This example predicts house pricing using a Linear Regression predictor. Treasure Workflow provides a way to predict continuous values, like a price or energy consumption, using Linear Regression predictor.

  • regression.dig is a TD workflow script for regression

About Feature Selection with the Filter Methods

Filter type methods select variables regardless of the model. They are based only on general features like the correlation with the variable to predict. Filter methods suppress the least interesting variables. The other variables are part of a classification or a regression model used to classify or to predict data. These methods are particularly effective in computation time and robust to overfitting.

Filter methods tend to select redundant variables when they do not consider the relationships between variables. More elaborate features minimize this problem by removing variables that are highly correlated to each other.

Building and Running the Workflow

To build and run the workflow:

  1. Obtain the The Boston Housing Dataset.

  2. Prepare the sample data set.
    $ ./data.sh

  3. Push the workflow to Treasure Data:
    $ tf wf push regressor

  4. Run the workflow:
    $ tf wf start regressor regression --session now -p apikey=${YOUR_TD_API_KEY

Input

In this workflow, we use The Boston Housing Dataset.

This workflow assumes a table as follows:

crim
double

zn
double

indus
double

chas
int

nox
double

rm
double

age
double

dis
double

rad
int

tax
int

ptratio
double

b
double

lstat
double

medv
double

0.00632

18

2.31

0

0.538

6.575

65.2

4.09

1

296

15.3

396.9

4.98

24

0.02731

0

7.07

0

0.469

6.421

78.9

4.9671

2

242

17.8

396.9

9.14

21.6

0.02729

0

7.07

0

0.469

7.185

61.1

4.9671

2

242

17.8

392.83

4.03

34.7

0.03237

0

2.18

0

0.458

6.998

45.8

6.0622

3

222

18.7

394.63

2.94

33.4

medv, median values of owner-occupied homes in $1000’s, is the target value for regression. chas and rad are categorical values, and other features are quantitative.

By default, we filtered the top 4 correlated columns with medv. This technique is known as feature selection with filter methods to improve prediction accuracy. If you want to change explanatory variables, you can modify the following file:

  • vectorize_log1p_features.sql

Output

This workflow outputs predicted price of houses in predictions table as follows:

rowid
string

predicted_price
double

1-10

33.97034232809395

1-121

30.3377696027913


  • No labels