# Feature Selection With Scikit Learn and Hivemall for House Price Prediction Treasure Data Workflow provides an easy way to predict continuous values, like a price or energy consumption, using Linear Regression predictor. Machine Learning algorithms can be run as part of your scheduled workflows, using Python custom scripts. This article introduces feature selection using scikit-learn in a Python script, which selects important features and builds a partial query for Hivemall to predict house prices. [Feature selection](https://en.wikipedia.org/wiki/Feature_selection) is a common machine learning technique used to build a simplified model for understanding and to enhance generalization by removing irrelevant or redundant information. ## Feature Selection Using Python Custom Scripts This article describes how to predict house prices using the [Boston house pricing data ](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.md)set with [Linear Regression predictor](http://hivemall.incubator.apache.org/userguide/regression/general.md). The feature selection from scikit-learn helps identify meaningful attributes from which to create supervised models. ### Example Workflow to Predict House Prices The workflow: Splits the training and test data sets Selects important features with scikit-learn Builds a partial query for Hivemall Trains and evaluates the model with selected features on Hivemall ### Prerequisites Make sure that the custom scripts feature is enabled for your TD account. Download and install the TD Toolbelt and the TD Toolbelt Workflow module. For more information, see [TD Workflow Quickstart.](/products/customer-data-platform/data-workbench/workflows/treasure-workflow-quick-start-using-td-toolbelt-in-a-cli) Basic Knowledge of Treasure Workflow's syntax ### Run the Example Workflow 1. Download the [house price prediction project](https://github.com/treasure-data/treasure-boxes/tree/master/machine-learning-box/house-price-prediction). 2. From the command line terminal window, change to the to house-price-prediction directory. For example: 3. Cd house-price-prediction 4. Run *data.sh* to ingest training and test data on Treasure Data. The script uses the Boston Housing Dataset with about 506 cases to build the model. The script also creates a database named *boston* and table named *house_prices* to store the data. ```bash $ ./data.sh ``` ### Input Assume the input table is: | crimdouble | zndouble | indusdouble | chasint | noxdouble | rmdouble | agedouble | disdouble | radint | taxint | ptratiodouble | bdouble | lstatdouble | medvdouble | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | 0.00632 | 18 | 2.31 | 0 | 0.538 | 6.575 | 65.2 | 4.09 | 1 | 296 | 15.3 | 396.9 | 4.98 | 24 | | 0.02731 | 0 | 7.07 | 0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242 | 17.8 | 396.9 | 9.14 | 21.6 | | 0.02729 | 0 | 7.07 | 0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242 | 17.8 | 392.83 | 4.03 | 34.7 | | 0.03237 | 0 | 2.18 | 0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3 | 222 | 18.7 | 394.63 | 2.94 | 33.4 | | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | Where “medv”, is Median value of owner-occupied homes in $1000's, the target value for regression. “chas” and “rad” is categorical values, and other features are quantitative. By default, the top 4 correlated columns with medv are used. Run the example workflow as follows: ```bash td workflow push regressor ``` ### Set credentials for this workflow ```bash export TD_API_KEY=1/xxxxx export TD_API_SERVER=https://api.treasuredata.com td wf secrets --project regressor --set apikey --set endpoint ``` Enter apikey e.g.) X/XXXXXXX, and endpoint e.g.) https://api.treasuredata.com ```bash td wf start regressor regression-py` ``` The predicted results for the price of houses are stored in the *predictions* table. To view the table: 1. Log into TD Console. 2. Search for the boston database. 3. Locate the *predictions* table ### Output This workflow outputs predicted results | rowidstring | predicted_pricedouble | | --- | --- | | 1-10 | 33.97034232809395 | | 1-121 | 30.3377696027913 | | ... | ... | ## Review the Workflow Custom Python Script Review the contents of the directory: [regression-py.dig](https://github.com/treasure-data/treasure-boxes/blob/master/machine-learning-box/house-price-prediction/regression-py.dig) - Example workflow for sales prediction and notification to Slack. [task/**init**.py](https://github.com/treasure-data/treasure-boxes/tree/master/machine-learning-box/house-price-prediction/tasks) - Custom Python script with scikit-learn. It selects important features and builds a partial query for Hivemall. This example uses scikit-learn's[ SelectFromModel function](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.md#sklearn.feature_selection.SelectFromModel), which enables selection of features when building a predictive model. ```bash $ ./data.sh # Ingest example data to Treasure Data ``` ```python reg = ExtraTreesRegressor() reg = reg.fit(X, y) model = SelectFromModel(reg, prefit=True) feature_idx = model.get_support() feature_name = df.drop(columns=['medv']).columns[feature_idx] selected_features = set(feature_name)(snip) feature_query = self._feature_column_query(selected_features, feature_types=feature_types) ``` In this example we use[ ExtraTreeRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.md) to get feature importance, you can use any other logics such as[LassoCV](http://scikit-learn.org/stable/auto_examples/feature_selection/plot_select_from_model_boston.md). In the example code, there is a _create_vectorize_table function for creating a vectorized table to train Hivemall models with Python. ```python self._create_vectorize_table(engine_hive, dbname, "train", "{}_train".format(source_table), feature_query) ``` While you can use this function within the custom Python script instead of exporting the partial query, we recommend that you export queries. Exporting queries gives you the benefit of Digdag parallelization and manageability on TD Console.