Treasure Data Workflow provides an easy way to predict continuous values, like a price or energy consumption, using Linear Regression predictor. Machine Learning algorithms can be run as part of your scheduled workflows, using Python custom scripts. This article introduces feature selection using scikit-learn in a Python script, which selects important features and builds a partial query for Hivemall to predict house prices.
Feature selection is a common machine learning technique used to build a simplified model for understanding and to enhance generalization by removing irrelevant or redundant information.
Feature Selection Using Python Custom Scripts
This article describes how to predict house prices using the Boston house pricing data set with Linear Regression predictor. The feature selection from scikit-learn helps identify meaningful attributes from which to create supervised models.
Example Workflow to Predict House Prices
Splits the training and test data sets
Selects important features with scikit-learn
Builds a partial query for Hivemall
Trains and evaluates the model with selected features on Hivemall
Make sure that the custom scripts feature is enabled for your TD account.
Download and install the TD Toolbelt and the TD Toolbelt Workflow module. For more information, see TD Workflow Quickstart.
Basic Knowledge of Treasure Workflow's syntax
Run the Example Workflow
Download the house price prediction project.
From the command line terminal window, change to the to house-price-prediction directory. For example:
Run data.sh to ingest training and test data on Treasure Data. The script uses the Boston Housing Dataset with about 506 cases to build the model. The script also creates a database named boston and table named house_prices to store the data.
Assume the input table is:
Where “medv”, is Median value of owner-occupied homes in $1000's, the target value for regression. “chas” and “rad” is categorical values, and other features are quantitative.
By default, the top 4 correlated columns with medv are used.
Run the example workflow as follows:
# Set credentials for this workflow
# export TD_API_KEY=1/xxxxx
# export TD_API_SERVER=https://api.treasuredata.com
# Enter apikey e.g.) X/XXXXXXX, and endpoint e.g.) https://api.treasuredata.com
The predicted results for the price of houses are stored in the predictions table.
To view the table:
Log into TD Console.
Search for the boston database.
Locate the predictions table
This workflow outputs predicted results
Review the Workflow Custom Python Script
Review the contents of the directory:
regression-py.dig - Example workflow for sales prediction and notification to Slack.
task/__init__.py - Custom Python script with scikit-learn. It selects important features and builds a partial query for Hivemall.
This example uses scikit-learn's SelectFromModel function, which enables selection of features when building a predictive model.
In the example code, there is a _create_vectorize_table function for creating a vectorized table to train Hivemall models with Python.
While you can use this function within the custom Python script instead of exporting the partial query, we recommend that you export queries. Exporting queries gives you the benefit of Digdag parallelization and manageability on TD Console.