This article introduces a machine learning workflow for Treasure Data by making use of the Chicago Energy Benchmarking dataset to predict future energy consumption. The city of Chicago provides measured energy efficiency for each building to encourages participants to improve the efficiency.

Through this article, you will learn how to:

  • create a machine learning model for regression with Hivemall

  • create a machine learning workflow with Treasure Workflow

Writing a workflow is very important for production use to execute ML codes repeatedly.

Prerequisites

This example assumes you are familiar with:

  • basic knowledge of Machine learning

  • Treasure Data machine learning

  • a one-hot representation

A one-hot representation for machine learning purposes is a group of bits where the legal combinations of values are only those with a single high (1) bit and all the others low (0). In natural language processing, a one-hot vector is a 1 × N matrix (vector) used to distinguish each word in a vocabulary from every other word in the vocabulary. The vector consists of 0s in all cells with the exception of a single 1 in a cell used uniquely to identify the word.

Ingesting Sample Data

To ingest the Chicago Energy Benchmarking dataset to Treasure Data:

  1. Navigate to the Chicago data portal.

  2. Select Export.

  3. Select CSV. The file downloads to your local machine.

  4. Open the TD Console.

  5. Navigate to Integrations > Source.

  6. At the top-right of the page, select Upload File.

  7. Browse to the Chicago_Energy_Benchmarking.csv file.

  8. Open the file within Treasure Data.

  9. On the Preview page, select Advanced settings.

  10. In the dialog, scroll down to locate the column named # of Buildings.

  11. Change the column name to "Num of Buildings" to prevent irregular character on column name problems.

  12. Save and select Next.

  13. Select Create new database? and type chicago_smart_green as the database.

  14. Select Create new table? and type energy_benchmarking as the table name.

  15. Select Start Upload.

  16. Verify that the data uploaded by navigating to Databases > chicago_smart_green > energy_benchmarking.

  17. Optionally, review the description of each column in the Columns in this Dataset on the Chicago Energy Benchmarking page.

Creating a Workflow

Before writing your core machine learning queries, create a workflow so that the queries can be run repeatedly.

  1. Open workflow create page, then create a workflow with Blank template.

  2. The workflow name and project is assumed to be chicago_smart_green.

  1. Edit the workflow so that it has the following code:

_export:
  td:
    database: chicago_smart_green
    engine: hive

+vectorize:
  td>: queries/vectorize.sql
  create_table: vectors

+shuffle:
  td>: queries/shuffle.sql
  create_table: samples
  engine: presto

+train:
  td>: queries/train.sql
  create_table: model

+evaluate:
  td>: queries/evaluate.sql
  store_last_results: true

+show_accuracy:
  echo>: "RMSE: ${td.last_results.rmse}, MAE: ${td.last_results.mae}"
  1. Create and same the workflow.

  2. Select SAVE & COMMIT.

Writing Queries to Cleanse Data and Create Feature Vectors

Create feature vectors with Hivemall.

  1. Open the workflow.

  2. Select EDIT.

  3. Add a SQL file as a Project File.

  4. Rename the file to:

    • queries/vectorize.sql

  1. Optionally, convert potentially useful columns into categorical or quantitative features.

  2. Optionally, concatenate them into a single feature vector (otherwise known as an array). For example:

select
  id,
  array_concat(
    categorical_features(
      array('Chicago community area', 'Primary use of property'),
      community_area, primary_property_type
    ),
    quantitative_features(
      array('Total interior floor space', 'Building age', 'Number of buildings'),
      gross_floor_area___buildings__sq_ft_, data_year - year_built, num_of_buildings
    )
  ) as features,
  electricity_use__kbtu_ as annual_electricity_consumption
from
  energy_benchmarking
where
  electricity_use__kbtu_ is not null
;
  1. Create a feature vector with Hivemall.

Each feature of Hivemall consists of and index (i.e., feature name) and a value:

  • Numerical value: <index>:<value>

    • for example, price:600.0

  • Categorical value: <index>#<value>

    • for example, gender#male

Feature index and feature value are generally separated by a comma. If the comma is omitted, the value is considered to be 1.0.

So, a categorical feature of gender#male is a one-hot representation of index := gender#male and value := 1.0. Where # is not a special character for the categorical feature.

Each feature is a string value and "feature vector" means an array of strings as follows:

["price:600.0", "day of week#Saturday", "gender#male", "category#book"]

Hivemall provides useful functions for preparing feature vectors: quantitative_features(), categorical_features(), and array_concat(). See the Hivemall input format web page for details.

Creating your Machine Learning Model

To split data into a training time to testing time ratio of 80% training and 20% testing ratio:

  1. Create the queries/shuffle.sql file for the +shuffle phase of your workflow as follows:

select
  *,
  abs(from_big_endian_64(xxhash64(to_utf8(cast(id as varchar)))) % 100) 
  as random_flag
from
  vectors
;
  1. With training data at 80% of the data, build a linear regressor to predict annual electricity consumption at a given building. For example, your queries/train.sql should be edited as follows:

select
  feature,
  avg(weight) as weight
from (
  select
    train_regressor(
      features,
      annual_electricity_consumption,
      '-loss_function squared -optimizer AdaGrad -regularization l1' 
        -- hyper-parameters
    ) as (feature, weight)
  from (
    select features, annual_electricity_consumption
    from samples
    where random_flag < 80
    CLUSTER BY rand(1) -- random shuffling
  ) t1
) t2
group by
  feature
;

Evaluating the Energy Prediction Model

  1. Prepare a SQL file named queries/evaluate.sql that predicts electricity usage of the 20% test buildings and compares the predicted electricity use with actual use. For example:

    with features_exploded as (
      select
        id,
        extract_feature(fv) as feature,
        extract_weight(fv) as value
      from (
        select * from samples where random_flag >= 80
      ) t1
      LATERAL VIEW explode(features) t2 as fv
    ),
    prediction as (
      select
        t1.annual_electricity_consumption,
        t2.predicted_electricity_consumption
      from
        samples t1
      join (
          select
            t1.id,
            sum(p1.weight * t1.value) as predicted_electricity_consumption
          from
            features_exploded t1
            LEFT OUTER JOIN model p1 ON (t1.feature = p1.feature)
          group by
            t1.id
        ) t2
        on t1.id = t2.id
    )
    select
      mae(predicted_electricity_consumption, annual_electricity_consumption) as mae,
      rmse(predicted_electricity_consumption, annual_electricity_consumption) as rmse
    from
      prediction
    ;
  2. Run the following SQL on the TD Console to check the model weight.

select  feature, weightfrom  modelorder by  abs(weight) desclimit 10;

You should get output similar to the following:

feature

weight

Building age

2.27515483

Primary use of property#Office

1.89947212

Chicago community area#LOOP

1.69078124

Total interior floor space

1.46410573

Number of buildings

1.42170572

Primary use of property#K-12 School

0.90449739

Chicago community area#NEAR NORTH SIDE

0.89309341

Primary use of property#Hospital (General Medical & Surgical)

0.64525992

Chicago community area#NEAR WEST SIDE

0.63020498

Chicago community area#HYDE PARK

0.58977646

Predicting Unknown Energy Consumption

Even though electricity usage of some buildings is still not benchmarked, we can now predict their potential energy consumption by linear regression.

  1. Run a Hive query as follows:

with null_samples as (
  select
    id,
    array_concat(
      categorical_features(
        array('Chicago community area', 'Primary use of property'),
        community_area, primary_property_type
      ),
      quantitative_features(
        array('Total interior floor space', 'Building age', 'Number of buildings'),
        gross_floor_area___buildings__sq_ft_, data_year - year_built, num_of_buildings
      )
    ) as features
  from
    energy_benchmarking
  where
    electricity_use__kbtu_ is null
),
features_exploded as (
  select 
    id,
    extract_feature(fv) as feature,
    extract_weight(fv) as value
  from null_samples t1
  LATERAL VIEW explode(features) t2 as fv
)
select
  t1.id,
  sum(p1.weight * t1.value) as predicted_electricity_consumption
from
  features_exploded t1
  LEFT OUTER JOIN model p1 ON (t1.feature = p1.feature)
group by
  t1.id
;

You should get the predicted results as follows:

  • No labels