Treasure Data Workflow provides an easy way to leverage Python custom scripts for sentiment analysis with TensorFlow and export its model to Amazon S3. Machine Learning algorithms can be run as part of your scheduled workflows, using Python Custom scripts. This article introduces the steps to run the ML algorithm Sentimental Analysis within a Treasure Data Workflow. 

Sentimental Analysis classifies texts as positive/negative, for movie reviews using TensorFlow and TensorFlow Hub. See the official document.


Sentimental Analysis using Python Custom Scripts

There are two versions of the algorithm discussed in this article:

Example Workflow using TensorFlow with Amazon S3

The workflow:

  • Fetches review data from Treasure Data

  • Builds a model with TensorFlow

  • Stores the model on S3

  • Predicts polarities for unknown review data and writes it back to Treasure Data

Prerequisites

  • Make sure the custom scripts feature is enabled for your TD account.

  • Download and install the TD Toolbelt and the TD Toolbelt Workflow module.

  • Basic Knowledge of Treasure Data Workflow syntax

  • AWS S3

  • S3 Secrets

Run the Example Workflow

  1. Download the sentimental-analysis project from this repository

  2. In the Terminal window, change directory to sentimental-analysis

  3. Run data.sh to ingest training and test data on Treasure Data. About 80 million records are fetched to build the model. The script also creates a database named sentiment and tables named movie_review_train and movie_review_test to store the data. For example:

$ ./data.sh 


Assume that the input table is:

rowid

sentence

sentiment

polarity

1-10531

"Bela Lugosi revels in his role as European horticulturist (sic) Dr. Lorenz in this outlandish...

2

0

1-10960

Fragmentaric movie about a couple of people in Austria during a heatwave. This kind of...

3

0

1-24370

I viewed the movie together with my arrogant, film critic friend, my wife and her female friend. So...

7

1


  1. Run the example workflow as follows:

    td workflow push sentiment
  2. Set secrets from STDIN like:
    apikey=x/xxxxx, endpoint=https://api.treasuredata.com, s3_bucket=my_bucket, or
     aws_access_key_id=AAAAAAAAAA, aws_secret_access_key=XXXXXXXXX

    td workflow secrets \
    
      --project sentiment \
    
      --set apikey \
    
      --set endpoint \
    
      --set s3_bucket \
    
      --set aws_access_key_id \
    
      --set aws_secret_access_key
    
    # Set secrets from STDIN like: apikey=x/xxxxx, endpoint=https://api.treasuredata.com, s3_bucket=my_bucket,
    
    #              aws_access_key_id=AAAAAAAAAA, aws_secret_access_key=XXXXXXXXX
    
  3. Start the analysis:

    td workflow start sentiment sentiment-analysis  --session now

Results of the script are stored in the test_predicted_polarities table in Treasure Data.



To view the table:

  1. Log into TD Console.

  2. Search for the sentiments database.

  3. Locate the test_predicted_polarities table.

  4. The prediction results are stored in this table as shown below:

rowid

predicted_polarity

1-21643

0

1-22967

1

Example Workflow using TensorFlow without Amazon S3

The workflow:

  • Fetches review data from Treasure Data

  • Builds a model with TensorFlow

  • Predicts polarities for unknown review data and writes the data back to Treasure Data

Prerequisites

  • Make sure this feature is enabled for your TD account.

  • Download and install the TD Toolbelt and the TD Toolbelt Workflow module.

  • Basic Knowledge of Treasure Data Workflow syntax


Run the Example Workflow

  1. Download the sentimental-analysis project from this repository.

  2. From the command line Terminal window, change directory to sentimental-analysis. For example:

    cd sentiment-analysis
  3. Run data.sh to ingest training and test data on Treasure Data. About 80 million records are fetched to build the model, the script also creates a database named sentiment and tables named movie_review_train and movie_review_test to store the data.

    $ ./data.sh 

Assume that the input table is as follows:

rowid

sentence

sentiment

polarity

1-10531

"Bela Lugosi revels in his role as European horticulturist (sic) Dr. Lorenz in this outlandish...

2

0

1-10960

Fragmentaric movie about a couple of people in Austria during a heatwave. This kind of...

3

0

1-24370

I viewed the movie together with my arrogant, film critic friend, my wife and her female friend. So...

7

1


  1. Run the example workflow as follows:

    td workflow push sentiment
  2. Add secrets from STDIN like: apikey=x/xxxxx, endpoint=https://api.treasuredata.com

    td workflow secrets 
       --project sentiment
       --set apikey
       --set endpoint
  3. Start the analysis

    td workflow start sentiment sentiment-analysis-simple --session now


Results of the script are stored in the test_predicted_polarities table in Treasure Data.


To view the table:

  1. Log into TD Console.

  2. Search for the sentiments database.

  3. Locate the test_predicted_polarities table.

The prediction results should be similar to the following:

rowid

predicted_polarity

1-21643

0

1-22967

1


Review the Workflow Custom Python Script

Review the contents of the sentimental-analysis directory:

  • sentiment-analysis.dig - This is the TD Workflow YAML file for sentiment analysis with TensorFlow.

  • sentiment.py - This is the Custom Python script with TensorFlow. It builds a prediction model with existing data and predicts polarity to unknown data.

In this example, we use a pre-trained model in TensorFlowHub for word embedding for English text.

embedded_text_feature_column = hub.text_embedding_column(
   key="sentence",
   module_spec="https://tfhub.dev/google/nnlm-en-dim128/1")

If you want to change this model to another one, for example, Japanese model, you can modify it as follows:

embedded_text_feature_column = hub.text_embedding_column(
   key="sentence",
   module_spec="https://tfhub.dev/google/nnlm-ja-dim128/1")

Before word embedding, you need to prepare tokenized sentences for Japanese.


Because this custom script also saves the trained TensorFlow model with movie reviews to Amazon S3, you can build your prediction server with TensorFlow Serving.

To change the serving_input_receiver_fn, modify the following code:

feature_spec = tf.feature_column.make_parse_example_spec([embedded_text_feature_column])
serving_input_receiver_fn = tf.estimator.export.build_parsing_serving_input_receiver_fn(feature_spec)
estimator.export_saved_model(EXPORT_DIR_BASE, serving_input_receiver_fn)


See TensorFlow documentation for details.



  • No labels