Treasure Data Workflow provides an easy way to leverage Python custom scripts for sentiment analysis with TensorFlow and export its model to Amazon S3. Machine Learning algorithms can be run as part of your scheduled workflows, using Python Custom scripts. This article introduces the steps to run the ML algorithm Sentimental Analysis within a Treasure Data Workflow.
Sentimental Analysis classifies texts as positive/negative, for movie reviews using TensorFlow and TensorFlow Hub. See the official document.
There are two versions of the algorithm discussed in this article:
Example Workflow using TensorFlow with Amazon S3
Example Workflow using TensorFlow without Amazon S3
The workflow:
Fetches review data from Treasure Data
Builds a model with TensorFlow
Stores the model on S3
Predicts polarities for unknown review data and writes it back to Treasure Data
Make sure the custom scripts feature is enabled for your TD account.
Download and install the TD Toolbelt and the TD Toolbelt Workflow module.
Basic Knowledge of Treasure Data Workflow syntax
AWS S3
S3 Secrets
Run the Example Workflow
Download the sentimental-analysis project from this repository
In the Terminal window, change directory to sentimental-analysis
Run data.sh to ingest training and test data on Treasure Data. About 80 million records are fetched to build the model. The script also creates a database named sentiment and tables named movie_review_train and movie_review_test to store the data. For example:
$ ./data.sh Assume that the input table is:
| rowid | sentence | sentiment | polarity |
|---|---|---|---|
| 1-10531 | "Bela Lugosi revels in his role as European horticulturist (sic) Dr. Lorenz in this outlandish... | 2 | 0 |
| 1-10960 | Fragmentaric movie about a couple of people in Austria during a heatwave. This kind of... | 3 | 0 |
| 1-24370 | I viewed the movie together with my arrogant, film critic friend, my wife and her female friend. So... | 7 | 1 |
- Run the example workflow as follows:
td workflow push sentiment - Set secrets from STDIN like:
apikey=x/xxxxx, endpoint=https://api.treasuredata.com, s3_bucket=my_bucket, or
aws_access_key_id=AAAAAAAAAA, aws_secret_access_key=XXXXXXXXXtd workflow secrets \
--project sentiment \
--set apikey \
--set endpoint \
--set s3_bucket \
--set aws_access_key_id \
--set aws_secret_access_key
# Set secrets from STDIN like:
apikey=x/xxxxx, endpoint=https://api.treasuredata.com, s3_bucket=my_bucket,aws_access_key_id=AAAAAAAAAA, aws_secret_access_key=XXXXXXXXX- Start the analysis:
td workflow start sentiment sentiment-analysis --session now Results of the script are stored in the test_predicted_polarities table in Treasure Data.
To view the table:
Log into TD Console.
Search for the sentiments database.
Locate the test_predicted_polarities table.
The prediction results are stored in this table as shown below:
| rowid | predicted_polarity |
|---|---|
| 1-21643 | 0 |
| 1-22967 | 1 |
The workflow:
Fetches review data from Treasure Data
Builds a model with TensorFlow
Predicts polarities for unknown review data and writes the data back to Treasure Data
Make sure this feature is enabled for your TD account.
Download and install the TD Toolbelt and the TD Toolbelt Workflow module.
Basic Knowledge of Treasure Data Workflow syntax
Run the Example Workflow
Download the sentimental-analysis project from this repository.
From the command line Terminal window, change directory to sentimental-analysis. For example:
cd sentiment-analysis- Run data.sh to ingest training and test data on Treasure Data. About 80 million records are fetched to build the model, the script also creates a database named sentiment and tables named movie_review_train and movie_review_test to store the data.
$ ./data.shAssume that the input table is as follows:
| rowid | sentence | sentiment | polarity |
|---|---|---|---|
| 1-10531 | "Bela Lugosi revels in his role as European horticulturist (sic) Dr. Lorenz in this outlandish... | 2 | 0 |
| 1-10960 | Fragmentaric movie about a couple of people in Austria during a heatwave. This kind of... | 3 | 0 |
| 1-24370 | I viewed the movie together with my arrogant, film critic friend, my wife and her female friend. So... | 7 | 1 |
- Run the example workflow as follows:
td workflow push sentiment- Add secrets from STDIN like: apikey=x/xxxxx, endpoint=https://api.treasuredata.com
td workflow secrets \
--project sentiment \
--set apikey \
--set endpoint- Start the analysis
td workflow start sentiment sentiment-analysis-simple --session nowResults of the script are stored in the test_predicted_polarities table in Treasure Data.
To view the table:
Log into TD Console.
Search for the sentiments database.
Locate the test_predicted_polarities table.
The prediction results should be similar to the following:
| rowid | predicted_polarity |
|---|---|
| 1-21643 | 0 |
| 1-22967 | 1 |
Review the contents of the sentimental-analysis directory:
sentiment-analysis.dig - This is the TD Workflow YAML file for sentiment analysis with TensorFlow.
sentiment.py - This is the Custom Python script with TensorFlow. It builds a prediction model with existing data and predicts polarity to unknown data.
In this example, we use a pre-trained model in TensorFlowHub for word embedding for English text.
embedded_text_feature_column = hub.text_embedding_column(
key="sentence",
module_spec="https://tfhub.dev/google/nnlm-en-dim128/1"
) If you want to change this model to another one, for example, Japanese model, you can modify it as follows:
embedded_text_feature_column = hub.text_embedding_column(
key="sentence",
module_spec="https://tfhub.dev/google/nnlm-ja-dim128/1"
) Before word embedding, you need to prepare tokenized sentences for Japanese.
Because this custom script also saves the trained TensorFlow model with movie reviews to Amazon S3, you can build your prediction server with TensorFlow Serving.
To change the serving_input_receiver_fn , modify the following code:
feature_spec = tf.feature_column.make_parse_example_spec([embedded_text_feature_column])
serving_input_receiver_fn = tf.estimator.export.build_parsing_serving_input_receiver_fn(feature_spec)
estimator.export_saved_model(EXPORT_DIR_BASE, serving_input_receiver_fn)See TensorFlow documentation for details.