You can use td-pyspark to bridge the results of data manipulations in Google Colab with your data in Treasure Data.

Google Colab notebooks make it easy to model with PySpark in Google. PySpark is a Python API for Spark. Treasure Data's td-pyspark is a Python library that provides a handy way to use PySpark and Treasure Data based on td-spark.


To follow the steps in this example, you must have the following Treasure Data items:

Configuring your Google Colab Environment

You create an envelope, install pyspark and td-pyspark libraries and configure the notebook for your connection code.

Create an Envelope in Google Colab

  1. Open Google Colab.

  2. Click File > New Python 3 notebook.

  3. Ensure that the runtime is connected.

The notebook shows a green check on the top right corner.

Prepare your Environment for the PySpark and TD-PySpark Libraries

Click the icon to add a code cell:

Enter the following code:

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!pip install pyspark td-pyspark

Create and Upload the td-spark.conf File

You specify your TD API Key and site on your local file system. Create a file as follows:

An example of the format is as follows. You provide the actual values: (Your TD API KEY) (Your site: us, jp, eu01, ap02)
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.sql.execution.arrow.enabled true

Name the file td-spark.conf and upload the file by clicking Files > Upload on the Google Colab menu. Verify that the td-spark.conf file is saved in the /content directory.

Run the Installation and Begin Work in Google Colab

Run the current cell by selecting the cell and clicking shift + enter keys.

Create a second code cell and create a script similar to the following code:

import os
os.environ[‘PYSPARK_SUBMIT_ARGS’] = ‘--jars /usr/local/lib/python2.7/dist-packages/td_pyspark/jars/td-spark-assembly.jar --properties-file /content/td-spark.conf pyspark-shell’
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

import td_pyspark
from pyspark import SparkContext
from pyspark.sql import SparkSession

builder = SparkSession.builder.appName("td-pyspark-test")
td = td_pyspark.TDSparkContextBuilder(builder).build()
df = td.table("sample_datasets.www_access").within("-10y").df()

TDSparkContextBuilder is an entry point to access td_pyspark's functionalities. As shown in the preceding code sample, you read tables in Treasure Data as data frames:

df = td.table("tablename").df()

You see a result similar to the following:

Your connection is working.

Interacting with Treasure Data from Google Colab

In Google Colab, you can run select and insert queries to Treasure Data or query back data from Treasure Data. You can also create and delete databases and tables.

In Google Colab, you can use the following commands:

Read Tables as DataFrames

By calling .df() your table data is read as Spark's DataFrame. The usage of the DataFrame is the same with PySpark. See also PySpark DataFrame documentation.

df = td.table("sample_datasets.www_access").df()

Submit Presto Queries

If your Spark cluster is small, reading all of the data as in-memory DataFrame might be difficult. In this case, you can use Presto, a distributed SQL query engine, to reduce the amount of data processing with PySpark.

q = td.presto("select code, * from sample_datasets.www_access")

q = td.presto("select code, count(*) from sample_datasets.www_access group by 1")

You see:

Create or Drop a Database



Upload DataFrames to Treasure Data

To save your local DataFrames as a table, you have two options:

td.insert_into(df, "mydb.tbl1")

td.create_or_replace(df, "mydb.tbl2")

Checking Google Colab in Treasure Data

You can use td toolbelt to check your database from a command line. Alternatively, if you have TD Console, you can check your databases and queries. Read about Database and Table Management.