You can use td-pyspark to bridge the results of data manipulations in Google Colab with your data in Treasure Data.
Google Colab notebooks make it easy to model with PySpark in Google. PySpark is a Python API for Spark. Treasure Data's td-pyspark is a Python library that provides a handy way to use PySpark and Treasure Data based on td-spark.
To follow the steps in this example, you must have the following Treasure Data items:
Treasure Data API key
td-spark feature enabled
You create an envelope, install pyspark and td-pyspark libraries and configure the notebook for your connection code.
Open Google Colab.
Click File > New Python 3 notebook.
Ensure that the runtime is connected.
The notebook shows a green check on the top right corner.
Click the icon to add a code cell:
Enter the following code:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null !pip install pyspark td-pyspark
You specify your TD API Key and site on your local file system. Create a file as follows:
An example of the format is as follows. You provide the actual values:
spark.td.apikey (Your TD API KEY) spark.td.site (Your site: us, jp, eu01, ap02) spark.serializer org.apache.spark.serializer.KryoSerializer spark.sql.execution.arrow.enabled true
Name the file
td-spark.conf and upload the file by clicking Files > Upload on the Google Colab menu. Verify that the td-spark.conf file is saved in the /content directory.
Run the current cell by selecting the cell and clicking
shift + enter keys.
Create a second code cell and create a script similar to the following code:
import os os.environ[‘PYSPARK_SUBMIT_ARGS’] = ‘--jars /usr/local/lib/python2.7/dist-packages/td_pyspark/jars/td-spark-assembly.jar --properties-file /content/td-spark.conf pyspark-shell’ os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64" import td_pyspark from pyspark import SparkContext from pyspark.sql import SparkSession builder = SparkSession.builder.appName("td-pyspark-test") td = td_pyspark.TDSparkContextBuilder(builder).build() df = td.table("sample_datasets.www_access").within("-10y").df() df.show()
TDSparkContextBuilder is an entry point to access td_pyspark's functionalities. As shown in the preceding code sample, you read tables in Treasure Data as data frames:
df = td.table("tablename").df()
You see a result similar to the following:
Your connection is working.
In Google Colab, you can run select and insert queries to Treasure Data or query back data from Treasure Data. You can also create and delete databases and tables.
In Google Colab, you can use the following commands:
.df() your table data is read as Spark's DataFrame. The usage of the DataFrame is the same with PySpark. See also PySpark DataFrame documentation.
df = td.table("sample_datasets.www_access").df() df.show()
If your Spark cluster is small, reading all of the data as in-memory DataFrame might be difficult. In this case, you can use Presto, a distributed SQL query engine, to reduce the amount of data processing with PySpark.
q = td.presto("select code, * from sample_datasets.www_access") q.show() q = td.presto("select code, count(*) from sample_datasets.www_access group by 1") q.show()
To save your local DataFrames as a table, you have two options:
Insert the records in the input DataFrame to the target table
Create or replace the target table with the content of the input DataFrame
td.insert_into(df, "mydb.tbl1") td.create_or_replace(df, "mydb.tbl2")
You can use td toolbelt to check your database from a command line. Alternatively, if you have TD Console, you can check your databases and queries. Read about Database and Table Management.