pytd provides user-friendly interfaces to Treasure Data’s REST APIs, Presto query engine, and Plazma primary storage. The seamless connection allows your Python code to efficiently read and write a large volume of data from and to Treasure Data.


Plazma Public API limits its free tier at 100GB Read and 100TB Write. Contact your Customer Success representative at support@treasuredata.com for information about additional tiers.


Installing pytd

pip install pytd

Using pytd

Set your API key and endpoint to the environment variables, TD_API_KEY and TD_API_SERVER, respectively, and create a client instance:

import pytd

client = pytd.Client(database='sample_datasets')
# or, hard-code your API key, endpoint, and/or query engine:
# >>> pytd.Client(apikey='X/XXX', endpoint='https://api.treasuredata.com/', database='sample_datasets', default_engine='presto')

You might also want to review sample usage on Google Colaboratory.

Querying in Treasure Data

There are three main methods that you can use to query data: presto query, Hive query, and initializing the pytd.Client for Hive.

Presto Query

client.query('select symbol, count(1) as cnt from nasdaq group by 1 order by 1')
# {'columns': ['symbol', 'cnt'], 'data': [['AAIT', 590], ['AAL', 82], ['AAME', 9252], ..., ['ZUMZ', 2364]]}

Hive Query

client.query('select hivemall_version()', engine='hive')
# {'columns': ['_c0'], 'data': [['0.6.0-SNAPSHOT-201901-r01']]} (as of Feb, 2019)

Initialize pytd.Client

client_hive = pytd.Client(database='sample_datasets', default_engine='hive')
client_hive.query('select hivemall_version()')

Importing Data to Treasure Data

For the writer option, pytd supports three different ways to ingest data to Treasure Data: bulk import, Presto INSERT INTO query, and td-spark.

Bulk Import API (bulk_import)

The bulk_import (default) method converts data into a CSV file and batch uploads the data.

Presto INSERT INTO query (insert_into)

This method is recommended for a small value of data only. It inserts single rows in DataFrame by issuing an INSERT INTO query through the Presto query engine.

td-spark (spark)

The td-spark method is a local customized Spark instance that writes DataFrame directly to Treasure Data’s primary storage system.

Since td-spark gives special access to the main storage system via PySpark, you must enable the Spark Writer.

  1. Contact support@treasuredata.com to activate the permission to your Treasure Data account.

  2. Install pytd with [spark] option if you use the third option: pip install pytd[spark]

If you want to use existing td-spark JAR file, consider creating SparkWriter with td_spark_path option.

from pytd.writer import SparkWriter

writer = SparkWriter(apikey='X/XXX', endpoint='https://api.treasuredata.com/', td_spark_path='/path/to/td-spark-assembly.jar')
client.load_table_from_dataframe(df, 'mydb.bar', writer=writer, if_exists='overwrite')

Method Comparison


bulk_import

insert_into

spark

Scalable against data volume


Write performance for larger data



Memory efficient


Disk efficient



Minimal package dependency


Exporting Data to Treasure Data

Data must be represented as pandas.DataFrame. For example:

import pandas as pd

df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 10]})
client.load_table_from_dataframe(df, 'takuti.foo', writer='bulk_import', if_exists='overwrite')

Working with Python Clients

Treasure Data offers three different Python clients on GitHub. The following list summarizes each client’s characteristics.

td-client-python

pytd

  • Access to Plazma via td-spark as introduced above.

  • Efficient connection to Presto based on presto-python-client.

  • Multiple data ingestion methods and a variety of utility functions.

pandas-td (deprecated)

Replace pandas-td with pytd pandas-td compatible functions.

  1. Install the package from PyPI.

pip install pytd
# or, `pip install pytd[spark]` if you wish to use `to_td`

2. Make the following changes to the import statements.

Before

After

import pytd.pandas_td as td

import pytd.pandas_td as td

In [1]: %%load_ext pandas_td.ipython

In [1]: %%load_ext pytd.pandas_td.ipython

All pandas_td code should keep running correctly with pytd. Report an issue here if you notice any incompatible behaviors.

Choosing a Client

The client you choose depends on your specific use case. Here are some common guidelines:

  • Use td-client-python if you want to execute basic CRUD operations from Python applications.

  • Use pytd for (1) analytical purpose relying on pandas and Jupyter Notebook, and (2) achieving more efficient data access.

Important!

There is a known difference to the  pandas_td.to_td function for type conversion. Since pytd.writer.BulkImportWriter (default writer pytd) uses CSV as an intermediate file before uploading a table, the column type might change via pandas.read_csv. To respect the column type as much as possible, you need to pass a fmt=”msgpack” argument to to_td function.

Additional Resources

  • No labels