pytd provides user-friendly interfaces to Treasure Data’s REST APIs, Presto query engine, and Plazma primary storage. The seamless connection allows your Python code to efficiently read and write a large volume of data from and to Treasure Data.
Plazma Public API limits its free tier at 100GB Read and 100TB Write. Contact your Customer Success representative at email@example.com for information about additional tiers.
You might also want to review sample usage on Google Colaboratory.
Querying in Treasure Data
There are three main methods that you can use to query data: presto query, Hive query, and initializing the pytd.Client for Hive.
Importing Data to Treasure Data
writer option, pytd supports three different ways to ingest data to Treasure Data: bulk import, Presto INSERT INTO query, and td-spark.
Bulk Import API (bulk_import)
The bulk_import (default) method converts data into a CSV file and batch uploads the data.
Presto INSERT INTO query (insert_into)
This method is recommended for a small value of data only. It inserts single rows in DataFrame by issuing an INSERT INTO query through the Presto query engine.
The td-spark method is a local customized Spark instance that writes DataFrame directly to Treasure Data’s primary storage system.
Since td-spark gives special access to the main storage system via PySpark, you must enable the Spark Writer.
Contact firstname.lastname@example.org to activate the permission to your Treasure Data account.
Install pytd with
[spark]option if you use the third option:
pip install pytd[spark]
If you want to use existing td-spark JAR file, consider creating
Scalable against data volume
Write performance for larger data
Minimal package dependency
Exporting Data to Treasure Data
Data must be represented as pandas.DataFrame. For example:
Working with Python Clients
Treasure Data offers three different Python clients on GitHub. The following list summarizes each client’s characteristics.
Basic REST API wrapper.
The capability is limited by what Treasure Data REST API can do.
Access to Plazma via td-spark as introduced above.
Efficient connection to Presto based on presto-python-client.
Multiple data ingestion methods and a variety of utility functions.
Replace pandas-td with pytd pandas-td compatible functions.
Install the package from PyPI.
2. Make the following changes to the import statements.
pandas_td code should keep running correctly with
pytd. Report an issue here if you notice any incompatible behaviors.
Choosing a Client
The client you choose depends on your specific use case. Here are some common guidelines:
Use td-client-python if you want to execute basic CRUD operations from Python applications.
Use pytd for (1) analytical purpose relying on pandas and Jupyter Notebook, and (2) achieving more efficient data access.
There is a known difference to the
pandas_td.to_td function for type conversion. Since
pytd.writer.BulkImportWriter (default writer pytd) uses CSV as an intermediate file before uploading a table, the column type might change via
pandas.read_csv. To respect the column type as much as possible, you need to pass a fmt=”msgpack” argument to