pytd provides user-friendly interfaces to Treasure Data’s REST APIs, Presto query engine, and Plazma primary storage. The seamless connection allows your Python code to efficiently read and write a large volume of data from and to Treasure Data.
Plazma Public API limits its free tier at 100GB Read and 100TB Write. Contact your Customer Success representative at support@treasuredata.com for information about additional tiers. |
Installing pytd
|
---|
Using pytd
Set your API key and endpoint to the environment variables, TD_API_KEY
and TD_API_SERVER
, respectively, and create a client instance:
import pytd client = pytd.Client(database='sample_datasets') # or, hard-code your API key, endpoint, and/or query engine: # >>> pytd.Client(apikey='X/XXX', endpoint='https://api.treasuredata.com/', database='sample_datasets', default_engine='presto') |
---|
You might also want to review sample usage on Google Colaboratory.
Querying in Treasure Data
There are three main methods that you can use to query data: presto query, Hive query, and initializing the pytd.Client for Hive.
Presto Query
client.query('select symbol, count(1) as cnt from nasdaq group by 1 order by 1') # {'columns': ['symbol', 'cnt'], 'data': [['AAIT', 590], ['AAL', 82], ['AAME', 9252], ..., ['ZUMZ', 2364]]} |
---|
Hive Query
client.query('select hivemall_version()', engine='hive') # {'columns': ['_c0'], 'data': [['0.6.0-SNAPSHOT-201901-r01']]} (as of Feb, 2019) |
---|
Initialize pytd.Client
client_hive = pytd.Client(database='sample_datasets', default_engine='hive') client_hive.query('select hivemall_version()') |
---|
Importing Data to Treasure Data
For the writer
option, pytd supports three different ways to ingest data to Treasure Data: bulk import, Presto INSERT INTO query, and td-spark.
Bulk Import API (bulk_import)
The bulk_import (default) method converts data into a CSV file and batch uploads the data.
Presto INSERT INTO query (insert_into)
This method is recommended for a small value of data only. It inserts single rows in DataFrame by issuing an INSERT INTO query through the Presto query engine.
td-spark (spark)
The td-spark method is a local customized Spark instance that writes DataFrame directly to Treasure Data’s primary storage system.
Since td-spark gives special access to the main storage system via PySpark, you must enable the Spark Writer.
Contact support@treasuredata.com to activate the permission to your Treasure Data account.
Install pytd with
[spark]
option if you use the third option:pip install pytd[spark]
If you want to use existing td-spark JAR file, consider creating SparkWriter
with td_spark_path
option.
from pytd.writer import SparkWriter writer = SparkWriter(apikey='X/XXX', endpoint='https://api.treasuredata.com/', td_spark_path='/path/to/td-spark-assembly.jar') client.load_table_from_dataframe(df, 'mydb.bar', writer=writer, if_exists='overwrite') |
---|
Method Comparison
bulk_import | insert_into | spark | |
---|---|---|---|
Scalable against data volume | ✓ | ✓ | |
Write performance for larger data | ✓ | ||
Memory efficient | ✓ | ✓ | |
Disk efficient | ✓ | ||
Minimal package dependency | ✓ | ✓ |
Exporting Data to Treasure Data
Data must be represented as pandas.DataFrame. For example:
import pandas as pd df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 10]}) client.load_table_from_dataframe(df, 'takuti.foo', writer='bulk_import', if_exists='overwrite') |
---|
Working with Python Clients
Treasure Data offers three different Python clients on GitHub. The following list summarizes each client’s characteristics.
td-client-python
Basic REST API wrapper.
Similar functionalities to td-client-{ruby, java, node, go}.
The capability is limited by what Treasure Data REST API can do.
pytd
Access to Plazma via td-spark as introduced above.
Efficient connection to Presto based on presto-python-client.
Multiple data ingestion methods and a variety of utility functions.
pandas-td (deprecated)
Old tool optimized for pandas and Jupyter Notebook.
pytd offers its compatible function set
Replace pandas-td with pytd pandas-td compatible functions.
Install the package from PyPI.
pip install pytd # or, `pip install pytd[spark]` if you wish to use `to_td` |
---|
2. Make the following changes to the import statements.
Before | After |
---|---|
|
|
|
|
All pandas_td
code should keep running correctly with pytd
. Report an issue here if you notice any incompatible behaviors.
Choosing a Client
The client you choose depends on your specific use case. Here are some common guidelines:
Use td-client-python if you want to execute basic CRUD operations from Python applications.
Use pytd for (1) analytical purpose relying on pandas and Jupyter Notebook, and (2) achieving more efficient data access.
Important!
There is a known difference to the pandas_td.to_td
function for type conversion. Since pytd.writer.BulkImportWriter
(default writer pytd) uses CSV as an intermediate file before uploading a table, the column type might change via pandas.read_csv
. To respect the column type as much as possible, you need to pass a fmt=”msgpack” argument to to_td
function.
Additional Resources
User Guides | API References | Changelog | Development |
---|---|---|---|