# pytd Quickstart

`pytd` provides user-friendly interfaces to Treasure Data’s [REST APIs](https://docs.treasuredata.com/tools/cli-and-sdks/td-toolbelt), [Trino query engine](https://docs.treasuredata.com/products/customer-data-platform/data-workbench/queries/trino/quickstart), and [Plazma primary storage](https://www.slideshare.net/treasure-data/td-techplazma). The seamless connection allows your Python code to efficiently read and write a large volume of data to and from Treasure Data.

We recommend using iPython with this library for the best experience during development.

Info
Treasure Data is no longer accepting new users for the Plazma Public API.

# Setup

## Installing


```shell
pip install pytd
```

## Initializing Client

Set your [TD API key](/products/my-settings/getting-your-api-keys) and [endpoint](/apis/endpoints/endpoints) to the environment variables, `TD_API_KEY` and `TD_API_SERVER`, respectively, and create a client instance.

You can also optionally set the query engine to use for all queries. The default query engine is Trino.


```python
import pytd

client = pytd.Client(database='sample_datasets')

# or, hard-code your API key, endpoint, and/or query engine:
pytd.Client(apikey='X/XXX',
            endpoint='https://api.treasuredata.com/',
            database='sample_datasets',
            default_engine='presto')
```

Info
Note that the default engine is set to `presto`. To use Hive you can either

1) Set the `default_engine='hive'` for all queries at initialization.
2) Pass the `engine='hive'` parameter at run time.


# Basic Use

## Running a Query

All queries in Treasure Data either run on Trino or Hive. You can specify the query engine at client initialization or per query at run time.

The default engine is Trino.


```python
client.query('select symbol, count(1) as cnt from nasdaq group by 1 order by 1')
# {'columns': ['symbol', 'cnt'], 'data': [['AAIT', 590], ['AAL', 82], ['AAME', 9252], ..., ['ZUMZ', 2364]]}
```

When you want to run Hive query, you should pass `hive` to engine option.


```python
client.query('select hivemall_version()', engine='hive')
# {'columns': ['_c0'], 'data': [['0.6.0-SNAPSHOT-201901-r01']]} (as of Feb, 2019)
```

## Loading a DataFrame from Treasure Data

When pulling data from Treasure Data all data must be represented as `pandas.DataFrame`. To do this we will use functions from the `pandas_td` sub library.

### Load Table

To read a Treasure Data table into a pandas DataFrame use the pandas [`read_td_table` function](https://pytd-doc.readthedocs.io/en/latest/pandas_td.html?highlight=read_td_table#pytd.pandas_td.read_td_table).

Note that `engine` is a Class returned by the [`create_engine` function](https://pytd-doc.readthedocs.io/en/latest/pandas_td.html#pytd.pandas_td.create_engine).


```python
import pytd.pandas_td as td

# Assumes TD_API_KEY and TD_API_SERVER env variables are set
engine = td.create_engine("presto:my_db")

df = td.read_td_table('table_name', engine, limit=10000)
```


```python
import pytd.pandas_td as td

# Declare API Key and Endpoint manually
con = td.connect(apikey='XXX', endpoint="https://api.treasuredata.com")
engine = td.create_engine("presto:my_db", con=con)

df = td.read_td_table('table_name', engine, limit=10000)
```

### Load from Query

To run a query and load the output into a dataframe use the [`read_td_query` function](https://pytd-doc.readthedocs.io/en/latest/pandas_td.html?highlight=read_td_query#pytd.pandas_td.read_td_query).


```python
query='SELECT foo FROM bar'
df = td.read_td_query(query,
                      engine,
                      index_col=None,
                      parse_dates=None,
                      distributed_join=False,
                      params=None)
```

`read_td_query` takes the following parameters:

- query (`String`) — SQL string to be executed
- engine — For example, `presto_engine`
- index_col
- parse_dates (`Array`, `none`) — For Array, column names must be given to parse as dates.
- distributed_join (`true`, `false`) — (Trino only) If true, distributed join is enabled. If false (default), broadcast join is used.
- Params


The `priority` parameter can be set by using syntax similar to the following:


```
df_2 = td.read_td('SELECT foo FROM bar',
                  engine,
                  params={'priority':1})
# Note priority is ordered highest 2,1,0,-1,-2 lowest
```

## Writing DataFrame to Treasure Data

To write data to Treasure Data you simply use the [`load_table_from_dataframe` function](https://pytd-doc.readthedocs.io/en/latest/generated/pytd.Client.html?highlight=load_table_from_dataframe#pytd.Client.load_table_from_dataframe) to write a locally defined pandas dataframe to a table.


```
import pandas as pd

df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 10]})

client.load_table_from_dataframe(df,
                                'table_name',
                                writer='bulk_import',
                                if_exists='overwrite')
```

**`writer` parameter**

For the `writer` parameter, pytd supports different ways to ingest data to Treasure Data:

1. `bulk_import` - Convert data into a CSV file and upload in the batch fashion.
2. `insert_into` - Insert every single row in `DataFrame` by issuing an `INSERT INTO` query through the Trino query engine. Recommended only for a small volume of data.


Characteristics of each of these methods can be summarized as follows:

|  | bulk_import | insert_into |
|  --- | --- | --- |
| Scalable against data volume | ✓ |  |
| Write performance for larger data |  |  |
| Memory efficient | ✓ | ✓ |
| Disk efficient |  | ✓ |
| Minimal package dependency | ✓ | ✓ |


**`if_exists` parameter**

For the `if_exists` parameter, pytd supports 4 different behavior patterns.

- `error`: raise an exception
- `overwrite`: drop it, recreate it, and insert data
- `append`: insert data (create if does not exist)
- `ignore`: do nothing


# Advanced Use

For more advanced examples see the [sample code on Google Colaboratory](https://colab.research.google.com/drive/1ps_ChU-H2FvkeNlj1e1fcOebCt4ryN11).

# Further Reading

- [Choosing between PyTD, td-client-python, and Pandas-TD](/tools/pytd/pytd-vs-td-client-python-vs-pandas-td) - Different libraries for different needs.
- [PyTD API Reference](https://pytd-doc.readthedocs.io/en/latest/reference.html)