# Pandas and Jupyter Configuration for Treasure Data Treasure Data provides a cloud-based analytics infrastructure accessible via SQL. Interactive engines like [Presto](https://api-docs.treasuredata.com/en/tools/presto/quickstart/) enable you to crunch billions of records easily. However, writing a SQL query is sometimes painful for data scientists, and you’ll still need to use external tools like Excel or Tableau to visualize the result. You can use Treasure Data with the Python-based data analysis tool called [Pandas](http://pandas.pydata.org/) and visualize the data interactively via [Jupyter Notebook](https://jupyter.org/). ![](/assets/image-20190925-190645.bf0be2045c23219c01095f0673eacf916240b2b202f3bee7f426bd2d7a0d99c9.60bcc915.png) * [Prerequisites](#prerequisites) * [Set Treasure Data API Key](#set-treasure-data-api-key) * [Set Treasure Data API Endpoint](#set-treasure-data-api-endpoint) * [Install the Necessary Packages and Configure your Environment](#install-the-necessary-packages-and-configure-your-environment) * [Run Jupyter and Create your First Notebook](#run-jupyter-and-create-your-first-notebook) * [Explore Data](#explore-data) * [Running a Query in Jupyter](#running-a-query-in-jupyter) * [Sample Data](#sample-data) ## Prerequisites * Basic knowledge of Python. * Basic knowledge of Treasure Data. ## Set Treasure Data API Key Set your master API key as an environment variable before launching Jupyter. The master API KEY can be retrieved from the TD Console profile. ```bash $ export TD_API_KEY="1234/abcde..." ``` You can set your environment variables with a command such as the following in Jupyter Notebook cell. ```cmd %env TD_API_KEY = "123c/abcdefghjk..." ``` ## Set Treasure Data API Endpoint Set Treasure Data API Endpoint as an environment variable if your account does not belong to the US Region. You can see Endpoint info [here](https://api-docs.treasuredata.com/en/overview/aboutendpoints/) ```bash $ export TD_API_SERVER="https://api.treasuredata.co.jp" ``` You can set your environment variables with a command such as the following in Jupyter Notebook cell. ```cmd %env TD_API_SERVER = "https://api.treasuredata.co.jp" ``` ## Install the Necessary Packages and Configure your Environment For more information and instructions, see [Installing](/tools/cli-and-sdks/installing-python-pandas-matplotlib-jupyter-notebook-pytd) Conda, Pandas, matplotlib, Jupyter Notebook, and pytd. ## Run Jupyter and Create your First Notebook We’ll use Jupyter as a frontend for our analysis project. 1. Run Notebook using the following syntax: ```bash (analysis)$ ipython notebook ``` 1. Your web browser will open: 2. Select **New** > **Python 3**. 3. Copy and paste the following text into your notebook: ```python %matplotlib inline import os import pandas as pd import pytd.pandas_td as td # Initialize the connection to Treasure Data con = td.connect(apikey=os.environ['TD_API_KEY'], endpoint='https://api.treasuredata.com') ``` ![](/assets/image-20190925-190918.5bf9b8848977614d4845d71e9ff6ccf5b502fde7ab1270be7a6ef7c962f9a063.60bcc915.png) 1. Your notebook should now look similar to ![](/assets/jupyter_pytd.c691f29202d2b11bce5d2b9dd9f8a9dd71b237a44272834e5a30d7c7707ac0b2.60bcc915.png) 6. Type Shift-Enter. If you get "KeyError: 'TD_API_KEY'" error, **try "apikey='your master apikey'"** instead of "apikey=os.environ['TD_API_KEY']". If it works, Jupyter didn't recognize the TD_API_KEY variable from the OS. Confirm the TD_API_KEY again and re-launch Jupyter. 1. Optionally, save your notebook. ## Explore Data There are two tables in `sample_datasets`. You can use the magic command `td_tables` to view all the tables in your database. ![](/assets/image-20200421-215102.77c3d435e3eabf4495b11e93741de1a1e7986434701583e6e6ad1640c6185e11.60bcc915.png) Let’s explore the `nasdaq` table. In Jupyter, type the following syntax: ```python engine = td.create_engine("presto:sample_datasets") client =td.Client(database='sample_datasets') client.query('select symbol, count(1) as cnt from nasdaq group by 1 order by 1') ``` For example: ![](/assets/image-20200418-223403.5f2a746f8f0485079a18bc0eb7633bbc08f8e4faf276f2eb35d5e94fb17824e8.60bcc915.png) ## Running a Query in Jupyter For the purposes of this example, Presto is used as the query engine for this session. In Jupyter, type the following syntax: ```python import pytd.pandas_td as td con = td.connect(apikey=apikey, endpoint="https://api.treasuredata.com") engine = td.create_engine("presto:sample_datasets") td.read_td_query(query, engine, index_col=None, parse_dates=None, distributed_join=False, params=None) ``` You can also use the `time_range` parameter to retrieve data within a specific time range: ![](/assets/image-20190925-191519.1d48dd56769110424b6dd90d31037b2a5ac0bb5e1a85fa4ebed0e7fc2fdf4fe8.60bcc915.png) Your data is stored in the local variable `df` as a DataFrame. Because the data is located in the local memory of your computer, you can analyze it interactively using the power of Pandas and Jupyter. See [Time Series / Date functionality](http://pandas.pydata.org/pandas-docs/stable/timeseries.md) for the details of time-series data. ## Sample Data As your data set grows very large, the method from the previous step doesn’t scale very well. We don't recommend that you retrieve more than a few million rows at a time due to memory limitations or slow network transfer. If you’re analyzing a large amount of data, you need to limit the amount of data getting transferred. There are two ways to do this: * You can sample data. For example, the “Nasdaq” table has 8,807,278 rows. Setting a limit of 100000 results in 100,000 rows, which is a reasonable size to retrieve: ![](/assets/image-20190925-191603.2c0d4d1eb14b4e64f49a6c2d7405f83d6f7fd3f86e0694478061590cb60f167b.60bcc915.png) * Write SQL and limit data from the server-side. For example, as we are interested only in data related to “AAPL”, let’s count the number of records, using `read_td_query`: ![](/assets/image-20190925-191649.7c74860a4312c5f17c87b0b4eb22890b8c92e0a96723b5b48d7786c8f409775a.60bcc915.png) It’s small enough, so we can retrieve all the rows and start analyzing data: ![](/assets/image-20190925-191754.2d7f06efaab93a3ebf0d9f947587125e7857504afc88bcde007e8303c9a9105d.60bcc915.png) See the contents below for further information. * [Python for Data Analysis (Book by O'Reilly Media)](http://shop.oreilly.com/product/0636920023784.do) Jupyter Notebooks are supported by GitHub and you can share the result of your analysis session with your team: * [GitHub + Jupyter Notebooks = <3](https://github.com/blog/1995-github-jupyter-notebooks-3)