Interactive Analysis with Jupyter + Pandas + TD

Treasure Data provides a cloud-based analytics infrastructure accessible via SQL. Interactive engines like Presto enable you to crunch billions of records easily. However, writing an SQL query is sometimes painful for data scientists, and you’ll still need to use external tools like Excel or Tableau to visualize the result.

This article shows you how to access Treasure Data from the popular Python-based data analysis tool called Pandas, and visualize the data interactively via Jupyter (iPython Notebook)

Table of Contents


  • Basic knowledge of Python.
  • Basic knowledge of Treasure Data.

Step 0: Set Treasure Data API Key

First, set your master api key as an environment variable before launch jupyter. The master API KEY can be retrieved from the Console’s profile page.

$ export TD_API_KEY="1234/abcde..."
TD_API_KEY has to be set before launch jupyter because jupyter doesn't care variable which is set after launch.

Step 1: Install Jupyter, Pandas, and Pandas-TD

We recommend that you use Miniconda to install all required packages for Pandas. Download an installer for your OS and install it. Let’s create a virtual environment for our first project “analysis”. We’ll use Python 3 for this project:

$ conda create -n analysis python=3
$ source activate analysis
discarding .../miniconda/bin from PATH
prepending .../miniconda/envs/analysis/bin to PATH

We need “pandas”, “matplotlib” and “ipython-notebook”:

(analysis)$ conda install pandas
(analysis)$ conda install matplotlib
(analysis)$ conda install ipython-notebook

You can use “pip” for general Python packages. We need “pandas-td”:

(analysis)$ pip install pandas-td

Step 2: Run Jupyter and Create your First Notebook

We’ll use “Jupyter” (formerly known as “IPython notebook”) as a frontend for our analysis project. Run “ipython notebook” and your web browser will open:

(analysis)$ ipython notebook

Let’s create a new notebook by “New > Python 3”. Copy & paste the following text to your notebook and type “Shift-Enter”:

%matplotlib inline

import os
import pandas as pd
import pandas_td as td

# Initialize the connection to Treasure Data
con = td.connect(apikey=os.environ['TD_API_KEY'], endpoint='')

Your notebook now looks something like this:

If you get "KeyError: 'TD_API_KEY'" error, please try "apikey='your apikey'" instead of "apikey=os.environ['TD_API_KEY']". If it works, jupyter didn't recognize TD_API_KEY variable from OS, so please confirm TD_API_KEY again and re-launch jupyter.

Step 3: Explore Data

We have two tables in sample_datasets. Let’s explore the nasdaq table as an example.

We’ll use presto as a query engine for this session. To see how the table looks, you can retrieve a few lines by read_td_table:

You can also use the time_range parameter to retrieve data within a specific time range:

Now, your data is stored in the local variable df as a DataFrame. Since the data is located in the local memory of your computer, you can analyze it interactively using the power of Pandas and Jupyter. See Time Series / Date functionality for the details of time-series data.

Step 4: Sample Data

As your data set grows very large, the method from the previous step doesn’t actually scale very well. It isn’t actually a very good idea to retrieve more than a few million rows at a time due to memory limitations or slow network transfer. If you’re analyzing a large amount of data, you need to limit the amount of data getting transferred.

There are two ways to do this:

First, you can sample data. We know, for example, that the “nasdaq” table has 8,807,278 rows (at presstime). Sampling 1% of this results in ~88,000 rows, which is a reasonable size to retrieve:

Another way is to write SQL and limit data from the server side. For example, as we are interested only in data related to “AAPL”, let’s count the number of records, using read_td_query:

It’s small enough, so we can retrieve all the rows and start analyzing data:

See Also

See the contents below for further information.

Jupyter Notebooks are now supported by GitHub and you can share the result of your analysis session with your team:

Last modified: Jun 15 2016 08:33:24 UTC

If this article is incorrect or outdated, or omits critical information, please let us know. For all other issues, please see our support channels.