Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Treasure Data provides a cloud-based analytics infrastructure accessible via SQL. Interactive engines like Presto enable you to crunch billions of records easily. However, writing a SQL query is sometimes painful for data scientists, and you’ll still need to use external tools like Excel or Tableau to visualize the result. You can use Treasure Data with the Python-based data analysis tool called Pandas and visualize the data interactively via Jupyter Notebook.

Image Modified


Table of Contents
maxLevel1

...

  1. Run Notebook using the following syntax:

    Code Block
    linenumberstrue
    (analysis)$ ipython notebook


  2. Your web browser will open:

  3. Select New > Python 3.

  4. Copy and paste the following text into your notebook:

    Code Block
    linenumberstrue
    %matplotlib inline
    
    import os
    import pandas as pd
    import pytd.pandas_td as td
    # Initialize the connection to Treasure Data
    
    con = td.connect(apikey=os.environ['TD_API_KEY'], endpoint='https://api.treasuredata.com')

    Image Modified

  5. Your notebook should now look similar to

    Image Modified
  6. Type Shift-Enter.
    If you get "KeyError: 'TD_API_KEY'" error, try "apikey='<your master apikey>'" instead of "apikey=os.environ['TD_API_KEY']".
    If it works, Jupyter didn't recognize the TD_API_KEY variable from the OS.
    Confirm the TD_API_KEY again and re-launch Jupyter.

  7. Optionally, save your notebook.

...

There are two tables in sample_datasets. You can use the magic command td_tables to view all the tables in your database.

Image Modified

Let’s explore the nasdaq table.

...

Code Block
linenumberstrue
engine = td.create_engine("presto:sample_datasets")
client =td.Client(database='sample_datasets')
client.query('select symbol, count(1) as cnt from nasdaq group by 1 order by 1')

For example:

Image Modified

Running a Query in Jupyter

...

You can also use the time_range parameter to retrieve data within a specific time range:

Image Modified

Your data is stored in the local variable df as a DataFrame. Because the data is located in the local memory of your computer, you can analyze it interactively using the power of Pandas and Jupyter. See Time Series / Date functionality for the details of time-series data.

...

  • You can sample data. For example, the “nasdaq” table has 8,807,278 rows. Setting a limit of 100000 results in 100,000 rows, which is a reasonable size to retrieve:

    Image Modified


  • Write SQL and limit data from the server side. For example, as we are interested only in data related to “AAPL”, let’s count the number of records, using read_td_query:

    Image Modified

    It’s small enough, so we can retrieve all the rows and start analyzing data:

    Image Modified


See the contents below for further information.

...