You can import Tweets, Retweets, Follower IDs, Follower List, or User Timeline from Twitter into Treasure Data.

This topic includes:

Prerequisites

Establish a Twitter Dev Environment for Treasure Data

In Twitter, you specify the apps used to import into Treasure Data.

The Twitter-to-Treasure Data authentication flow generally is as follows:

  • Treasure Data makes a request to the POST oauth2 / token endpoint to exchange credentials for a bearer token.

  • When accessing the REST API, Treasure Data uses the bearer token to authenticate.

In Twitter

Log into your Twitter account name. Go to Apps to retrieve your consumer key and consumer secret.


Go back to the dashboard. Select Account > Dev Environments.


Select the Search Tweets 30-Days label and the Search Tweets Full-Archive label. You are specifying the Twitter search endpoint APIs used to import data. For more information, see Twitter documentation about 30 day and Full-Archive searches.



Use the TD Console to create your connection

Create a new connection

Go to Treasure Data Connections and search and select Tweet Insights.


Select Create to create an authenticated connection.

The following dialog opens.


Edit the consumer key and consumer secret that you retrieved from the Twitter App. Indicate if you are Twitter Paid Premium Account.

Select Continue.


Name your new Twitter Tweet Insights Connection. Select Done.

Transfer your Twitter Tweet Insights data to Treasure Data

After creating the connection, you are automatically taken to the Authentications tab. Look for the connection you created and select New Source.

Specify the data that you want to import.

Import Tweets from handle


Parameters:

  • Data Type: Tweets (default) or Account

  • Object to Import: If the data type is Tweets: All Tweets or Retweets. If the data type is Account: User Timeline, Follower IDs, or Follower List

  • 30-day dev environment: the Twitter dev environment API label for a search of Tweets or Retweets from the last 30-days. Not applicable for the Account data type.

  • Full archive dev environment: dev environment API label for a full search of archived Tweets or Retweets. Not applicable for the Account data type.

  • From this screen name or handle: Required. The value must be the user’s Twitter numeric Account ID or username of Twitter Account. Use to specify the data to be queried.

  • Include videos: Only import Tweets that include a video link.

  • From Date: Import Tweets created from this time. Time is set in UTC.

  • To Date: Import Tweets created until this time. Time is set in UTC.

  • Incremental: When importing based on a schedule, the time window of the fetched data automatically shifts forward on each run. For example, if the initial config is January 1, with ten days in duration, the first run fetches data modified from January 1 to January 10, the second run fetches from January 11 to January 20, and so on.

Import Retweets from handle


Parameters:

  • Retweets of this screen name/handle: username of Twitter Account

  • Include videos: Only import Retweets that have video.

  • From Date: Import Retweets created from this time. Time is set in UTC.

  • To Date: Import Retweets created until this time. Time is set in UTC.

  • Incremental: When importing based on a schedule, the time window of the fetched data automatically shifts forward on each run. For example, if the initial config is January 1, with ten days in duration, the first run fetches data modified from January 1 to January 10, the second run fetches from January 11 to January 20, and so on.

Import User Timeline of handle


Parameters:

  • From this screen name/handle: username of Twitter Account

  • Incremental: When importing based on a schedule, the max ID of Tweets of the fetched data automatically shifts forward on each run. For example, if the initial config max ID is 1, the first run fetches data and max ID of Tweets return is 100, the second run fetches from ID 100 and set ID with max ID return in the second run, and so on.

Import Follower IDs of handle


Parameters:

  • From this screen name/handle: username of Twitter Account

Import Follower List of handle


Parameters:

  • From this screen name/handle: username of Twitter Account

After completing your configuration, select Next.


Preview

You’ll see a preview of your data. To make changes, select Advanced Settings, otherwise select Next.


Advanced Settings


You can specify the following parameters:

  • Maximum retry times. Specifies the maximum retry times for each API call.

      Type: number
      Default: 7
    
  • Initial retry interval millisecond. Specifies the wait time for the first retry.

      Type: number
      Default: 1000
    
  • Maximum retry interval milliseconds. Specifies the maximum time between retries.

      Type: number
      Default: 120000
    

Choose the target database and table

Choose existing ones or create a new database and table.


Create a new database and give your database a name. Complete similar steps for Create new table.

Select whether to append records to an existing table or replace your existing table.

If you want to set a different partition key seed rather than use the default key, you can specify one using the popup menu.

Scheduling

In the When tab, you can specify a one-time transfer, or schedule an automated recurring transfer.

Parameters

  • Once now: set one time job.

  • Repeat…

    • Schedule: accepts these three options: @hourly, @daily, and @monthly and custom cron.

    • Delay Transfer: add a delay of execution time.

  • TimeZone: supports extended timezone formats like ‘Asia/Tokyo’.


Details

Name your transfer and select Done to start.


After your transfer has run, you can see the results of your transfer in the Databases tab.

Use the Command Line to create your Twitter Tweet Insights connection

You can use the Treasure Data console to configure your connection.


Install the Treasure Data Toolbelt

Install the newest Treasure Data Toolbelt.

Create a Configuration File (load.yml)

The configuration file includes an in: section where you specify what comes into the connector from Twitter Tweet Insights and an out: section where you specify what the connector puts out to the database in Treasure Data.

The following example shows how to specify import Tweets, without incremental scheduling.

in:
  type: twitter_tweet_insights
  comsumer_key: xxxxxxxx
  comsumer_secret: xxxxxxxx
  is_paid_account: false
  data_type: tweets
  tweet_type: all
  30_day_env: xxxxxxxx
  full_archive_env: xxxxxxxx
  handle: xxxxxxxx
  include_video: false
  from_date: 2019-01-17T00:00:00.000Z
  to_date: 2019-01-27T00:00:00.000Z
  tweet_incremental: false
out:
 mode: append


The following example shows how to specify import Tweets, with incremental scheduling.

in:
 type: twitter_tweet_insights
 comsumer_key: xxxxxxxx
 comsumer_secret: xxxxxxxx
 is_paid_account: false
 data_type: tweets
 tweet_type: all
 30_day_env: xxxxxxxx
 full_archive_env: xxxxxxxx
 handle: xxxxxxxx
 include_video: false
 from_date: 2019-01-17T00:00:00.000Z
 to_date: 2019-01-27T00:00:00.000Z
 tweet_incremental: true
out:
 mode: append


The following example shows how to specify import ReTweets, without incremental scheduling.

in:
 type: twitter_tweet_insights
 comsumer_key: xxxxxxxx
 comsumer_secret: xxxxxxxx
 is_paid_account: false
 data_type: tweets
 tweet_type: retweets
 30_day_env: xxxxxxxx
 full_archive_env: xxxxxxxx
 handle_retweet: xxxxxxxx
 include_video: false
 from_date: 2019-01-17T00:00:00.000Z
 to_date: 2019-01-27T00:00:00.000Z
 tweet_incremental: false
out:
 mode: append


The following example shows how to specify import ReTweets, with incremental scheduling.

in:
 type: twitter_tweet_insights
 comsumer_key: xxxxxxxx
 comsumer_secret: xxxxxxxx
 is_paid_account: false
 data_type: tweets
 tweet_type: retweets
 30_day_env: xxxxxxx
 full_archive_env: xxxxxxxx
 handle_retweet: xxxxxxxx
 include_video: false
 from_date: 2019-01-17T00:00:00.000Z
 to_date: 2019-01-27T00:00:00.000Z
 tweet_incremental: true
out:
 mode: append


The following example shows how to specify an import User Timeline, without incremental scheduling.

in:
 type: twitter_tweet_insights
 comsumer_key: xxxxxxxx
 comsumer_secret: xxxxxxxx
 is_paid_account: false
 data_type: account
 account_type: user
 account_label: xxxxxxxx
 account_incremental: false
out:
 mode: append


The following example shows how to specify an import User Timeline, with incremental scheduling.

in:
 type: twitter_tweet_insights
 comsumer_key: xxxxxxxx
 comsumer_secret: xxxxxxxx
 is_paid_account: false
 data_type: account
 account_type: user
 account_label: xxxxxxxx
 account_incremental: true
out:
 mode: append


The following example shows how to specify import Follower IDs.

in:
 type: twitter_tweet_insights
 comsumer_key: xxxxxxxx
 comsumer_secret: xxxxxxxx
 is_paid_account: false
 data_type: account
 account_type: id
 account_label: xxxxxxxx
out:
 mode: append


The following example shows how to specify the import Follower List.

in:
 type: twitter_tweet_insights
 comsumer_key: xxxxxxxx
 comsumer_secret: xxxxxxxx
 is_paid_account: false
 data_type: account
 account_type: list
 account_label: xxxxxxxx
out:
 mode: append


Preview the Data to be Imported (Optional)

You can preview data to be imported using the command td connector:preview.

$ td connector:preview load.yml 

Execute the Load Job

You use td connector:issue to execute the job.

You must specify the database and table where you want to store the data before you execute the load job. Ex td_sample_db, td_sample_table

$ td connector:issue load.yml \ 
     --database td_sample_db \ 
     --table td_sample_table \ 
     --time-column date_time_column

It is recommended to specify --time-column option, because Treasure Data’s storage is partitioned by time. If the option is not given, the data connector selects the first long or timestamp column as the partitioning time. The type of the column, specified by --time-column, must be either of long or timestamp type (use Preview results to check for the available column name and type. Generally, most data types have a last_modified_date column).

If your data doesn’t have a time column, you can add the column by using the add_time filter option. See details at add_time filter plugin.

td connector:issue assumes you have already created a database (sample_db) and a table (sample_table). If the database or the table does not exist in TD, td connector:issue will fail. Therefore, you must create the database and table manually or use --auto-create-table td connector:issue to automatically create the database and table.

 $ td connector:issue load.yml \ 
      --database td_sample_db \ 
      --table td_sample_table \ 
      --time-column date_time_column \
      --auto-create-table

From the command line, submit the load job. Processing might take a couple of hours depending on the data size.

Scheduled execution

You can schedule periodic data connector execution for periodic Tweets or ReTweets import. We configure our scheduler carefully to ensure high availability. By using this feature, you no longer need a cron daemon on your local data center.

Scheduled execution supports configuration parameters that control the behavior of the data connector during its periodic attempts to fetch data from Twitter:

  • incremental This configuration is used to control the load mode, which governs how the data connector fetches data from Twitter based on one of the native timestamp fields associated with each object.

  • columns This configuration is used to define a custom schema for data to be imported into Treasure Data. You can define only columns that you are interested in here but make sure they exist in the object that you are fetching. Otherwise, these columns aren’t available in the result.

  • last_record This configuration is used to control the last record from the previous load job. It requires the object include a key for the column name and a value for the column’s value. The key needs to match the Twitter Data column name.

See Appendix: How Incremental Loading works for details and examples.

Create the schedule

A new schedule can be created using the td connector:create command. The name of the schedule, cron-style schedule, the database and table where their data will be stored, and the data connector configuration file are required.

The `cron` parameter accepts these options: `@hourly`, `@daily` and `@monthly`.

By default, schedule is setup in UTC timezone. You can set the schedule in a timezone using -t or --timezone option. The `--timezone` option only supports extended timezone formats like 'Asia/Tokyo', 'America/Los_Angeles' etc. Timezone abbreviations like PST, CST are *not* supported and may lead to unexpected schedules.

$ td connector:create \
    daily_import \
    "10 0 * * *" \
    td_sample_db \
    td_sample_table \
    load.yml

It’s also recommended to specify the --time-column option, since Treasure Data’s storage is partitioned by time.

$ td connector:create \
    daily_import \
    "10 0 * * *" \
    td_sample_db \
    td_sample_table \
    load.yml \
    --time-column created_at

List the Schedules

You can see the list of currently scheduled entries by entering the command td connector:list.

$ td connector:list
+--------------+------------+----------+-------+--------------+-----------------+--------------------------------------------+
| Name         | Cron       | Timezone | Delay | Database     | Table           | Config                                     |
+--------------+------------+----------+-------+--------------+-----------------+--------------------------------------------+
| daily_import | 10 0 * * * | UTC      | 0     | td_sample_db | td_sample_table | {"in"=>{"type"=>"twitter_tweet_insights",  |
+--------------+------------+----------+-------+--------------+-----------------+--------------------------------------------+

Show the Schedule Settings and History of Schedules

td connector:show shows the execution setting of a schedule entry.

% td connector:show daily_import
Name     : daily_import
Cron     : 10 0 * * *
Timezone : UTC
Delay    : 0
Database : td_sample_db
Table    : td_sample_table
Config
---in:
 type: twitter_tweet_insights
 comsumer_key: xxxxxxxx
 comsumer_secret: xxxxxxxx
 is_paid_account: false
 data_type: tweets
 tweet_type: all
 30_day_env: xxxxxxxx
 full_archive_env: xxxxxxxx
 handle: xxxxxxxx
 include_video: false
 from_date: 2019-01-17T00:00:00.000Z
 to_date: 2019-01-27T00:00:00.000Z
 tweet_incremental: true

td connector:history shows the execution history of a schedule entry. To investigate the results of each individual execution, use td job <jobid>.

% td connector:history daily_import
+--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+
| JobID  | Status  | Records | Database     | Table           | Priority | Started                   | Duration |
+--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+
| 578066 | success | 10000   | td_sample_db | td_sample_table | 0        | 2018-04-18 00:10:05 +0000 | 160      |
| 577968 | success | 10000   | td_sample_db | td_sample_table | 0        | 2018-04-17 00:10:07 +0000 | 161      |
| 577914 | success | 10000   | td_sample_db | td_sample_table | 0        | 2018-04-16 00:10:03 +0000 | 152      |
| 577872 | success | 10000   | td_sample_db | td_sample_table | 0        | 2018-04-15 00:10:04 +0000 | 163      |
| 577810 | success | 10000   | td_sample_db | td_sample_table | 0        | 2018-04-14 00:10:04 +0000 | 164      |
| 577766 | success | 10000   | td_sample_db | td_sample_table | 0        | 2018-04-13 00:10:04 +0000 | 155      |
| 577710 | success | 10000   | td_sample_db | td_sample_table | 0        | 2018-04-12 00:10:05 +0000 | 156      |
| 577610 | success | 10000   | td_sample_db | td_sample_table | 0        | 2018-04-11 00:10:04 +0000 | 157      |
+--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+
8 rows in set

Delete the Schedule

td connector:delete removes the schedule.

$ td connector:delete daily_import

Appendix

Modes for the out plugin

You can specify file import mode in the out section of the load.yml file.

The out: section controls how data is imported into a Treasure Data table.
For example, you may choose to append data or replace data in an existing table in Treasure Data.

Output modes are ways to modify the data as the data is placed in Treasure Data.

  • Append (default): Records are appended to the target table.

  • Replace (available In td 0.11.10 and later): Replaces data in the target table. Any manual schema changes made to the target table remain intact.

Examples:

in:
  ...
out:
  mode: append


in:
  ...
out:
  mode: replace

How Incremental Loading works

Incremental loading uses monotonically increasing unique columns (such as AUTO_INCREMENT column) to load records that were inserted (or updated) after the last execution.

If incremental: true is set, this connector loads all records with additional ORDER BY. This mode is useful when you want to fetch just the object targets that have changed since the previously scheduled run. For example, if incremental_columns: [updated_at, id] option is set, the query is as follows:

SELECT * FROM (
 ...original query is here...
)
ORDER BY updated_at, id

When bulk data loading finishes successfully, it outputs last_record: parameter as config-diff so that the next execution uses it.

At the next execution, when last_record: is also set, this plugin generates additional WHERE conditions to load records larger than the last record. For example, if last_record: ["2017-01-01T00:32:12.000000", 5291] is set,

SELECT * FROM (
 ...original query is here...
)
WHERE updated_at > '2017-01-01T00:32:12.000000' OR (updated_at = '2017-01-01T00:32:12.000000' AND id > 5291)
ORDER BY updated_at, id

Then, it updates last_record: so that the next execution uses the updated last_record.
IMPORTANT: If you set incremental_columns: option, make sure that there is an index on the columns to avoid full table scan. For this example, the following index should be created:

CREATE INDEX embulk_incremental_loading_index ON table (updated_at, id);

Recommended usage is to leave incremental_columns unset and let the connector automatically find an AUTO_INCREMENT primary key.

Currently, only Timestamp, Datetime, and numerical columns are supported as incremental_columns.

For the raw query, the incremental_columns is required because it won't be able to detect the Primary keys for a complex query.

If incremental: false is set, the data connector fetches all the records of the specified Twitter object type, regardless of when they were last updated. This mode is best combined with writing data into a destination table using the ‘replace’ mode.

Incremental Loading for Data Extensions

Treasure Data supports incremental loading for Data Extensions that have a date field.

If incremental: true is set, the data connector loads records according to the range specified by the from_date and the fetch_days for the specified date field.

Sandbox application limitation

MaxResults param for Sandbox account is 100 while Premium is 500

Request rate limits at both minute and second granularity. The per-minute rate limit is 30 requests per minute. Requests are also limited to 10 per second. Requests are aggregated across both the data and counts endpoints. Monthly request limits are also applied. Sandbox environments are limited to 250 requests per month.

Users should check their application dashboard for request usage and monthly quota.

  • No labels