Skip to content
Last updated

Mixpanel Import Integration

The Data Connector for Mixpanel allows you to back up the event data in Mixpanel on Treasure Data. Use cases include:

Use CaseDescription
One-time migrationThe organization is migrating away from Mixpanel and wants to keep a raw copy of the data
Incremental daily backupThe organization needs to have more granular access with SQL to event data inside Mixpanel

For sample workflows on how to import event data from Mixpanel, view Treasure Boxes.

Use TD Console

Create a New Connection

  1. Navigate to Integrations Hub > Catalog
  2. Search and select Mixpanel.
  3. The following dialog opens. Edit the Mixpanel credential details.
  4. Select Continue and provide a name for this connection.

Create a New Source

After creating the connection, you are automatically taken to the Authentications page. Look for the connection you created and select New Source.

Connection

The following dialog opens.

  1. Name the data transfer and select Next.

Source Table

The following dialog opens

  1. Edit your source parameters and select Next.

ParametersDescription
Data Export API Endpoint:The endpoint you are using for the Export API call. If you leave it empty the default value is https://data.mixpanel.com/api/2.0/export/
JQL API EndpointThe endpoint using for JQL API call. If you leave it empty the default value is https://mixpanel.com/api/2.0/jql/
TimezoneYour project timezone. It could be found in PROJECT SETTINGS >> YOUR PROJECT  >> OVERVIEW.
JQL ModeUsing JQL endpoint or just export endpoint. The default value is false
JQLJQL script. It only effects when JQL Mode is true
Incremental LoadingRun your transfer in incremental mode or not
Incremental Fieldwhich field is used as an index for incremental. The default value is time
From datestart date of the incremental
Number of Days to Fetchtotal number of days to fetch data in one time
Number of Days to Slicenumber of days that one request fetch

Data Settings

Your data settings can be edited on this page.

  1. Edit the data settings or optionally, skip this page of the dialog.

ParametersDescription
Backfill DaysThe amount of time that will be subtracted from from\_date to calculate the final from\_date that is used for the API Request. This is due to the Mixpanel caching data on user devices before sending it to the Mixpanel server. It only affects when Incremental is true and Incremental Field is specified. The default value is 5
Incremental Column Upper Limit Delay In SecondsThe upper limit of the incremental column. When using export with the incremental column, the plugin will lock the upper limit of the incremental column query with the job start time. This is to support when Mixpanel has a delay in their processing. It only affects Export endpoint. The default value is 0.
Allow Partial ImportAllows the plugin to skip errors in the import. It only affects Export endpoint. The default value is true.
Fetch Custom PropertiesAllows the plugin to import all custom properties for Export endpoint only. The default value is false.
EventsThe events for filtering data for Export endpoint only
Filter ExpressionThe segmentation expression for Export endpoint only
BucketThe data bucket to filter data for Export endpoint only
SchemaYour schema with columns names and types, which are stored in the TD table.

Data Preview

You can see a preview of your data before running the import by selecting Generate Preview. Data preview is optional and you can safely skip to the next page of the dialog if you choose to.

  1. Select Next. The Data Preview page opens.
  2. If you want to preview your data, select Generate Preview.
  3. Verify the data.

Data Placement

For data placement, select the target database and table where you want your data placed and indicate how often the import should run.

  1. Select Next. Under Storage, you will create a new or select an existing database and create a new or select an existing table for where you want to place the imported data.

  2. Select a Database > Select an existing or Create New Database.

  3. Optionally, type a database name.

  4. Select a TableSelect an existing or Create New Table.

  5. Optionally, type a table name.

  6. Choose the method for importing the data.

    • Append (default)-Data import results are appended to the table. If the table does not exist, it will be created.
    • Always Replace-Replaces the entire content of an existing table with the result output of the query. If the table does not exist, a new table is created.
    • Replace on New Data-Only replace the entire content of an existing table with the result output when there is new data.
  7. Select the Timestamp-based Partition Key column. If you want to set a different partition key seed than the default key, you can specify the long or timestamp column as the partitioning time. As a default time column, it uses upload_time with the add_time filter.

  8. Select the Timezone for your data storage.

  9. Under Schedule, you can choose when and how often you want to run this query.

Run once

  1. Select Off.
  2. Select Scheduling Timezone.
  3. Select Create & Run Now.

Repeat Regularly

  1. Select On.
  2. Select the Schedule. The UI provides these four options: @hourly@daily and @monthly or custom cron.
  3. You can also select Delay Transfer and add a delay of execution time.
  4. Select Scheduling Timezone.
  5. Select Create & Run Now.

After your transfer has run, you can see the results of your transfer in Data Workbench > Databases.

Use Command Line

Install ‘td’ Command v0.11.9 or Later

Install the newest Treasure Data Toolbelt.

Look up Mixpanel API Credentials

Log in to your Mixpanel account, go to “Account” then “Projects” to look up your API key and API secret.

Create the Seed Config File (seed.yml)

Create a file called seed.yml as follows.

in:
  type: mixpanel
  api_key: MIXPANEL_API_KEY
  api_secret: MIXPANEL_API_SECRET
  timezone: 'UTC'
  from_date: "2015-10-28"
  fetch_days: 1
out:
  mode: append

This seed file is used to “guess” the full configuration file with column names. In this example, one day’s worth of data on Oct 28, 2015, is exported from Mixpanel to Treasure Data.

For more details on available out modes, see the Appendix below.

Guess Fields (Generate load.yml)

Based on seed.yml, generate load.yml with the connector:guess command.

$ td connector:guess seed.yml -o load.yml

Your column names and types depend on how you configure Mixpanel. The load.yml file should look as follows:

in:
  type: mixpanel
  api_key: MIXPANEL_API_KEY
  api_secret: MIXPANEL_API_SECRET
  timezone: UTC
  from_date: '2015-10-28'
  fetch_days: 1
  columns:
  - name: event
    type: string
  - name: "$browser"
    type: string
  - name: "$browser_version"
    type: long
  - name: "$city"
    type: string
  - name: "$current_url"
    type: string
  - name: "$initial_referrer"
    type: string
  - name: "$initial_referring_domain"
    type: string
  - name: "$lib_version"
    type: string
  - name: "$os"
    type: string
  - name: "$referrer"
    type: string
  - name: "$referring_domain"
    type: string
  - name: "$region"
    type: string
  - name: "$screen_height"
    type: long
  - name: "$screen_width"
    type: long
  - name: ENV
    type: string
  - name: RAILS_ENV
    type: string
  - name: distinct_id
    type: long
  - name: from
    type: string
  - name: job_id
    type: long
  - name: mp_country_code
    type: string
  - name: mp_lib
    type: string
  - name: name
    type: string
  - name: time
    type: long
filters:
- type: rename
  rules:
  - rule: upper_to_lower
  - rule: character_types
    pass_types: ["a-z", "0-9"]
    pass_characters: "_"
    replace: "_"
  - rule: first_character_types
    pass_types: ["a-z"]
    pass_characters: "_"
    prefix: "_"
  - rule: unique_number_suffix
    max_length: 128
out:
  mode: append
exec: {}

For more details on the rename filter, see rename filter plugin for Data Connector.

The Data Connector picked up fields such as:

  • ENV
  • RAILS_ENV
  • $current_url

Column names are normalized in the type: rename filter section.

Preview Data Loading

Use connector:preview command to preview the data load.

td connector:preview load.yml

If the data looks good, then create the database and the table on Treasure Data where you wish to import Mixpanel data:

td db:create mixpanel_historical
td table:create mixpanel_historical app_name

Load Data

Submit the load job. It may take a couple of hours depending on the size of the data. Users need to specify the database and table where their data is stored.

It’s also recommended to specify --time-column option because Treasure Data’s storage is partitioned by time (see architecture) If the option is not provided, the data connector chooses the first long or timestamp column as the partitioning time. The type of the column specified by --time-column must be either of long and timestamp type.

td connector:issue load.yml --database mixpanel_historical --table app_name --time-column time

The preceding command assumes you have already created database(td_sample_db) and table(td_sample_table). If the database or the table do not exist in TD this command fails, so create the database and table manually or use --auto-create-table option with td connector:issue command to auto-create the database and table:

$ td connector:issue load.yml --database td_sample_db --table td_sample_table --time-column created_at --auto-create-table

You can assign Time Format column to the "Partitioning Key" by "--time-column" option.

You can check that the data is loaded by running a SQL query directly against your data in Treasure Data:

td query -T presto -w -d mixpanel_historical 'SELECT COUNT(1) FROM app_name'

Scheduling Incremental Data Loading

Unless you are migrating off of Mixpanel completely, Mixpanel data must be incrementally loaded into Treasure Data regularly. The Data Connector’s scheduling function comes in handy for this purpose.

After scheduling, the Mixpanel Data Connector’s successive runs increment the from_date parameter by fetch_days. For example, if the initial run in load.yml was

from_date: '2015-10-28'
fetch_days: 1

Then, the next run is as follows:

from_data: '2015-10-29'
fetch_days: 1

You do not need to update load.yml after it is uploaded since from_date field is automatically updated on the server side.

Suppose you wish to schedule a daily upload. Then, make sure that the initial from_date is at least two days ago and set fetch_days: 1 in load.yml. Then, the following command creates a daily job called “daily_mixpanel_import” which loads historical data to mixpanel_historical.app_name on Treasure Data every day.

$ td connector:create \
    daily_mixpanel_import \
    "10 0 * * *" \
    mixpanel_historical \
    app_name \
    load.yml \
    --time-column time # optional

The historical runs of the import can be seen with td connector:history name, e.g., td connector:history daily_mixpanel_import.

Incremental Data Loading With Incremental Column

Certain Mixpanel account has project set up with additional field added to indicate the time data get processed by Mixpanel. For example: mp_processing_time_ms

User can add an additional parameter to Mixpanel Input Plugin incremental_column. The max incremental_column value of a run session will store and use as a filter for next run, Example: where incremental_column > previous_run_max_value.

Example:

in:
  type: mixpanel
  api_key: MIXPANEL_API_KEY
  api_secret: MIXPANEL_API_SECRET
  timezone: 'UTC'
  from_date: "2015-10-28"
  incremental_column: mp_processing_time_ms
  fetch_days: 1
out:
  mode: append

Look back for data with back_fill_days

For devices data like phone, tablets… Mixpanel can keep data in user’s device for a while before send them to Mixpanel Server. So data appear in query can be delayed up to 5 days. When incremental import data from Mixpanel, we can missed data that are cached in user devices. To solve this issue we can set back_fill_days parameter(Default to 5). Plugin looks back for a number of days(from_date - back_fill_days). Due to performance issue, this feature only work with incremental_column.

Split range query into smaller API queries with slice_range

For some cases, the data return could be too big, data return in 1 query could exceed Mixpanel limitation and cause job to failed. In that case, we can split the big range query into smaller one using slice_range configuration parameter. This parameter are optional and default to 7.

Example:

in:
  type: mixpanel
  api_key: MIXPANEL_API_KEY
  api_secret: MIXPANEL_API_SECRET
  timezone: 'UTC'
  from_date: "2015-10-20"
  incremental_column: mp_processing_time_ms
  fetch_days: 6
  slice_range: 2
out:
  mode: append

The above configuration produces a query with the following range: [2015-10-20,2015-10-21],[2015-10-22,2015-10-23],[2015-10-24,2015-10-25] instead of [2015-10-20,2015-10-25]

Appendix

Modes for Out Plugin

You can specify file import mode in out section of seed.yml.

  • append (default)

This is the default mode and records are appended to the target table.

in:
  ...
out:
  mode: append
  • replace (In td 0.11.10 and later)

This mode replaces data in the target table. Manual schema changes made to the target table remains intact with this mode.

in:
  ...
out:
  mode: replace