The Data Connector for Mixpanel allows you to back up the event data in Mixpanel on Treasure Data. Use cases include:
The organization is migrating away from Mixpanel and wants to keep a raw copy of the data
Incremental daily backup
The organization needs to have more granular access with SQL to event data inside Mixpanel
For sample workflows on how to import event data from Mixpanel, view Treasure Boxes.
Use TD Console
Create a New Connection
Navigate to Integrations Hub > Catalog
Search and select Mixpanel.
The following dialog opens. Edit the Mixpanel credential details.
Select Continue and provide a name for this connection.
Create a New Source
After creating the connection, you are automatically taken to the Authentications page. Look for the connection you created and select New Source.
The following dialog opens.
Name the data transfer and select Next.
The following dialog opens
Edit your source parameters and select Next.
Data Export API Endpoint:
The endpoint you are using for the Export API call. If you leave it empty the default value is https://data.mixpanel.com/api/2.0/export/
JQL API Endpoint
The endpoint using for JQL API call. If you leave it empty the default value is https://mixpanel.com/api/2.0/jql/
Your project timezone. It could be found in PROJECT SETTINGS >> YOUR PROJECT >> OVERVIEW.
Using JQL endpoint or just export endpoint. The default value is false
JQL script. It only effects when JQL Mode is true
Run your transfer in incremental mode or not
which field is used as an index for incremental. The default value is `time`
start date of the incremental
Number of Days to Fetch
total number of days to fetch data in one time
Number of Days to Slice
number of days that one request fetch
Your data settings can be edited on this page.
Edit the data settings or optionally, skip this page of the dialog.
The amount of time that will be subtracted from `from_date` to calculate the final `from_date` that is used for the API Request. This is due to the Mixpanel caching data on user devices before sending it to the Mixpanel server. It only affects when Incremental is true and Incremental Field is specified. The default value is 5
Incremental Column Upper Limit Delay In Seconds
The upper limit of the incremental column. When using export with the incremental column, the plugin will lock the upper limit of the incremental column query with the job start time. This is to support when Mixpanel has a delay in their processing. It only affects Export endpoint. The default value is 0.
Allow Partial Import
Allows the plugin to skip errors in the import. It only affects Export endpoint. The default value is true.
Fetch Custom Properties
Allows the plugin to import all custom properties for Export endpoint only. The default value is false.
The events for filtering data for Export endpoint only
The segmentation expression for Export endpoint only
The data bucket to filter data for Export endpoint only
Your schema with columns names and types, which are stored in the TD table.
You can see a preview of your data in the Data Preview page before running the import by selecting Generate Preview. Data preview is optional and you can safely skip to the next page of the dialog if you choose to.
If you want to preview your data, select Generate Preview.
Verify the correct data is showing.
In this dialog, you will specify where your data will be placed and schedule how often it will run this import.
Under Storage, you will create a new or select an existing database and create a new or select an existing table for where you want to place the imported data.
Select a Database > Select an existing or Create New Database.
Select a Table> Select an existing or Create New Table.
Choose the Append or Replace method for importing the data.
Append (default)-Data import results are appended to the table.
If the table does not exist, it will be created.
Replace-Replaces the entire content of an existing table with the resulting output of the query.
If the table does not exist, a new table is created.
Select the Timestamp-based Partition Key column.
If you want to set a different partition key seed than the default key, you can specify the long or timestamp column as the partitioning time. As a default time column, it uses upload_time with the add_time filter.
Select the Timezone for your data storage.
Under Schedule, you can choose when and how often you want to run this query.
Select Scheduling Timezone.
Select Create & Run Now.
Repeat the query:
Select the Schedule. The UI provides these four options: @hourly, @daily and @monthly or custom cron.
You can also select Delay Transfer and add a delay of execution time.
Select Scheduling Timezone.
Select Create & Run Now.
After your transfer has run, you can see the results of your transfer in Data Workbench > Databases or Jobs.
Use Command Line
Install ‘td’ Command v0.11.9 or Later
Install the newest Treasure Data Toolbelt.
Look up Mixpanel API Credentials
Log in to your Mixpanel account, go to “Account” then “Projects” to look up your API key and API secret.
Create the Seed Config File (seed.yml)
Create a file called
seed.yml as follows.
This seed file is used to “guess” the full configuration file with column names. In this example, one day’s worth of data on Oct 28, 2015, is exported from Mixpanel to Treasure Data.
For more details on available
out modes, see the Appendix below.
Guess Fields (Generate load.yml)
load.yml with the
Your column names and types depend on how you configure Mixpanel. The
load.yml file should look as follows:
For more details on the
rename filter, see rename filter plugin for Data Connector.
The Data Connector picked up fields such as:
Column names are normalized in the type: rename filter section.
Preview Data Loading
Use `connector:preview` command to preview the data load.
If the data looks good, then create the database and the table on Treasure Data where you wish to import Mixpanel data:
Submit the load job. It may take a couple of hours depending on the size of the data. Users need to specify the database and table where their data is stored.
It’s also recommended to specify
--time-column option because Treasure Data’s storage is partitioned by time (see architecture) If the option is not provided, the data connector chooses the first
timestamp column as the partitioning time. The type of the column specified by
--time-column must be either of
The preceding command assumes you have already created database(td_sample_db) and table(td_sample_table). If the database or the table do not exist in TD this command fails, so create the database and table manually or use
--auto-create-table option with
td connector:issue command to auto-create the database and table:
You can assign Time Format column to the "Partitioning Key" by "--time-column" option.
You can check that the data is loaded by running a SQL query directly against your data in Treasure Data:
Scheduling Incremental Data Loading
Unless you are migrating off of Mixpanel completely, Mixpanel data must be incrementally loaded into Treasure Data regularly. The Data Connector’s scheduling function comes in handy for this purpose.
After scheduling, the Mixpanel Data Connector’s successive runs increment the
from_date parameter by
fetch_days. For example, if the initial run in
Then, the next run is as follows:
You do not need to update
load.yml after it is uploaded since
from_date field is automatically updated on the server side.
Suppose you wish to schedule a daily upload. Then, make sure that the initial
from_date is at least two days ago and set
fetch_days: 1 in
load.yml. Then, the following command creates a daily job called “daily_mixpanel_import” which loads historical data to
mixpanel_historical.app_name on Treasure Data every day.
The historical runs of the import can be seen with
td connector:history <name>, e.g.,
td connector:history daily_mixpanel_import.
Incremental Data Loading With Incremental Column
Certain Mixpanel account has project set up with additional field added to indicate the time data get processed by Mixpanel. For example:
User can add an additional parameter to Mixpanel Input Plugin
incremental_column. The max
incremental_column value of a run session will store and use as a filter for next run, Example:
where incremental_column > previous_run_max_value.
Look back for data with back_fill_days
For devices data like phone, tablets… Mixpanel can keep data in user’s device for a while before send them to Mixpanel Server. So data appear in query can be delayed up to 5 days. When incremental import data from Mixpanel, we can missed data that are cached in user devices. To solve this issue we can set
back_fill_days parameter(Default to 5). Plugin looks back for a number of days(
from_date - back_fill_days). Due to performance issue, this feature only work with incremental_column.
Split range query into smaller API queries with slice_range
For some cases, the data return could be too big, data return in 1 query could exceed Mixpanel limitation and cause job to failed. In that case, we can split the big range query into smaller one using
slice_range configuration parameter. This parameter are optional and default to 7.
The above configuration produces a query with the following range:
[2015-10-20,2015-10-21],[2015-10-22,2015-10-23],[2015-10-24,2015-10-25] instead of [2015-10-20,2015-10-25]
Modes for Out Plugin
You can specify file import mode in
out section of seed.yml.
This is the default mode and records are appended to the target table.
replace (In td 0.11.10 and later)
This mode replaces data in the target table. Manual schema changes made to the target table remains intact with this mode.
Incremental for JQL mode
Incremental for JQL based on the idea that we could send parameters along with the JQL script, so parameters `params.start_time` and `params.end_time` are required in JQL script when using incremental in JQL mode.