# Mixpanel Import Integration

The Data Connector for Mixpanel allows you to back up the event data in Mixpanel on Treasure Data. Use cases include:

| **Use Case** | **Description** |
|  --- | --- |
| One-time migration | The organization is migrating away from Mixpanel and wants to keep a raw copy of the data |
| Incremental daily backup | The organization needs to have more granular access with SQL to event data inside Mixpanel |


For sample workflows on how to import event data from Mixpanel, view [Treasure Boxes](https://github.com/treasure-data/treasure-boxes/tree/master/td_load/mixpanel).

## Use TD Console

### Create a New Connection

1. Navigate to Integrations Hub > Catalog
2. Search and select Mixpanel.
3. The following dialog opens. Edit the Mixpanel credential details.
4. Select **Continue** and provide a name for this connection.


![](/assets/image-20200331-191757.62ae4934c2d416ee52ac82c22d39f0c09cd44e57f2928e0c14050423ccb5d3cc.6d61b60d.png)

### Create a New Source

After creating the connection, you are automatically taken to the Authentications page. Look for the connection you created and select **New Source**.

![](/assets/image-20200331-191831.26c095e5bbdc4e186d5fa3ccfc1a83d8bbc25bb940fc52d25101bcf1fbcc4a60.6d61b60d.png)

#### **Connection**

The following dialog opens.

1. Name the data transfer and select **Next**.


![](/assets/image-20200331-191847.490cea31436f6abfee2dbbf9ba1cc8877628c2017c0e39fedf50e839bc192311.6d61b60d.png)

#### Source Table

The following dialog opens

1. Edit your source parameters and select **Next**.


![](/assets/image-20200331-205825.b6c7b1431dcb1fe90a3ca2a28e9507ccef368b7d4f718f2404ab4610b9f39289.6d61b60d.png)

![](/assets/image-20200331-191941.156b803358278dab82a2798475a19f9d207c16048e0db26d98c769ce663c861f.6d61b60d.png)

| **Parameters** | **Description** |
|  --- | --- |
| **Data Export API Endpoint:** | The endpoint you are using for the Export API call. If you leave it empty the default value is [https://data.mixpanel.com/api/2.0/export/](https://data.mixpanel.com/api/2.0/export/) |
| **JQL API Endpoint** | The endpoint using for JQL API call. If you leave it empty the default value is [https://mixpanel.com/api/2.0/jql/](https://mixpanel.com/api/2.0/jql/) |
| **Timezone** | Your project timezone. It could be found in PROJECT SETTINGS >> YOUR PROJECT  >> OVERVIEW. |
| **JQL Mode** | Using JQL endpoint or just export endpoint. The default value is false |
| **JQL** | JQL script. It only effects when JQL Mode is true |
| **Incremental Loading** | Run your transfer in incremental mode or not |
| **Incremental Field** | which field is used as an index for incremental. The default value is `time` |
| **From date** | start date of the incremental |
| **Number of Days to Fetch** | total number of days to fetch data in one time |
| **Number of Days to Slice** | number of days that one request fetch |


### Data Settings

Your data settings can be edited on this page.

1. Edit the data settings or optionally, skip this page of the dialog.


![](/assets/image-20200331-205910.f234277625ea61950315dda899a5b3add865c944ac5d12c3f5c22882471a0288.6d61b60d.png)

![](/assets/image-20200331-205934.3f7e6cd4df143e81f63184b21ad59f4393eba4b89cdf45399886ab5adf3a1706.6d61b60d.png)

| **Parameters** | **Description** |
|  --- | --- |
| **Backfill Days** | The amount of time that will be subtracted from `from\_date` to calculate the final `from\_date` that is used for the API Request. This is due to the Mixpanel caching data on user devices before sending it to the Mixpanel server. It only affects when Incremental is true and Incremental Field is specified. The default value is 5 |
| **Incremental Column Upper Limit Delay In Seconds** | The upper limit of the incremental column. When using export with the incremental column, the plugin will lock the upper limit of the incremental column query with the job start time. This is to support when Mixpanel has a delay in their processing. It only affects Export endpoint. The default value is 0. |
| **Allow Partial Import** | Allows the plugin to skip errors in the import. It only affects Export endpoint. The default value is true. |
| **Fetch Custom Properties** | Allows the plugin to import all custom properties for Export endpoint only. The default value is false. |
| **Events** | The events for filtering data for Export endpoint only |
| **Filter Expression** | The segmentation expression for Export endpoint only |
| **Bucket** | The data bucket to filter data for Export endpoint only |
| **Schema** | Your schema with columns names and types, which are stored in the TD table. |


### Data Preview

You can see a [preview](/products/customer-data-platform/integration-hub/batch/import/previewing-your-source-data) of your data before running the import by selecting Generate Preview. Data preview is optional and you can safely skip to the next page of the dialog if you choose to.

1. Select **Next**. The Data Preview page opens.
2. If you want to preview your data, select **Generate Preview**.
3. Verify the data.


### Data Placement

For data placement, select the target database and table where you want your data placed and indicate how often the import should run.

1. Select **Next.** Under Storage, you will create a new or select an existing database and create a new or select an existing table for where you want to place the imported data.
2. Select a **Database** > **Select an existing** or **Create New Database**.
3. Optionally, type a database name.
4. Select a **Table**> **Select an existing** or **Create New Table**.
5. Optionally, type a table name.
6. Choose the method for importing the data.
  - **Append** (default)-Data import results are appended to the table.
If the table does not exist, it will be created.
  - **Always Replace**-Replaces the entire content of an existing table with the result output of the query. If the table does not exist, a new table is created.
  - **Replace on New Data**-Only replace the entire content of an existing table with the result output when there is new data.
7. Select the **Timestamp-based Partition Key** column.
If you want to set a different partition key seed than the default key, you can specify the long or timestamp column as the partitioning time. As a default time column, it uses upload_time with the add_time filter.
8. Select the **Timezone** for your data storage.
9. Under **Schedule**, you can choose when and how often you want to run this query.


#### Run once

1. Select **Off**.
2. Select **Scheduling Timezone**.
3. Select **Create & Run Now**.


#### Repeat Regularly

1. Select **On**.
2. Select the **Schedule**. The UI provides these four options: *@hourly*, *@daily* and *@monthly* or custom *cron*.
3. You can also select **Delay Transfer** and add a delay of execution time.
4. Select **Scheduling Timezone**.
5. Select **Create & Run Now**.


After your transfer has run, you can see the results of your transfer in **Data Workbench** > **Databases.**

## Use Command Line

### Install ‘td’ Command v0.11.9 or Later

Install the newest [Treasure Data Toolbelt](https://support.treasuredata.com/hc/en-us/articles/360000720048-Treasure-Data-Toolbelt-Command-line-Interface).

### Look up Mixpanel API Credentials

Log in to your Mixpanel account, go to “Account” then “Projects” to look up your API key and API secret.

### Create the Seed Config File (seed.yml)

Create a file called `seed.yml` as follows.


```yaml
in:
  type: mixpanel
  api_key: MIXPANEL_API_KEY
  api_secret: MIXPANEL_API_SECRET
  timezone: 'UTC'
  from_date: "2015-10-28"
  fetch_days: 1
out:
  mode: append
```

This seed file is used to “guess” the full configuration file with column names. In this example, one day’s worth of data on Oct 28, 2015, is exported from Mixpanel to Treasure Data.

For more details on available `out` modes, see the Appendix below.

### Guess Fields (Generate load.yml)

Based on `seed.yml`, generate `load.yml` with the `connector:guess` command.


```
$ td connector:guess seed.yml -o load.yml
```

Your column names and types depend on how you configure Mixpanel. The `load.yml` file should look as follows:


```yaml
in:
  type: mixpanel
  api_key: MIXPANEL_API_KEY
  api_secret: MIXPANEL_API_SECRET
  timezone: UTC
  from_date: '2015-10-28'
  fetch_days: 1
  columns:
  - name: event
    type: string
  - name: "$browser"
    type: string
  - name: "$browser_version"
    type: long
  - name: "$city"
    type: string
  - name: "$current_url"
    type: string
  - name: "$initial_referrer"
    type: string
  - name: "$initial_referring_domain"
    type: string
  - name: "$lib_version"
    type: string
  - name: "$os"
    type: string
  - name: "$referrer"
    type: string
  - name: "$referring_domain"
    type: string
  - name: "$region"
    type: string
  - name: "$screen_height"
    type: long
  - name: "$screen_width"
    type: long
  - name: ENV
    type: string
  - name: RAILS_ENV
    type: string
  - name: distinct_id
    type: long
  - name: from
    type: string
  - name: job_id
    type: long
  - name: mp_country_code
    type: string
  - name: mp_lib
    type: string
  - name: name
    type: string
  - name: time
    type: long
filters:
- type: rename
  rules:
  - rule: upper_to_lower
  - rule: character_types
    pass_types: ["a-z", "0-9"]
    pass_characters: "_"
    replace: "_"
  - rule: first_character_types
    pass_types: ["a-z"]
    pass_characters: "_"
    prefix: "_"
  - rule: unique_number_suffix
    max_length: 128
out:
  mode: append
exec: {}
```

For more details on the `rename` filter, see [rename filter plugin for Data Connector](https://docs.treasuredata.com/smart/project-product-documentation/rename-filter-function).

The Data Connector picked up fields such as:

- ENV
- RAILS_ENV
- $current_url


Column names are normalized in the type: rename filter section.

### Preview Data Loading

Use `connector:preview` command to preview the data load.


```bash
td connector:preview load.yml
```

If the data looks good, then create the database and the table on Treasure Data where you wish to import Mixpanel data:


```bash
td db:create mixpanel_historical
td table:create mixpanel_historical app_name
```

### Load Data

Submit the load job. It may take a couple of hours depending on the size of the data. Users need to specify the database and table where their data is stored.

It’s also recommended to specify `--time-column` option because Treasure Data’s storage is partitioned by time (see [architecture](https://support.treasuredata.com/hc/en-us/articles/360001480667)) If the option is not provided, the data connector chooses the first `long` or `timestamp` column as the partitioning time. The type of the column specified by `--time-column` must be either of `long` and `timestamp` type.


```bash
td connector:issue load.yml --database mixpanel_historical --table app_name --time-column time
```

The preceding command assumes you have already created *database(td_sample_db)* and *table(td_sample_table)*. If the database or the table do not exist in TD this command fails, so create the database and table [manually](https://docs.treasuredata.com/articles/product-documentation-publication/data-management) or use `--auto-create-table` option with `td connector:issue` command to auto-create the database and table:


```
$ td connector:issue load.yml --database td_sample_db --table td_sample_table --time-column created_at --auto-create-table
```

You can assign Time Format column to the "Partitioning Key" by "--time-column" option.

You can check that the data is loaded by running a SQL query directly against your data in Treasure Data:


```bash
td query -T presto -w -d mixpanel_historical 'SELECT COUNT(1) FROM app_name'
```

### Scheduling Incremental Data Loading

Unless you are migrating off of Mixpanel completely, Mixpanel data must be incrementally loaded into Treasure Data regularly. The Data Connector’s scheduling function comes in handy for this purpose.

After scheduling, the Mixpanel Data Connector’s successive runs increment the `from_date` parameter by `fetch_days`. For example, if the initial run in `load.yml` was


```
from_date: '2015-10-28'
fetch_days: 1
```

Then, the next run is as follows:


```
from_data: '2015-10-29'
fetch_days: 1
```

You **do not** need to update `load.yml` after it is uploaded since `from_date` field is automatically updated on the server side.

Suppose you wish to schedule a daily upload. Then, make sure that the initial `from_date` is at least two days ago and set `fetch_days: 1` in `load.yml`. Then, the following command creates a daily job called “daily_mixpanel_import” which loads historical data to `mixpanel_historical.app_name` on Treasure Data every day.


```
$ td connector:create \
    daily_mixpanel_import \
    "10 0 * * *" \
    mixpanel_historical \
    app_name \
    load.yml \
    --time-column time # optional
```

The historical runs of the import can be seen with `td connector:history name`, e.g., `td connector:history daily_mixpanel_import`.

### Incremental Data Loading With Incremental Column

Certain Mixpanel account has project set up with additional field added to indicate the time data get processed by Mixpanel. For example: `mp_processing_time_ms`

User can add an additional parameter to Mixpanel Input Plugin `incremental_column`. The max `incremental_column` value of a run session will store and use as a filter for next run, Example: `where incremental_column > previous_run_max_value`.

Example:


```
in:
  type: mixpanel
  api_key: MIXPANEL_API_KEY
  api_secret: MIXPANEL_API_SECRET
  timezone: 'UTC'
  from_date: "2015-10-28"
  incremental_column: mp_processing_time_ms
  fetch_days: 1
out:
  mode: append
```

#### Look back for data with ***back_fill_days***

For devices data like phone, tablets… Mixpanel can keep data in user’s device for a while before send them to Mixpanel Server. So data appear in query can be delayed up to 5 days. When incremental import data from Mixpanel, we can missed data that are cached in user devices. To solve this issue we can set `back_fill_days` parameter(Default to 5). Plugin looks back for a number of days(`from_date - back_fill_days`). Due to performance issue, this feature only work with incremental_column.

#### Split range query into smaller API queries with ***slice_range***

For some cases, the data return could be too big, data return in 1 query could exceed Mixpanel limitation and cause job to failed. In that case, we can split the big range query into smaller one using `slice_range` configuration parameter. This parameter are optional and default to ***7***.

Example:


```yaml
in:
  type: mixpanel
  api_key: MIXPANEL_API_KEY
  api_secret: MIXPANEL_API_SECRET
  timezone: 'UTC'
  from_date: "2015-10-20"
  incremental_column: mp_processing_time_ms
  fetch_days: 6
  slice_range: 2
out:
  mode: append
```

The above configuration produces a query with the following range: `[2015-10-20,2015-10-21],[2015-10-22,2015-10-23],[2015-10-24,2015-10-25]` instead of [2015-10-20,2015-10-25]

## Appendix

### Modes for Out Plugin

You can specify file import mode in `out` section of seed.yml.

- append (default)


This is the default mode and records are appended to the target table.


```
in:
  ...
out:
  mode: append
```

- replace (In td 0.11.10 and later)


This mode replaces data in the target table. Manual schema changes made to the target table remains intact with this mode.


```
in:
  ...
out:
  mode: replace
```