Skip to content
Last updated

Overview

ID Unification Guide

This guide provides context on profiles and illustrates the most common ways to combine or stitch profiles based on shared identifiers using the ID Unification capability on Treasure Data CDP.

ID Unification Overview

While it was created for marketers, you will likely need a developer on your team to implement some of these features, which can utilize Treasure Data Workflows.

Why is there a need for a unique customer ID?

The customer journey is far more complex than it used to be and takes place across far more channels. Customers might hop on with a display ad, then continue through social media, your website, review sites, a coupon app, a local search, a brick-and-mortar visit, and an online purchase.

Without the ability to combine physical and digital data into a unique customer ID, you wouldn't see the totality of the journey. For example, if you send the brick-and-mortar visitor a follow-up email asking them to buy something they just purchased from your online store, you might not attribute the display ad to the eventual purchase. You could track all of these touchpoints as separate individual encounters rather than the journey of a single customer.

In short, customer ID unification enables marketing that is more efficient, more relevant to the customer, and ultimately more effective in building relationships.

Treasure Data ID Unification

What is Treasure Data ID Unification?

ID Unification is one of Treasure Data CDP's core capabilities. It unifies customer data across data sources and identities to create a single view of the customer. Currently, ID Unification is entirely constructed with Treasure Workflow, which requires the definition of source data, unification requirements, and an algorithm to unify with YAML-formatted configuration.

ID Unification Overview

The ID Unification service runs at the end of a custom workflow

It runs the following steps:

  1. Extract data and stitch IDs
  2. Join the tables by the canonical ID to create master table(s)

After the unification, you can use the master tables in a parent segment configuration.

Prerequisites

To perform ID unification, you need permission to edit parent segments. In the TD Console, you will need "Full" access to configure all master segments.

ID Unification Steps

Step 1: Create a Custom Workflow

Sample Workflow

_export:
 td:
   database: <database>

+unification:
  http_call>:  https://<cdp_endpoint>/unifications/workflow_call
  headers:
    - authorization: ${secret:td.apikey}
  method: POST
  retry: true
  content_format: json
  content:
    #full_refresh: true        # OK to remove because full refresh runs every 3 days
    unification:
      !include : unify.yml     # Set YAML config file name

Workflow Details

Create a custom workflow with a step to call the unification API: http_call>: https://<cdp_endpoint>/unifications/workflow_call.

The unification workflow API differs depending on your region:

  • US region: https://api-cdp.treasuredata.com/unifications/workflow_call
  • EU01 Region: https://api-cdp.eu01.treasuredata.com/unifications/workflow_call
  • Tokyo Region: https://api-cdp.treasuredata.co.jp/unifications/workflow_call
  • Korea: https://api-cdp.ap02.treasuredata.com/unifications/workflow_call

The custom workflow for ID Unification automatically makes a secret, so you should not register the td.apikey manually.

The custom workflow should also include a .yml file containing the configuration for the unification process. In the following example, the file is called unify.yml.

  • unification:
    • !include : unify.yml
  • keep_debug_tables : (Optional) The default is false - # Enable this option not to drop intermediate tables
  • run_canonical_ids : (Optional) The default is true - If set to false, the unification process will not create unified canonical ids
  • run_enrichments : (Optional) The default is true. - If set to false, the unification process will not enrich tables in the created CDP unification database with canonical IDs
  • run_master_tables : (Optional) The default is true. If set to false, the unification process will not create master tables listed in the master table section
  • full_refresh : (Optional) Set to true to run full refresh on every run. A full refresh will run every three days, regardless of the setting

Step 2. Create a Unification YML Configuration File

ID Unification is configured through a YAML file.

Example Configuration

name: production
keys:
  - name: td_client_id
    valid_regexp: "[0-9a-fA-F]{8}-..."
    invalid_texts: ['']

  - name: td_global_id
    valid_regexp: "[0-9a-fA-F]{8}-..."
    invalid_texts: ['', '0000000000-...']

  - name: email
    valid_regexp: ".*@.*"

tables:
  - database: prod
    table: pageviews
    incremental_columns: [updated_at, id]
    key_columns:
      - {column: td_client_id, key: td_client_id}
 - database: brand2
    table: pageviews
    as: brand2_pageviews
    key_columns:
      - {column: td_client_id, key: td_client_id}
      - {column: td_global_id, key: td_global_id}
      - {column: email, key: email}

  - database: prod
    table: contacts
    key_columns:
      - {column: email, key: email}

canonical_ids:
  - name: browser_id
    merge_by_keys: [td_client_id, td_global_id]

master_tables:
  - name: marketing_master
    canonical_id: browser_id

    attributes:
      - name: browser_id
        source_canonical_id: browser_id

      - name: email
        source_columns:
          - {table: contacts, column: email}

Sections

Step 2.0: Name

Name is the name of your unification process. This name is used when the unification database is created.

Step 2.1: Keys

Keys are used to unify customer profiles.

keys:
  - name: td_client_id
    valid_regexp: "[0-9a-fA-F]{8}-..."
    invalid_texts: ['']

  - name: td_global_id
    valid_regexp: "[0-9a-fA-F]{8}-..."
    invalid_texts: ['', '0000000000-...']

  - name: email
    valid_regexp: ".*@.*"
  • The keys are based on the column names from the tables section.

    • If you're unsure which columns you should use as "keys:" use the following guidelines.
  • A key should be able to specify a customer or other entities uniquely:

    • For example, "first_name" is not a good key because many people have the same first name.
    • "td_ip" (IPv4 address) is also not good because many devices reuse the same address.
    • "vin_number" (VIN: Vehicle Identifier Number) is not a good key because multiple people could own the same car.
  • A key should be consistent. For example, "home_address" is not a good key because the same address may have different text representations.

  • "keys" entries of ID Unification YAML describe the specification of the key column values. For example, the td_client_id key column contains one of the strings that matches [0-9a-fA-F]{8}-... (valid_regexp) but must not be an empty value (invalid_texts). All of "307423ab-9cbc-4bca-9cae-05c3700cc8f2", "fc6422e1-48bb-4396-b7aa-bdd2075c000d" are sample values of the td_client_id key column. Any column that is a td_client_id key column can be named "td_client_id" (same name as the key), but it can be other names.

  • Name(required): Name of the key.

  • Valid_regexp(optional): use regex string to validate valid keys example: "[0-9a-fA-F]{8}-..."

  • Invalid_texts(optional): examples of text that shouldn't be considered valid keys. For example, ["'] is used for a valid key.

Step 2.2: Tables

Tables to be used for unification.

tables:
  - database: prod
	table: pageviews
	incremental_columns: [updated_at, id]
	key_columns:
	 - {column: td_client_id, key: td_client_id}

  - database: brand2
	table: pageviews
	as: brand2_pageviews
	key_columns:
	 - {column: td_client_id, key: td_client_id}
	 - {column: td_global_id, key: td_global_id}
	 - {column: email, key: email}

  - database: prod
	table: contacts
	key_columns:
	 - {column: email, key: email}
  • Tables : The tables to be used for unification declare the relationship between them. For example, prod.pageviews and brand2.pageviews can be joined by td_client_id and td_global_id.
    • database: (Required) The database where the table is contained
    • Table: (Required) The name of the table
    • Key_columns: (Required) The mapping between columns in the table and the keys for unification.
      • Column : (Required) The name of the column in the table.
      • Key : (Required) The key from the keys section to which this column corresponds.
  • as (Optional): Add this for the table name that the unification process should use. It is applicable when duplicate table names exist in different databases in the tables section.
  • incremental_columns (Optional): Columns to be used when incremental=true is set

Step 2.3: Canonical IDs

canonical_ids:
  - name: browser_id
	merge_by_keys: [td_client_id, td_global_id, td_ssc_id]

  - name: marketing_id
    merge_by_canonical_ids: [browser_id]
    merge_by_keys: [email]
    source_tables: [pageviews, contacts]
 
  - name: contact_id
    merge_by_canonical_ids: [browser_id]
	merge_by_keys: [membership_id, email]
	merge_iterations: 3
	incremental_merge_iterations: 2

Declare creating a canonical ID in the "canonical_ids:" section. You can create multiple canonical IDs in a single YAML file.

  • name : Name of the canonical ID
  • Merge_by_keys : keys to use to create a canonical id. The left key has higher priority. An un-canonicalized ID must have at least one key in the "merge_by_keys:" section. Set more consistent keys first in the merge_by_keys array. For example, [email, td_client_id] is better than [td_client_id, email].
  • Merge_by_canonical_ids : ID can inherit other IDs (merge_by_canonical_ids). This is helpful to create a hierarchy in the IDs, e.g., browser ID > device ID > person ID > family ID
  • source_tables : Sources tables from the "tables" section that contain the keys
  • Merge_iterations : The number of times the unification algorithm should iterate through the data set to create a canonical id. The number of merge iterations is three for full refresh (every three days) and two for following incremental refresh by default. Use incremental_merge_iterations options if the number of IDs needs to be unified more.
  • incremental_merge_iterations : The number of times the unification algorithm should iterate through the data set when doing an incremental merge.

Step 2.4: Master Table

Declare how to create master tables in the "master_tables:" section. You can create multiple master tables in a single YAML file. A master table must have one canonical_id as its primary key. Attributes must be declared. Attributes are columns of the master table.

  • name : name of the master table to use canonical_id : name of canonical id to use for master table
 master_tables:
  - name: marketing_master
  canonical_id: marketing_id
  attributes:
    - name: browser_id
    source_canonical_id: browser_id

    - name: firstname
    source_columns:
      - {table: pageviews}
      - {table: form_submits, column: first_name}

    - name: birthdate
    valid_regexp: "[0-9]{4}-[0-9]{1,2}-[0-9]{1,2}"
    invalid_texts: ['']
    source_columns:
      - {table: contacts, priority: 1}
      - {table: pageviews, priority: 2}
      - {table: form_submits, priority: 2}

    - name: all_emails
    array_elements: 5
    source_columns:
      - {table: contacts, order: last, order_by: time, priority: 1}
      - {table: pageviews, priority: 2}
      - {table: form_submits, priority: 2}

Step 2.4.1: How to Create the Attributes

An attribute can be created either of columns of tables ("source_columns") or canonical_id ("source_canonical_id").

  • Valid_regexp : Regex string to validate attributes.
  • invalid_texts : Values to be considered as invalid attributes.
  • Array_elements [x]: If "array_elements:" is set, the column becomes an array with at most [x] elements. array_elements accepts a maximum of 10 elements at present.
  • Source_columns has a table name and column name. The attribute name is used as a column name if "column:" is omitted. "source_columns " can have multiple columns. There are options to control which column is used as the attribute.
  • Table (Required): The name of the table to use.
  • Column (Optional): The name of the column to use.
  • Priority(Optional): Set the priority of the attribute.
  • Order(Optional): Sort the rows by "order_by: "column (default: "time"), and take first or last (configured by "order" section. default: "last").
  • Order_by:The column to determine the order of rows with the same priority.

How the Master Table Attribute / Priority is Resolved

  • Workflow takes the highest priority first.
  • Get the value from "priority: 1." If it's null, get from "priority: 2." Repeat until you reach a more significant priority number.
  • Multiple columns can have the same priority, except numerous columns from the same table are configured.
  • If multiple rows exist in the same priority, sort the rows by the "order_by: "column (default: "time") and take first or last (configured by the "order" section, default: "last").

ID Unification Outcome

Outcome 1. Unification Database

The unification process will create a database with the naming convention cdp_unification_<name specified at the top of the unification yml file>. For example, if the name specified at the top of your unification yml file is jack123. A database named cdp_unification_jack123 will be created on your account.

Outcome 2: ID Lookup Tables and Statistics Tables

Mapping from other IDs to canonical IDs.

Outcome 2.1: Unification Generates Statistics Tables

  • < canonical_id_name>_result_key_stats: Statistics of the unification results

New records are inserted into the result statistics table (<canonical_id_name>_result_key_stats) always when a workflow runs.

  • < canonical_id_name>_source_key_stats: Statistics of source tables before unification

New records are inserted into the source statistics table (<canonical_id_name>_source_key_stats) only when a workflow runs in full-refresh mode.

  • These tables keep historical information. To get the latest data only, sort the table by the "time" column and get the latest rows ("time" is the workflow's session_time).

Example of a query to get the latest information only:

SELECT
   * ,
  row_number(
  ) over(
partition BY from_table
  ORDER BY
time DESC
  ) AS i
FROM
  XXX_key_stats

ID unification workflow generates <canonical_id_name>_lookup table for each canonical ID. This table is usable with <canonical_id_name>_keys and <canonical_id_name>_tables.

The assignment of id_key_type and id_source_table_ids is consistent, but you can't predict which ID will be assigned. Therefore, you must use <canonical_id_name>_lookup table always JOIN-ed with _keys and _tables tables as following (Trino query):

SELECT
  lookup.* ,
  keys.key_name,
  transform(
id_source_table_ids,
tid -> element_at(
  tables.mapping,
  tid
)
  ) AS id_source_table_names
FROM
  CANONICAL_ID_lookup lookup
JOIN
  CANONICAL_ID_keys keys
  ON lookup.id_key_type = keys.key_type CROSS
JOIN (
SELECT
  map_agg(
    table_id,
    table_name
  ) AS mapping
FROM
  CANONICAL_ID_tables
  ) tables

Outcome 3: Copy and Enrich Source Tables With the Canonical ID

The unification process will create copies of the tables from the tables section of the unification yml configuration. Each table's name will be prefixed with the enriched_<table_name>.

Each table will include a column with the canonical ID assigned to that record.

About Unification Support

The unification workflow results heavily depend on the underlying data at the time it is run and changes in the underlying data between runs. The Treasure Data Support team does not have access to customer data, so the scope of support possible for unexpected unification results is limited. Treasure Data support can help unify workflow errors and confirm yml configuration syntax.

Complete ID Unification Process Tutorial

The following is an example of a complete ID Unification process with all the components.

1. Create Tables

This ID Unification example consists of three tables: contacts, web_visits, and pageviews_email. The unification process will create a single customer ID from these three tables using three keys: td_client_id, td_global_id, and email.

contacts table

Schema : The contacts table schema has one column: email (excluding the time column).

Sample records from the contacts table.

pageviews_email table

Schema : The pageview_email table schema has two columns: email and td_client_id (excluding the time column).

Sample records

web_visits table

Schema

Sample Records Sample Records

2. Create Custom Workflow

  1. Create a workflow with the following .dig file.

Update the ID Unification http_call endpoint to point to your respective region.

  • sample dig file
+unification:
 http_call>: https://api-cdp.treasuredata.com/unifications/workflow_call
 headers:
   - authorization: ${secret:td_apikey}
 method: POST
 retry: true
 content_format: json
 content:
   full_refresh: true
   unification:
     !include : unify.yml

3. Add unify.yml

Add a unify.yml file to the workflow with the following content.

name: unification_tutorial

keys:
 - name: td_client_id
   valid_regexp: "[0-9a-fA-F]{8}"
   invalid_texts: ['']

 - name: td_global_id
   valid_regexp: "[0-9a-fA-F]{8}"
   invalid_texts: ['', '0000000000']

 - name: email
   valid_regexp: ".*@.*"

tables:
 - database: unify_db
   table: web_visits
   key_columns:
- {column: td_client_id, key: td_client_id}
     - {column: td_global_id, key: td_global_id}

 - database: unify_db
   table: pageviews_email
   key_columns:
     - {column: td_client_id, key: td_client_id}
     - {column: email, key: email}

 - database: unify_db
   table: contacts
   key_columns:
     - {column: email, key: email}

canonical_ids:
 - name: canonical_id
   merge_iterations: 3
   merge_by_keys: [td_client_id, td_global_id, email]


master_tables:
- name: unifcation_tutorial_master
  canonical_id: canonical_id
  attributes:

   - name: td_client_id
     invalid_texts: ['']
     source_columns:
       - {table: web_visits , column: td_client_id, order: first, order_by: time,  priority: 1}

   - name: td_global_id
     invalid_texts: ['']
     source_columns:
       - {table: web_visits , column: td_global_id, order: first, order_by: time,  priority: 1}

   - name: email
     invalid_texts: ['']
     source_columns:
       - {table: contacts , column: email ,order: last, priority: 1}

Unification Output

These three keys are based on corresponding columns within each table.

For example, the record for

email : Aleksandra_James6578@1kmd3.online,

td_client_id: df4a5935-0693-42be-979c-c3c4003c2530

would be unified into a single ID between the three tables.

The ID unification process will produce a master table unification_tutorial_master with unified records. The records for **email:**Aleksandra_James6578@1kmd3.online, has been unified into canonical id: 4OyhRjnMNWZP

Appendix

About do_not_merge_key

The do_not_merge_key is explicitly designed for this particular case only:

  • A trustable table has a unique ID, and two IDs in the table shouldn't be merged, even if other kinds of IDs indicate that they should be incorporated.
  • The example can be two salespeople using the same device (laptop) attached to the same email for whatever reason, so the email is the same but later links to two different user identifiers elsewhere (cust_id=1 or cust_id=2).

For example, if you have two tables, unify_1 and unify_2, and are trying to unify the two tables and unify a user profile.

The columns of each table are the following:

unify_1

  • email_1, phone_number_1, cust_id_1

unify_2

  • email_2, phone_number_2, cust_id_2

  • In the previous example table, the two records have the same email but different phone numbers and different cust_ids. They should normally be merged based on having the same email.
  • However, adding the do_not_merge_key: cust_id in the canonical id section will not merge these records even though they have the same email address.

The unification configuration file (unification.yml) would look like the following. Review the do_not_merge_key: cust_id in the canonical id section.

name: do_not_merge_test
keys:
 - name: email
 - name: cust_id
 - name: phone_number

tables:
 - database: unify_do_not_merge
   table: unify_1
   key_columns:
     - {column: email_1, key: email}
     - {column: cust_id_1, key: cust_id}
     - {column: phone_number_1, key: phone_number}

 - database: unify_do_not_merge
   table: unify_2
   key_columns:
     - {column: email_1, key: email}
     - {column: cust_id_1, key: cust_id}
     - {column: phone_number_2, key: phone_number}

canonical_ids:
 - name: canonical_id
   merge_by_keys: [cust_id,phone_number,email]
   source_tables: [unify_1, unify_2]
   merge_iterations: 3
   do_not_merge_key: cust_id

master_tables:
 - name: customer_master
   canonical_id: canonical_id
   attributes:

     - name: email
       source_columns:
         - {table: unify_1, column: email_1}
         - {table: unify_2, column: email_1}

     - name: phone
       source_columns:
         - {table: unify_1, column: phone_number_1}
         - {table: unify_2, column: phone_number_2}

     - name: cust_id
       source_columns:
         - {table: unify_1, column: cust_id_1, priority: 1}
         - {table: unify_2, column: cust_id_1,  order: last, order_by: time,  priority: 2}

The output of the unification process would produce two unique profiles.

Please note that the do_not_merge key will not unify profiles where the value of the do_not_merge key is unique. If the value of the do_not_merge key is the same for two profiles, those profiles will be unified, even if that key is specified as the do_not_merge key. For example:

Scenario #1

  • do_not_merge_key: cust_id

  • unique cust_id value for each record

Outcome: profiles are not unified, 2 canonical ids.

Scenario #2

  • do_not_merge_key: cust_id

  • same cust_id values for each record

Outcome: profiles are unified, 1 canonical id

There is a significant side effect to keep in mind. In the previous example, email=abc@123.com should be associated with cust_id=1 or cust_id=2. Realistically, email=abc@123.com will be assigned to both cust_id=1 and cust_id=2. The implication is that activities directly or indirectly associated with the email abc@123.com will be copied to cust_id=1 and cust_id=2. **Please remember this, as it can lead to surprising results when querying for activity only using the email abc@123.com. Do not use _do_not_merge_key_ unless necessary. the use case is similar to the above example.

To illustrate further, if you remove the do_not_merge_key: cust_id line from the unification yml file. The output will be one unified record.

Checking Results

Confirm the number of canonical_ids generated

You execute a query like the one below on the canonical_id_lookup table. The expectation is that this query should return 0 results as all keys should have been unified to a single canonical ID.

SELECT
  id,
  id_key_type,
  COUNT(canonical_id) AS cnt
FROM
  canonical_id_lookup
GROUP BY
  id,
  id_key_type
HAVING
  COUNT(canonical_id)> 1

If the query above returns results:

  • One possible cause of this is that the number of merge iterations isn't enough. To confirm this, you can check the setting for the merge_iterations in the canonical_id section of the unification yml configuration file.

  • Another option is to check the workflow logs for the result of the loop task numbers in the unification workflow.

The tasks look like +unify_loop+loop-N+iteration are the targets.

Confirm histogram data in the canonical_id_result_key_stats table

The other way to check the results is by checking the histogram in the result stats table. The histogram shows the distribution of how many ids one canonical_id has. There isn't necessarily a good or bad result for this query; that will depend on your dataset. The interpretation of the results of this query depends on your underlying dataset. Still, for example, if one canonical_id has 100 emails or keys, and that is unexpected, there might be an issue. The query is as follows.

SELECT
   *
FROM
  canonical_id_result_key_stats
WHERE
  from_table = '*'
  AND time = (
SELECT
  MAX(time)
FROM
  canonical_id_result_key_stats
  )

Difference between Full Refresh And Incremental

The difference between full_refresh=true and full_refresh=false is whether we incorporate the previous result or the unification workflow. There isn't a big difference in computation resources required, as the unification algorithm still has to unify all the records, but there is a difference in how to treat removed IDs. Removed IDs (a record) from the sources remain until a full refresh is run.

Limitations / Errors

  • The number of key_columns you can use in the tables section of the yml configuration file is limited to 32 keys.

    • Error: ....400 Bad Request: {"keys[1].relation_key_columns ":[" is too long (maximum is 32 characters)", "is invalid"], "tables[0].key_columns[0].key.relation_key_columns ":[" is too long (maximum is 32 characters)", "is invalid"]
  • A limitation of 255 tables can be used in the tables section.

  • When using do_not_merge_key. The do_not_merge_key must be the first element of merge_by_keys. The unification workflow fails if we give a key other than the first one from merge_by_keys as do_not_merge_key. Otherwise, the workflow fails with an error message 400 Bad Request: {"canonical_ids[0].do_not_merge_key" :[ "must be the first element of merge_by_keys"]}

#run_master_tables: true   # OK to remove because default is true
    #full_refresh: true        # OK to remove because refresh runs 3 days
    unification:
      !include : unify.yml     # Set YAML config file name
  • keep_debug_tables : not required, default is false # Enable this option not to drop intermediate tables
  • run_canonical_ids : not required as default is true, not required - Disable not to run canonical ids
  • run_enrichments : enrich tables in CDP unification db with canonical IDs, not required as the default is true
  • run_master_tables : create master tables from the master table section; not required as the default is true
  • full_refresh : run full refresh on every run. Full refresh will run every three days regardless of the setting. Not required
  • use_hive_engine : (Optional) - If set to true, the unification workflow would run on both Hive and Trino (depends on memory consumption tasks), but if not so, the unification workflow would still run on Trino only as currently.
  • unification :
  • !include: unify.yml