This guide provides context on profiles and illustrates the most common ways to combine or stitch profiles based on shared identifiers using the ID Unification capability on Treasure Data CDP.

While it was created for marketers, you will likely need a developer on your team to implement some of these features, which can utilize Treasure Data Workflows.
The customer journey is far more complex than it used to be and takes place across far more channels. Customers might hop on with a display ad, then continue through social media, your website, review sites, a coupon app, a local search, a brick-and-mortar visit, and an online purchase.
Without the ability to combine physical and digital data into a unique customer ID, you wouldn't see the totality of the journey. For example, if you send the brick-and-mortar visitor a follow-up email asking them to buy something they just purchased from your online store, you might not attribute the display ad to the eventual purchase. You could track all of these touchpoints as separate individual encounters rather than the journey of a single customer.
In short, customer ID unification enables marketing that is more efficient, more relevant to the customer, and ultimately more effective in building relationships.
ID Unification is one of Treasure Data CDP's core capabilities. It unifies customer data across data sources and identities to create a single view of the customer. Currently, ID Unification is entirely constructed with Treasure Workflow, which requires the definition of source data, unification requirements, and an algorithm to unify with YAML-formatted configuration.
The ID Unification service runs at the end of a custom workflow
It runs the following steps:
- Extract data and stitch IDs
- Join the tables by the canonical ID to create master table(s)
After the unification, you can use the master tables in a parent segment configuration.
To perform ID unification, you need permission to edit parent segments. In the TD Console, you will need "Full" access to configure all master segments.

_export:
td:
database: <database>
+unification:
http_call>: https://<cdp_endpoint>/unifications/workflow_call
headers:
- authorization: ${secret:td.apikey}
method: POST
retry: true
content_format: json
content:
#full_refresh: true # OK to remove because full refresh runs every 3 days
unification:
!include : unify.yml # Set YAML config file nameWorkflow Details
Create a custom workflow with a step to call the unification API: http_call>: https://<cdp_endpoint>/unifications/workflow_call.
The unification workflow API differs depending on your region:
- US region: https://api-cdp.treasuredata.com/unifications/workflow_call
- EU01 Region: https://api-cdp.eu01.treasuredata.com/unifications/workflow_call
- Tokyo Region: https://api-cdp.treasuredata.co.jp/unifications/workflow_call
- Korea: https://api-cdp.ap02.treasuredata.com/unifications/workflow_call
The custom workflow for ID Unification automatically makes a secret, so you should not register the td.apikey manually.
The custom workflow should also include a .yml file containing the configuration for the unification process. In the following example, the file is called unify.yml.
- unification:
- !include : unify.yml
- keep_debug_tables : (Optional) The default is false - # Enable this option not to drop intermediate tables
- run_canonical_ids : (Optional) The default is true - If set to false, the unification process will not create unified canonical ids
- run_enrichments : (Optional) The default is true. - If set to false, the unification process will not enrich tables in the created CDP unification database with canonical IDs
- run_master_tables : (Optional) The default is true. If set to false, the unification process will not create master tables listed in the master table section
- full_refresh : (Optional) Set to true to run full refresh on every run. A full refresh will run every three days, regardless of the setting
ID Unification is configured through a YAML file.
name: production
keys:
- name: td_client_id
valid_regexp: "[0-9a-fA-F]{8}-..."
invalid_texts: ['']
- name: td_global_id
valid_regexp: "[0-9a-fA-F]{8}-..."
invalid_texts: ['', '0000000000-...']
- name: email
valid_regexp: ".*@.*"
tables:
- database: prod
table: pageviews
incremental_columns: [updated_at, id]
key_columns:
- {column: td_client_id, key: td_client_id}
- database: brand2
table: pageviews
as: brand2_pageviews
key_columns:
- {column: td_client_id, key: td_client_id}
- {column: td_global_id, key: td_global_id}
- {column: email, key: email}
- database: prod
table: contacts
key_columns:
- {column: email, key: email}
canonical_ids:
- name: browser_id
merge_by_keys: [td_client_id, td_global_id]
master_tables:
- name: marketing_master
canonical_id: browser_id
attributes:
- name: browser_id
source_canonical_id: browser_id
- name: email
source_columns:
- {table: contacts, column: email}Name is the name of your unification process. This name is used when the unification database is created.
Keys are used to unify customer profiles.
keys:
- name: td_client_id
valid_regexp: "[0-9a-fA-F]{8}-..."
invalid_texts: ['']
- name: td_global_id
valid_regexp: "[0-9a-fA-F]{8}-..."
invalid_texts: ['', '0000000000-...']
- name: email
valid_regexp: ".*@.*"The keys are based on the column names from the tables section.
- If you're unsure which columns you should use as "keys:" use the following guidelines.
A key should be able to specify a customer or other entities uniquely:
- For example, "first_name" is not a good key because many people have the same first name.
- "td_ip" (IPv4 address) is also not good because many devices reuse the same address.
- "vin_number" (VIN: Vehicle Identifier Number) is not a good key because multiple people could own the same car.
A key should be consistent. For example, "home_address" is not a good key because the same address may have different text representations.
"keys" entries of ID Unification YAML describe the specification of the key column values. For example, the td_client_id key column contains one of the strings that matches
[0-9a-fA-F]{8}-...(valid_regexp) but must not be an empty value (invalid_texts). All of "307423ab-9cbc-4bca-9cae-05c3700cc8f2", "fc6422e1-48bb-4396-b7aa-bdd2075c000d" are sample values of the td_client_id key column. Any column that is a td_client_id key column can be named "td_client_id" (same name as the key), but it can be other names.Name(required): Name of the key.
Valid_regexp(optional): use regex string to validate valid keys example: "[0-9a-fA-F]{8}-..."
Invalid_texts(optional): examples of text that shouldn't be considered valid keys. For example, ["'] is used for a valid key.
Tables to be used for unification.
tables:
- database: prod
table: pageviews
incremental_columns: [updated_at, id]
key_columns:
- {column: td_client_id, key: td_client_id}
- database: brand2
table: pageviews
as: brand2_pageviews
key_columns:
- {column: td_client_id, key: td_client_id}
- {column: td_global_id, key: td_global_id}
- {column: email, key: email}
- database: prod
table: contacts
key_columns:
- {column: email, key: email}- Tables : The tables to be used for unification declare the relationship between them. For example, prod.pageviews and brand2.pageviews can be joined by td_client_id and td_global_id.
- database: (Required) The database where the table is contained
- Table: (Required) The name of the table
- Key_columns: (Required) The mapping between columns in the table and the keys for unification.
- Column : (Required) The name of the column in the table.
- Key : (Required) The key from the keys section to which this column corresponds.
- as (Optional): Add this for the table name that the unification process should use. It is applicable when duplicate table names exist in different databases in the tables section.
- incremental_columns (Optional): Columns to be used when incremental=true is set
canonical_ids:
- name: browser_id
merge_by_keys: [td_client_id, td_global_id, td_ssc_id]
- name: marketing_id
merge_by_canonical_ids: [browser_id]
merge_by_keys: [email]
source_tables: [pageviews, contacts]
- name: contact_id
merge_by_canonical_ids: [browser_id]
merge_by_keys: [membership_id, email]
merge_iterations: 3
incremental_merge_iterations: 2Declare creating a canonical ID in the "canonical_ids:" section. You can create multiple canonical IDs in a single YAML file.
- name : Name of the canonical ID
- Merge_by_keys : keys to use to create a canonical id. The left key has higher priority. An un-canonicalized ID must have at least one key in the "merge_by_keys:" section. Set more consistent keys first in the merge_by_keys array. For example, [email, td_client_id] is better than [td_client_id, email].
- Merge_by_canonical_ids : ID can inherit other IDs (merge_by_canonical_ids). This is helpful to create a hierarchy in the IDs, e.g., browser ID > device ID > person ID > family ID
- source_tables : Sources tables from the "tables" section that contain the keys
- Merge_iterations : The number of times the unification algorithm should iterate through the data set to create a canonical id. The number of merge iterations is three for full refresh (every three days) and two for following incremental refresh by default. Use incremental_merge_iterations options if the number of IDs needs to be unified more.
- incremental_merge_iterations : The number of times the unification algorithm should iterate through the data set when doing an incremental merge.
Declare how to create master tables in the "master_tables:" section. You can create multiple master tables in a single YAML file. A master table must have one canonical_id as its primary key. Attributes must be declared. Attributes are columns of the master table.
- name : name of the master table to use canonical_id : name of canonical id to use for master table
master_tables:
- name: marketing_master
canonical_id: marketing_id
attributes:
- name: browser_id
source_canonical_id: browser_id
- name: firstname
source_columns:
- {table: pageviews}
- {table: form_submits, column: first_name}
- name: birthdate
valid_regexp: "[0-9]{4}-[0-9]{1,2}-[0-9]{1,2}"
invalid_texts: ['']
source_columns:
- {table: contacts, priority: 1}
- {table: pageviews, priority: 2}
- {table: form_submits, priority: 2}
- name: all_emails
array_elements: 5
source_columns:
- {table: contacts, order: last, order_by: time, priority: 1}
- {table: pageviews, priority: 2}
- {table: form_submits, priority: 2}An attribute can be created either of columns of tables ("source_columns") or canonical_id ("source_canonical_id").
- Valid_regexp : Regex string to validate attributes.
- invalid_texts : Values to be considered as invalid attributes.
- Array_elements [x]: If "array_elements:" is set, the column becomes an array with at most [x] elements. array_elements accepts a maximum of 10 elements at present.
- Source_columns has a table name and column name. The attribute name is used as a column name if "column:" is omitted. "source_columns " can have multiple columns. There are options to control which column is used as the attribute.
- Table (Required): The name of the table to use.
- Column (Optional): The name of the column to use.
- Priority(Optional): Set the priority of the attribute.
- Order(Optional): Sort the rows by "order_by: "column (default: "time"), and take first or last (configured by "order" section. default: "last").
- Order_by:The column to determine the order of rows with the same priority.
- Workflow takes the highest priority first.
- Get the value from "priority: 1." If it's null, get from "priority: 2." Repeat until you reach a more significant priority number.
- Multiple columns can have the same priority, except numerous columns from the same table are configured.
- If multiple rows exist in the same priority, sort the rows by the "order_by: "column (default: "time") and take first or last (configured by the "order" section, default: "last").
The unification process will create a database with the naming convention cdp_unification_<name specified at the top of the unification yml file>. For example, if the name specified at the top of your unification yml file is jack123. A database named cdp_unification_jack123 will be created on your account.
Mapping from other IDs to canonical IDs.

- < canonical_id_name>_result_key_stats: Statistics of the unification results
New records are inserted into the result statistics table (<canonical_id_name>_result_key_stats) always when a workflow runs.

- < canonical_id_name>_source_key_stats: Statistics of source tables before unification
New records are inserted into the source statistics table (<canonical_id_name>_source_key_stats) only when a workflow runs in full-refresh mode.

- These tables keep historical information. To get the latest data only, sort the table by the "time" column and get the latest rows ("time" is the workflow's session_time).
Example of a query to get the latest information only:
SELECT
* ,
row_number(
) over(
partition BY from_table
ORDER BY
time DESC
) AS i
FROM
XXX_key_statsID unification workflow generates <canonical_id_name>_lookup table for each canonical ID. This table is usable with <canonical_id_name>_keys and <canonical_id_name>_tables.

The assignment of id_key_type and id_source_table_ids is consistent, but you can't predict which ID will be assigned. Therefore, you must use <canonical_id_name>_lookup table always JOIN-ed with _keys and _tables tables as following (Trino query):
SELECT
lookup.* ,
keys.key_name,
transform(
id_source_table_ids,
tid -> element_at(
tables.mapping,
tid
)
) AS id_source_table_names
FROM
CANONICAL_ID_lookup lookup
JOIN
CANONICAL_ID_keys keys
ON lookup.id_key_type = keys.key_type CROSS
JOIN (
SELECT
map_agg(
table_id,
table_name
) AS mapping
FROM
CANONICAL_ID_tables
) tablesThe unification process will create copies of the tables from the tables section of the unification yml configuration. Each table's name will be prefixed with the enriched_<table_name>.

Each table will include a column with the canonical ID assigned to that record.

The unification workflow results heavily depend on the underlying data at the time it is run and changes in the underlying data between runs. The Treasure Data Support team does not have access to customer data, so the scope of support possible for unexpected unification results is limited. Treasure Data support can help unify workflow errors and confirm yml configuration syntax.
The following is an example of a complete ID Unification process with all the components.
This ID Unification example consists of three tables: contacts, web_visits, and pageviews_email. The unification process will create a single customer ID from these three tables using three keys: td_client_id, td_global_id, and email.
Schema : The contacts table schema has one column: email (excluding the time column).

Sample records from the contacts table.

Schema : The pageview_email table schema has two columns: email and td_client_id (excluding the time column).

Sample records

Schema
Sample Records 

- Create a workflow with the following .dig file.
Update the ID Unification http_call endpoint to point to your respective region.
- sample dig file
+unification:
http_call>: https://api-cdp.treasuredata.com/unifications/workflow_call
headers:
- authorization: ${secret:td_apikey}
method: POST
retry: true
content_format: json
content:
full_refresh: true
unification:
!include : unify.ymlAdd a unify.yml file to the workflow with the following content.
name: unification_tutorial
keys:
- name: td_client_id
valid_regexp: "[0-9a-fA-F]{8}"
invalid_texts: ['']
- name: td_global_id
valid_regexp: "[0-9a-fA-F]{8}"
invalid_texts: ['', '0000000000']
- name: email
valid_regexp: ".*@.*"
tables:
- database: unify_db
table: web_visits
key_columns:
- {column: td_client_id, key: td_client_id}
- {column: td_global_id, key: td_global_id}
- database: unify_db
table: pageviews_email
key_columns:
- {column: td_client_id, key: td_client_id}
- {column: email, key: email}
- database: unify_db
table: contacts
key_columns:
- {column: email, key: email}
canonical_ids:
- name: canonical_id
merge_iterations: 3
merge_by_keys: [td_client_id, td_global_id, email]
master_tables:
- name: unifcation_tutorial_master
canonical_id: canonical_id
attributes:
- name: td_client_id
invalid_texts: ['']
source_columns:
- {table: web_visits , column: td_client_id, order: first, order_by: time, priority: 1}
- name: td_global_id
invalid_texts: ['']
source_columns:
- {table: web_visits , column: td_global_id, order: first, order_by: time, priority: 1}
- name: email
invalid_texts: ['']
source_columns:
- {table: contacts , column: email ,order: last, priority: 1}These three keys are based on corresponding columns within each table.
For example, the record for
email : Aleksandra_James6578@1kmd3.online,
td_client_id: df4a5935-0693-42be-979c-c3c4003c2530
would be unified into a single ID between the three tables.



The ID unification process will produce a master table unification_tutorial_master with unified records. The records for **email:**Aleksandra_James6578@1kmd3.online, has been unified into canonical id: 4OyhRjnMNWZP

The do_not_merge_key is explicitly designed for this particular case only:
- A trustable table has a unique ID, and two IDs in the table shouldn't be merged, even if other kinds of IDs indicate that they should be incorporated.
- The example can be two salespeople using the same device (laptop) attached to the same email for whatever reason, so the email is the same but later links to two different user identifiers elsewhere (cust_id=1 or cust_id=2).
For example, if you have two tables, unify_1 and unify_2, and are trying to unify the two tables and unify a user profile.
The columns of each table are the following:
unify_1
- email_1, phone_number_1, cust_id_1

unify_2
- email_2, phone_number_2, cust_id_2

- In the previous example table, the two records have the same email but different phone numbers and different cust_ids. They should normally be merged based on having the same email.
- However, adding the do_not_merge_key: cust_id in the canonical id section will not merge these records even though they have the same email address.
The unification configuration file (unification.yml) would look like the following. Review the do_not_merge_key: cust_id in the canonical id section.
name: do_not_merge_test
keys:
- name: email
- name: cust_id
- name: phone_number
tables:
- database: unify_do_not_merge
table: unify_1
key_columns:
- {column: email_1, key: email}
- {column: cust_id_1, key: cust_id}
- {column: phone_number_1, key: phone_number}
- database: unify_do_not_merge
table: unify_2
key_columns:
- {column: email_1, key: email}
- {column: cust_id_1, key: cust_id}
- {column: phone_number_2, key: phone_number}
canonical_ids:
- name: canonical_id
merge_by_keys: [cust_id,phone_number,email]
source_tables: [unify_1, unify_2]
merge_iterations: 3
do_not_merge_key: cust_id
master_tables:
- name: customer_master
canonical_id: canonical_id
attributes:
- name: email
source_columns:
- {table: unify_1, column: email_1}
- {table: unify_2, column: email_1}
- name: phone
source_columns:
- {table: unify_1, column: phone_number_1}
- {table: unify_2, column: phone_number_2}
- name: cust_id
source_columns:
- {table: unify_1, column: cust_id_1, priority: 1}
- {table: unify_2, column: cust_id_1, order: last, order_by: time, priority: 2}The output of the unification process would produce two unique profiles.

Please note that the do_not_merge key will not unify profiles where the value of the do_not_merge key is unique. If the value of the do_not_merge key is the same for two profiles, those profiles will be unified, even if that key is specified as the do_not_merge key. For example:
Scenario #1
do_not_merge_key: cust_id
unique cust_id value for each record
Outcome: profiles are not unified, 2 canonical ids.

Scenario #2
do_not_merge_key: cust_id
same cust_id values for each record
Outcome: profiles are unified, 1 canonical id

There is a significant side effect to keep in mind. In the previous example, email=abc@123.com should be associated with cust_id=1 or cust_id=2. Realistically, email=abc@123.com will be assigned to both cust_id=1 and cust_id=2. The implication is that activities directly or indirectly associated with the email abc@123.com will be copied to cust_id=1 and cust_id=2. **Please remember this, as it can lead to surprising results when querying for activity only using the email abc@123.com. Do not use _do_not_merge_key_ unless necessary. the use case is similar to the above example.
To illustrate further, if you remove the do_not_merge_key: cust_id line from the unification yml file. The output will be one unified record.

You execute a query like the one below on the canonical_id_lookup table. The expectation is that this query should return 0 results as all keys should have been unified to a single canonical ID.
SELECT
id,
id_key_type,
COUNT(canonical_id) AS cnt
FROM
canonical_id_lookup
GROUP BY
id,
id_key_type
HAVING
COUNT(canonical_id)> 1If the query above returns results:
- One possible cause of this is that the number of merge iterations isn't enough. To confirm this, you can check the setting for the merge_iterations in the canonical_id section of the unification yml configuration file.

- Another option is to check the workflow logs for the result of the loop task numbers in the unification workflow.
The tasks look like +unify_loop+loop-N+iteration are the targets.

The other way to check the results is by checking the histogram in the result stats table. The histogram shows the distribution of how many ids one canonical_id has. There isn't necessarily a good or bad result for this query; that will depend on your dataset. The interpretation of the results of this query depends on your underlying dataset. Still, for example, if one canonical_id has 100 emails or keys, and that is unexpected, there might be an issue. The query is as follows.
SELECT
*
FROM
canonical_id_result_key_stats
WHERE
from_table = '*'
AND time = (
SELECT
MAX(time)
FROM
canonical_id_result_key_stats
)The difference between full_refresh=true and full_refresh=false is whether we incorporate the previous result or the unification workflow. There isn't a big difference in computation resources required, as the unification algorithm still has to unify all the records, but there is a difference in how to treat removed IDs. Removed IDs (a record) from the sources remain until a full refresh is run.
The number of key_columns you can use in the tables section of the yml configuration file is limited to 32 keys.
- Error:
....400 Bad Request: {"keys[1].relation_key_columns ":[" is too long (maximum is 32 characters)", "is invalid"], "tables[0].key_columns[0].key.relation_key_columns ":[" is too long (maximum is 32 characters)", "is invalid"]
- Error:
A limitation of 255 tables can be used in the tables section.
When using do_not_merge_key. The do_not_merge_key must be the first element of merge_by_keys. The unification workflow fails if we give a key other than the first one from merge_by_keys as do_not_merge_key. Otherwise, the workflow fails with an error message 400 Bad Request: {"canonical_ids[0].do_not_merge_key" :[ "must be the first element of merge_by_keys"]}
#run_master_tables: true # OK to remove because default is true
#full_refresh: true # OK to remove because refresh runs 3 days
unification:
!include : unify.yml # Set YAML config file name- keep_debug_tables : not required, default is false # Enable this option not to drop intermediate tables
- run_canonical_ids : not required as default is true, not required - Disable not to run canonical ids
- run_enrichments : enrich tables in CDP unification db with canonical IDs, not required as the default is true
- run_master_tables : create master tables from the master table section; not required as the default is true
- full_refresh : run full refresh on every run. Full refresh will run every three days regardless of the setting. Not required
- use_hive_engine : (Optional) - If set to true, the unification workflow would run on both Hive and Trino (depends on memory consumption tasks), but if not so, the unification workflow would still run on Trino only as currently.
- unification :
- !include: unify.yml