# Learning `canonical_id` for Cookie ID and Email This page is based on the previous page [Learning `canonical_id` for Cookie ID](/products/customer-data-platform/id-unification/p3_canonical_id_learning), so some duplicate explanations have been omitted; please make sure you have read through the previous page first before continuing. This page deals with a case where the same person is identified by stitching together the `td_client_id`, `td_global_id`, `td_ssc_id` and email of multiple sites. There are four sites, but each table has a different combination of keys that it has. | | | site_aaa | site_aaa | site_aaa | | site_xxx | site_xxx | | site_yyy | site_yyy | | site_zzz | site_zzz | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | date | | td_client_id | td_global_id | td_ssc_id | | td_ssc_id | td_global_id | | email | td_ssc_id | | td_client_id | email | | Jan | 5th | aaa_001 | 3rd_001 | ssc_001 | | | | | a@ex.com | ssc_001 | | | | | | 15th | aaa_001 | 3rd_002 | ssc_001 | | | | | | | | zzz_001 | a@ex.com | | | 25th | aaa_001 | 3rd_003 | ssc_001 | | | | | | | | | | | Feb | 5th | aaa_001 | 3rd_004 | ssc_001 | | ssc_001 | 3rd_004 | | | | | | | | | 15th | | | | | ssc_001 | 3rd_005 | | a@ex.com | ssc_001 | | | | | | 25th | | | | | | | | a@ex.com | ssc_001 | | zzz_003 | a@ex.com | | Mar | 5th | | | | | | | | | | | zzz_003 | a@ex.com | | | 15th | | | | | ssc_001 | 3rd_008 | | | | | zzz_003 | a@ex.com | | | 25th | aaa_002 | 3rd_009 | ssc_002 | | ssc_001 | 3rd_009 | | | | | | | | Apr | 5th | aaa_002 | 3rd_010 | ssc_002 | | | | | b@ex.com | ssc_003 | | | | | | 15th | | | | | | | | b@ex.com | ssc_003 | | zzz_004 | b@ex.com | | | 25th | | | | | ssc_002 | 3rd_010 | | | | | zzz_004 | c@ex.com | | May | 5th | aaa_003 | 3rd_013 | ssc_003 | | ssc_002 | 3rd_013 | | | | | | | | | 15th | aaa_003 | 3rd_014 | ssc_003 | | | | | | | | | | | | 25th | aaa_003 | 3rd_015 | ssc_004 | | | | | c@ex.com | ssc_003 | | zzz_005 | c@ex.com | | Jun | 5th | aaa_003 | 3rd_016 | ssc_004 | | ssc_003 | 3rd_016 | | | | | | | | | 15th | | | | | ssc_003 | 3rd_017 | | | | | zzz_005 | c@ex.com | | | 25th | | | | | | | | c@ex.com | ssc_004 | | zzz_005 | c@ex.com | ## Data Preparation The above data is assumed to be stored in TD tables with the names site_aaa, site_xxx, site_yyy and site_zzz for each site, as in the site_aaa table below. This table can be generated by using [Workflow Samples](https://github.com/treasure-data/treasure-boxes/tree/master/tool-box/id-unification-samples). ### site_aaa | time | site | td_client_id | td_global_id | td_ssc_id | | --- | --- | --- | --- | --- | | 2023/01/05 | aaa.jp | aaa_001 | 3rd_001 | ssc_001 | | 2023/01/15 | aaa.jp | aaa_001 | 3rd_002 | ssc_001 | | 2023/01/25 | aaa.jp | aaa_001 | 3rd_003 | ssc_001 | | 2023/02/05 | aaa.jp | aaa_001 | 3rd_004 | ssc_001 | | 2023/02/15 | aaa.jp | | | | | 2023/02/25 | aaa.jp | | | | | 2023/03/05 | aaa.jp | | | | | 2023/03/15 | aaa.jp | | | | | 2023/03/25 | aaa.jp | aaa_002 | 3rd_009 | ssc_002 | | 2023/04/05 | aaa.jp | aaa_002 | 3rd_010 | ssc_002 | | 2023/04/15 | aaa.jp | | | | | 2023/04/25 | aaa.jp | | | | | 2023/05/05 | aaa.jp | aaa_003 | 3rd_013 | ssc_003 | | 2023/05/15 | aaa.jp | aaa_003 | 3rd_014 | ssc_003 | | 2023/05/25 | aaa.jp | aaa_003 | 3rd_015 | ssc_004 | | 2023/06/05 | aaa.jp | aaa_003 | 3rd_016 | ssc_004 | | 2023/06/15 | aaa.jp | | | | | 2023/06/25 | aaa.jp | | | | ### site_xxx | time | site | td_ssc_id | td_global_id | | --- | --- | --- | --- | | 2023/01/05 | xxx.jp | NULL | NULL | | 2023/01/15 | xxx.jp | NULL | NULL | | 2023/01/25 | xxx.jp | NULL | NULL | | 2023/02/05 | xxx.jp | ssc_001 | 3rd_004 | | 2023/02/15 | xxx.jp | ssc_001 | 3rd_005 | | 2023/02/25 | xxx.jp | NULL | NULL | | 2023/03/05 | xxx.jp | NULL | NULL | | 2023/03/15 | xxx.jp | ssc_001 | 3rd_008 | | 2023/03/25 | xxx.jp | ssc_001 | 3rd_009 | | 2023/04/05 | xxx.jp | NULL | NULL | | 2023/04/15 | xxx.jp | NULL | NULL | | 2023/04/25 | xxx.jp | ssc_002 | 3rd_010 | | 2023/05/05 | xxx.jp | ssc_002 | 3rd_013 | | 2023/05/15 | xxx.jp | NULL | NULL | | 2023/05/25 | xxx.jp | NULL | NULL | | 2023/06/05 | xxx.jp | ssc_003 | 3rd_016 | | 2023/06/15 | xxx.jp | ssc_003 | 3rd_017 | | 2023/06/25 | xxx.jp | NULL | NULL | ### site_yyy | time | site | email | td_ssc_id | | --- | --- | --- | --- | | 2023/01/05 | yyy.jp | a@ex.com | ssc_001 | | 2023/01/15 | yyy.jp | NULL | NULL | | 2023/01/25 | yyy.jp | NULL | NULL | | 2023/02/05 | yyy.jp | NULL | NULL | | 2023/02/15 | yyy.jp | a@ex.com | ssc_001 | | 2023/02/25 | yyy.jp | a@ex.com | ssc_001 | | 2023/03/05 | yyy.jp | NULL | NULL | | 2023/03/15 | yyy.jp | NULL | NULL | | 2023/03/25 | yyy.jp | NULL | NULL | | 2023/04/05 | yyy.jp | b@ex.com | ssc_003 | | 2023/04/15 | yyy.jp | b@ex.com | ssc_003 | | 2023/04/25 | yyy.jp | NULL | NULL | | 2023/05/05 | yyy.jp | NULL | NULL | | 2023/05/15 | yyy.jp | NULL | NULL | | 2023/05/25 | yyy.jp | c@ex.com | ssc_003 | | 2023/06/05 | yyy.jp | NULL | NULL | | 2023/06/15 | yyy.jp | NULL | NULL | | 2023/06/25 | yyy.jp | c@ex.com | ssc_004 | ### site_zzz | time | site | td_client_id | email | | --- | --- | --- | --- | | 2023/01/05 | zzz.jp | NULL | NULL | | 2023/01/15 | zzz.jp | zzz_001 | a@ex.com | | 2023/01/25 | zzz.jp | NULL | NULL | | 2023/02/05 | zzz.jp | NULL | NULL | | 2023/02/15 | zzz.jp | NULL | NULL | | 2023/02/25 | zzz.jp | zzz_003 | a@ex.com | | 2023/03/05 | zzz.jp | zzz_003 | a@ex.com | | 2023/03/15 | zzz.jp | zzz_003 | a@ex.com | | 2023/03/25 | zzz.jp | NULL | NULL | | 2023/04/05 | zzz.jp | NULL | NULL | | 2023/04/15 | zzz.jp | zzz_004 | b@ex.com | | 2023/04/25 | zzz.jp | zzz_004 | c@ex.com | | 2023/05/05 | zzz.jp | NULL | NULL | | 2023/05/15 | zzz.jp | NULL | NULL | | 2023/05/25 | zzz.jp | zzz_005 | c@ex.com | | 2023/06/05 | zzz.jp | NULL | NULL | | 2023/06/15 | zzz.jp | zzz_005 | c@ex.com | | 2023/06/25 | zzz.jp | zzz_005 | c@ex.com | ## id_unification_ex2.dig ```yaml +call_unification: http_call>: https://api-cdp.treasuredata.com/unifications/workflow_call headers: - authorization: ${secret:td.apikey} method: POST retry: true content_format: json content: run_canonical_ids: true run_enrichments: true run_master_tables: true full_refresh: true keep_debug_tables: true unification: !include : unification_ex2.yml ``` ## unification_ex2.yml ```yaml name: test_id_unification_ex2 keys: - name: td_client_id - name: td_global_id - name: td_ssc_id - name: email tables: - database: test_id_unification_ex2 table: site_aaa key_columns: - {column: td_client_id, key: td_client_id} - {column: td_global_id, key: td_global_id} - {column: td_ssc_id, key: td_ssc_id} - database: test_id_unification_ex2 table: site_xxx key_columns: - {column: td_ssc_id, key: td_ssc_id} - {column: td_global_id, key: td_global_id} - database: test_id_unification_ex2 table: site_yyy key_columns: - {column: email, key: email} - {column: td_ssc_id, key: td_ssc_id} - database: test_id_unification_ex2 table: site_zzz key_columns: - {column: td_client_id, key: td_client_id} - {column: email, key: email} canonical_ids: - name: person_id merge_by_keys: [email, td_ssc_id, td_client_id, td_global_id] merge_iterations: 5 master_tables: - name: master_table_ex2 canonical_id: person_id attributes: - name: email source_columns: - {table: site_yyy, order: last, order_by: time, priority: 1} - {table: site_zzz, order: last, order_by: time, priority: 1} - name: td_ssc_id array_elements: 5 source_columns: - {table: site_xxx, order: last, order_by: time, priority: 1} - {table: site_yyy, order: last, order_by: time, priority: 2} - name: td_client_id array_elements: 5 source_columns: - {table: site_aaa, order: last, order_by: time, priority: 1} - {table: site_zzz, order: last, order_by: time, priority: 4} - name: td_global_id array_elements: 5 source_columns: - {table: site_aaa, order: last, order_by: time, priority: 1} - {table: site_xxx, order: last, order_by: time, priority: 2} ``` ## Explanation of the Unification Algorithm ### graph_unify_loop_0 The initial state of the graph is as follows: ![canonical_id_1](/assets/graph_unify_loop_0.ad3b925765a6850ab843f1e010f14ddd1fa7627471791338b285d5acb8dbe95d.48409074.avif) #### How the graph_unify_loop_0 Table Is Created The `graph_unify_loop_0` table is created within the `+extract_and_merge` task of the workflow. Let's examine how the graph is constructed. ```yaml canonical_ids: - name: person_id merge_by_keys: [email, td_ssc_id, td_client_id, td_global_id] merge_iterations: 5 ``` In this configuration of `canonical_ids:`, `merge_by_keys:` prioritizes `email` to ensure the creation of the most universal `canonical_id`, followed by `td_ssc_id`. Using the original data, the `graph_unify_loop_0` table is generated through the same steps as in the previous section. ![canonical_id_1](/assets/5-1-1.28251666cc10da13276b3bfa7cf136a2eac2c039d2f17f094a20b780a6fe58a1.48409074.avif) Unlike before, pairs for `site_aaa` containing three or more keys in a single table are created. In this case, for a single record, the leader and followers are as follows: - **Leader:** - `td_ssc_id` - **Followers:** - `td_client_id` - `td_global_id` - `td_ssc_id` As a result, the record expands into three pairs for each follower (highlighted in green in the figure above). `follower_ns` and `leader_ns` identify which key corresponds to each id. (In this example, 1 is `td_client_id`, 2 is `td_global_id`, 3 is `td_ssc_id`, and 4 is `email`.) In this configuration, the priority order of `ns` is [4,3,1,2]. ### graph_unify_loop_1 Based on the graph from loop_0, leaders are updated according to the following rules: ![canonical_id_1](/assets/5-2-1.d9b9f8620948a425d49449c5c10f9c55d71e413bc9f3c1010e69b34dd59e4778.48409074.avif) The key difference this time is: - If a leader is connected to a higher-priority leader, it will be replaced before leaders of the same priority adjacent to the follower. All leaders are replaced with `new_leader` (unless they are the minimum or maximum priority, in which case they remain unchanged). The table resulting from merging all these replacements is the `graph_unify_loop_1` table. ![canonical_id_1](/assets/5-3-1.94499b310b5b3ed120146cdbcba5a6f6f7716201d4cf790e5057027079bc6d70.48409074.avif) ![canonical_id_1](/assets/graph_unify_loop_1.d8178a057a4e7ca6cc118c8e0b4a342a6342f04bb8d3c3b4990895c7cf8f24fe.48409074.avif) Subsequent loops follow the same process with the graph from the previous iteration. ### graph_unify_loop_2 ![canonical_id_1](/assets/graph_unify_loop_2.27df0549eddba523457d9f4e3b65692e11df8e41bb1d2d0a986a1ed25ad994b8.48409074.avif) ### graph_unify_loop_3 The process converges in the third loop. ![canonical_id_1](/assets/graph_unify_loop_3.af19e91ed84c4a5e060c57570f6af787b7b28424a0f7a09788b279e27ef0bd28.48409074.avif) ## Generating and Assigning `canonical_id` In this example, the order of keys specified in `merge_by_keys:` prioritizes `email` as the highest. This is because, unlike other keys, the value of `email` is less likely to change over time. For users without an email, the `td_ssc_id` becomes the final leader, and the `canonical_id` is generated based on this value. ![canonical_id_1](/assets/5-4-1.ca24c335088dda43414199cb5d4b3f87faa0b53c5a04e1c170cb34c59c5bb067.48409074.avif) ### `canonical_id` Configurations for `master_table` The `canonical_id` used in the `master_table` is generally created with all keys specified in `merge_by_keys:`. However, `canonical_id` can also be created using a subset of the keys. That said, the `canonical_id` configured in the `master_table` must meet the following condition: - The configuration must ensure that a `canonical_id` is assigned to every record in all source tables. In other words, every source table must have at least one key included in `merge_by_keys:`. For this example (Example 2 data), the keys in each table are as follows: - `site_aaa` - `td_client_id` - `td_global_id` - `td_ssc_id` - `site_xxx` - `td_ssc_id` - `td_global_id` - `site_yyy` - `email` - `td_ssc_id` - `site_zzz` - `td_client_id` - `email` Given this, the `merge_by_keys:` configurations valid for the `master_table` are: - Good: `[email, td_ssc_id, td_global_id, td_client_id]` - Good: `[td_ssc_id, td_global_id, td_client_id]` - Good: `[email, td_global_id]` - Good: `[td_ssc_id, td_client_id]` On the other hand, the following configurations will result in errors: - Bad: `[td_global_id, td_client_id]` - Missing keys from the `site_yyy` table. - Bad: `[email, td_client_id]` - Missing keys from the `site_xxx` table. In such cases, an error like the following will appear: ```yaml 400 Bad Request: {"master_tables[0].attributes[0].source_columns[0].table":["is invalid because it doesn't have columns merged into this master_table's canonical_id 'person_id'"],"master_tables[0].attributes[1].sou... ``` For example, in the `[td_global_id, td_client_id]` case, `canonical_id` is only assigned to records generated using `td_global_id` and `td_client_id`. Since the `site_yyy` table does not include these keys, it cannot be assigned a `canonical_id`. At least one key from `merge_by_keys:` (used for stitching) must be present to ensure that every record in all tables is assigned a `canonical_id`.