# Learning `canonical_id` for Cookie ID and Email

This page is based on the previous page [Learning `canonical_id` for Cookie ID](/products/customer-data-platform/id-unification/p3_canonical_id_learning), so some duplicate explanations have been omitted; please make sure you have read through the previous page first before continuing.

This page deals with a case where the same person is identified by stitching together the `td_client_id`, `td_global_id`, `td_ssc_id` and email of multiple sites. There are four sites, but each table has a different combination of keys that it has.

|  |  | site_aaa | site_aaa | site_aaa |  | site_xxx | site_xxx |  | site_yyy | site_yyy |  | site_zzz | site_zzz |
|  --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| date |  | td_client_id | td_global_id | td_ssc_id |  | td_ssc_id | td_global_id |  | email | td_ssc_id |  | td_client_id | email |
| Jan | 5th | aaa_001 | 3rd_001 | ssc_001 |  |  |  |  | a@ex.com | ssc_001 |  |  |  |
|  | 15th | aaa_001 | 3rd_002 | ssc_001 |  |  |  |  |  |  |  | zzz_001 | a@ex.com |
|  | 25th | aaa_001 | 3rd_003 | ssc_001 |  |  |  |  |  |  |  |  |  |
| Feb | 5th | aaa_001 | 3rd_004 | ssc_001 |  | ssc_001 | 3rd_004 |  |  |  |  |  |  |
|  | 15th |  |  |  |  | ssc_001 | 3rd_005 |  | a@ex.com | ssc_001 |  |  |  |
|  | 25th |  |  |  |  |  |  |  | a@ex.com | ssc_001 |  | zzz_003 | a@ex.com |
| Mar | 5th |  |  |  |  |  |  |  |  |  |  | zzz_003 | a@ex.com |
|  | 15th |  |  |  |  | ssc_001 | 3rd_008 |  |  |  |  | zzz_003 | a@ex.com |
|  | 25th | aaa_002 | 3rd_009 | ssc_002 |  | ssc_001 | 3rd_009 |  |  |  |  |  |  |
| Apr | 5th | aaa_002 | 3rd_010 | ssc_002 |  |  |  |  | b@ex.com | ssc_003 |  |  |  |
|  | 15th |  |  |  |  |  |  |  | b@ex.com | ssc_003 |  | zzz_004 | b@ex.com |
|  | 25th |  |  |  |  | ssc_002 | 3rd_010 |  |  |  |  | zzz_004 | c@ex.com |
| May | 5th | aaa_003 | 3rd_013 | ssc_003 |  | ssc_002 | 3rd_013 |  |  |  |  |  |  |
|  | 15th | aaa_003 | 3rd_014 | ssc_003 |  |  |  |  |  |  |  |  |  |
|  | 25th | aaa_003 | 3rd_015 | ssc_004 |  |  |  |  | c@ex.com | ssc_003 |  | zzz_005 | c@ex.com |
| Jun | 5th | aaa_003 | 3rd_016 | ssc_004 |  | ssc_003 | 3rd_016 |  |  |  |  |  |  |
|  | 15th |  |  |  |  | ssc_003 | 3rd_017 |  |  |  |  | zzz_005 | c@ex.com |
|  | 25th |  |  |  |  |  |  |  | c@ex.com | ssc_004 |  | zzz_005 | c@ex.com |


## Data Preparation

The above data is assumed to be stored in TD tables with the names site_aaa, site_xxx, site_yyy and site_zzz for each site, as in the site_aaa table below. This table can be generated by using [Workflow Samples](https://github.com/treasure-data/treasure-boxes/tree/master/tool-box/id-unification-samples).

### site_aaa

| time | site | td_client_id | td_global_id | td_ssc_id |
|  --- | --- | --- | --- | --- |
| 2023/01/05 | aaa.jp | aaa_001 | 3rd_001 | ssc_001 |
| 2023/01/15 | aaa.jp | aaa_001 | 3rd_002 | ssc_001 |
| 2023/01/25 | aaa.jp | aaa_001 | 3rd_003 | ssc_001 |
| 2023/02/05 | aaa.jp | aaa_001 | 3rd_004 | ssc_001 |
| 2023/02/15 | aaa.jp |  |  |  |
| 2023/02/25 | aaa.jp |  |  |  |
| 2023/03/05 | aaa.jp |  |  |  |
| 2023/03/15 | aaa.jp |  |  |  |
| 2023/03/25 | aaa.jp | aaa_002 | 3rd_009 | ssc_002 |
| 2023/04/05 | aaa.jp | aaa_002 | 3rd_010 | ssc_002 |
| 2023/04/15 | aaa.jp |  |  |  |
| 2023/04/25 | aaa.jp |  |  |  |
| 2023/05/05 | aaa.jp | aaa_003 | 3rd_013 | ssc_003 |
| 2023/05/15 | aaa.jp | aaa_003 | 3rd_014 | ssc_003 |
| 2023/05/25 | aaa.jp | aaa_003 | 3rd_015 | ssc_004 |
| 2023/06/05 | aaa.jp | aaa_003 | 3rd_016 | ssc_004 |
| 2023/06/15 | aaa.jp |  |  |  |
| 2023/06/25 | aaa.jp |  |  |  |


### site_xxx

| time | site | td_ssc_id | td_global_id |
|  --- | --- | --- | --- |
| 2023/01/05 | xxx.jp | NULL | NULL |
| 2023/01/15 | xxx.jp | NULL | NULL |
| 2023/01/25 | xxx.jp | NULL | NULL |
| 2023/02/05 | xxx.jp | ssc_001 | 3rd_004 |
| 2023/02/15 | xxx.jp | ssc_001 | 3rd_005 |
| 2023/02/25 | xxx.jp | NULL | NULL |
| 2023/03/05 | xxx.jp | NULL | NULL |
| 2023/03/15 | xxx.jp | ssc_001 | 3rd_008 |
| 2023/03/25 | xxx.jp | ssc_001 | 3rd_009 |
| 2023/04/05 | xxx.jp | NULL | NULL |
| 2023/04/15 | xxx.jp | NULL | NULL |
| 2023/04/25 | xxx.jp | ssc_002 | 3rd_010 |
| 2023/05/05 | xxx.jp | ssc_002 | 3rd_013 |
| 2023/05/15 | xxx.jp | NULL | NULL |
| 2023/05/25 | xxx.jp | NULL | NULL |
| 2023/06/05 | xxx.jp | ssc_003 | 3rd_016 |
| 2023/06/15 | xxx.jp | ssc_003 | 3rd_017 |
| 2023/06/25 | xxx.jp | NULL | NULL |


### site_yyy

| time | site | email | td_ssc_id |
|  --- | --- | --- | --- |
| 2023/01/05 | yyy.jp | a@ex.com | ssc_001 |
| 2023/01/15 | yyy.jp | NULL | NULL |
| 2023/01/25 | yyy.jp | NULL | NULL |
| 2023/02/05 | yyy.jp | NULL | NULL |
| 2023/02/15 | yyy.jp | a@ex.com | ssc_001 |
| 2023/02/25 | yyy.jp | a@ex.com | ssc_001 |
| 2023/03/05 | yyy.jp | NULL | NULL |
| 2023/03/15 | yyy.jp | NULL | NULL |
| 2023/03/25 | yyy.jp | NULL | NULL |
| 2023/04/05 | yyy.jp | b@ex.com | ssc_003 |
| 2023/04/15 | yyy.jp | b@ex.com | ssc_003 |
| 2023/04/25 | yyy.jp | NULL | NULL |
| 2023/05/05 | yyy.jp | NULL | NULL |
| 2023/05/15 | yyy.jp | NULL | NULL |
| 2023/05/25 | yyy.jp | c@ex.com | ssc_003 |
| 2023/06/05 | yyy.jp | NULL | NULL |
| 2023/06/15 | yyy.jp | NULL | NULL |
| 2023/06/25 | yyy.jp | c@ex.com | ssc_004 |


### site_zzz

| time | site | td_client_id | email |
|  --- | --- | --- | --- |
| 2023/01/05 | zzz.jp | NULL | NULL |
| 2023/01/15 | zzz.jp | zzz_001 | a@ex.com |
| 2023/01/25 | zzz.jp | NULL | NULL |
| 2023/02/05 | zzz.jp | NULL | NULL |
| 2023/02/15 | zzz.jp | NULL | NULL |
| 2023/02/25 | zzz.jp | zzz_003 | a@ex.com |
| 2023/03/05 | zzz.jp | zzz_003 | a@ex.com |
| 2023/03/15 | zzz.jp | zzz_003 | a@ex.com |
| 2023/03/25 | zzz.jp | NULL | NULL |
| 2023/04/05 | zzz.jp | NULL | NULL |
| 2023/04/15 | zzz.jp | zzz_004 | b@ex.com |
| 2023/04/25 | zzz.jp | zzz_004 | c@ex.com |
| 2023/05/05 | zzz.jp | NULL | NULL |
| 2023/05/15 | zzz.jp | NULL | NULL |
| 2023/05/25 | zzz.jp | zzz_005 | c@ex.com |
| 2023/06/05 | zzz.jp | NULL | NULL |
| 2023/06/15 | zzz.jp | zzz_005 | c@ex.com |
| 2023/06/25 | zzz.jp | zzz_005 | c@ex.com |


## id_unification_ex2.dig


```yaml
+call_unification:
  http_call>: https://api-cdp.treasuredata.com/unifications/workflow_call  
  headers:
    - authorization: ${secret:td.apikey}
  method: POST
  retry: true
  content_format: json
  content:

    run_canonical_ids: true
    run_enrichments: true
    run_master_tables: true

    full_refresh: true
    keep_debug_tables: true

    unification:
      !include : unification_ex2.yml
```

## unification_ex2.yml


```yaml
name: test_id_unification_ex2

keys:
  - name: td_client_id
  - name: td_global_id
  - name: td_ssc_id
  - name: email

tables:
  - database: test_id_unification_ex2
    table: site_aaa
    key_columns:
      - {column: td_client_id, key: td_client_id}
      - {column: td_global_id, key: td_global_id}
      - {column: td_ssc_id, key: td_ssc_id}

  - database: test_id_unification_ex2
    table: site_xxx
    key_columns:
      - {column: td_ssc_id, key: td_ssc_id}
      - {column: td_global_id, key: td_global_id}

  - database: test_id_unification_ex2
    table: site_yyy
    key_columns:
      - {column: email, key: email}
      - {column: td_ssc_id, key: td_ssc_id}

  - database: test_id_unification_ex2
    table: site_zzz
    key_columns:
      - {column: td_client_id, key: td_client_id}
      - {column: email, key: email}

canonical_ids:
  - name: person_id
    merge_by_keys: [email, td_ssc_id, td_client_id, td_global_id]    
    merge_iterations: 5

master_tables:
  - name: master_table_ex2
    canonical_id: person_id
    attributes:
      - name: email
        source_columns:
          - {table: site_yyy, order: last, order_by: time, priority: 1}
          - {table: site_zzz, order: last, order_by: time, priority: 1}                    
      - name: td_ssc_id
        array_elements: 5
        source_columns:
          - {table: site_xxx, order: last, order_by: time, priority: 1}
          - {table: site_yyy, order: last, order_by: time, priority: 2}
      - name: td_client_id
        array_elements: 5
        source_columns:
          - {table: site_aaa, order: last, order_by: time, priority: 1}
          - {table: site_zzz, order: last, order_by: time, priority: 4}                    
      - name: td_global_id
        array_elements: 5
        source_columns:
          - {table: site_aaa, order: last, order_by: time, priority: 1}
          - {table: site_xxx, order: last, order_by: time, priority: 2}
```

## Explanation of the Unification Algorithm

### graph_unify_loop_0

The initial state of the graph is as follows:

![canonical_id_1](/assets/graph_unify_loop_0.ad3b925765a6850ab843f1e010f14ddd1fa7627471791338b285d5acb8dbe95d.48409074.avif)

#### How the graph_unify_loop_0 Table Is Created

The `graph_unify_loop_0` table is created within the `+extract_and_merge` task of the workflow. Let's examine how the graph is constructed.


```yaml
canonical_ids:
  - name: person_id
    merge_by_keys: [email, td_ssc_id, td_client_id, td_global_id]    
    merge_iterations: 5
```

In this configuration of `canonical_ids:`, `merge_by_keys:` prioritizes `email` to ensure the creation of the most universal `canonical_id`, followed by `td_ssc_id`.

Using the original data, the `graph_unify_loop_0` table is generated through the same steps as in the previous section.

![canonical_id_1](/assets/5-1-1.28251666cc10da13276b3bfa7cf136a2eac2c039d2f17f094a20b780a6fe58a1.48409074.avif)

Unlike before, pairs for `site_aaa` containing three or more keys in a single table are created. In this case, for a single record, the leader and followers are as follows:

- **Leader:**
  - `td_ssc_id`
- **Followers:**
  - `td_client_id`
  - `td_global_id`
  - `td_ssc_id`


As a result, the record expands into three pairs for each follower (highlighted in green in the figure above).

`follower_ns` and `leader_ns` identify which key corresponds to each id. (In this example, 1 is `td_client_id`, 2 is `td_global_id`, 3 is `td_ssc_id`, and 4 is `email`.) In this configuration, the priority order of `ns` is [4,3,1,2].

### graph_unify_loop_1

Based on the graph from loop_0, leaders are updated according to the following rules:

![canonical_id_1](/assets/5-2-1.d9b9f8620948a425d49449c5c10f9c55d71e413bc9f3c1010e69b34dd59e4778.48409074.avif)

The key difference this time is:

- If a leader is connected to a higher-priority leader, it will be replaced before leaders of the same priority adjacent to the follower.


All leaders are replaced with `new_leader` (unless they are the minimum or maximum priority, in which case they remain unchanged). The table resulting from merging all these replacements is the `graph_unify_loop_1` table.

![canonical_id_1](/assets/5-3-1.94499b310b5b3ed120146cdbcba5a6f6f7716201d4cf790e5057027079bc6d70.48409074.avif)

![canonical_id_1](/assets/graph_unify_loop_1.d8178a057a4e7ca6cc118c8e0b4a342a6342f04bb8d3c3b4990895c7cf8f24fe.48409074.avif)

Subsequent loops follow the same process with the graph from the previous iteration.

### graph_unify_loop_2

![canonical_id_1](/assets/graph_unify_loop_2.27df0549eddba523457d9f4e3b65692e11df8e41bb1d2d0a986a1ed25ad994b8.48409074.avif)

### graph_unify_loop_3

The process converges in the third loop.

![canonical_id_1](/assets/graph_unify_loop_3.af19e91ed84c4a5e060c57570f6af787b7b28424a0f7a09788b279e27ef0bd28.48409074.avif)

## Generating and Assigning `canonical_id`

In this example, the order of keys specified in `merge_by_keys:` prioritizes `email` as the highest. This is because, unlike other keys, the value of `email` is less likely to change over time. For users without an email, the `td_ssc_id` becomes the final leader, and the `canonical_id` is generated based on this value.

![canonical_id_1](/assets/5-4-1.ca24c335088dda43414199cb5d4b3f87faa0b53c5a04e1c170cb34c59c5bb067.48409074.avif)

### `canonical_id` Configurations for `master_table`

The `canonical_id` used in the `master_table` is generally created with all keys specified in `merge_by_keys:`. However, `canonical_id` can also be created using a subset of the keys.

That said, the `canonical_id` configured in the `master_table` must meet the following condition:

- The configuration must ensure that a `canonical_id` is assigned to every record in all source tables. In other words, every source table must have at least one key included in `merge_by_keys:`.


For this example (Example 2 data), the keys in each table are as follows:

- `site_aaa`
  - `td_client_id`
  - `td_global_id`
  - `td_ssc_id`
- `site_xxx`
  - `td_ssc_id`
  - `td_global_id`
- `site_yyy`
  - `email`
  - `td_ssc_id`
- `site_zzz`
  - `td_client_id`
  - `email`


Given this, the `merge_by_keys:` configurations valid for the `master_table` are:

- Good: `[email, td_ssc_id, td_global_id, td_client_id]`
- Good: `[td_ssc_id, td_global_id, td_client_id]`
- Good: `[email, td_global_id]`
- Good: `[td_ssc_id, td_client_id]`


On the other hand, the following configurations will result in errors:

- Bad: `[td_global_id, td_client_id]`
  - Missing keys from the `site_yyy` table.
- Bad: `[email, td_client_id]`
  - Missing keys from the `site_xxx` table.


In such cases, an error like the following will appear:


```yaml
400 Bad Request: {"master_tables[0].attributes[0].source_columns[0].table":["is invalid because it doesn't have columns merged into this master_table's canonical_id 'person_id'"],"master_tables[0].attributes[1].sou...
```

For example, in the `[td_global_id, td_client_id]` case, `canonical_id` is only assigned to records generated using `td_global_id` and `td_client_id`. Since the `site_yyy` table does not include these keys, it cannot be assigned a `canonical_id`. At least one key from `merge_by_keys:` (used for stitching) must be present to ensure that every record in all tables is assigned a `canonical_id`.