Merging Data

This article explains how to merge data from and to Treasure Data tables.

Table of Contents

Prerequisites

Overview

Both:

  • Merging from one table into an existing table
  • Merging from two tables into a third table

consist of querying the data from one or more source tables and writing the result using Job Result Output to Treasure Data to a table, whether existing or new.

Merging One Table into Another Existing Table

In this scenario both the source and the destination table exist and contain data. Merging consists of querying the data from the source table and writing the result with Job Result Output to Treasure Data into the destination table.

Let’s assume the source table is called src_dbs.src_tbl and has schema:

$ td schema:show src_dbs src_tbl
src_dbs.src_tbl (
  common_col:string
  src_col_1:string
  src_col_2:long
)
Untitled-3
The 'time' column is implicit. See the Schema documentation page for more details.

and the the destination table dst_dbs.dst_tbl has schema:

$ td schema:show dst_dbs dst_tbl
dst_dbs.dst_tbl (
  common_col:string
  dst_col_1:string
  dst_col_2:long
)

The data belonging to columns having the same name in the source and destination table will be merged – this includes the time column or its equivalent. Source table columns that don’t exist in the destination table are instead just copied over.

To merge the tables run this command in the CLI:

$ td query -d src_dbs -w -r 'td://@/dst_dbs/dst_tbl' \
   'SELECT `time`, src_col_1, src_col_2, common_col FROM src_tbl'
Untitled-3
we select the time in order to preserve the original timestamps when writing to the result to the destination table. If the time column was omitted, the time at which the query is ran will be used instead.

By using td://@/dst_dbs/dst_tbl as a job result output target, we will write the result in append mode, appending the result of this query to the data already in the destination table dst_dbs.dst_tbl. After the query completes, the schema of the destination table will be updated to include the columns in the source table’s result (that is, those columns indicated in the select clause of the query) that did not exist before; the new schema will be:

$ td schema:show dst_dbs dst_tbl
dst_dbs.dst_tbl (
  common_col:string
  dst_col_1:string
  dst_col_2:long
  src_col_1:string
  src_col_2:long
)

Clearly data for both time and common_col was merged in the process. If one would want to avoid merging common_col in the destination table with common_col in the source table, perhaps because the two columns contain data with different meanings that should not mixed, a easy way is to rename the source column in the ‘SELECT’ clause with the old_name AS new_name diction:

$ td query ... 'SELECT `time`, src_col_1, src_col_2, common_col AS src_common_col FROM src_tbl'

Merging Two Tables Into A Third Table

In this scenario we want to merge data from two tables. Whether writing into a new or existing table, the approach is very similar, with the schema caveats discussed above.

Let’s assume the two source tables are called src_dbs_1.src_tbl_1 and src_dbs_2.src_tbl_2 and have schema:

$ td schema:show src_dbs_1 src_tbl_1
src_dbs_1.src_tbl_1 (
  common_col:string
  src_col_1_1:string
  src_col_1_2:long
)
$ td schema:show src_dbs_2 src_tbl_2
src_dbs_2.src_tbl_2 (
  common_col:string
  src_col_2_1:string
  src_col_2_2:long
)

For simplicity let’s assume the destination database dst_dbs exists but the destination table dst_dbs.dst_tbl does not (for schema update simplification).

This query will merge the data from the 2 tables into the new destination table, which will be created automatically once the result of the query is written to the destination.

$ td query -d src_dbs_1 -w -r 'td://@/dst_dbs/dst_tbl' \
   'SELECT                              \
      *                                 \
    FROM (                              \
      SELECT                            \
        tbl_1_alias.time,               \
        tbl_1_alias.common_col,         \
        tbl_1_alias.src_col_1_1,        \
        tbl_1_alias.src_col_1_2,        \
        NULL as src_col_2_1,            \
        NULL as src_col_2_2             \
      FROM                              \
        src_dbs_1.src_tbl_1 tbl_1_alias \
      UNION ALL                         \
      SELECT                            \
        tbl_2_alias.time,               \
        tbl_2_alias.common_col,         \
        NULL as src_col_1_1,            \
        NULL as src_col_1_2             \
        tbl_2_alias.src_col_2_1,        \
        tbl_2_alias.src_col_2_2         \
      FROM                              \
        src_dbs_2.src_tbl_2 tbl_2_alias \
    ) tbl_1_union_tbl_2'

These details are worth noting:

  • We use an UNION ALL to merge the content of the columns with identical names in the two source tables (common_col in this example) without elimination/de-duplication – this also allows preserving the original timestmaps of the records from both tables by including them in the two inner select clauses for each respective tables. Therefore even if there were no columns to be merged, we would still need to use a UNION ALL to merge the time column;
  • Since in an UNION ALL both sides of the union must contain the same number, name, and type of columns we need to specify the second source table’s columns missing in the first source table as NULL as tbl_2_col_X to act as column placeholders and vice versa. This can very easily become complicated if the tables contain many unrelated columns;
  • If the two source tables reside in the same database, one can drop the database specified in the two (or more) inner FROM clauses;
  • Both source tables need to be aliased with a ‘src_dbs_2.src_tbl_2 tbl_2_alias’ syntax and the tbl_2_alias is used to namespace columns in either table;
  • The last alias at the end and outside of the round parenthesis is an arbitrary name, require by syntax.

Naturally if the destination table existed and contained data, the query could have to get a bit more complicated. Special care must be taken to make sure no unwanted merging or overwriting of columns happens in the process. The actual query will vary dramatically depending on the specific use case.


Last modified: Feb 24 2017 09:27:52 UTC

If this article is incorrect or outdated, or omits critical information, please let us know. For all other issues, please see our support channels.