If you decide to use incremental loading, you should determine if you need to create an index on the columns to avoid a full table scan. For example, the following index could be created: 

CREATE INDEX embulk_incremental_loading_index ON table (updated_at, id);

In some cases, the recommendation is to leave incremental_columns unset and let the integration automatically find an AUTO_INCREMENT primary key.

If incremental loading is not chosen, the integration fetches all the records of the specified <integration> object type, regardless of when they were last updated. This mode is best combined with writing data into a destination table using the ‘replace’ mode. 

When using incremental loading for some integrations the following behavior occurs:

If incremental_columns: [updated_at, id] option is set, the query is as follows: 

SELECT *
    FROM ( ...original query is here... )
    ORDER BY updated_at, id

When bulk data loading finishes successfully, it outputs last_record: parameter as config-diff so that the next execution uses it.
At the next execution, when last_record: is also set, and generates an additional WHERE clause to load records larger than the last record. For example, if last_record: ["2017-01-01T00:32:12.000000", 5291] is set:

SELECT * FROM ( ...original query is here... )
    WHERE updated_at > '2017-01-01T00:32:12.000000'
        OR (updated_at = '2017-01-01T00:32:12.000000'
        AND id > 5291)
    ORDER BY updated_at, id

Then, it updates last_record: so that the next execution uses the updated last_record.

For other integrations, the following behavior occurs:

SELECT ...
  FROM `target_table`
  WHERE ((`col1` > ?))

The bind variable (`?` literal in the query above) includes the maximum value of the previous execution. Larger values are recognized as the difference as to the target column (`col1`).

  • No labels