Skip to content
Last updated

About Incremental Loading

Many batch input connectors in Treasure Data CDP support incremental data loading. This feature is useful in cases where:

  • Data sources are too large to fully re-import regularly.
  • Frequent imports of updated data are required to keep information fresh.
  • The number of rows ingested should be minimized to optimize usage.

How Incremental Load Works

To enable incremental loading, the data source must have specific columns (e.g., created_date) that help identify new records. The method of enabling incremental load differs based on whether you are using the Integration Hub or Treasure Workflow.

Incremental Load via Integration Hub

When using source setting in Integration Hub, the incremental load mechanism is managed by Treasure Data. The process works as follows:

  1. Users specify one or more incremental columns (e.g., id, created_at).
  2. The input connector is scheduled to run periodically.
  3. Treasure Data automatically records the latest values of the incremental columns in a special section called Config Diff.
  4. On each scheduled execution, the connector fetches only new records based on these recorded values.

Since Treasure Data maintains and updates last_records automatically, users do not need to manually configure parameters for each execution.

This approach is useful when you simply want to fetch just the object targets that have changed since the previously scheduled run.
For example, in the UI:

Database integrations, such as MySQL, BigQuery, and SQL server, require column or field names to load incremental data. For example:

When configuring incremental loading through Treasure Workflow, the process requires explicit management by the user. Unlike the UI-based approach, Treasure Workflow does not automatically track incremental values. Instead:

  1. Users define their own incremental logic using workflow variables such as session_time or other workflow parameters.
  2. This session_time value is injected into the connector configuration file.
  3. Each time the workflow runs, it uses the specified session time to determine which data to import.
  4. If a workflow session fails, the same session_time can be reused to ensure data consistency.

This approach is designed for greater flexibility, allowing users to handle scenarios such as backfill operations where historical data needs to be reloaded.

Create a YML file for incremental loading

In this approach, the incremental-loading feature of the input connector is not used. Instead, the workflow variables are injected into the connector configuration, which allows it to load different parts of the data in each workflow session.

  1. Create a workflow definition that declares the use of a workflow variable. The workflow provides various variables that could be useful for incremental loading or any other purposes.

  2. Inject the last_session_time variable into the input connector configuration file.

  3. Whenever the input connector is triggered by the workflow, the appropriate last session time will be injected into the connector configuration file. When there are retries of a failed workflow session, the variable value will remain unchanged.

By understanding the differences between these approaches, users can choose the best method based on their specific data ingestion needs.