About Incremental Data Import Using TD Workflow
Copy for LLM
Copy page as Markdown for LLMs
View as Markdown
Open this page as Markdown
Open in ChatGPT
Get insights from ChatGPT
Open in Claude
Get insights from Claude
Connect to Cursor
Install MCP server on Cursor
Connect to VS Code
Install MCP server on VS Code

You can specify incremental transfers of data and control when data transfers start as part of managing your data pipeline. Use TD Workflows to ensure query processing steps in Treasure Data after the necessary data has been ingested into your account.

Many batch import integrations support incremental data loading features for various use cases:

Tables are too large to regularly re-import the table in its entirety (for example, from large production databases).
You are running frequent imports of updated data (for example every 15 minutes) to keep the data fresh.
You want to minimize the number of rows ingested, to make the most efficient use of your Treasure Data account plan’s capacity.

About Timing, Scheduling, and Incremental Processing

The incremental processing works by keeping track of the column value of the table and records to be imported (for example. a time or ID column), and then using the highest value imported during the last ingest to start the subsequent ingest. For example:

The from_date (can be last_fetched_date or any incremental field) is updated and stored after each execution. The new value is used during the next run.

For example, using the Mixpanel import integration:

During the first run, you import all data.
For subsequent incremental runs, you use the last_fetched_time (which is the max ingestion timestamp from the previous run)

Determine Your Approach

There are 2 approaches below when triggering an import integration as part of a workflow:

Use the incremental-loading feature of the integration itself, in which the Unique ID of a pre-defined Source is referred to in the workflow definition
Utilize the ability to inject the session time of a workflow (via workflow variables) into the integration parameters. In this approach, the import integration parameters are declared in a YAML file

Unique ID based approach

Pros
- Create and manage the Source directly in TD Console, making setup more user friendly.
- Supports incremental logic using IDs, filenames, or time-based fields provided by the integration.
- Self-healing behavior: if one ingestion fails but the next succeeds, the diff is computed from the last successful run.
Cons
- Not ideal for sources with delayed data because missed records may not be picked up by the default incremental logic.
- Re-running historical workflow sessions will not reload past data—the latest incremental values are stored in the integration configuration.

YAML file based approach

Pros
- Offers flexible timing controls; useful when the source system produces late arriving data.
Cons
- Only applies when the incremental logic can be driven by dates.

Configuring your Unique ID Incremental Data Transfer

Create an Authentication for your import integration on the Integration Hub.
Create a Source from the Authentication created above and select the incremental loading feature

Find your Source on the Integrations Hub > Sources, select the ... menu, and select Copy Unique ID

Navigate to Data Workbench > Workflows
Create a new workflow, or select an existing one and open the workflow definition
Use the td_load >: command with the Unique ID (ex: s3_v2_import_1725874557) saved in the clipboard

The incremental data load occurs every time the workflow is executed, manually or via a schedule

Configuring a YAML Incremental Data Transfer

With this approach, you create a custom configuration for your import integration, using the time-based variables provided by the workflow. You can use this method when the incremental flow is based on time.

In this partial example, the integration configuration file is daily_load.yml which will be referred to in the workflow definition. The variable last_session_time , among many Workflow variables, will be used. Refer to Treasure Workflow Basics for more details about workflow session time and variables.

Navigate to Data Workbench > Workflows.
Create a new workflow, or select an existing one and open the workflow definition.
Use the **td_load >: **command with the configuration filename, and add a variable for last_session_time.

schedule:  
  daily>: 07:00:00  
            
_export:  
  td:  
    dest_db: my_sample_db  
    dest_table: my_sample_table  
  wf:  
    start_time: "${last_session_time}"  
           
+data_import:  
  td_load>: config/daily_load.yml  
  database: ${td.dest_db}  
  table: ${td.dest_table}

Edit the config/daily_load.yml to use the start_time variable provided by the workflow.

in:  
  type: salesforce  
  login_url: https://<YOUR_DOMAIN_NAME>.salesforce.com  
  auth_method: token  
  username: <YOUR_EMAIL_ADDRESS>  
  token: <YOUR_API_TOKEN>  
  target: tickets  
  start_time: ${wf.start_time} #use workflow variable  
out:  
  mode: append

The workflow runs daily and fetches data incrementally since the last_session_time.

About Timing, Scheduling, and Incremental Processing

Determine Your Approach

Configuring your Unique ID Incremental Data Transfer

Configuring a YAML Incremental Data Transfer

Was this helpful?