Visit our new documentation site! This documentation page is no longer updated.

Workflows Incremental Processing

Table of Contents

Introduction

Often the purpose of creating a workflow is to process data regularly coming into your system in an incremental way. For example, you may have a workflow that processes data that arrived from yesterday.

The benefits of incremental processing includes: * You reduce the total processing required, by reducing the duplicate processing of the same data, day-after-day. * Reducing the run-time of your query, so you can get your results sooner, or more often

Tutorial

In this tutorial, you set-up a basic workflow that processes data incrementally. The workflow that you’ll create queries data from the previous day.

The workflow:

timezone: UTC

schedule:
  daily>: 07:00:00

+daily:
  td>: incremental.sql

The query incremental.sql:

SELECT
    *
FROM
  database.table
WHERE
    TD_TIME_RANGE(td,${session_time},${last_session_time})

Workflows include a concept called sessions. Sessions represent scheduled runs of a workflow. In this tutorial, because the workflow is being run daily, the session is defined as the day the workflow runs. You can learn more about the session concept here.

In the SQL file that is referenced in the workflow definition file, you use Workflow Variables in order to apply the correct dates into the SQL code, depending on the session day.

For example, if this workflow is set to run on April 2nd at 7am UTC, then: * the workflow session is April 2nd * the workflow session_time variable is April 2nd at Midnight UTC * the workflow last_session_time variable is set to April 1st Midnight UTC (the previous session time)

Handling Frequently Running Workflows that Sometimes Take Too Long

Now, what if you have an incrementally running workflow that takes longer than the time-span between sessions?

For example, let’s say a workflow is running hourly, and normally takes only 30 minutes. But it’s the holiday season and now there has been a huge increase in usage of your site – so much data is now being process the workflow is taking 1 hour and 30 minutes.

During this time period, a second workflow, scheduled for the second hour, has started running. The addition of this second run causes further strain on your available resources because both workflows are running at the same time.

In a case such as this, it’s best to skip the second hour’s workflow session, and instead use the subsequent session to process 2 hours of data. To do this, you can utilize the following features:

  • Set the ’skip_on_overtime: true' schedule option to control whether scheduled session execution is skipped if another session is already running.
  • Use the variable: last_executed_session_time. Scheduled workflow sessions have a ${last_executed_session_time} variable which contains the previously executed session time. The value is usually same as ${last_session_time} but the value is different when the ‘skip_on_overtime: true’ is set is for the session and is the first execution.

To use these schedule adjustment features, you modify the workflow to look as follows:

The workflow:

timezone: UTC

schedule:
  daily>: 07:00:00
  skip_on_overtime: true

+daily:
  td>: incremental.sql

The query incremental.sql:

SELECT
    *
FROM
  database.table
WHERE
    TD_TIME_RANGE(td,${session_time},${last_executed_session_time})

Incremental Data Transfer using a Data Connector

If you want to use a configured data connector to transfer input incrementally, refer to Incremental Data Transfer Input.


Last modified: Apr 19 2018 23:59:47 UTC

If this article is incorrect or outdated, or omits critical information, let us know. For all other issues, access our support channels.