Scheduled Jobs (Web Console)
Treasure Data has a scheduler feature that supports periodic query execution. We take great care in distributing and operating our scheduler in order to achieve high availability. By using this feature, you no longer need a
cron daemon within your local datacenter.
Table of Contents
- Basic knowledge of Treasure Data
- A table with some data. An example is provided in the Getting Started guide.
Create the Schedule (Web Console)
A new schedule can be created using the Web Console. You can do so by selecting the “New Query” pane, and setting a schedule for the query.
There are three predefined Cron schedules:
|@hourly||Run once an hour||0 * * * *|
|@daily||Run once a day at midnight||0 0 * * *|
|@monthly||Run once a month at midnight on the morning of the first day of the month||0 0 1 * *|
For any other scheduling definition, please select the ‘Other cron…’ option. The cron specification format is based off the cron-spec gem. The five required fields are:
* * * * * - - - - - | | | | | | | | | +----- day of week (0 - 6) (Sunday=0) | | | +---------- month (1 - 12) | | +--------------- day of month (1 - 31) | +-------------------- hour (0 - 23) +------------------------- min (0 - 59)
The following named entries can be used:
- Day of Week: sun, mon, tue, wed, thu, fri, sat
- Month: jan, feb, mar, apr, may, jun, jul, aug, sep, oct, nov, dec
A single space is required between each field. The values for each of the fields can be composed of:
- a single value, within the limits displayed above for each field.
- a wildcard ‘*’ to indicate no restriction based on the field. E.g. ‘0 0 1 * *’ configures the schedule to run at midnight (00:00) on the first day of each month.
- a range ‘2-5’, indicating the range of accepted values for the field. E.g. ‘0 0 1-10 * *’ configures the schedule to run at midnight (00:00) on the first 10 days of each month.
- a list of comma-separated values ‘2,3,4,5’, indicating the list of accepted values for the field. E.g. ‘0 0 1,11,21 * *’ configures the schedule to run at midnight (00:00) every 1st, 11th, and 21st day of each month.
- a periodicity indicator ‘*/5’ to express how often based on the field’s valid range of values a schedule is allowed to run. E.g. ‘30 */2 1 * *’ configures the schedule to run on the 1st of every month, every 2 hours starting at 00:30. ‘0 0 */5 * *’ configures the schedule to run at midnight (00:00) every 5 days starting on the 5th of each month.
- a comma-separated list of any of the above except the ‘*’ widlcard is also supported ‘2,*/5,8-10’. E.g. ‘0 0 5,*/10,25 * *’ configures the schedule to run at midnight (00:00) every 5th, 10th, 20th, and 25th day of each month.
When two specification provide conflicting schedule specifications, the specification requesting to execute more often is followed while the other is ignored. For example, if the cron schedule is ‘0 0 1 * 1’, the ‘day of month’ specification and ‘day of week’ are discordant because the former requires to run every 1st day of each month at midnight (00:00) while the latter requires to run every Monday at midnight (00:00): in this case the latter is followed. The same concept applies to all other fields.
For more information, please refer to this the Wikipedia page on Cron and the Crontab Linux ‘man’ page](http://linux.die.net/man/5/crontab).
Example: Daily KPIs
A common pattern is to periodically calculate the fixed KPIs or metrics in a certain interval.
SELECT TD_TIME_FORMAT(TIME, "yyyy-MM-dd") AS day, COUNT(1) AS cnt FROM www_access GROUP BY TD_TIME_FORMAT(TIME, "yyyy-MM-dd"), WHERE TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1d'), TD_SCHEDULED_TIME())
We can schedule the above query to run on a daily basis by setting the following parameters:
- Recurring?: @daily
- Delay (seconds): 3600
The example above aggregates the daily page views from an access log. It makes use of several common UDFs to set the proper time range for the aggregation. TD_SCHEDULED_TIME() returns the time when the job gets scheduled to be run. In this case, TD_SCHEDULED_TIME() returns 12 am.
By setting the delay parameter to an hour (3600 seconds), we launch the job at 1 AM each day – thus allowing some additional time (an hour in this case) for data to be imported before we run the job.
See link below on how to write the result to an end-system of your choice to track KPIs.
Example: List of Daily Active Users
Another common pattern is to periodically export results to a Treasure Data table, which in turn can be used by other jobs.
SELECT user_id FROM www_access GROUP BY user_id WHERE TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1d'), TD_SCHEDULED_TIME())
Like the previous example, we can schedule the above query to run on a daily basis by setting the following parameters:
- Recurring?: @daily
- Delay (seconds): 3600
The example above aggregates the list of daily active users from an access log.
You can write this list to another Treasure Data table by setting the following parameters:
- Export Result To: Treasure Data
- Database: YOUR_DATABASE
- Table: YOUR_TABLE
- Mode: Replace
Since writing the result into another table is atomic, you can reference this table from other queries at any time.
Example: Data Mart Generation
One more common pattern is to periodically summarize data from logs, and build a datamart by pushing results to an RDBMS.
SELECT user, code, method, path, agent, host, avg(size) FROM www_access GROUP BY user, code, method, path, agent, host WHERE TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1h'), TD_SCHEDULED_TIME())
To schedule this job, we set the following parameters:
- Recurring?: @hourly
- Delay (seconds): 600
The example above aggregates web request results by user, code, path, agent, host, and average size from an access log on an hourly basis.
See link below on how to write the result into another RDBMS:
Once you have pushed the query results to an external RDBMS, you can slice and dice the results using the BI tool(s) of your choice.
Update the Schedule
Most of the settings of scheduled queries can be updated/reconfigured anytime. This includes the cron schedule, delay, name, query string, etc… On the other hand, the ‘next scheduled run time’ (also referred to as ‘Next run’ in the Web Console) is an inferred configuration that is auto updated based on current time, cron schedule, and delay setting.
When a scheduled query is updated, the updated settings will apply immediately to the next scheduled run, occurring at ‘next scheduled run time + delay’ (with delay >0, <0 or 0).
Last modified: Sep 01 2016 21:36:41 UTC