Visit our new documentation site! This documentation page is no longer updated.

Job SLA Notification

This article describes how to set a Job SLA (Service Level Agreement) for important jobs.

Our customers can add magic comments to configure the SLA for either Hive or Presto queries. This method will provide indication to the system when a query is running slower than it should and it can help ensure important jobs complete within their allotted time, every day.

Untitled-3
This feature is only provided to customers in Enterprise Support tier.

Table of Contents

Magic Comments Overview

You can specify 3 types of settings for the Job SLA notification: they are all optional.

  • SLA for Job Elapsed Time
  • SLA for Job Completion Time
  • Job Name

The example below indicates that job daily_important_batch needs to be completed within 3 hours:

-- @TD name: daily_important_batch
-- @TD alert_duration: 10800

Magic Comment Syntax

Setting SLA for Job Elapsed Time

With this magic comment, you can set the expected maximum execution time in seconds, calculated from when job gets started.

-- @TD alert_duration: \d+

Here are a few examples:

# Job needs to finish within 1 hour
-- @TD alert_duration: 3600

# Job needs to finish within 3 hours
-- @TD alert_duration: 10800

# Job needs to finish within 1 day
-- @TD alert_duration: 86400

Setting SLA for Job Completion Time

With this magic comment, you can set the expected maximum completion time.

-- @TD alert_at: \d\d:\d\d(Z|[+-]\d\d\d\d)

Here are a few examples:

# Job needs to finish until 6 am UTC (the same day job has started in UTC)
-- @TD alert_at: 06:00:00

# Job needs to finish until 6 am PDT (the same day job has started in PDT)
-- @TD alert_at: 06:00:00+0700

Setting the Job Name

With this magic comment, you can set the job name. While this does not have any functional property, it tends to help a lot in identifying / categorizing the jobs.

-- @TD name: [a-z0-9_]+

Here’s an example:

-- @TD name: daily_important_batch

How the Notification Works

This notification is dispatched to Treasure Data’s dedicated SRE and Support team. Once the SLA is violated, the notification will trigger an internal escalation, paging the engineer in the on-call rotation and soliciting appropriate actions.

The actions include:

  • Adding more compute capacities
  • Diagnosing the possible underlying issue

We will be using the magic comment information for on-demand capacity planning, so that we can add compute resources at the right time to get the jobs completed.


Last modified: Apr 06 2018 23:33:49 UTC

If this article is incorrect or outdated, or omits critical information, let us know. For all other issues, access our support channels.