Troubleshoot Submitted Workflows

Table of Contents

Introduction

In this tutorial we will go through the process of troubleshooting workflows that you’ve submitted to Treasure Data, and has had an error you want to investigate.

Prerequisites

Introductory Tutorial

If you haven’t already, start by going through the TD Workflows introductory tutorial

We will use the workflow project downloaded in the above linked tutorial below.

Tutorial

Create error to debug

We will now create an error to debug

# Start by entering the `nasdaq_analysis` directory from the introductory tutorial.
# Create error for us to debug
SELECT TD_DATE_TRUNC('month', time), AVG(daily_avg_open) AS monthly_avg_open, AVG(daily_avg_close) AS month_avg_close
FROM daily_open
GROUP BY 1
EOF

Push broken workflow to Treasure Data

$ td wf push nasdaq_analysis
# Submitting workflow "nasdaq_analysis"...
# Done!

Start the workflow, on Treasure Data’s side

$ td wf start nasdaq_analysis nasdaq_analysis --session now

Check status, see that the workflow failed

$ td wf session nasdaq_analysis nasdaq_analysis

You should see the following as your output

2016-05-11 16:40:24 +0900: Digdag v0.6.1
Session attempts:
  attempt id: 100
  uuid: ef704e1f-3eb5-4ba7-9be0-4ebfaeee4424
  project: nasdaq_analysis
  workflow: nasdaq_analysis
  session time: 2016-05-11 07:38:15 +0000
  retry attempt name:
  params: {"td":{"apikey":"..."},"last_session_time":"2016-05-11T00:00:00+00:00","next_session_time":"2016-05-12T00:00:00+00:00"}
  created at: 2016-05-11 16:38:17 +0900
  kill requested: false
  status: error

Determine what task(s) failed

# in above example attempt_id == 100
$ td wf tasks <attempt_id>

You should see the following output:

2016-05-16 21:18:19 -0700: Digdag v0.7.1
   id: 1105
   name: +nasdaq_analysis
   state: group_error
   config: {"schedule":{"daily>":"07:00:00"},"_export":{"td":{"database":"workflow_temp"}}}
   parent: null
   upstreams: []
   export params: {"td":{"database":"workflow_temp"}}
   store params: {}
   state params: {}

   id: 1106
   name: +nasdaq_analysis+task1
   state: success
   config: {"td>":"queries/daily_open.sql","create_table":"daily_open"}
   parent: 1105
   upstreams: []
   export params: {}
   store params: {"td":{"last_job_id":"66338029"}}
   state params: {}

   id: 1107
   name: +nasdaq_analysis+task2
   state: error
   config: {"td>":"queries/monthly_open.sql","create_table":"monthly_open"}
   parent: 1105
   upstreams: [1106]
   export params: {}
   store params: {}
   state params: {}

You can see under the last task listed, named +nasdaq_analysis+task2 that state: error, meaning this task is the one that failed.

Review logs of the failed task

The command to get the logs for a particular tasks is as follows:

$ td wf logs <attempt_id> <task_name>

Specifically, put the following:

$ td wf logs <attempt_id> +nasdaq_analysis+task2

You should see the following output:

2016-05-16 21:28:34 -0700: Digdag v0.7.1
2016-05-17 04:17:39.804 +0000 [INFO] (1072@+nasdaq_analysis+task2) io.digdag.core.agent.OperatorManager: td>: queries/monthly_open.sql
2016-05-17 04:17:40.026 +0000 [INFO] (1072@+nasdaq_analysis+task2) io.digdag.standards.operator.td.TdOperatorFactory$TdOperator: Started 66338037 job id=presto:
DROP TABLE IF EXISTS "monthly_open";
CREATE TABLE "monthly_open" AS
SELECT TD_DATE_TRUNC('month', time), AVG(daily_avg_open) AS monthly_avg_open, AVG(daily_avg_close) AS month_avg_close
FROM daily_opne -- <<< TYPO HERE
GROUP BY 1

2016-05-17 04:17:41.235 +0000 [WARN] (1072@+nasdaq_analysis+task2) io.digdag.standards.operator.td.TdOperatorFactory: Job 66338037:
===
started at 2016-05-17T04:17:40Z
executing query: DROP TABLE IF EXISTS "monthly_open"
Started fetching results.
executing query: CREATE TABLE "monthly_open" AS SELECT TD_DATE_TRUNC('month', time), AVG(daily_avg_open) AS monthly_avg_open, AVG(daily_avg_close) AS month_avg_close FROM daily_opne -- <<< TYPO HERE
GROUP BY 1
finished at 2016-05-17T04:17:40Z

Query 20160517_041740_15306_gf734 failed: line 1:155: Table td-presto.workflow_temp.daily_opne does not exist

===
2016-05-17 04:17:41.244 +0000 [ERROR] (1072@+nasdaq_analysis+task2) io.digdag.core.agent.OperatorManager: Task failed
java.lang.RuntimeException: TD job 66338037 failed with status ERROR
    at io.digdag.standards.operator.td.TDJobOperator.ensureSucceeded(TDJobOperator.java:87)
    at io.digdag.standards.operator.td.TdOperatorFactory.joinJob(TdOperatorFactory.java:276)
    at io.digdag.standards.operator.td.TdOperatorFactory$TdOperator.runTask(TdOperatorFactory.java:163)
    at io.digdag.standards.operator.BaseOperator.run(BaseOperator.java:49)
    at io.digdag.core.agent.OperatorManager.callExecutor(OperatorManager.java:255)
    at io.digdag.core.agent.OperatorManager.runWithWorkspace(OperatorManager.java:200)
    at io.digdag.core.agent.OperatorManager.lambda$runWithHeartbeat$1(OperatorManager.java:129)
    at io.digdag.core.agent.LocalWorkspaceManager.withExtractedArchive(LocalWorkspaceManager.java:71)
    at io.digdag.core.agent.OperatorManager.runWithHeartbeat(OperatorManager.java:128)
    at io.digdag.core.agent.OperatorManager.run(OperatorManager.java:105)
    at io.digdag.core.agent.LocalAgent.lambda$run$0(LocalAgent.java:61)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Wow! That’s a lot of text!

Don’t worry, we’re actively working on giving summarized error logs, so you only see the most important information. In the meantime, let me explain what you’re seeing in each of the 3 sections above, each separated by a ===.

In the first section, above the first ===, you are seeing the workflow logs running normally, and setting off the task to run. In the last line, you can see the TD job id: 2016-05-17 04:17:41.235 +0000 [WARN] (1072@+nasdaq_analysis+task2) io.digdag.standards.operator.td.TdOperatorFactory: Job 66338037:. You can use this job id to go into the console to see the error logs in a U.I.

In the 2nd section, you are seeing the query logs from Treasure Data. Here you see what time it was started, the query to be run, and then the final status of the query. In this case you see Query 20160517_041740_15306_gf734 failed: line 1:155: Table td-presto.workflow_temp.daily_opne does not exist.

The 3rd section starts with a line that shows the workflow failing due to an error. You can ignore this section as it contains a traceback that is not relevant to your workflow.

Fix the query

Now let’s fix the query and rerun the workflow.

$ cat > queries/monthly_open.sql <<EOF
SELECT TD_DATE_TRUNC('month', time), AVG(daily_avg_open) AS
monthly_avg_open, AVG(daily_avg_close) AS month_avg_close
FROM daily_open
GROUP BY 1
EOF
    # this will fix the file

Push the fix to Treasure Data

$ td wf push nasdaq_analysis

Retry the workflow session

Here we will rerun the workflow.

$ td wf retry <attempt_id> --name fix-typo --latest-revision --all

Quickly run td wf attempts to see the new session attempt running. Run it again, and you’ll likely see it succeeded successfully.

Note how the most recent attempt has the same session time as the previous attempt that failed. This is the benefit of using retry in this instance, instead of start. This is particularly important if you have a daily scheduled workflow, and you only want to retry the current day’s session using any time-related parameters embedded into the workflow.

Untitled-3
in the future the last command will change from `--all` to `--resume` which will allow you to only rerun starting at the failed task & all subsequent tasks.

Feedback

If you have any feedback we welcome hearing your thoughts on our TD Workflows ideas forum.

Also, if you have any ideas or feedback on the tutorial itself, we’d welcome them here!


Last modified: Sep 27 2017 03:55:48 UTC

If this article is incorrect or outdated, or omits critical information, let us know. For all other issues, access our support channels.