Skip to content
Last updated

CSV Parser Function

The CSV parser plugin for Treasure Data's integrations parses CSV data. It has 15 options as explained in the following table. For more information, see CSV parser plugin.

OptionDescription
delimiterDelimiter character. This can be any single-byte character. For example, for CSV you could us \t, or for TSV you could use
use_string_literalChecked when you want to use special characters such as \t in your configuration
quoteThe character surrounding a quoted value. Setting \0 disables quoting.
escapeEscape character to escape a special character. Setting \0 disables escaping.
skip_header_linesSkip this number of lines first. Set 1 if the file has a header line.
null_stringIf a value is in this string, converts it to NULL. For example, set \N for CSV files created by mysqldump
trim_if_not_quotedIf true, remove spaces of a value if the value is not surrounded by the quote character
comment_line_markerSkip a line if the line begins with this string
allow_optional_columnsIf true, set null to insufficient columns. Otherwise, skip the row in case of insufficient number of columns. Because the CSV parser can't know exactly missing columns when parse row, any missing columns will be considered these last columns.
allow_extra_columnsIf true, ignore too many columns. Otherwise, skip the row in case of too many columns.
max_quoted_size_limitMaximum number of bytes of a quoted value. If a value exceeds the limit, the row will be skipped
stop_on_invalid_recordStop bulk load transaction if a file includes invalid record (such as invalid timestamp)
default_timezoneTime zone of timestamp columns if the value itself doesn’t include time zone description (eg. Asia/Tokyo)
newlineNewline character (CRLF, LF or CR)
line_delimiter_recognizedIf true, the specified newline character is processed as a new line, and the character is also preserved as data.
charsetCharacter encoding (eg. ISO-8859-1, UTF-8). See Supported Values for charset Option in CSV Parser.
quotes_in_quoted_fieldsSpecify how to deal with irregular unescaped quote characters in quoted fields. Specify either ACCEPT_ONLY_RFC4180_ESCAPED or ACCEPT_STRAY_QUOTES_ASSUMING_NO_DELIMITERS_IN_FIELDS.
columnsColumns (see below)

The columns option declares the list of columns. This CSV parser plugin ignores the header line. The order of columns will be used to map csv columns (by their orders), and discard the header (if exists).

ColumnDescription
nameName of the column
typeType of the column (see below).
boolean : true or false
long : 64-bit signed integers
timestamp : Date and time with nano-seconds precision
double : 64-bit floating point numbers
string : Strings
formatFormat of the timestamp if type is timestamp<

Quotes in Quoted Fields Options

The quotes_in_quotes_fields option specifies how to deal with irregular non-escaped stray quote characters.

NameDescription
ACCEPT_ONLY_RFC4180_ESCAPEDDefault. CSV parsing is RFC 4180 compliant.

ACCEPT_STRAY_QUOTES_ASSUMING_NO_DELIMITERS_IN_FIELDS

Accept stray quotes as-is in between quotation marks. If there are delimiters inside the quotation marks, the behavior is undefined.

For example, if the field were "a"b", the middle quotation mark would be treated as just another character, resulting in a"b. However, if the field contained "a""b", the middle two quotation marks would be considered an escaped quotation mark, resulting in a"b.

CSV Parser Configuration

You can use guess to automatically generate the column settings. Or set the parser section like an example:

in:
...
  parser:
    type: csv
    charset: UTF-8
    newline: CRLF
    delimiter: "\t"
    quote: '"'
    escape: '"'
    null_string: 'NULL'
    skip_header_lines: 1
    comment_line_marker: '#'
    columns:
    - {name: id, type: long}
    - {name: account, type: long}
    - {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
    - {name: purchase, type: timestamp, format: '%Y%m%d'}
    - {name: comment, type: string}
out:
...

CSV Parser Use Case Examples

Here are some example use cases for the CSV parser function:

CSV Parser Use Case Example 1

Here is an example of using simple CSV with the parser function.

Example of Source data

id,name,price,tag,timestamp
1,"A green door",11.50,"green","2016-01-02"
2,"A blue door",12.50,"blue","2016-01-03"
3,"A red door",13.50,"red","2016-01-04"
4,"A pink door",14.50,"pink","2016-01-05"
5,"A white door",15.50,"white","2016-01-06"
6,"A black door",16.50,"black","2016-01-07"
7,"A yellow door",17.50,"yellow","2016-01-08"
8,"A purple door",18.50,"purple","2016-01-09"
in:
...
parser:
  charset: UTF-8
  newline: CRLF
  type: csv
  delimiter: ","
  quote: "\""
  escape: "\""
  trim_if_not_quoted: false
  skip_header_lines: 1
  allow_extra_columns: false
  allow_optional_columns: false
  columns:
  - {name: id, type: long}
  - {name: name, type: string}
  - {name: price, type: double}
  - {name: tag, type: string}
  - {name: timestamp, type: timestamp, format: '%Y-%m-%d'}
filters: []
out: {mode: append}
exec: {}

You can show parsed CSV data by the preview command.

$ td connector:preview load.yml

If your data doesn’t have a time column you may add it using add_time filter option. More details at add_time filter function.

$ td connector:issue load.yml \
--database <database name> \
--table <table name>  \
--time-column timestamp --auto-create-table

CSV Parser Use Case Example 2

Here are examples for using

  • null_string
  • trim_if_not_quoted
  • comment_line_marker
  • max_quoted_size_limit

Source data is.

id,name,price,tag,timestamp
1,"A green door",11.50,"green","2016-01-02"
2,"A blue door",12.50,"blue","2016-01-03"
3,"A red door",13.50,"red","2016-01-04"
4,"A pink door",14.50,"pink","2016-01-05"
5,"A white door",15.50,"   white   ","2016-01-06"
6,"A black door",16.50,    black    ,"2016-01-07"
7,"A yellow door",17.50,"yellow","2016-01-08"
8,"A purple doooooooooooooooooooooooor",18.50,"purple","2016-01-09"

If you use following load.yml for it.

in:
...
parser:
  charset: UTF-8
  newline: CRLF
  type: csv
  delimiter: ","
  quote: "\""
  escape: "\""
  skip_header_lines: 1
  null_string "red"
  trim_if_not_quoted: true
  comment_line_marker: "2"
  max_quoted_size_limit: 20
  columns:
  - {name: id, type: long}
  - {name: name, type: string}
  - {name: price, type: double}
  - {name: tag, type: string}
  - {name: timestamp, type: timestamp, format: '%Y-%m-%d'}
filters: []
out: {mode: append}
exec: {}

The preview result is.

$ td connector:preview load.yml
+---------+-----------------+--------------+---------------+---------------------------+
| id:long | name:string     | price:double | tag:string    | timestamp:timestamp       |
+---------+-----------------+--------------+---------------+---------------------------+
| 1       | "A green door"  | 11.5         | "green"       | "2016-01-02 00:00:00 UTC" |
| 3       | "A red door"    | 13.5         | nil           | "2016-01-04 00:00:00 UTC" |
| 4       | "A pink door"   | 14.5         | "pink"        | "2016-01-05 00:00:00 UTC" |
| 5       | "A white door"  | 15.5         | "   white   " | "2016-01-06 00:00:00 UTC" |
| 6       | "A black door"  | 16.5         | "black"       | "2016-01-07 00:00:00 UTC" |
| 7       | "A yellow door" | 17.5         | "yellow"      | "2016-01-08 00:00:00 UTC" |
+---------+-----------------+--------------+---------------+---------------------------+
  • null_string replaced the “red” tag to nil. (“A red door” is not target, because of not full match.)

  • trim_if_not_quoted trimmed “black” tag’s Beginning and Ending spaces because it’s not quoted. (“white” tag is not target, because it’s quoted.)

  • comment_line_marker skipped “blue” record because it raw beginning with “2”.

  • max_quoted_size_limit skipped “purple” record because it name exceeded value.

CSV Parser Use Case Example 3

Here is an example of how setting allow_optional_columns to FALSE (default) or TRUE affects the parsing results.

Source data is.

id,name,price,tag,timestamp,additional
1,"A green door",11.50,"green","2016-01-02",aaa
2,"A blue door",12.50,"blue","2016-01-03",aaa
3,"A red door",13.50,"red","2016-01-04"
4,"A pink door",14.50,"pink","2016-01-05",aaa
5,"A white door",15.50,"white","2016-01-06",aaa
6,"A black door",16.50,"black","2016-01-07"
7,"A yellow door",17.50,"yellow","2016-01-08",aaa
8,"A purple door",18.50,"purple","2016-01-09",aaa

load.yml setting as below

in:
...
parser:
  ...
  allow_optional_columns: [ false / true ]
  columns:
  - {name: id, type: long}
  - {name: name, type: string}
  - {name: price, type: double}
  - {name: tag, type: string}
  - {name: timestamp, type: timestamp, format: '%Y-%m-%d'}
  - {name: additional, type: string}
filters: []
out: {mode: append}
exec: {}

Result of false doesn’t have “red” and “black” rows because they don’t have “additional” column values.

$ td connector:preview load.yml
+---------+-----------------+--------------+------------+---------------------------+-------------------+
| id:long | name:string     | price:double | tag:string | timestamp:timestamp       | additional:string |
+---------+-----------------+--------------+------------+---------------------------+-------------------+
| 1       | "A green door"  | 11.5         | "green"    | "2016-01-02 00:00:00 UTC" | "aaa"             |
| 2       | "A blue door"   | 12.5         | "blue"     | "2016-01-03 00:00:00 UTC" | "aaa"             |
| 4       | "A pink door"   | 14.5         | "pink"     | "2016-01-05 00:00:00 UTC" | "aaa"             |
| 5       | "A white door"  | 15.5         | "white"    | "2016-01-06 00:00:00 UTC" | "aaa"             |
| 7       | "A yellow door" | 17.5         | "yellow"   | "2016-01-08 00:00:00 UTC" | "aaa"             |
| 8       | "A purple door" | 18.5         | "purple"   | "2016-01-09 00:00:00 UTC" | "aaa"             |
+---------+-----------------+--------------+------------+---------------------------+-------------------+

On the other hand, result of true is filled null.

+---------+-----------------+--------------+------------+---------------------------+-------------------+
| id:long | name:string     | price:double | tag:string | timestamp:timestamp       | additional:string |
+---------+-----------------+--------------+------------+---------------------------+-------------------+
| 1       | "A green door"  | 11.5         | "green"    | "2016-01-02 00:00:00 UTC" | "aaa"             |
| 2       | "A blue door"   | 12.5         | "blue"     | "2016-01-03 00:00:00 UTC" | "aaa"             |
| 3       | "A red door"    | 13.5         | "red"      | "2016-01-04 00:00:00 UTC" | nil               |
| 4       | "A pink door"   | 14.5         | "pink"     | "2016-01-05 00:00:00 UTC" | "aaa"             |
| 5       | "A white door"  | 15.5         | "white"    | "2016-01-06 00:00:00 UTC" | "aaa"             |
| 6       | "A black door"  | 16.5         | "black"    | "2016-01-07 00:00:00 UTC" | nil               |
| 7       | "A yellow door" | 17.5         | "yellow"   | "2016-01-08 00:00:00 UTC" | "aaa"             |
| 8       | "A purple door" | 18.5         | "purple"   | "2016-01-09 00:00:00 UTC" | "aaa"             |
+---------+-----------------+--------------+------------+---------------------------+-------------------+

CSV Parser Use Case Example 4

Here is an example of how setting allow_extra_columns to FALSE (default) or TRUE affects the parsing results.

The source data is:

id,name,price,tag,timestamp,additional
1,"A green door",11.50,"green","2016-01-02",aaa
2,"A blue door",12.50,"blue","2016-01-03",aaa
3,"A red door",13.50,"red","2016-01-04"
4,"A pink door",14.50,"pink","2016-01-05",aaa
5,"A white door",15.50,"white","2016-01-06",aaa
6,"A black door",16.50,"black","2016-01-07"
7,"A yellow door",17.50,"yellow","2016-01-08",aaa
8,"A purple door",18.50,"purple","2016-01-09",aaa

The load.yml setting is as below:

in:
...
parser:
  ...
  allow_extra_columns: [ false / true ]
  columns:
  - {name: id, type: long}
  - {name: name, type: string}
  - {name: price, type: double}
  - {name: tag, type: string}
  - {name: timestamp, type: timestamp, format: '%Y-%m-%d'}
filters: []
out: {mode: append}
exec: {}

Result of false shows only “red” and “black” rows because “additional” column isn’t defined in columns and other rows have undefined column values.

$ td connector:preview load.yml
+---------+----------------+--------------+------------+---------------------------+
| id:long | name:string    | price:double | tag:string | timestamp:timestamp       |
+---------+----------------+--------------+------------+---------------------------+
| 3       | "A red door"   | 13.5         | "red"      | "2016-01-04 00:00:00 UTC" |
| 6       | "A black door" | 16.5         | "black"    | "2016-01-07 00:00:00 UTC" |
+---------+----------------+--------------+------------+---------------------------+

On the other hand, result of true shows all rows without “additional” columns.

$ td connector:preview load.yml
+---------+-----------------+--------------+------------+---------------------------+
| id:long | name:string     | price:double | tag:string | timestamp:timestamp       |
+---------+-----------------+--------------+------------+---------------------------+
| 1       | "A green door"  | 11.5         | "green"    | "2016-01-02 00:00:00 UTC" |
| 2       | "A blue door"   | 12.5         | "blue"     | "2016-01-03 00:00:00 UTC" |
| 3       | "A red door"    | 13.5         | "red"      | "2016-01-04 00:00:00 UTC" |
| 4       | "A pink door"   | 14.5         | "pink"     | "2016-01-05 00:00:00 UTC" |
| 5       | "A white door"  | 15.5         | "white"    | "2016-01-06 00:00:00 UTC" |
| 6       | "A black door"  | 16.5         | "black"    | "2016-01-07 00:00:00 UTC" |
| 7       | "A yellow door" | 17.5         | "yellow"   | "2016-01-08 00:00:00 UTC" |
| 8       | "A purple door" | 18.5         | "purple"   | "2016-01-09 00:00:00 UTC" |
+---------+-----------------+--------------+------------+---------------------------+

CSV Parser Use Case Example 5

Here is an example of specifying \0 as the quote and escape character.

The source data is:

id,name,price,tag,timestamp
1,"A green door",11.50,"green","2016-01-02"
2,"A blue door",12.50,"blue",2016-01-03
3,"A \"red\" door",13.50,"red",2016-01-04
4,"A pink door",14.50,"pink",2016-01-05
in:
...
parser:
  charset: UTF-8
  newline: CRLF
  type: csv
  delimiter: ","
  quote: "\0"
  escape: "\0"
  trim_if_not_quoted: false
  skip_header_lines: 1
  allow_extra_columns: false
  allow_optional_columns: false
  columns:
  - {name: id, type: long}
  - {name: name, type: string}
  - {name: price, type: double}
  - {name: tag, type: string}
  - {name: timestamp, type: timestamp, format: '%Y-%m-%d'}
filters: []
out: {mode: append}
exec: {}
  • All quotation marks are presented in the results because they are not specified as the "quote character" to the parser.
  • Row 1 is not presented because the value of quote is set to \0. The quotation marks make "2016-01-02" and invalid timestamp.
  • Embedded quotation marks and slash marks on row 4 are presented because they are not specified as the "quote character" to the parser.
$ td connector:preview load.yml
+---------+-----------------+--------------+------------+---------------------+
| id:long | name:string     | price:double | tag:string | timestamp:timestamp |
+---------+-----------------+--------------+------------+---------------------+
| 2       | "A blue door"   | 12.5         | "blue"     | 2016-01-03 00:00:00 |
| 3       | "A \"red\" door"| 13.5         | "red"      | 2016-01-04 00:00:00 |
| 4       | "A pink door"   | 14.5         | "pink"     | 2016-01-05 00:00:00 |
+---------+-----------------+--------------+------------+---------------------+