The CSV parser plugin for Treasure Data's integrations parses CSV data. It has 15 options as explained in the following table. For more information, see CSV parser plugin.
| Option | Description |
|---|---|
| delimiter | Delimiter character. This can be any single-byte character. For example, for CSV you could us \t, or for TSV you could use |
| use_string_literal | Checked when you want to use special characters such as \t in your configuration |
| quote | The character surrounding a quoted value. Setting \0 disables quoting. |
| escape | Escape character to escape a special character. Setting \0 disables escaping. |
| skip_header_lines | Skip this number of lines first. Set 1 if the file has a header line. |
| null_string | If a value is in this string, converts it to NULL. For example, set \N for CSV files created by mysqldump |
| trim_if_not_quoted | If true, remove spaces of a value if the value is not surrounded by the quote character |
| comment_line_marker | Skip a line if the line begins with this string |
| allow_optional_columns | If true, set null to insufficient columns. Otherwise, skip the row in case of insufficient number of columns. Because the CSV parser can't know exactly missing columns when parse row, any missing columns will be considered these last columns. |
| allow_extra_columns | If true, ignore too many columns. Otherwise, skip the row in case of too many columns. |
| max_quoted_size_limit | Maximum number of bytes of a quoted value. If a value exceeds the limit, the row will be skipped |
| stop_on_invalid_record | Stop bulk load transaction if a file includes invalid record (such as invalid timestamp) |
| default_timezone | Time zone of timestamp columns if the value itself doesn’t include time zone description (eg. Asia/Tokyo) |
| newline | Newline character (CRLF, LF or CR) |
| line_delimiter_recognized | If true, the specified newline character is processed as a new line, and the character is also preserved as data. |
| charset | Character encoding (eg. ISO-8859-1, UTF-8). See Supported Values for charset Option in CSV Parser. |
| quotes_in_quoted_fields | Specify how to deal with irregular unescaped quote characters in quoted fields. Specify either ACCEPT_ONLY_RFC4180_ESCAPED or ACCEPT_STRAY_QUOTES_ASSUMING_NO_DELIMITERS_IN_FIELDS. |
| columns | Columns (see below) |
The columns option declares the list of columns. This CSV parser plugin ignores the header line. The order of columns will be used to map csv columns (by their orders), and discard the header (if exists).
| Column | Description |
|---|---|
| name | Name of the column |
| type | Type of the column (see below). boolean : true or false long : 64-bit signed integers timestamp : Date and time with nano-seconds precision double : 64-bit floating point numbers string : Strings |
| format | Format of the timestamp if type is timestamp< |
The quotes_in_quotes_fields option specifies how to deal with irregular non-escaped stray quote characters.
| Name | Description |
|---|---|
| ACCEPT_ONLY_RFC4180_ESCAPED | Default. CSV parsing is RFC 4180 compliant. |
ACCEPT_STRAY_QUOTES_ASSUMING_NO_DELIMITERS_IN_FIELDS | Accept stray quotes as-is in between quotation marks. If there are delimiters inside the quotation marks, the behavior is undefined. For example, if the field were "a"b", the middle quotation mark would be treated as just another character, resulting in a"b. However, if the field contained "a""b", the middle two quotation marks would be considered an escaped quotation mark, resulting in a"b. |
You can use guess to automatically generate the column settings. Or set the parser section like an example:
in:
...
parser:
type: csv
charset: UTF-8
newline: CRLF
delimiter: "\t"
quote: '"'
escape: '"'
null_string: 'NULL'
skip_header_lines: 1
comment_line_marker: '#'
columns:
- {name: id, type: long}
- {name: account, type: long}
- {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
- {name: purchase, type: timestamp, format: '%Y%m%d'}
- {name: comment, type: string}
out:
...Here are some example use cases for the CSV parser function:
- Simple CSV
- [Using null_string,trim_if_not_quoted,comment_line_marker and max_quoted_size_limit](<csv-parser-function.md#csv-parser-use-case-example-2)
- Effect of FALSE and TRUE with allow_optional_columns
- Effect of FALSE and TRUE with allow_extra_columns
- Using \0 as the quote and escape character
Here is an example of using simple CSV with the parser function.
Example of Source data
id,name,price,tag,timestamp
1,"A green door",11.50,"green","2016-01-02"
2,"A blue door",12.50,"blue","2016-01-03"
3,"A red door",13.50,"red","2016-01-04"
4,"A pink door",14.50,"pink","2016-01-05"
5,"A white door",15.50,"white","2016-01-06"
6,"A black door",16.50,"black","2016-01-07"
7,"A yellow door",17.50,"yellow","2016-01-08"
8,"A purple door",18.50,"purple","2016-01-09"in:
...
parser:
charset: UTF-8
newline: CRLF
type: csv
delimiter: ","
quote: "\""
escape: "\""
trim_if_not_quoted: false
skip_header_lines: 1
allow_extra_columns: false
allow_optional_columns: false
columns:
- {name: id, type: long}
- {name: name, type: string}
- {name: price, type: double}
- {name: tag, type: string}
- {name: timestamp, type: timestamp, format: '%Y-%m-%d'}
filters: []
out: {mode: append}
exec: {}You can show parsed CSV data by the preview command.
$ td connector:preview load.ymlIf your data doesn’t have a time column you may add it using add_time filter option. More details at add_time filter function.
$ td connector:issue load.yml \
--database <database name> \
--table <table name> \
--time-column timestamp --auto-create-tableHere are examples for using
- null_string
- trim_if_not_quoted
- comment_line_marker
- max_quoted_size_limit
Source data is.
id,name,price,tag,timestamp
1,"A green door",11.50,"green","2016-01-02"
2,"A blue door",12.50,"blue","2016-01-03"
3,"A red door",13.50,"red","2016-01-04"
4,"A pink door",14.50,"pink","2016-01-05"
5,"A white door",15.50," white ","2016-01-06"
6,"A black door",16.50, black ,"2016-01-07"
7,"A yellow door",17.50,"yellow","2016-01-08"
8,"A purple doooooooooooooooooooooooor",18.50,"purple","2016-01-09"If you use following load.yml for it.
in:
...
parser:
charset: UTF-8
newline: CRLF
type: csv
delimiter: ","
quote: "\""
escape: "\""
skip_header_lines: 1
null_string "red"
trim_if_not_quoted: true
comment_line_marker: "2"
max_quoted_size_limit: 20
columns:
- {name: id, type: long}
- {name: name, type: string}
- {name: price, type: double}
- {name: tag, type: string}
- {name: timestamp, type: timestamp, format: '%Y-%m-%d'}
filters: []
out: {mode: append}
exec: {}The preview result is.
$ td connector:preview load.yml+---------+-----------------+--------------+---------------+---------------------------+
| id:long | name:string | price:double | tag:string | timestamp:timestamp |
+---------+-----------------+--------------+---------------+---------------------------+
| 1 | "A green door" | 11.5 | "green" | "2016-01-02 00:00:00 UTC" |
| 3 | "A red door" | 13.5 | nil | "2016-01-04 00:00:00 UTC" |
| 4 | "A pink door" | 14.5 | "pink" | "2016-01-05 00:00:00 UTC" |
| 5 | "A white door" | 15.5 | " white " | "2016-01-06 00:00:00 UTC" |
| 6 | "A black door" | 16.5 | "black" | "2016-01-07 00:00:00 UTC" |
| 7 | "A yellow door" | 17.5 | "yellow" | "2016-01-08 00:00:00 UTC" |
+---------+-----------------+--------------+---------------+---------------------------+null_string replaced the “red” tag to nil. (“A red door” is not target, because of not full match.)
trim_if_not_quoted trimmed “black” tag’s Beginning and Ending spaces because it’s not quoted. (“white” tag is not target, because it’s quoted.)
comment_line_marker skipped “blue” record because it raw beginning with “2”.
max_quoted_size_limit skipped “purple” record because it name exceeded value.
Here is an example of how setting allow_optional_columns to FALSE (default) or TRUE affects the parsing results.
Source data is.
id,name,price,tag,timestamp,additional
1,"A green door",11.50,"green","2016-01-02",aaa
2,"A blue door",12.50,"blue","2016-01-03",aaa
3,"A red door",13.50,"red","2016-01-04"
4,"A pink door",14.50,"pink","2016-01-05",aaa
5,"A white door",15.50,"white","2016-01-06",aaa
6,"A black door",16.50,"black","2016-01-07"
7,"A yellow door",17.50,"yellow","2016-01-08",aaa
8,"A purple door",18.50,"purple","2016-01-09",aaaload.yml setting as below
in:
...
parser:
...
allow_optional_columns: [ false / true ]
columns:
- {name: id, type: long}
- {name: name, type: string}
- {name: price, type: double}
- {name: tag, type: string}
- {name: timestamp, type: timestamp, format: '%Y-%m-%d'}
- {name: additional, type: string}
filters: []
out: {mode: append}
exec: {}Result of false doesn’t have “red” and “black” rows because they don’t have “additional” column values.
$ td connector:preview load.yml+---------+-----------------+--------------+------------+---------------------------+-------------------+
| id:long | name:string | price:double | tag:string | timestamp:timestamp | additional:string |
+---------+-----------------+--------------+------------+---------------------------+-------------------+
| 1 | "A green door" | 11.5 | "green" | "2016-01-02 00:00:00 UTC" | "aaa" |
| 2 | "A blue door" | 12.5 | "blue" | "2016-01-03 00:00:00 UTC" | "aaa" |
| 4 | "A pink door" | 14.5 | "pink" | "2016-01-05 00:00:00 UTC" | "aaa" |
| 5 | "A white door" | 15.5 | "white" | "2016-01-06 00:00:00 UTC" | "aaa" |
| 7 | "A yellow door" | 17.5 | "yellow" | "2016-01-08 00:00:00 UTC" | "aaa" |
| 8 | "A purple door" | 18.5 | "purple" | "2016-01-09 00:00:00 UTC" | "aaa" |
+---------+-----------------+--------------+------------+---------------------------+-------------------+On the other hand, result of true is filled null.
+---------+-----------------+--------------+------------+---------------------------+-------------------+
| id:long | name:string | price:double | tag:string | timestamp:timestamp | additional:string |
+---------+-----------------+--------------+------------+---------------------------+-------------------+
| 1 | "A green door" | 11.5 | "green" | "2016-01-02 00:00:00 UTC" | "aaa" |
| 2 | "A blue door" | 12.5 | "blue" | "2016-01-03 00:00:00 UTC" | "aaa" |
| 3 | "A red door" | 13.5 | "red" | "2016-01-04 00:00:00 UTC" | nil |
| 4 | "A pink door" | 14.5 | "pink" | "2016-01-05 00:00:00 UTC" | "aaa" |
| 5 | "A white door" | 15.5 | "white" | "2016-01-06 00:00:00 UTC" | "aaa" |
| 6 | "A black door" | 16.5 | "black" | "2016-01-07 00:00:00 UTC" | nil |
| 7 | "A yellow door" | 17.5 | "yellow" | "2016-01-08 00:00:00 UTC" | "aaa" |
| 8 | "A purple door" | 18.5 | "purple" | "2016-01-09 00:00:00 UTC" | "aaa" |
+---------+-----------------+--------------+------------+---------------------------+-------------------+Here is an example of how setting allow_extra_columns to FALSE (default) or TRUE affects the parsing results.
The source data is:
id,name,price,tag,timestamp,additional
1,"A green door",11.50,"green","2016-01-02",aaa
2,"A blue door",12.50,"blue","2016-01-03",aaa
3,"A red door",13.50,"red","2016-01-04"
4,"A pink door",14.50,"pink","2016-01-05",aaa
5,"A white door",15.50,"white","2016-01-06",aaa
6,"A black door",16.50,"black","2016-01-07"
7,"A yellow door",17.50,"yellow","2016-01-08",aaa
8,"A purple door",18.50,"purple","2016-01-09",aaaThe load.yml setting is as below:
in:
...
parser:
...
allow_extra_columns: [ false / true ]
columns:
- {name: id, type: long}
- {name: name, type: string}
- {name: price, type: double}
- {name: tag, type: string}
- {name: timestamp, type: timestamp, format: '%Y-%m-%d'}
filters: []
out: {mode: append}
exec: {}Result of false shows only “red” and “black” rows because “additional” column isn’t defined in columns and other rows have undefined column values.
$ td connector:preview load.yml+---------+----------------+--------------+------------+---------------------------+
| id:long | name:string | price:double | tag:string | timestamp:timestamp |
+---------+----------------+--------------+------------+---------------------------+
| 3 | "A red door" | 13.5 | "red" | "2016-01-04 00:00:00 UTC" |
| 6 | "A black door" | 16.5 | "black" | "2016-01-07 00:00:00 UTC" |
+---------+----------------+--------------+------------+---------------------------+On the other hand, result of true shows all rows without “additional” columns.
$ td connector:preview load.yml+---------+-----------------+--------------+------------+---------------------------+
| id:long | name:string | price:double | tag:string | timestamp:timestamp |
+---------+-----------------+--------------+------------+---------------------------+
| 1 | "A green door" | 11.5 | "green" | "2016-01-02 00:00:00 UTC" |
| 2 | "A blue door" | 12.5 | "blue" | "2016-01-03 00:00:00 UTC" |
| 3 | "A red door" | 13.5 | "red" | "2016-01-04 00:00:00 UTC" |
| 4 | "A pink door" | 14.5 | "pink" | "2016-01-05 00:00:00 UTC" |
| 5 | "A white door" | 15.5 | "white" | "2016-01-06 00:00:00 UTC" |
| 6 | "A black door" | 16.5 | "black" | "2016-01-07 00:00:00 UTC" |
| 7 | "A yellow door" | 17.5 | "yellow" | "2016-01-08 00:00:00 UTC" |
| 8 | "A purple door" | 18.5 | "purple" | "2016-01-09 00:00:00 UTC" |
+---------+-----------------+--------------+------------+---------------------------+Here is an example of specifying \0 as the quote and escape character.
The source data is:
id,name,price,tag,timestamp
1,"A green door",11.50,"green","2016-01-02"
2,"A blue door",12.50,"blue",2016-01-03
3,"A \"red\" door",13.50,"red",2016-01-04
4,"A pink door",14.50,"pink",2016-01-05in:
...
parser:
charset: UTF-8
newline: CRLF
type: csv
delimiter: ","
quote: "\0"
escape: "\0"
trim_if_not_quoted: false
skip_header_lines: 1
allow_extra_columns: false
allow_optional_columns: false
columns:
- {name: id, type: long}
- {name: name, type: string}
- {name: price, type: double}
- {name: tag, type: string}
- {name: timestamp, type: timestamp, format: '%Y-%m-%d'}
filters: []
out: {mode: append}
exec: {}- All quotation marks are presented in the results because they are not specified as the "quote character" to the parser.
- Row 1 is not presented because the value of quote is set to
\0. The quotation marks make "2016-01-02" and invalid timestamp. - Embedded quotation marks and slash marks on row 4 are presented because they are not specified as the "quote character" to the parser.
$ td connector:preview load.yml+---------+-----------------+--------------+------------+---------------------+
| id:long | name:string | price:double | tag:string | timestamp:timestamp |
+---------+-----------------+--------------+------------+---------------------+
| 2 | "A blue door" | 12.5 | "blue" | 2016-01-03 00:00:00 |
| 3 | "A \"red\" door"| 13.5 | "red" | 2016-01-04 00:00:00 |
| 4 | "A pink door" | 14.5 | "pink" | 2016-01-05 00:00:00 |
+---------+-----------------+--------------+------------+---------------------+