csv parser plugin for Data Connector

The csv parser plugin for Data Connector parse csv data. It takes 15 options as explained below. Detail is here.

Option Description
delimiter Delimiter character such as , for CSV, “\t” for TSV, “|” or any single-byte character
quote The character surrounding a quoted value. Setting null disables quoting.
escape Escape character to escape a special character. Setting null disables escaping.
skip_header_lines Skip this number of lines first. Set 1 if the file has header line.
null_string If a value is this string, converts it to NULL. For example, set \N for CSV files created by mysqldump
trim_if_not_quoted If true, remove spaces of a value if the value is not surrounded by the quote character
comment_line_marker Skip a line if the line begins with this string
allow_optional_columns If true, set null to insufficient columns. Otherwise, skip the row in case of insufficient number of columns
allow_extra_columns If true, ignore too many columns. Otherwise, skip the row in case of too many columns
max_quoted_size_limit Maximum number of bytes of a quoted value. If a value exceeds the limit, the row will be skipped
stop_on_invalid_record Stop bulk load transaction if a file includes invalid record (such as invalid timestamp)
default_timezone Time zone of timestamp columns if the value itself doesn’t include time zone description (eg. Asia/Tokyo)
newline Newline character (CRLF, LF or CR)
charset Character encoding (eg. ISO-8859-1, UTF-8)
columns Columns (see below)

The columns option declares the list of columns. This CSV parser plugin ignores the header line.

Column Description
name Name of the column
type Type of the column (see below).

boolean: true or false
long: 64-bit signed integers
timestamp: Date and time with nano-seconds precision
double: 64-bit floating point numbers
string: Strings
format Format of the timestamp if type is timestamp

Table of Contents

Configuration

You can use guess to automatically generate the column settings. Or set the parser section like an example:

in:
...
  parser:
    type: csv
    charset: UTF-8
    newline: CRLF
    delimiter: "\t"
    quote: '"'
    escape: '"'
    null_string: 'NULL'
    skip_header_lines: 1
    comment_line_marker: '#'
    columns:
    - {name: id, type: long}
    - {name: account, type: long}
    - {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
    - {name: purchase, type: timestamp, format: '%Y%m%d'}
    - {name: comment, type: string}
out:
...

Use Case Example: Simple CSV

Example of Source data

id,name,price,tag,timestamp
1,"A green door",11.50,"green","2016-01-02"
2,"A blue door",12.50,"blue","2016-01-03"
3,"A red door",13.50,"red","2016-01-04"
4,"A pink door",14.50,"pink","2016-01-05"
5,"A white door",15.50,"white","2016-01-06"
6,"A black door",16.50,"black","2016-01-07"
7,"A yellow door",17.50,"yellow","2016-01-08"
8,"A purple door",18.50,"purple","2016-01-09"

Example of load.yml

in:
...
parser:
  charset: UTF-8
  newline: CRLF
  type: csv
  delimiter: ","
  quote: "\""
  escape: "\""
  trim_if_not_quoted: false
  skip_header_lines: 1
  allow_extra_columns: false
  allow_optional_columns: false
  columns:
  - {name: id, type: long}
  - {name: name, type: string}
  - {name: price, type: double}
  - {name: tag, type: string}
  - {name: timestamp, type: timestamp, format: '%Y-%m-%d'}
filters: []
out: {mode: append}
exec: {}

You can show parsed csv data by preview command.

$ td connector:preview load.yml
+---------+-----------------+--------------+------------+---------------------------+
| id:long | name:string     | price:double | tag:string | timestamp:timestamp       |
+---------+-----------------+--------------+------------+---------------------------+
| 1       | "A green door"  | 11.5         | "green"    | "2016-01-02 00:00:00 UTC" |
| 2       | "A blue door"   | 12.5         | "blue"     | "2016-01-03 00:00:00 UTC" |
| 3       | "A red door"    | 13.5         | "red"      | "2016-01-04 00:00:00 UTC" |
| 4       | "A pink door"   | 14.5         | "pink"     | "2016-01-05 00:00:00 UTC" |
| 5       | "A white door"  | 15.5         | "white"    | "2016-01-06 00:00:00 UTC" |
| 6       | "A black door"  | 16.5         | "black"    | "2016-01-07 00:00:00 UTC" |
| 7       | "A yellow door" | 17.5         | "yellow"   | "2016-01-08 00:00:00 UTC" |
| 8       | "A purple door" | 18.5         | "purple"   | "2016-01-09 00:00:00 UTC" |
+---------+-----------------+--------------+------------+---------------------------+

If your data doesn’t have a time column you may add it using add_time filter option. More details at add_time filter plugin

$ td connector:issue load.yml --database <database name> --table <table name>  --time-column timestamp --auto-create-table

Use Case Example: Using null_string, trim_if_not_quoted, comment_line_marker and max_quoted_size_limit

Source data is.

id,name,price,tag,timestamp
1,"A green door",11.50,"green","2016-01-02"
2,"A blue door",12.50,"blue","2016-01-03"
3,"A red door",13.50,"red","2016-01-04"
4,"A pink door",14.50,"pink","2016-01-05"
5,"A white door",15.50,"   white   ","2016-01-06"
6,"A black door",16.50,    black    ,"2016-01-07"
7,"A yellow door",17.50,"yellow","2016-01-08"
8,"A purple doooooooooooooooooooooooor",18.50,"purple","2016-01-09"

If you use following load.yml for it.

in:
...
parser:
  charset: UTF-8
  newline: CRLF
  type: csv
  delimiter: ","
  quote: "\""
  escape: "\""
  skip_header_lines: 1
  null_string "red"
  trim_if_not_quoted: true
  comment_line_marker: "2"
  max_quoted_size_limit: 20
  columns:
  - {name: id, type: long}
  - {name: name, type: string}
  - {name: price, type: double}
  - {name: tag, type: string}
  - {name: timestamp, type: timestamp, format: '%Y-%m-%d'}
filters: []
out: {mode: append}
exec: {}

The result is.

$ td connector:preview load.yml
+---------+-----------------+--------------+---------------+---------------------------+
| id:long | name:string     | price:double | tag:string    | timestamp:timestamp       |
+---------+-----------------+--------------+---------------+---------------------------+
| 1       | "A green door"  | 11.5         | "green"       | "2016-01-02 00:00:00 UTC" |
| 3       | "A red door"    | 13.5         | nil           | "2016-01-04 00:00:00 UTC" |
| 4       | "A pink door"   | 14.5         | "pink"        | "2016-01-05 00:00:00 UTC" |
| 5       | "A white door"  | 15.5         | "   white   " | "2016-01-06 00:00:00 UTC" |
| 6       | "A black door"  | 16.5         | "black"       | "2016-01-07 00:00:00 UTC" |
| 7       | "A yellow door" | 17.5         | "yellow"      | "2016-01-08 00:00:00 UTC" |
+---------+-----------------+--------------+---------------+---------------------------+
  • null_string replaced the “red” tag to nil. (“A red door” is not target, because of not full match.)
  • trim_if_not_quoted trimed “black” tag’s Beginning and Ending spaces because it’s not quoted. (“white” tag is not target, because it’s quoted.)
  • comment_line_marker skipped “blue” record because it raw beginning with “2”.
  • max_quoted_size_limit skipped “purple” record because it name exceeded value.

Use Case Example: Difference of allow_optional_columns false(default) and true

Source data is.

id,name,price,tag,timestamp,additional
1,"A green door",11.50,"green","2016-01-02",aaa
2,"A blue door",12.50,"blue","2016-01-03",aaa
3,"A red door",13.50,"red","2016-01-04"
4,"A pink door",14.50,"pink","2016-01-05",aaa
5,"A white door",15.50,"white","2016-01-06",aaa
6,"A black door",16.50,"black","2016-01-07"
7,"A yellow door",17.50,"yellow","2016-01-08",aaa
8,"A purple door",18.50,"purple","2016-01-09",aaa

load.yml setting as below

in:
...
parser:
  ...
  allow_optional_columns: [ false / true ]
  columns:
  - {name: id, type: long}
  - {name: name, type: string}
  - {name: price, type: double}
  - {name: tag, type: string}
  - {name: timestamp, type: timestamp, format: '%Y-%m-%d'}
  - {name: additional, type: string}
filters: []
out: {mode: append}
exec: {}

Result of false doesn’t have “red” and “black” raws because they don’t have “additional” column value

+---------+-----------------+--------------+------------+---------------------------+-------------------+
| id:long | name:string     | price:double | tag:string | timestamp:timestamp       | additional:string |
+---------+-----------------+--------------+------------+---------------------------+-------------------+
| 1       | "A green door"  | 11.5         | "green"    | "2016-01-02 00:00:00 UTC" | "aaa"             |
| 2       | "A blue door"   | 12.5         | "blue"     | "2016-01-03 00:00:00 UTC" | "aaa"             |
| 4       | "A pink door"   | 14.5         | "pink"     | "2016-01-05 00:00:00 UTC" | "aaa"             |
| 5       | "A white door"  | 15.5         | "white"    | "2016-01-06 00:00:00 UTC" | "aaa"             |
| 7       | "A yellow door" | 17.5         | "yellow"   | "2016-01-08 00:00:00 UTC" | "aaa"             |
| 8       | "A purple door" | 18.5         | "purple"   | "2016-01-09 00:00:00 UTC" | "aaa"             |
+---------+-----------------+--------------+------------+---------------------------+-------------------+

On the other hand, result of true is filled null.

+---------+-----------------+--------------+------------+---------------------------+-------------------+
| id:long | name:string     | price:double | tag:string | timestamp:timestamp       | additional:string |
+---------+-----------------+--------------+------------+---------------------------+-------------------+
| 1       | "A green door"  | 11.5         | "green"    | "2016-01-02 00:00:00 UTC" | "aaa"             |
| 2       | "A blue door"   | 12.5         | "blue"     | "2016-01-03 00:00:00 UTC" | "aaa"             |
| 3       | "A red door"    | 13.5         | "red"      | "2016-01-04 00:00:00 UTC" | nil               |
| 4       | "A pink door"   | 14.5         | "pink"     | "2016-01-05 00:00:00 UTC" | "aaa"             |
| 5       | "A white door"  | 15.5         | "white"    | "2016-01-06 00:00:00 UTC" | "aaa"             |
| 6       | "A black door"  | 16.5         | "black"    | "2016-01-07 00:00:00 UTC" | nil               |
| 7       | "A yellow door" | 17.5         | "yellow"   | "2016-01-08 00:00:00 UTC" | "aaa"             |
| 8       | "A purple door" | 18.5         | "purple"   | "2016-01-09 00:00:00 UTC" | "aaa"             |
+---------+-----------------+--------------+------------+---------------------------+-------------------+

Use Case Example: Difference of allow_extra_columns false(default) and true

Source data is.

id,name,price,tag,timestamp,additional
1,"A green door",11.50,"green","2016-01-02",aaa
2,"A blue door",12.50,"blue","2016-01-03",aaa
3,"A red door",13.50,"red","2016-01-04"
4,"A pink door",14.50,"pink","2016-01-05",aaa
5,"A white door",15.50,"white","2016-01-06",aaa
6,"A black door",16.50,"black","2016-01-07"
7,"A yellow door",17.50,"yellow","2016-01-08",aaa
8,"A purple door",18.50,"purple","2016-01-09",aaa

load.yml setting as below

in:
...
parser:
  ...
  allow_extra_columns: [ false / true ]
  columns:
  - {name: id, type: long}
  - {name: name, type: string}
  - {name: price, type: double}
  - {name: tag, type: string}
  - {name: timestamp, type: timestamp, format: '%Y-%m-%d'}
filters: []
out: {mode: append}
exec: {}

Result of false shows only “red” and “black” raws because “additional” column isn’t defined in columns and other raws have un-defined column’s value.

+---------+----------------+--------------+------------+---------------------------+
| id:long | name:string    | price:double | tag:string | timestamp:timestamp       |
+---------+----------------+--------------+------------+---------------------------+
| 3       | "A red door"   | 13.5         | "red"      | "2016-01-04 00:00:00 UTC" |
| 6       | "A black door" | 16.5         | "black"    | "2016-01-07 00:00:00 UTC" |
+---------+----------------+--------------+------------+---------------------------+

On the other hand, result of true shows all raws without “additional” column.

+---------+-----------------+--------------+------------+---------------------------+
| id:long | name:string     | price:double | tag:string | timestamp:timestamp       |
+---------+-----------------+--------------+------------+---------------------------+
| 1       | "A green door"  | 11.5         | "green"    | "2016-01-02 00:00:00 UTC" |
| 2       | "A blue door"   | 12.5         | "blue"     | "2016-01-03 00:00:00 UTC" |
| 3       | "A red door"    | 13.5         | "red"      | "2016-01-04 00:00:00 UTC" |
| 4       | "A pink door"   | 14.5         | "pink"     | "2016-01-05 00:00:00 UTC" |
| 5       | "A white door"  | 15.5         | "white"    | "2016-01-06 00:00:00 UTC" |
| 6       | "A black door"  | 16.5         | "black"    | "2016-01-07 00:00:00 UTC" |
| 7       | "A yellow door" | 17.5         | "yellow"   | "2016-01-08 00:00:00 UTC" |
| 8       | "A purple door" | 18.5         | "purple"   | "2016-01-09 00:00:00 UTC" |
+---------+-----------------+--------------+------------+---------------------------+

Last modified: May 25 2016 01:45:40 UTC

If this article is incorrect or outdated, or omits critical information, please let us know. For all other issues, please see our support channels.