2.4 CSV

MCMD processes tabular data in CSV format (Comma Separated Values) as illustrated in Figure 2.5. CSV is a de facto standard format for table-structured data. It is widely used as a tabular format to import and export data between application programs.

Product code, product name, classification, price
0899781,bread, food,128
8879674,orange juice, beverage,98
3244565,cheese, food,350
6711298,bowl, tableware,168

Figure 2.5: Example of CSV data

However, CSV is not a standard format endorsed by organization for standardization nor corporate initiatives, as a result the method of handling a CSV differs from each software vendor at present. The proposed RFC 4180 is an effort to formalize CSV as an Internet standard in October 2005 is a significant move to increase the portability of CSV. Augmented Backus-Naur Format (ABNF) for CSV files in RFC 4180 is shown in Figure 2.6.

(A) file = [header CRLF] record *(CRLF record) [CRLF]
(B) header = name *(COMMA name)
(C) record = field *(COMMA field)
(D) name = field
(E) field = (escaped / non-escaped)
(F) escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
(G) non-escaped = *TEXTDATA
(H) COMMA = %x2C
(I) CR = %x0D ;as per section 6.1 of RFC 2234 [2]
(J) DQUOTE = %x22 ;as per section 6.1 of RFC 2234 [2]
(K) LF = %x0A ;as per section 6.1 of RFC 2234 [2]
(L) CRLF = CR LF ;as per section 6.1 of RFC 2234 [2]
(M) TEXTDATA = %x20-21 / %x23-2B / %x2D-7E

Figure 2.6: Definition of ABNF for CSV

Meaning of each line in Figure 2.6 is as follows.

(A) File consists of a header and record of one or more lines. Header is not required. The line break (CRLF) is attached at the end of the header and each record. The line break (CRLF) in the last row is not required.
(B) Header consists of one or more names which is separated by a single comma.
(C) Record consists of one or more fields which is separated by a single comma.
(D) Name refers to field.
(E) Field can include an escape character or non-escape character.
(F) Field values containing 1 or more text characters (TEXTDATA), comma(COMMA), newline character (CR or LF) shall have a pair of consecutive double quotes character escaped by doubling it.
(G) Non-escape refers to 1 or more text characters (TEXTDATA).
(H) ASCII code of comma in hexadecimal is 2C.
(I) ASCII code of carriage return (CR) in hexadecimal is 0D.
(J) ASCII code of double quotation (DQUOTE) in hexadecimal is 22.
(K) ASCII code of line feed (LF) in hexadecimal is 0A.
(L) Line break or newline is represented as a carriage return + line feed.
(M) Text character (TEXTDATA) had the range of 20-21, 23-2B, 2D-7E in hexadecimal ASCII code.

2.4.1 Definition specific to KGMOD

KGMOD (MCMD) added the following rules to CSV as defined above.

The number of fields must be the same in all the rows.
Set a limit on the maximum length of a single row (default value is 1024000 bytes (1MB) and expandable up to 10MB).
Line break is only marked with Line Feed (LF).
Line break is mandatory even in the last record.
Added the 80-FF (hexadecimal) range to text characters for handling multibyte characters.

It is sufficient to use the mchkcsv command to verify whether the CSV file meets the above definition.

2.4.2 Common input and output process

The input and output sequence of CSV file for MCMD follows the steps listed below.

Read file into memory blocks.
Split the comma-delimited string into different fields while taking consideration of escape character.
Interpret escape characters and convert to original data (except DQUOTE).
Run the specific processing function of the command and write the results to the output buffer.
Add character escapes if necessary.
Output to a file when buffer is full.

2.4.3 Notes

For points to note when preparing the CSV data will be described below with examples.

Data containing comma characters

Escape comma characters in data by enclosing them in double quotes. The following is a CSV file comprising of two fields f1,f2. The data in row 0¹ at column f1 is enclosed in double quotes since it contains a comma.

f1,f2
"abc,def",2
xyz,2

Data containing double quotes

Data containing double quotes characters can be represented by a pair of consecutive double quote. The following is the CSV file that consists of two columns f1,f2. Data in row 0 and 1 at column f1 is escaped with double quotation. The original data is written as abc"def and " respectively.

f1,f2
"abc""def",2
"""",2

Line breaks in data

Data including a line break can be process when enclosed in double quotes. A line break is included in the data at row 0 in column f1 after abc, since the data is enclosed in double quotes, it is identified as part of the data instead of end of the line.

f1,f2
"abc
def",1

Unnecessary double quotes

Double quotes in data are removed in the output where unnecessary.

$ more data.csv
f1,f2
"abc",efg
abc,"efg"
$ mcut f=f1,f2 i=data.csv
f1,f2
abc,efg
abc,efg

Footnotes

MCMD address the value of the first row as 0 (except for the field name row) consistently.