2.4 CSV

MCMD processes tabular data in CSV format (Comma Separated Values) as illustrated in Figure 2.5. CSV is a de facto standard format for table-structured data. It is widely used as a tabular format to import and export data between application programs.

Product code, product name, classification, price
0899781,bread, food,128
8879674,orange juice, beverage,98
3244565,cheese, food,350
6711298,bowl, tableware,168
Figure 2.5: Example of CSV data

However, CSV is not a standard format endorsed by organization for standardization nor corporate initiatives, as a result the method of handling a CSV differs from each software vendor at present. The proposed RFC 4180 is an effort to formalize CSV as an Internet standard in October 2005 is a significant move to increase the portability of CSV. Augmented Backus-Naur Format (ABNF) for CSV files in RFC 4180 is shown in Figure 2.6.

(A) file = [header CRLF] record *(CRLF record) [CRLF]
(B) header = name *(COMMA name)
(C) record = field *(COMMA field)
(D) name = field
(E) field = (escaped / non-escaped)
(F) escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
(G) non-escaped = *TEXTDATA
(H) COMMA = %x2C
(I) CR = %x0D ;as per section 6.1 of RFC 2234 [2]
(J) DQUOTE = %x22 ;as per section 6.1 of RFC 2234 [2]
(K) LF = %x0A ;as per section 6.1 of RFC 2234 [2]
(L) CRLF = CR LF ;as per section 6.1 of RFC 2234 [2]
(M) TEXTDATA = %x20-21 / %x23-2B / %x2D-7E
Figure 2.6: Definition of ABNF for CSV

Meaning of each line in Figure 2.6 is as follows.

2.4.1 Definition specific to KGMOD

KGMOD (MCMD) added the following rules to CSV as defined above.

It is sufficient to use the mchkcsv command to verify whether the CSV file meets the above definition.

2.4.2 Common input and output process

The input and output sequence of CSV file for MCMD follows the steps listed below.

  1. Read file into memory blocks.

  2. Split the comma-delimited string into different fields while taking consideration of escape character.

  3. Interpret escape characters and convert to original data (except DQUOTE).

  4. Run the specific processing function of the command and write the results to the output buffer.

  5. Add character escapes if necessary.

  6. Output to a file when buffer is full.

2.4.3 Notes

For points to note when preparing the CSV data will be described below with examples.

Data containing comma characters

Escape comma characters in data by enclosing them in double quotes. The following is a CSV file comprising of two fields f1,f2. The data in row 01 at column f1 is enclosed in double quotes since it contains a comma.

f1,f2
"abc,def",2
xyz,2

Data containing double quotes

Data containing double quotes characters can be represented by a pair of consecutive double quote. The following is the CSV file that consists of two columns f1,f2. Data in row 0 and 1 at column f1 is escaped with double quotation. The original data is written as abc"def and " respectively.

f1,f2
"abc""def",2
"""",2

Line breaks in data

Data including a line break can be process when enclosed in double quotes. A line break is included in the data at row 0 in column f1 after abc, since the data is enclosed in double quotes, it is identified as part of the data instead of end of the line.

f1,f2
"abc
def",1

Unnecessary double quotes

Double quotes in data are removed in the output where unnecessary.

$ more data.csv
f1,f2
"abc",efg
abc,"efg"
$ mcut f=f1,f2 i=data.csv
f1,f2
abc,efg
abc,efg

Footnotes

  1. MCMD address the value of the first row as 0 (except for the field name row) consistently.