MCMD processes tabular data in CSV format (Comma Separated Values) as illustrated in Figure 2.5. CSV is a de facto standard format for table-structured data. It is widely used as a tabular format to import and export data between application programs.
Product code, product name, classification, price 0899781,bread, food,128 8879674,orange juice, beverage,98 3244565,cheese, food,350 6711298,bowl, tableware,168
However, CSV is not a standard format endorsed by organization for standardization nor corporate initiatives, as a result the method of handling a CSV differs from each software vendor at present. The proposed RFC 4180 is an effort to formalize CSV as an Internet standard in October 2005 is a significant move to increase the portability of CSV. Augmented Backus-Naur Format (ABNF) for CSV files in RFC 4180 is shown in Figure 2.6.
(A) file = [header CRLF] record *(CRLF record) [CRLF] (B) header = name *(COMMA name) (C) record = field *(COMMA field) (D) name = field (E) field = (escaped / non-escaped) (F) escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE (G) non-escaped = *TEXTDATA (H) COMMA = %x2C (I) CR = %x0D ;as per section 6.1 of RFC 2234 [2] (J) DQUOTE = %x22 ;as per section 6.1 of RFC 2234 [2] (K) LF = %x0A ;as per section 6.1 of RFC 2234 [2] (L) CRLF = CR LF ;as per section 6.1 of RFC 2234 [2] (M) TEXTDATA = %x20-21 / %x23-2B / %x2D-7E
Meaning of each line in Figure 2.6 is as follows.
(A) File consists of a header and record of one or more lines. Header is not required. The line break (CRLF) is attached at the end of the header and each record. The line break (CRLF) in the last row is not required.
(B) Header consists of one or more names which is separated by a single comma.
(C) Record consists of one or more fields which is separated by a single comma.
(D) Name refers to field.
(E) Field can include an escape character or non-escape character.
(F) Field values containing 1 or more text characters (TEXTDATA), comma(COMMA), newline character (CR or LF) shall have a pair of consecutive double quotes character escaped by doubling it.
(G) Non-escape refers to 1 or more text characters (TEXTDATA).
(H) ASCII code of comma in hexadecimal is 2C.
(I) ASCII code of carriage return (CR) in hexadecimal is 0D.
(J) ASCII code of double quotation (DQUOTE) in hexadecimal is 22.
(K) ASCII code of line feed (LF) in hexadecimal is 0A.
(L) Line break or newline is represented as a carriage return + line feed.
(M) Text character (TEXTDATA) had the range of 20-21, 23-2B, 2D-7E in hexadecimal ASCII code.
KGMOD (MCMD) added the following rules to CSV as defined above.
The number of fields must be the same in all the rows.
Set a limit on the maximum length of a single row (default value is 1024000 bytes (1MB) and expandable up to 10MB).
Line break is only marked with Line Feed (LF).
Line break is mandatory even in the last record.
Added the 80-FF (hexadecimal) range to text characters for handling multibyte characters.
It is sufficient to use the mchkcsv command to verify whether the CSV file meets the above definition.
The input and output sequence of CSV file for MCMD follows the steps listed below.
Read file into memory blocks.
Split the comma-delimited string into different fields while taking consideration of escape character.
Interpret escape characters and convert to original data (except DQUOTE).
Run the specific processing function of the command and write the results to the output buffer.
Add character escapes if necessary.
Output to a file when buffer is full.
For points to note when preparing the CSV data will be described below with examples.
Escape comma characters in data by enclosing them in double quotes. The following is a CSV file comprising of two fields f1,f2. The data in row 01 at column f1 is enclosed in double quotes since it contains a comma.
f1,f2 "abc,def",2 xyz,2
Data containing double quotes characters can be represented by a pair of consecutive double quote. The following is the CSV file that consists of two columns f1,f2. Data in row 0 and 1 at column f1 is escaped with double quotation. The original data is written as abc"def and " respectively.
f1,f2 "abc""def",2 """",2
Data including a line break can be process when enclosed in double quotes. A line break is included in the data at row 0 in column f1 after abc, since the data is enclosed in double quotes, it is identified as part of the data instead of end of the line.
f1,f2 "abc def",1
Double quotes in data are removed in the output where unnecessary.
$ more data.csv f1,f2 "abc",efg abc,"efg" $ mcut f=f1,f2 i=data.csv f1,f2 abc,efg abc,efg
Footnotes