3.3 mrecount - Row calculation method for CSV data

Class to process CSV data file by row. The features are as follows.

Implemented in C++ and thus operates at high speed (Slightly faster than the UNIX command wc-l).
Only counts the number of rows in data excluding the field names in the first row.
Only counts the line break char, thus, line break(s) escaped by double quotes are also counted. Use MCMD::Mtable to avoid this problem.

3.3.1 Format

MCMD::mrecount(arguments)

Specify the following arguments in character string separated with space at arguments.

`i=`	Input file name (String)
`-nfn`	No field names in the first row.

3.3.2 Examples

Example 1 Print row number and value from field names

# dat1.csv
customer,date,amount
A,20081201,10
B,20081002,40

p MCMD::mrecount("i=dat1.csv")      # -> 2
p MCMD::mrecount("i=dat1.csv -nfn") # -> 3

3.3.3 Related Command

Mtable : Class to read CSV data into cell

3.3.4 Benchmark Test

The processing speed of the UNIX command wc and Mtable are benchmarked in terms of row count of CSV data. The results of the benchmark test in shown in Table 3.1. The experiment is carried out for data with one million, two million, three million, four million and 500 million rows. As shown in the results, mrecount is slightly faster than wc. Further, Mtable is not a class for used to count the number of rows, mrecount is 5-6 times faster when compared to Mtable.

An excerpt of the script used in the benchmark test is shown in Figure 3.1.

Table 3.1: Comparison of execution speed among various CSV library (in seconds)

Number of rows	1000K	2000K	3000K	4000K	5000K
mrecount	0.034	0.066	0.097	0.129	0.161
wc -l	0.038	0.070	0.103	0.133	0.169
Mtable	0.231	0.407	0.503	0.731	0.828

The results show the average value (real time) of 10 benchmark tests.
1000K number of rows refer to one million rows. Data size of 1000K records is about 25MB. The data consists of 5 columns.
Test environment: Mac Book Pro, 2.66GHz Intel Core i7, 8GB memory, Mac OS X 10.6.8

require 'rubygems'
require 'mtools'

require 'benchmark'

puts Benchmark.measure{
  (0...10).each{|i|
    # Case of mrecount
    p MCMD::mrecount("i=data.csv")

    # Case of wc
    system "wc -l data.csv"

    # Case of Mtable
    MCMD::Mtable("i=data.csv -array"){|tbl|
			p tbl.recordSize
    }
  }
}

Figure 3.1: Excerpt of benchmark test script