3.3 mrecount - Row calculation method for CSV data

Class to process CSV data file by row. The features are as follows.

3.3.1 Format

MCMD::mrecount(arguments)

Specify the following arguments in character string separated with space at arguments.

i=

Input file name (String)

-nfn

No field names in the first row.

3.3.2 Examples

Example 1 Print row number and value from field names

# dat1.csv
customer,date,amount
A,20081201,10
B,20081002,40

p MCMD::mrecount("i=dat1.csv")      # -> 2
p MCMD::mrecount("i=dat1.csv -nfn") # -> 3

3.3.3 Related Command

Mtable : Class to read CSV data into cell

3.3.4 Benchmark Test

The processing speed of the UNIX command wc and Mtable are benchmarked in terms of row count of CSV data. The results of the benchmark test in shown in Table 3.1. The experiment is carried out for data with one million, two million, three million, four million and 500 million rows. As shown in the results, mrecount is slightly faster than wc. Further, Mtable is not a class for used to count the number of rows, mrecount is 5-6 times faster when compared to Mtable.

An excerpt of the script used in the benchmark test is shown in Figure 3.1.

Table 3.1: Comparison of execution speed among various CSV library (in seconds)

Number of rows

1000K

2000K

3000K

4000K

5000K

mrecount

0.034

0.066

0.097

0.129

0.161

wc -l

0.038

0.070

0.103

0.133

0.169

Mtable

0.231

0.407

0.503

0.731

0.828


The results show the average value (real time) of 10 benchmark tests.
1000K number of rows refer to one million rows. Data size of 1000K records is about 25MB. The data consists of 5 columns.
Test environment: Mac Book Pro, 2.66GHz Intel Core i7, 8GB memory, Mac OS X 10.6.8

require 'rubygems'
require 'mtools'

require 'benchmark'

puts Benchmark.measure{
  (0...10).each{|i|
    # Case of mrecount
    p MCMD::mrecount("i=data.csv")

    # Case of wc
    system "wc -l data.csv"

    # Case of Mtable
    MCMD::Mtable("i=data.csv -array"){|tbl|
			p tbl.recordSize
    }
  }
}

Figure 3.1: Excerpt of benchmark test script