Class to process CSV data file by row. The features are as follows.
Implemented in C++ and thus operates at high speed (Slightly faster than the UNIX command wc-l).
Only counts the number of rows in data excluding the field names in the first row.
Only counts the line break char, thus, line break(s) escaped by double quotes are also counted. Use MCMD::Mtable to avoid this problem.
MCMD::mrecount(arguments)
Specify the following arguments in character string separated with space at arguments.
i= |
Input file name (String) |
-nfn |
No field names in the first row. |
# dat1.csv customer,date,amount A,20081201,10 B,20081002,40 p MCMD::mrecount("i=dat1.csv") # -> 2 p MCMD::mrecount("i=dat1.csv -nfn") # -> 3
Mtable : Class to read CSV data into cell
The processing speed of the UNIX command wc and Mtable are benchmarked in terms of row count of CSV data. The results of the benchmark test in shown in Table 3.1. The experiment is carried out for data with one million, two million, three million, four million and 500 million rows. As shown in the results, mrecount is slightly faster than wc. Further, Mtable is not a class for used to count the number of rows, mrecount is 5-6 times faster when compared to Mtable.
An excerpt of the script used in the benchmark test is shown in Figure 3.1.
Number of rows |
1000K |
2000K |
3000K |
4000K |
5000K |
mrecount |
0.034 |
0.066 |
0.097 |
0.129 |
0.161 |
wc -l |
0.038 |
0.070 |
0.103 |
0.133 |
0.169 |
Mtable |
0.231 |
0.407 |
0.503 |
0.731 |
0.828 |
require 'rubygems'
require 'mtools'
require 'benchmark'
puts Benchmark.measure{
(0...10).each{|i|
# Case of mrecount
p MCMD::mrecount("i=data.csv")
# Case of wc
system "wc -l data.csv"
# Case of Mtable
MCMD::Mtable("i=data.csv -array"){|tbl|
p tbl.recordSize
}
}
}