2.1 Mcsvin - Read CSV Class

This class process CSV data file by row. The features are as follows.

2.1.1 Method

* MCMD::Mcsvin::new(arguments){block}

Create Mcsvin object. Specify the list of arguments delimited by space in character string at “arguments” as follows.

i=

Input file name (String) [required]

k=

Detect key break in the list of fields. Multiple fields are delimited by comma.

 

Note the specification of key depends on yield arguments from each method.

-nfn

No field names in the first row.

-array

Store each data field by each method in Array.

 

Data fields are stored in Hash by default if this is not specified.

 

Storage in Array is 2 times more efficient than Hash (refer to benchmark for details).

block

Execute (yield) when block is specified.

* MCMD::Mcsvin::each{|val| block}
* MCMD::Mcsvin::each{|val,top,bot| block}

Process CSV file one row at a time. 1) val is set to the value when key (k=) is not specified. 2) when key is specified, key break information is set in top and bot variables with the exception of val.

val

Set value in field name as key in Hash (or Array). Values are stored in string format.

top

Set to true if the start of key is specified at k=, otherwise, set to false. See remarks for more details.

bot

Set to true if the end of key is specified at k=, otherwise, set to false. See remarks for more details.

* MCMD::Mcsvin::names()

Return field name array. Return nil if -nfn is specified.

2.1.2 Remarks

2.1.3 Example

Example 1 Return the row number and values for corresponding field names

# dat1.csv
customer,date,amount
A,20081201,10
B,20081002,40

MCMD::Mcsvin.new("i=dat1.csv"){|csv|
  puts csv.names.join(",")
  csv.each{|val|
    p val
  }
}
# Output results
customer,date,amount
["customer"=>"A", "date"=>"20081201", "amount"=>"10"]
["customer"=>"B", "date"=>"20081002", "amount"=>"40"]

Example 2 Key break processing

# dat1.csv
customer,date
A,20081201
A,20081202
B,20081003
C,20081004
C,20081005
C,20081006

csv=MCMD::Mcsvin.new("i=dat1.csv k=customer")
csv.each{|val,top,bot|
  puts "#{val['customer']},#{val['date']} top=#{top} bot=#{bot}"
}
csv.close

# Output results
A,20081201 top=true bot=false
A,20081202 top=false bot=true
B,20081003 top=true bot=true
C,20081004 top=true bot=false
C,20081005 top=false bot=false
C,20081006 top=false bot=true

Example 3 Data processing without field names

Data is stored in Array when -nfn is specified.

# dat1.csv
A,20081201
A,20081202

MCMD::Mcsvin.new("i=dat1.csv k=1 -nfn"){|csv|
  puts csv.names # -> nil
  csv.each{|val|
    p val
  }
}

# Output results
nil
["A","20081201"]
["A","20081202"]

Example 4 Example of storing data in Array

# dat1.csv
customer,date,amount
A,20081201,10
B,20081002,40

# Array storage with -array option 
MCMD::Mcsvin.new("i=dat1.csv -array"){|csv|
  puts csv.names.join(",")
  csv.each{|val|
    p val
  }
}

# Output results
customer,date,amount
["A", "20081201", "10"]
["B", "20081002", "40"]

Related Commands

Mcsvout : Write to CSV data.

Mtable : Read data by cell from CSV file.

2.1.4 Benchmark Test

The processing speed for various Ruby extension library are benchmarked in terms of reading CSV data. The benchmark test targets the following 4 libraries and mcut command.

CSVScan

http://raa.ruby-lang.org/project/csvscan/

LightCsv

http://tmtm.org/ruby/lightcsv/

FasterCSV

http://www.gesource.jp/programming/ruby/database/fastercsv.html

CSV

http://www.ruby-lang.org/ja/old-man/html/CSV.html

mcut

Fields are extracted with M-Command (implemented in C++). Speed performance is shown as reference.

The results of benchmark test is shown in Table 2.1. In this experiment, 10,000 to 500 million rows of data is read. The performance of Mscvin is almost equivalent to CSVScan (implemented in C). The difference is more significant for other libraries implemented in Ruby native code. However, Mcsvin lags behind when compared with mcut. The same parsing logic for CSV is used for mcut and Mcsvin, the difference due to the cost incurred when data is store in Array in Ruby interface. An excerpt of the script used in the benchmark test is shown in Figure 2.1.

Table 2.1: Comparison of execution speed among various CSV libraries (in seconds)

Number of rows

10K

100K

1000K

2000K

3000K

4000K

5000K

Mcsvin

0.020

0.196

1.76

3.51

5.26

7.02

8.79

CSVScan

0.021

0.187

1.83

3.67

5.50

7.33

9.17

LightCsv

0.155

1.62

15.99

FasterCSV

0.196

1.96

19.50

CSV

1.44

14.3

mcut

0.095

0.177

0.260

0.342

0.423


The results show the average value (real time) of 10 benchmark tests.
Benchmark is not measured in cells marked as “-” since the value is too big (or too small).
10K number of rows refer to 10,000 rows. Data size of 1000K records is about 25MB. The data consists of 5 columns.
Version: CSVScan 0.0.20070920, FasterCSV 1.5.1, LightCsv 0.2.2 CSV(Ruby 1.8.7)
Test environment: Mac Book Pro, 2.66GHz Intel Core i7, 8GB memory, Mac OS X 10.6.8

require 'rubygems'
require 'csv'
require 'fastercsv'
require 'lightcsv'
require 'csvscan'
require 'mcmd'

require 'benchmark'

puts Benchmark.measure{
  (0...10).each{|i|
    # Case of Mcsvin
    csv=MCMD::Mcsvin.new("i=data.csv -array"){|csv| csv.each{|val|}}

    # Case of CSVScan
    File.open("data.csv","r"){|fp| CSVScan.scan(fp){|row|}}

    # Case of LightCsv
    LightCsv.foreach("data.csv"){|row|}

    # Case of FasterCSV
    FasterCSV.foreach("data.csv"){|row|}

    # Case of CSV
    CSV.open("data.csv", 'r'){|row|}
  }
}

Figure 2.1: Excerpt of benchmark test script

The next test looks at the difference in execution time corresponding to the types of data storage for data with or without key (Table 2.2). There is minimal difference in speed for data with or without key, however, the speed of data stored in Array is twice as fast as Hash.

Table 2.2: Comparison of execution speed according to data types with and without key (in seconds)

Key

Type

1000K

2000K

3000K

4000K

5000K

No

Array

1.76

3.51

5.26

7.02

8.79

No

Hash

3.52

6.99

10.50

14.00

17.52

Yes

Array

1.97

3.92

5.88

7.84

9.83

Yes

Hash

3.68

7.34

11.01

14.73

18.34


Size of key at an average of 10 rows.