3.24 mhashsum - Compute Total Sum Using Hash

Calculate the hash total value of the column specified at the f= parameter for each line item based on the key specified at the k= parameter.
The processing speed of this command is faster than msum since key fields do not require prior sorting. However, variation in length of keys (different length of strings in field) will slow down the processing speed.
User shall assess the usage of mhashsum and msum based on the contents of the data (Refer to "Benchmark" in the second half of this manual).

Format

mhashsum f= [hs=] [k=] [-n] [i=] [o=] [-nfn] [-nfno] [-x] [precision=] [--help] [--version]

Parameters

f=

Compute the sum of of values in the column specified (Multiple fields can be specified)

 

Specify the new field name after colon ":". Example f=Quantity:TotalQuantity.

k=

Calculate the sum on the data series based on the key field(s) (Multiple keys can be specified.)

 

This command do not use aggregate key break processing, prior sorting is not required.

hs=

Hash size [Default value: 199999]

 

User shall specify the key size for data processing based on speed and memory consumption optimisation requirements. Prime number should be used as hash table size.

 

The processing speed will slow down if the hash table is not big enough for data with large key size.

 

A larger hash table will speed up processing but will also require more memory (Refer to "Benchmark" in the second half of this manual).

 

Estimating memory requirements: K*(24+F*16) byte, K: key size, F: number of fields specified f= parameter.

-n

Return NULL in output if there are null values in f=.

Example

Example 1: Basic Example

Calculate the total Quantity and total Amount for each Customer using the hash function.

$ more dat1.csv
Customer,Quantity,Amount
A,1,
B,,15
A,2,20
B,3,10
B,1,20
$ mhashsum k=Customer f=Quantity,Amount i=dat1.csv o=rsl1.csv
#END# kghashsum f=Quantity,Amount i=dat1.csv k=Customer o=rsl1.csv
$ more rsl1.csv
Customer,Quantity,Amount
A,3,20
B,4,45

Example 2: NULL value in output

The output returns NULL if NULL value is present in Quantity and Amount. Use -n option to print the null value.

$ mhashsum k=Customer f=Quantity,Amount -n i=dat1.csv o=rsl2.csv
#END# kghashsum -n f=Quantity,Amount i=dat1.csv k=Customer o=rsl2.csv
$ more rsl2.csv
Customer,Quantity,Amount
A,3,
B,,45

Overview of algorithm

mhashsum command uses a hash method known as separate chaining.
In this method, a M sequence or a list is created containing all the keys that hash to the same value. The hash function converts and stores the character string containing the keys into integer (hash values) from 0 to M. Two or more keys with the same hash value (conflict keys) will be stored in a linked list from the conflicted slot of hash table. The address of keys are stored in sequential order, but the list is searched in linear order. Thus, the lookup procedure will scan all its entries, and the processing speed decreases with more key conflicts. The default hash size for mhashsum is 199999, if the key size is up to 200,000, the average list size is 1 less of the key size. Multiple key conflicts may occur even if the key size is small depending on the data content and structure. The key size can be changed at the hs= parameter (maximum value: 1999999).

Benchmark test

Method of Benchmark test

Compare the computation speed of mhashsum command (hash size: 199,999) and msum command (sort data using msort command beforehand) on 13 different types of data with 10 to 1,000,000 key sizes.

Sample data with two columns (key and fld ) and 500 million rows of random values is generated as shown in the following table. The key is a 6 digit fixed-length numerical value and the fld is a 3 digit number.

Table 3.3: Table 1: Sample data for Benchmark test

key

fld

100020

120

100007

107

100029

129

100065

165

100030

130

100088

188

100055

155

100093

193

100072

172

Commands for benchmark

Using the mhashsum method
$ time mhashsum k=key f=fld i=dat.csv o=/dev/null

Using the msort+msum method
$ time msort i=dat.csv msum k=key f=fld o=/dev/null |

Experiment Results

The proceeding speed of mhashsum is five times faster than sorting when the key size is small ( 10,000). As the key size increases, the difference between the two methods is reduced, the processing speed of the two methods are the same when the key size is more than 800,000.

The following is a guideline on the usage of msum or mhashsum, actual results varies depending on the distribution of the key values.

Table 3.4: Table 2: Comparison of processing speed for mhashsum and msum(msort+msum)

key size

mhashsum(a)(second)

msort+msum(b)(second)

ratio(a/b)

100

0.672

2.955

0.227

1,000

0.731

3.981

0.184

10,000

0.814

4.201

0.194

100,000

1.793

4.291

0.418

200,000

2.241

4.336

0.517

300,000

2.604

4.394

0.593

400,000

2.993

4.448

0.673

500,000

3.380

4.497

0.752

600,000

3.793

4.579

0.828

700,000

4.128

4.618

0.894

800,000

4.514

4.667

0.967

900,000

4.901

4.707

1.041

1,000,000

5.352

4.771

1.122

Benchmark environment

iMac, Mac OS X 10.5 Leopard, 2.8GHz Intel Core 2 Duo, 4GB memory

Related commands

msum : Same computation function but requires key break process.

mhashavg : Compute average using the same hash function