3.56 msortf - Sort Records

Sort records according to the field defined at f= parameter.
This commands uses quicksort algorithm and it is not a stable sort (original order is retained for rows with same key value).

Format

msortf f= [i=] [o=] [tmpPath=] [-nfn] [-nfno] [-x] [--help] [--version]

Parameters

`f=`	Specify the column name where record values will be sorted accordingly.
	Four types of sequence order can be specified namely numeric, string, ascending, descending.
	Specify `%n` after the field name, followed by `n` or `r`.
	Character string ascending order:`field name` (`%` is not specified), character string descending order:`f=field%r`,
	numeric ascending order:`f=field%n`, numeric descending order:`f=field%nr`.

Remarks

Character string fields specified at k= may not be sorted correct when %n is specified.
When k= is not specified, specify the files in merging order at i= (same as mcat).
When key field include NULL values, NULL value is treated as a value less than any value.
Field names of all input data specified at i= is assumed to have the same field names, whereas mcat has more flexibility in field names.

Examples

Example 1: Basic example

Sort by item and date.

$ more dat1.csv
item,date,quantity,price
B,20081201,4,40
A,20081201,10,200
A,20081201,10,100
B,20081203,5,50
B,20081201,2,500
A,20081201,3,300
$ msortf f=item,date i=dat1.csv o=rsl1.csv
#END# kgsortf f=item,date i=dat1.csv o=rsl1.csv
$ more rsl1.csv
item,date,quantity,price
A,20081201,10,200
A,20081201,10,100
A,20081201,3,300
B,20081201,4,40
B,20081201,2,500
B,20081203,5,50

Example 2: Sort by quantity in descending order and price in ascending order.

$ msortf f=quantity%nr,price%n i=dat1.csv o=rsl2.csv
#END# kgsortf f=quantity%nr,price%n i=dat1.csv o=rsl2.csv
$ more rsl2.csv
item,date,quantity,price
A,20081201,10,100
A,20081201,10,200
B,20081203,5,50
B,20081201,4,40
A,20081201,3,300
B,20081201,2,500

Advanced parameters

`pways=`	Merge multiple files simultaneously ([2-100]:default 32) [Optional]
	Specify number of files to merge at a time while sorting multiple files.
`blocks=`	Number of buffer block ([1-1000]: default 100 1blk=400KB) [Optional]
	Specify memory size limit in the block size when sorting in memory.
	Maximum size for 1 block is × 4. Default = 400KB.
`maxlines=`	Row fetch limit of memory sort ([100-10,000,000]: 500,000 defaults) [Optional]
	Specify the maximum number of records sorted at once in memory.
	Set -block limit and -maxlines limit depending on the average size of record in the data.
`threadCnt=`	Number of threads to use when sorting in memory ([1-50] Default: 8) [Optional]
	Specify the number of threads for sorting through multi-threading function.

Notes on sorting order of CSV special characters

msortf interprets and sorts CSV special characters (e.g. comma and double quotes) differently than the sort command in UNIX. The data fields/columns are separated by comma character. For example, the values in the first column (f1) from the first row onwards are represented by the following ASCII characters: a -> (0x61) null -> (0x00) space -> (0x20) + -> (0x2b) - -> (0x2d) , -> (0x2c) " -> (0x22) Comma and double quotes is treated as special characters in CSV is enclosed in double quotes. For ease of illustration, "x" is populated in the second column f2 for all records as follows.

[baselinestretch=0.7,frame=single,fontsize=\small]
------------------------------------------------
f1,f2
a,x
,x
 ,x
+,x
-,x
",",x
"""",x
------------------------------------------------

The statement "msortf f=f1" sorts the data as follows. The sort order for CSV format special characters (null, space, double quotation, +, comma,-, a) is explained.

------------------------------------------------
f1,f2
,x
 ,x
"""",x
+,x
",",x
-,x
a,x
------------------------------------------------

Benchmark Test

The benchmark test described here shows the performance of msort and msortf. The input data consist of 6 fields and all data values are uniform random numbers.

------------------------------------------------
key,fld1,fld2,fld3,fld4,fldn
95547922,162,159,192,118,74
81438069,138,157,155,122,58
26885062,129,199,133,198,75
32651684,180,107,123,170,-14
10245631,164,103,159,154,-63
15145156,182,191,175,107,-60
29254245,188,185,129,124,5
85423170,116,164,175,113,57
55155879,105,163,195,167,25
66997216,195,139,195,113,39
.
.
------------------------------------------------

Compare number of key types and values

The sample data size is 1 million, the following table shows the results according to variation in types of key values at 2,10,100,1000,10000. Data in the "random number" column is generated using the maximum limit of the random number as key. Data is sorted according to the values in "random number ascending / descending order" column before the benchmark test. The comparison table shows the processing results of msort, MUSASHI xtsort command, and UNIX sort command against the msortf command. The sort command sort one or more sort keys extracted from each line of input, whereas "sort -k1" sorts data on the first column. The last 3 rows of the table show the result of msortf, xtsort and sort sorted on numeric value stored in the first key field. ※ Input data size: about 28MB.
※ Unit: seconds. Measurement in real time from beginning to end of program using the time command.
※ Environment: iMac, Mac OS X 10.5 Leopard, 2.8GHz Intel Core 2 Duo, 4GB memory

Table 3.32: Comparison of types of key values and its condition among various sort commands

No.	Command	2 Types	10 Types	100 Types	1000 Types	10000 Types	Rand	Rand Asc	Rand Desc
(1)	msortf f=key	0.29	0.33	0.37	0.40	0.43	0.50	0.29	0.28
(2)	xtsort -k key	1.25	1.24	1.22	1.20	1.19	1.12	0.85	1.00
(3)	sort -k1	16.96	16.63	16.05	15.56	15.08	13.68	6.85	7.13
(4)	msortf f=key%n	0.46	0.56	0.65	0.72	0.79	1.02	0.59	0.59
(5)	xtsort -k key%n	2.52	2.72	2.96	3.16	3.21	3.22	2.31	2.32
(6)	sort -k1 -n	16.65	14.52	11.54	8.56	5.71	0.95	0.33	0.36

$\includegraphics[scale=.8]{figure/msortf/key.eps}$

Figure 3.1: Compare sort results on character strings with msort, msortf, xtsort on various key types. (x-axis: number of the key types, y-axis: seconds)

$\includegraphics[scale=.8]{figure/msortf/num.eps}$

Figure 3.2: Compare sort results on numerical values with msortf,xtsort on various key types (x-axis: number of key types, y-axis: seconds)

msortf is 2 to 5 times faster than xtsort. In relation to sort, it can be more than ten times faster depending on the conditions. This command uses the exactly the same quick sort algorithm as in MUSASHI, however, in MCMD multi-threading is used for the parallel processing of sort in separate threads. The impact of the difference is shown.

Next, the experiment shows the change in speed of character string sorting from 1 million records to 10 million records given the number of key types is set as 100 and the maximum value. The comparison of the two commands msortf and xtsort is shown in Figure 3.3, 3.4.

$\includegraphics[scale=.8]{figure/msortf/line_100.eps}$

Figure 3.3: Sorting results with 100 key types (x-axis: number of records, y-axis: seconds)

$\includegraphics[scale=.8]{figure/msortf/line_rand.eps}$

Figure 3.4: Sorting results with different key types using random number (maximum) (x-axis: number of records, y-axis: seconds)