2.9 Specify Parameters

The format of the parameters used in M-Command is slightly different than UNIX commands. The keyword and specified value is separated by an equal sign i.e. "keyword=value". Option type parameters precedes with a minus sign e.g. "-keyword" and do not require specified value.

Many parameters share common functions in M-Command. The parameters are explained below. However, in some command, it works as a completely different function.

Keyword

Description

i=

Input file name

o=

Output file name

f=

Input and output field name

k=

Key field name

s=

Sort field name

a=

Add item name

-nfn

CSV without field name

-nfno

Output without field name

-x

Specify the field number

-q

Disable automatic sorting

precision=

Number of significant figures

tmpPath=

Work file storage path name

delim=

Delimiter of vector data

bufcount=

Number of buffers

--help

Display help

2.9.1 i= Input file name

Specify the name of input file. Most commands only allow a single file to be specified, with the exception of mcat command where multiple files can be specified separated with a comma. Yet, certain commands such as mnewnumber and mnewrand do not require input data.

When this parameter is not defined, data is read from standard input by using pipeline. In the example below, i= parameter is not specified for msum command because the input data is the result of msortf, which is read from standard input through the pipeline.

$ msortf f=a i=dat.csv | msum k=a f=b o=rsl.csv

However, it is difficult to identify errors when results are piped directly from one command to the next. In the following example, i= parameter is also specified for msum. The results of msortf is sent to standard output, and msum reads input data from dat.csv. Since msortf did not add meaning to the input for msum, the results from this example is different from the above.

$ msortf f=a i=dat.csv | msum k=a f=b i=dat.csv o=rsl.csv

Examples

Example 1: Basic Example

Run mcut using dat1.csv as input data.

$ more dat1.csv
customer,quantity,amount
A,1,10
A,2,20
$ mcut f=customer,amount i=dat1.csv o=rsl1.csv
#ERROR# field name not found: `customer' in dat1.csv (kgcut)
$ more rsl1.csv

Example 2: Specify output field name

Read standard input using redirection (""<"").

$ mcut f= customer, amount o=rsl2.csv <dat1.csv
#ERROR# invalid argument: customer, (kgcut)
$ more rsl2.csv
rsl2.csv: No such file or directory

Related commands

The parameter can be used in all commands except for commands such as mnewnumber and mnewrand.

2.9.2 o= Output file name

Specify the name of output file. Most commands only allow specification of a single file name, with the exception of mtee command where multiple files can be specified. There is also the command that does not require output data, for example, msep.

When this parameter is not defined, data is read from standard input through pipeline. In the following example o= is not specified in msortf because the output data is sent to standard output through pipeline.

$ msortf f=a i=dat.csv | msum k=a f=b o=rsl.csv

The example below is similar to the above. The difference is that o= parameter is specified for the msortf and the result of msortf is saved to tmp.csv. Even though the two commands are connected with pipeline, there is no data stream from standard output to msum, the receiving process could not read data from pipeline and stays idle.

$ msortf f=a i=dat.csv o=tmp.csv | msum k=a f=b o=rsl.csv

Below is a more complicated example by using mtee to connect the data streams between the two commands.

$ msortf f=a i=dat.csv | mtee o=tmp.csv | msum k=a f=b o=rsl.csv

The mtee command writes to a standard input file specified at o= and send the data to standard output concurrently. The results of msortf is written to tmp.csv, at the same time, msum receives the data stream through pipeline from mtee. The final result is saved to rsl.csv.

Examples

Example 1: Basic Example

The result of mcut is saved to rsl1.csv as specified in o= parameter.

$ more dat1.csv
customer,quantity,amount
A,1,10
A,2,20
$ mcut f=customer,amount i=dat1.csv o=rsl1.csv
#ERROR# field name not found: `customer' in dat1.csv (kgcut)
$ more rsl1.csv

Example 2: Redirect

Write to standard input using redirection (">").

$ mcut f=customer,amount i=dat1.csv >rsl2.csv
#ERROR# field name not found: `customer' in dat1.csv (kgcut)
$ more rsl2.csv

Related commands

This parameter can be used in all commands except for certain commands such as sep.

2.9.3 f= Input and output field name

Specify the input and output field name for processing. For example, this parameter specifies the "field name to select" in mcut, "field name to aggregate" for magg, and "field name to merge" for mjoin. In addition, multiple field names can be specified separated by a comma in between such as f=a,b,c.

The output field name for every specified item from the input file can be renamed in MCMD. This can be done by defining the input field name and output field name separated by a colon in between e.g. f=a:A,b:B. The field name in the output remains the same if the output field name is not specified.

Examples

Example 1: Basic Example

Extract fields val1 and val2.

$ more dat1.csv
id,val1,val2
A,1,2
B,2,3
C,3,4
$ mcut f=val1,val2 i=dat1.csv o=rsl1.csv
#END# kgcut f=val1,val2 i=dat1.csv o=rsl1.csv
$ more rsl1.csv
val1,val2
1,2
2,3
3,4

Example 2: Specify name of output field

Aggregate val1,val2, and rename the fields in the output as sum1,sum2 respectively.

$ msum f=val1:sum1,val2:sum2 i=dat1.csv o=rsl2.csv
#END# kgsum f=val1:sum1,val2:sum2 i=dat1.csv o=rsl2.csv
$ more rsl2.csv
id,sum1,sum2
C,6,9

Related commands

mcut, msum, mcat, mjoin, etc.

2.9.4 k= Key field name

Specify the key field name. A key field uniquely identifies individual rows or an entity in the data, it is used as unit of aggregation, or used as common key for joining fields between two files.

For example, in msum command, aggregate computation is carried out for records with the same key (aggregate key break processing). Whereas in mjoin command, the size of key items in the two data files are compared (join key break processing) and joined.

When k= command is specified, the field(s) specified are first sorted in character string ascending order, afterwards, corresponding processing is carried out.

and is considered as the default field for sorting character strings in ascending order (except for mhashsum). Key break process refers to the processing method for every same key field with the same value assuming that the items are sorted beforehand (However, mhashsum command is an exception).

For details on key break process, please refer to Key break processing. Since frequent sorting may decrease the processing performance, understanding the need for key break processing would help reduce the instances for sorting, desirable for optimizing script performance.

Examples

Example 1: Basic Example

Compute sum on val column by id.

$ more dat1.csv
id,val
A,1
B,1
B,2
A,2
B,3
$ msum i=dat1.csv k=id f=val o=rsl1.csv
#END# kgsum f=val i=dat1.csv k=id o=rsl1.csv
$ more rsl1.csv
id%0,val
A,3
B,6

Example 2: Join Process

Use the join key “id” from dat1.csv, and join the field “name” from ref1.csv.

$ more dat1.csv
id,val
A,1
B,1
B,2
A,2
B,3
$ more ref1.csv
id,name
A,nysol
B,mcmd
$ mjoin k=id i=dat1.csv m=ref1.csv f=name o=rsl4.csv
#END# kgjoin f=name i=dat1.csv k=id m=ref1.csv o=rsl4.csv
$ more rsl4.csv
id%0,val,name
A,1,nysol
A,2,nysol
B,1,mcmd
B,2,mcmd
B,3,mcmd

Related commands

msum, mslide, mjoin, mrjoin, mcommon, etc.

2.9.5 s= Sort Field Name

Specify the field name for sorting (multiple fields can be specified).

The order of records affects the process results for some commands such as maccum. When s= parameter is specified, sorting is carried out on the specified fields before the processing command.

There are four combinations of sorting methods (order), including numeric / string, and ascending / descending order. The sorting methods can be specified by appending % followed by n or r after the column name. The examples are as follows.

Character string ascending order: field (% not required), character string descending order: f=field%r, numeric ascending order: f=field%n, numeric descending order:f=field%nr.

Example

Example 1: Basic Example

After sorting by id, calcuate the cumulative sum on val column.

$ more dat1.csv
id,val
A,1
B,1
B,2
A,2
B,3
$ maccum s=id k=id f=val:val_accum i=dat1.csv o=rsl1.csv
#END# kgaccum f=val:val_accum i=dat1.csv k=id o=rsl1.csv s=id
$ more rsl1.csv
id,val,val_accum
A,1,1
A,2,3
B,1,1
B,2,3
B,3,6

Example 2: Specify sort method

After sorting the val field in descending numerical order, calculate the cumulative sum on val column.

$ more dat1.csv
id,val
A,1
B,1
B,2
A,2
B,3
$ maccum s=id,val%nr k=id f=val:val_accum i=dat1.csv o=rsl1.csv
#END# kgaccum f=val:val_accum i=dat1.csv k=id o=rsl1.csv s=id,val%nr
$ more rsl1.csv
id,val,val_accum
A,2,2
A,1,3
B,3,3
B,2,5
B,1,6

Corresponding Commands

maccum, mbest, mmvavg, mnumber, mslide, etc.

2.9.6 a= Add field name

Add an additional field (column) according to the field name specified. Most commands add the result in 1 field, thus, only 1 field is specified at this parameter. Nevertheless, mcombi returns multiple fields as output, thus multiple field names are specified delimited by comma.

Examples

Example 1: Basic Example

Add a new field as “payday”.

$ more dat1.csv
id
A
B
C
$ msetstr v=20070101 a=payday i=dat1.csv o=rsl1.csv
#END# kgsetstr a=payday i=dat1.csv o=rsl1.csv v=20070101
$ more rsl1.csv
id,payday
A,20070101
B,20070101
C,20070101

Example 2: Add multiple fields

Enumerate the two combination of each item A,B,C in the column “id”.

$ mcombi f=id n=2 a=id1,id2 i=dat1.csv o=rsl2.csv
#END# kgcombi a=id1,id2 f=id i=dat1.csv n=2 o=rsl2.csv
$ more rsl2.csv
id,id1,id2
C,A,B
C,A,C
C,B,C

Related command

mcal, mcombi, mrand, msetstr etc.

2.9.7 -nfn CSV without field names (No Field Names)

This option reads input data without field names. When this option is specified, the field number is used instead of the field name to specify the field. The field number begins from the integer 0 and increments by 1 from the left onwards. When --nfn option is specified, the field name will not be included in the output file.

Examples

Example 1: Basic Example

Extract column0 and 2.

$ more dat1.csv
A,1,10
A,2,20
B,1,15
B,3,10
B,1,20
$ mcut -nfn f=0,2 i=dat1.csv o=rsl1.csv
#END# kgcut -nfn f=0,2 i=dat1.csv o=rsl1.csv
$ more rsl1.csv
A,10
A,20
B,15
B,10
B,20

Related command

This option can be used in all M-Commands except mchkcsv.

2.9.8 -nfno Output with field names (No Field Names for Output)

This option allow users to remove field names from the output data. Unlike --nfn, this option assumes that input data specified at i= and m= includes field names in the first row.

Examples

Example 1: Basic Example

Extract column0 and 2.

$ more dat1.csv
A,1,10
A,2,20
B,1,15
B,3,10
B,1,20
$ mcut -nfn f=0,2 i=dat1.csv o=rsl1.csv
#END# kgcut -nfn f=0,2 i=dat1.csv o=rsl1.csv
$ more rsl1.csv
A,10
A,20
B,15
B,10
B,20

Related commands

This option can be used in all commands except mchkcsv.

2.9.9 -x Specify by item number

This option allows user to specify a column with corresponding field number where input data includes field names. Users can specify the output field name(s) by adding colon right after input field, followed by the output field name.

Examples

Example 1: Basic Example

Compute the sum of all items in column 1 and 2 of the same key.

$ more dat1.csv
customer,quantity,amount
A,1,10
A,2,20
B,1,15
B,3,10
B,1,20
$ msum -x k=0 f=1,2 i=dat1.csv o=rsl1.csv
#END# kgsum -x f=1,2 i=dat1.csv k=0 o=rsl1.csv
$ more rsl1.csv
customer,quantity,amount
A,3,30
B,5,45

Example 2: Output column names

Rename column 1 and 2 as a,b respectively.

$ msum -x k=0 f=1:a,2:b i=dat1.csv o=rsl2.csv
#END# kgsum -x f=1:a,2:b i=dat1.csv k=0 o=rsl2.csv
$ more rsl2.csv
customer,a,b
A,3,30
B,5,45

Example 3: Error when using -nfn

The -nfn option assumes data starts from the first row when computing the sum of "quantity" and "amount". However, the result will not be computed as expected since the position of first row of data is defined differently when using -x and -nfn.

$ msum -nfn k=0 f=1,2 i=dat1.csv o=rsl3.csv
#END# kgsum -nfn f=1,2 i=dat1.csv k=0 o=rsl3.csv
$ more rsl3.csv
customer,0,0
A,3,30
B,5,45

Related commands

This option can be used in all commands except mchkcsv.

2.9.10 -q Disable Automatic Sorting

Use this option to disable automatic sorting on fields specified at k= parameter.

The s= option is not required when k= parameter is defined at the same time, therefore, each command operates the same as MCMD Ver. 1.0.

Example

Example 1: Basic Example

Find out the cumulative value by id field. When -q option is specified, sorting by field specified at k= parameter will be disabled.

$ more dat1.csv
id,val
A,1
B,1
B,2
A,2
B,3
$ maccum -q k=id f=val:val_accum i=dat1.csv o=rsl1.csv
#END# kgaccum -q f=val:val_accum i=dat1.csv k=id o=rsl1.csv
$ more rsl1.csv
id,val,val_accum
A,1,1
B,1,1
B,2,3
A,2,2
B,3,3

Corresponding Commands

This function is available in all commands where k= parameter exists.

2.9.11 precision= Number of significant digits

Applies sprintf format ["%.$n$g"] in C language. This format converts the number of significant figures defined from normalized notation (integer bits, decimal bits: ex.123.456) to exponent notation (mantissa e$\pm $ exponent part: ex. 1.23456e+02). The criteria to adopt exponent notation for conversion is when the exponent bits exceed the specified number of significant digits or if it is less than or equal to -5 (i.e more than 4 zeros after decimal points).

Integers between 1 to 16 can be specified in $n$, the default value is 10. When $n<1$, set $n=1$, and when $n>16$ set to $n=16$.

In addition, the number of significant figures can be changed by setting the environment variable KG_Precision. However, changes to the environment variable will affect the execution of all commands.

Examples

Example 1: Basic Example

The exponential notation of id=1 is 1.2345678e +08, the exponent bits is more than 6 significant figures when the significant figures of mantissa is set at 6. The exponential notation of id=2 is 1.23456789e +03, the exponent bits is more than 7 significant figures when the significant figures of integer bits + decimal bits is set at 6. The exponential notation of id=4 is 1.23456789e-04, the exponent bits is less than -4 when the significant figures is set at 6. The exponential notation of id=5 is 1.23456789e-05, the exponent bits is less than -4 when the significant figures of mantissa is set at 6.

$ more dat1.csv
id,val
1,123456789
2,1234.56789
3,0.123456789
4,0.000123456789
5,0.0000123456789
$ mcal c='${val}' a=result precision=6 i=dat1.csv o=rsl1.csv
#END# kgcal a=result c=${val} i=dat1.csv o=rsl1.csv precision=6
$ more rsl1.csv
id,val,result
1,123456789,1.23457e+08
2,1234.56789,1234.57
3,0.123456789,0.123457
4,0.000123456789,0.000123457
5,0.0000123456789,1.23457e-05

Example 2: Case when precision=2

$ mcal c='${val}' a=result precision=2 i=dat1.csv o=rsl2.csv
#END# kgcal a=result c=${val} i=dat1.csv o=rsl2.csv precision=2
$ more rsl2.csv
id,val,result
1,123456789,1.2e+08
2,1234.56789,1.2e+03
3,0.123456789,0.12
4,0.000123456789,0.00012
5,0.0000123456789,1.2e-05

Example 3: Specify the environment variable

When the environment variable is set, the setting will be applied to all commands in subsequent processes.

$ export KG_Precision=4
$ mcal c='${val}' a=result i=dat1.csv o=rsl3.csv
#END# kgcal a=result c=${val} i=dat1.csv o=rsl3.csv
$ more rsl3.csv
id,val,result
1,123456789,1.235e+08
2,1234.56789,1235
3,0.123456789,0.1235
4,0.000123456789,0.0001235
5,0.0000123456789,1.235e-05

Related commands

This setting applies to all commands for calculating real numbers which is used in msum,mcal.

2.9.12 tmpPath= Path name of temporary file

Specify the name of the directory which stores the temporary files for use by the command. For example, the results from msortf is saved as a temporary file during partitioned sort. If the path is not specified, the file is saved in /tmp. The name of temporary files begins with __KGTMP.

The temporary files are deleted if the command terminates normally (includes termination by exit signal, or termination by signal from MCMD signal). Temporary files will be retained in the directory when the program is terminated unexpectedly by power outage or bug.

Depending on the amount of data, enormous amount of temporary data may be generated (more than 1 million files). This will significantly slow down the execution of commands, therefore, clean out the files in the temporary path on a regular basis. Currently there is no plans to implement functions for garbage collection to remove objects no longer used by the program.

The temporary directory can be changed by setting the environment variable KG_Tmp_Path, however, the same variable applies to the execution of all commands.

Examples

Example 1: Basic Example

Set the tmp directory under the current directory for temporary files.

$ msortf f=val tmpPath=./tmp i=dat1.csv o=rsl1.csv
#END# kgsortf f=val i=dat1.csv o=rsl1.csv tmpPath=./tmp

Example 2: Specify the environment variable

The settings of the environment variable will be applied to subsequent commands.

$ export KG_TmpPath=~/tmp
$ msortf f=val i=dat1.csv o=rsl1.csv
#ERROR# internal error: cannot create temp file (kgsortf)

Related commands

This applies to commands such as msortf and mdelnull which select records by key field, and commands such as mbucket, mnjoin, and mshare that require multiple pass scanning based on key field.

2.9.13 delim= Delimiter of vector element

Specify the delimiter for elements in vector data. The default delimiter is 1 byte space. When comma is specified as the delimiter for the vector, the vector is enclosed in double quotes to avoid confusion with the comma delimiter in CSV file.

Examples

Example 1: Basic Example

Sort the elements of the vector field “vec” with colon as a delimiter.

$ more dat1.csv
vec
b:a:c
x:p
$ mvsort vf=vec delim=: i=dat1.csv o=rsl1.csv
#END# kgvsort delim=: i=dat1.csv o=rsl1.csv vf=vec
$ more rsl1.csv
vec
a:b:c
p:x

Example 2: When delim parameter is not specified

Since delim parameter is not specified, b:a:c and x:p is interpreted as one element.

$ mvsort vf=vec i=dat1.csv o=rsl2.csv
#END# kgvsort i=dat1.csv o=rsl2.csv vf=vec
$ more rsl2.csv
vec
b:a:c
x:p

Example 3: Use comma as delimiter

If comma is used as delimiter for the vector, the entire vector is enclosed by double quote to draw distinction between the delimiter of CSV and the delimiter of the vector.

$ more dat2.csv
id,vec1,vec2
1,a,b
2,p,q
$ mvcat vf=vec1,vec2 a=vec3 delim=, i=dat2.csv o=rsl3.csv
#END# kgvcat a=vec3 delim=, i=dat2.csv o=rsl3.csv vf=vec1,vec2
$ more rsl3.csv
id,vec3
1,"a,b"
2,"p,q"

Related commands

This parameter can be used in all vector related commands such as such as mvcat and mvsort.

2.9.14 bufcount= Buffer size

Specify the internal buffer size (number of blocks) to be used in commands such as mbucket, mnjoin, and mshare, for processing key units at which data requires multiple pass scanning. One buffer block contains 4MB, the default size is 10 blocks (40MB). In case of buffer overflow, data is written to a temporary file. If the key size is very large, the processing speed can be improved by adjusting this parameter if memory permits.

Examples

Example 1: Basic Example

If the key size of the reference file is less than 80MB (4MB × 20), the temporary file will not be used.

$ mnjoin k=id m=ref.csv f=name i=dat.csv o=rsl.csv bufcount=20
#END# kgnjoin bufcount=20 f=name i=dat.csv k=id m=ref.csv o=rsl.csv

Related command

Commands that require multiple pass scanning of the data to process key units, such as mbucket, mnjoin, and mshare.