2.11 Key Break Processing

In key break processing, it is assumed that within the column which matches the column specified, processing is executed for key fields with the same value. Key break processing is broadly divided into two type of processes. First is key break processing for aggregate calculation (referred to as "aggregate key break processing" below), second is key break processing for joins (referred to as "join key break processing" below ).

Join key break processing is executed for commands such as mjoin, mcommon which contains the word "join" and "common". Aggregate key break processing is carried out on other commands with k= parameter.

For example, when msum command triggers aggregate key break processing, it detects the change of value in the key field, and executes aggregate processing for records with the same key. Therefore, it is necessary to sort the records by key field beforehand (unless input file is sorted in advance), and sorting is carried out in msum command before aggregate processing.

Join key break processing involves a more complicated process. For instance, mjoin command takes in two data files, and compare the values in the key field. The key fields from the smaller data set is read continuously, and the records are joined when the key fields in input file and reference file matches. When the comparison of key field values, since key break processing is used for join operation, the key fields from the two data files need to be sorted beforehand. Therefore, in this version, the two data files used in mjoin commands are sorted.

Basic sorting character string ascending order is carried out for both key break processing, however, when joining records by numerical range in mrjoin command, sorting is carried out by numeric ascending order.

Besides the fields defined at k= parameter are automatically sorted, in other commands automatic sorting is pre-determined, thus users do not need to resolve whether the input files requires sorting. Even though users no longer need to initiate the sort command, note that sorting is handled within each command internally. Thus, depending on the construction of the script, sort processing may frequently take place which could reduce performance.

Example of Script

Example of script when sorting takes place frequently

Initially, name column is sorted and saved as xxcustomer output file, afterwards, join processing by id key field is carried out by mjoin command. In this case, mjoin is executed three times, and id column from xxcustomer inputer data is sorted at each instance of mjoin command.

mcut   i=customer.csv f=id,name |
msortf f=name o=xxcustomer

mjoin i=xxcustomer m=address.csv k=id f=address o=cust_address.csv
mjoin i=xxcustomer m=phone.csv   k=id f=phone   o=cust_phone.csv
mjoin i=xxcustomer m=age.csv     k=id f=age     o=cust_age.csv

Example of script to minimize sorting

When the script is modified as follows, since xxcustomer file is sorted by id field and saved as xxcustomer. Automatic sorting of the input file at mjoin commands is not carried out.

mcut   i=customer.csv f=id,name |
msortf f=id o=xxcustomer

mjoin i=xxcustomer m=address.csv k=id f=address o=cust_address.csv
mjoin i=xxcustomer m=phone.csv   k=id f=phone   o=cust_phone.csv
mjoin i=xxcustomer m=age.csv     k=id f=age     o=cust_age.csv