MCMD handles multibyte characters such as Chinese characters in UTF-8 encoding. Other encodings such as SHIFT_JIS can be treated as multibyte characters, however, some functions may not work correctly. The following explains how MCMD process multibyte characters.
Kanji-code is processed as multibyte characters without conversion in order to increase the processing speed when using MCMD. However, character string search and string substitution functions may result in unexpected results depending on the encoding.
For example, "陰(shadow)" is represented as 0x8941 in SHIFT_JIS, the second byte of this character refers to "A" in single-byte characters. Thus, when “A” is substituted with “B”, "陰” will be converted to “隠 (hidden)” (0x8942). The UTF-8 uses an encoding system which could avoid problems with character substitution. Moreover, it is difficult to count the number of characters in strings containing multibyte characters and ASCII characters even in UTF-8.
This problem can be avoided by converting all characters including ASCII code to fixed length character, known as wide character (MCMD adopts 32-bit fixed length).
When converting wide characters, it is necessary to find out the encoding for multibyte characters in the environment variable LANG. Type the following at the command prompt to check the environment variable, .
$ echo $LANG ja_JP.UTF-8
Some MCMD commands have built-in option (-W) to convert input data to wide characters before data processing. The list of commands which support the option is shown in Table 2.3. These commands pertain to search or replace functions, it is not necessary to use this option if encoding is set as UTF-8.
Command name |
Function |
Description |
mchgstr |
Substitution |
-By specifying -W, the field data specified by f= is converted to wide characters internally. |
mselstr |
Search |
In case of substring matching (-sub), |
the field data specified by f= is converted to wide characters internally. |
||
msed |
Substitution |
By specifying -W, the field data specified by f= is converted to wide characters internally. |
mtonull |
Search |
For substring matching (-sub), |
he field data specified by f= is converted to wide characters internally. |
In addition, mcal and msel incorporated functions to handle wide characters (Table 2.4). For instance, the lengthw function counts the number of characters and computes the character position for data in UTF-8 encoding.
Name of the function |
Function |
Details |
lengthw |
Number of characters |
Convert target string to wide character before processing. |
midw |
Substring |
Convert target string to wide character before processing. |
rightw |
Substring |
Convert target string to wide character before processing. |
leftw |
Substring |
Convert target string to wide character before processing. |
regexsw |
Match regular expression |
Convert target string to wide character before processing. |
regexmw |
Match regular expression |
Convert target string to wide character. |
regexrepw |
Substitute by regular expression |
Convert target string to wide character. |
regexlenw |
Match length by regular expression |
Convert target string to wide character. |
regexposw |
Match position by regular expression |
Convert target string to wide character. |
regexstrw |
Substring match by regular expression |
Convert target string to wide character. |
regexpfxw |
Prefix by regular expression |
Convert target string to wide character. |
regexsfxw |
Suffix match by regular expression |
Convert target string to wide character. |
Take note of the following when handling wide-character.
Conversion to wide character involves overhead which sacrifices the processing speed.
Wide characters input data can be converted except for the field names.
File name with multibyte characters can be processed as it.