2.8 Multibyte Characters

MCMD handles multibyte characters such as Chinese characters in UTF-8 encoding. Other encodings such as SHIFT_JIS can be treated as multibyte characters, however, some functions may not work correctly. The following explains how MCMD process multibyte characters.

Kanji-code is processed as multibyte characters without conversion in order to increase the processing speed when using MCMD. However, character string search and string substitution functions may result in unexpected results depending on the encoding.

For example, "陰(shadow)" is represented as 0x8941 in SHIFT_JIS, the second byte of this character refers to "A" in single-byte characters. Thus, when “A” is substituted with “B”, "陰” will be converted to “隠 (hidden)” (0x8942). The UTF-8 uses an encoding system which could avoid problems with character substitution. Moreover, it is difficult to count the number of characters in strings containing multibyte characters and ASCII characters even in UTF-8.

This problem can be avoided by converting all characters including ASCII code to fixed length character, known as wide character (MCMD adopts 32-bit fixed length).

When converting wide characters, it is necessary to find out the encoding for multibyte characters in the environment variable LANG. Type the following at the command prompt to check the environment variable, .

$ echo $LANG
ja_JP.UTF-8

Some MCMD commands have built-in option (-W) to convert input data to wide characters before data processing. The list of commands which support the option is shown in Table 2.3. These commands pertain to search or replace functions, it is not necessary to use this option if encoding is set as UTF-8.

Table 2.3: List of commands with wide character conversion function

Command name	Function	Description
mchgstr	Substitution	-By specifying -W, the field data specified by f= is converted to wide characters internally.
mselstr	Search	In case of substring matching (-sub),
		the field data specified by f= is converted to wide characters internally.
msed	Substitution	By specifying -W, the field data specified by f= is converted to wide characters internally.
mtonull	Search	For substring matching (-sub),
		he field data specified by f= is converted to wide characters internally.

In addition, mcal and msel incorporated functions to handle wide characters (Table 2.4). For instance, the lengthw function counts the number of characters and computes the character position for data in UTF-8 encoding.

Table 2.4: List of mcal functions with wide character conversion function

Name of the function	Function	Details
lengthw	Number of characters	Convert target string to wide character before processing.
midw	Substring	Convert target string to wide character before processing.
rightw	Substring	Convert target string to wide character before processing.
leftw	Substring	Convert target string to wide character before processing.
regexsw	Match regular expression	Convert target string to wide character before processing.
regexmw	Match regular expression	Convert target string to wide character.
regexrepw	Substitute by regular expression	Convert target string to wide character.
regexlenw	Match length by regular expression	Convert target string to wide character.
regexposw	Match position by regular expression	Convert target string to wide character.
regexstrw	Substring match by regular expression	Convert target string to wide character.
regexpfxw	Prefix by regular expression	Convert target string to wide character.
regexsfxw	Suffix match by regular expression	Convert target string to wide character.

Take note of the following when handling wide-character.

Conversion to wide character involves overhead which sacrifices the processing speed.
Wide characters input data can be converted except for the field names.
File name with multibyte characters can be processed as it.