Convert XML formatted data to CSV. The basic rule of conversion is by specifying the element as unit of each record (XML tag) and the element corresponding to the column (or attribute) at the parameters k=, f=. The value of the column can be specified in four ways : text bounded by elements, presence of elements, the value of the attribute, presence of attributes.
When SAX is used as the parser of XML, there is no size constraints of XML. If other encoding besides UTF-8 is used in the XML file, the XML file is converted to UTF-8 and output as CSV. XML data should be structured in a complete, well-formed XML document. Otherwise, the program may return unexpected processing results.
Table 3.50 shows a typical format of XML data. More details are explained in the next section, however, a brief outline is illustrated as follows.
Table 3.51 shows the returned output. Element <b> is used as the key unit of each record (the element is referred to as "key element"). The column is defined by the attribute of element b for the value of att (CSV column name b_att). The attributes of element c includes the value of p (b_p) and flag (b_p_f), as well as the text inside element d and a (d, a).
Here, the flag indicates the presence of specified elements or attributes by the value 0-1 in the output. The text output of the element includes the concatenation of all strings in the range within specified element. However, note that the spaces and the control characters are not included in the output.
Table 3.50: input XML data <a att="aa"> <b att="bb1"> <c p="pp1" q="qq1"/> <d>text1</d> </b> <b att="bb2"> <c q="qq2"></c> <d>text2</d> </b> <b> <c p="pp3"/> <d>text3</d> </b> </a> Table 3.51: key:/a/b, item:b@att,c@p,d,/a
|
Specify the key element as the key unit of each record (specified in the parameter k=) with an absolute path. The absolute path is defined from the root directory starting from the symbol ('/'), the hierarchy of elements is separated by the sign '/'. The role of the key elements in this command is to perform the following two functions corresponding to the end tag of the key elements.
Output one row of column data. In the example above, the end tag of the key elements </b> has appeared three times, one row of CSV data is inserted as a new line for every instance in the output.
Initialize the column data. However, it does not initialize the column elements outside the key elements in the output data. In Table 3.51, the text in the output of element a will be consolidated, even when the end tag of the key elements has emerged, the element /a is outside the key elements /a/b. As a result, the output data is not initialized.
If the element defined at f= is returned as a CSV field in the output, follow the format shown below.
Element path[%flag]: CSV field name
"Field name" is the column name in the CSV output which must be specified.
There are two methods to display elements as columns in the output. The first method is to return the text enclosed by the opening and closing tags of the specified element. The other method is to return 0-1 value to indicate whether the specified element exists. The target element path is defined in the former method, and the the flag %f is added when using the latter method.
The two methods of specifying the element path include absolute and relative paths. A relative path can be specified by defining the path from elements at k=. Table 3.50 shows examples on how to specify the element paths of the XML data.
Given k=/a/b, when f=:B is specified, the key elements is the same when relative path is nil. B is the column name of CSV.
Given k=/a/b, f=c:C and f=/a/b/c:C performs the same function. The former is specified by relative path, while the latter by an absolute path. The CSV field name is defined as C for both methods.
f=d:D returns the text within the element d. f=d%f:D returns output when element \verbd| exists. The field name of CSV is D.
When k=/a/b, it is assumed that f=/a:A, the column element is outside the key element, the text contained in the element a are combined in sequential order. The end tag of the key element exists, however, the end tag of the field element does not exists, this is because data is not cleared at the time.
If the attribute defined at f= is returned as a CSV column, use the format shown below.
Element path@Element name[%flag]:CSV field name
"Field name" is the column name in the CSV output which must be specified.
The method of specifying the element path is the same as specifying the elements in the output field. The attribute name is specified after the element path connected with @. By adding %f after the element name, the presence of the element can be indicated by 0-1 value in the output.
mxml2csv k= f= [i=] [o=] [-nfn] [-nfno] [--help] [--version]
k= Specify the pathname from the root based on the element as the unit of records. The path starts from the root symbol ’/’, and the specified elements are connected with ’/’. Example: /article/sentence/chunk f= Specify the element or attribute as fields in the output by delimiting the field names with comma. Format is as follows. Element path[%flag]:CSV field name Element path@Element name[%flag]:CSV field name i= Specify the file name of XML data. The input is read from standard input by default when the input file is not specified.
The example below is illustrated in the summary above. Output the 5 CSV fields with /a/b set as the key elements.
$ more dat1.xml <a att="aa"> <b att="bb1"> <c p="pp1" q="qq1"/> <d>text1</d> </b> <b att="bb2"> <c q="qq2"></c> <d>text2</d> </b> <b> <c p="pp3"/> <d>text3</d> </b> </a> $ mxml2csv k=/a/b f=@att:b_att,c@p:c_p,c@p%f:c_p_f,d:d,/a:a i=dat1.xml o=rsl1.csv #END# kgxml2csv f=@att:b_att,c@p:c_p,c@p%f:c_p_f,d:d,/a:a i=dat1.xml k=/a/b o=rsl1.csv $ more rsl1.csv b_att,c_p,c_p_f,d,a bb1,pp1,1,text1,text1 bb2,,,text2,text1text2 ,pp3,1,text3,text1text2text3
Specification of same element as in the basic example with an absolute path. Output the 5 CSV fields with /a/b as the key elements.
$ mxml2csv k=/a/b f=/a/b@att:b_att,/a/b/c@p:c_p,/a/b/c@p%f:c_p_f,/a/b/d:d,/a:a i=dat1.xml o=rsl2.csv #END# kgxml2csv f=/a/b@att:b_att,/a/b/c@p:c_p,/a/b/c@p%f:c_p_f,/a/b/d:d,/a:a i=dat1.xml k=/a/b o=rsl2.csv $ more rsl2.csv b_att,c_p,c_p_f,d,a bb1,pp1,1,text1,text1 bb2,,,text2,text1text2 ,pp3,1,text3,text1text2text3
Example of changing a key element to a using an absolute path. Since there is only one end tag a, one row of record will be returned as output. /a/b@att specified at f= appeared twice, the last value is returned as output.
$ mxml2csv k=/a f=/a/b@att:b_att,/a/b/c@p:c_p,/a/b/c@p%f:c_p_f,/a/b/d:d,/a:a i=dat1.xml o=rsl3.csv #END# kgxml2csv f=/a/b@att:b_att,/a/b/c@p:c_p,/a/b/c@p%f:c_p_f,/a/b/d:d,/a:a i=dat1.xml k=/a o=rsl3.csv $ more rsl3.csv b_att,c_p,c_p_f,d,a bb2,pp3,1,text3,text1text2text3