2.5 mdtree.rb Draw decision tree model by PMML

The command visualize the decision tree model described in PMML (Predictive Model Markup Language) and generate a visual diagram in HTML format with D3 library. The command is created to visualize the output generated from mbonsai command, and it can be applied to decision tree models in PMML format generated by other software.

PMML defines description of values and categories of branch rules, however, it does not record the presence of sequence pattern in mbonsai. Therefore, mbonsai defines the extension tag to record the branching of sequence pattern, and mdtree.rb allow users to visual the expansion tag of the decision tree.

The following examples illustrates the series of decision trees constructed with mbonsai.

Table 2.6: Input data dat1.csv. Refer to examples for all data.

Gender

VisitGap

PurchasePattern

Hospitalized

Male

1.2

ABCAAA

Yes

Male

10.5

BCDADD

Yes

Male

0.5

AAAA

No

Male

2.0

BBCC

No

Male

3.1

DEDDA

Yes

Female

0.7

CCCAA

No

Female

1.5

DDDEEE

Yes

Female

2.6

BACD

Yes

Female

3.5

ABBB

Yes

Female

4.0

DDDD

Yes

Female

2.1

DEDE

No

:

:

:

:

Table 2.6 shows the training data for the construction of decision tree model with mbonsai command. The decision tree is saved as PMML file model.pmml in the directory specified at O=.

$ mbonsai c=Hospitalized n=VisitGap p=PurchasePattern d=Gender i=dat1.csv O=outdat
#END# kgbonsai O=outdat c=Hospitalized d=Gender i=dat1.csv n=VisitGap p=PurchasePattern; IN=81;
$ ls outdat
alpha_list.csv model.pmml     model.txt      model_info.csv param.csv      predict.csv

The following command can be used to visualize model.pmml. The output is rendered as model.html as shown in Figure 2.7.

$ mdtree.rb i=outdat/model.pmml o=model.html
#END# mdtree.rb i=outdat/model.pmml o=model.html;
$ open model.html 

\includegraphics[scale=0.5]{figure/tree_1.eps}
Table 2.7: Draw decision tree with this command. The pie chart inside each node shows the class distribution (color for each class is shown in legend). The node with dotted line represents an intermediate node, the node with solid line represents a leaf node. The item name for the branch is shown below the node, and the branch rule is shown above the node. For example, on the first child level, if the Visit gap is below 2.15 on the left, and more than 2.15 on the right branch with no overlaps on both. When the node contains a sequence pattern, the branch which contained the pattern will be shown on the left, and the branch without the pattern is shown on the right. For instance, the left node in the second level from the top shows the purchase pattern containing “44". Furthermore, the characters of sequence patterns and each corresponding index are shown in the alphabet-index table in the upper left of the figure.

The maximum tree built in mbonsai is stored, the level of pruning of the decision tree can be controlled by specifying pruning degree at alpha=. When an integer value that is greater than 0 is specified at alpha=, when the decision tree is large, a lot of branches will be pruned. When alpha= is not specified and cross validation is not specified for mbonsai, alpha=0.01 is specified. Yet if cross validation is specified, minimum misclassification rate is rendered.

Figure 2.8l the pruned decision tree with alpha=0.1.

$ mdtree.rb alpha=0.1 i=outdat/model.pmml o=model2.html
#END# mdtree.rb alpha=0.1 i=outdat/model.pmml o=model.html;
$ open model2.html 

\includegraphics[scale=0.5]{figure/tree_2.eps}
Table 2.8: Decision tree created when pruning degree is set as alpha=0.1

2.5.1 Collaboration with R

Many decision tree construction packages are available in the R statistical analysis package. The following section explains how to draw a decision tree with the rpart library.

We will build a decision tree with R script using two data sets - Iris data set (iris) and prostate cancer data set (stagec). The decision tree is built using rpart libraries and the model saved in PMML output.

This manual will not go into details on how to build decision tree model from the content in the data set. Note that PMML, XML, and rpart R libraries must be installed before proceeding with the following examples.

The program will generate a output from the decision tree of iris and prostate cancer will be saved as PMML file model_r1.pmml and model_r1.pmm2.

library(pmml)
library(rpart)
iris.rp=rpart(Species~.,data=iris)
sink("model_r1.pmml")
pmml(iris.rp)
sink()

stagec$progstat <- factor(stagec$pgstat, levels = 0:1, labels = c("No", "Prog"))
cfit <- rpart(progstat ~ age + eet + g2 + grade + gleason + ploidy, data = stagec, method = "class")
sink("model_r2.pmml")
pmml(cfit)
sink()

After obtaining two PMML files, we will follow the procedure for drawing a decision tree. The decision trees are shown in diagrams 2.9 and 2.9.

$ mdtree.rb i=model_r1.pmml o=out_r1.html
#END# mdtree.rb i=model_r1.pmml o=outl_r1.html;
$ mdtree.rb i=model_r2.pmml o=/out_r2.html
#END# mdtree.rb i=model_r2.pmml o=out_r2.html;
$ open model1_r1.html
$ open model1_r2.html

\includegraphics[scale=0.5]{figure/tree_3.eps}
Table 2.9: Decision tree of Iris dataset

\includegraphics[scale=0.5]{figure/tree_4.eps}
Table 2.10: Decision tree of prostate cancer

2.5.2 Format

mdtree.rb i= o= [alpha=] [--help]

i=

: PMML file of decision tree model

o=

: Output file (HTML file)

alpha=

: Specify the pruning degree (more branches are pruned when pruning degree is a integer greater than 0).

 

: when this is not specified, and cross validation is not specified for mbonsai,

 

: the value will be set as 0.01. If cross validation is specified, model with the minimum misclassification rate is rendered.

 

: This parameter is only valid for building decision trees with mbonsai.

--help

: Show help

2.5.3 Example

例1: Basic Example

Example from the above section.

$ cat dat1.csv
gender,visitgap,purchasepattern,hospitalized
Male,1.2,ABCAAA,Yes
Male,10.5,BCDADD,Yes
Male,0.5,AAAA,No
Male,2.0,BBCC,No
Male,3.1,DEDDA,Yes
Female,0.7,CCCAA,No
Female,1.5,DDDEEE,Yes
Female,2.6,BACD,Yes
Female,3.5,ABBB,Yes
Female,4.0,DDDD,Yes
Female,2.1,DEDE,No
Male,1.2,ABCAAA,Yes
Male,10.5,BCDADD,Yes
Male,0.5,AAAA,No
Male,2.0,BBCC,No
Male,3.1,DEDDA,Yes
Male,0.7,CCCAA,No
Male,1.5,DDDEEE,No
Male,2.6,BACD,Yes
Male,3.5,ABBB,Yes
Male,4.0,DDDD,Yes
Male,2.1,DEDE,No
Male,1.2,ABCAAA,Yes
Male,10.5,BCDADDA,Yes
Male,0.5,AAAAA,No
Male,2.0,BBCCA,No
Male,3.1,DEDDA,Yes
Male,0.7,CCCAA,No
Male,1.5,ADDDEEE,Yes
Male,2.6,BACD,Yes
Male,3.5,ABBB,Yes
Male,4.0,DDDD,Yes
Female,2.1,DEDE,No
Female,1.2,ABCAAA,Yes
Female,10.5,BCDADD,Yes
Female,0.5,AAAA,No
Female,2.0,BBCC,No
Female,3.1,DEDDA,Yes
Female,0.7,CCCAA,No
Female,1.5,DDDEEE,Yes
Female,2.6,BACD,Yes
Female,3.5,ABBB,Yes
Female,4.0,DDDD,Yes
Female,2.1,DEDE,No
Female,1.2,ABCAAA,Yes
Female,10.5,BCDADD,Yes
Female,0.5,AAAA,No
Female,2.0,BBCC,No
Female,3.1,DEDDA,Yes
Female,0.7,CCCAA,No
Female,1.5,DDDEEE,Yes
Female,2.6,BACD,Yes
Female,3.5,ABBB,Yes
Female,1.0,DDDD,Yes
Female,2.5,DEDE,No
Female,2.5,ABBB,Yes
Female,1.0,DDDD,Yes
Female,1.1,DEDE,No
Female,2.2,ABCAAA,Yes
Female,10.5,BCDADD,Yes
Female,1.5,AAAA,No
Female,2.6,BBCC,No
Female,3.3,DEDDA,Yes
Female,1.7,CCCAA,No
Female,1.5,DDDEEE,Yes
Female,2.6,BACD,Yes
Female,3.9,ABBB,Yes
Female,4.5,DDDD,Yes
Female,2.1,DEDE,No
Female,3.9,BABB,Yes
Male,4.5,BAA,No
Female,2.1,DEDE,No
Male,3.9,BABB,Yes
Female,3.9,BABB,Yes
Male,4.5,BAA,No
Female,2.1,DEDE,No
Male,3.9,BABB,Yes
Female,3.9,BABB,Yes
Male,4.5,BAA,No
Female,2.1,DEDE,No
Male,3.9,BABB,Yes
$ mbonsai c=hospitalized n=visitgap p=purchasepattern d=gender i=dat1.csv O=outdat
ABCDE = 12345  *improved(errev:0.037037 *improved(errMin:0,leaf:1)
#END# kgbonsai O=outdat c=hospitalized d=gender i=dat1.csv n=visitgap p=purchasepattern
N=81
$ mdtree.rb i=outdat/model.pmml o=model1.html
#END# /usr/bin/mdtree.rb i=outdat/model.pmml o=model1.html
$ mdtree.rb alpha=0.1 i=outdat/model.pmml o=model2.html
#END# /usr/bin/mdtree.rb alpha=0.1 i=outdat/model.pmml o=model2.html
$ head model1.html
<html lang="ja">
<head>
  <meta charset="utf-8">
	<meta name="viewport" content="width=device-width, initial-scale=1.0">
  <style type="text/css">
	  p.title { border-bottom: 1px solid gray
		g > .type-node > rect { stroke-dasharray: 10,5
hite
		g > .type-leaf > rect { stroke-width: 3px
		.edge path {  fill: none
		svg >.legend > rect { stroke-width: 1px
$ head model2.html
<html lang="ja">
<head>
  <meta charset="utf-8">
	<meta name="viewport" content="width=device-width, initial-scale=1.0">
  <style type="text/css">
	  p.title { border-bottom: 1px solid gray
		g > .type-node > rect { stroke-dasharray: 10,5
hite
		g > .type-leaf > rect { stroke-width: 3px
		.edge path {  fill: none
		svg >.legend > rect { stroke-width: 1px