Search code examples
pythonscikit-learndecision-treedotpydot

Visualizing scikit-learn/ sklearn multi-output decision tree regression in png or pdf


this is the first question I'm posting on stackoverflow so I apologize for any mishaps in layout and so on (advice welcome). Your help is much appreciated!

I'm trying to visualize the output of DecisionTreeRegressor with multiple outputs (as described in http://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression_multioutput.html#example-tree-plot-tree-regression-multioutput-py) in png or pdf format using pydot.

The code I tried looks like this:

...
dtreg = tree.DecisionTreeRegressor(max_depth=3)
dtreg.fit(x,y)

tree.export_graphviz(dtreg, out_file='tree.dot') #print dotfile

dot_data = StringIO()
tree.export_graphviz(dtreg, out_file=dot_data)
print dot_data.getvalue()
pydot.graph_from_dot_data(dot_data.getvalue()).write_pdf("pydot_try.pdf") 

Writing the pdf gives the following errors:

pydot.InvocationException: Program terminated with status: 1. stderr follows: Warning: /tmp/tmpAy7d59:7: string ran past end of line Error: /tmp/tmpAy7d59:8: syntax error near line 8 context: >>> [ <<< 0.20938667] Warning: /tmp/tmpAy7d59:18: string ran past end of line Warning: /tmp/tmpAy7d59:20: string ran past end of line

and so on with more "string ran past end of line" errors.

I've never worked with .dot before, but I suspect there might be a problem with the multi-output format. For example, part of the tree looks like this:

digraph Tree {
0 [label="X[0] <= 56.0000\nmse = 0.0149315126135\nsamples = 41", shape="box"] ;
1 [label="X[0] <= 40.0000\nmse = 0.0137536911947\nsamples = 25", shape="box"] ;
0 -> 1 ;
2 [label="X[0] <= 24.0000\nmse = 0.0152142545276\nsamples = 21", shape="box"] ;
1 -> 2 ;
3 [label="mse = 0.0140\nsamples = 15\nvalue = [[ 0.83384667]
 [ 0.20938667]
 [ 0.08511333]
 [ 0.04234667]
 [ 0.08158   ]
 [ 0.17948667]
 [ 0.03616   ]
 [ 0.00995333]
 [ 0.99529333]
 [ 0.13715333]
 [ 0.10294667]
 [ 0.06632667]]", shape="box"] ;
2 -> 3 ;
4 [label="mse = 0.0170\nsamples = 6\nvalue = [[ 0.69588333]
 [ 0.20275   ]
 [ 0.0953    ]
 [ 0.0436    ]
 [ 0.1216    ]
 [ 0.17248333]
 [ 0.04393333]
 [ 0.01178333]
 [ 0.99913333]
 [ 0.12348333]
 [ 0.10838333]
 [ 0.06973333]]", shape="box"] ;
2 -> 4 ;
}

I don't know how to solve this, because that's just the output I get from DecisionTreeRegressor.

I also tried converting the dot file:

dot -Tpng tree.dot -o tree.png

But this gives the same errors (string ran past end of line) I also tried visualizing tree.dot using xdot and that gave the same error.


Solution

  • The error message appears to be telling you that there is a problem with the multiline strings (labels). As shown here, to specify multiline labels in dot you can use \n, or alternatively as described in the DOT language documentation:

    As another aid for readability, dot allows double-quoted strings to span multiple physical lines using the standard C convention of a backslash immediately preceding a newline character.

    That said, when I attempted to generate your plot using dot on Graphviz version 2.39.20141007.0445 it worked absolutely fine:

    enter image description here

    I can't find a reference to the format changing, however it may be worth having another attempt with the latest version of Graphviz installed.