Search code examples
wekadata-mining

Regarding RandomTree in Weka


I was playing around with weka when I observed a minNum field in the RandomTree configuration. I read the description which said "The minimum total weight of the instances in a leaf". However, I couldn't really understand what it means.

I played around with that number, and I realized that when I increase it, the size of the tree thus generated reduces. I couldn't correlate as to why this is happening.

Any help/references will be appreciated.


Solution

  • This has to do with the minimum number of instances on a leaf node (which is often 2 by default in decision trees, like J48). The higher you set this parameter, the more general the tree will be since having many leaves with a low number of instances yields a too granular tree structure.

    Here are two examples on the iris dataset, which shows how the -M option might affect size of the resulting tree:

    $ weka weka.classifiers.trees.RandomTree -t iris.arff -i
    
    petallength < 2.45 : Iris-setosa (50/0)
    petallength >= 2.45
    |   petalwidth < 1.75
    |   |   petallength < 4.95
    |   |   |   petalwidth < 1.65 : Iris-versicolor (47/0)
    |   |   |   petalwidth >= 1.65 : Iris-virginica (1/0)
    |   |   petallength >= 4.95
    |   |   |   petalwidth < 1.55 : Iris-virginica (3/0)
    |   |   |   petalwidth >= 1.55
    |   |   |   |   sepallength < 6.95 : Iris-versicolor (2/0)
    |   |   |   |   sepallength >= 6.95 : Iris-virginica (1/0)
    |   petalwidth >= 1.75
    |   |   petallength < 4.85
    |   |   |   sepallength < 5.95 : Iris-versicolor (1/0)
    |   |   |   sepallength >= 5.95 : Iris-virginica (2/0)
    |   |   petallength >= 4.85 : Iris-virginica (43/0)
    
    Size of the tree : 17
    
    $ weka weka.classifiers.trees.RandomTree -M 6 -t iris.arff -i
    
    petallength < 2.45 : Iris-setosa (50/0)
    petallength >= 2.45
    |   petalwidth < 1.75
    |   |   petallength < 4.95
    |   |   |   petalwidth < 1.65 : Iris-versicolor (47/0)
    |   |   |   petalwidth >= 1.65 : Iris-virginica (1/0)
    |   |   petallength >= 4.95 : Iris-virginica (6/2)
    |   petalwidth >= 1.75
    |   |   petallength < 4.85 : Iris-virginica (3/1)
    |   |   petallength >= 4.85 : Iris-virginica (43/0)
    
    Size of the tree : 11
    

    As a sidenote, Random trees rely on bagging, which means there's a subsampling of attributes (K randomly chosen to split at each node); contrary to REPTree, however, there's no pruning (like in RandomForest), so you may end up with very noisy trees.