Search code examples
machine-learningwekadecision-tree

Method of finding threshold in Decision tree for continuous data


I am using decision tree in Weka and I have some continuous data, so when I use Weka it automatically find the threshold for me but for some reason I want to implement Decision Tree by myself so I need to know what approach to use to find the threshold to discretize my continuous data?


Solution

  • ID3 and C4.5 use entropy heuristic for discretization of continuous data. The method finds a binary cut for each variable (feature). You could apply the same method recursively to get multiple intervals from continuous data.

    Suppose at a certain tree node, all instances belong to a set of S, and you are working on variable A and a particular boundary (cut) T, the class information entropy of the partition induced by T, denoted as E(A,T,S) is given by:

                 |S1|                 |S2|
    E(A, T, S) = ---- Entropy(S1) +   ---- Entropy(S2)
                  |S|                 |S|
    

    where |S1| is the number of instances in the first partition; |S2| is the number of instances in the second partition; |S| = |S1|+|S2|.

    For a given feature A, the boundary T_min which minimizes the entropy function over all possible partition boundaries, is selected as a binary discretization boundary.

    For example, you might have a variable Length, with all possible values as:

    Length = {2.1, 2.8, 3.5, 8.0, 10.0, 20.0, 50.0, 51.0}
    

    Then your T could be:

    T = {2.1, 2.8, 3.5, 8.0, 10.0, 20.0, 50.0, 51.0}
    

    in which you cut at every possible Length value. You could also cut at every middle point of adjacent Length values, e.g.,

    T = {2.45, 3.15, 5.75, 9.0, 15.0, 35.0, 50.5}
    

    At discretization time, you will iterate through all possible T values and evaluate which one obtains the minimum E(A, T, S). That's it.

    See more details in this paper, which also describes other optional methods:

    • ChiMerge Discretization method.
    • Learning Vector Quantization (LVQ) based method
    • Histogram-based method.