Search code examples
pythonscikit-learnrandom-forestdecision-tree

Interaction between sample_weight and min_samples_split in decision tree


In sklearn.ensemble.RandomForestClassifier, if we define both sample_weight and min_samples_split, does the sample weight impact the min_samples_split. For example, if min_sample_split = 20 and the weight of data points in samples are all 2, then 10 data points satisfy the min_sample_split condition?


Solution

  • No, see the source; min_samples_split does not take into consideration sample weights. Compare to min_samples_leaf and its weighted cousin min_weight_fraction_leaf (source).

    Your example suggests an easy experiment to check:

    from sklearn.tree import DecisionTreeClassifier
    import numpy as np
    
    X = np.array([1, 2, 3]).reshape(-1, 1)
    y = [0, 0, 1]
    
    tree = DecisionTreeClassifier()
    tree.fit(X, y)
    print(len(tree.tree_.feature))  # number of nodes
    # 3
    
    tree.set_params(min_samples_split=10)
    tree.fit(X, y)
    print(len(tree.tree_.feature))
    # 1
    
    tree.set_params(min_samples_split=10)
    tree.fit(X, y, sample_weight=[20, 20, 20])
    print(len(tree.tree_.feature))
    # 1; the sample weights don't count to make 
    #    each sample "large" enough for min_samples_split