Search code examples
pythonpython-3.xscikit-learnkeyerror

Python Scikit-Learn DecisionTreeClassifier.fit() throws KeyError: 'default'


I have a small dataset and am trying to use sklearn to create a decision tree classifier. I use sklearn.tree.DecisionTreeClassifier as the model and use its .fit() function to fit to the data. Searching around, I could not find anyone else who has run into the same issue.

After loading in the data into one array and labels into another, printing out the two arrays (data and labels) gives:

[[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0.]
 [1. 1. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [1. 1. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 1. 1. 1. 1. 0. 1. 1. 1. 0. 1. 1. 0. 1. 0. 0.]
 [0. 1. 1. 1. 1. 1. 1. 1. 0. 0. 1. 0. 0. 1. 1. 0.]
 [0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 0. 1. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 1. 1. 0. 0. 0. 0. 0. 1. 1. 1. 0. 1. 1. 1.]
 [0. 1. 1. 1. 0. 0. 0. 0. 1. 0. 1. 1. 0. 1. 0. 0.]
 [0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 1.]
 [1. 0. 1. 0. 1. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0.]
 [1. 0. 1. 0. 1. 0. 1. 0. 1. 1. 0. 0. 0. 0. 0. 0.]]

['Alcaligenes_faecalis' 'Bacillus_circulans' 'Bacillus_megaterium'
 'Bacillus_sphaericus' 'Citrobacter_freundii' 'Enterobacter_aerogenes'
 'Escherichia_coli' 'Micrococcus_luteus' 'Proteus_mirabilis'
 'Salmonella_arizonae' 'Serratia_marcescens' 'Staphylococcus_epidermidis'
 'Staphylococcus_saprophyticus']

I defined a function to do the fitting, and I have tried removing the function and directly running the .fit() function:

def decisiontree(data, labels, criterion = "gini", splitter = "default", max_depth = None): #expects *2d data and 1d labels

    model = sklearn.tree.DecisionTreeClassifier(criterion = criterion, splitter = splitter, max_depth = max_depth)
    model = model.fit(data,labels)

    return model

I then called the function:

model = decisiontree(data, labels)

at this point, the KeyError is raised:

KeyError                                  Traceback (most recent call last)
<ipython-input-21-3574397ccfb6> in <module>
----> 1 model = decisiontree(data, labels)

<ipython-input-18-e85883291477> in decisiontree(data, labels, criterion, splitter, max_depth)
      2 
      3     model = sklearn.tree.DecisionTreeClassifier(criterion = criterion, splitter = splitter, max_depth = max_depth)
----> 4     model = model.fit(data,labels)
      5 
      6     return model

~/anaconda3/lib/python3.7/site-packages/sklearn/tree/_classes.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
    875             sample_weight=sample_weight,
    876             check_input=check_input,
--> 877             X_idx_sorted=X_idx_sorted)
    878         return self
    879 

~/anaconda3/lib/python3.7/site-packages/sklearn/tree/_classes.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
    333         splitter = self.splitter
    334         if not isinstance(self.splitter, Splitter):
--> 335             splitter = SPLITTERS[self.splitter](criterion,
    336                                                 self.max_features_,
    337                                                 min_samples_leaf,

KeyError: 'default'

The data is stored in data.csv:

Alcaligenes_faecalis,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0
Bacillus_circulans,1,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0
Bacillus_megaterium,1,1,1,0,1,0,1,0,0,0,0,0,0,0,0,1
Bacillus_sphaericus,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0
Citrobacter_freundii,0,1,1,1,1,0,1,1,1,0,1,1,0,1,0,0
Enterobacter_aerogenes,0,1,1,1,1,1,1,1,0,0,1,0,0,1,1,0
Escherichia_coli,0,1,1,1,1,1,1,1,1,0,1,0,1,0,0,0
Micrococcus_luteus,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Proteus_mirabilis,0,1,1,1,0,0,0,0,0,1,1,1,0,1,1,1
Salmonella_arizonae,0,1,1,1,0,0,0,0,1,0,1,1,0,1,0,0
Serratia_marcescens,0,1,1,0,0,0,1,0,0,0,1,0,0,1,0,1
Staphylococcus_epidermidis,1,0,1,0,1,0,1,0,1,0,1,0,0,0,0,0
Staphylococcus_saprophyticus,1,0,1,0,1,0,1,0,1,1,0,0,0,0,0,0

Solution

  • there is not default value for sklearn.tree.DecisionTreeClassifier spliter param, the default value is best so you can use:

    def decisiontree(data, labels, criterion = "gini", splitter = "best", max_depth = None): #expects *2d data and 1d labels
    
        model = sklearn.tree.DecisionTreeClassifier(criterion = criterion, splitter = splitter, max_depth = max_depth)
        model = model.fit(data,labels)
    
        return model