I have a small dataset and am trying to use sklearn to create a decision tree classifier. I use sklearn.tree.DecisionTreeClassifier as the model and use its .fit() function to fit to the data. Searching around, I could not find anyone else who has run into the same issue.
After loading in the data into one array and labels into another, printing out the two arrays (data and labels) gives:
[[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0.]
[1. 1. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
[1. 1. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
[0. 1. 1. 1. 1. 0. 1. 1. 1. 0. 1. 1. 0. 1. 0. 0.]
[0. 1. 1. 1. 1. 1. 1. 1. 0. 0. 1. 0. 0. 1. 1. 0.]
[0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 0. 1. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 1. 1. 1. 0. 0. 0. 0. 0. 1. 1. 1. 0. 1. 1. 1.]
[0. 1. 1. 1. 0. 0. 0. 0. 1. 0. 1. 1. 0. 1. 0. 0.]
[0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 1.]
[1. 0. 1. 0. 1. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0.]
[1. 0. 1. 0. 1. 0. 1. 0. 1. 1. 0. 0. 0. 0. 0. 0.]]
['Alcaligenes_faecalis' 'Bacillus_circulans' 'Bacillus_megaterium'
'Bacillus_sphaericus' 'Citrobacter_freundii' 'Enterobacter_aerogenes'
'Escherichia_coli' 'Micrococcus_luteus' 'Proteus_mirabilis'
'Salmonella_arizonae' 'Serratia_marcescens' 'Staphylococcus_epidermidis'
'Staphylococcus_saprophyticus']
I defined a function to do the fitting, and I have tried removing the function and directly running the .fit() function:
def decisiontree(data, labels, criterion = "gini", splitter = "default", max_depth = None): #expects *2d data and 1d labels
model = sklearn.tree.DecisionTreeClassifier(criterion = criterion, splitter = splitter, max_depth = max_depth)
model = model.fit(data,labels)
return model
I then called the function:
model = decisiontree(data, labels)
at this point, the KeyError is raised:
KeyError Traceback (most recent call last)
<ipython-input-21-3574397ccfb6> in <module>
----> 1 model = decisiontree(data, labels)
<ipython-input-18-e85883291477> in decisiontree(data, labels, criterion, splitter, max_depth)
2
3 model = sklearn.tree.DecisionTreeClassifier(criterion = criterion, splitter = splitter, max_depth = max_depth)
----> 4 model = model.fit(data,labels)
5
6 return model
~/anaconda3/lib/python3.7/site-packages/sklearn/tree/_classes.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
875 sample_weight=sample_weight,
876 check_input=check_input,
--> 877 X_idx_sorted=X_idx_sorted)
878 return self
879
~/anaconda3/lib/python3.7/site-packages/sklearn/tree/_classes.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
333 splitter = self.splitter
334 if not isinstance(self.splitter, Splitter):
--> 335 splitter = SPLITTERS[self.splitter](criterion,
336 self.max_features_,
337 min_samples_leaf,
KeyError: 'default'
The data is stored in data.csv:
Alcaligenes_faecalis,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0
Bacillus_circulans,1,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0
Bacillus_megaterium,1,1,1,0,1,0,1,0,0,0,0,0,0,0,0,1
Bacillus_sphaericus,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0
Citrobacter_freundii,0,1,1,1,1,0,1,1,1,0,1,1,0,1,0,0
Enterobacter_aerogenes,0,1,1,1,1,1,1,1,0,0,1,0,0,1,1,0
Escherichia_coli,0,1,1,1,1,1,1,1,1,0,1,0,1,0,0,0
Micrococcus_luteus,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Proteus_mirabilis,0,1,1,1,0,0,0,0,0,1,1,1,0,1,1,1
Salmonella_arizonae,0,1,1,1,0,0,0,0,1,0,1,1,0,1,0,0
Serratia_marcescens,0,1,1,0,0,0,1,0,0,0,1,0,0,1,0,1
Staphylococcus_epidermidis,1,0,1,0,1,0,1,0,1,0,1,0,0,0,0,0
Staphylococcus_saprophyticus,1,0,1,0,1,0,1,0,1,1,0,0,0,0,0,0
there is not default
value for sklearn.tree.DecisionTreeClassifier spliter param, the default value is best
so you can use:
def decisiontree(data, labels, criterion = "gini", splitter = "best", max_depth = None): #expects *2d data and 1d labels
model = sklearn.tree.DecisionTreeClassifier(criterion = criterion, splitter = splitter, max_depth = max_depth)
model = model.fit(data,labels)
return model