In the part where we create the trees (iTrees) I don't understand why we are using the following classification line of code (much alike as it is in decision tree classification):
def classify_data(data):
label_column = data.values[:, -1]
unique_classes, counts_unique_classes = np.unique(label_column, return_counts=True)
index = counts_unique_classes.argmax()
classification = unique_classes[index]
return classification
We are choosing the last column and an indexed value of the largest unique element? It might make sense for decision trees but I don't understand why we use it in isolation forest?
And the whole iTree code is looking like the following:
def isolation_tree(data,counter=0,
max_depth=50,random_subspace=False):
# End loop if max depth or if isolated
if (counter == max_depth) or data.shape[0]<=1:
classification = classify_data(data)
return classification
else:
# Counter
counter +=1
# Select random feature
split_column = select_feature(data)
# Select random value
split_value = select_value(data,split_column)
# Split data
data_below, data_above = split_data(data,split_column,split_value)
# instantiate sub-tree
question = "{} <= {}".format(split_column,split_value)
sub_tree = {question: []}
# Recursive part
below_answer = isolation_tree(data_below,counter,max_depth=max_depth)
above_answer = isolation_tree(data_above,counter,max_depth=max_depth)
if below_answer == above_answer:
sub_tree = below_answer
else:
sub_tree[question].append(below_answer)
sub_tree[question].append(above_answer)
return sub_tree
Edit: Here is an example of the data and running classify_data:
feat1 feat2
0 3.300000 3.300000
1 -0.519349 0.353008
2 -0.269108 -0.909188
3 -1.887810 -0.555841
4 -0.711432 0.927116
label columns: [ 3.3 0.3530081 -0.90918776 -0.55584138
0.92711613]
unique_classes, counts unique classes: [-0.90918776 -0.55584138
0.3530081 0.92711613 3.3 ] [1 1 1 1 1]
-0.9091877609469025
So I later found out that the classification part was for testing purposes, it is worthless. If you use this code (popular on Medium) please remove the classification function as it serves no purpose.