Search code examples
random-forestanomaly-detectionisolationisolation-forest

Isolation Tree algorithm question about classification


In the part where we create the trees (iTrees) I don't understand why we are using the following classification line of code (much alike as it is in decision tree classification):

def classify_data(data):

label_column = data.values[:, -1]
unique_classes, counts_unique_classes = np.unique(label_column, return_counts=True)

index = counts_unique_classes.argmax()
classification = unique_classes[index]

return classification

We are choosing the last column and an indexed value of the largest unique element? It might make sense for decision trees but I don't understand why we use it in isolation forest?

And the whole iTree code is looking like the following:

def isolation_tree(data,counter=0,
                   max_depth=50,random_subspace=False):
# End loop if max depth or if isolated
if (counter == max_depth) or data.shape[0]<=1:
    classification = classify_data(data)
    return classification
    
else:
    # Counter
    counter +=1
    
    # Select random feature
    split_column = select_feature(data)
    
    # Select random value
    split_value = select_value(data,split_column)

    # Split data
    data_below, data_above = split_data(data,split_column,split_value)

# instantiate sub-tree
question = "{} <= {}".format(split_column,split_value)
sub_tree = {question: []}

# Recursive part
below_answer = isolation_tree(data_below,counter,max_depth=max_depth)
above_answer = isolation_tree(data_above,counter,max_depth=max_depth)

if below_answer == above_answer:
    sub_tree = below_answer
else:
    sub_tree[question].append(below_answer)
    sub_tree[question].append(above_answer)
    
return sub_tree 

Edit: Here is an example of the data and running classify_data:

feat1     feat2
0  3.300000  3.300000
1 -0.519349  0.353008
2 -0.269108 -0.909188
3 -1.887810 -0.555841
4 -0.711432  0.927116
label columns: [ 3.3         0.3530081  -0.90918776 -0.55584138  
0.92711613]
unique_classes, counts unique classes: [-0.90918776 -0.55584138  
0.3530081   0.92711613  3.3       ] [1 1 1 1 1]
-0.9091877609469025

Solution

  • So I later found out that the classification part was for testing purposes, it is worthless. If you use this code (popular on Medium) please remove the classification function as it serves no purpose.