Apologies if this has been asked elsewhere, but it feels like a specific question. I am working through some machine learning code for fun and struggling to figure out how exactly to remove a column with entropy of zero.
def generate_tree(data, tree, branch="", parent="root"):
entropies = [binary_entropy(data.iloc[:,i]) for i in range(len(data.iloc[0]))]
if np.argmin(entropies) == 0:
data = data.drop(data.columns[np.argmin(entropies)])
no_column_left = (len(data.columns) == 0)
one_animal_left = (len(data.index) == 1)
if no_column_left or one_animal_left:
tree.create(branch+", ".join(data.index), parent=parent)
return
data = data.loc[data.loc[:,data.columns[np.argmax(entropies)]] == 1]
data = data.drop(data.columns[np.argmax(entropies)], axis = 1)
entropies = [binary_entropy(data.iloc[:,i]) for i in range(len(data.iloc[0]))]
selected_column = str(data.columns[np.argmax(entropies)])
node = tree.create_node(branch+selected_column, parent=parent)
mask = data.columns
generate_tree(data[mask], tree, branch="+", parent = node)
generate_tree(data[~mask], tree, branch="-", parent = node)
So as I understand it, entropies
computes the entropies for each column. Next, we have an if statement that should catch any values in the entropies array equalling zero and to then remove them from the dataframe, however, I don't believe this is being picked up and cannot figure out why.
no_column_left
and one_animal_left
should register a True
or False
as to whether each one has been attained. These are then used in the following if statement to determine whether we've reached the end of the tree. If not, we are to compute the new entropies of the reduced dataframe with the next three lines of code.
To add the selected column to the tree, we convert this to a string and it is added to the tree.
Ultimately, my question here is, what am I doing wrong so that the output of each reduced dataframe keeps including columns with zero entropy? Any guidance much appreciated.
I also give some output below so see what exactly the problem is.
[1. 0.91829583 0.91829583 0.81127812 0.97986876 0.97986876]
[1]
can it fly? is it a vertebrate? is it endangered? does it live in groups? does it have hair?
cat 0 1 0 0 1
duck 1 1 0 1 0
eagle 1 1 1 0 0
elephant 0 1 1 1 0
man 0 1 1 1 1
rabbit 0 1 0 1 1
[0.91829583 0. 1. 0.91829583 1. ]
[0.91829583 0. 1. 0.91829583 1. ]
[1]
can it fly? is it a vertebrate? does it live in groups? does it have hair?
eagle 1 1 0 0
elephant 0 1 1 0
man 0 1 1 1
[0.91829583 0. 0.91829583 0.91829583]
[0.91829583 0. 0.91829583 0.91829583]
[0 1 2]
is it a vertebrate? does it live in groups? does it have hair?
eagle 1 0 0
[0 0 0]
[0 0 0]
Entropy measures the uncertainty of a random variable.
If you have one class with probability 1 then the entropy reduces to sum(p * log(p)) = 1 * log(1) = 0
, this is correct.
You should be able to get the columns with non-zero Hentropy with the columns not having
data = data.loc[:, entropy != 0]