I have a dataset contains numeric and non-numeric values so I divided them into tables where tNonNumeric
indicates table contains non-numeric values.
My dependent variable is churn
feature in which 0 indicates nun churner customers and 1 indicates churners.
I need to calculate info gain for every feature so that I can decide on which ones are necessary and which ones are not. In order to do that I created 2 tables in which Kchurn0Table
indicates customers with churn value=0
and Kchurn1Table
with churn =1
. After that I calculated total entropy, and Entropy(i)
which means entropy without feature i.
TotEntropy= -1 * ((height(Kchurn0Table)/height(tNonNumeric)) * log2(height(Kchurn0Table)/height(tNonNumeric))) + (height(Kchurn1Table)/height(tNonNumeric))* log2(height(Kchurn1Table)/height(tNonNumeric));
for i=1 :width(tNonNumeric)
dummy_dataset=tNonNumeric;
dummy_dataset(:,i)=[];
dummy_churn0=Kchurn0Table;
dummy_churn1=Kchurn1Table;
dummy_churn0(:,i)=[];
dummy_churn1(:,i)=[];
Entropy(i)= -1 * ((height(dummy_churn0)/height(dummy_dataset)) * log2(height(dummy_churn0)/height(dummy_dataset))) + height(dummy_churn1)/height(dummy_dataset)* log2(height(dummy_churn1)/height(dummy_dataset));
InfoGain(i)= TotEntropy - Entropy(i);
end
But in every iteration I get the same value as an entropy in Entropy(i) and hence 0 for InforGain.
I could not figure out what I'm missing. Every help will be appreciated.
I think that Information Gain
is the same thing as Mutual Information
, so maybe you should use the following equation:
MI(X,Y)=H(Y)-H(Y|X)
where H(Y)
is the entropy of Y and H(Y|X)
is the condicional entropy.
This means that MI(X,Y)
measures the quantity of common information to X and Y.