Search code examples
matlabstatisticsentropyfeature-selection

Feature Selection by Entropy and Information Gain in Matlab


I have a dataset contains numeric and non-numeric values so I divided them into tables where tNonNumeric indicates table contains non-numeric values.

My dependent variable is churn feature in which 0 indicates nun churner customers and 1 indicates churners.

I need to calculate info gain for every feature so that I can decide on which ones are necessary and which ones are not. In order to do that I created 2 tables in which Kchurn0Table indicates customers with churn value=0 and Kchurn1Table with churn =1. After that I calculated total entropy, and Entropy(i) which means entropy without feature i.

TotEntropy= -1 * ((height(Kchurn0Table)/height(tNonNumeric)) * log2(height(Kchurn0Table)/height(tNonNumeric))) + (height(Kchurn1Table)/height(tNonNumeric))* log2(height(Kchurn1Table)/height(tNonNumeric));
for i=1 :width(tNonNumeric)
    dummy_dataset=tNonNumeric;
    dummy_dataset(:,i)=[];
    dummy_churn0=Kchurn0Table;
    dummy_churn1=Kchurn1Table;
    dummy_churn0(:,i)=[];
    dummy_churn1(:,i)=[];
    Entropy(i)= -1 * ((height(dummy_churn0)/height(dummy_dataset)) * log2(height(dummy_churn0)/height(dummy_dataset))) + height(dummy_churn1)/height(dummy_dataset)* log2(height(dummy_churn1)/height(dummy_dataset));
    InfoGain(i)= TotEntropy - Entropy(i);
end

But in every iteration I get the same value as an entropy in Entropy(i) and hence 0 for InforGain.

I could not figure out what I'm missing. Every help will be appreciated.


Solution

  • I think that Information Gain is the same thing as Mutual Information, so maybe you should use the following equation:

    MI(X,Y)=H(Y)-H(Y|X)

    where H(Y) is the entropy of Y and H(Y|X) is the condicional entropy.

    This means that MI(X,Y) measures the quantity of common information to X and Y.