Search code examples
pythonpandasmachine-learningscikit-learncategorical-data

Feature selection when independent variables are categorical and also target variable is categorical


I have presented a small sample of the dataset that I am working on. My original dataset has around 400 columns for 'Symptoms' and 1 column for 'Disease'. From here the output expected is to find out the top 'N' maybe 10 or some number of 'Symptoms' which are most significant for a particular disease. My sample dataset is as follows:

fever    headche     sore throat          drowsiness               Disease
    0        0         1                   0                      Fungal infection
    0        0         0                   1                      Fungal infection
    0        1         0                   0                      liver infection
    1        0         0                   1                      diarrhoea
    0        0         1                   1                      common cold
    0        1         1                   0                      diarrhoea
    1        0         0                   0                      flu
    

I have tried using sklearn's SelectKBest but cannot comprehend the results. Also want to know if panda's dataframe.corr function can work in this case


Solution

  • One way to address this problem is using a naive bayes classifier with feature probabilities modelled as Bernoulli distributions. This assumes that the target variables are not categorical variables as you mention in the question but simply binary variables. I think that's a more reasonable assumption and it seems to me it follows from the construction of your input data where the input variables appear to be binary.

    A first model pass can be the following (adapting the important_features function from this answer:

    import numpy as np
    import pandas as pd
    from sklearn.naive_bayes import BernoulliNB
    
    def important_features(classifier,feature_names, n=20):
        class_labels = classifier.classes_
    
        for i,feature in enumerate(feature_names): 
            print("Important features in ", class_labels[i])
            topn_class = sorted(zip(classifier.feature_log_prob_[i], feature_names),
                                reverse=True)[:n]
            
            for coef, feat in topn_class:
                print(coef, feat)
            print('-----------------------')
    
    d = {}
    d['fever'] = np.array([0,0,0,1,0,0,1])
    d['headache'] = np.array([0,0,1,0,0,1,0])
    d['sorethroat'] = np.array([1,0,0,0,1,1,0])
    d['drowsiness'] = np.array([0,1,0,1,1,0,0])
    d['disease'] = ['Fungal infection','Fungal infection','liver infection',
               'diarrhoea','common cold','diarrhoea','flu']
    
    df = pd.DataFrame(d)
    
    X = df[df.columns[:-1]]
    y = df['disease']
    
    clf = BernoulliNB()
    clf.fit(X, y)
    BernoulliNB()
    
    important_features(clf,df.columns[:-1])
    

    This should give you the following output, which of course is just for demonstration purposes as I only used the data you provided above:

    Important features in  Fungal infection
    -0.6931471805599453 sorethroat
    -0.6931471805599453 drowsiness
    -1.3862943611198906 headache
    -1.3862943611198906 fever
    -----------------------
    Important features in  common cold
    -0.4054651081081645 sorethroat
    -0.4054651081081645 drowsiness
    -1.0986122886681098 headache
    -1.0986122886681098 fever
    -----------------------
    Important features in  diarrhoea
    -0.6931471805599453 sorethroat
    -0.6931471805599453 headache
    -0.6931471805599453 fever
    -0.6931471805599453 drowsiness
    -----------------------
    Important features in  flu
    -0.4054651081081645 fever
    -1.0986122886681098 sorethroat
    -1.0986122886681098 headache
    -1.0986122886681098 drowsiness
    -----------------------
    

    Naive bayes of course doesn't account for correlation between the independent variables e.g. one could be more likely to have headache if they have fever anyway and independently of the underlying disease. If this limitation is not an issue for you then you could go ahead and run the model for all your data. Note that it's probably really really difficult to train a more general model which estimates all the possible correlations from the data.

    Finally note that pandas corr method will give you the correlation of the independent variables but it won't have anything to do with a model predicting the disease from the inputs.