python pandas scikit-learn feature-selection

Python: How to properly deal with NaN's in a pandas DataFrame for feature selection in scikit-learn

This is related to a question I posted here but this one is more specific and simpler.

I have a pandas DataFrame whose index is unique user identifiers, columns correspond to unique events, and values 1 (attended), 0 (did not attend), or NaN (wasn't invited/not relevant). The matrix is pretty sparse with respect to NaNs: there are several hundred events and most users were only invited to several tens at most.

I created some extra columns to measure the "success" which I define as just % attended relative to invites:

my_data['invited'] = my_data.count(axis=1)
my_data['attended'] = my_data.sum(axis=1)-my_data['invited']
my_data['success'] = my_data['attended']/my_data['invited']

My goal right now is to do feature selection on the events/columns, starting with the most basic variance-based method: remove those with low variance. Then I would look at a linear regression on the events and keep only those with large coefficients and small p-values.

But my problem is I have so many NaN's and I'm not sure what the correct way to deal with them is as most scikit-learn methods give me errors because of them. One idea is to replace 'didn't attend' with -1 and 'not invited' with 0 but I'm worried this will alter the significance of events.

Can anyone suggest the proper way to deal with all these NaN's without altering the statistical significance of each feature/event?

Edit: I'd like to add that I'm happy to change my metric for "success" from the above if there is a reasonable one which will allow me to move forward with feature selection. I am just trying to determine which events are effective in capturing user interest. It's pretty open-ended and this is mostly an exercise to practice feature selection.

Thank you!

Solution

if I understand correctly you would like to clean your data from NaN's without significantly altering the statistical properties within it - so that you can run some analysis afterwords.

I actually came across something similar recently, one simple approach you might be interested is using sklearn's 'Imputer'. As EdChum mentioned earlier, one idea is to replace with the mean on the axis. Other options include replacing with the median for example.

Something like:

from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=1)
cleaned_data = imp.fit_transform(original_data)

In this case, this will replace the NaN's with the mean across each axis (for example let's impute by event so axis=1). You then could round the cleaned data to make sure you get 0's and 1's.

I would plot a few histograms for the data, by event, to sanity check whether this preprocessing significantly changes your distribution - as we may be introducing too much of a bias by swapping so many values to the mean / mode / median along each axis.

Link for reference: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html

Taking things one step further (assuming above is not sufficient), you could alternately do the following:

Take each event column in your data and calculate the probability of attending ('p') vs not attending ('1 - p'), after dropping all nan numbers. [i.e. p = Attending / (Attending + Not Attending) ]
Then replace your NaN numbers across each event column using random numbers generated from Bernoulli distribution which we fit with the 'p' you estimated, roughly something like:

import numpy as np

n = 1 # number of trials

p = 0.25 # estimated probability of each trial (i.e. replace with what you get for attended / total)

s = np.random.binomial(n, p, 1000)

# s now contains random a bunch of 1's and 0's you can replace your NaN values on each column with

Again, this in itself is not perfect are you are still going to end up slightly biasing your data (for e.g. a more accurate approach would be to account for dependencies in your data across events for each user) - but by sampling from a roughly matching distribution this should at least be more robust than replacing arbitrarily with mean values etc.

Hope this helps!