Search code examples
pythonnumpypandasnandata-cleaning

Risks re changing NaN responses to zero in Python DataFrame


I have a large-ish survey dataset to clean (300 columns, 30000 rows) and the columns are mixed. I'm using Python with pandas and numpy. Am very much in the learner wheels stage using Python.

  • Some of the columns had Y or N answers to questions (and these are filled "Y" or "N").
  • Some were likert scale questions with 5 possible answers. In the CSV file each answer (agree, disagree etc.) has its own column. This has imported as 1 for a yes and NaN otherwise.
  • Other questions had up to 10 possible answers (e.g. for age) and these have imported as a string in one column - i.e. "a. 0-18" or "b. 19-25" and so on. Changing those will be interesting!

As I go through I'm changing the Y/N answers to 1 or 0. However, for the likert scale columns, I'm concerned that there might be a risk with doing the same thing. Does anyone have a view as to whether it would be preferable to leave the data for those as NaN for now? Gender is the same - there is a separate column for Males and one for Females, both populated with 1 for yes and NaN for no.

I'm intending to use Python for the data analysis/charting (will import matplotlib & seaborn). As this is new to me I'm guessing that changes I make now may have unintended consequences later!

Any guidance you can give would be much appreciated.

Thanks in advance.


Solution

  • If there aren't 0's that mean anything, it's fine to fill the NA's with a value (0 for convenience). It all depends on your data. That said, 300 x 30k isn't that big. Save it off as a CSV and just experiment in IPython Notebook, Pandas can probably read it in under a second, so if you screw anything up, just reload.

    Here's a quick bit of code that can condense whatever multi-column question sets into single columns with some number:

    df = pd.DataFrame({
        1: {'agree': 1}, 
        2: {'disagree': 1}, 
        3: {'whatevs': 1}, 
        4: {'whatevs': 1}}).transpose()
    df
    

    generated DataFrame

    question_sets = {
        'set_1': ['disagree', 'whatevs', 'agree'], # define these lists from 1 to whatever
    }
    
    for setname, setcols in question_sets.items():
        # plug the NaNs with 0
        df[setcols].fillna(0)
    
        # scale each column with 0 or 1 in the question set with an ascending value
        for val, col in enumerate(setcols, start=1):
            df[col] *= val
    
        # create new column by summing all the question set columns
        df[setname] = df[question_set_columns].sum(axis=1)
    
        # delete all the old columns
        df.drop(setcols, inplace=True, axis=1) 
    
    df
    

    resultant DataFrame