Search code examples
pythonpandasdataframenumpydata-cleaning

How to clean survey data in pandas


Input:

enter image description here Output:

enter image description here

here's the data:

d = {'Morning': ["Didn't answer", "Didn't answer", "Didn't answer", 'Morning', "Didn't answer"], 'Afternoon': ["Didn't answer", 'Afternoon', "Didn't answer", 'Afternoon', "Didn't answer"], 'Night': ["Didn't answer", 'Night', "Didn't answer", 'Night', 'Night'], 'Sporadic': ["Didn't answer", "Didn't answer", 'Sporadic', "Didn't answer", "Didn't answer"], 'Constant': ["Didn't answer", "Didn't answer", "Didn't answer", 'Constant', "Didn't answer"]}

         Morning      Afternoon          Night       Sporadic       Constant
0  Didn't answer  Didn't answer  Didn't answer  Didn't answer  Didn't answer
1  Didn't answer      Afternoon          Night  Didn't answer  Didn't answer
2  Didn't answer  Didn't answer  Didn't answer       Sporadic  Didn't answer
3        Morning      Afternoon          Night  Didn't answer       Constant
4  Didn't answer  Didn't answer          Night  Didn't answer  Didn't answer

I want the output to be:

d = {"Time of the day": ["Didn't answer", "['Afternoon', 'Night']", "Sporadic", "['Morning', 'Afternoon', 'Night', 'Constant']", "Night"]}

                                 Time of the day
0                                  Didn't answer
1                         ['Afternoon', 'Night']
2                                       Sporadic
3  ['Morning', 'Afternoon', 'Night', 'Constant']
4                                          Night

so if there's no answer in every column in a row, the value in the new data frame would be "Didn't answer" and if there's at least one answer like "night", the value in the new data frame would be "night" and if there are multiple answers like " Morning", "Night" the value in the new data frame would be a list of the answers


Solution

  • You can use:

    df["ToD"] = (df.replace("Didn't answer", np.nan).stack().groupby(level=0)
                   .apply(lambda x: [i for i in x] if len(x) > 1 else x.iloc[0])
                   .reindex(df.index, fill_value="Didn't answer"))
    

    Output:

    >>> df["ToD"]
    0                            Didn't answer
    1                       [Afternoon, Night]
    2                                 Sporadic
    3    [Morning, Afternoon, Night, Constant]
    4                                    Night
    Name: ToD, dtype: object