Search code examples
pythonpandaspercentile

Pandas ignore missing dates to find percentiles


I have a dataframe. I am trying to find percentiles of datetimes. I am using the function:

Dataframe:

student, attempts, time
student 1,14, 9/3/2019  12:32:32 AM
student 2,2, 9/3/2019  9:37:14 PM
student 3, 5
student 4, 16, 9/5/2019  8:58:14 PM

studentInfo2 = [14, 4, Timestamp('2019-09-04 00:26:36')]
data['time'] = pd.to_datetime(data['time_0001'], errors='coerce')
perc1_first = stats.percentileofscore(data['time'].notnull(), student2Info[2], 'rank')

where student2Info[2] holds the datetime for a particular student. When I try and do this I get the error:

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Any ideas on how I can get the percentile to calculate correctly even when there are missing times in the columns?


Solution

  • You need to transform the Timestamps into units that percentileofscore can understand. Also, pd.DataFrame.notnull() returns a boolean list that you may use to filter your DataFrame, it does not return the filtered list, so I've updated that for you. Here is a working example:

    import pandas as pd
    import scipy.stats as stats
    
    data = pd.DataFrame.from_dict({
        "student": [1, 2, 3, 4],
        "attempts": [14, 2, 5, 16],
        "time_0001": [
            "9/3/2019  12:32:32 AM",
            "9/3/2019  9:37:14 PM",
            "",
            "9/5/2019  8:58:14 PM"
        ]
    })
    
    student2Info = [14, 4, pd.Timestamp('2019-09-04 00:26:36')]
    data['time'] = pd.to_datetime(data['time_0001'], errors='coerce')
    perc1_first = stats.percentileofscore(data[data['time'].notnull()].time.transform(pd.Timestamp.toordinal), student2Info[2].toordinal(), 'rank')
    print(perc1_first)  #-> 66.66666666666667