Search code examples
pythonpandastime-seriesweighted-average

How can I build a faster decaying average? comparing a data frame's rows date field to other rows dates


I am clumsy but adequate with python. I have referenced stack often, but this is my first question. I have built a decaying average function to act on a pandas data frame with about 10000 rows, but it takes 40 minutes to run. I would appreciate any thoughts on how to speed it up. Here is a sample of actual data, simplified a bit.

sub = pd.DataFrame({
        'user_id':[101,101,101,101,101,102,101],
        'class_section':['Modern Biology - B','Spanish Novice 1 - D', 'Modern Biology - B','Spanish Novice 1 - D','Spanish Novice 1 - D','Modern Biology - B','Spanish Novice 1 - D'],
        'sub_skill':['A','A','B','B','B','B','B'],
        'rating' :[2.0,3.0,3.0,2.0,3.0,2.0,2.0],
        'date' :['2019-10-16','2019-09-04','2019-09-04', '2019-09-04','2019-09-13','2019-10-16','2019-09-05']})

For this data frame:

sub
Out[716]: 
   user_id            class_section sub_skill  rating        date
0      101       Modern Biology - B         A     2.0  2019-10-16
1      101     Spanish Novice 1 - D         A     3.0  2019-09-04
2      101       Modern Biology - B         B     3.0  2019-09-04
3      101     Spanish Novice 1 - D         B     2.0  2019-09-04
4      101     Spanish Novice 1 - D         B     3.0  2019-09-13
5      102       Modern Biology - B         B     2.0  2019-10-16
6      101     Spanish Novice 1 - D         B     2.0  2019-09-05

A decaying average weights the most recent event that meets conditions at full weight and weights each previous event with a multiplier less than one. In this case, the multiplier is 0.667. previously weighted events are weighted again.

So the decaying average for user 101's rating in Spanish sub_skill B is:

(2.0*0.667^2 + 2.0*0.667^1 + 3.0*0.667^0)/((0.667^2 + 0.667^1 + 0.667^0) = 2.4735

Here is what I tried, after reading a helpful post on weighted averages

sub['date'] = pd.to_datetime(sub.date_due) 

def func(date, user_id, class_section, sub_skill):
    return sub.apply(lambda row: row['date'] > date  
                     and row['user_id']==user_id 
                     and row['class_section']== class_section 
                     and row['sub_skill']==sub_skill,axis=1).sum()

# for some reason this next line of code took about 40 minutes to run on 9000 rows:
sub['decay_count']=sub.apply(lambda row: func(row['date'],row['user_id'], row['class_section'], row['sub_skill']), axis=1)

# calculate decay factor:
sub['decay_weight']=sub.apply(lambda row: 0.667**row['decay_count'], axis=1)

# calcuate decay average contributors (still needs to be summed):
g = sub.groupby(['user_id','class_section','sub_skill'])
sub['decay_avg'] = sub.decay_weight / g.decay_weight.transform("sum") * sub.rating

# new dataframe with indicator/course summaries as decaying average (note the sum):
indicator_summary = g.decay_avg.sum().to_frame(name = 'DAvg').reset_index()

I frequently work in pandas and I am used to iterating through large datasets. I would have expected this to take rows-squared time, but it is taking much longer. A more elegant solution or some advice to speed it up would be really appreciated!

Some background on this project: I am trying to automate the conversion from proficiency-based grading into a classic course grade for my school. I have the process of data extraction from our Learning Management System into a spreadsheet that does the decaying average and then posts the information to teachers, but I would like to automate the whole process and extract myself from it. The LMS is slow to implement a proficiency-based system and is reluctant to provide a conversion - for good reason. However, we have to communicate both student proficiencies and our conversion to a traditional grade to parents and colleges since that is a language they speak.


Solution

  • Why not use groupby? The idea here is that you rank the dates within the group in descending order and subtract 1 (because rank starts with 1). That seems to mirror your logic in func above, without having to try to call apply with a nested apply.

    sub['decay_count'] = sub.groupby(['user_id', 'class_section', 'sub_skill'])['date'].rank(method='first', ascending=False) - 1
    
    sub['decay_weight'] = sub['decay_count'].apply(lambda x: 0.667 ** x)
    

    Output:

    sub.sort_values(['user_id', 'class_section', 'sub_skill', 'decay_count'])                                      
    
       user_id         class_section sub_skill  rating       date  decay_count  decay_weight
    0      101    Modern Biology - B         A     2.0 2019-10-16          0.0      1.000000
    2      101    Modern Biology - B         B     3.0 2019-09-04          0.0      1.000000
    1      101  Spanish Novice 1 - D         A     3.0 2019-09-04          0.0      1.000000
    3      101  Spanish Novice 1 - D         B     2.0 2019-09-04          0.0      1.000000
    6      101  Spanish Novice 1 - D         B     2.0 2019-09-05          1.0      0.667000
    4      101  Spanish Novice 1 - D         B     3.0 2019-09-13          2.0      0.444889
    5      102    Modern Biology - B         B     2.0 2019-10-16          0.0      1.000000