I am clumsy but adequate with python. I have referenced stack often, but this is my first question. I have built a decaying average function to act on a pandas data frame with about 10000 rows, but it takes 40 minutes to run. I would appreciate any thoughts on how to speed it up. Here is a sample of actual data, simplified a bit.
sub = pd.DataFrame({
'user_id':[101,101,101,101,101,102,101],
'class_section':['Modern Biology - B','Spanish Novice 1 - D', 'Modern Biology - B','Spanish Novice 1 - D','Spanish Novice 1 - D','Modern Biology - B','Spanish Novice 1 - D'],
'sub_skill':['A','A','B','B','B','B','B'],
'rating' :[2.0,3.0,3.0,2.0,3.0,2.0,2.0],
'date' :['2019-10-16','2019-09-04','2019-09-04', '2019-09-04','2019-09-13','2019-10-16','2019-09-05']})
For this data frame:
sub
Out[716]:
user_id class_section sub_skill rating date
0 101 Modern Biology - B A 2.0 2019-10-16
1 101 Spanish Novice 1 - D A 3.0 2019-09-04
2 101 Modern Biology - B B 3.0 2019-09-04
3 101 Spanish Novice 1 - D B 2.0 2019-09-04
4 101 Spanish Novice 1 - D B 3.0 2019-09-13
5 102 Modern Biology - B B 2.0 2019-10-16
6 101 Spanish Novice 1 - D B 2.0 2019-09-05
A decaying average weights the most recent event that meets conditions at full weight and weights each previous event with a multiplier less than one. In this case, the multiplier is 0.667. previously weighted events are weighted again.
So the decaying average for user 101's rating in Spanish sub_skill B is:
(2.0*0.667^2 + 2.0*0.667^1 + 3.0*0.667^0)/((0.667^2 + 0.667^1 + 0.667^0) = 2.4735
Here is what I tried, after reading a helpful post on weighted averages
sub['date'] = pd.to_datetime(sub.date_due)
def func(date, user_id, class_section, sub_skill):
return sub.apply(lambda row: row['date'] > date
and row['user_id']==user_id
and row['class_section']== class_section
and row['sub_skill']==sub_skill,axis=1).sum()
# for some reason this next line of code took about 40 minutes to run on 9000 rows:
sub['decay_count']=sub.apply(lambda row: func(row['date'],row['user_id'], row['class_section'], row['sub_skill']), axis=1)
# calculate decay factor:
sub['decay_weight']=sub.apply(lambda row: 0.667**row['decay_count'], axis=1)
# calcuate decay average contributors (still needs to be summed):
g = sub.groupby(['user_id','class_section','sub_skill'])
sub['decay_avg'] = sub.decay_weight / g.decay_weight.transform("sum") * sub.rating
# new dataframe with indicator/course summaries as decaying average (note the sum):
indicator_summary = g.decay_avg.sum().to_frame(name = 'DAvg').reset_index()
I frequently work in pandas and I am used to iterating through large datasets. I would have expected this to take rows-squared time, but it is taking much longer. A more elegant solution or some advice to speed it up would be really appreciated!
Some background on this project: I am trying to automate the conversion from proficiency-based grading into a classic course grade for my school. I have the process of data extraction from our Learning Management System into a spreadsheet that does the decaying average and then posts the information to teachers, but I would like to automate the whole process and extract myself from it. The LMS is slow to implement a proficiency-based system and is reluctant to provide a conversion - for good reason. However, we have to communicate both student proficiencies and our conversion to a traditional grade to parents and colleges since that is a language they speak.
Why not use groupby
? The idea here is that you rank the dates within the group in descending order and subtract 1 (because rank starts with 1). That seems to mirror your logic in func
above, without having to try to call apply with a nested apply.
sub['decay_count'] = sub.groupby(['user_id', 'class_section', 'sub_skill'])['date'].rank(method='first', ascending=False) - 1
sub['decay_weight'] = sub['decay_count'].apply(lambda x: 0.667 ** x)
Output:
sub.sort_values(['user_id', 'class_section', 'sub_skill', 'decay_count'])
user_id class_section sub_skill rating date decay_count decay_weight
0 101 Modern Biology - B A 2.0 2019-10-16 0.0 1.000000
2 101 Modern Biology - B B 3.0 2019-09-04 0.0 1.000000
1 101 Spanish Novice 1 - D A 3.0 2019-09-04 0.0 1.000000
3 101 Spanish Novice 1 - D B 2.0 2019-09-04 0.0 1.000000
6 101 Spanish Novice 1 - D B 2.0 2019-09-05 1.0 0.667000
4 101 Spanish Novice 1 - D B 3.0 2019-09-13 2.0 0.444889
5 102 Modern Biology - B B 2.0 2019-10-16 0.0 1.000000