I have a student table that contains student_id
, course_id
, and exam_time
(10k rows). I pivot on student_id
and exam_time
to get the number of exams in a session or in a day. I am building a timetabling heuristic that changes the times of the examinations one at a time so I need to update this pivot table a lot of times. A change in one course's examination time affects an average of 50 rows in the original dataframe. Is there a way to update the resulting pivot table without recalculating the whole thing in pandas or should I keep track of the changes on the pivot table myself (i.e. by adding and subtracting 1 to the changed slots)?
Edit: Here's how I construct the pivot table. I added a column of ones to count the numbers by np.sum. I couldn't find another function that works faster.
sLength = len(df["student_id"])
df["ones"] = pd.Series(np.ones(sLength))
pivot_table = pd.pivot_table(df, rows = "student_id", cols = "exam_time", values = "ones", aggfunc = np.sum)
And for the changes in examination times, I wrote this (assuming changed_course
is moved from old_slot
to new_slot
)
affected_students = df[df["course_id"] == changed_course]["student_id"]
pivot_table[old_slot][affected_students] -= 1
pivot_table[new_slot][affected_students] += 1
Here is sample code, the idea is update the total pivot table by subtract the pivot table of old rows and add the pivot table of new rows.
So every time you change the data, you call twice pivot_table()
, and one add()
and one sub()
:
import numpy as np
import pandas as pd
### create random data
N = 1000
a = np.random.randint(0, 100, N)
b = np.random.randint(0, 30, N)
c = np.random.randint(0, 10, N)
df = pd.DataFrame({"a":a, "b":b, "c":c})
### calculate pivot sum
res = df.pivot_table(values="c", index="a", columns="b", aggfunc="sum", fill_value=0)
### create random rows to change
M = 100
row_index = np.unique(np.random.randint(0, N, M))
old_rows = df.iloc[row_index]
M = old_rows.shape[0]
new_rows = pd.DataFrame({"a":np.random.randint(0, 100, M),
"b":np.random.randint(0, 30, M),
"c":np.random.randint(0, 10, M)})
### update pivot table
sub_df = old_rows.pivot_table(values="c", index="a", columns="b", aggfunc="sum", fill_value=0)
add_df = new_rows.pivot_table(values="c", index="a", columns="b", aggfunc="sum", fill_value=0)
new_res = res.sub(sub_df, fill_value=0).add(add_df, fill_value=0)
### check result
df.iloc[row_index] = new_rows.values
res2 = df.pivot_table(values="c", index="a", columns="b", aggfunc="sum", fill_value=0)
print(new_res.astype(int).equals(res2))