I previously asked the question on how to do a countifs in python across multiple data frames, just like you can do countifs on separate worksheets in Excel. somebody gave me a very creative answer:
python pandas countifs using multiple criteria AND multiple data frames
Thank you for that @AlexG--I tried it, and it worked superbly:
import pandas as pd
import numpy as np
import matplotlib as plt
#import the data
students = pd.read_csv("Student Detail stump.csv")
exams = pd.read_csv("Exam Detail stump.csv")
#get data parameters
student_info = students[['Student Number', 'Enrollment Date', 'Detail Date']].values
#prepare an empty list to hold the results
N_exams_passed = []
#count records in data set according to parameters
for s_id, s_enroll, s_qual in student_info:
N_exams_passed.append(len(exams[(exams['Student Number']==s_id) &
(exams['Exam Grade Date']>=s_enroll) &
(exams['Exam Grade Date']<=s_qual) &
(exams['Exam Grade']>=70)])
)
#add the results to the original data set
students['Exams Passed'] = N_exams_passed
HOWEVER, it only worked effectively on small data sets. When I ran the data with 100,000s of rows, it wouldn't even be done overnight. It doesn't seem very pythonic.
The SQL way you can do this in seconds is to use a correlated subquery, like this:
SELECT
s.*,
(SELECT COUNT(e.[Exam Grade])
FROM
exams AS e
WHERE
e.[Exam Grade] >= 65
AND e.[Student Number] = s.[Student Number]
AND e.[Exam Grade Date] >= s.[Enrollment Date]
AND e.[Exam Grade Date] <= s.[Detail Date])
AS ExamsPassed
FROM
students AS s;
How do I reproduce such a correlated subquery in pandas or some other pythonic way?
Here are the data frames:
#Students
Student Number Enroll Date Detail Date
1 1/1/2016 2/1/2016
1 1/1/2016 3/1/2016
2 2/1/2016 3/1/2016
3 3/1/2016 4/1/2016
#Exams
Student Number Exam Date Exam Grade
1 1/1/2016 50
1 1/15/2016 80
1 1/28/2016 90
1 2/5/2016 100
1 3/5/2016 80
1 4/5/2016 40
2 2/2/2016 85
2 2/3/2016 10
2 2/4/2016 100
Final data frame should look like this, with a count of 'Passed Exams' at the end:
#FinalResult
Student Number Enroll Date Detail Date Passed Exams
1 1/1/2016 2/1/2016 2
1 1/1/2016 3/1/2016 3
2 2/1/2016 3/1/2016 2
3 3/1/2016 4/1/2016 0
If I understand the structure of your dataframes correctly, I'd suggest merging the two dataframes and then performing the task on the merged data using numpy.where
.
import numpy as np
exams = exams.merge(students, on='Student Number', how='left')
exams['Passed'] = np.where(
(exams['Exam Grade Date'] >= exams['Enrollment Date']) &
(exams['Exam Grade Date'] <= exams['Detail Date']) &
(exams['Grade'] >= 70),
1, 0)
students = students.merge(
exams.groupby(['Student Number', 'Detail Date'])['Passed'].sum().reset_index(),
left_on=['Student Number', 'Detail Date'],
right_on=['Student Number', 'Detail Date'],
how='left')
students['Passed'] = students['Passed'].fillna(0).astype('int')
Note: you'll need to make sure the date columns are properly stored as datetimes (you can use pandas.to_datetime
to do this).
numpy.where
creates a new array where the values are one way (1
in the example above) if the conditions you specify are met and another (0
) if they aren't met.
The line exams.groupby(['Student Number', 'Detail Date'])['Passed'].sum()
produces a series in which the index is Student Number
and Detail Date
and the values are the counts of passed exams corresponding to that Student Number
and Detail Date
combination. The reset_index()
makes it into a dataframe for merging.