faster way to run countifs in python

I previously asked the question on how to do a countifs in python across multiple data frames, just like you can do countifs on separate worksheets in Excel. somebody gave me a very creative answer:

python pandas countifs using multiple criteria AND multiple data frames

Thank you for that @AlexG--I tried it, and it worked superbly:

import pandas as pd
import numpy as np
import matplotlib as plt

#import the data
students = pd.read_csv("Student Detail stump.csv")
exams = pd.read_csv("Exam Detail stump.csv")

#get data parameters
student_info = students[['Student Number', 'Enrollment Date', 'Detail Date']].values

#prepare an empty list to hold the results
N_exams_passed = []

#count records in data set according to parameters
for s_id, s_enroll, s_qual in student_info:
N_exams_passed.append(len(exams[(exams['Student Number']==s_id) &
                         (exams['Exam Grade Date']>=s_enroll) &
                         (exams['Exam Grade Date']<=s_qual) &
                         (exams['Exam Grade']>=70)])
                      )

#add the results to the original data set
students['Exams Passed'] = N_exams_passed

HOWEVER, it only worked effectively on small data sets. When I ran the data with 100,000s of rows, it wouldn't even be done overnight. It doesn't seem very pythonic.

The SQL way you can do this in seconds is to use a correlated subquery, like this:

SELECT
   s.*,
   (SELECT COUNT(e.[Exam Grade]) 
 FROM
     exams AS e 
 WHERE
    e.[Exam Grade] >= 65 
    AND e.[Student Number] = s.[Student Number] 
    AND e.[Exam Grade Date] >= s.[Enrollment Date] 
    AND e.[Exam Grade Date] <= s.[Detail Date]) 
    AS ExamsPassed
FROM 
    students AS s;

How do I reproduce such a correlated subquery in pandas or some other pythonic way?

Here are the data frames:

 #Students
 Student Number Enroll Date Detail Date
 1              1/1/2016    2/1/2016
 1              1/1/2016    3/1/2016
 2              2/1/2016    3/1/2016
 3              3/1/2016    4/1/2016

 #Exams
 Student Number Exam Date   Exam Grade
 1              1/1/2016    50
 1              1/15/2016   80
 1              1/28/2016   90
 1              2/5/2016    100
 1              3/5/2016    80
 1              4/5/2016    40
 2              2/2/2016    85
 2              2/3/2016    10
 2              2/4/2016    100

Final data frame should look like this, with a count of 'Passed Exams' at the end:

 #FinalResult
 Student Number Enroll Date Detail Date Passed Exams
 1              1/1/2016    2/1/2016    2
 1              1/1/2016    3/1/2016    3
 2              2/1/2016    3/1/2016    2
 3              3/1/2016    4/1/2016    0

Solution

If I understand the structure of your dataframes correctly, I'd suggest merging the two dataframes and then performing the task on the merged data using numpy.where.

import numpy as np

exams = exams.merge(students, on='Student Number', how='left')
exams['Passed'] = np.where(
    (exams['Exam Grade Date'] >= exams['Enrollment Date']) &
    (exams['Exam Grade Date'] <= exams['Detail Date']) &
    (exams['Grade'] >= 70),
    1, 0)

students = students.merge(
    exams.groupby(['Student Number', 'Detail Date'])['Passed'].sum().reset_index(),
    left_on=['Student Number', 'Detail Date'],
    right_on=['Student Number', 'Detail Date'],
    how='left')
students['Passed'] = students['Passed'].fillna(0).astype('int')

Note: you'll need to make sure the date columns are properly stored as datetimes (you can use pandas.to_datetime to do this).

numpy.where creates a new array where the values are one way (1 in the example above) if the conditions you specify are met and another (0) if they aren't met.

The line exams.groupby(['Student Number', 'Detail Date'])['Passed'].sum() produces a series in which the index is Student Number and Detail Date and the values are the counts of passed exams corresponding to that Student Number and Detail Date combination. The reset_index() makes it into a dataframe for merging.