Search code examples
pythonpandasfeature-extractioncumulative-sumproportions

How to calculate the no-show rate for appointments based on a patient's previous appointments in a healthcare dataset using Python and Pandas?


'm working with a healthcare dataset from Kaggle (https://www.kaggle.com/joniarroba/noshowappointments) that contains information about medical appointments in Brazil and whether or not the patient showed up. The dataset has columns for appointment ID, patient ID, appointment date and time, scheduled date and time, and several other features.

I want to calculate the no-show rate for each appointment based on the patient's previous appointments. For example, if a patient has had three appointments and showed up for two of them, the no-show rate for their fourth appointment would be 1/3. If the patient is a first-time appointment, the no-show rate would be 0.

I tried using the following code, but it calculates the current show or no-show rate instead of the no-show rate based on previous appointments:

# convert appointment and scheduled dates to datetime format
df['AppointmentDay'] = pd.to_datetime(df['AppointmentDay'])
df['ScheduledDay'] = pd.to_datetime(df['ScheduledDay'])

# sort the DataFrame by PatientId and AppointmentDay
df = df.sort_values(['PatientId', 'AppointmentDay'])

# create a new column with the time difference between scheduled and appointment date
df['time_diff'] = (df['AppointmentDay'].dt.date - df['ScheduledDay'].dt.date).dt.days

# create a new column to store the no-show rate for each appointment
df['no_show_rate'] = 0

# loop through each row of the DataFrame
for index, row in df.iterrows():
    # get the PatientId and AppointmentDay for the current row
    patient_id = row['PatientId']
    appointment_day = row['AppointmentDay']
    
    # select all previous appointments for the current patient
    previous_appointments = df.loc[(df['PatientId'] == patient_id) & (df['AppointmentDay'] < appointment_day)]
    
    # calculate the no-show rate based on the previous appointments
    appointment_count = len(previous_appointments)
    no_show_count = len(previous_appointments.loc[previous_appointments['No-show'] == 'Yes'])
    if appointment_count > 0:
        no_show_rate = no_show_count / appointment_count
    else:
        no_show_rate = 0
    
    # update the 'no_show_rate' column for the current row
    df.at[index, 'no_show_rate'] = no_show_rate
    
# print first 5 rows
print(df.head())

Solution

  • Try this... I think the column-wise approach from pandas is simpler. To get the no-show rate based on previous appointments, we are gonna use pandas.DataFrame.shift on a Groupby of our dataframe by PatientId to get that lagged value:

    import pandas as pd
    
    #import data
    dat = ...
    
    #sort
    dat = dat.sort_values(['PatientId', 'AppointmentDay'])
    
    #create cumcount for total appointments for ea patient. This iterates on the
    #index starting at 0, so if we want the first appointment to show 1 instead
    #of 0, second 2 instead of 1... then we need to add 1
    dat['appt_ncum'] = dat.groupby(['PatientId']).cumcount() + 1
    
    #for this purpose, it's better to recode 'No-show' to Boolean (1/0)
    #in a temp column and do cumsum on that
    dat['cond'] = dat['No-show'] == 'Yes'
    dat['noshow_ncum'] = dat.groupby('PatientId').cond.cumsum()
    dat = dat.drop(['cond'], axis = 1)
    
    #Calculate no-show proportion based on Patient's previous appointment(s)
    #using DataFrame.shift as described above to lag numerator and denominator
    
    dat['noshow_cum_prop'] = dat.groupby(['PatientId'])['noshow_ncum'].shift(1) / dat.groupby(['PatientId'])['appt_ncum'].shift(1)
    
    #A few patients you can check to make sure metrics are calculated correctly
    #dat[['PatientId', 'AppointmentDay', 'No-show' ,'appt_ncum', 'noshow_ncum', 'noshow_cum_prop']][dat['PatientId'] == 476861615941]
    
    #dat[['PatientId', 'AppointmentDay', 'No-show' ,'appt_ncum', 'noshow_ncum', 'noshow_cum_prop']][dat['PatientId'] == 1421991592826]
    
    #dat[['PatientId', 'AppointmentDay', 'No-show' ,'appt_ncum', 'noshow_ncum', 'noshow_cum_prop']][dat['PatientId'] == 933789553426785]
    
    #dat[['PatientId', 'AppointmentDay', 'No-show' ,'appt_ncum', 'noshow_ncum', 'noshow_cum_prop']][dat['PatientId'] == 416755661551767]