Search code examples
pythonpandasdata-cleaning

Aggregating Python dataframe rows based on specific, changing column conditions


I have two dataframes that I want to merge. One contains data on "assessments" done on particular dates for particular clients. The second contains data on different categories of "services" performed for clients on particular dates. See sample code below:

assessments = pd.DataFrame({'ClientID' : ['212','212','212','292','292'],
           'AssessmentDate' : ['2018-01-04', '2018-07-03', '2019-06-10', '2017-08-08', '2017-12-21'],
           'Program' : ['Case Mgmt','Case Mgmt','Case Mgmt','Coordinated Access','Coordinated Access']})

ClientID    AssessmentDate  Program
212         2018-01-04      Case Mgmt
212         2018-07-03      Case Mgmt
212         2019-06-10      Case Mgmt
292         2017-08-08      Coordinated Access
292         2017-12-21      Coordinated Access


services = pd.DataFrame({'ClientID' : ['212','212','212','292','292'],
           'ServiceDate' : ['2018-01-02', '2018-04-08', '2018-05-23', '2017-09-08', '2017-12-03'],
           'Assistance Navigation' : ['0','1','1','0','1'],
           'Basic Needs' : ['1','0','0','1','2']})

ClientID    ServiceDate Assistance Navigation   Basic Needs
212         2018-01-02  0                       1
212         2018-04-08  1                       0
212         2018-05-23  1                       0
292         2017-09-08  0                       1
292         2017-12-03  1                       2

I want to know how many services of each service type (Assistance Navigation and Basic Needs) occur between consecutive assessments of the same program. In other words, I want to append two columns to the assessments dataframe named 'Assistance Navigation' and 'Basic Needs' that tell me how many Assistance Navigation services and how many Basic Needs services have occurred since the last assessment of the same program. The resulting dataframe would look like this:

assessmentsFinal = pd.DataFrame({'ClientID' : ['212','212','212','292','292'],
           'AssessmentDate' : ['2018-01-04', '2018-07-03', '2019-06-10', '2017-08-08', '2017-12-21'],
           'Program' : ['Case Mgmt','Case Mgmt','Case Mgmt','Coordinated Access','Coordinated Access'],
           'Assistance Navigation' : ['0','2','0','0','1'],
           'Basic Needs' : ['0','0','0','0','3']})

ClientID    AssessmentDate  Program             Assistance Navigation   Basic Needs
212         2018-01-04      Case Mgmt           0                       0
212         2018-07-03      Case Mgmt           2                       0
212         2019-06-10      Case Mgmt           0                       0
292         2017-08-08      Coordinated Access  0                       0
292         2017-12-21      Coordinated Access  1                       3

Of course, the real data has many more service categories than just 'Assistance Navigation' and 'Basic Needs' and the number of services and assessments is huge. My current attempt uses loops (which I know is a Pandas sin) and takes a couple of minutes to run, which may pose problems when our dataset gets even larger. Below is the current code for reference. Basically we loop through the assessments dataframe to get the ClientID and the date range and then we go into the services sheet and tally up the service type occurrences. There's got to be a quick and easy way to do this in Pandas but I'm new to the game. Thanks in advance.

servicesDict = {}
prevClient = -1
prevDate = ""
prevProg = ""
categories = ["ClientID","ServiceDate","Family Needs","Housing Navigation","Housing Assistance","Basic Needs","Professional","Education","Financial Aid","Healthcare","Counseling","Contact","Assistance Navigation","Referral","Misc"]

for index, row in assessmentDF.iterrows():
    curClient = row[0]
    curDate = datetime.strptime(row[1], '%m/%d/%y')
    curProg = row[7] 
    curKey = (curClient, curDate)

    if curKey not in servicesDict:
        services = [curClient, curDate, 0,0,0,0,0,0,0,0,0,0,0,0,0]
        servicesDict.update({curKey : services})
    services = servicesDict[curKey]


    #if curDate and prevDate equal each other action required
    if curClient == prevClient and curProg == prevProg:
        boundary = serviceDF[serviceDF['ClientID'] == curClient].index
    
        for x in boundary:
            curRowSer = serviceDF.iloc[x]
            curDateSer = datetime.strptime(curRowSer[1], '%m/%d/%y')
        
            if curDateSer>=prevDate and curDateSer<curDate:
                serviceCategory = curRowSer[5]
                i = categories.index(serviceCategory)
                services[i] = services[i] + 1
                servicesDict.update({curKey : services})
            
                          
    prevClient = curClient
    prevDate = curDate
    prevProg = curProg

servicesCleaned = pd.DataFrame.from_dict(servicesDict, orient = 'index',columns=categories)
#then merge into assessments on clientID and AssessmentDate

Solution

  • One way would be like this. You'll probably have to tweak it for your original dataset, and check the edge cases.

    assessments['PreviousAssessmentDate'] = assessments.groupby(['ClientID', 'Program']).AssessmentDate.shift(1, fill_value='0000-00-00')
    df = assessments.merge(services, on='ClientID', how='left')
    df[df.columns[5:]] = df[df.columns[5:]].multiply((df.AssessmentDate > df.ServiceDate) & (df.PreviousAssessmentDate < df.ServiceDate), axis=0)
    df = df.groupby(['ClientID', 'AssessmentDate', 'Program']).sum().reset_index()
    
      ClientID AssessmentDate             Program  Assistance Navigation  Basic Needs
    0      212     2018-01-04           Case Mgmt                      0            1
    1      212     2018-07-03           Case Mgmt                      2            0
    2      212     2019-06-10           Case Mgmt                      0            0
    3      292     2017-08-08  Coordinated Access                      0            0
    4      292     2017-12-21  Coordinated Access                      1            3
    

    Logic

    1. We shift the AssessmentDate by 1 in order to determine the previous assessment date
    2. We merge the two dataframes on ClientID
    3. We set all service type columns to 0 incase the ServiceDate doesn't fall between PreviousAssessmentDate and the AssessmentDate.
    4. We groupby ClientID, Program and AssessmentDate and do a sum()

    Assumptions

    1. Service type categories are integers
    2. Your data frame is sorted on AssessmentDate (for the shift)