I have two dataframes that I want to merge. One contains data on "assessments" done on particular dates for particular clients. The second contains data on different categories of "services" performed for clients on particular dates. See sample code below:
assessments = pd.DataFrame({'ClientID' : ['212','212','212','292','292'],
'AssessmentDate' : ['2018-01-04', '2018-07-03', '2019-06-10', '2017-08-08', '2017-12-21'],
'Program' : ['Case Mgmt','Case Mgmt','Case Mgmt','Coordinated Access','Coordinated Access']})
ClientID AssessmentDate Program
212 2018-01-04 Case Mgmt
212 2018-07-03 Case Mgmt
212 2019-06-10 Case Mgmt
292 2017-08-08 Coordinated Access
292 2017-12-21 Coordinated Access
services = pd.DataFrame({'ClientID' : ['212','212','212','292','292'],
'ServiceDate' : ['2018-01-02', '2018-04-08', '2018-05-23', '2017-09-08', '2017-12-03'],
'Assistance Navigation' : ['0','1','1','0','1'],
'Basic Needs' : ['1','0','0','1','2']})
ClientID ServiceDate Assistance Navigation Basic Needs
212 2018-01-02 0 1
212 2018-04-08 1 0
212 2018-05-23 1 0
292 2017-09-08 0 1
292 2017-12-03 1 2
I want to know how many services of each service type (Assistance Navigation and Basic Needs) occur between consecutive assessments of the same program. In other words, I want to append two columns to the assessments dataframe named 'Assistance Navigation' and 'Basic Needs' that tell me how many Assistance Navigation services and how many Basic Needs services have occurred since the last assessment of the same program. The resulting dataframe would look like this:
assessmentsFinal = pd.DataFrame({'ClientID' : ['212','212','212','292','292'],
'AssessmentDate' : ['2018-01-04', '2018-07-03', '2019-06-10', '2017-08-08', '2017-12-21'],
'Program' : ['Case Mgmt','Case Mgmt','Case Mgmt','Coordinated Access','Coordinated Access'],
'Assistance Navigation' : ['0','2','0','0','1'],
'Basic Needs' : ['0','0','0','0','3']})
ClientID AssessmentDate Program Assistance Navigation Basic Needs
212 2018-01-04 Case Mgmt 0 0
212 2018-07-03 Case Mgmt 2 0
212 2019-06-10 Case Mgmt 0 0
292 2017-08-08 Coordinated Access 0 0
292 2017-12-21 Coordinated Access 1 3
Of course, the real data has many more service categories than just 'Assistance Navigation' and 'Basic Needs' and the number of services and assessments is huge. My current attempt uses loops (which I know is a Pandas sin) and takes a couple of minutes to run, which may pose problems when our dataset gets even larger. Below is the current code for reference. Basically we loop through the assessments dataframe to get the ClientID and the date range and then we go into the services sheet and tally up the service type occurrences. There's got to be a quick and easy way to do this in Pandas but I'm new to the game. Thanks in advance.
servicesDict = {}
prevClient = -1
prevDate = ""
prevProg = ""
categories = ["ClientID","ServiceDate","Family Needs","Housing Navigation","Housing Assistance","Basic Needs","Professional","Education","Financial Aid","Healthcare","Counseling","Contact","Assistance Navigation","Referral","Misc"]
for index, row in assessmentDF.iterrows():
curClient = row[0]
curDate = datetime.strptime(row[1], '%m/%d/%y')
curProg = row[7]
curKey = (curClient, curDate)
if curKey not in servicesDict:
services = [curClient, curDate, 0,0,0,0,0,0,0,0,0,0,0,0,0]
servicesDict.update({curKey : services})
services = servicesDict[curKey]
#if curDate and prevDate equal each other action required
if curClient == prevClient and curProg == prevProg:
boundary = serviceDF[serviceDF['ClientID'] == curClient].index
for x in boundary:
curRowSer = serviceDF.iloc[x]
curDateSer = datetime.strptime(curRowSer[1], '%m/%d/%y')
if curDateSer>=prevDate and curDateSer<curDate:
serviceCategory = curRowSer[5]
i = categories.index(serviceCategory)
services[i] = services[i] + 1
servicesDict.update({curKey : services})
prevClient = curClient
prevDate = curDate
prevProg = curProg
servicesCleaned = pd.DataFrame.from_dict(servicesDict, orient = 'index',columns=categories)
#then merge into assessments on clientID and AssessmentDate
One way would be like this. You'll probably have to tweak it for your original dataset, and check the edge cases.
assessments['PreviousAssessmentDate'] = assessments.groupby(['ClientID', 'Program']).AssessmentDate.shift(1, fill_value='0000-00-00')
df = assessments.merge(services, on='ClientID', how='left')
df[df.columns[5:]] = df[df.columns[5:]].multiply((df.AssessmentDate > df.ServiceDate) & (df.PreviousAssessmentDate < df.ServiceDate), axis=0)
df = df.groupby(['ClientID', 'AssessmentDate', 'Program']).sum().reset_index()
ClientID AssessmentDate Program Assistance Navigation Basic Needs
0 212 2018-01-04 Case Mgmt 0 1
1 212 2018-07-03 Case Mgmt 2 0
2 212 2019-06-10 Case Mgmt 0 0
3 292 2017-08-08 Coordinated Access 0 0
4 292 2017-12-21 Coordinated Access 1 3
AssessmentDate
by 1 in order to determine the
previous assessment dateClientID
0
incase the ServiceDate
doesn't fall between PreviousAssessmentDate
and the AssessmentDate
.ClientID
, Program
and AssessmentDate
and do a sum()