how to create categorical field in data frame based on values in another data frame

I need to create a categorical variable based on a value in another data frame. Consider Table 1 which has hospital visits and patient IDs. Note that a patient can visit a hospital more than once:

+----------+------------+
| visit_id | patient_id |
+----------+------------+
|       10 |          1 |
|       20 |          1 |
|       50 |          2 |
|      100 |          3 |
|      110 |          3 |
+----------+------------+

I need to add a new field with a 1 or 0 indicating if the patient received aspirin during a hospital visit, which is in Table 2:

+----------+------------+---------------+
| visit_id | patient_id |  medication   |
+----------+------------+---------------+
|       10 |          1 | aspirin       |
|       10 |          1 | ibuprofin     |
|       20 |          1 | codine        |
|       50 |          2 | aspirin       |
|      100 |          3 | ibuprofin     |
|      110 |          3 | acetaminophin |
|      110 |          3 | vicodin       |
+----------+------------+---------------+

You can see again the multiple levels - you can receive more than one medicine from a doctor, right? This is just an example, of course.

I tried to merge the tables (inner join), which worked...

tab1 = pd.merge(tab1, tab2, on=['visit_id','patient_id'])

tab1['aspirin_index'] = np.where(tab1['medication'].str.contains('aspirin', 
       flags=re.IGNORECASE, regex=True, na=False),1,0)

...but then I ended up with duplicates for patient 1 who got both aspirin and ibuprofen. I just need to know if they got aspirin at least once.

+----------+------------+---------------+
| visit_id | patient_id | aspirin_index |
+----------+------------+---------------+
|       10 |          1 |             1 |
|       10 |          1 |             0 |
+----------+------------+---------------+

I need to get here...same shape as Table 1 but just with the new index.

+----------+------------+---------------+
| visit_id | patient_id | aspirin_index |
+----------+------------+---------------+
|       10 |          1 |             1 |
|       20 |          1 |             0 |
|       50 |          2 |             1 |
|      100 |          3 |             0 |
|      110 |          3 |             0 |
+----------+------------+---------------+

Solution

First let's set up your sample data.

# setup tab1 & tab2
tab1 = pd.DataFrame([[10, 1], [20, 1], [50, 2], [100, 3], [110, 3]], columns=["visit_id","patient_id"])
tab2 = pd.DataFrame([[10, 1, "aspirin"], [10, 1, "ibuprofin"], [20, 1, "codine"], [50, 2, "aspirin"], [100, 3, "ibuprofin"], [110, 3, "acetominophin"], [110, 3, "vicodin"]], columns=["visit_id","patient_id", "medication"])

There's many ways to do this. One approach might be to filter tab2 down to aspirin only, join it to tab1 using a "left" join, then fill nulls with 0.

# filter tab2 to aspirin only
# change column name
# change to 1/0 instead of text since it now only refers to aspirin
aspirin = tab2.loc[tab2.medication=="aspirin"].copy()
aspirin.columns = ["visit_id", "patient_id", "aspirin_index"]
aspirin["aspirin_index"] = 1

# left-outer merge and fill nulls
tab1 = pd.merge(tab1, aspirin, how="left", on=["visit_id","patient_id"])
tab1.aspirin_index.fillna(0, inplace=True)
tab1["aspirin_index"] = tab1.aspirin_index.astype("int")

#  visit_id  patient_id  aspirin_index
#        10           1              1
#        20           1              0
#        50           2              1
#       100           3              0
#       110           3              0

That gets you one column with an "aspirin_index". So that accomplishes your goal.

But what about doing that same exercise with ALL drugs at once... including aspirin? sklearn has some preprocessing functions that make this easy.

from sklearn import preprocessing

le = preprocessing.LabelEncoder()
lb = preprocessing.LabelBinarizer()

# convert each drug into a column of 1's and 0's
all_drugs = pd.DataFrame(lb.fit_transform(le.fit_transform(tab2.medication)), columns=le.classes_)

# concat with source data, aggregate, and clean up
tab2 = pd.concat((tab2.loc[:,["visit_id", "patient_id"]].copy(), all_drugs), axis=1)
tab2 = tab2.groupby(["visit_id", "patient_id"]).agg(np.sum)
tab2.reset_index(inplace=True)

#  visit_id  patient_id  acetominophin  aspirin  codine  ibuprofin  vicodin
#        10           1              0        1       0          1        0
#        20           1              0        0       1          0        0
#        50           2              0        1       0          0        0
#       100           3              0        0       0          1        0
#       110           3              1        0       0          0        1

This is a pretty common approach to get categorical data as columns of binary features. But it takes up loads of space.

What about sticking with one column that has every drug for that visit in a list? That would let you do text searches and not have dense columns of mostly 0's for rare drugs.

# create tab1 with ALL meds taken on each visit
tab2 = tab2.groupby(["visit_id", "patient_id"]).agg({"medication": list})
tab1 = pd.merge(tab1, tab2, how="left", on=["visit_id","patient_id"])

#  visit_id  patient_id                medication
#        10           1      [aspirin, ibuprofin]
#        20           1                  [codine]
#        50           2                 [aspirin]
#       100           3               [ibuprofin]
#       110           3  [acetominophin, vicodin]

# helper function to extract records for ANY drug
def drug_finder(drug):
  idx = tab1.medication.apply(lambda drugs: drug in drugs)
  return tab1.loc[idx].copy()

# find aspirin
drug_finder("aspirin")

#  visit_id  patient_id            medication
#        10           1  [aspirin, ibuprofin]
#        50           2             [aspirin]

# find ibuprofin
drug_finder("ibuprofin")

#  visit_id  patient_id            medication
#        10           1  [aspirin, ibuprofin]
#       100           3           [ibuprofin]