Search code examples
pythonpandasdictionaryvectorization

Is there a vectorization method to match a df column to a dictionary


I have the following (abbreviated) dictionary:

_d = {
    "pain": 
        ["C0030193",
        "C0150055",
        "C0151825",
        "C0184567"],
    "anxiety": 
        ["C0003467",
        "C0003469",
        "C0027769",
        "C0154587",
        "C0231397",
        "C0231401",
        "C0231402"],
    "depression": 
        ["C0001539",
        "C0005587",
        "C0011579",
        "C0011581",
        "C0024517",
        "C0086132"],
    "fatigue": 
        ["C0015672"]
}

and the following dataframe:

df = 
    cui
1   "C0015672"
2   "C0015634"
3   "C0011579"
4   "C0030193"
5   "C0031193"
6   "C0030193"

I want to match up the column df["cui"], such that if the value of df["cui"] is in any of the value lists in then dictionary then I want a new column "symptom" based on the dictionary key, otherwise it would remain null.

This is the desired output:

df = 
        cui           symptom
    1   C0015672      fatigue 
    2   C0015634      NaN
    3   C0011579      depression 
    4   C0030193      pain
    5   C0031193      NaN
    6   C0030193      pain

I can do this by iterating over each row in the dataframe, but since there are 10s of millions of rows, it's super slow. I'm looking for a way to vectorize this.


Solution

  • You can iterate over the dictionary first, then map the dictionary to cui column

    d = {v: k
        for k, lst in _d.items()
        for v in lst}
    
    df['symptom'] = df['cui'].map(d)