Search code examples
pythonpandasfiltertypeerror

python - How can I use a dictionary as multiple condition in pandas.query()?


I have workers dictionary with names as keys and lists of two dates as values, that form some time interval:

workers = {
    'Alan': [dt.date(2017, 1, 15), dt.date(2020, 5, 4)],
    'Ben': [dt.date(2018, 3, 28), dt.date(2021, 5, 7)]
}

And also I have df dataframe with some events, which one is related to some worker:

event_id    person  event_date
1LFDVDX     Alan    2018-10-28
4DLDQVC     Ben     2022-02-01
5PEXVGH     Ben     2019-09-05
9OPCLXD     John    2020-06-15

So I'm trying to filter df with query() using dictionary as multiple condition. Particularly in described case I want to get from df all events for Alan with event_time between 2017-01-15 and 2020-05-04 and also all events for Ben with event_time between 2018-03-28 and 2021-05-07. So expected result should be as below:

event_id    worker  event_date
1LFDVDX     Alan    2018-10-28
5PEXVGH     Ben     2019-09-05

I was trying next one:

df.query('person in @workers and event_date >= @workers[person][0] and event_date <= @workers[person][1]')

But got TypeError: unhashable type: 'Series'

How can I solve this? Thank you in advance.


Solution

  • Problem

    The problem is in this expression:

    person in @workers
    

    Here person is a Series that is a not hashable object, which is being searched in the keys of the dictionary. To solve this particular problem you use isin, but that only answers the question partially.

    Solution

    I suggest you do this instead:

    from collections import defaultdict
    
    # create a defaultdict to be used in map, is useful for given a default values to NaN
    lookup = defaultdict(list, workers)
    
    # create new Series applying the lookup dictionary
    person_map = df["person"].map(lookup)
    
    # convert the series to a DataFrame with the same index as df
    ran = pd.DataFrame(data=person_map.to_list(), columns=["start", "end"], index=person_map.index)
    
    # use a very clean query
    res = df.query("person.isin(@workers) and @ran.start < event_date < @ran.end")
    
    print(res)
    

    Output

      event_id person event_date
    0  1LFDVDX   Alan 2018-10-28
    2  5PEXVGH    Ben 2019-09-05