Search code examples
pythonpandastime-seriesdata-cleaningmulti-index

Pandas - Identifying date of switch in a multiindexed dataframe


I have a very large dataframe that is multiindexed as ('ID','Date'); the column 'Value' tracks an individual's progress in time using a boolean.

I know that each individual starts and ends with Value = True. I've been able to locate the date for the first occurrence of False using df.loc[~df['Value'], :], but what I want to be able to do is locate the date when they switched back to True after one or more periods of False. I've tried using variations on .groupby().diff() but this is extremely slow.

Example: I want to extract "7-22-19" for individual A, below:

ID---Date------Value

A----1-30-19---True

A----3-15-19---False

A----4-1-19-----False

A----7-22-19---True

A----11-13-19--True

B----2-1-19-----True

etc.

As an extra caveat, a solution that is both fast (my dataframe has hundreds of thousands of IDs so no loops, and .groupby().diff() seems to be slow) and that works with non-booleans would be ideal (i.e. if we replace True/False with Drug X/Drug Y).

Thank you!


Solution

  • shift is a nice tool to detect transitions in a column. So you could find transitions from False to True for the same ID with:

    df.loc[df['Value']&((~df['Value']).shift())&(df['ID']==df['ID'].shift())]
    

    With your data, it gives as expected:

      ID     Date  Value
    3  A  7-22-19   True