Search code examples
pandasgroup-bypivot-tablepanel-data

Identify change in status due to change in categorical variable in panel data


I have unbalanced panel data (repeated observations per ID at different points in time). I need to identify for a change in variable per person over time.

Here is the code to generate the data frame:

df = pd.DataFrame(
    {
        "region": ["C1", "C1", "C2", "C2", "C2"],
        "id": [1, 1, 2, 2, 2],
        "date": ["01/01/2021", "01/02/2021", "01/01/2021", "01/02/2021", "01/03/2021"],
        "job": ["A", "A", "A", "B", "B"],
        }
)


df

I am trying to create a column ("change") that indicates when individual 2 changes job status from A to B on that date (01/02/2021).

I have tried the following, but it is giving me an error:

df['change']=df.groupby(['id'])['job'].diff().fillna(0)

Solution

  • In your code error happens because you use 'diff' on 'job' column, but 'job' type is 'object' and 'diff' works only with numeric types.

    current answer:

    df["change"] = df.groupby(
    ["id"])["job"].transform(lambda x: x.ne(x.shift().bfill())).astype(int)