Search code examples
pythonpandasdataframegroup-byrow

How to apply a function pairwise on rows in a series?


I want something like this: df.groupby("A")["B"].diff()

But instead of diff(), I want be able to compute if the two rows are different or identical, and return 1 if the current row is different from the previous, and 0 if it is identical.

Moreover, I really would like to use a custom function instead of diff(), so that I can do general pairwise row operations.

I tried using .rolling(2) and .apply() at different places, but I just can not get it to work.

Edit:

Each row in the dataset is a packet.

The first row in the dataset is the first recorded packet, and the last row is the last recorded packet, i.e., they are ordered by time.

One of the features(columns) is called "ID", and several packets have the same ID. Another column is called "data", its values are 64 bit binary values (strings), i.e., 001011010011001.....10010 (length 64).

I want to create two new features(columns):

Compare the "data" field of the current packet with the data field of the previous packet with the Same ID, and compute:

  1. If they are different (1 or 0)
  2. How different (a figure between 0 and 1)

Solution

  • Use DataFrameGroupBy.shift with compare for not equal by Series.ne:

    df["dc"] = df.groupby("ID")["data"].shift().ne(df['data']).astype(int)
    

    EDIT: for correlation between 2 Series use:

    df["dc"] = df['data'].corr(df.groupby("ID")["data"].shift())