Search code examples
pandasloopstime-seriessequential

How can I identify a change in a specific column based on the unique identifiers in another column?


Here's my dataframe:

date sec_ID code
2015-11-16 abc 0
2015-11-17 abc 0
2015-12-18 abc 1
2016-01-16 def 0
2016-02-16 def 0
2015-11-16 xyz 0
2016-11-17 xyz -1

The dataframe consists of nearly 4 million rows, for 3000 securities (3000 unique SEC_IDs). This is a time-series based data as we have a code for each day for each security, in a separate column.

What I do want is to iterate through each SEC_ID, and detect the first date when the CODE changed for a particular SEC_ID just retain that row, or the first date for a security if the code is 0 throughout. It is important that the iterator resets for each SEC_ID.

So the desired output would be a dataset comprising of 3000 rows with either 1, 0 or -1 in each case, as shown below:

date sec_ID code
2015-12-18 abc 1
2016-01-16 def 0
2016-11-17 xyz -1

I have already sorted the data by security ID and date in ascending order.

So far I have tried doing:

data['diff'] = data['code'].diff()
data.dropna(inplace=True) # did this because the first row was understandably a NaN. No other NaNs anywhere in the dataset.
data_first_change = data.loc[data['diff'] != 0]
data_first_change = data_first_change.groupby('sec_ID').nth(0).reset_index()

The above doesn't yield the desired output, because the code doesn't reset for each of the sec_IDs.

Any thoughts please?


Solution

  • Use DataFrameGroupBy.diff for difference per groups, replace first missing values and compare for not equal 0, then get all first rows by Series.duplicated, join together by concat with boolean indexing and remove duplicates by DataFrame.drop_duplicates:

    mask1 = df.groupby('sec_ID')['code'].diff().fillna(0).ne(0)
    mask2 = df['sec_ID'].duplicated()
    df = pd.concat([df[mask1], df[~mask2]]).drop_duplicates(['sec_ID']).sort_index()
    print (df)
             date sec_ID  code
    2  2015-12-18    abc     1
    3  2016-01-16    def     0
    6  2016-11-17    xyz    -1