Here's my dataframe:
date | sec_ID | code |
---|---|---|
2015-11-16 | abc | 0 |
2015-11-17 | abc | 0 |
2015-12-18 | abc | 1 |
2016-01-16 | def | 0 |
2016-02-16 | def | 0 |
2015-11-16 | xyz | 0 |
2016-11-17 | xyz | -1 |
The dataframe consists of nearly 4 million rows, for 3000 securities (3000 unique SEC_IDs). This is a time-series based data as we have a code for each day for each security, in a separate column.
What I do want is to iterate through each SEC_ID, and detect the first date when the CODE changed for a particular SEC_ID just retain that row, or the first date for a security if the code is 0 throughout. It is important that the iterator resets for each SEC_ID.
So the desired output would be a dataset comprising of 3000 rows with either 1, 0 or -1 in each case, as shown below:
date | sec_ID | code |
---|---|---|
2015-12-18 | abc | 1 |
2016-01-16 | def | 0 |
2016-11-17 | xyz | -1 |
I have already sorted the data by security ID and date in ascending order.
So far I have tried doing:
data['diff'] = data['code'].diff()
data.dropna(inplace=True) # did this because the first row was understandably a NaN. No other NaNs anywhere in the dataset.
data_first_change = data.loc[data['diff'] != 0]
data_first_change = data_first_change.groupby('sec_ID').nth(0).reset_index()
The above doesn't yield the desired output, because the code doesn't reset for each of the sec_IDs.
Any thoughts please?
Use DataFrameGroupBy.diff
for difference per groups, replace first missing values and compare for not equal 0
, then get all first rows by Series.duplicated
, join together by concat
with boolean indexing
and remove duplicates by DataFrame.drop_duplicates
:
mask1 = df.groupby('sec_ID')['code'].diff().fillna(0).ne(0)
mask2 = df['sec_ID'].duplicated()
df = pd.concat([df[mask1], df[~mask2]]).drop_duplicates(['sec_ID']).sort_index()
print (df)
date sec_ID code
2 2015-12-18 abc 1
3 2016-01-16 def 0
6 2016-11-17 xyz -1