I have some data like shown in the left side of the image with columns like group, date, open, close, exit.
What I want to do is generate 2 more columns (like shown on the right) with date_open and date_close.
The logic is that for each group, the date_open should be the first entry in the data (for that group) and, for date_close it should be the last entry only when exit is 1 else there is no exit date (0 or NaN).
I tried it using for loop and index and enumerate but could not acheive the output. Is there any way to get the desired output?
group date open close exit
A Jan-22 2 0 0
A Feb-22 2 0 0
A Mar-22 2 0 0
A Apr-22 1 1 0
A May-22 1 1 0
A Jun-22 0 2 1
B Mar-23 3 0 0
B Apr-23 3 0 0
B May-23 3 0 0
B Jun-23 3 1 0
C Sep-21 1 0 0
C Oct-21 1 0 0
C Nov-21 1 0 0
C Dec-21 1 0 0
C Jan-22 0 1 1
In [29]: df = pd.read_csv('data.tsv', delim_whitespace=' ')
In [30]: df
Out[30]:
group date open close exit
0 A Jan-22 2 0 0
1 A Feb-22 2 0 0
2 A Mar-22 2 0 0
3 A Apr-22 1 1 0
4 A May-22 1 1 0
5 A Jun-22 0 2 1
6 B Mar-23 3 0 0
7 B Apr-23 3 0 0
8 B May-23 3 0 0
9 B Jun-23 3 1 0
10 C Sep-21 1 0 0
11 C Oct-21 1 0 0
12 C Nov-21 1 0 0
13 C Dec-21 1 0 0
14 C Jan-22 0 1 1
In [31]: df['date'] = df.date.apply(lambda s: dt.datetime.strptime(s, "%b-%y"))
In [32]: df
Out[32]:
group date open close exit
0 A 2022-01-01 2 0 0
1 A 2022-02-01 2 0 0
2 A 2022-03-01 2 0 0
3 A 2022-04-01 1 1 0
4 A 2022-05-01 1 1 0
5 A 2022-06-01 0 2 1
6 B 2023-03-01 3 0 0
7 B 2023-04-01 3 0 0
8 B 2023-05-01 3 0 0
9 B 2023-06-01 3 1 0
10 C 2021-09-01 1 0 0
11 C 2021-10-01 1 0 0
12 C 2021-11-01 1 0 0
13 C 2021-12-01 1 0 0
14 C 2022-01-01 0 1 1
In [33]: opens = df[['group', 'date']].groupby(by='group').min()
In [34]: opens
Out[34]:
date
group
A 2022-01-01
B 2023-03-01
C 2021-09-01
In [35]: opens.rename(columns={'date': 'date_open'})
Out[35]:
date_open
group
A 2022-01-01
B 2023-03-01
C 2021-09-01
In [36]: opens = opens.rename(columns={'date': 'date_open'})
In [37]: opens
Out[37]:
date_open
group
A 2022-01-01
B 2023-03-01
C 2021-09-01
In [38]: closes = df[df.exit==1][['group', 'date']].groupby(by='group').max()
In [39]: closes = closes.rename(columns={'date': 'date_close'})
In [40]: closes
Out[40]:
date_close
group
A 2022-06-01
C 2022-01-01
In [41]: opens.join(df.set_index('group')).join(closes)
Out[41]:
date_open date open close exit date_close
group
A 2022-01-01 2022-01-01 2 0 0 2022-06-01
A 2022-01-01 2022-02-01 2 0 0 2022-06-01
A 2022-01-01 2022-03-01 2 0 0 2022-06-01
A 2022-01-01 2022-04-01 1 1 0 2022-06-01
A 2022-01-01 2022-05-01 1 1 0 2022-06-01
A 2022-01-01 2022-06-01 0 2 1 2022-06-01
B 2023-03-01 2023-03-01 3 0 0 NaT
B 2023-03-01 2023-04-01 3 0 0 NaT
B 2023-03-01 2023-05-01 3 0 0 NaT
B 2023-03-01 2023-06-01 3 1 0 NaT
C 2021-09-01 2021-09-01 1 0 0 2022-01-01
C 2021-09-01 2021-10-01 1 0 0 2022-01-01
C 2021-09-01 2021-11-01 1 0 0 2022-01-01
C 2021-09-01 2021-12-01 1 0 0 2022-01-01
C 2021-09-01 2022-01-01 0 1 1 2022-01-01