Search code examples
pythonpython-3.xpandasdataframedata-analysis

Python - generate columns with values based on a condition


I have some data like shown in the left side of the image with columns like group, date, open, close, exit.

What I want to do is generate 2 more columns (like shown on the right) with date_open and date_close.

The logic is that for each group, the date_open should be the first entry in the data (for that group) and, for date_close it should be the last entry only when exit is 1 else there is no exit date (0 or NaN).

I tried it using for loop and index and enumerate but could not acheive the output. Is there any way to get the desired output?

enter image description here

group   date    open    close   exit
A      Jan-22    2       0       0 
A      Feb-22    2       0       0
A      Mar-22    2       0       0
A      Apr-22    1       1       0
A      May-22    1       1       0
A      Jun-22    0       2       1
B      Mar-23    3       0       0
B      Apr-23    3       0       0
B      May-23    3       0       0
B      Jun-23    3       1       0
C      Sep-21    1       0       0
C      Oct-21    1       0       0
C      Nov-21    1       0       0
C      Dec-21    1       0       0
C      Jan-22    0       1       1


Solution

  • In [29]: df = pd.read_csv('data.tsv', delim_whitespace=' ')
    
    In [30]: df
    Out[30]:
       group    date  open  close  exit
    0      A  Jan-22     2      0     0
    1      A  Feb-22     2      0     0
    2      A  Mar-22     2      0     0
    3      A  Apr-22     1      1     0
    4      A  May-22     1      1     0
    5      A  Jun-22     0      2     1
    6      B  Mar-23     3      0     0
    7      B  Apr-23     3      0     0
    8      B  May-23     3      0     0
    9      B  Jun-23     3      1     0
    10     C  Sep-21     1      0     0
    11     C  Oct-21     1      0     0
    12     C  Nov-21     1      0     0
    13     C  Dec-21     1      0     0
    14     C  Jan-22     0      1     1
    
    In [31]: df['date'] = df.date.apply(lambda s: dt.datetime.strptime(s, "%b-%y"))
    
    In [32]: df
    Out[32]:
       group       date  open  close  exit
    0      A 2022-01-01     2      0     0
    1      A 2022-02-01     2      0     0
    2      A 2022-03-01     2      0     0
    3      A 2022-04-01     1      1     0
    4      A 2022-05-01     1      1     0
    5      A 2022-06-01     0      2     1
    6      B 2023-03-01     3      0     0
    7      B 2023-04-01     3      0     0
    8      B 2023-05-01     3      0     0
    9      B 2023-06-01     3      1     0
    10     C 2021-09-01     1      0     0
    11     C 2021-10-01     1      0     0
    12     C 2021-11-01     1      0     0
    13     C 2021-12-01     1      0     0
    14     C 2022-01-01     0      1     1
    
    In [33]: opens = df[['group', 'date']].groupby(by='group').min()
    
    In [34]: opens
    Out[34]:
                date
    group
    A     2022-01-01
    B     2023-03-01
    C     2021-09-01
    
    In [35]: opens.rename(columns={'date': 'date_open'})
    Out[35]:
           date_open
    group
    A     2022-01-01
    B     2023-03-01
    C     2021-09-01
    
    In [36]: opens = opens.rename(columns={'date': 'date_open'})
    
    In [37]: opens
    Out[37]:
           date_open
    group
    A     2022-01-01
    B     2023-03-01
    C     2021-09-01
    
    In [38]: closes = df[df.exit==1][['group', 'date']].groupby(by='group').max()
    
    In [39]: closes = closes.rename(columns={'date': 'date_close'})
    
    In [40]: closes
    Out[40]:
          date_close
    group
    A     2022-06-01
    C     2022-01-01
    
    In [41]: opens.join(df.set_index('group')).join(closes)
    Out[41]:
           date_open       date  open  close  exit date_close
    group
    A     2022-01-01 2022-01-01     2      0     0 2022-06-01
    A     2022-01-01 2022-02-01     2      0     0 2022-06-01
    A     2022-01-01 2022-03-01     2      0     0 2022-06-01
    A     2022-01-01 2022-04-01     1      1     0 2022-06-01
    A     2022-01-01 2022-05-01     1      1     0 2022-06-01
    A     2022-01-01 2022-06-01     0      2     1 2022-06-01
    B     2023-03-01 2023-03-01     3      0     0        NaT
    B     2023-03-01 2023-04-01     3      0     0        NaT
    B     2023-03-01 2023-05-01     3      0     0        NaT
    B     2023-03-01 2023-06-01     3      1     0        NaT
    C     2021-09-01 2021-09-01     1      0     0 2022-01-01
    C     2021-09-01 2021-10-01     1      0     0 2022-01-01
    C     2021-09-01 2021-11-01     1      0     0 2022-01-01
    C     2021-09-01 2021-12-01     1      0     0 2022-01-01
    C     2021-09-01 2022-01-01     0      1     1 2022-01-01