Search code examples
pythonpandasdataframedata-analysis

How to create a column in a pandas dataframe that verifies if a state transition has occured?


The following is the original dataframe:

 uid    timestamp      state
   1     2015-01-01      fail  
   2     2015-01-07      fail  
   2     2015-03-02      fail  
   1     2015-01-03      pass  
   1     2015-01-02      warn  
   2     2015-03-01      pass  
   1     2015-01-04      pass  
   1     2015-01-07      pass  
   2     2015-01-01      warn  

This is the resulting dataframe that I would like to generate:

 uid     timestamp      state  fail->pass?
   1     2015-01-01      fail  True
   2     2015-01-07      pass  False
   2     2015-03-02      fail  False
   1     2015-01-03      pass  True
   1     2015-01-02      warn  True
   2     2015-03-01      pass  False
   1     2015-01-04      pass  True
   1     2015-01-07      pass  True
   2     2015-01-01      warn  False

The "Fail->Pass?" column is a boolean column that tells you if UID went from a failing state to a passing state. This passing state must be the final state of the UID. The falling state can occur at anytime before the final state. The final state occurs at the latest timestamp for that UID.

What's the most efficient way to create this column? There could potentially be hundreds of state transitions for each UID.


Solution

  • df = pd.DataFrame({'uid': [1, 2, 2, 1, 1, 2, 1, 1, 2],
     'timestamp': ['2015-01-01',
      '2015-01-07',
      '2015-03-02',
      '2015-01-03',
      '2015-01-02',
      '2015-03-01',
      '2015-01-04',
      '2015-01-07',
      '2015-01-01'],
     'state': ['fail',
      'pass',
      'fail',
      'pass',
      'warn',
      'pass',
      'pass',
      'pass',
      'warn'],
     'fail->pass?': [True, False, False, True, True, False, True, True, False]})
    
    
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    df = df.sort_values(by='timestamp')
    
    fp = (df[['uid','state']].groupby('uid').last()=='pass').reset_index()
    fp.columns = ['uid','fail->pass?']
    
    df.merge(fp, on='uid').sort_values(by='timestamp')
    

    Output

       uid  timestamp   state   fail->pass?
    0   1   2015-01-01  fail    True
    5   2   2015-01-01  warn    False
    1   1   2015-01-02  warn    True
    2   1   2015-01-03  pass    True
    3   1   2015-01-04  pass    True
    4   1   2015-01-07  pass    True
    6   2   2015-01-07  fail    False
    7   2   2015-03-01  pass    False
    8   2   2015-03-02  fail    False