Search code examples
pythonpandastime-seriesdata-sciencedata-analysis

Python Data Grouping and compare on the basis of date


I have a set of data where I want to do, If the latest st_1 or st_2 are greater than earlier st_1 or st_2 put True or False respectively in another column. How can I do that on the basis of date and id?

id  date                        st_1    st_2
1   2022-02-28 00:00:00+00:00   60.0    6.0
2   2021-10-31 00:00:00+00:00   70.0    0.0
2   2021-12-31 00:00:00+00:00   70.0    4.0
3   2021-10-31 00:00:00+00:00   60.0    0.0
4   2021-06-30 00:00:00+00:00   63.3    2.66
4   2021-08-31 00:00:00+00:00   60.0    3.0
4   2022-02-28 00:00:00+00:00   70.0    2.0
5   2021-06-30 00:00:00+00:00   70.0    3.0
4   2022-02-28 00:00:00+00:00   70.0    2.0
5   2021-06-30 00:00:00+00:00   70.0    3.0
5   2021-08-31 00:00:00+00:00   80.0    2.0
5   2021-10-31 00:00:00+00:00   70.0    3.5

My expected outcome:

id  date                        st_1    st_2  outcome
1   2022-02-28 00:00:00+00:00   60.0    6.0   false
2   2021-10-31 00:00:00+00:00   70.0    0.0   false
2   2021-12-31 00:00:00+00:00   70.0    4.0   true
3   2021-10-31 00:00:00+00:00   60.0    0.0   false
4   2021-06-30 00:00:00+00:00   63.3    2.66  false
4   2021-08-31 00:00:00+00:00   60.0    3.0   true
4   2022-02-28 00:00:00+00:00   70.0    2.0   true
5   2021-06-30 00:00:00+00:00   70.0    3.0   false 
5   2021-08-31 00:00:00+00:00   80.0    2.0   true
5   2021-10-31 00:00:00+00:00   70.0    3.5   true

Solution

  • Update #2: I fixed the sorting to sort first by id and then by date and added column lag_id which is now used to ensure that comparisons are made only within the same id

    Update: I just noticed the the spec is "If the latest st_1 or st_2 are greater than earlier st_1 or st_2" which means that the correct answer is to use "|" instead of the original answer's "&". Corrected.

    Code:

    import io
    import pandas as pd
    string = """id  date st_1 st_2
    1  "2022-02-28 00:00:00+00:00"  60.0    6.0
    2  "2021-10-31 00:00:00+00:00"  70.0    0.0
    2  "2021-12-31 00:00:00+00:00"  70.0    4.0
    3  "2021-10-31 00:00:00+00:00"  60.0    0.0
    4  "2021-06-30 00:00:00+00:00"  63.3    2.66
    4  "2021-08-31 00:00:00+00:00"  60.0    3.0
    4  "2022-02-28 00:00:00+00:00"  70.0    2.0
    5  "2021-06-30 00:00:00+00:00"  70.0    3.0
    4  "2022-02-28 00:00:00+00:00"  70.0    2.0
    5  "2021-06-30 00:00:00+00:00"  70.0    3.0
    5  "2021-08-31 00:00:00+00:00"  80.0    2.0
    5  "2021-10-31 00:00:00+00:00"  70.0    3.5
    """
    data = io.StringIO(string)
    df = pd.read_csv(data, sep="\s+")  # Load df0 from the data string
    df.sort_values(['id', 'date'], inplace=True)  # Sort according to the spec
    print(df)
    
    df['lag_id'] = df['id'].shift(1) # Lag the id column
    
    df['lag_st_1'] = df['st_1'].shift(1)  # Create column lag_st_1 with the st_1 data lagged by 1 row
    df['lag_st_2'] = df['st_2'].shift(1)  # Ditto for st_2
    print(df)
    
    # Create result column with True values where the right conditions are met
    df.loc[(df['id'] == df['lag_id'])
            & (
                  (df['st_1'] > df['lag_st_1'])
                | (df['st_2'] > df['lag_st_2'])
              ), 'result'] = True
    
    # The previous operation fills the rest of the rows with NAs.
    # Here we change the NAs to "False"
    df['result'] = df['result'].fillna(False)
    print(df)
    

    Updated output:

        id                       date  st_1  st_2
    0    1  2022-02-28 00:00:00+00:00  60.0  6.00
    1    2  2021-10-31 00:00:00+00:00  70.0  0.00
    2    2  2021-12-31 00:00:00+00:00  70.0  4.00
    3    3  2021-10-31 00:00:00+00:00  60.0  0.00
    4    4  2021-06-30 00:00:00+00:00  63.3  2.66
    5    4  2021-08-31 00:00:00+00:00  60.0  3.00
    6    4  2022-02-28 00:00:00+00:00  70.0  2.00
    8    4  2022-02-28 00:00:00+00:00  70.0  2.00
    7    5  2021-06-30 00:00:00+00:00  70.0  3.00
    9    5  2021-06-30 00:00:00+00:00  70.0  3.00
    10   5  2021-08-31 00:00:00+00:00  80.0  2.00
    11   5  2021-10-31 00:00:00+00:00  70.0  3.50
        id                       date  st_1  st_2  lag_id  lag_st_1  lag_st_2
    0    1  2022-02-28 00:00:00+00:00  60.0  6.00     NaN       NaN       NaN
    1    2  2021-10-31 00:00:00+00:00  70.0  0.00     1.0      60.0      6.00
    2    2  2021-12-31 00:00:00+00:00  70.0  4.00     2.0      70.0      0.00
    3    3  2021-10-31 00:00:00+00:00  60.0  0.00     2.0      70.0      4.00
    4    4  2021-06-30 00:00:00+00:00  63.3  2.66     3.0      60.0      0.00
    5    4  2021-08-31 00:00:00+00:00  60.0  3.00     4.0      63.3      2.66
    6    4  2022-02-28 00:00:00+00:00  70.0  2.00     4.0      60.0      3.00
    8    4  2022-02-28 00:00:00+00:00  70.0  2.00     4.0      70.0      2.00
    7    5  2021-06-30 00:00:00+00:00  70.0  3.00     4.0      70.0      2.00
    9    5  2021-06-30 00:00:00+00:00  70.0  3.00     5.0      70.0      3.00
    10   5  2021-08-31 00:00:00+00:00  80.0  2.00     5.0      70.0      3.00
    11   5  2021-10-31 00:00:00+00:00  70.0  3.50     5.0      80.0      2.00
        id                       date  st_1  st_2  lag_id  lag_st_1  lag_st_2  result
    0    1  2022-02-28 00:00:00+00:00  60.0  6.00     NaN       NaN       NaN   False
    1    2  2021-10-31 00:00:00+00:00  70.0  0.00     1.0      60.0      6.00   False
    2    2  2021-12-31 00:00:00+00:00  70.0  4.00     2.0      70.0      0.00    True
    3    3  2021-10-31 00:00:00+00:00  60.0  0.00     2.0      70.0      4.00   False
    4    4  2021-06-30 00:00:00+00:00  63.3  2.66     3.0      60.0      0.00   False
    5    4  2021-08-31 00:00:00+00:00  60.0  3.00     4.0      63.3      2.66    True
    6    4  2022-02-28 00:00:00+00:00  70.0  2.00     4.0      60.0      3.00    True
    8    4  2022-02-28 00:00:00+00:00  70.0  2.00     4.0      70.0      2.00   False
    7    5  2021-06-30 00:00:00+00:00  70.0  3.00     4.0      70.0      2.00   False
    9    5  2021-06-30 00:00:00+00:00  70.0  3.00     5.0      70.0      3.00   False
    10   5  2021-08-31 00:00:00+00:00  80.0  2.00     5.0      70.0      3.00    True
    11   5  2021-10-31 00:00:00+00:00  70.0  3.50     5.0      80.0      2.00    True