Search code examples
pythonpandaspysparkdataset

Pandas or pyspark span column creation


I've sample data in the form :

week    year    flag_1  flag_2
26      2022    0        0
27      2022    1        0
28      2022    0        0
2       2023    0        1
3       2023    1        0
4       2023    0        0
5       2023    1        1
6       2023    0        1
7       2023    0        0
8       2023    0        0
9       2023    0        0
10      2023    0        1
11      2023    0        1

I want to create two new columns span_flag_1, span_flag_2.

span_flag_1 will contain 1 if flag_1 == 1, and for all the other rows, span_flag_1 will be max(num weeks since last flag_1 = 1, num weeks till next flag_1 = 1)

How can i do the same?


Solution

  • You can achieve this this way with pandas

    import pandas as pd
    
    data = {
        "week": [26, 27, 28, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
        "year": [2022, 2022, 2022, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023],
        "flag_1": [0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0],
        "flag_2": [0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1]
    }
    
    df = pd.DataFrame(data)
    
    df['overall_week'] = df['year'] * 52 + df['week']
    
    df['since_last_flag_1'] = df[df['flag_1'] == 1]['overall_week']
    df['since_last_flag_1'] = df['since_last_flag_1'].fillna(method='ffill')
    df['since_last_flag_1'] = df['overall_week'] - df['since_last_flag_1']
    
    df['till_next_flag_1'] = df[df['flag_1'] == 1]['overall_week']
    df['till_next_flag_1'] = df['till_next_flag_1'].fillna(method='bfill')
    df['till_next_flag_1'] = df['till_next_flag_1'] - df['overall_week']
    
    df['span_flag_1'] = df[['since_last_flag_1', 'till_next_flag_1']].max(axis=1)
    df.loc[df['flag_1'] == 1, 'span_flag_1'] = 1
    
    df.drop(columns=['overall_week', 'since_last_flag_1', 'till_next_flag_1'], inplace=True)
    
    print(df)
    
    

    which gives

       week  year  flag_1  flag_2  span_flag_1
    0     26  2022       0       0          1.0
    1     27  2022       1       0          1.0
    2     28  2022       0       0         27.0
    3      2  2023       0       1         27.0
    4      3  2023       1       0          1.0
    5      4  2023       0       0          1.0
    6      5  2023       1       1          1.0
    7      6  2023       0       1          1.0
    8      7  2023       0       0          2.0
    9      8  2023       0       0          3.0
    10     9  2023       0       0          4.0
    11    10  2023       0       1          5.0
    12    11  2023       0       1          6.0