Search code examples
pythonpandasdummy-variable

Creating dummy variable depending on year and category in pandas


this is my first time posting a question here, so please let me know if my question is lacking anyway.

Let's say I have the following dataframe where "Value" contains only integer 1 or 2. Basically, I want to create a column("Desired") with a dummy variable where 1 happens when firm has remained with Value=1 since the beginning of its appearance. Once the firm has Value=2, the dummy variable should be 0 even if the firm reverts back to Value=1.

Firm_ID Year Value Desired
0000001 2000 1 1
0000001 2001 1 1
0000001 2002 2 0
0000001 2003 2 0
0000001 2004 1 0
0000001 2005 1 0
0000002 2000 2 0
0000002 2001 2 0
0000002 2002 2 0
0000003 2000 1 1
0000003 2001 1 1
0000003 2002 1 1
0000003 2003 1 1
d = {'firm_id': ["0000001" , "0000001","0000001","0000001","0000001","0000001","0000002","0000002","0000002","0000003",
                "0000003","0000003","0000003"], 
     'year': [2000,2001,2002,2003,2004,2005,2000,2001,2002,2000,2001,2002,2003],
    'Value':[1,1,2,2,1,1,2,2,2,1,1,1,1]}
df = pd.DataFrame(data=d)

Currently, the code I am running is the following.

for i in range(df.shape[0]):
    firm = df.loc[i,'firm_id']
    year = df.loc[i,'year']
    temp_df = df[df['firm_id']==firm]
    
    if (temp_df.groupby(['year']).max()[['Value']].max() == 2)[0]: # At some point this firm becomes Value==2
        
        # Get Earliest Year of becoming Value==2
        year_df = temp_df.groupby(['year']).max()[['Value']]
        ch_year = year_df[year_df['Value']==2].index.min() # Year the firm becomes Value==2
        
        if year >= ch_year :
            df.loc[i,'Desired'] = 0
        else : 
            df.loc[i,'Desired'] = 1
    else :# They always remain Value==1
         df.loc[i,'Desired']=1 
            

However, this code is taking too long for the size of my current dataframe. Is there a more efficient way of code that I could use?


Solution

  • # df = df.sort_values(["firm_id", "year"])
    df.Value.ne(1).groupby(df.firm_id).cumsum().lt(1).astype(int)
    #     Value
    # 0    1
    # 1    1
    # 2    0
    # 3    0
    # 4    0
    # 5    0
    # 6    0
    # 7    0
    # 8    0
    # 9    1
    # 10   1
    # 11   1
    # 12   1