this is my first time posting a question here, so please let me know if my question is lacking anyway.
Let's say I have the following dataframe where "Value" contains only integer 1 or 2. Basically, I want to create a column("Desired") with a dummy variable where 1 happens when firm has remained with Value=1 since the beginning of its appearance. Once the firm has Value=2, the dummy variable should be 0 even if the firm reverts back to Value=1.
Firm_ID | Year | Value | Desired |
---|---|---|---|
0000001 | 2000 | 1 | 1 |
0000001 | 2001 | 1 | 1 |
0000001 | 2002 | 2 | 0 |
0000001 | 2003 | 2 | 0 |
0000001 | 2004 | 1 | 0 |
0000001 | 2005 | 1 | 0 |
0000002 | 2000 | 2 | 0 |
0000002 | 2001 | 2 | 0 |
0000002 | 2002 | 2 | 0 |
0000003 | 2000 | 1 | 1 |
0000003 | 2001 | 1 | 1 |
0000003 | 2002 | 1 | 1 |
0000003 | 2003 | 1 | 1 |
d = {'firm_id': ["0000001" , "0000001","0000001","0000001","0000001","0000001","0000002","0000002","0000002","0000003",
"0000003","0000003","0000003"],
'year': [2000,2001,2002,2003,2004,2005,2000,2001,2002,2000,2001,2002,2003],
'Value':[1,1,2,2,1,1,2,2,2,1,1,1,1]}
df = pd.DataFrame(data=d)
Currently, the code I am running is the following.
for i in range(df.shape[0]):
firm = df.loc[i,'firm_id']
year = df.loc[i,'year']
temp_df = df[df['firm_id']==firm]
if (temp_df.groupby(['year']).max()[['Value']].max() == 2)[0]: # At some point this firm becomes Value==2
# Get Earliest Year of becoming Value==2
year_df = temp_df.groupby(['year']).max()[['Value']]
ch_year = year_df[year_df['Value']==2].index.min() # Year the firm becomes Value==2
if year >= ch_year :
df.loc[i,'Desired'] = 0
else :
df.loc[i,'Desired'] = 1
else :# They always remain Value==1
df.loc[i,'Desired']=1
However, this code is taking too long for the size of my current dataframe. Is there a more efficient way of code that I could use?
# df = df.sort_values(["firm_id", "year"])
df.Value.ne(1).groupby(df.firm_id).cumsum().lt(1).astype(int)
# Value
# 0 1
# 1 1
# 2 0
# 3 0
# 4 0
# 5 0
# 6 0
# 7 0
# 8 0
# 9 1
# 10 1
# 11 1
# 12 1