Hello I'm doing a classification model with stages of a disease as categorical variables. Here's an example of value counts:
(Note: NX denotes unmeasured)
I'm making stages dummy variables so that the current as well as the previously passed stages will be set to 1.
My question is whether the code I've created for this can be written better. First I'm setting each column's values with functions.
def N1(row):
if row['N'] == 'N1':
return 1
if row['N'] == 'N2' :
return 1
if row['N'] == 'N3' :
return 1
else:
return 0
def N2(row):
if row['N'] == 'N2' :
return 1
if row['N'] == 'N3' :
return 1
else:
return 0
def N3(row):
if row['N'] == 'N3' :
return 1
else:
return 0
def NX(row):
if row['N'] == 'NX' :
return 1
else:
return 0
Then using these functions with:
df['N1'] = df.apply (lambda row: N1(row), axis =1)
df['N2'] = df.apply (lambda row: N2(row), axis =1)
df['N3'] = df.apply (lambda row: N3(row), axis =1)
df['NX'] = df.apply (lambda row: NX(row), axis =1)
An Example Final Outcome:
Any input on how this process might take less code is appreciated! Thank you.
Create the dummies of all of the columns and drop 'N0'
as you don't care about that one. Then apply your hierarchy to set the lower stages to 1 if the higher stage is 1.
import pandas as pd
df = pd.DataFrame({'N': ['N0', 'N1', 'NX', 'N2', 'N3']})
df = pd.concat([df, pd.get_dummies(df['N']).drop(columns='N0')], axis=1)
hierarchy = ['N3', 'N2', 'N1']
for i in range(len(hierarchy)-1):
df[hierarchy[i+1]] += df[hierarchy[i]]
N N1 N2 N3 NX
0 N0 0 0 0 0
1 N1 1 0 0 0
2 NX 0 0 0 1
3 N2 1 1 0 0
4 N3 1 1 1 0