I have features DataFrame that (let us say) looks like this:
Symptom A | Symptom B |
---|---|
Itching | Rash |
Rash | Itching |
When I run the get_dummies function on this dataframe, it will create four columns named 'Symptom_A_Itching', 'Symptom_A_Rash', 'Symptom_B_Rash', 'Symptom_B_Itching'
. I don't want to treat the two values separately as it is being done with this function.
Is there any way to perform one hot encoding for this dataframe, where the values of both these columns won't be treated separately.
Basically, I want to get a DataFrame with columns 'Symptom_Itching', 'Symptom_Rash'
.
I tried using the columns and prefix arguments in the get_dummies function, but that did not produce any results. I also tried setting all the Symptom column names to just 'Symptom'
instead of 'Symptom_A', 'Symptom_B'
, but that also didn't work.
This is the code I have:
data_frame: DataFrame = read_csv('dataset.csv')
features: DataFrame = data_frame.iloc[:, 1:]
features.fillna('')
x: DataFrame = get_dummies(features)
stack
, then get_dummies
and groupby.max()
:
out = (df
.stack().str.get_dummies()
.groupby(level=0).max()
)
Or using a trick to get all output columns with the same name and groupby.max()
on axis=1
:
out = (pd.get_dummies(df.rename(columns=lambda x: ''), prefix_sep='')
.groupby(level=0, axis=1).max()
)
Output:
Itching Rash
0 1 1
1 1 1