So this is a common question but I cant find an answer that fits this particular scenario.
So I have a Dataframe
with columns for genres eg "Drama, Western" and one hot encoded versions of the genres so for the drama and western there is a 1 in both columns but where its just Western genre its 1 for that column 0 for drama.
I want a filtered dataframe containing rows with only Western and no other genre. Im trying to oversample for a model as it is a minor class but I don't want to increase other genre counts as a byproduct
There are multiple rows so I can't use the index and there are multiple genres so I can't use a condition like df[(df['Western']==1) & (df['Drama']==0)
without having to account for 24 genres.
Index | Genre | Drama | Western | Action | genre 4 |
0 Drama, Western 1 1 0 0
1 Western 0 1 0 0
3 Action, Western 0 1 1 0
If I understand your question correctly, you want those rows where only 'Western' is 1, i.e. the genre is only Western, nothing else.
Why do you have to use the encoded columns then? Just use the original 'Genre' column where the data is in string format. No need to overcomplicate things.
new_df = df[df['Genre']=='Western']