I basically want to reverse the process posed in this question.
>>> import pandas as pd
>>> example_input = pd.DataFrame({"one" : [0,1,0,1,0],
"two" : [0,0,0,0,0],
"three" : [1,1,1,1,0],
"four" : [1,1,0,0,0]
})
>>> print(example_input)
one two three four
0 0 0 1 1
1 1 0 1 1
2 0 0 1 0
3 1 0 1 0
4 0 0 0 0
>>> desired_output = pd.DataFrame(["three, four", "one, three, four",
"three", "one, three", ""])
>>> print(desired_output)
0
0 three, four
1 one, three, four
2 three
3 one, three
4
There are many questions (examples 1 & 2) about reversing one-hot encoding, but the answers rely on only one binary class being active per row, while my data can have multiple classes active in the same row.
This question comes close to addressing what I need, but its multiple classes are separated on different rows. I need my results to be strings joined by a separator (for example ", "), such that the output has the same number of rows as the input.
Using the ideas found in these two questions (1 & 2), I was able to come up with a solution, but it requires an ordinary python for loop to iterate through the rows, which I suspect will be slow compared to a solution which entirely uses pandas.
The input dataframe can use actual Boolean values instead of integer encoding if it makes things easier. The output can be a dataframe or a series; I'm eventually going to add the resulting column to a larger dataframe. I'm also open to using numpy
if it allows for a better solution, but otherwise I would prefer to stick with pandas
.
You can do DataFrame.dot
which is much faster
than iterating over all the rows in the dataframe:
df.dot(df.columns + ', ').str.rstrip(', ')
0 three, four
1 one, three, four
2 three
3 one, three
4
dtype: object