Search code examples
pythonpandasnumpydataframeone-hot-encoding

How to convert (Not-One) Hot Encodings to a Column with Multiple Values on the Same Row


I basically want to reverse the process posed in this question.

>>> import pandas as pd
>>> example_input = pd.DataFrame({"one"   : [0,1,0,1,0], 
                                  "two"   : [0,0,0,0,0],
                                  "three" : [1,1,1,1,0],
                                  "four"  : [1,1,0,0,0]
                                  })
>>> print(example_input)
   one  two  three  four
0    0    0      1     1
1    1    0      1     1
2    0    0      1     0
3    1    0      1     0
4    0    0      0     0
>>> desired_output = pd.DataFrame(["three, four", "one, three, four",
                                   "three", "one, three", ""])
>>> print(desired_output)
                  0
0       three, four
1  one, three, four
2             three
3        one, three
4                  

There are many questions (examples 1 & 2) about reversing one-hot encoding, but the answers rely on only one binary class being active per row, while my data can have multiple classes active in the same row.

This question comes close to addressing what I need, but its multiple classes are separated on different rows. I need my results to be strings joined by a separator (for example ", "), such that the output has the same number of rows as the input.

Using the ideas found in these two questions (1 & 2), I was able to come up with a solution, but it requires an ordinary python for loop to iterate through the rows, which I suspect will be slow compared to a solution which entirely uses pandas.

The input dataframe can use actual Boolean values instead of integer encoding if it makes things easier. The output can be a dataframe or a series; I'm eventually going to add the resulting column to a larger dataframe. I'm also open to using numpy if it allows for a better solution, but otherwise I would prefer to stick with pandas.


Solution

  • You can do DataFrame.dot which is much faster than iterating over all the rows in the dataframe:

    df.dot(df.columns + ', ').str.rstrip(', ')
    

    0         three, four
    1    one, three, four
    2               three
    3          one, three
    4                    
    dtype: object