Search code examples
pythonpandasdataframebinarydummy-variable

Pandas DataFrame: How to convert numeric columns into pairwise categorical data?


Given a pandas dataFrame, how does one convert several numeric columns (where x≠1 denotes the value exists, x=0 denotes it doesn't) into pairwise categorical dataframe? I know it is similar to one-hot decoding but the columns are not exactly one hot.

An example:

 df
id A  B  C  D
0  3  0  0  1
1  4  1  0  0
2  1  7  20 0
3  0  0  0  4
4  0  0  0  0
5  0  1  0  0

The result would be: df id match

 result 
0  A
0  D 
1  A
1  B
2  A
2  B
2  C
3  D
5  B

Solution

  • Use DataFrame.stack with filtering and Index.to_frame:

    s = df.stack()
    
    df = s[s!=0].index.to_frame(index=False).rename(columns={1:'result'})
    print (df)
       id result
    0   0      A
    1   0      D
    2   1      A
    3   1      B
    4   2      A
    5   2      B
    6   2      C
    7   3      D
    8   5      B
    

    Or if performance is important use numpy.where for indices by matched values with DataFrame constructor:

    i, c = np.where(df != 0)
    
    df = pd.DataFrame({'id':df.index.values[i],
                       'result':df.columns.values[c]})
    print (df)
       id result
    0   0      A
    1   0      D
    2   1      A
    3   1      B
    4   2      A
    5   2      B
    6   2      C
    7   3      D
    8   5      B
    

    EDIT:

    For first:

    s = df.stack()
    
    df = s[s!=0].reset_index()
    df.columns= ['id','result','vals']
    print (df)
       id result  vals
    0   0      A     3
    1   0      D     1
    2   1      A     4
    3   1      B     1
    4   2      A     1
    5   2      B     7
    6   2      C    20
    7   3      D     4
    8   5      B     1
    

    For second:

    df = pd.DataFrame({'id':df.index.values[i],
                       'result':df.columns.values[c],
                       'vals':df.values[i,c]})