Search code examples
pythonpandastranspose

Transpose columns to rows, show value_counts of former columns as column value in Pandas


Let's suppose I have this dataset. Each variable can assume values between 1 and 5. The variables represent similar objects, so I would like to compare them across all data sets.

import pandas as pd
df = pd.DataFrame({'dataset': ["ds1", "ds2", "ds3", "ds4", "ds5"], 
                   'var1': [1, 4, 5, 3, 1], 
                   'var2': [2, 1, 1, 2, 5],
                   'var3': [2, 1, 1, 2, 5]})
df

I want to transpose the df using .T and sum up the data as follows:

rows: each representing one variable (var1 to var3) columns 1 - 5 : representing values of these variables, each cell showing the total number of appearances of this value across all datasets (var1.value_counts)

Example: Row1 represents var1. Column1 shows "1" appears two times across all datasets, column2 shows 2 appears 0 times

Expected outcome

         1 2 3 4 5
var1     2 0 1 1 1
var2     2 2 0 0 1
var3     2 2 0 0 1 

Maybe I need to use crosstab or pivot_table? Thanks!


Solution

  • Use DataFrame.melt with DataFrame.pivot_table and DataFrame.rename_axis:

    df1 = (df.melt('dataset')
             .pivot_table(index='variable',
                          columns='value',
                          aggfunc='size',
                          fill_value=0)
             .rename_axis(index=None, columns=None))
    print (df1)
          1  2  3  4  5
    var1  2  0  1  1  1
    var2  2  2  0  0  1
    var3  2  2  0  0  1
    

    Or for all columns without first value_counts with transpose, replaced missing values to 0 and last converting all values to integers:

    df = df.set_index('dataset').apply(pd.value_counts).T.fillna(0).astype(int)
    print (df)
          1  2  3  4  5
    var1  2  0  1  1  1
    var2  2  2  0  0  1
    var3  2  2  0  0  1