Search code examples
pythonpandassklearn-pandas

Pandas how to reshape a dataframe containing duplicated values for columns


This is the dataframe I have:

       chr  value   region
      chr22     1   21-77
       chr6     3   12-65
       chr3     5   73-81
       chr3     8   91-96

And this is what I need:

                       21-77    12-65   73-81   91-96
           chr22        1       0       0       0                           
           chr6         0       3       0       0
           chr3         0       0       5       8

Please note that the first column of the initial dataframe contains duplicate values. (such as chr3)

Could you tell me how I can achieve this please. Thanks in advance.


Solution

  • Looks like the perfect application for pandas pivot_table.

    Worth highlighting that pivot_table uses numpy mean as aggregation function (in case there are multiple observations with same index & column. So it implicitly requires numbers (int/floats) as values by default.

    Let frame be the pandas dataframe containing your data:

    import pandas as pd
    
    cc = ['chr', 'value', 'region']
    vals = [['chr22', 1, '21-77'],
           ['chr6',     3,   '12-65'],
           ['chr3',     5,   '73-81'],
           ['chr3',     8,   '91-96']]
    
    frame = pd.DataFrame(vals, columns = cc)
    
    result = pd.pivot_table(frame,
                            values = 'value', index = ['chr'], columns = ['region'],
                            fill_value = 0)