Search code examples
python-3.xpandasdataframepivotpandas-melt

how to convert pandas dataframe to columns in python


I have a dataset given below:

weekid   A      B      C      D      E     F
    1    10    20      30     40     0     50
    2    70    100     0      0      80    0 

I am trying to convert given first dataset into another format without including missing values (which is 0 in this case):

weekid type amount
1       A    10   
1       B    20
1       C    30
1       D    40
1       F    50

2       A    70
2       E    80
2       B    100

Is there any way to convert first pandas dataframe into second one as a dataframe efficiently? Thanks.


Solution

  • You can use melt and filter the data when value != 0.

    First you want to identify all the columns that you want to use as identifies. So you set id_vars to weekid as that's your primary column to match.

    id_vars: Column(s) to use as identifier variables.

    Then you want to identify column that needs to end up as rows. Set the var_name to that. In this case, that's columns A thru F and you want that to be assigned to Type

    var_name: Name to use for the ‘variable’ column. If None it uses frame.columns.name or ‘variable’.

    Then you want to identify all the values in the rows that need to end up as a column. Here you want all the values to be part

    value_name: Name to use for the ‘value’ column.

    You then get the final dataframe. However, you want to remove values that are equal to 0. So you filter them out using != 0.

    Update: Added recommendation from @sammywemmy By using ignore_index=False, we are able to capture the index of the original dataframe. Then sorting it by that index, we retain the order. I am then resetting the index.

    Putting all that together, here's how you can do it.

    c = ['weekid','A','B','C','D','E','F']
    d = [[1,10,20,30,40,0,50],
         [2,70,100,0,0,80,0]]
    import pandas as pd
    df = pd.DataFrame(d,columns=c)
    
    df1= (df.melt(id_vars=["weekid"], 
            var_name="Type", 
            value_name="Amount",
            ignore_index=False)
            .sort_index()
            .reset_index(drop=True))
    
    df1 = df1[df1['Amount'] != 0]
    print (df1)
    

    The output will be:

    Input dataframe:

       weekid   A    B   C   D   E   F
    0       1  10   20  30  40   0  50
    1       2  70  100   0   0  80   0
    

    Output dataframe:

        weekid Type  Amount
    0        1    A      10
    1        1    B      20
    2        1    C      30
    3        1    D      40
    5        1    F      50
    6        2    A      70
    7        2    B     100
    10       2    E      80
    

    An alternate method I found in this StackOverflow link is to set_index with stack for MultiIndex Series and then reset_index

    You can do it this way:

    c = ['weekid','A','B','C','D','E','F']
    d = [[1,10,20,30,40,0,50],
         [2,70,100,0,0,80,0]]
    import pandas as pd
    df = pd.DataFrame(d,columns=c)
    
    df2 = (df.set_index(["weekid"])
             .stack()
             .reset_index(name='Amount')
             .rename(columns={'level_1':'Type'}))
    
    df2 = df2[df2['Amount'] != 0]
    print (df2)
    

    The output will be:

        weekid Type  Amount
    0        1    A      10
    1        1    B      20
    2        1    C      30
    3        1    D      40
    5        1    F      50
    6        2    A      70
    7        2    B     100
    10       2    E      80