Search code examples
python-3.xpandasnumpydataframeanalytics

How to convert column into row?


Assuming I have two rows where for most of the columns the values are same, but not for all. I would like to group these two rows into one where ever the values are same and if the values are different then create an extra column and assign the column name as 'column1'

Step 1: Here assuming I have columns which has same value in both the rows 'a','b','c' and columns which has different values are 'd','e','f' so I am grouping using 'a','b','c' and then unstacking 'd','e','f'

Step 2: Then I am dropping the levels then renaming it to 'a','b','c','d','d1','e','e1','f','f1'

But in my actual case I have 500+ columns and million rows, I dont know how to expand this to 500+ columns where I have constrains like 1) I dont know which all columns will have same values 2) And which all columns will have different values that needs to be converted into new column after grouping with the columns that has same value

df.groupby(['a','b','c']) ['d','e','f'].apply(lambda x:pd.DataFrame(x.values)).unstack().reset_index()

df.columns = df.columns.droplevel()
df.columns = ['a','b','c','d','d1','e','e1','f','f1']

enter image description here

To be more clear, the below code creates the sample dataframe & expected output

df = pd.DataFrame({'Cust_id':[100,100, 101,101,102,103,104,104], 'gender':['M', 'M', 'F','F','M','F','F','F'], 'Date':['01/01/2019', '02/01/2019','01/01/2019',
                                                                                                                   '01/01/2019','03/01/2019','04/01/2019','03/01/2019','03/01/2019'],
              'Product': ['a','a','b','c','d','d', 'e','e']})



expected_output = pd.DataFrame({'Cust_id':[100, 101,102,103,104], 'gender':['M', 'F','M','F','F'], 'Date':['01/01/2019','01/01/2019','03/01/2019','04/01/2019', '03/01/2019'], 'Date1': ['02/01/2019', 'NA','NA','NA','NA']
                                  , 'Product': ['a', 'b', 'd', 'd','e'], 'Product1':['NA', 'c','NA','NA','NA' ]})

Solution

  • you may do following to get expected_output from df

    s = df.groupby('Cust_id').cumcount().astype(str).replace('0', '')
    df1 = df.pivot_table(index=['Cust_id', 'gender'], columns=s, values=['Date', 'Product'], aggfunc='first')
    df1.columns = df1.columns.map(''.join)
    
    Out[57]:
                          Date       Date1 Product Product1
    Cust_id gender
    100     M       01/01/2019  02/01/2019       a        a
    101     F       01/01/2019  01/01/2019       b        c
    102     M       03/01/2019         NaN       d      NaN
    103     F       04/01/2019         NaN       d      NaN
    104     F       03/01/2019  03/01/2019       e        e
    

    Next, replace columns having duplicated values with NA

    df_expected = df1.where(df1.ne(df1.shift(axis=1)), 'NA').reset_index()
    
    Out[72]:
       Cust_id gender        Date       Date1 Product Product1
    0      100      M  01/01/2019  02/01/2019       a       NA
    1      101      F  01/01/2019          NA       b        c
    2      102      M  03/01/2019          NA       d       NA
    3      103      F  04/01/2019          NA       d       NA
    4      104      F  03/01/2019          NA       e       NA