python-3.x pandas numpy dataframe analytics

How to convert column into row?

Assuming I have two rows where for most of the columns the values are same, but not for all. I would like to group these two rows into one where ever the values are same and if the values are different then create an extra column and assign the column name as 'column1'

Step 1: Here assuming I have columns which has same value in both the rows 'a','b','c' and columns which has different values are 'd','e','f' so I am grouping using 'a','b','c' and then unstacking 'd','e','f'

Step 2: Then I am dropping the levels then renaming it to 'a','b','c','d','d1','e','e1','f','f1'

But in my actual case I have 500+ columns and million rows, I dont know how to expand this to 500+ columns where I have constrains like 1) I dont know which all columns will have same values 2) And which all columns will have different values that needs to be converted into new column after grouping with the columns that has same value

df.groupby(['a','b','c']) ['d','e','f'].apply(lambda x:pd.DataFrame(x.values)).unstack().reset_index()

df.columns = df.columns.droplevel()
df.columns = ['a','b','c','d','d1','e','e1','f','f1']

To be more clear, the below code creates the sample dataframe & expected output

df = pd.DataFrame({'Cust_id':[100,100, 101,101,102,103,104,104], 'gender':['M', 'M', 'F','F','M','F','F','F'], 'Date':['01/01/2019', '02/01/2019','01/01/2019',
                                                                                                                   '01/01/2019','03/01/2019','04/01/2019','03/01/2019','03/01/2019'],
              'Product': ['a','a','b','c','d','d', 'e','e']})



expected_output = pd.DataFrame({'Cust_id':[100, 101,102,103,104], 'gender':['M', 'F','M','F','F'], 'Date':['01/01/2019','01/01/2019','03/01/2019','04/01/2019', '03/01/2019'], 'Date1': ['02/01/2019', 'NA','NA','NA','NA']
                                  , 'Product': ['a', 'b', 'd', 'd','e'], 'Product1':['NA', 'c','NA','NA','NA' ]})

Solution

you may do following to get expected_output from df

s = df.groupby('Cust_id').cumcount().astype(str).replace('0', '')
df1 = df.pivot_table(index=['Cust_id', 'gender'], columns=s, values=['Date', 'Product'], aggfunc='first')
df1.columns = df1.columns.map(''.join)

Out[57]:
                      Date       Date1 Product Product1
Cust_id gender
100     M       01/01/2019  02/01/2019       a        a
101     F       01/01/2019  01/01/2019       b        c
102     M       03/01/2019         NaN       d      NaN
103     F       04/01/2019         NaN       d      NaN
104     F       03/01/2019  03/01/2019       e        e

Next, replace columns having duplicated values with NA

df_expected = df1.where(df1.ne(df1.shift(axis=1)), 'NA').reset_index()

Out[72]:
   Cust_id gender        Date       Date1 Product Product1
0      100      M  01/01/2019  02/01/2019       a       NA
1      101      F  01/01/2019          NA       b        c
2      102      M  03/01/2019          NA       d       NA
3      103      F  04/01/2019          NA       d       NA
4      104      F  03/01/2019          NA       e       NA