Search code examples
pythonpython-3.xpandasdynamic-allocation

Is there no syntax suger for dynamic creating columns with multiindexed pandas dataframe?


First, I show the pandas dataframe to elucidate my problem.

import pandas as pd
mi = pd.MultiIndex.from_product([["A","B"],["c","d"]], names=['lv1', 'lv2'])
df1 = pd.DataFrame([[1,2,3,4],[5,6,7,8],[9,10,11,12]],columns=mi)

this python code creates dataframe(df1) like this:

#input dataframe
lv1  A       B
lv2  c   d   c   d
0    1   2   3   4
1    5   6   7   8
2    9  10  11  12

I want to create columns 'c*d' on lv2 by using df1's data. like this:

#output dataframe after calculation
lv1  A           B        
lv2  c   d  c*d  c    d  c*d
0    1   2    2  3    4   12
1    5   6   30  7    8   56
2    9  10   90  11  12  132

For this problem,I wrote some code like this:

for l1 in mi.levels[0]:
    df1.loc[:, (l1, "c*d")] = df1.loc[:,(l1,"c")]*df1.loc[:,(l1,"d")]
df1.sort_index(1,inplace=True)

Although this code almost solved my problem, but I really want to write without 'for' statement like this:

df1.loc[:,(slice(None),"c*d")]=df1.loc[:,(slice(None),"c")]*df1.loc[:,(slice(None),"d")]

With this statement,I got Key error that says 'c*d' is missing. Is there no syntax sugar for this calculation? Or can I achieve better performance by other code?


Solution

  • Explanation of jezrael's answer using stack which is may be the most idiomatic way in pandas.

    output = (df1
                 # "Stack" data, by moving the top level ('lv1') of the
                 # column MultiIndex into row index,
                 # now the rows are a MultiIndex and the columns
                 # are a regular Index.
                 .stack(0)
    
                 # Since we only have 2 columns now, 'lv2' ('c' & 'd')
                 # we can multiply them together along the row axis.
                 # The assign method takes key=value pairs mapping new column
                 # names to the function used to calculate them. Here we're
                 # wrapping them in a dictionary and unpacking them using **
                 .assign(**{'c*d': lambda x: x.product(axis=1)})
    
                 # Undos the stack operation, moving 'lv1', back to the
                 # column index, but now as the bottom level of the column index
                 .unstack()
    
                 # This sets the order of the column index MultiIndex levels.
                 # Since they are named we can use the names, you can also use
                 # their integer positions instead. Here axis=1 references
                 # the column index
                 .swaplevel('lv1', 'lv2', axis=1)
    
                 # Sort the values in both levels of the column MultiIndex.
                 # This will order them as c, c*d, d which is not what you
                 # specified above, however having a sorted MultiIndex is required
                 # for indexing via .loc[:, (...)] to work properly
                 .sort_index(axis=1)
              )