Search code examples
pythonpandasmulti-index

store calculated data in new multicolumn of pandas dataframe


I have a pandas dataframe with multiindex columns:

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
      ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(3, 8), index=['A', 'B', 'C'], columns=index)

Now I need to divide and store the values from df["bar"] by df["baz"] in the dataframe under the name "new" (with second level index being 1 and 2).

df["bar"] / df["baz"] gives me the correct values, however I don't understand how to store this in the dataframe.

I tried:
df["new"] = df["bar"]/df["baz"] and df.loc[:, ("new", ["one", "two"])] = df["bar"]/df["baz"], but both give errors. Any ideas how to store the data under a new name in the dataframe?


Solution

  • You can add level by MultiIndex.from_product and then use concat:

    a = df["bar"] / df["baz"]
    a.columns = pd.MultiIndex.from_product([['new'], a.columns])
    print (a)
            new          
            one       two
    A -1.080108 -0.876062
    B  0.171536  0.278908
    C  2.045792  0.795082
    
    df1 = pd.concat([df, a], axis=1)
    print (df1)
    first        bar                 baz                 foo                 qux  \
    second       one       two       one       two       one       two       one   
    A      -0.668129 -0.498210  0.618576  0.568692  1.350509  1.629589  0.301966   
    B      -0.345811 -0.315231 -2.015971 -1.130231 -1.111846  0.237851 -0.325130   
    C       1.915676  0.920348  0.936398  1.157552 -0.106208 -0.088752 -0.971485   
    
    first                  new            
    second       two       one       two  
    A       0.449483 -1.080108 -0.876062  
    B       1.944702  0.171536  0.278908  
    C      -0.384060  2.045792  0.795082  
    

    Another solution with selecting by xs and rename, last join to original:

    a = (df.xs("bar", axis=1, level=0, drop_level=False) / df["baz"])
           .rename(columns={'bar':'new'})
    
    df1 = df.join(a)
    print (df1)
    first        bar                 baz                 foo                 qux  \
    second       one       two       one       two       one       two       one   
    A      -0.668129 -0.498210  0.618576  0.568692  1.350509  1.629589  0.301966   
    B      -0.345811 -0.315231 -2.015971 -1.130231 -1.111846  0.237851 -0.325130   
    C       1.915676  0.920348  0.936398  1.157552 -0.106208 -0.088752 -0.971485   
    
    first                  new            
    second       two       one       two  
    A       0.449483 -1.080108 -0.876062  
    B       1.944702  0.171536  0.278908  
    C      -0.384060  2.045792  0.795082 
    

    And solution with reshaping by stack and unstack should be slowier in large df:

    df1 = df.stack()
    df1['new'] = df1["bar"] / df1["baz"]
    df1 = df1.unstack()
    print (df1)
    first        bar                 baz                 foo                 qux  \
    second       one       two       one       two       one       two       one   
    A      -0.668129 -0.498210  0.618576  0.568692  1.350509  1.629589  0.301966   
    B      -0.345811 -0.315231 -2.015971 -1.130231 -1.111846  0.237851 -0.325130   
    C       1.915676  0.920348  0.936398  1.157552 -0.106208 -0.088752 -0.971485   
    
    first                  new            
    second       two       one       two  
    A       0.449483 -1.080108 -0.876062  
    B       1.944702  0.171536  0.278908  
    C      -0.384060  2.045792  0.795082   
    

    Solution with loc:

    a = (df.loc(axis=1)['bar', :] / df["baz"]).rename(columns={'bar':'new'})
    print (a)
    first        new          
    second       one       two
    A      -1.080108 -0.876062
    B       0.171536  0.278908
    C       2.045792  0.795082
    
    df1 = df.join(a)
    print (df1)
    first        bar                 baz                 foo                 qux  \
    second       one       two       one       two       one       two       one   
    A      -0.668129 -0.498210  0.618576  0.568692  1.350509  1.629589  0.301966   
    B      -0.345811 -0.315231 -2.015971 -1.130231 -1.111846  0.237851 -0.325130   
    C       1.915676  0.920348  0.936398  1.157552 -0.106208 -0.088752 -0.971485   
    
    first                  new            
    second       two       one       two  
    A       0.449483 -1.080108 -0.876062  
    B       1.944702  0.171536  0.278908  
    C      -0.384060  2.045792  0.795082  
    

    Setup:

    np.random.seed(456)
    arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
    tuples = list(zip(*arrays))
    index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
    df = pd.DataFrame(np.random.randn(3, 8), index=['A', 'B', 'C'], columns=index)
    print (df)
    first        bar                 baz                 foo                 qux  \
    second       one       two       one       two       one       two       one   
    A      -0.668129 -0.498210  0.618576  0.568692  1.350509  1.629589  0.301966   
    B      -0.345811 -0.315231 -2.015971 -1.130231 -1.111846  0.237851 -0.325130   
    C       1.915676  0.920348  0.936398  1.157552 -0.106208 -0.088752 -0.971485   
    
    first             
    second       two  
    A       0.449483  
    B       1.944702  
    C      -0.384060