Search code examples
pythonpandasmulti-index

Python/Pandas - 2 multi-index DataFrames visually the same, but not equal?


I am using 2 ways to feed a multi-index pandas DataFrame:

  • one that initializes it with values,

  • one that fills it row per row (I don't care about the speed, I have to do it this way as I have each new row after different computations, and not all at once)

Please, both DataFrames 'looks' the same, but pandas tells me they are not. What should I do so that they are the same?

 import attr
 import pandas as pd

 @attr.s(frozen=True)
 class CDE(object):
     name : str = attr.ib()
     def __str__(self):
         return self.name

 # Method 1: filling it row per row.
 rmidx = pd.MultiIndex(levels=[[],[]], codes=[[],[]], names=['CDE','period'])
 c_array = [['Column', 'Column'],['First', 'Last']]
 cmidx = pd.MultiIndex.from_arrays(c_array)
 summary1 = pd.DataFrame(index = rmidx, columns=cmidx)
 mcde1 = CDE(name='hi')
 mcde2 = CDE(name='ho')
 values = [[20,30],[40,50]]
 period = '5m'
 summary1.loc[(mcde1, period),('Column',)]=values[0]
 summary1.loc[(mcde2, period),('Column',)]=values[1]

 # Method 2: all at once.
 columns = summary1.columns
 rows = summary1.index.names
 index_label = [[mcde1,mcde2],[period,period]]
 summary2 = pd.DataFrame(values, index=index_label, columns=columns)
 summary2.index.names = rows

 In [2]:summary1
 Out[2]: 
            Column     
             First Last
 CDE period            
 hi  5m         20   30
 ho  5m         40   50

 In [3]:summary2
 Out[3]: 
            Column     
             First Last
 CDE period            
 hi  5m         20   30
 ho  5m         40   50

But

 summary1.equals(summary2)
 False

Why is that so? Thanks for any advice on that.


Solution

  • From the doc This function requires that the elements have the same dtype as their respective elements in the other Series or DataFrame. But if you do

    summary1.dtypes, summary2.dtypes
    

    You would get

    (Column  First    object
             Last     object
     dtype: object,
     Column  First    int64
             Last     int64
     dtype: object)
    

    This is because when you create an empty data frame

    summary1 = pd.DataFrame(index = rmidx, columns=cmidx)
    

    the default dtype is object. Therefore, whenever you append a new row, the data is converted/masked as the given dtype. On the other hand, if you create a data frame with given data, pandas will try to guess the best dtype, in this case int64.