Search code examples
pythonpandasdataframemulti-index

Python / Pandas: How creating an multi-index empty DataFrame, and then starting to fill it?


I would like to store the summary of a local set of DataFrames into a "meta DataFrame" using pd.MultiIndex.

Basically, row-axis has two levels, and column-axis also. In the class managing the set of DataFrames, I define as a class variable this "Meta DataFrame".

import pandas as pd

row_axis = pd.MultiIndex(levels=[[],[]], codes=[[],[]], names=['Data', 'Period'])
column_axis = pd.MultiIndex(levels=[[],[]], codes=[[],[]], names=['Data', 'Extrema'])
MD = pd.DataFrame(index=row_axis, columns=column_axis)

It seems to work.

MD.index
>>> MultiIndex([], names=['Data', 'Period'])

MD.columns
>>> MultiIndex([], names=['Data', 'Extrema'])

Now, each time I process an individual DataFrame id, I want to update this "Meta DataFrame" accordingly. id has a DateTimeIndex with period '5m'.

id.index[0]
>>> Timestamp('2020-01-01 08:00:00')

id.index[-1]
>>> Timestamp('2020-01-02 08:00:00')

I want to keep in MD its first and last index values for instance.

MD.loc[[('id', '5m')],[('Timestamp', 'First')]] = id.index[0]
MD.loc[[('id', '5m')],[('Timestamp', 'Last')]] = id.index[-1]

This doesn't work, I get following error message:

TypeError: unhashable type: 'list'

In the end, the result I would like is to have in MD following type of info (I am having other id DataFrames with different periods) :

           Timestamp
           First                   Last
id    5m   2020-01-01 08:00:00     2020-01-02 08:00:00
     10m   2020-01-05 08:00:00     2020-01-06 18:00:00

Ultimately, I will also keep min and max of some columns in id. For instance if id has a column 'Temperature'.

           Timestamp                                     Temperature
           First                Last                     Min    Max
id    5m   2020-01-01 08:00:00  2020-01-02 08:00:00      -2.5   10
     10m   2020-01-05 08:00:00  2020-01-06 18:00:00      4      15

These values will be recorded when I record id.

I am aware initializing a DataFrame cell per cell is not time efficient, but it will not be done that often.

Besides, I don't see how I can manage this organization of information in a Dict, which is why I am considering doing it with a multi-level DataFrame. I will then dump it in a csv file to store these "meta data".

Please, what is the right way to initialize each of these values in MD?

I thank you for your help! Bests,


Solution

  • Instead of filling an empty DataFrame you can store the data in a dict of dicts. A MultiIndex uses tuples as the index values so we make the keys of each dictionary tuples.

    The outer Dictionary uses the column MultiIndex tuples as keys and the values are another dictionary with the row MultiIndex tuples as keys and the value that goes in a cell as the value.

    d = {('Score', 'Min'):       {('id1', '5m'): 72, ('id1', '10m'): -18},
         ('Timestamp', 'First'): {('id1', '5m'): 1, ('id1', '10m'): 2},
         ('Timestamp', 'Last'):  {('id1', '5m'): 10, ('id1', '10m'): 20}}
         #        |                     |                            |
         #  Column MultiIndex       Row Multi                    Cell Value
         #       Label                Label     
    
    pd.DataFrame(d)
    

            Score Timestamp     
              Min     First Last
    id1 5m     72         1   10
        10m   -18         2   20
    

    Creating that dict will depend upon how you get the values. You can extend a dict with update