Search code examples
pythonpandasdataframemulti-index

Is it possible in a multiindexed Pandas dataframe to have a column whose values refer to a higher level index?


Suppose I have such a dataframe in Pandas:

df = pd.DataFrame({'a':[4,4,8,8],'b':[4,5,6,5], 'd':[0,1,2,1]})
multi_idx = pd.MultiIndex.from_arrays([[0,0,1,1],[0,1,0,1]])
df.index= multi_idx

which outputs this shape:

     a  b  d
0 0  4  4  0
  1  4  5  1
1 0  8  6  2
  1  8  5  1

You see that values of columns a are repeated based on the first level index. I am looking for a way to avoid this repetition. One of course would be to split the information across more dataframes, i.e., having a data frame for column a and first level index. However, I was wondering if there is somehow a way using multiindex and multi level columns, to create a columns whose values correspond to all rows of a higher level index.

Visually I would like something like that: is that possible in Pandas?

|    |    |  a  |  b  |  c  | 
|----|----|-----|-----|-----|
|idx1|idx2|     |     |     |
|----|----|-----|-----|-----|
|    |  0 |     |  4  |  0  |
|  0 |----|  4  |-----|-----|
|    |  1 |     |  5  |  1  |
|----|----|-----|-----|-----|
|    |  0 |     |  6  |  2  |
|  1 |----|  8  |-----|-----|
|    |  1 |     |  5  |  1  |
|----|----|-----|-----|-----|

Solution

  • This is unfortunately not possible, pandas doesn't have a concept of merged cells.

    If you want to "merge" contiguous identical cells, you either have to use a MultiIndex to benefit from this display (only on the initial levels though) or to post-process the html output to change the rowspan of those cells manually.

    Example to process the HTML, partially with pandas to compute the size of the rowspan and with BeautifulSoup to modify the HTML:

    from IPython.core.display import HTML
    from IPython.display import display
    from bs4 import BeautifulSoup
    import re
    
    def to_rowspan(s):
        m = s.ne(s.groupby(level=0).shift())
        sz = s.groupby(m.cumsum()).transform('size')
        return [f'[rowspan={size}]{val}' if mask else 'TO_DELETE'
                for size, mask, val in zip(sz, m, s)]
    
    def merge_cells(df):
        reg = re.compile(r'\[rowspan=(\d+)\](.*)')
        soup = BeautifulSoup(df.apply(to_rowspan).to_html())
        
        for row in soup.find_all('td'):
            if row.text == 'TO_DELETE':
                row.decompose()
            elif (m:=reg.search(row.text)):
                a,b = m.groups() ; a = int(a)
                if a>1:
                    row['rowspan'] = a
                row.string = b
        return soup.prettify()
    
    display(HTML(merge_cells(df)))
    

    Output:

    enter image description here