Search code examples
pandasdataframelatex

Multirow in pandas df - one header with multiple values


I have a dataframe like this (but with more columns and rows):

pd.DataFrame({
    "Measure" : ["twitter", "news"],
    "Upper quartile" : [0.1, 0.2],
    "Lower quartile" : [0.05, 0.1],
    "P-value1" : [0.05, 0.02],
    "Upper decile" : [0.11, 0.22],
    "Lower decile" : [0.03, 0.09],
    "P-value2" : [0.04, 0.01]
})

And I need to kind of merge columns "Upper quartile", "Lower quartile" and "P-value1" into one column and "Upper decile", "Lower decile" and "P-value2" to another. In the end I would like to have a table like this (using sort of latex notation):

Measure & Upper quartile & Upper decile \\
{}      & Lower quartile & Lower decile \\
{}      & P-value1       & P-value2 \\
twitter & 0.1            & 0.11 \\
{}      & 0.05           & 0.03 \\
{}      & 0.05           & 0.04 \\
news    & 0.2            & 0.22 \\
{}      & 0.1            & 0.09 \\
{}      & 0.02           & 0.01 \\
{} denotes empty cells
\\ denotes end of line
& is column separator

This part should be my header:

Measure & Upper quartile & Upper decile \\
{}      & Lower quartile & Lower decile \\
{}      & P-value1       & P-value2 \\

I have tried to set Multiindex, but it is not exactly what I need cause each level of multiindex relates to every value in given column, but this is not case of my target table. In my table, every row contains 3 values and only one of three headers relates to each of the values. I have also tried groupby and melt, but did not figure it out.

The reason for doing this: my table (df) is too wide when exported to overleaf, so I am trying to make it narrower so it fits on one page. I could do the table I need manually in overleaf, but I have quite a few tables like this, so I prefer preparing it in pandas and then exporting to latex.


Solution

  • Using a MultiIndex and reshaping, assuming your measurements are sorted and grouped in the desired order (e.g. Upper/Lower/P-Value)

    # number of measurements per final column
    N = 3
    
    tmp = df.set_index('Measure')
    
    M = tmp.shape[1]//N
    cols = pd.MultiIndex.from_arrays(tmp.columns.to_numpy().reshape(-1, N).T,
                                     names=['Measure']+['{}']*(N-1))
    idx = (tmp.index.repeat(N).to_series()
           .mask(lambda x: x.duplicated(), '{}')
           )
    
    def pad(s):
        s = s.astype(str)
        return s.str.ljust(s.str.len().max())
    
    out = '\n'.join(pd.concat([
    cols.to_frame(allow_duplicates=True, index=False).T,
    pd.DataFrame(tmp.to_numpy().reshape(len(df), -1, N).swapaxes(1,2).reshape(-1, M),
                 index=idx),
        ])
        .reset_index().astype(str)
        .apply(pad) # optional
        .agg(' & '.join, axis=1).add(r' \\')
        )
    
    print(out)
    

    Output:

    Measure & Upper quartile & Upper decile \\
    {} & Lower quartile & Lower decile \\
    {} & P-value1 & P-value2 \\
    twitter & 0.1 & 0.11 \\
    {} & 0.05 & 0.03 \\
    {} & 0.05 & 0.04 \\
    news & 0.2 & 0.22 \\
    {} & 0.1 & 0.09 \\
    {} & 0.02 & 0.01 \\
    

    Output with optional padding:

    Measure & Upper quartile & Upper decile \\
    {}      & Lower quartile & Lower decile \\
    {}      & P-value1       & P-value2     \\
    twitter & 0.1            & 0.11         \\
    {}      & 0.05           & 0.03         \\
    {}      & 0.05           & 0.04         \\
    news    & 0.2            & 0.22         \\
    {}      & 0.1            & 0.09         \\
    {}      & 0.02           & 0.01         \\