Search code examples
pythonpandasstatisticsstatsmodelsstandard-deviation

Descriptive statistics in Python /with Pandas with std in parentheses


This question concerns the best-practice to do descriptive statistics in Python with a formatted output that correspond to tables found in academic publications: means with their respective standard deviations in parenthesis below. Final goal is to be able to export it in a Latex tabular format (or an other format, html, etc).

Example (Deucherta & Eugster (2018)):

enter image description here

Pandas:

The classical solution to do descriptive statistics in Pandas is to use the describe() method of a DataFrame.

import numpy as np
import pandas as pd 

# Generate a DataFrame to have an example 
df = pd.DataFrame(
    {"Age" : np.random.normal(20,15,5), 
     "Income": np.random.pareto(1,5)*20_000 }
    )
# The describe method to get means and stds
df.describe().loc[["mean", "std"]].T
>>>
                mean            std
Age        15.322797      13.449727
Income  97755.733510  143683.686484

What I would like to have is the following output:

Age        15.32 
          (13.44)
Income  97755.73  
        (143683.68)

It would be nice to have a solution that works with Multi-index Dataframe:

df2 = pd.DataFrame(
    {"Age" : np.random.normal(20,15,5), 
     "Income": np.random.pareto(1,5)*20_000 }
    )
df_c = pd.concat([df,df2], keys = ["A", "B"])
>>>

and get

                A           B
Age          23.15       21.33
            (11.62)      (9.34)
Income    68415.53    46619.51
         (95612.40)  (64596.10)

My current solution:

idx = pd.IndexSlice
df_desc = (df_c
      ).groupby(level = 0, axis = 0).describe()
df_desc = df_desc.loc[idx[:],idx[:,["mean", "std"]]].T
df_desc.loc[idx[:,["std"]],idx[:]] = df_desc.loc[idx[:,["std"]],idx[:]
                                               ].applymap(
                                               lambda x: "("+"{:.2f}".format(x)+")")
print(df_desc)

>>>
                     A           B
Age    mean     23.1565     21.3359
      std      (11.62)      (9.34)
Income mean     68415.5     46619.5
      std   (95612.40)  (64596.10)
Problem 1 :

I did not find the solution to hide the second index column [mean, std, mean,std].

Then I want to export my df to latex:

df_desc.to_latex()

>>>
\begin{tabular}{llll}
\toprule
       &     &            A &           B \\
\midrule
Age & mean &       5.5905 &     29.5894 \\
       & std &      (16.41) &     (13.03) \\
Income & mean &       531970 &     72653.7 \\
       & std &  (875272.44) &  (79690.18) \\
\bottomrule
\end{tabular}
Problem 2:

The & characters of the table are not aligned which makes it a bit tedious to edit (I use extensions for aligning & in VSCode)

Overall I find this solution tedious and not elegant.

Solution(s) ?

I do not know what I should do to obtain the desired result without complex string manipulation.

I have looked at Pandas styling, but I don't think that it is the best solution.

There is also StatModels Tables, but I did not find a simple solution to my problem. Statsmodels Tables seems the most promising solution. But I do not know how to implement it. There are some descriptive stats functions in StatsModels, but I read on GitHub that they are to some extent deprecated.

So what is the best way to make those tables?


Solution

  • I just ran into a similar problem and found your post, so here's how I dealt with the issues you mentioned.

    Problem 1: Hide second index column

    I prefer solution b), but leave a) here for illustrative purposes.

    a) droplevel & set_index

    df_desc.index.droplevel(level=1)
    
    >>>
    Index(['Age', 'Age', 'Income', 'Income'], dtype='object')
    

    Use this piece of code along with a set_index expression:

    df_desc.set_index(df_desc.index.droplevel(level=1), inplace=True)
    

    This results in:

    print(df_desc)
    
    >>>
                      A           B
    Age         17.1527     23.9678
    Age         (19.73)     (12.01)
    Income       293271     12178.8
    Income  (400059.27)  (14483.35)
    

    which doesn't look satisfying yet, as the index values Age and Income appear twice, respectively.

    That's why I came up with the following.

    b) Create new DataFrame using pd.DataFrame.values and assign index manually

    First, reset index:

    df_desc = df_desc.reset_index(drop=True)
    
    print(df_desc)
    
    >>>
                A           B
    0      17.306      11.425
    1     (14.40)     (16.67)
    2     88016.7     67280.4
    3  (73054.44)  (54953.69)
    

    Second, create new DataFrame specifying the index and column names manually. Note that I used df_desc.values as the data argument (first position).

    df_new = pd.DataFrame(df_desc.values, index=["Age", "", "Income", ""], columns=["A", "B"])
    
    print(df_new)
    
    >>>
                     A           B
    Age        27.7039     20.8031
               (13.99)     (13.92)
    Income     20690.7     7370.44
            (29470.03)  (13279.10)
    

    Problem 2: Align LaTeX table

    Note that running

    df_new.to_latex()
    

    indeed produces a somewhat messy str output:

    >>> 
    '\\begin{tabular}{lll}\n\\toprule\n{} &           A &           B \\\\\n\\midrule\nAge    &     27.7039 &     20.8031 \\\\\n       &     (13.99) &     (13.92) \\\\\nIncome &     20690.7 &     7370.44 \\\\\n       &  (29470.03) &  (13279.10) \\\\\n\\bottomrule\n\\end{tabular}\n'
    

    However, wrapping it inside a print statement produces the desired output:

    print(df_new.to_latex())
    
    >>>
    \begin{tabular}{lll}
    \toprule
    {} &           A &           B \\
    \midrule
    Age    &     27.7039 &     20.8031 \\
           &     (13.99) &     (13.92) \\
    Income &     20690.7 &     7370.44 \\
           &  (29470.03) &  (13279.10) \\
    \bottomrule
    \end{tabular}
    

    Moreover, exporting the table to a LaTeX document is fairly simple.

    As you noted yourself, to_latex() already creates a tabular, so you just need to write that to a file, and use \input in your LaTeX document. Following the example here, do the following:

    i) Save the table as a text file

    with open('mytable.tex','w') as tf:
        tf.write(df_new.to_latex())
    

    ii) Use the exported table in a LaTeX document

    \documentclass{article}
    \usepackage{booktabs}
    \begin{document}
    \input{mytable}
    \end{document}
    

    This example here assumes that mytable.tex and the LaTeX document are in the same folder. The booktabs package is required, since to_latex() uses the booktabs commands for table rules.

    The final pdf output looks like this:

    enter image description here