Search code examples
pythonpython-3.xpandasdataframesummary

Slicing Pandas Columns to Obtain Summary Statistics


I have a dataframe that looks similar to the following:

ColA  ColB  Year  ...
=====================
1     2     2007
2     5     2007
3     4     2007
4     3     2007
5     2     2008
6     1     2008
7     0     2008
8     9     2008
...

I am using dat[['ColA', 'ColB']].describe(). When I do this, as expected, it displays summary statistics for both columns over all years. I would like to have summary statistics for each column by year. In the example above, I would have 4 columns of statistics (1 for ColA in 2007, 1 for ColA in 2008, 1 for ColB in 2007, and 1 for ColB in 2008). Is there a way to extend the capabilities of pd.describe() to accommodate this?


Solution

  • you can group by year before calling describe :

    df_example = pd.DataFrame({"colA": [1, 2, 3, 4, 5, 6, 7, 8],
                               "Year": [2007, 2007, 2007, 2007, 2008, 2008, 2008, 2008]})
    des = df_example.groupby("Year").describe()
    print(des)
    
     colA                                          
         count mean       std  min   25%  50%   75%  max
    Year                                                
    2007   4.0  2.5  1.290994  1.0  1.75  2.5  3.25  4.0
    2008   4.0  6.5  1.290994  5.0  5.75  6.5  7.25  8.0