Search code examples
pythonpandasaggregatesdescribe

pandas dataframe.describe() obtain aggregates based on index values


I am trying to use the .describe() method on df1 to obtain aggregates. The current index is year. I want to obtain these stats based on each statistics over the 3 year period in the index. I tried using stats_df = df1.groupby('statistics').descirbe().unstack(1)) but I don't get the result that I am looking for.

in df1 =

         statistics  s_values
year
1999  cigarette use       100
1999  cellphone use       310
1999   internet use       101
1999    alcohol use       100
1999       soda use       215
2000  cigarette use       315
2000  cellphone use       317
2000   internet use       325
2000    alcohol use       108
2000       soda use       200
2001  cigarette use       122
2001  cellphone use       311
2001   internet use       112
2001    alcohol use       144
2001       soda use       689
2002  cigarette use       813
2002  cellphone use       954
2002   internet use       548
2002    alcohol use       882
2002       soda use       121

I am trying to achieve an output like this. Please keep in mind these aggregate values are not accurate I just populated them with random numbers to give you the idea of the format. result stats_df =

     statistics  count  unique  top  freq  mean  std  min  20%  40%  50%  60%  80%  max

  cigarette use    32    335    655    54   45    45   1    23   21   12   55   55  999
  cellphone use    92    131    895    49   12    33   6    13   32   55   34   12  933     
  internet use     32    111    123    44   65    31   2    42  544   15   11   54  111    
   alcohol use     32    315    611    33   41    53   3    34   22   34   11   33  555      
     soda use      32    355    655    54   45    45   1    23   21   12   55   55  999      

thank you


Solution

  • I created a sample dataframe and I could get the result with just using groupby().describe(). I am unsure what's wrong with your code, could you also edit your post to show the result you obtained?

    here's mine

    df = pd.DataFrame(index=[1999,1999,1999,1999,1999,2000,2000,2000,2000,2000], columns=['statistics', 's_values'], data=[['cigarette use', 100],['cellphone use', 310],['internet use',
     101],['alcohol use', 100], ['soda use', 215],['cigarette use', 315],['cellphone use', 317],['internet use', 325],['alcohol use', 108],['soda use', 200]])
    
    df.groupby("statistics").describe()
    

    output:

                     s_values
                     count   mean         std    min     25%    50%     75%    max
    statistics
    alcohol use        2.0  104.0    5.656854  100.0  102.00  104.0  106.00  108.0
    cellphone use      2.0  313.5    4.949747  310.0  311.75  313.5  315.25  317.0
    cigarette use      2.0  207.5  152.027958  100.0  153.75  207.5  261.25  315.0
    internet use       2.0  213.0  158.391919  101.0  157.00  213.0  269.00  325.0
    soda use           2.0  207.5   10.606602  200.0  203.75  207.5  211.25  215.0