Search code examples
pythonpandasgroup-bysummary

Pandas fuzzy group summary statistics


I have a data frame defined from a CSV and would like to calculate basic summary statistics e.g. mean, variance, ... for the train part of all the models.

Inserting a model number and grouping by that would work fine - but does not seem to be a good solution. How can I get the summary statistics per model (only for training), as a group_by modelName does not work because of the counter.

df.groupby(['modelName', 'typeOfRun'])['kappa'].mean()

or

df[df.typeOfRun != 'validation'].describe()

do not yield the desired results. pct

AUC_R,Accuracy,Error rate,False negative rate,False positive rate,Lift value,Precision J,Precision N,Rate of negative predictions,Rate of positive predictions,Sensitivity (true positives rate),Specificity (true negatives rate),f1_R,kappa,modelName,typeOfRun
0.7747622323007851,0.7182416731216111,0.28175832687838887,0.16519823788546256,0.28527729751296715,2.769918376242967,0.08117369886485329,0.9930703132218424,0.029305447973147433,0.3013813581203202,0.8348017621145375,0.7147227024870328,0.8312130234716368,0.09987857210248623,00_testing_1-training,training
0.7688154033277225,0.7295055512522592,0.27049444874774076,0.1894273127753304,0.27294188056922464,2.807689674786938,0.08228060368921185,0.9921956531603068,0.029305447973147433,0.28869739220242707,0.8105726872246696,0.7270581194307754,0.8391825769931881,0.10159217699431862,00_testing_2-training,training
0.7653761718477654,0.7217918925897238,0.2782081074102763,0.1883259911894273,0.2809216651150419,2.737743031677203,0.08023078597866318,0.9921552436003304,0.029305447973147433,0.29647560030983733,0.8116740088105727,0.7190783348849581,0.8338281219878937,0.09791120175612114,00_testing_3-training,training
0.7666987721022418,0.7202566535628756,0.2797433464371244,0.18396711202466598,0.2826353437708505,2.7358921138891255,0.08018987022168358,0.9923159476282464,0.02931031885891585,0.2982693958700465,0.816032887975334,0.7173646562291496,0.8327314318650539,0.097878484924986,00_testing-validation,validation
0.7776426005660843,0.7300542215336948,0.2699457784663052,0.17180616740088106,0.2729086314669504,2.8639238514789174,0.08392857142857142,0.9929168180167091,0.029305447973147433,0.28918151303898787,0.8281938325991189,0.7270913685330496,0.8394625719769673,0.10476961017159536,01_otherSet_1-training,training
0.7691501646636157,0.737412858249419,0.26258714175058095,0.197136563876652,0.2645631067961165,2.8639098209585327,0.08392816025788626,0.9919723742039644,0.029305447973147433,0.2803382390911438,0.802863436123348,0.7354368932038835,0.8446557452170924,0.1044486077353842,01_otherSet_2-training,training
0.770174515310113,0.7342176607281178,0.2657823392718823,0.19162995594713655,0.26802101343263735,2.847815513920855,0.08345650938032974,0.9921582766235522,0.029305447973147433,0.283856183836819,0.8083700440528634,0.7319789865673627,0.8424375777288816,0.10367514449353035,01_otherSet_3-training,training
0.7676347850606817,0.7317488289428102,0.26825117105718976,0.19424460431654678,0.2704858255620898,2.8156062097690264,0.08252631578947368,0.9920241385858671,0.02931031885891585,0.2861747473378218,0.8057553956834532,0.7295141744379102,0.8407546494992847,0.10196584743637081,01_otherSet-validation,validation

Solution

  • IIUC you can use DataFrameGroupBy.describe:

    print (df.groupby(['modelName', 'typeOfRun']).describe())
    
                                                 f1_R     kappa  
    modelName              typeOfRun                             
    00_testing-validation  validation count  1.000000  1.000000  
                                      mean   0.832731  0.097878  
                                      std         NaN       NaN  
                                      min    0.832731  0.097878  
                                      25%    0.832731  0.097878  
                                      50%    0.832731  0.097878  
                                      75%    0.832731  0.097878  
                                      max    0.832731  0.097878  
    00_testing_1-training  training   count  1.000000  1.000000  
                                      mean   0.831213  0.099879  
                                      std         NaN       NaN  
                                      min    0.831213  0.099879  
                                      25%    0.831213  0.099879  
                                      50%    0.831213  0.099879  
                                      75%    0.831213  0.099879  
                                      max    0.831213  0.099879  
    00_testing_2-training  training   count  1.000000  1.000000  
                                      mean   0.839183  0.101592  
                                      std         NaN       NaN  
    ...
    ...                                  
    

    You can groupby by Series created by split and selected first item of list by str[0]:

    print (df.modelName.str.split('_').str[0])
    0    00
    1    00
    2    00
    3    00
    4    01
    5    01
    6    01
    7    01
    Name: modelName, dtype: object
    
    print (df.groupby([df.modelName.str.split('_').str[0]]).describe())
                        AUC_R  Accuracy  Error;rate  False;negative;rate  \
    modelName                                                              
    00        count  4.000000  4.000000    4.000000             4.000000   
              mean   0.768913  0.722449    0.277551             0.181730   
              std    0.004149  0.004924    0.004924             0.011270   
              min    0.765376  0.718242    0.270494             0.165198   
              25%    0.766368  0.719753    0.276280             0.179275   
              50%    0.767757  0.721024    0.278976             0.186147   
              75%    0.770302  0.723720    0.280247             0.188601   
              max    0.774762  0.729506    0.281758             0.189427   
    01        count  4.000000  4.000000    4.000000             4.000000   
              mean   0.771151  0.733358    0.266642             0.188704   
              std    0.004452  0.003198    0.003198             0.011488   
              min    0.767635  0.730054    0.262587             0.171806   
              25%    0.768771  0.731325    0.264984             0.186674   
              50%    0.769662  0.732983    0.267017             0.192937   
              75%    0.772042  0.735016    0.268675             0.194968   
              max    0.777643  0.737413    0.269946             0.197137   
              ...
              ...