I have a data frame defined from a CSV and would like to calculate basic summary statistics e.g. mean, variance, ... for the train part of all the models.
Inserting a model number and grouping by that would work fine - but does not seem to be a good solution. How can I get the summary statistics per model (only for training), as a group_by modelName does not work because of the counter.
df.groupby(['modelName', 'typeOfRun'])['kappa'].mean()
or
df[df.typeOfRun != 'validation'].describe()
do not yield the desired results.
AUC_R,Accuracy,Error rate,False negative rate,False positive rate,Lift value,Precision J,Precision N,Rate of negative predictions,Rate of positive predictions,Sensitivity (true positives rate),Specificity (true negatives rate),f1_R,kappa,modelName,typeOfRun
0.7747622323007851,0.7182416731216111,0.28175832687838887,0.16519823788546256,0.28527729751296715,2.769918376242967,0.08117369886485329,0.9930703132218424,0.029305447973147433,0.3013813581203202,0.8348017621145375,0.7147227024870328,0.8312130234716368,0.09987857210248623,00_testing_1-training,training
0.7688154033277225,0.7295055512522592,0.27049444874774076,0.1894273127753304,0.27294188056922464,2.807689674786938,0.08228060368921185,0.9921956531603068,0.029305447973147433,0.28869739220242707,0.8105726872246696,0.7270581194307754,0.8391825769931881,0.10159217699431862,00_testing_2-training,training
0.7653761718477654,0.7217918925897238,0.2782081074102763,0.1883259911894273,0.2809216651150419,2.737743031677203,0.08023078597866318,0.9921552436003304,0.029305447973147433,0.29647560030983733,0.8116740088105727,0.7190783348849581,0.8338281219878937,0.09791120175612114,00_testing_3-training,training
0.7666987721022418,0.7202566535628756,0.2797433464371244,0.18396711202466598,0.2826353437708505,2.7358921138891255,0.08018987022168358,0.9923159476282464,0.02931031885891585,0.2982693958700465,0.816032887975334,0.7173646562291496,0.8327314318650539,0.097878484924986,00_testing-validation,validation
0.7776426005660843,0.7300542215336948,0.2699457784663052,0.17180616740088106,0.2729086314669504,2.8639238514789174,0.08392857142857142,0.9929168180167091,0.029305447973147433,0.28918151303898787,0.8281938325991189,0.7270913685330496,0.8394625719769673,0.10476961017159536,01_otherSet_1-training,training
0.7691501646636157,0.737412858249419,0.26258714175058095,0.197136563876652,0.2645631067961165,2.8639098209585327,0.08392816025788626,0.9919723742039644,0.029305447973147433,0.2803382390911438,0.802863436123348,0.7354368932038835,0.8446557452170924,0.1044486077353842,01_otherSet_2-training,training
0.770174515310113,0.7342176607281178,0.2657823392718823,0.19162995594713655,0.26802101343263735,2.847815513920855,0.08345650938032974,0.9921582766235522,0.029305447973147433,0.283856183836819,0.8083700440528634,0.7319789865673627,0.8424375777288816,0.10367514449353035,01_otherSet_3-training,training
0.7676347850606817,0.7317488289428102,0.26825117105718976,0.19424460431654678,0.2704858255620898,2.8156062097690264,0.08252631578947368,0.9920241385858671,0.02931031885891585,0.2861747473378218,0.8057553956834532,0.7295141744379102,0.8407546494992847,0.10196584743637081,01_otherSet-validation,validation
IIUC you can use DataFrameGroupBy.describe
:
print (df.groupby(['modelName', 'typeOfRun']).describe())
f1_R kappa
modelName typeOfRun
00_testing-validation validation count 1.000000 1.000000
mean 0.832731 0.097878
std NaN NaN
min 0.832731 0.097878
25% 0.832731 0.097878
50% 0.832731 0.097878
75% 0.832731 0.097878
max 0.832731 0.097878
00_testing_1-training training count 1.000000 1.000000
mean 0.831213 0.099879
std NaN NaN
min 0.831213 0.099879
25% 0.831213 0.099879
50% 0.831213 0.099879
75% 0.831213 0.099879
max 0.831213 0.099879
00_testing_2-training training count 1.000000 1.000000
mean 0.839183 0.101592
std NaN NaN
...
...
You can groupby
by Series
created by split
and selected first item of list by str[0]
:
print (df.modelName.str.split('_').str[0])
0 00
1 00
2 00
3 00
4 01
5 01
6 01
7 01
Name: modelName, dtype: object
print (df.groupby([df.modelName.str.split('_').str[0]]).describe())
AUC_R Accuracy Error;rate False;negative;rate \
modelName
00 count 4.000000 4.000000 4.000000 4.000000
mean 0.768913 0.722449 0.277551 0.181730
std 0.004149 0.004924 0.004924 0.011270
min 0.765376 0.718242 0.270494 0.165198
25% 0.766368 0.719753 0.276280 0.179275
50% 0.767757 0.721024 0.278976 0.186147
75% 0.770302 0.723720 0.280247 0.188601
max 0.774762 0.729506 0.281758 0.189427
01 count 4.000000 4.000000 4.000000 4.000000
mean 0.771151 0.733358 0.266642 0.188704
std 0.004452 0.003198 0.003198 0.011488
min 0.767635 0.730054 0.262587 0.171806
25% 0.768771 0.731325 0.264984 0.186674
50% 0.769662 0.732983 0.267017 0.192937
75% 0.772042 0.735016 0.268675 0.194968
max 0.777643 0.737413 0.269946 0.197137
...
...