I have this df
dataset:
df = pd.DataFrame({'train': {'auc': [0.432, 0.543, 0.523],
'logloss': [0.123, 0.234, 0.345]},
'test': {'auc': [0.456, 0.567, 0.678],
'logloss': [0.321, 0.432, 0.543]}})
Where I'm trying to transform it into this:
And also considering that:
epochs
always have the same order for every cell, but instead of only 3 epochs, it could reach 1.000 or 10.000.f1
instead of logloss
, or val
instead of train
. But no matter the names, in df
each row will always be a metric name, and each column will always be a dataset name.df
could change too. There are some models with 5 datasets, and 7 metrics for example (which would give a df
with 5 columns and 7 rows)datasetname_metricname
So I'm trying to build some generic code transformation where at the same time avoiding brute force transformations. Just if it's helpful, the df
source is:
df = pd.DataFrame(model_xgb.evals_result())
df.columns = ['train', 'test'] # This is the line that can change (and the metrics inside `model_xgb`)
Where model_xgb = xgboost.XGBClassifier(..)
, but after using model_xgb.fit(..)
Here's a generic way to get the result you've specified, irrespective of the number of epochs or the number or labels of rows and columns:
df2 = df.stack().apply(pd.Series)
df2.index = ['_'.join(reversed(x)) for x in df2.index]
df2 = df2.T.assign(epochs=range(1, len(df2.columns) + 1)).set_index('epochs').reset_index()
Output:
epochs train_auc test_auc train_logloss test_logloss
0 1 0.432 0.456 0.123 0.321
1 2 0.543 0.567 0.234 0.432
2 3 0.523 0.678 0.345 0.543
Explanation:
stack()
to convert the input dataframe to a series (of lists) with a multiindex that matches the desired column sequence in the questionapply(pd.Series)
to convert the series of lists to a dataframe with each list converted to a row and with column count equal to the uniform length of the list values in the input series (in other words, equal to the number of epochs)join()
with _
as a separator, then use T
to transpose the dataframe so these index labels (which are the desired column labels) become column labelsassign()
to add a column named epochs
enumerating the epochs beginning with 1
set_index()
followed by reset_index()
to make epochs
the leftmost column.