Search code examples
pythonpandasdataframecorrelationcrosstab

DataFrame with correlation between several columns based on grouping column


I have a dataframe with the following structure:

|file_id|metric_a|metric_b|metric_c|
|'1.xml'| 1      | 0.5    | 50     |
|'1.xml'| 1.5    | 0.55   | 65     | 
|'2.xml'| 2      | 0.7    | 75     |
|'2.xml'| 2.5    | 0.75   | 80     | 

As result I would like to get this table of correlations between 'metric_c' and each other column:

|file_id|correlation_a_c|correlation_b_c|
|'1.xml'| 0.7           |  0.8          |
|'2.xml'| 0.75          |  0.85         | 

I do it with the following code but it looks more than terrible:

metric_a_vs_metric_c_df = source_df.groupby('file_id')[
                                  ['metric_a', 'metric_c']].corr(method='spearman').iloc[0::2,-1].reset_index().rename(
    columns={'metric_a': 'correlation_a_c'}
)
metric_b_vs_metric_c_df = source_df.groupby('file_id')[
                                 ['metric_b', 'metric_c']].corr(method='spearman').iloc[0::2,-1].reset_index().rename(
    columns={'metric_b': 'correlation_b_c'}
)
joined_df = metric_a_vs_metric_c_df.set_index('file_id').join(metric_b_vs_metric_c_df.set_index('file_id'), lsuffix='_caller', rsuffix='_other')
print(joined_df)

Does it exist the way to do it more readable?


Solution

  • Here's a solution. Since the sample data is naive, so are the results - but it would work with real data as well.

    df = df.groupby("file_id").corr().reset_index().melt(id_vars = ["file_id", "level_1"])
    ac = df[(df.level_1 == "metric_a") & (df.variable == "metric_c")]
    bc = df[(df.level_1 == "metric_b") & (df.variable == "metric_c")]
    df = pd.concat([ac, bc])
    df["metrics"] = df.level_1 + "_" + df.variable
    
    df = pd.pivot_table(df, index="file_id", columns="metrics")
    df.columns = [c[1] for c in df.columns]
    

    The result is:

             metric_a_metric_c  metric_b_metric_c
    file_id                                      
    '1.xml'                1.0                1.0
    '2.xml'                1.0                1.0