Search code examples
pythonpython-polars

Corr of one column with all other numeric ones


Starting with

import polars as pl
df = pl.DataFrame({
    'a': [1,2,3],
    'b': [4.,2.,6.],
    'c': ['w', 'a', 'r'],
    'd': [4, 1, 1]
})

how can I get the correlation between a and all other numeric columns?

Equivalent in pandas:

In [30]: (
    ...:     pd.DataFrame({
    ...:         'a': [1,2,3],
    ...:         'b': [4.,2.,6.],
    ...:         'c': ['w', 'a', 'r'],
    ...:         'd': [4, 1, 1]
    ...:     })
    ...:     .corr()
    ...:     .loc['a']
    ...: )
Out[30]:
a    1.000000
b    0.500000
d   -0.866025
Name: a, dtype: float64

I've tried

(
    df.select(pl.col(pl.Int64).cast(pl.Float64), pl.col(pl.Float64))
    .select(pl.corr('a', pl.exclude('a')))
)

but got

DuplicateError: the name 'a' is duplicate

Solution

  • There is a DataFrame.corr() which you could then filter.

    df.select(
        pl.col(pl.Int64).cast(pl.Float64), 
        pl.col(pl.Float64)
    ).corr()
    
    shape: (3, 3)
    ┌───────────┬───────────┬─────┐
    │ a         ┆ d         ┆ b   │
    │ ---       ┆ ---       ┆ --- │
    │ f64       ┆ f64       ┆ f64 │
    ╞═══════════╪═══════════╪═════╡
    │ 1.0       ┆ -0.866025 ┆ 0.5 │
    │ -0.866025 ┆ 1.0       ┆ 0.0 │
    │ 0.5       ┆ 0.0       ┆ 1.0 │
    └───────────┴───────────┴─────┘