Starting with
import polars as pl
df = pl.DataFrame({
'a': [1,2,3],
'b': [4.,2.,6.],
'c': ['w', 'a', 'r'],
'd': [4, 1, 1]
})
how can I get the correlation between a
and all other numeric columns?
Equivalent in pandas:
In [30]: (
...: pd.DataFrame({
...: 'a': [1,2,3],
...: 'b': [4.,2.,6.],
...: 'c': ['w', 'a', 'r'],
...: 'd': [4, 1, 1]
...: })
...: .corr()
...: .loc['a']
...: )
Out[30]:
a 1.000000
b 0.500000
d -0.866025
Name: a, dtype: float64
I've tried
(
df.select(pl.col(pl.Int64).cast(pl.Float64), pl.col(pl.Float64))
.select(pl.corr('a', pl.exclude('a')))
)
but got
DuplicateError: the name 'a' is duplicate
There is a DataFrame.corr()
which you could then filter.
df.select(
pl.col(pl.Int64).cast(pl.Float64),
pl.col(pl.Float64)
).corr()
shape: (3, 3)
┌───────────┬───────────┬─────┐
│ a ┆ d ┆ b │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 │
╞═══════════╪═══════════╪═════╡
│ 1.0 ┆ -0.866025 ┆ 0.5 │
│ -0.866025 ┆ 1.0 ┆ 0.0 │
│ 0.5 ┆ 0.0 ┆ 1.0 │
└───────────┴───────────┴─────┘