I have Polars dataframe
data = {
"col1": ["a", "b", "c", "d"],
"col2": [[-0.06066, 0.072485, 0.548874, 0.158507],
[-0.536674, 0.10478, 0.926022, -0.083722],
[-0.21311, -0.030623, 0.300583, 0.261814],
[-0.308025, 0.006694, 0.176335, 0.533835]],
}
df = pl.DataFrame(data)
I want to calculate cosine similarity for each combination of column col1
The desired output should be the following:
┌─────────────────┬──────┬──────┬──────┬──────┐
│ col1_col2 ┆ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═════════════════╪══════╪══════╪══════╪══════╡
│ a ┆ 1.0 ┆ 0.86 ┆ 0.83 ┆ 0.54 │
│ b ┆ 0.86 ┆ 1.0 ┆ 0.75 ┆ 0.41 │
│ c ┆ 0.83 ┆ 0.75 ┆ 1.0 ┆ 0.89 │
│ d ┆ 0.54 ┆ 0.41 ┆ 0.89 ┆ 1.0 │
└─────────────────┴──────┴──────┴──────┴──────┘
Where each value represents cosine similarity between respective column values.
I tried to use pivot
method
df.pivot(values="col2", index="col1", columns="col1", aggregate_function=cosine_similary)
However I'm getting the following error
'function' object has no attribute '_pyexpr'
I'm using following cosine similarity function
from numpy.linalg import norm
cosine_similarity = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
However, I can use any implementation of it
You could cross join + filter to get the pairs. (i.e. combinations_with_replacements(..., r=2)
)
And use expressions for the similarity calculation:
x = pl.col("col2").flatten()
y = pl.col("col2_right").flatten()
row = pl.first().cum_count()
cosine_similarity = (
x.dot(y) / (x.pow(2).sum().sqrt() * y.pow(2).sum().sqrt())
).over(row)
(df.join(df, how = "cross")
.filter(pl.col("col1") <= pl.col("col1_right"))
.select(
col = "col1",
other = "col1_right",
cosine = cosine_similarity
)
)
shape: (10, 3)
┌─────┬───────┬──────────┐
│ col ┆ other ┆ cosine │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞═════╪═══════╪══════════╡
│ a ┆ a ┆ 1.0 │
│ a ┆ b ┆ 0.856754 │
│ a ┆ c ┆ 0.827877 │
│ a ┆ d ┆ 0.540282 │
│ b ┆ b ┆ 1.0 │
│ b ┆ c ┆ 0.752199 │
│ b ┆ d ┆ 0.411564 │
│ c ┆ c ┆ 1.0 │
│ c ┆ d ┆ 0.889009 │
│ d ┆ d ┆ 1.0 │
└─────┴───────┴──────────┘
You can then .pivot
if desired.