I am struggling with polars. I have a dataframe and an numpy array. I would like to subtract them.
import polars as pl
import pandas as pd
df = pl.DataFrame(np.random.randn(6, 4), schema=['#', 'x', 'y', 'z'])
arr = np.array([-10, -20, -30])
df.select(
pl.col(r'^(x|y|z)$') # ^[xyz]$
).map_rows(
lambda x: np.array(x) - arr
)
# ComputeError: expected tuple, got ndarray
But if I try to calculate the norm for example, then it works:
df.select(
pl.col(r'^(x|y|z)$')
).map_rows(
lambda x: np.sum((np.array(x) - arr)**2)**0.5
)
shape: (6, 1)
┌───────────┐
│ map │
│ --- │
│ f64 │
╞═══════════╡
│ 38.242255 │
│ 37.239545 │
│ 38.07624 │
│ 36.688312 │
│ 38.419194 │
│ 36.262196 │
└───────────┘
# check if it is correct:
np.sum((df.to_pandas()[['x', 'y', 'z']].to_numpy() - arr)**2, axis=1) ** 0.5
>>> array([38.24225488, 37.23954478, 38.07623986, 36.68831161, 38.41919409,
36.2621962 ])
In pandas one can do it like this:
df.to_pandas()[['x', 'y', 'z']] - arr
x y z
0 10.143819 21.875335 29.682364
1 10.360651 21.116404 28.871060
2 9.777666 20.846593 30.325185
3 9.394726 19.357053 29.716592
4 9.223525 21.618511 30.390805
5 9.751234 21.667080 27.393393
One way it will work is to do it for each column separately. But that means a lot of the same code, especially when the number of columns are increasing:
df.select(
pl.col('x') - arr[0], pl.col('y') - arr[1], pl.col('z') - arr[2]
)
There are a few things going on in this question.
The first is that you really really don't want to use map_rows
unless you're doing something that is a custom python function
This method is much slower than the native expressions API. Only use it if you cannot implement your logic otherwise.
There's not really a polars way to do what you want. When polars sees pl.col(r'^(x|y|z)$').expr
it's going to identify each column that fits the regex and then there will be a thread doing the work of whatever the rest of the expression is. The expression doesn't know where in the order it was. It only knows what its data is and what it's supposed to do. Therefore, there's nothing you can put in the expr
for it to know which element in the array to access.
To get at what you want, you have to do something like @ignoring_gravity had but you can use the re
module.
import re
df.select(pl.col(col)-arr[i]
for i, col in enumerate(filter(re.compile(r'^(x|y|z)$').match, df.columns)))