Search code examples
pythondataframenumpypython-polars

Polars subtract numpy 1xn array from n columns


I am struggling with polars. I have a dataframe and an numpy array. I would like to subtract them.

import polars as pl
import pandas as pd

df = pl.DataFrame(np.random.randn(6, 4), schema=['#', 'x', 'y', 'z'])

arr = np.array([-10, -20, -30])


df.select(
    pl.col(r'^(x|y|z)$') # ^[xyz]$
).map_rows(
    lambda x: np.array(x) - arr
)

# ComputeError: expected tuple, got ndarray

But if I try to calculate the norm for example, then it works:

df.select(
    pl.col(r'^(x|y|z)$')
).map_rows(
    lambda x: np.sum((np.array(x) - arr)**2)**0.5
)
shape: (6, 1)
┌───────────┐
│ map       │
│ ---       │
│ f64       │
╞═══════════╡
│ 38.242255 │
│ 37.239545 │
│ 38.07624  │
│ 36.688312 │
│ 38.419194 │
│ 36.262196 │
└───────────┘

# check if it is correct:
np.sum((df.to_pandas()[['x', 'y', 'z']].to_numpy() - arr)**2, axis=1) ** 0.5
>>> array([38.24225488, 37.23954478, 38.07623986, 36.68831161, 38.41919409,
       36.2621962 ])

In pandas one can do it like this:

df.to_pandas()[['x', 'y', 'z']] - arr

x   y   z
0   10.143819   21.875335   29.682364
1   10.360651   21.116404   28.871060
2   9.777666    20.846593   30.325185
3   9.394726    19.357053   29.716592
4   9.223525    21.618511   30.390805
5   9.751234    21.667080   27.393393

One way it will work is to do it for each column separately. But that means a lot of the same code, especially when the number of columns are increasing:

df.select(
    pl.col('x') - arr[0], pl.col('y') - arr[1], pl.col('z') - arr[2]
)

Solution

  • There are a few things going on in this question.

    The first is that you really really don't want to use map_rows unless you're doing something that is a custom python function

    This method is much slower than the native expressions API. Only use it if you cannot implement your logic otherwise.

    There's not really a polars way to do what you want. When polars sees pl.col(r'^(x|y|z)$').expr it's going to identify each column that fits the regex and then there will be a thread doing the work of whatever the rest of the expression is. The expression doesn't know where in the order it was. It only knows what its data is and what it's supposed to do. Therefore, there's nothing you can put in the expr for it to know which element in the array to access.

    To get at what you want, you have to do something like @ignoring_gravity had but you can use the re module.

    import re
    df.select(pl.col(col)-arr[i] 
              for i, col in enumerate(filter(re.compile(r'^(x|y|z)$').match, df.columns)))