Search code examples
python-3.xpandasdataframepython-polars

How to find values that can be found in other columns in polars quickly


I started learning polars because the performance of pandas was not adequate for my task, but before I started I wanted to know if it could meet my requirements.

Now I have a dataframe like this

Column A Column B Column C
v1 v3 x
v2 v1 y

And I want to find those values in columnA, which could be find in columnB, like v1 in table, and then modify the value of other columns of the same row.

Suppose the size range of my data set is like (3e+5, 20) to (10e+5, 20),and I need to perform this search operation on two of the columns (like if colA.value == colB.value in a database operation), which may be repeated ten to thirty times in func.

In pandas I learned a solution by pandas.merge: speed up my function about build bill of materials with pandas.

And it takes about 0.5s each time searching two columns in my computer. I want to know could polars performs faster than pandas in this operation? If it could, how to do it?

Thx for any help and suggestions


Solution

  • Benchmarking this polars statement

    df.select(pl.col('a').filter(pl.col('a').is_in(pl.col('b'))))
    

    On the sample df

    df = pl.DataFrame({
      'a' : np.random.randint(1, 1_000_000_000, size=300_000),
      'b' : np.random.randint(1, 1_000_000_000, size=300_000)
    })
    

    I get an average of 9-10ms per run.