Search code examples
pythonpython-polarsbulkupdate

How can I filter all rows of a polars dataframe that partially match strings in another?


I want to delete all rows of a dataframe that match one or more rows in a filtering dataframe.

Yes, I know about filter by one regex and I also know how join can be leveraged when there is a full match on a column. This isn't a direct match, except through looping the filter dataframe row by row.

It is a relatively trivial problem in sql to apply this filter in bulk, on the server, without looping with client-side code:

given:

data.csv
filename,col2
keep.txt,bar
skip.txt,foo
keep2.txt,zoom
skip3.txt,custom1
discard.txt,custom2
file3.txt,custom3
discard2.txt,custom4
file4.txt,custom5
filter.csv:
skip
discard
skip

Here's the sql using postgres. It will, and that is the key point here, scale very well.

withsql.sql
\c test;

DROP TABLE IF EXISTS data;
DROP TABLE IF EXISTS filter;

CREATE TABLE data (
    filename CHARACTER(50),
    col2 CHARACTER(10),
    skip BOOLEAN DEFAULT FALSE
);

\copy data (filename,col2) FROM './data.csv' WITH (FORMAT CSV);

CREATE TABLE filter (
    skip VARCHAR(20)
);

\copy filter FROM './filter.csv' WITH (FORMAT CSV);

update filter set skip = skip || '%';

update data set skip = TRUE where exists (select 1 from filter s where filename like s.skip);
delete from data where skip = TRUE;

select * from data;

psql -f withsql.sql

this gives as output:

You are now connected to database "test" as user "djuser".
...
UPDATE 4
DELETE 4
                      filename                      |    col2    | skip 
----------------------------------------------------+------------+------
 filename                                           | col2       | f
 keep.txt                                           | bar        | f
 keep2.txt                                          | zoom       | f
 file3.txt                                          | custom3    | f
 file4.txt                                          | custom5    | f
(5 rows)

Now, I can do with polars, but the only thing I can think of is using a loop on the filter.csv:

withpolars.py
import polars as pl

df_data = pl.read_csv("data.csv")
df_filter = pl.read_csv("filter.csv")

for row in df_filter.iter_rows():
    df_data = df_data.filter(~pl.col('filename').str.contains(row[0]))

print("data after:\n", df_data)

The output is correct, but I do this without looping, somehow? And... just curious how some of these bulk sql approaches map to dataframes.

data after:
 shape: (4, 2)
┌───────────┬─────────┐
│ filename  ┆ col2    │
│ ---       ┆ ---     │
│ str       ┆ str     │
╞═══════════╪═════════╡
│ keep.txt  ┆ bar     │
│ keep2.txt ┆ zoom    │
│ file3.txt ┆ custom3 │
│ file4.txt ┆ custom5 │
└───────────┴─────────┘


Solution

  • There is a dedicated partial/substring matching function:

    df_data.filter(
        pl.col("filename")
          .str.contains_any(df_filter.get_column("skip"))
          .not_()
    )
    
    shape: (4, 2)
    ┌───────────┬─────────┐
    │ filename  ┆ col2    │
    │ ---       ┆ ---     │
    │ str       ┆ str     │
    ╞═══════════╪═════════╡
    │ keep.txt  ┆ bar     │
    │ keep2.txt ┆ zoom    │
    │ file3.txt ┆ custom3 │
    │ file4.txt ┆ custom5 │
    └───────────┴─────────┘
    

    As for your current approach, I think the issue is calling .filter each time.

    Instead, you would build a single expression and call it once. (allowing Polars to parallelize the work)

    df_data.filter(
       pl.all_horizontal(
          pl.col("filename").str.contains(row).not_()
          for row in df_filter.get_column("skip")
       )
    )