Search code examples
rstringrgreplapache-arrow

Filter expression not supported for Arrow Datasets


I'm using arrow package in R. I need to filter strings, so for example I have 700 million rows I need to get only those that contain "Walmart", but I get the error below.

    FileSystemDataset with 2886 Parquet files
DatoID: int32
BanktransaksjonID: int64
PosteringstypeID: int64
ForretningskategoriID: int64
KundeID: string
KortKildeID: string
TransaksjonDTM: timestamp[ns]
Posteringstekst: string
Kontovaluta: string
Kontobelop: decimal(38, 18)
Transaksjonsvaluta: string
Forretningskategori: string
Forretningsnr: string
Posteringstypekode: string
Posteringstype: string
KortFlagg: string
year: int32
month: int32
+ system.time(ds %>%
+               filter(year==2019, month=1,  grep('^cl\\.+', Posteringstekst, value=TRUE)) %>%
+               select(Kontobelop)%>%
+               collect() %>%
+               summarise(
+                 mean = mean(abs(Kontobelop)), 
+                 n = n()) %>%
+               print())
Error: Filter expression not supported for Arrow Datasets: grep("^cl\\.+", Posteringstekst, value = TRUE)
Call collect() first to pull data into R.
Timing stopped at: 0.01 0 0.01

I have also use string with same results, maybe is as simple as the error says "Filter expressions nor supported" if so when these will be supported?

Maybe a new SQL like way to query will be better in the future?


Solution

  • The error message is correct: as of version 3.0.0, the arrow R package does not support string functions like grep for filtering datasets. https://issues.apache.org/jira/browse/ARROW-10305 is the issue tracking that feature. We hope to have these string functions implemented in the next release (4.0.0).