Search code examples
rapache-arrow

Proper Syntax for Filtering Expressions for Arrow Datasets in R


I am attempting to use the arrow package (relatively recently implemented) DataSet API to to read a directory of files into memory, and leverage the c++ back-end to filter rows and columns. I would like to use the arrow package functions directly, not the wrapper functions for dplyr style verbs. These functions are very early in their lifecycle as of today, so I'm having a hard time tracking down some examples that illustrate the syntax.

In order to understand the syntax, I have created a very minimal example for testing. The first two queries work as expected.

library(arrow) ## version 4.0.0

write.csv(mtcars,"ArrowTest_mtcars/mtcars.csv")
## Define a dataset object
DS <- arrow::open_dataset(sources = "ArrowTest_mtcars", format = "text")

## Generate a basic scanner 
AT <- DS$NewScan()$UseThreads()$Finish()$ToTable()
head(as.data.frame(AT), n = 3)
##                      mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1         Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## 2     Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## 3        Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

## Generate a basic scanner with projection to select columns
AT <- DS$NewScan()$UseThreads()$Project(c("mpg","cyl"))$Finish()$ToTable()
head(as.data.frame(AT), n = 3)    
#   mpg cyl
#1 21.0   6
#2 21.0   6
#3 22.8   4

However, I have not yet been able to figure out the proper syntax to implement a filtering expression. I've tried a number of things, but my best guess still isn't working, and causes a segfault when I execute the Filt <- Expression$create(...) line.

## Generate a basic scanner with filtering where column `cyl` = 6    
## My best guess at what might work, but causes a segfault instead
Filt <- Expression$create("==",args = list(Expression$field_ref("cyl"), Scalar$create(6L)))

AT <- DS$NewScan()$UseThreads()$Filter(Filt)$Finish()$ToTable()
head(as.data.frame(AT))

What is the proper syntax to implement row based filtering?


Solution

  • The documentation is quite awful on this. But a bit of trying and testing actually got me something that might lead you to the right answer. The problem I found was with Scalar$create and knowing which function to name to use:

    Filt = Expression$create('or', 
                             args = list(Expression$field_ref("cyl") == 6L, 
                                         Expression$field_ref('cyl') == 4L))
    
    AT <- DS$NewScan()$UseThreads()$Filter(Filt)$Finish()$ToTable()
    head(as.data.frame(AT))
    

    However, note that for a single condition just using Expression$field_ref(...) == x works directly in filter

    AT <- DS$NewScan()$UseThreads()$Filter(Expression$field_ref("cyl") == 6L)$Finish()$ToTable()
    head(as.data.frame(AT))