Search code examples
apache-sparkdatabrickssparkr

Unable to subset the data using SparkR, using piping convention to execute the commands


I'm operating on some data that looks like below: dataFrame

the command that I'm performing is :

library(magrittr)

#subsetting the data for MAC-OS & sorting by event-timestamp.
macDF <- eventsDF %>% 
  SparkR::select("device", "event_timestamp") %>%
  SparkR::filter("device = macOS") %>%
  SparkR::arrange("event_timestamp")

display(macDF)

And the error I get is:

Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'arrange': unable to find an inherited method for function ‘filter’ for signature ‘"character", "missing"’
Some(<code style = 'font-size:10p'> Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'arrange': unable to find an inherited method for function ‘filter’ for signature ‘&quot;character&quot;, &quot;missing&quot;’ </code>)

Any help would be appreciated, Thanks!


Solution

  • I couldn't precisely replicate your error, but I created an example eventsDF dataframe in R, converted it to a Spark dataframe, and updated a bit of your code.

    Here's an example in the style you started with. Note the call to SparkR::expr which allows you provide a sql expressions for Spark to put in the where clause it is building. Since this example uses expr() to build a sql where clause, macOS needs to be quoted:

    library(magrittr)
    
    eventsDF = data.frame(device=c("macOS","redhat","macOS"),event_timestamp=strptime(c('2022-01-13 12:19','2021-11-14 08:02','2021-12-01 21:33'),format="%Y-%m-%d %H:%M")) %>%
                SparkR::as.DataFrame()
    
    macDF <- eventsDF %>% 
      SparkR::select(eventsDF$device, eventsDF$event_timestamp) %>%
      SparkR::filter(SparkR::expr("device='macOS'")) %>%
      SparkR::arrange('event_timestamp') %>%
      display()
    

    How I might do it:

    library(dplyr)
    library(SparkR)
    
    eventsDF = data.frame(device=c("macOS","redhat","macOS"),event_timestamp=strptime(c('2022-01-13 12:19','2021-11-14 08:02','2021-12-01 21:33'),format="%Y-%m-%d %H:%M")) %>%
                as.DataFrame()
    
    macDF <- eventsDF %>% 
      select(c('device','event_timestamp')) %>%
      filter(eventsDF$device=='macOS') %>%
      arrange('event_timestamp') %>%
      display()
    

    Results: screenshot of filtered, sorted, eventsDF