Search code examples
rapache-sparkamazon-s3sparkrsparklyr

sparklyr: fill `spark_read_parquet path` argument from list


I'd like to process several files in the same s3 bucket in the same way. So, I created a list of the filenames:

dt <- seq(as.Date("1991/12/23"), by = "day", length.out = 5)
dt_ls <- paste0('s3://donuts/date=',dt)

And then I run a for loop over that list:

for (i in 1:length(dt)){
  df <- spark_read_parquet(sc, "df", path = dt_ls[i]) #readin
  df_tbl <- tbl(sc, "df")                             #convert to tbl
  #perform w/e operations you like
  rm(df)
}

However, I immediately get one of two errors when trying to assign path = dt_ls[i].

Error in UseMethod("invoke"): no applicable method for 'invoke' applied to an object of class "character"

or:

Error in as.vector(x, "character"): cannot coerce type 'environment' to vector of type 'character'

I see the same errors when running a single line in isolation, e.g.:

tmp <- spark_read_parquet(sc, "tmp", path = dt_ls[1])

My read of these errors is that I cannot pass an s3 filepath saved as an object to spark_read_parquet, because since the back end of the command is calling on invoke it doesn't direct to the contents of the list index I've passed to it. Therefore, I have to write the path directly into the path argument.

Is that a correct interpretation? Is there a work around so I can automate the opening of all these files?


Solution

  • SOLUTION:

    The quotation marks that appear around a chr object in a list appear to have been the problem. Removing those quotation marks when passing a list index to the path argument in spark_read_parquet allows the function to run normally.

    So the solution in brief:

    tmp <- spark_read_parquet(sc, "tmp", path = noquotes(dt_ls[1]))

    And an example of the input causing the issue:

       [1] “s3://donuts/date=2021-12-23”
       [2] “s3://donuts/date=2021-12-24”
       [3] “s3://donuts/date=2021-12-25”
    

    So the filepath passed must resemble:

    [1] s3://donuts/date=2021-12-23