Search code examples
rapache-sparkparquetsparkr

Is there a basepath data option in SparkR?


I have an explicitly pruned schema structure in S3, causing the following error when I read.parquet():

Caused by: java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths
    s3a://leftout/for/security/dashboard/updateddate=20170217
    s3a://leftout/for/security/dashboard/updateddate=20170218

The (lengthy) error tells me further down...

If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table.

I cannot, however, find any documentation on how to do this using SparkR::read.parquet(...). Does anyone know how to do this in R (with SparkR)?

> version

platform       x86_64-redhat-linux-gnu     
arch           x86_64                      
os             linux-gnu                   
system         x86_64, linux-gnu           
status                                     
major          3                           
minor          2.2                         
year           2015                        
month          08                          
day            14                          
svn rev        69053                       
language       R                           
version.string R version 3.2.2 (2015-08-14)
nickname       Fire Safety       

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Amazon Linux AMI 2016.09

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.6.0   SparkR_2.0.2      DT_0.2            jsonlite_1.2      shinythemes_1.1.1 ggthemes_3.3.0   
 [7] dplyr_0.5.0       ggplot2_2.2.1     leaflet_1.0.1     shiny_1.0.0      

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.9       magrittr_1.5      munsell_0.4.3     colorspace_1.3-2  xtable_1.8-2      R6_2.2.0         
 [7] stringr_1.1.0     plyr_1.8.4        tools_3.2.2       grid_3.2.2        gtable_0.2.0      DBI_0.5-1        
[13] sourcetools_0.1.5 htmltools_0.3.5   yaml_2.1.14       lazyeval_0.2.0    digest_0.6.12     assertthat_0.1   
[19] tibble_1.2        htmlwidgets_0.8   mime_0.5          stringi_1.1.2     scales_0.4.1      httpuv_1.3.3             

Solution

  • In Spark 2.1 or later you can pass basePath as a named argument:

    read.parquet(path, basePath="s3a://leftout/for/security/dashboard/")
    

    Arguments captured by the ellipsis are converted with varargsToStrEnv and used as options.

    Full session example :

    • write data (Scala):

      Seq(("a", 1), ("b", 2)).toDF("k", "v")
        .write.partitionBy("k").mode("overwrite").parquet("/tmp/data")
      
    • read data (SparkR):

       Welcome to
          ____              __ 
         / __/__  ___ _____/ /__ 
        _\ \/ _ \/ _ `/ __/  '_/ 
       /___/ .__/\_,_/_/ /_/\_\   version  2.1.0 
          /_/ 
      
      
       SparkSession available as 'spark'.
      
      > paths <- dir("/tmp/data/", pattern="*parquet", full.names=TRUE, recursive=TRUE)
      > read.parquet(paths, basePath="/tmp/data")
      
      SparkDataFrame[v:int, k:string]
      

      In contrast, without basePath:

      > read.parquet(paths)
      
      SparkDataFrame[v:int]