I have an explicitly pruned schema structure in S3, causing the following error when I read.parquet()
:
Caused by: java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths
s3a://leftout/for/security/dashboard/updateddate=20170217
s3a://leftout/for/security/dashboard/updateddate=20170218
The (lengthy) error tells me further down...
If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table.
I cannot, however, find any documentation on how to do this using SparkR::read.parquet(...)
. Does anyone know how to do this in R (with SparkR)?
> version
platform x86_64-redhat-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 2.2
year 2015
month 08
day 14
svn rev 69053
language R
version.string R version 3.2.2 (2015-08-14)
nickname Fire Safety
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Amazon Linux AMI 2016.09
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] lubridate_1.6.0 SparkR_2.0.2 DT_0.2 jsonlite_1.2 shinythemes_1.1.1 ggthemes_3.3.0
[7] dplyr_0.5.0 ggplot2_2.2.1 leaflet_1.0.1 shiny_1.0.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.9 magrittr_1.5 munsell_0.4.3 colorspace_1.3-2 xtable_1.8-2 R6_2.2.0
[7] stringr_1.1.0 plyr_1.8.4 tools_3.2.2 grid_3.2.2 gtable_0.2.0 DBI_0.5-1
[13] sourcetools_0.1.5 htmltools_0.3.5 yaml_2.1.14 lazyeval_0.2.0 digest_0.6.12 assertthat_0.1
[19] tibble_1.2 htmlwidgets_0.8 mime_0.5 stringi_1.1.2 scales_0.4.1 httpuv_1.3.3
In Spark 2.1 or later you can pass basePath
as a named argument:
read.parquet(path, basePath="s3a://leftout/for/security/dashboard/")
Arguments captured by the ellipsis are converted with varargsToStrEnv
and used as options
.
Full session example :
write data (Scala):
Seq(("a", 1), ("b", 2)).toDF("k", "v")
.write.partitionBy("k").mode("overwrite").parquet("/tmp/data")
read data (SparkR):
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0
/_/
SparkSession available as 'spark'.
> paths <- dir("/tmp/data/", pattern="*parquet", full.names=TRUE, recursive=TRUE)
> read.parquet(paths, basePath="/tmp/data")
SparkDataFrame[v:int, k:string]
In contrast, without basePath
:
> read.parquet(paths)
SparkDataFrame[v:int]