I am importing parquet files in databricks using SparkR
and sparklyr
.
data1 = SparkR::read.df("dbfs:/.../data202007*", source = "parquet", header = TRUE, inferSchema = TRUE)
data1 = sparklyr::spark_read_parquet(sc = sc, path = "dbfs:/.../data202007*")
The time difference for import is humongous: 6 seconds for SparkR
vs 11 minutes for sparklyr
!
Is there a way to reduce the time taken in sparklyr
? I am more familiar with dplyr
syntax and therefore sparklyr
as well.
By default sparklyr::spark_read_parquet
caches the results (memory = TRUE
).
Compare the following for cached results:
SparkR::cache(SparkR::read.df("dbfs:/.../data202007*", source = "parquet", header = TRUE, inferSchema = TRUE))
sparklyr::spark_read_parquet(sc = sc, path = "dbfs:/.../data202007*")
And this for uncached:
SparkR::read.df("dbfs:/.../data202007*", source = "parquet", header = TRUE, inferSchema = TRUE)`
sparklyr::spark_read_parquet(sc = sc, path = "dbfs:/.../data202007*", memory = FALSE)