Search code examples
rparquetdatabrickssparkrsparklyr

Difference in time taken for importing parquet files between SparkR and sparklyr


I am importing parquet files in databricks using SparkR and sparklyr.

data1 = SparkR::read.df("dbfs:/.../data202007*", source = "parquet", header = TRUE, inferSchema = TRUE)

data1 = sparklyr::spark_read_parquet(sc = sc, path = "dbfs:/.../data202007*")

The time difference for import is humongous: 6 seconds for SparkR vs 11 minutes for sparklyr! Is there a way to reduce the time taken in sparklyr? I am more familiar with dplyr syntax and therefore sparklyr as well.


Solution

  • By default sparklyr::spark_read_parquet caches the results (memory = TRUE).

    Compare the following for cached results:

    SparkR::cache(SparkR::read.df("dbfs:/.../data202007*", source = "parquet", header = TRUE, inferSchema = TRUE))
    
    sparklyr::spark_read_parquet(sc = sc, path = "dbfs:/.../data202007*")
    

    And this for uncached:

    SparkR::read.df("dbfs:/.../data202007*", source = "parquet", header = TRUE, inferSchema = TRUE)`
    
    sparklyr::spark_read_parquet(sc = sc, path = "dbfs:/.../data202007*", memory = FALSE)