Search code examples
rapache-sparksparklyr

Sparklyr, spark_read_csv, do we have to reimport data everytime?


i'm using sparklyr to read data on my local machine.

What i did

spark_install()


config <- spark_config()
spark_dir = "C:/spark"

config$`sparklyr.shell.driver-java-options` <- paste0("-Djava.io.tmpdir=", spark_dir)
config$`sparklyr.shell.driver-memory` <- "4G"
config$`sparklyr.shell.executor-memory` <- "4G"
config$`spark.yarn.executor.memoryOverhead` <- "1g"

sc = spark_connect(master = "local", config = config)

my_data = spark_read_csv(sc, name = "my_data", path = "my_data.csv", memory = FALSE)

After it is finished, in the folder C:/Spark i found a file named liblz4-java8352426675436067796.so

What's this file?

If i disconnect Spark connection, this file is still there. Next time if i want to work on my_data.csv again, do i need to rerun spark_read_csv? It takes long time to just read the data.

Or is there some way i could directly use this file liblz4-java8352426675436067796.so


Solution

  • After it is finished, in the folder C:/Spark i found a file named liblz4-java8352426675436067796.so

    What's this file?

    The file is a shared library of Java bindings for liblz4. It is not related to your data.

    If i disconnect Spark connection, this file is still there. Next time if i want to work on my_data.csv again, do i need to rerun spark_read_csv?

    Yes you will have to re-import the data.spark_read_csv creates only temporary bindings which cannot outlive corresponding SparkSession.

    If you want to keep the data you should create a persistent table using Hive metastore.