i'm using sparklyr to read data on my local machine.
What i did
spark_install()
config <- spark_config()
spark_dir = "C:/spark"
config$`sparklyr.shell.driver-java-options` <- paste0("-Djava.io.tmpdir=", spark_dir)
config$`sparklyr.shell.driver-memory` <- "4G"
config$`sparklyr.shell.executor-memory` <- "4G"
config$`spark.yarn.executor.memoryOverhead` <- "1g"
sc = spark_connect(master = "local", config = config)
my_data = spark_read_csv(sc, name = "my_data", path = "my_data.csv", memory = FALSE)
After it is finished, in the folder C:/Spark
i found a file named
liblz4-java8352426675436067796.so
What's this file?
If i disconnect Spark connection, this file is still there. Next time if i want to work on my_data.csv
again, do i need to rerun spark_read_csv
?
It takes long time to just read the data.
Or is there some way i could directly use this file liblz4-java8352426675436067796.so
After it is finished, in the folder C:/Spark i found a file named liblz4-java8352426675436067796.so
What's this file?
The file is a shared library of Java bindings for liblz4
. It is not related to your data.
If i disconnect Spark connection, this file is still there. Next time if i want to work on my_data.csv again, do i need to rerun spark_read_csv?
Yes you will have to re-import the data.spark_read_csv
creates only temporary bindings which cannot outlive corresponding SparkSession
.
If you want to keep the data you should create a persistent table using Hive metastore.