Search code examples
sparklyrmclapply

mclapply and spark_read_parquet


I am relatively new as active user to the forum, but have to thank you all first your contributions because I have been looking for answers since years...

Today, I have a question that nobody has solved or I am not able to find...

I am trying to read files in parallel from s3 (AWS) to spark (local computer) as part of a test system. I have used mclapply, but when set more that 1 core, it fails...

Example: (the same code works when using one core, but fails when using 2)

new_rdd_global <- mclapply(seq(file_paths), function(i){spark_read_parquet(sc, name=paste0("rdd_",i), path=file_paths[i])}, mc.cores = 1)

new_rdd_global <- mclapply(seq(file_paths), function(i){spark_read_parquet(sc, name=paste0("rdd_",i), path=file_paths[i])}, mc.cores = 2) Warning message: In mclapply(seq(file_paths), function(i) { : all scheduled cores encountered errors in user code

Any suggestion???

Thanks in advance.


Solution

  • Just read everything into one table via 1 spark_read_parquet() call, this way Spark handles the parallelization for you. If you need separate tables you can split them afterwards assuming there's a column that tells you which file the data came from. In general you shouldn't need to use mcapply() when using Spark with R.