Search code examples
rapache-sparkamazon-s3sparklyrs3-bucket

How to load objects from S3 bucket into Spark in RStudio?


The Object in the S3 bucket is 5.3 GB size. In order to convert object into data, I used get_object("link to bucket path"). But this leads to memory issues.

So, I installed Spark 2.3.0 in RStudio and trying to load this object directly into Spark but the command to load object directly into spark is not known. library(sparklyr) library(dplyr) sc <- spark_connect(master = "local")

If I convert the object into a readable data type (such as data.frame/tbl in R) I would use copy_to to transfer the data into spark from R as below:

Copy data to Spark

spark_tbl <- copy_to(spark_conn,data)

I was wondering how can convert the object inside spark ?

relevant links would be

  1. https://github.com/cloudyr/aws.s3/issues/170

  2. Sparklyr connection to S3 bucket throwing up error

Any guidance would be sincerely appreciated.


Solution

  • Solution.

    I was trying to read the csv file which is 5.3 GB from S3 bucket. But Since R is single-threaded, it was giving memory issues (IO exceptions).

    However, the solution is to load sparklyr in R (library(sparklyr)) and hence now all the cores in the computer will be utilized.

    get_object("link to bucket path") can be replaced by spark_read_csv("link to bucket path"). Since RStudio uses all cores, we have no memory issues.

    Also, depending on the file extension, you can change the functions: ´spark_load_table, spark_read_jdbc, spark_read_json, spark_read_libsvm, spark_read_parquet, spark_read_source, spark_read_table, spark_read_text, spark_save_table, spark_write_csv, spark_write_jdbc, spark_write_json, spark_write_parquet, spark_write_source, spark_write_table, spark_write_text´