Search code examples
rapache-sparkapache-spark-sqldatabrickssparkr

Is there a size limit in DataBricks for converting an R dataframe to a Spark dataframe?


I am new to Stack overflow and tried so many ways to solve the error but without any success. My problem: I CAN convert subsets of an R dataframe to a Spark dataframe, but not the whole dataframe. Similar questions but not the same include: Not able to to convert R data frame to Spark DataFrame and Is there any size limit for Spark-Dataframe to process/hold columns at a time?

Here some information about the R dataframe:

library(SparkR)
sparkR.session()
sparkR.version()
[1] "2.4.3"

dim(df)
[1] 101368     25
class(df)
[1] "data.frame"

When converting this to a Spark Dataframe:

sdf <- as.DataFrame(df)
Error in handleErrors(returnStatus, conn) : Error in handleErrors(returnStatus, conn) : 
Error in handleErrors(returnStatus, conn) : 

However, when I subset the R dataframe, it does NOT result in an error:

sdf_sub1 <- as.DataFrame(df[c(1:50000), ])
sdf_sub2 <- as.DataFrame(df[c(50001:101368), ])

class(sdf_sub1)
[1] "SparkDataFrame"
attr(,"package")
[1] "SparkR"

class(sdf_sub2)
[1] "SparkDataFrame"
attr(,"package")
[1] "SparkR"

How can I write the whole dataframe to a Spark DataFrame? (I want to saveAsTable afterwards). I was thinking about a problem with capacity but I do not have a clue how to solve it.

Thanks a lot!!


Solution

  • In general you'll see poor performance when converting from R dataframes to Spark dataframes, and vice versa. Objects are representing differently in memory in Spark and R, and there is significant expansion of the object size when converting from one to the other. This often blows out the memory of the driver, making it difficult to copy/collect large objects to/from Spark. Fortunately, you have a couple options.

    1. Use Apache Arrow to establish a common in memory format for objects, eliminating the need to copy and convert from representation in R to Spark. The link I provided has instructions on how to set this up on Databricks.

    2. Write the dataframe to disk as parquet (or CSV) and then read it into Spark directly. You can use the arrow library in R to do this.

    3. Increase the size of your driver node to accommodate the memory expansion. On Databricks you can select the driver node type (or ask your admin to do it) for your cluster - make sure you pick one with a lot of memory. For reference, I tested collecting a 2GB dataset and needed a 30GB+ driver. With arrow that comes down dramatically.