I have an R data frame that I would like to convert into a Spark data frame on a remote cluster. I have decided to write my data frame to an intermediate csv file that is then read using sparklyr::spark_read_csv()
. I am doing this as the data frame is too big to send directly using sparklyr::sdf_copy_to()
(which I think is due to a limitation in Livy).
I would like to programatically transfer R column types used in the data frame to the new spark data frame by writing a function that returns a named vector that I can use with the columns
argument in spark_read_csv()
.
I only have rudimentary knowledge of mapping R data types (specifically, returned by the class()
function) to Spark data types. However, the following function seems to work as I expect. Hopefully others will find it useful/improve it:
get_spark_data_types_from_data_frame_types <- function(df) {
r_types <-
c("logical", "numeric", "integer", "character", "list", "factor")
spark_types <-
c("boolean", "double", "integer", "string", "array", "string")
types_in <- sapply(df, class)
types_out <- spark_types[match(types_in, r_types)]
types_out[is.na(types_out)] <- "string" # initialise to character by default
names(types_out) <- names(df)
return(types_out)
}