Search code examples
rapache-sparkdplyrsparklyrlivy

Function to convert R types to Spark types


I have an R data frame that I would like to convert into a Spark data frame on a remote cluster. I have decided to write my data frame to an intermediate csv file that is then read using sparklyr::spark_read_csv(). I am doing this as the data frame is too big to send directly using sparklyr::sdf_copy_to() (which I think is due to a limitation in Livy).

I would like to programatically transfer R column types used in the data frame to the new spark data frame by writing a function that returns a named vector that I can use with the columns argument in spark_read_csv().


Solution

  • I only have rudimentary knowledge of mapping R data types (specifically, returned by the class() function) to Spark data types. However, the following function seems to work as I expect. Hopefully others will find it useful/improve it:

    get_spark_data_types_from_data_frame_types <- function(df) {
    
    
    
        r_types <-
            c("logical", "numeric", "integer", "character", "list", "factor")
    
        spark_types <-
            c("boolean", "double", "integer", "string", "array", "string")
    
        types_in <- sapply(df, class)    
    
    
        types_out <- spark_types[match(types_in, r_types)]
    
        types_out[is.na(types_out)] <- "string" # initialise to character by default
    
        names(types_out) <- names(df)
    
        return(types_out)
    
    }