Search code examples
rapache-sparkdatabrickssparkr

Is there a more memory efficient way to concatenate two columns in SparkR?


I'm trying to concatenate two columns from two spark data frames using SparkR e.g.

Frame1

ID Data1 Data2
1 0 0
2 1 0
3 1 1
4 0 1
5 1 1

Frame2

OtherID Data1 Data2
2 0 0
3 1 0
4 1 1
5 0 1
6 1 1

After concatenation:

ID Data1 Data2
12 0 0
23 1 0
34 1 1
45 0 1
56 1 1

As these are Spark frames, each column is treated as a S4 class, rather than a vector so attempting a simple paste(Frame1$ID, Frame2$OtherID) doesn't work as R can't coerce the S4 type into a vector. My current solution then, is to collect the two frames into regular R dataframes and then run the paste command:

r_frame1 <- collect(Frame1)
r_frame2 <- collect(Frame2)

r_frame1$id <- paste(r_frame1$ID, rframe2$OtherID, sep = "")

This works for some of my data, however I need to do this for several tables, all of which are very large (several million rows) and I'm getting java.lang.outofmemory errors in my Databricks environment. Is there a more memory efficient way of doing this? Perhaps one that doesn't involve collecting into a R dataframe? I have also tried running it as a SQL command using CONCAT, but ran into some other issues (see my previous related question: Databricks SQL CONCAT function adding characters where not requested?)


Solution

  • You can use the function concat:

    Frame1$id = concat(Frame1$ID, Frame2$OtherID)
    

    I you want to concatenate with a separator you can use concat_ws:

    Frame1$id = concat_ws('_', Frame1$ID, Frame2$OtherID)
    

    Note that if your columns are integer, you probably need to cast them to string before the concatenation with something like that:

    Frame1$id = concat(cast(Frame1$ID, "string"), cast(Frame2$OtherID, "string"))