Search code examples
rapache-sparkfiltersparkr

sparkR - subset values in list


How do I perform the following task for a spark data frame. In dplyr, I would do this:

library(dplyr)
df1 <- data.frame(x = 1:10, y = 101:110)
df2 <- data.frame(r = 5:10, s = 205:210)
df3 <- df1 %>% filter(x %in% df2$r)

How do I perform the filter(x %in% df2$r) command for a sparkR dataframe?


Solution

  • I just had similar question and this seemed to work for filtering from a list:

    df3 <- filter(df1, ("x in ('string1','string2','string3')"))
    

    in your case, you might want to consider a join

    df3 <- drop(join(df1, SparkR::distinct(SparkR::select(df2,'r')), df1$x==df2$r),'r')
    

    (probably a bit too expensive though) ..

    cheers, anna