Search code examples
rapache-sparksparkr

Rowwise calculation using SparkR


Here is my toy dataframe.

library(tibble); library(SparkR)

df <- tibble::tribble(
  ~var1, ~var2, ~maxofvar1var2,
  1L,    1L,    1L,
  2L,    1L,    2L,
  2L,    3L,    3L,
  NA,    2L,    2L,
  1L,    4L,    4L,
  8L,    5L,    8L)

df <- df %>% as.DataFrame()

How can I calculate the rowwise calculation using SparkR to get the max value of var1 and var2 as shown in the 3rd variable in the df above? If there is no rowwise function in SparkR, how can I get the desired output?


Solution

  • To get a maximum value from a set of columns use SparkR::greatest:

    df %>% withColumn("maxOfVars", greatest(df$var1, df$var2))
    

    and in general case higher order functions, like aggregate (Spark 2.4 or later), on assembled data .

    df %>% withColumn("theLastVar", expr("aggregate(array(var1, var2), (x, y) -> y)"))
    

    or (version independent) composition of expressions:

    scols <- c("var1", "var2") %>% purrr::map(column)
    
    sumOfVars <- scols %>%
      purrr::map(function(x) coalesce(x, lit(0)))  %>%
      purrr::reduce(function(x, y) x + y, .init=lit(0))
    
    countOfVars <- scols %>% 
      purrr::map(function(x) ifelse(isNotNull(x), lit(1), lit(0))) %>%
      purrr::reduce(
        function(x, y) x + y, .init=lit(0))
    
    df %>% withColumn("meanOfVars", sumOfVars / countOfVars)