Search code examples
rapache-sparkmissing-datasparklyr

How to impute missing value with column mean using sparklyr, for selected columns?


For Spark data frames in sparklyr, I know NA can be imputed by a fixed number using na.replace(number), also I know I can do na.replace(x=something) for a hard coded column.

Now I have a vector containing the column names I want to impute missing value with mean value. What can I do to insert mean for all the missing values within these columns?

I looked into spark_apply to apply mice on it, but didn't figure out a solution yet.

Thank you!


Solution

  • You can use Imputer. Let's say data looks like this:

    df <- copy_to(sc, tibble(id=1:3, x=c(1, NA, 3), y=c(NA, 2, -1)))
    

    The transformer requires input and output column lists:

    input_cols <- c("x", "y")
    output_cols <- paste0(input_cols, "_imp")
    

    and can be applied as shown below:

    df %>% 
      ft_imputer(input_cols=input_cols, output_cols=output_cols, strategy="mean")
    
    # Source:   table<sparklyr_tmp_73a32e74369c> [?? x 5]
    # Database: spark_connection
         id     x     y x_imp y_imp
      <int> <dbl> <dbl> <dbl> <dbl>
    1     1     1   NaN     1   0.5
    2     2   NaN     2     2   2  
    3     3     3    -1     3  -1