Search code examples
rapache-sparkdplyrsparklyr

Delete NA Columns in sparklyr


I have a dataframe with 75 columns out of which 12 columns are having all NA's and some with 70% NA's. I want to delete columns having >=70% NA's.

Can anyone help me in this? I tried

df[,! apply( df , 2 , function(x) all(is.na(x)) )

but I am getting exception as:

Error: Unable to retreive a spark_connection from object of class NULL

I also tried:

df[colSums(!is.na(df)) != nrow(df)]

and

df[, colSums(is.na(df)) < nrow(df)]

But I am getting exception as

Error in colSums(!is.na(df)) : 'x' must be an array of at least two dimensions


Solution

  • It seems like a bit tricky in sparklyr, but, we can get the index of the columns that needs to be removed from the local copy of dataset and use select to remove those columns

    j1 <- which(!colSums(!is.na(df)))
    library(sparklyr)
    sc <- spark_connect(master = "local")
    df_tbl <- copy_to(sc, df)
    library(dplyr)
    df_tbl %>% 
             select(-j1)
    # Source:   query [20 x 2]
    #Database: spark connection master=local[4] app=sparklyr #local=TRUE
    
    #    col2        col3
    #   <int>       <dbl>
    #1      1 -1.31690812
    #2      1  0.59826911
    #3      4 -0.76221437
    #4      3 -1.42909030
    #5      3  0.33224445
    #6      5 -0.46906069
    #7      1 -0.33498679
    #8      4  1.53625216
    #9      4  0.60999453
    #10     1  0.51633570
    #11     3 -0.07430856
    #12     2 -0.60515695
    #13     4 -1.70964518
    #14     4 -0.26869311
    #15     1 -0.64859151
    #16     5 -0.09411013
    #17     1 -0.08554095
    #18    NA  0.11953107
    #19     3 -0.11629639
    #20    NA -0.94382724
    

    data

    set.seed(24)
    df <- data.frame(col1 = NA_real_, col2 = sample(c(NA, 1:5), 20, 
                   replace = TRUE), col3 = rnorm(20))