I want to extract only the column names that doesn't have null values in a large dataset in r.
If my table has 4 columns (id, Price, Product, Status) with columns Price and Status having a few null values and columns id and Product with no null values. Then I would want my output as: id, Product
If you need an exact answer you'll have to scan full dataset first, to count missing values:
library(dplyr)
df <- copy_to(sc, tibble(
id = 1:4, Price = c(NA, 3.20, NA, 42),
Product = c("p1", "p2", "p3", "p4"),
Status = c(NA, "foo", "bar", NA)))
null_counts <- df %>%
summarise_all(funs(sum(as.numeric(is.na(.)), na.rm=TRUE))) %>%
collect()
null_counts
# A tibble: 1 x 4
id Price Product Status
<dbl> <dbl> <dbl> <dbl>
1 0 2 0 2
determine which columns have missing count equal to zero:
cols_without_nulls <- null_counts %>%
select_if(funs(. == 0)) %>%
colnames()
cols_without_nulls
[1] "id" "Product"
and use these to select
df %>% select(one_of(cols_without_nulls))
# Source: spark<?> [?? x 2]
id Product
<int> <chr>
1 1 p1
2 2 p2
3 3 p3
4 4 p4
A shorter variant exists:
df %>% select_if(funs(sum(as.numeric(is.na(.)), na.rm=TRUE) == 0))
Applying predicate on the first 100 rows
# Source: spark<?> [?? x 2]
id Product
<int> <chr>
1 1 p1
2 2 p2
3 3 p3
4 4 p4
but as you see it will only sample the data.