Search code examples
rgroup-bynadplyrtidy

Tidy way of NA management by groups?


my issue is the following:

I have a tibble which I want to modify with 3 different cases:

  • All values in group are NA
  • At least one but not all are NA. In this case replace NA with an arbitrary value (ex: 0.5)
  • None are NA

Example: (with group_by ind)

    a1 = c(0.3,0.1,NA,0.7,0.2)
    a2 = rep(NA,5)
    a3 = c(0.1,0.3,0.5,0.7,0.8)
    tibble(ind = c(rep("A",5),rep("B",5),rep("C",5)),
       value = c(a1,a2,a3)

segment of group A should yield c(0.3,0.1,0.5,0.7,0.2)

segment of group B should yield rep(NA,5)

segment of group C should stay the same

I've tried with if, ifelse and case_when statements but I think I'm missing something very obvious. All help is appreciated.


Solution

  • Edit:

    Here's a hack way to do it, though I know there is a more concise way:

    library(dplyr)
    df %>% 
      group_by(ind) %>% 
      mutate_if(is.logical, as.numeric) %>% 
      mutate(a1 = case_when(is.na(a1) & sum(is.na(a1)) < length(a1) ~ 0.5, TRUE ~ a1),
             a2 = case_when(is.na(a2) & sum(is.na(a2)) < length(a2) ~ 0.5, TRUE ~ a2),
             a3 = case_when(is.na(a3) & sum(is.na(a3)) < length(a3) ~ 0.5, TRUE ~ a3))
    

    Edit2: Here's the more concise way

    point_five <- function(x){
      x = case_when(is.na(x) & sum(is.na(x)) < length(x) ~ 0.5, TRUE ~ x)
    }
    
    df %>%
      group_by(ind) %>% 
      mutate_if(is.logical, as.numeric) %>% 
      mutate(across(.cols = c(a1:a3), ~ point_five(.)))
    

    This gives us:

    # A tibble: 5 x 4
    # Groups:   ind [1]
      ind      a1    a2    a3
      <chr> <dbl> <dbl> <dbl>
    1 A       0.3    NA   0.1
    2 A       0.1    NA   0.3
    3 A       0.5    NA   0.5
    4 A       0.7    NA   0.7
    5 A       0.2    NA   0.8
    

    If we have df2, containing two groups for ind, group_by will give us:

       ind      a1    a2    a3
       <chr> <dbl> <dbl> <dbl>
     1 A       0.3    NA   0.1
     2 A       0.5    NA   0.3
     3 A       0.5    NA   0.5
     4 A       0.7    NA   0.7
     5 A       0.2    NA   0.8
     6 B       0.5    NA   0.1
     7 B       0.5    NA   0.3
     8 B       0.5    NA   0.5
     9 B       0.5    NA   0.7
    10 B       0.2    NA   0.8
    

    df2

    structure(list(ind = c("A", "A", "A", "A", "A", "B", "B", "B", 
    "B", "B"), a1 = c(0.3, 0.5, NA, 0.7, 0.2, NA, 0.5, NA, NA, 0.2
    ), a2 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), a3 = c(0.1, 
    0.3, 0.5, 0.7, 0.8, 0.1, 0.3, 0.5, 0.7, 0.8)), row.names = c(NA, 
    -10L), class = c("tbl_df", "tbl", "data.frame"))