Search code examples
rdataframetidyrna

fill NAs with set of preceding values


I have a dataframe that looks like this:

d = data.frame(obs = 1:4, comp = c("7_10","5","6_9","1_2_6_9"))

  obs    comp
1   1    7_10
2   2       5
3   3     6_9
4   4 1_2_6_9

The comp column contains information about the elements that can be found in each observation (row). So, the first observation contains the elements 7 and 10, the second, element 5 and so on. Each observation can contain one, two or four out of the possible 10 elements.

I can separate these into distinct columns:

library(tidyverse)
d %>% 
  separate_wider_delim(comp, 
                       delim = "_", 
                       names = c("e1", "e2", "e3", "e4"),
                       too_few = "align_start")

    obs e1    e2    e3    e4   
1     1 7     10    NA    NA   
2     2 5     NA    NA    NA   
3     3 6     9     NA    NA   
4     4 1     2     6     9  

However, instead of NAs, I need the information to repeat, like this:

    obs e1    e2    e3    e4   
1     1 7     10    7     10   # elements 7 and 10 repeat
2     2 5     5     5     5    # element 5 repeats
3     3 6     9     6     9   
4     4 1     2     6     9 

I know how to use pivot_longer and fill to fill in NAs with the preceding value:

d %>% 
  separate_wider_delim(comp, 
                       delim = "_", 
                       names = c("e1", "e2", "e3", "e4"),
                       too_few = "align_start") %>% 
  pivot_longer(2:5,
               names_to = "e",
               values_to = "code") %>% 
  fill(code, .direction = "down") %>% 
  pivot_wider(names_from = e,
              values_from = code)

    obs e1    e2    e3    e4   
1     1 7     10    10    10  
2     2 5     5     5     5    
3     3 6     9     9     9    
4     4 1     2     6     9   

Is there a way to instead fill NAs with the set of preceding values, instead of just the last value?


Solution

  • If you know that comp sequences are either 1, 2, or 4 elements (if 1, then repeat 4 times, if 2 repeat twice, if 4 don't repeat), then you can use rep and indicate a target length for each obs of 4. The row_number is added to show the data in wide format as in the example output above.

    library(tidyverse)
    
    d %>% 
      separate_longer_delim(comp, delim = "_") %>%
      reframe(value = rep_len(comp, length.out = 4), .by = obs) %>%
      mutate(name = row_number(), .by = obs) %>%
      pivot_wider(id_cols = obs, names_from = name, values_from = value, names_prefix = "e")
    

    Output

        obs e1    e2    e3    e4   
      <int> <chr> <chr> <chr> <chr>
    1     1 7     10    7     10   
    2     2 5     5     5     5    
    3     3 6     9     6     9    
    4     4 1     2     6     9