Search code examples
rdplyrtidyversetidyr

Completing a sequence of integers by group with tidyverse in R


Given a dataset which contains a grouping variable and a column of integers which is incomplete (contains NAs) and the beginning and ending integer vary by group and the length of each group varies (and could be NA). How might one fill in the NA integer values by completing the sequence.

The following dataset may be used as an example:

library(dplyr)
set.seed(5112021)
dat1 <- bind_rows(data.frame(Group=1,Seq=(3:20)),
                  data.frame(Group=2,Seq=(-1:25))) %>%
  mutate(rn = rnorm(45,mean=0.5,sd=1),
         Seq = ifelse(rn < 0.4,NA,Seq)) %>% 
  select(-rn) %>% 
  group_by(Group) %>% 
  mutate(Seq = ifelse(Seq==-1,NA,Seq))

dat1
   Group Seq
1      1  NA
2      1  NA
3      1  NA
4      1   6
5      1   7
6      1   8
7      1  NA
8      1  10
9      1  11
10     1  NA
11     1  13
12     1  NA
13     1  15
14     1  NA
15     1  NA
16     1  NA
17     1  NA
18     1  20
19     2  NA
20     2   0
21     2  NA
22     2   2
23     2   3
24     2  NA
25     2   5
26     2   6
27     2   7
28     2   8
29     2  NA
30     2  10
31     2  NA
32     2  12
33     2  NA
34     2  NA
35     2  NA
36     2  16
37     2  17
38     2  NA
39     2  NA
40     2  NA
41     2  NA
42     2  22
43     2  NA
44     2  NA
45     2  NA

One way to do this could be to make use of row_numbers (since they are a sequence of integers) by group and calculate the difference between the non-missing values and the row number (which is a unique value) and then add that value back to the row number.

for example

dat2 <- dat1 %>% 
  group_by(Group) %>% 
  mutate(rn = row_number(),
         diff = mean(Seq-rn,na.rm=T)) %>% 
  mutate(New_Seq = rn+diff) %>% 
  select(-rn,-diff)

dat2
   Group Seq New_Seq
1      1  NA       3
2      1  NA       4
3      1  NA       5
4      1   6       6
5      1   7       7
6      1   8       8
7      1  NA       9
8      1  10      10
9      1  11      11
10     1  NA      12
11     1  13      13
12     1  NA      14
13     1  15      15
14     1  NA      16
15     1  NA      17
16     1  NA      18
17     1  NA      19
18     1  20      20
19     2  NA      -1
20     2   0       0
21     2  NA       1
22     2   2       2
23     2   3       3
24     2  NA       4
25     2   5       5
26     2   6       6
27     2   7       7
28     2   8       8
29     2  NA       9
30     2  10      10
31     2  NA      11
32     2  12      12
33     2  NA      13
34     2  NA      14
35     2  NA      15
36     2  16      16
37     2  17      17
38     2  NA      18
39     2  NA      19
40     2  NA      20
41     2  NA      21
42     2  22      22
43     2  NA      23
44     2  NA      24
45     2  NA      25

While this works, it doesn't seem very elegant and may be slow for very large datasets with many grouping variables. I'm curiouse if there is a more 'Tidyverse' way to do this.


Solution

  • You could do something like:

    df %>% 
      group_by(Group) %>%
      mutate(newseq  = seq(n()) + first(na.omit(Seq)) - which.min(Seq)) %>% 
      ungroup()
    
    # OR
    
    df %>% 
      group_by(Group) %>%
      mutate(newseq  = seq(n()) - 1 + first(na.omit(Seq)) - sum(cumall(is.na(Seq)))) %>% 
      ungroup()
    
    # OR
    
    df %>% 
      group_by(Group) %>%
      mutate(newseq  = seq_along(Group) + (first(na.omit(Seq)) - sum(cumall(is.na(Seq)))) - 1) %>% 
      ungroup()
    
    # OR 
    
    df %>% 
      group_by(Group) %>%
      mutate(newseq  = seq(first(na.omit(Seq)) - sum(cumall(is.na(Seq))), length.out = n())) %>%
      ungroup()
    

    All these do the same thing: shift the start of the sequence by the difference of the first non-NA value and the number of NAs before it.

    Output

       Group   Seq newseq
       <int> <int>  <dbl>
     1     1    NA      3
     2     1    NA      4
     3     1    NA      5
     4     1     6      6
     5     1     7      7
     6     1     8      8
     7     1    NA      9
     8     1    10     10
     9     1    11     11
    10     1    NA     12
    # ... with 35 more rows