Search code examples
rdata.tablenumbersrowmissing-data

How to insert a new row for each missing number of one variable while filling with NA for other variables using data.table in R?


How to insert a new row every time there is a break in the y column numbering, while filling the other columns with NA?

Edit: Iroha's suggestion based on tidyr::complete() works fine, as do the others. However, when I apply it to millions of rows, it runs for hours.
It seems like a data.table approach would be faster, but I never use it and I can't seem to find a solution.
Any help/advice would be greatly appreciated!

Initial data:

x <- c("a","a","a","a","a","a","a","a","a","b","b","b","b","b","b","b")
y <- c(1,1,2,4,4,4,7,7,11,1,3,3,3,6,6,8)
dat0 <- data.frame(x,y)  

>dat0
   x  y
1  a  1
2  a  1
3  a  2
4  a  4
5  a  4
6  a  4
7  a  7
8  a  7
9  a 11
10 b  1
11 b  3
12 b  3
13 b  3
14 b  6
15 b  6
16 b  8  

Desired output:

> dat1
      x  y
1     a  1
2     a  1
3     a  2
4  <NA>  3
5     a  4
6     a  4
7     a  4
8  <NA>  5
9  <NA>  6
10    a  7
11    a  7
12 <NA>  8
13 <NA>  9
14 <NA> 10
15    a 11
16    b  1
17 <NA>  2
18    b  3
19    b  3
20    b  3
21 <NA>  4
22 <NA>  5
23    b  6
24    b  6
25 <NA>  7
26    b  8  

Thanks for help


Solution

  • Here is a data.table approach (in both examples below, be sure to start with setDT(dat0)):

    f <- \(k) k[CJ(y=1:max(k$y)), on="y"]
    dat0[, f(.SD), x, .SDcols=c("y", "x")][, c(2,3)]
    

    Output:

            y      x
        <int> <char>
     1:     1      a
     2:     1      a
     3:     2      a
     4:     3   <NA>
     5:     4      a
     6:     4      a
     7:     4      a
     8:     5   <NA>
     9:     6   <NA>
    10:     7      a
    11:     7      a
    12:     8   <NA>
    13:     9   <NA>
    14:    10   <NA>
    15:    11      a
    16:     1      b
    17:     2   <NA>
    18:     3      b
    19:     3      b
    20:     3      b
    21:     4   <NA>
    22:     5   <NA>
    23:     6      b
    24:     6      b
    25:     7   <NA>
    26:     8      b
            y      x
    

    Alternatively, you could do this:

    dat0[, x2:=x][dat0[, .(y=1:max(y)),x], on=.(x,y), .(x=x2, y)]
    

    Output:

             x     y
        <char> <int>
     1:      a     1
     2:      a     1
     3:      a     2
     4:   <NA>     3
     5:      a     4
     6:      a     4
     7:      a     4
     8:   <NA>     5
     9:   <NA>     6
    10:      a     7
    11:      a     7
    12:   <NA>     8
    13:   <NA>     9
    14:   <NA>    10
    15:      a    11
    16:      b     1
    17:   <NA>     2
    18:      b     3
    19:      b     3
    20:      b     3
    21:   <NA>     4
    22:   <NA>     5
    23:      b     6
    24:      b     6
    25:   <NA>     7
    26:      b     8