Search code examples

Bug in Reshape (splitstackshape)?

I'm fairly sure this is a bug, but I just wanted to put it to the community first. In the example page for the Reshape function of the splitstackshape package:

mydf <- data.frame(id_1 = 1:6, id_2 = c("A", "B"), varA.1 = sample(letters, 6),
                 varA.2 = sample(letters, 6), varA.3 = sample(letters, 6),
                 varB.2 = sample(10, 6), varB.3 = sample(10, 6),
                 varC.3 = rnorm(6))

  id_1 id_2 varA.1 varA.2 varA.3 varB.2 varB.3      varC.3
1    1    A      g      y      r      4      3 -0.04493361
2    2    B      j      q      j      7      4 -0.01619026
3    3    A      n      p      s      8      1  0.94383621
4    4    B      u      b      l      2     10  0.82122120
5    5    A      e      e      p     10      6  0.59390132
6    6    B      s      d      u      1      2  0.91897737


## Note that these data are unbalanced
## reshape() will not work
## Not run: 
reshape(mydf, direction = "long", idvar=1:2, varying=3:ncol(mydf))

## End(Not run)

## The Reshape() function can handle such scenarios

Reshape(mydf, id.vars = c("id_1", "id_2"),
       var.stubs = c("varA", "varB", "varC"))

    id_1 id_2 time varA varB        varC
 1:    1    A    1    g    4 -0.04493361
 2:    2    B    1    j    7 -0.01619026
 3:    3    A    1    n    8  0.94383621
 4:    4    B    1    u    2  0.82122120
 5:    5    A    1    e   10  0.59390132
 6:    6    B    1    s    1  0.91897737
 7:    1    A    2    y    3          NA
 8:    2    B    2    q    4          NA
 9:    3    A    2    p    1          NA
10:    4    B    2    b   10          NA
11:    5    A    2    e    6          NA
12:    6    B    2    d    2          NA
13:    1    A    3    r   NA          NA
14:    2    B    3    j   NA          NA
15:    3    A    3    s   NA          NA
16:    4    B    3    l   NA          NA
17:    5    A    3    p   NA          NA
18:    6    B    3    u   NA          NA

But based on the variable names (the numeric suffixes to be precise) in the wide format, shouldn't the output be:

    id_1 id_2 time varA varB        varC
 1:    1    A    1    g   NA          NA
 2:    2    B    1    j   NA          NA
 3:    3    A    1    n   NA          NA
 4:    4    B    1    u   NA          NA
 5:    5    A    1    e   NA          NA
 6:    6    B    1    s   NA          NA
 7:    1    A    2    y    4          NA
 8:    2    B    2    q    7          NA
 9:    3    A    2    p    8          NA
10:    4    B    2    b    2          NA
11:    5    A    2    e   10          NA
12:    6    B    2    d    1          NA
13:    1    A    3    r    3 -0.04493361
14:    2    B    3    j    4 -0.01619026
15:    3    A    3    s    1  0.94383621
16:    4    B    3    l   10  0.82122120
17:    5    A    3    p    6  0.59390132
18:    6    B    3    u    2  0.91897737

Since VarA was measured at all three time points (1,2, and 3), VarB was measured at time points 2 and 3, while VarC was measured only at time point 3. So am I missing something obvious...

The tidyr version seems to get it right:

> library(tidyr)
> mydf %>% gather(key="variable", value="value", varA.1:varC.3) %>%
+   separate(variable, into=c("variable","time")) %>%
+   spread("variable", "value")
   id_1 id_2 time varA varB                varC
1     1    A    1    g <NA>                <NA>
2     1    A    2    y    4                <NA>
3     1    A    3    r    3 -0.0449336090152309
4     2    B    1    j <NA>                <NA>
5     2    B    2    q    7                <NA>
6     2    B    3    j    4 -0.0161902630989461 ...


  • This has been fixed in version 1.4.4, now available on CRAN. Thanks for reporting the bug.

    After an update.packages(), you should be able to get the following:

    ## [1] ‘1.4.4’
    Reshape(mydf, id.vars = c("id_1", "id_2"), var.stubs = c("varA", "varB", "varC"))
    ##     id_1 id_2 time varA varB        varC
    ##  1:    1    A    1    g   NA          NA
    ##  2:    2    B    1    j   NA          NA
    ##  3:    3    A    1    n   NA          NA
    ##  4:    4    B    1    u   NA          NA
    ##  5:    5    A    1    e   NA          NA
    ##  6:    6    B    1    s   NA          NA
    ##  7:    1    A    2    y    4          NA
    ##  8:    2    B    2    q    7          NA
    ##  9:    3    A    2    p    8          NA
    ## 10:    4    B    2    b    2          NA
    ## 11:    5    A    2    e   10          NA
    ## 12:    6    B    2    d    1          NA
    ## 13:    1    A    3    r    3 -0.04493361
    ## 14:    2    B    3    j    4 -0.01619026
    ## 15:    3    A    3    s    1  0.94383621
    ## 16:    4    B    3    l   10  0.82122120
    ## 17:    5    A    3    p    6  0.59390132
    ## 18:    6    B    3    u    2  0.91897737