Search code examples
rreshapereshape2melt

Reshape wide format, to multi-column long format


I want to reshape a wide format dataset that has multiple tests which are measured at 3 time points:

   ID   Test Year   Fall Spring Winter
    1   1   2008    15      16      19
    1   1   2009    12      13      27
    1   2   2008    22      22      24
    1   2   2009    10      14      20
    2   1   2008    12      13      25
    2   1   2009    16      14      21
    2   2   2008    13      11      29
    2   2   2009    23      20      26
    3   1   2008    11      12      22
    3   1   2009    13      11      27
    3   2   2008    17      12      23
    3   2   2009    14      9       31

into a data set that separates the tests by column but converts the measurement time into long format, for each of the new columns like this:

    ID  Year    Time        Test1 Test2
    1   2008    Fall        15      22
    1   2008    Spring      16      22
    1   2008    Winter      19      24
    1   2009    Fall        12      10
    1   2009    Spring      13      14
    1   2009    Winter      27      20
    2   2008    Fall        12      13
    2   2008    Spring      13      11
    2   2008    Winter      25      29
    2   2009    Fall        16      23
    2   2009    Spring      14      20
    2   2009    Winter      21      26
    3   2008    Fall        11      17
    3   2008    Spring      12      12
    3   2008    Winter      22      23
    3   2009    Fall        13      14
    3   2009    Spring      11      9
    3   2009    Winter      27      31

I have unsuccessfully tried to use reshape and melt. Existing posts address transforming to single column outcome.


Solution

  • Using reshape2:

    # Thanks to Ista for helping with direct naming using "variable.name"
    df.m <- melt(df, id.var = c("ID", "Test", "Year"), variable.name = "Time")
    df.m <- transform(df.m, Test = paste0("Test", Test))
    dcast(df.m, ID + Year + Time ~ Test, value.var = "value")
    

    Update: Using data.table melt/cast from versions >= 1.9.0:

    data.table from versions 1.9.0 imports reshape2 package and implements fast melt and dcast methods in C for data.tables. A comparison of speed on bigger data is shown below.

    For more info regarding NEWS, go here.

    require(data.table) ## ver. >=1.9.0
    require(reshape2)
    
    dt <- as.data.table(df, key=c("ID", "Test", "Year"))
    dt.m <- melt(dt, id.var = c("ID", "Test", "Year"), variable.name = "Time")
    dt.m[, Test := paste0("Test", Test)]
    dcast.data.table(dt.m, ID + Year + Time ~ Test, value.var = "value")
    

    At the moment, you'll have to write dcast.data.table explicitly as it's not a S3 generic in reshape2 yet.


    Benchmarking on bigger data:

    # generate data:
    set.seed(45L)
    DT <- data.table(ID = sample(1e2, 1e7, TRUE), 
            Test = sample(1e3, 1e7, TRUE), 
            Year = sample(2008:2014, 1e7,TRUE), 
            Fall = sample(50, 1e7, TRUE), 
            Spring = sample(50, 1e7,TRUE), 
            Winter = sample(50, 1e7, TRUE))
    DF <- as.data.frame(DT)
    

    reshape2 timings:

    reshape2_melt <- function(df) {
        df.m <- melt(df, id.var = c("ID", "Test", "Year"), variable.name = "Time")
    }
    # min. of three consecutive runs
    system.time(df.m <- reshape2_melt(DF))
    #   user  system elapsed 
    # 43.319   4.909  48.932 
    
    df.m <- transform(df.m, Test = paste0("Test", Test))
    
    reshape2_cast <- function(df) {
        dcast(df.m, ID + Year + Time ~ Test, value.var = "value")
    }
    # min. of three consecutive runs
    system.time(reshape2_cast(df.m))
    #   user  system elapsed 
    # 57.728   9.712  69.573 
    

    data.table timings:

    DT_melt <- function(dt) {
        dt.m <- melt(dt, id.var = c("ID", "Test", "Year"), variable.name = "Time")
    }
    # min. of three consecutive runs
    system.time(dt.m <- reshape2_melt(DT))
    #   user  system elapsed 
    #  0.276   0.001   0.279 
    
    dt.m[, Test := paste0("Test", Test)]
    
    DT_cast <- function(dt) {
        dcast.data.table(dt.m, ID + Year + Time ~ Test, value.var = "value")
    }
    # min. of three consecutive runs
    system.time(DT_cast(dt.m))
    #   user  system elapsed 
    # 12.732   0.825  14.006 
    

    melt.data.table is ~175x faster than reshape2:::melt and dcast.data.table is ~5x than reshape2:::dcast.