Search code examples
rdataframereshapesplitstackshape

Creating a long table from a wide table using merged.stack (or reshape)


I have a data frame that looks like this:

ID rd_test_2011 rd_score_2011 mt_test_2011 mt_score_2011 rd_test_2012 rd_score_2012 mt_test_2012 mt_score_2012
1  A            80            XX           100           NA           NA            BB           45 
2  XX           90            NA           NA            AA           80            XX           80

I want to write a script that would, for IDs that don't have NA's in the yy_test_20xx columns, create a new data frame with the subject taken from the column title, the test name, the test score and year taken from the column title. So, in this example ID 1 would have three entries. Expected output would look like this:

ID   Subject    Test        Score        Year
1    rd         A           80           2011
1    mt         XX          100          2012
1    mt         BB          45           2012
2    rd         XX          90           2011
2    rd         AA          80           2012
2    mt         XX          80           2012

I've tried both reshape and various forms of merged.stack which works in the sense that I get an output that is on the road to being right but I can't understand the inputs well enough to get there all the way:

library(splitstackshape)
merged.stack(x, id.vars='id', var.stubs=c("rd_test","mt_test"), sep="_")

I've had more success (gotten closer) with reshape:

y<- reshape(x, idvar="id", ids=1:nrow(x), times=grep("test", names(x), value=TRUE), 
      timevar="year", varying=list(grep("test", names(x), value=TRUE), grep("score",
      names(x), value=TRUE)), direction="long", v.names=c("test", "score"),
      new.row.names=NULL) 

Solution

  • This will get your data into the right format:

    df.long = reshape(df, idvar="ID", ids=1:nrow(df), times=grep("Test", names(df), value=TRUE),
     timevar="Year", varying=list(grep("Test", names(df), value=TRUE), 
    grep("Score", names(df), value=TRUE)), direction="long", v.names=c("Test", "Score"),
    new.row.names=NULL) 
    

    Then omitting NA:

    df.long = df.long[!is.na(df.long$Test),]
    

    Then splitting Year to remove Test_:

    df.long$Year = sapply(strsplit(df.long$Year, "_"), `[`, 2)
    

    And ordering by ID:

    df.long[order(df.long$ID),]
    
       ID Year Test Score
    1   1 2011    A    80
    5   1 2012   XX   100
    2   2 2011   XX    90
    9   2 2013   AA    80
    6   3 2012    A    10
    3   4 2011    A    50
    7   4 2012   XX    60
    10  4 2013   AA    99
    4   5 2011    C    50
    8   5 2012    A    75