Search code examples
rlogistic-regressionreshape2melt

Reshape categorical data in columns to binary data in rows using R


I have tabulations of binomial events (eg numbers of wins and losses) that I want to reshape into long format so that each event is on a separate row.

The data looks something like this:

 #Fake data   
 team      <- c("a","b","c","d")
 wins.home <- c(0,1,0,2); wins.away <- c(1,1,1,1)
 loss.home <- c(0,1,2,1); loss.away <- c(2,1,2,2)
 df <- data.frame(team,wins.home,wins.away,loss.home,loss.away)
 df$games.tot <- apply(df[,-1],1,sum)


#Dataframe in WIDE format
      team  wins.home  wins.away  loss.home  loss.away  games.tot
         a         0         1         0         2         3
         b         1         1         1         1         4
         c         0         1         2         2         5
         d         2         1         1         2         6

I'd like to reshape this wide data to LONG format like this:

team  game where  win
a     1    away   1
a     2    away   0
a     3    away   0
b     1    home   1
b     2    away   1
b     3    home   0
b     4    away   0

EDIT: the game index is arbitrary and is just an index.

Converting this so that there are counts of each category in each row is easy with reshape2::melt but I don't know how to further break down the counts into binary events.


Solution

  • There is probably a shorter solution for this, but as a quick fix:

    library(stringr)
    library(dplyr)
    library(reshape2)
    
    #initial melt
    df2 <- melt(df[-6], id.vars='team')
    #split the variable column to get the 'away and home' column
    df2 <- cbind(df2, str_split_fixed(df2$variable, "\\.", 2))
    #create the wins column
    df2$win <- ifelse(df2$`1` == 'wins',1,0)
    #replicate the number of rows according to value
    df2 <- df2[rep(1:nrow(df2), df2$value),]
    #remove unnecessary columns
    df2$variable <- df2$value <- df2$`1` <- NULL
    #final group by to add the game column
    df2 %>% group_by(team) %>% mutate(game = 1:n()) %>% arrange(team)
    

    Output:

    Source: local data frame [18 x 4]
    Groups: team [4]
    
         team      2   win  game
       (fctr) (fctr) (dbl) (int)
    1       a   away     1     1
    2       a   away     0     2
    3       a   away     0     3
    4       b   home     1     1
    5       b   away     1     2
    6       b   home     0     3
    7       b   away     0     4
    #...and so on