Search code examples
rmultiple-columnscoercion

coerce column names from column factors and populate


I have data like

set.seed(6)
df <- data.frame(t = as.Date("2014/1/1")+seq(0,100.25,.25),
    name = paste(sample(c("Alert_","NonOp_"),402,replace=TRUE),
                  sample(1:10,402,replace=TRUE),sep=""),
    unit = c(rep(1:10,each=40),10,10))
head(df)

Here is some representative data

head(df)
            t     name   unit
1  2014-01-01  NonOp_3      1
2  2014-01-01  NonOp_6      1
3  2014-01-01  Alert_5      1
4  2014-01-01  Alert_7      1
5  2014-01-02  NonOp_4      1
6  2014-01-02  NonOp_2      1

How to generate a table from the names, where the table has columns of unit, t, and then names in the name column gets coerced into columns of name factored without the Alert/NonOp, and the values in the factored names columns should be NA, A (for alert), and N (for NonOp). Here is the type of table I'm looking for, if all the numbers above were for unit 1.

unit               t   name_1 name_2 name_3 name_4 name_5 name_6 name_7 name_8 ...
   1      2014-01-01       NA     NA      N     NA      A      A      A     NA
   1      2014-01-02       NA      N     NA      N     NA     NA     NA     NA

The goal is to the named alerts / nonops into a table ordered by unit / t and write the table to a file. And read the file into excel.


Solution

  • It sounds like the following is what you're looking for:

    library(reshape2)
    newdf <- cbind(df, colsplit(df$name, "_", c("V1", "V2")))
    newdf$V1 <- factor(newdf$V1, c("NonOp", "Alert"), c("N", "A"))
    newdf$V2 <- paste0("name_", newdf$V2)
    head(newdf)
    #            t    name unit V1     V2
    # 1 2014-01-01 NonOp_3    1  N name_3
    # 2 2014-01-01 NonOp_6    1  N name_6
    # 3 2014-01-01 Alert_5    1  A name_5
    # 4 2014-01-01 Alert_7    1  A name_7
    # 5 2014-01-02 NonOp_4    1  N name_4
    # 6 2014-01-02 NonOp_2    1  N name_2
    
    head(dcast(newdf, t ~ V2, value.var = "V1"))
    #            t name_1 name_10 name_2 name_3 name_4 name_5 name_6 name_7 name_8 name_9
    # 1 2014-01-01   <NA>    <NA>   <NA>      N   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>
    # 2 2014-01-01   <NA>    <NA>   <NA>   <NA>   <NA>   <NA>      N   <NA>   <NA>   <NA>
    # 3 2014-01-01   <NA>    <NA>   <NA>   <NA>   <NA>      A   <NA>   <NA>   <NA>   <NA>
    # 4 2014-01-01   <NA>    <NA>   <NA>   <NA>   <NA>   <NA>   <NA>      A   <NA>   <NA>
    # 5 2014-01-02   <NA>    <NA>   <NA>   <NA>      N   <NA>   <NA>   <NA>   <NA>   <NA>
    # 6 2014-01-02   <NA>    <NA>      N   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>
    

    Basically, start by splitting the "name" column into two columns, and then use dcast. The other steps are mostly cosmetic.


    Another approach would be to use my cSplit function and dcast.data.table from the "data.table" package.

    Skipping things like creating "name_blah" and replacing "NonOp" with "N" and so on, you can just directly do:

    dcast.data.table(cSplit(df, "name", "_"), t ~ name_2, value.var = "name_1")
    #               t     1 10     2     3     4     5     6     7  8  9
    #   1: 2014-01-01    NA NA    NA NonOp    NA    NA    NA    NA NA NA
    #   2: 2014-01-01    NA NA    NA    NA    NA    NA NonOp    NA NA NA
    #   3: 2014-01-01    NA NA    NA    NA    NA Alert    NA    NA NA NA
    #   4: 2014-01-01    NA NA    NA    NA    NA    NA    NA Alert NA NA
    #   5: 2014-01-02    NA NA    NA    NA NonOp    NA    NA    NA NA NA
    #  ---                                                              
    # 398: 2014-04-10    NA NA    NA    NA    NA    NA NonOp    NA NA NA
    # 399: 2014-04-10    NA NA    NA    NA NonOp    NA    NA    NA NA NA
    # 400: 2014-04-10 NonOp NA    NA    NA    NA    NA    NA    NA NA NA
    # 401: 2014-04-11    NA NA    NA NonOp    NA    NA    NA    NA NA NA
    # 402: 2014-04-11    NA NA Alert    NA    NA    NA    NA    NA NA NA