Search code examples
rsplitvectorization

Split string column to create new binary columns


My data has one column and I am trying to create additional columns with what’s after each “/” in the rows. Here are the first few rows of the data:

mydata <- data.frame(
  ALL = factor(c(
    "/ca/put/sent_1/fe.gr/eq2_on/eq2_off",
    "/ca/put/sent_1/fe.gr/eq2_on/eq2_off/hi.on/hi.ov",
    "/ca/put/sent_1/fe.gr/eq2_on/eq2_off/hi.on/hi.ov",
    "/ca/put/sent_1/fe.gr/eq2_on/eq2_off/cni_at.p3x.4",
    "/ca/put/sent_1/fe.gr/eq2_on/eq2_off/cbr_LBL"
  ))
) 

The result should look like this (data frame) with a “1” in the new column if the variable appears in the row and “0” if not:

Result <- data.frame(
  ALL = factor(c(
    "/ca/put/sent_1/fe.gr/eq2_on/eq2_off",
    "/ca/put/sent_1/fe.gr/eq2_on/eq2_off/hi.on/hi.ov",
    "/ca/put/sent_1fe. gr/eq2_on/eq2_off/hi.on/hi.ov",
    "/ca/put/sent_1/fe.gr/eq2_on/eq2_off/cni_at.p3x.4",
    "/ca/put/sent_1/fe.gr/eq2_on/eq2_off/cbr_LBL"
  )),
  ca = rep(1L, 5L),
  put = rep(1L, 5L),
  sent_1 = rep(1L, 5L),
  fe.gr = rep(1L, 5L),
  eq2_on = rep(1L, 5L),
  eq2_off = rep(1L, 5L),
  hi.on = c(0L, 1L, 1L, 0L, 0L),
  hi.ov = c(0L, 1L, 1L, 0L, 0L),
  cni_at.p3x.4 = c(0L, 0L, 0L, 1L, 0L),
  cbr_LBL = rep(0:1, c(4L, 1L))
)

I have tried many functions including strsplit and sapply:

sapply(strsplit(as.character(mydata$ALL), “\\/”), “[[“, 2) #returns "ca"s only

sapply(strsplit(as.character(mydata$ALL), "\\/"), "[[", 3) #returns "put"s only

There are millions of rows and I’d greatly appreciate anything that is quick and efficient.


Solution

  • Using mtabuate from the qdapTools package that I maintain:

    library(qdapTools)
    mtabulate(strsplit(as.character(dat[[1]]), "/"))
    
    ##   V1 ca cbr_LBL cni_at.p3x.4 eq2_off eq2_on fe.gr hi.on hi.ov put sent_1 sent_1fe.gr
    ## 1  1  1       0            0       1      1     1     0     0   1      1           0
    ## 2  1  1       0            0       1      1     1     1     1   1      1           0
    ## 3  1  1       0            0       1      1     0     1     1   1      0           1
    ## 4  1  1       0            1       1      1     1     0     0   1      1           0
    ## 5  1  1       1            0       1      1     1     0     0   1      1           0