Search code examples
rdata.table

R data.table explode multiple rows at once


Similar to this post, I have some columns with comma separated values and I wish to explode them to additional rows with shared by columns (data below). In this case they are aligned, so the comma ordering is preserved. So the exploded version would be 51 total rows, where each column with a comma separated value is split (Model, Pathological Tau, Tau Enzymes).

I can do it for one column but I have to rename the aggregated column names after and unlist the column like so:

mycol = "Model"
f <- e[, .(tstrsplit(.SD, ", ")), by = c(names(e)[names(e) != mycol]), .SDcols = mycol]
setnames(f, old = c(paste0("V", seq_along(mycol))), new = mycol)
f[, (mycol) := unlist(.SD), .SDcols = mycol]

with multiple columns it seems to work and dump them into V1, potentially skipping the setnames step, but I can't get it to extract these list items to columns:

mycol = c("Model", "Pathological Tau", "Tau Enzymes")
f <- e[, .(tstrsplit(.SD, ", ")), by = c(names(e)[!names(e) %in% mycol]), .SDcols = mycol]

# all error
f[, c(mycol) := lapply(V1, unlist), by = .I]
f[, c(mycol) := unlist(V1), by = .I]
f[, (mycol) := unlist(V1), by = .I]

Error in `[.data.table`(f, , `:=`(c(mycol), unlist(V1)), by = .I) :
  Supplied 3 items to be assigned to group 1 of size 1 in column 'Model'. The RHS length must either be 1 (single values are ok) or match the LHS length exactly. If you wish to 'recycle' the RHS please use rep() explicitly to make this intent clear to readers of your code.

The error doesn't make sense, the LHS is length 3...

Data below

structure(list(Strategy = c("General neuroprotection", "Immune response",
"Neuromodulator/transmission", "Passive immunization", "Passive immunization",
"Passive immunization", "Passive immunization", "Passive immunization",
"Passive immunization", "Passive immunization", "Passive immunization",
"Passive immunization", "Passive immunization", "Passive immunization",
"Passive immunization", "Passive immunization", "Passive immunization",
"Passive immunization", "Passive immunization", "Proteostasis network",
"Proteostasis network", "Tau aggregation", "Tau aggregation",
"Tau aggregation", "Tau enzyme/PTM"), Page = c(5L, 4L, 4L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L,
3L, 2L, 2L, 2L, 2L), Model = c("hTau.P301S, THY-Tau22 (G272V/P301S)",
"Tau.P301L, hTau.P301S", "Tau-ΔK280, TauRD-ΔK280, rTg4510 (P301L)",
"JNPL3 (P301L), hTau.P301S", "JNPL3 (P301L), PS19 (P301S)", "JNPL3 (P301L), Tau.P301L",
"PS19 (P301S), hTau", "K3 (K396I), pR5 (P301L)", "hTau, JNPL3 (P301L)",
"hTau, JNPL3 (P301L)", "JNPL3 (P301L), rTg4510 (P301L)", "JNPL3 (P301L), rTg4510 (P301L)",
"Tau609, Tau784", "JNPL3 (P301L), rTg4510 (P301L)", "JNPL3 (P301L), hTau.P301S",
"JNPL3 (P301L), rTg4510 (P301L)", "K3 (K396I), pR5 (P301L)",
"K3 (K396I), pR5 (P301L)", "JNPL3 (P301L), hTau.P301S", "rTg4510, PS19 (P301S)",
"hTau.P301S, rTg4510 (P301L)", "hTau40(296–390aa) line 1 (L1), line 66 (P301S/G335D)",
"Tau-ΔK280, TauRD-ΔK280", "hTau40(296–390aa) line 1 (L1), line 66 (P301S/G335D)",
"pR5 (P301L), K3 (K369I)"), `Pathological Tau` = c("NT, NT",
"DE, NT", "NT, NT, NT", "NT, NT", "DE, DE", "NT, NT", "NT, NT",
"NE, NE", "NT, NT", "NT, NT", "DE, DE", "NE, NE", "DE, NE", "DE, DE",
"DE, DE", "DE, NE", "NE, NE", "DE, NE", "DE, DE", "NE, DE", "DE, DE",
"DE, DE", "DE, NE", "DE, DE", "DE, DE"), `Tau Enzymes` = c("NT, NT",
"NT, NT", "NT, NT, NT", "NT, NT", "NT, NT", "NT, NT", "NT, NT",
"NT, NT", "NT, NT", "NT, NT", "NT, NT", "NT, NT", "NT, NT", "NT, NT",
"NT, NT", "NT, NT", "NT, NT", "NT, NT", "NT, NT", "NT, NT", "NT, NT",
"NT, NT", "NT, NT", "NT, NT", "NT, NE")), class = c("data.table",
"data.frame"), row.names = c(NA, -25L), sorted = "Strategy")

Edit:

thanks @jblood94 for the answer, the final dynamic version that will grab a range (or vector) of column names is

# explode columns with multiple entries
mycol <- cols[5:length(cols)]
d <- b[, lapply(.SD, \(x) unlist(tstrsplit(x, ", "))), 
  by = c(names(b)[!names(b) %in% mycol]), 
  .SDcols = mycol
]

Solution

  • Is this what you're after? (Truncated to 10 rows).

    dt[, lapply(.SD, \(x) unlist(tstrsplit(x, ","))), .(Strategy, Page),
       .SDcols = c("Model", "Pathological Tau", "Tau Enzymes")][1:10]
    #>                        Strategy  Page                    Model Pathological Tau Tau Enzymes
    #>                          <char> <int>                   <char>           <char>      <char>
    #>  1:     General neuroprotection     5               hTau.P301S               NT          NT
    #>  2:     General neuroprotection     5  THY-Tau22 (G272V/P301S)               NT          NT
    #>  3:             Immune response     4                Tau.P301L               DE          NT
    #>  4:             Immune response     4               hTau.P301S               NT          NT
    #>  5: Neuromodulator/transmission     4                Tau-ΔK280               NT          NT
    #>  6: Neuromodulator/transmission     4              TauRD-ΔK280               NT          NT
    #>  7: Neuromodulator/transmission     4          rTg4510 (P301L)               NT          NT
    #>  8:        Passive immunization     1            JNPL3 (P301L)               NT          NT
    #>  9:        Passive immunization     1            JNPL3 (P301L)               DE          NT
    #> 10:        Passive immunization     1            JNPL3 (P301L)               NT          NT