Consider the following dataset:
dt <- structure(list(lllocatie = structure(c(1L, 6L, 2L, 4L, 3L), .Label = c("Assen", "Oosterwijtwerd", "Startenhuizen", "t-Zandt", "Tjuchem", "Winneweer"), class = "factor"),
lat = c(52.992, 53.32, 53.336, 53.363, 53.368),
lon = c(6.548, 6.74, 6.808, 6.765, 6.675),
mag.cat = c(3L, 2L, 1L, 2L, 2L),
places = structure(c(2L, 4L, 5L, 6L, 3L), .Label = c("", "Amen,Assen,Deurze,Ekehaar,Eleveld,Geelbroek,Taarlo,Ubbena", "Eppenhuizen,Garsthuizen,Huizinge,Kantens,Middelstum,Oldenzijl,Rottum,Startenhuizen,Toornwerd,Westeremden,Zandeweer", "Loppersum,Winneweer", "Oosterwijtwerd", "t-Zandt,Zeerijp"), class = "factor")),
.Names = c("lllocatie", "lat", "lon", "mag.cat", "places"),
class = c("data.table", "data.frame"),
row.names = c(NA, -5L))
When I want to split the strings in the last column into separate rows, I use (with data.table
version 1.9.5+):
dt.new <- dt[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by=list(lllocatie,lat,lon,mag.cat)]
However, when I use:
dt.new2 <- dt[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by=lllocatie]
I get the the same result except that all columns are forced into character variables. The problem is that for small datasets it is not a big problem to specify the variables that do not have to split in the by
argument, but for datasets with many columns/variables it is. I know it is possible to do this with the splitstackshape
package (as is mentioned by @ColonelBeauvel in his answer), but I'm looking for a data.table
solution as i want to chain more operations to this.
How can I prevent that without manually specifying the variables that do not have to be split in the by
argument?
Two solutions with data.table
:
1: Use the type.convert=TRUE
argument inside tstrsplit()
as proposed by @Arun:
dt.new1 <- dt[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE, type.convert=TRUE))), by=lllocatie]
2: Use setdiff(names(dt),"places")
in the by
argument as proposed by @Frank:
dt.new2 <- dt[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by=setdiff(names(dt),"places")]
Both approaches give the same result:
> identical(dt.new1,dt.new2)
[1] TRUE
The advantage of the second solution is that when you have more thanone columns with string values, only the one you specify in setdiff(names(dt),"places")
is being split (supposing you want only that specific one, in this case places
, to split). The splitstackshape
package also offers this advantage.