I have a data.table with a column comprising occupational title names. I want to find out repeated occupations but that are written in a reverse order (e.g. writer advertising and advertising writer). Here is a simplified version of my data and the result that I would like to get
data = data.table(
ID = as.character(c("advertisings writer","writer advertisings","setter","drill setter","setter drill","agent claims","claims agent","engineer"))
)
data_result = data.table(
ID = as.character(c("advertisings writer","setter","drill setter","agent claims","engineer"))
)
Here is the code that I've been using.
data[,b:= strsplit(ID," ")]
data <- data[,.(b=unlist(b)),by = setdiff(names(data),'b')]
setorderv(data,cols=c("ID","b"))
data <- data[,bb:=list(list(unique(b))),by="ID"][,.SD[1],by=c("ID"),.SDcols=c("bb")]
data[,b:=lapply(bb,paste,collapse=' ')]
data[,b:=unlist(b)]
unique(data,by="b")
Since I am working with quite a large dataset this approach is very time consuming.
Thanks
A possible solution with data.table
:
library(data.table)
data[,ID:=sapply(sapply(stringr::str_split(ID,' '),sort),function(x) paste(x,collapse=' '))]
unique(data)
ID
1: advertisings writer
2: setter
3: drill setter
4: agent claims
5: engineer