I have following sample data from a large data.table:
ddf = structure(list(id = 1:5, country = c("United States of America",
"United Kingdom", "United Arab Emirates", "Saudi Arabia", "Brazil"
), area = c("North America", "Europe", "Arab", "Arab", "South America"
), city = c("first", "second", "second", "first", "third")), .Names = c("id",
"country", "area", "city"), class = c("data.table", "data.frame"
), row.names = c(NA, -5L))
ddf
id country area city
1: 1 United States of America North America first
2: 2 United Kingdom Europe second
3: 3 United Arab Emirates Arab second
4: 4 Saudi Arabia Arab first
5: 5 Brazil South America third
>
I have to make a function to which I can send variable number of text arguments and the function should perform AND searches on the data and output all rows that have all the text search arguments. Different search-strings can be in different columns.
For example searchfn(ddf, 'brazil','third') should print out the last row only.
The case needs to be ignored.
The data is large hence the search needs to be fast and speed-optimized (hence the use of data.table).
I tried:
searchfn = function(ddf, ...){
ll = list(...)
print(sapply(ll, function(x) grep(x, ddf, ignore.case=T)))
}
It picks up all the sent search strings and puts out searched numbers but the search is not proper.
This seems to work, but I doubt it's an optimal solution:
searchfn = function(ddf, ...){
ll = list(...)
pat <- paste(unlist(ll), collapse = "|")
X <- do.call(paste, ddf)
Y <- regmatches(X, gregexpr(pat, X, ignore.case = TRUE))
ddf[which(vapply(Y, function(x) length(unique(x)), 1L) == length(ll)), ]
}
Here are some tests to try out:
searchfn(ddf, 'brazil', 'third')
searchfn(ddf, 'arab', 'first')
searchfn(ddf, "united", "second")
searchfn(ddf, "united", "second", "2")
searchfn(ddf, "united", "second", "Euro")