Search code examples
rsearchdata.tabletext-search

Variable argument fast text search function in R


I have following sample data from a large data.table:

ddf = structure(list(id = 1:5, country = c("United States of America", 
 "United Kingdom", "United Arab Emirates", "Saudi Arabia", "Brazil"
 ), area = c("North America", "Europe", "Arab", "Arab", "South America"
 ), city = c("first", "second", "second", "first", "third")), .Names = c("id", 
 "country", "area", "city"), class = c("data.table", "data.frame"
 ), row.names = c(NA, -5L))

ddf
   id                  country          area   city
1:  1 United States of America North America  first
2:  2           United Kingdom        Europe second
3:  3     United Arab Emirates          Arab second
4:  4             Saudi Arabia          Arab  first
5:  5                   Brazil South America  third
> 

I have to make a function to which I can send variable number of text arguments and the function should perform AND searches on the data and output all rows that have all the text search arguments. Different search-strings can be in different columns.

For example searchfn(ddf, 'brazil','third') should print out the last row only.

The case needs to be ignored.

The data is large hence the search needs to be fast and speed-optimized (hence the use of data.table).

I tried:

searchfn = function(ddf, ...){
    ll = list(...)
    print(sapply(ll, function(x) grep(x, ddf, ignore.case=T)))
}

It picks up all the sent search strings and puts out searched numbers but the search is not proper.


Solution

  • This seems to work, but I doubt it's an optimal solution:

    searchfn = function(ddf, ...){
      ll = list(...)
      pat <- paste(unlist(ll), collapse = "|")
      X <- do.call(paste, ddf)
      Y <- regmatches(X, gregexpr(pat, X, ignore.case = TRUE))
      ddf[which(vapply(Y, function(x) length(unique(x)), 1L) == length(ll)), ]
    }
    

    Here are some tests to try out:

    searchfn(ddf, 'brazil', 'third')
    searchfn(ddf, 'arab', 'first')
    searchfn(ddf, "united", "second")
    searchfn(ddf, "united", "second", "2")
    searchfn(ddf, "united", "second", "Euro")