Search code examples
rnlptext-miningr-packagenamed-entity-recognition

Entities extraction based on customized list in R


I have list of texts and I also have a list of entities.

The list of texts is typically in vectorized string.

The list of entities is a bit more complexed. Some entities, can be listed out exhaustively such as the list of main cities of the world. Some entities, while impossible to be listed out exhaustively, can be captured by regex pattern.


list_of_text <- c('Lorem ipsum 12-01-2021 eat, Copenhagen 133.001.00.00 ...', 'Lorem ipsum 12-01-2021, Copenhagen www.stackoverflow.com swimming', ...)

entity_city <- c('Copenhagen', 'Paris', 'New York', ...)

entity_IP_address <- c('regex code for IP address')

entity_IP_address <- c('regex code for URL')

entity_verb <- c('verbs')

Given the list_of_text and the list of entities, I want to find matching entities for each text.

For example c('Lorem ipsum 12-01-2021 eat drink sleep, Copenhagen 133.001.00.00 ...'), it has c(eat, drink, sleep) for entity_verb, c(133.001.00.00) for entity_IP, etc.


res <- extract_entity(text = c('Lorem ipsum 12-01-2021 eat drink sleep, Copenhagen 133.001.00.00 ...')
                      ,entities <- c(entity_verb, entity_IP_address, entity_city))

res[['verb']]
c('eat', 'drink', 'sleep')

res[['IP']]
c('133.001.00.00')

res[['city']]
c('Copenhagen')

Is there a R package I can leverage on?


Solution

  • Please take a look at maps and qdapDictionaries. For world cities, I subset for cities with greater than a population of 1M. Otherwise, it error with 'regular expression is too large'.

    library(maps)
    library(qdapDictionaries)
    
    list_of_text  <- c('Lorem ipsum 12-01-2021 eat, Copenhagen 192.41.196.888','192.41.199.888','Lorem ipsum 12-01-2021, Copenhagen www.stackoverflow.com swimming')
    #regex needs adjusted. Not extracting the first IP Address
    ipRegex   <- "(?(?=.*?(\\d+\\.\\d+\\.\\d+\\.\\d+).*?)(\\1|))"
    
    regmatches(x = list_of_text , m = regexpr(ipRegex ,list_of_text ,perl = TRUE))[
      regmatches(x = list_of_text , m = regexpr(ipRegex ,list_of_text ,perl = TRUE)) != '']
    
    verbRegex <- substr(paste0((unlist(action.verbs)),'|',collapse = ""),
                         start = 1,nchar(paste0((unlist(action.verbs)),'|',collapse = ""))-1)
    
    unlist(regmatches(x = list_of_text , m = gregexpr(verbRegex,list_of_text ,perl = TRUE))[
      regmatches(x = list_of_text , m = gregexpr(verbRegex,list_of_text ,perl = TRUE)) != ''])
    
    citiesRegex <- substr(paste0((unlist(world.cities[world.cities$pop >1000000,'name'])),'|',collapse = ""),
                        start = 1,nchar(paste0((unlist(world.cities[world.cities$pop >1000000,'name'])),'|',collapse = ""))-1)
    
    unlist(regmatches(x = list_of_text , m = gregexpr(citiesRegex,list_of_text ,perl = TRUE))[
      regmatches(x = list_of_text , m = gregexpr(citiesRegex,list_of_text ,perl = TRUE)) != ''])