Search code examples
rstringwikipedia-apipage-title

How to check if a string is the title of a Wikipedia article with R?


Suppose I have a list of strings :

strings <- c("dog", "cat", "animal", "bird", "birds", "bqpohd", "ohphha", "mqphihpha", "aphhphohpa", "pohha")

I would like to check if these strings are the titles of Wikipedia articles.

Here is a solution but I presume that it is not the quickest way to do this task for long lists:

results.df <- CheckIfAStringIsTheTitleOfAWikipediaArticle(strings)
View(results.df)
CheckIfAStringIsTheTitleOfAWikipediaArticle <- function(strings){
  start_time <- Sys.time()
    Check <- function(string){
    GetPageID <- function(string){
  query <- paste0("https://en.wikipedia.org/w/api.php?", 
                  "action=query", "&format=xml", "&titles=", string)
  answer <- httr::GET(query)
  library(xml2)
  library(httr)
  page.xml <- xml2::read_xml(answer)
  nodes <- xml_find_all(page.xml, ".//query//pages//page")
  pageid <- xml_attr(nodes, "pageid", ns = character(), 
                     default = NA_character_)
  return(pageid)
  }

  IsValidPageName <- function(string){
    pageid<- GetPageID(string)
    if(!is.na(pageid)){return(TRUE)}
    else{return(FALSE)}
  }

  boolean <- IsValidPageName(string)
  return(boolean)
  }

  validTitle <- unlist(lapply(strings, Check))
  results.df <- data.frame(strings, validTitle)
  end_time <- Sys.time()
  time <- end_time - start_time
  print(time)
  return(results.df)
}

Thank you very much for your help !


Solution

  • Here's a base R approach.

    Download all of the English titles from Wikipedia into a temp file. Then scan them into memory. It's about 1.2 Gb.

    I assume you don't care about case, so we'll need to change the titles to all lower case with tolower. Then just use %in%.

    strings <- c("dog", "cat", "animal", "bird", "birds", "bqpohd", "ohphha", "mqphihpha", "aphhphohpa", "pohha")
    
    url <- "http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-all-titles-in-ns0.gz"
    tmp <- tempfile()
    download.file(url,tmp)
    titles <- scan(gzfile(tmp),character())
    titles <- tolower(titles)
    strings[strings %in% titles]
    [1] "dog"    "cat"    "animal" "bird"   "birds" 
    
    #Reasonably fast
    system.time(strings[strings %in% titles])
       user  system elapsed 
      1.494   0.029   1.525