Search code examples
rutf-8character-encodingdiacriticsmediawiki-api

How to encode accents to query Mediawiki API in R?


I have no problem to query Mediawiki API of French Wikipedia for strings without accents:

string <- 'chien'
string <- stringi::stri_enc_toutf8(string, is_unknown_8bit = FALSE, validate = FALSE)
apiQuery <- paste0('https://fr.wikipedia.org/w/api.php?action=query&format=xml&titles=', string)
page <- xml2::read_xml(apiQuery)

{xml_document} [1] \n \n \n \n \n <page _idx="2736914" pageid="2736914 ...

but I have problem for strings with accents:

string <- 'être'
string <- stringi::stri_enc_toutf8(string, is_unknown_8bit = FALSE, validate = FALSE)
apiQuery <- paste0('https://fr.wikipedia.org/w/api.php?action=query&format=xml&titles=', string)
page <- xml2::read_xml(apiQuery)

I receive the following error :

Error in open.connection(x, "rb") : HTTP error 400.


Solution

  • You need to encode the query in HTML escapes:

    page <- xml2::read_xml(URLencode(apiQuery))
    

    This changes the "ê" to "%C3%AA".