Search code examples
htmlrxmlweb-scrapingrvest

Web-Scraping using R - I want to extract some table like data from a website


I'm having some problems scraping data from a website. I do have not a lot of experience with web-scraping. My intended plan is to scrape some data using R from the following website: https://www.fatf-gafi.org/countries/

More precisely, I want to extract the list of Countries with some sort of sanctions

library(XML)
  url <- paste0("https://www.fatf-gafi.org/countries/")
  source <- readLines(url, encoding = "UTF-8")
  parsed_doc <- htmlParse(source, encoding = "UTF-8")

But this doesn't bring up the intended information because is not under a table but it is a nested div.


Solution

  • Just to test how JavaScript evaluation works with V8, Embedded JavaScript and WebAssembly Engine.
    https://cran.r-project.org/web/packages/V8/vignettes/v8_intro.html

    Create context engine, evaluate requested JavaScript and get the value of countries variable from V8 (it's turned into nested dataframe, thus the unnest() ), last row is filled with NAs, thus the filter.

    library(httr)
    library(V8)
    library(dplyr)
    library(tidyr)
    url <- paste0('https://www.fatf-gafi.org/media/fatf/fatfv20/',
                  'js/country-data-multi-lang.js')
    js_content <- content(GET(url), 'text')
    
    ct <- v8()
    ct$eval(js_content)
    ct$get("countries") %>% 
      unnest(cols = c(groups)) %>%
      select(c(1:2,4:14,16)) %>%
      filter(!is.na(name))
    
    #> # A tibble: 209 × 14
    #>    name       code  FATF  APG   CFATF EAG   ESAAMLG GABAC GAFILAT GIABA MENAFATF
    #>    <chr>      <chr> <chr> <chr> <chr> <chr> <chr>   <chr> <chr>   <chr> <chr>   
    #>  1 Afghanist… AF    ""    "mbr" ""    "obs" ""      ""    ""      ""    ""      
    #>  2 Albania    AL    ""    ""    ""    ""    ""      ""    ""      ""    ""      
    #>  3 Algeria    DZ    ""    ""    ""    ""    ""      ""    ""      ""    "mbr"   
    #>  4 Andorra    AD    ""    ""    ""    ""    ""      ""    ""      ""    ""      
    #>  5 Angola     AO    ""    ""    ""    ""    "mbr"   ""    ""      ""    ""      
    #>  6 Anguilla   AI    ""    ""    "mbr" ""    ""      ""    ""      ""    ""      
    #>  7 Antigua a… AG    ""    ""    "mbr" ""    ""      ""    ""      ""    ""      
    #>  8 Argentina  AR    "mbr" "non" "non" "non" "non"   ""    "mbr"   "non" "non"   
    #>  9 Armenia    AM    ""    ""    ""    "obs" ""      ""    ""      ""    ""      
    #> 10 Aruba Kin… AW    "els" ""    "mbr" ""    ""      ""    ""      ""    ""      
    #> # … with 200 more rows, and 3 more variables: MONEYVAL <chr>,
    #> #   jurisdiction <chr>, id <chr>