Search code examples
rweb-scrapingrvest

How to scrape data with filters from the website when the URL doesn't change?


I've scraped data from this list in R, however it doesn't include the website filters (List = Oxford 3000 and CEFR level = A1) that I had applied, and there aren't variables as far as I can see which I can use to filter the data in R.

Is there some other way I can get just the data I want? The URL doesn't appear to change with filtering.

Here is my code:

url <- "https://www.oxfordlearnersdictionaries.com/wordlists/oxford3000-5000" 

url %>%
  map(. %>%
    read_html() %>%
      html_nodes(".belong-to , .pos , a") %>%
      html_text()
  ) %>%
  unlist() -> ox3ka1

Solution

  • To select only the words with filter a1 we can do the following,

    df = 'https://www.oxfordlearnersdictionaries.com/wordlists/oxford3000-5000' %>% read_html() %>% html_nodes('.top-g') %>% html_nodes( "li[data-ox5000 = 'a1']") %>% html_text()
    
    head(df)
    [1] "   a   indefinite articlea1      " "   about   adverba1      "         "   about   prepositiona1      "    "   above   adverba1      "        
    [5] "   above   prepositiona1      "    "   across   adverba1      "   
    

    Further reference, How do I use html_nodes to select nodes with "attribute = x" in R?