Search code examples
rrvestxml2

Filter options in URL are ignored by a read_html and rvest call


I'm trying to scrape https://www.yachtfocus.com/boten-te-koop.html#price=10000%7C30000&length=9.2%7C&super_cat_nl=Zeil. I'm using the R packages read_html and rvest. I do this using this code:

library('rvest')
#scrape yachtfocus
url <- "https://www.yachtfocus.com/boten-te-koop.html#price=10000|30000&length=9.2|&super_cat_nl=Zeil"
webpage <- read_html(url)

#Using CSS selectors to scrap the rankings section
amount_results_html <- html_node(webpage,".res_number")

#create text
amount_results <- html_text(amount_results_html)

This returns not the expected value when using the filters provided in the url, but instead returns the "unfiltered" value. So the same when I'd use:

url <- "https://www.yachtfocus.com/boten-te-koop.html"
webpage <- read_html(url)

Can I "force" read_html to execute the filter parameters correctly?


Solution

  • The issue is that the site turns the anchor link into an asynchronous POST request, retrieves JSON and then dynamically builds the page.

    You can use Developer Tools in the browser and reload the request to see ^^:

    enter image description here

    If you right-click the highlighted item and choose "Copy as cURL" you can use the curlconverter package to automagically turn it into an httr function:

    httr::POST(
      url = "https://www.yachtfocus.com/wp-content/themes/yachtfocus/search/",
      body = list(
        hash = "#price=10000%7C30000&length=9.2%7C&super_cat_nl=Zeil"
      ),
      encode = "form"
    ) -> res
    
    dat <- jsonlite::fromJSON(httr::content(res, "text"))
    

    This is what you get (you still need to parse some HTML):

    str(dat)
    ## List of 8
    ##  $ content   : chr " <!-- <div class=\"list_part\"> <span class=\"list_icon\"><a href=\"#\">lijst</a></span> <span class=\"foto\"><"| __truncated__
    ##  $ top       : chr " <h3 class=\"res_number\">317 <em>boten\tgevonden</em></h3> <p class=\"filters_list red_border\"> <span>prijs: "| __truncated__
    ##  $ facets    :List of 5
    ##   ..$ categories_nl    :List of 15
    ##   .. ..$ 6u3son : int 292
    ##   .. ..$ 1v3znnf: int 28
    ##   .. ..$ 10opzfl: int 27
    ##   .. ..$ 1mrn15c: int 23
    ##   .. ..$ qn3nip : int 3
    ##   .. ..$ 112l5mh: int 2
    ##   .. ..$ 1xjlw46: int 1
    ##   .. ..$ ci62ni : int 1
    ##   .. ..$ 1x1x806: int 0
    ##   .. ..$ 1s9bgxg: int 0
    ##   .. ..$ 1i7r9mm: int 0
    ##   .. ..$ qlys89 : int 0
    ##   .. ..$ 1wwlclv: int 0
    ##   .. ..$ 84qiky : int 0
    ##   .. ..$ 3ahnnr : int 0
    ##   ..$ material_facet_nl:List of 11
    ##   .. ..$ 911206 : int 212
    ##   .. ..$ c9twlr : int 53
    ##   .. ..$ 1g88z3 : int 23
    ##   .. ..$ fwfz2d : int 14
    ##   .. ..$ gvrlp6 : int 5
    ##   .. ..$ 10i8nq1: int 4
    ##   .. ..$ h98ynr : int 4
    ##   .. ..$ 1qt48ef: int 1
    ##   .. ..$ 1oxq1p2: int 1
    ##   .. ..$ 1kc1p0j: int 0
    ##   .. ..$ 10dkoie: int 0
    ##   ..$ audience_facet_nl:List of 13
    ##   .. ..$ 71agu9 : int 69
    ##   .. ..$ eb9lzb : int 63
    ##   .. ..$ o40emg : int 55
    ##   .. ..$ vd2cm9 : int 41
    ##   .. ..$ tyffgj : int 24
    ##   .. ..$ icsp53 : int 20
    ##   .. ..$ aoqm1  : int 11
    ##   .. ..$ 1puyni5: int 6
    ##   .. ..$ 1eyfin8: int 5
    ##   .. ..$ 1920ood: int 4
    ##   .. ..$ dacmg4 : int 4
    ##   .. ..$ e7bzw  : int 3
    ##   .. ..$ offcbq : int 3
    ##   ..$ memberships      :List of 7
    ##   .. ..$ 137wtpl: int 185
    ##   .. ..$ 17vn92y: int 166
    ##   .. ..$ wkz6oe : int 109
    ##   .. ..$ 1mdn78e: int 87
    ##   .. ..$ aklw3a : int 27
    ##   .. ..$ 1d9qtvu: int 20
    ##   .. ..$ zqsmlf : int 3
    ##   ..$ super_cat_nl     :List of 3
    ##   .. ..$ 2xl9ac : int 271
    ##   .. ..$ glli8c : int 317
    ##   .. ..$ 1key6o0: int 0
    ##  $ filter    :List of 3
    ##   ..$ brand  : chr "<label><input type=\"checkbox\" name=\"yfilter[brand][Dehler]\" data-solr=\"brand\" value=\"Dehler\" class=\"cu"| __truncated__
    ##   ..$ brokers: chr "<label><input type=\"checkbox\" name=\"yfilter[brokers][Scheepsmakelaardij Goliath]\" data-solr=\"brokers\" val"| __truncated__
    ##   ..$ land_nl: chr "<label><input type=\"checkbox\" name=\"yfilter[land_nl][Nederland]\" data-solr=\"land_nl\" value=\"Nederland\" "| __truncated__
    ##  $ hash      : chr "&price=10000|30000&length=9.2|&super_cat_nl=Zeil"
    ##  $ ifield    :List of 3
    ##   ..$ y_price_min : chr "10000"
    ##   ..$ y_price_max : chr "30000"
    ##   ..$ y_length_min: chr "9.2"
    ##  $ rcfield   :List of 1
    ##   ..$ y_glli8c: chr "1"
    ##  $ session_id: chr "spghrfb8urv50u2kfg6bp3hejm"
    

    Note that this is a super common problem that's been covered many times on SO. Each situation requires finding the right URL in the XHR requests but that's usually the only difference. If you're going to web scrape you should spend some time reading up on how to do so (even 10m of searching on SO would have likely solved this for you).

    If you don't want to do this type of page introspection, you need to use Rselenium or splashr or decapitated. Again, the use of those tools in the context of a problem like this is a well-covered topic on SO.