Search code examples
rweb-scrapingrvest

How to extract tag code itself using R, rvest


I would like to scrape links to content from this website:

https://www.forklift-international.com/en/for-sale/forklift-battery

Obviously the link is Java script generated, but after inspecting the page code I can see that the url pattern is present there. Here is important part of the code for the first link:

<div class="card cardhighlight0 cpointer" itemscope="" itemtype="http://schema.org/Product" onclick="window.location='/en/e/Battery-used-RoyPow-S2450-12003t'"> 

In this particular example I need to extract this code, which I can use to assemble the final url:

/en/e/Battery-used-RoyPow-S2450-12003t 

Problem is that I am not able to read this in. I do the following:

response <- GET("https://www.forklift-international.com/en/for-sale/forklift-battery")
page <- read_html(response)
page_links <- page %>% html_elements(".card.cardhighlight0") %>% html_text2() 

In next steps I would to extract the pattern from the text by regex, but I never got to this point, because the needed pattern is not included in the parsed text. As I understand, it is a part of the tag and it is not being picked up by rvest. Any suggestions how to deal with this please?


Solution

  • You can access element attributes through rvest::html_attr() :

    library(rvest)
    page <- read_html("https://www.forklift-international.com/en/for-sale/forklift-battery")
    page %>% 
      html_elements(".card.cardhighlight0") %>% 
      html_attr("onclick")
    #>  [1] "window.location='/en/e/Battery-used-RoyPow-S2450-12003t'"               
    #>  [2] "window.location='/en/e/Battery-used-RoyPow-S24160-12002t'"              
    #>  [3] "window.location='/en/e/Battery-used-RoyPow-F80420A-12001t'"             
    #>  [4] "window.location='/en/e/Battery-used-RoyPow-F48560X-12000t'"             
    #>  [5] "window.location='/en/e/Battery-used-RoyPow-F24160-11999t'"              
    #>  [6] "window.location='/en/e/Battery-used-GRUMA-48-Volt-4-PzS-620-Ah-11998t'" 
    #>  [7] "window.location='/en/e/Battery-used-IBB-24-Volt-3-PzB-225-Ah-11996t'"   
    #>  [8] "window.location='/en/e/Battery-used-Linde-24-Volt-3-PzS-375-Ah-11997t'" 
    #>  [9] "window.location='/en/e/Battery-used-Hoppecke-48V-4-HPzS-500-11993t'"    
    #> [10] "window.location='/en/e/Battery-used-IBH-IBG-Smart-Low-Antimon-11992t'"  
    #> [11] "window.location='/en/e/Battery-used-GRUMA-24-Volt-8-PzS-1000-Ah-11991t'"
    #> [12] "window.location='/en/e/Battery-used-Hoppecke-24V-3-HPzS-375-11990t'"    
    #> [13] "window.location='/en/e/Battery-used-IBV-24-Volt-4-PzS-620-Ah-11987t'"   
    #> [14] "window.location='/en/e/Battery-used-IBV-24-Volt-4-PzS-620-Ah-11988t'"   
    #> [15] "window.location='/en/e/Battery-used-IBV-24-Volt-4-PzS-620-Ah-11989t'"   
    #> [16] "window.location='/en/e/Battery-used-GRUMA-48-Volt-4-PzS-620-Ah-11985t'" 
    #> [17] "window.location='/en/e/Battery-used-%5Bdiv%5D-3-EPzS-465-11853t'"       
    #> [18] "window.location='/en/e/Battery-used-GRUMA-48-Volt-5-PzS-625-Ah-11981t'" 
    #> [19] "window.location='/en/e/Battery-used-AIM-48-Volt-5-PzS-775-Ah-11980t'"   
    #> [20] "window.location='/en/e/Battery-used-GRUMA-24-Volt-2-PzS-250-Ah-11979t'"
    

    Created on 2024-01-12 with reprex v2.0.2