Search code examples
web-scrapingrvestxml2

Page item not scrape-able with rvest


I am getting into web scraping with R and recently have been doing some exercises. I am currently playing around the local ebay listings where I was able to scrape the text info about an individual listing. However, I have tried different options to also scrape the number of views of the listing. But nothing gives me the number shown on the page.

The Page Link is this:

https://www.ebay-kleinanzeigen.de/s-anzeige/zahnpflege-fuer-hunde-und-katzen-extra-stark-gegen-mundgeruch/1281544930-313-3170

While the pageview Number is at the right-below of the image (currently 00044 views)

I was able to retrieve the text with this code:

pageURL <- read_html("https://www.ebay-kleinanzeigen.de/s-anzeige/zahnpflege-fuer-hunde-und-katzen-extra-stark-gegen-mundgeruch/1281544930-313-3170")
input <- pageURL %>%
  html_nodes(xpath="/html/body/div[1]/div[2]/div/section[1]/section/section/article/section[1]/section/dl") %>%
  html_text() 
write.csv2(input, "example_listing.csv")

I don't see a difference in the views node. I tried xpath and full xpath with no results.


Solution

  • The problem is that the text in the element you are trying to scrape does not exist in the html you are parsing. You can check this by doing the following:

    library(magrittr)
    library(httr)
    
    url <- paste("https://www.ebay-kleinanzeigen.de/s-anzeige/",
                 "zahnpflege-fuer-hunde-und-katzen-extra-stark",
                 "-gegen-mundgeruch/1281544930-313-3170", collapse = "")
    
    page <- url %>% GET %>% content("text")
    substr(page, 72144, 72177)
    #>[1] "<span id=\"viewad-cntr-num\"></span>"
    

    Yet if you look at this item in the developer tools in Chrome or Firefox, you can see there should be a number in here:

    <span id="viewad-cntr-num">00047</span>
    

    What happens is that when you are using a web browser, the page that you request contains javascript, which the browser automatically runs. In this case, it sends further requests to the server to download extra information and this is inserted on the page.

    However, when you are using rvest or similar tools, the original html page is downloaded but the javascript is not run. Therefore, the subsequent requests are not made, and the empty field is not available to be scraped.

    In this case, it is quite easy to find the link that downloads the number of page views, since that link is actually on the html page you downloaded:

    url2 <- strsplit(strsplit(page, "viewAdCounterUrl: '")[[1]][2], "'")[[1]][1]
    url2
    #> [1] "https://www.ebay-kleinanzeigen.de/s-vac-inc-get.json?adId=1281544930&userId=50592093"
    page_views <- url2 %>% GET %>% content("text")
    page_views
    #> [1] "{\"numVisits\":52,\"numVisitsStr\":\"00052\"}"
    

    You can see that the server has returned a short JSON that contains the content you were looking for. You can manually do what javascript does and reinsert the information back into the page like this:

    page_views <- strsplit(strsplit(page_views, "\":\"")[[1]][2], "\"")[[1]][1]
    tag <- "<span id=\"viewad-cntr-num\">"
    page <- sub(tag, paste0(tag, page_views), page)
    

    Now you can do this:

    input <- page %>% 
      read_html %>%
      html_nodes(xpath="//section[@class=\"l-container\"]") %>%
      html_text() %>% extract(1)
    

    And you will have the text you were looking for, including the number of page views.