Search code examples
rweb-scrapingscreen-scrapingrvest

How to pull a product link from customer profile page on Amazon


I'm trying to get the product link from a customers profile page usign R's RVEST package

I've referenced various questions on stack overflow including here(could not read webpage with read_html using rvest package from r), but each time I try something, I'm not able to return the correct result.

For example on this profile page:

https://www.amazon.com/gp/profile/amzn1.account.AETT6GZORFV55BFNOAVFDIJ75QYQ/ref=cm_cr_dp_d_gw_tr?ie=UTF8

I'd like to be able to return this link, with the end goal to extract the product id: B01A51S9Y2

https://www.amazon.com/Amagabeli-Stainless-Chainmail-Scrubber-Pre-Seasoned/dp/B01A51S9Y2?ref=pf_vv_at_pdctrvw_dp

library(dplyr)
library(rvest)
library(stringr)
library(httr)
library(rvest)

# get url
url='https://www.amazon.com/gp/profile/amzn1.account.AETT6GZORFV55BFNOAVFDIJ75QYQ/ref=cm_cr_dp_d_gw_tr?ie=UTF8'
x <- GET(url, add_headers('user-agent' = 'test'))
page <- read_html(x)

page %>%
  html_nodes("[class='a-link-normal profile-at-product-box-link a-text-normal']") %>%
  html_text()

#I did a test to see if i could even find the href, with no luck

test <- page %>%
  html_nodes("#a-page") %>%
  html_text()

grepl("B01A51S9Y2",test)

Thanks for the tip @Qharr on Rselenium. that is helpful, but still unsure how to extract the link or asin. library(RSelenium)

driver <- rsDriver(browser=c("chrome"), port = 4574L, chromever = "77.0.3865.40")
rd <- driver[["client"]]
rd$open()
rd$navigate("https://www.amazon.com/gp/profile/amzn1.account.AETT6GZORFV55BFNOAVFDIJ75QYQ/ref=cm_cr_arp_d_gw_btm?ie=UTF8")
prod <- rd$findElement(using = "css", '.profile-at-product-box-link')
prod$getElementText

This doesn't really return anything

Added the get attribute href, and was able to get the link

prod <- rd$findElements(using = "css selector", '.profile-at-product-box-link')

for (link in 1:length(prod)){
  print(prod[[link]]$getElementAttribute('href'))
}


Solution

  • That info is pulled in dynamically from a POST request the page makes that your rvest initial request doesn't capture. This subsequent request returns in json format the content governing asins, the products links etc.....

    enter image description here

    You can find it in the network tab of dev tools F12. Press F5 to refresh the page then examine network traffic:

    enter image description here

    It is not a simple POST request to mimic and I would just go with RSelenium to let the page render and then use css selector

    .profile-at-product-box-link
    

    to gather a webElements collection you can loop and extract href attribute from.