Search code examples
rrvest

In R/rvest, how to get href information ( the linkage following click text)


In R/rvest, as below code , I can run the html_text(), but when i run want to get the linkage following every text web %>% html_node("div.p13n-desktop-grid") %>% html_attr(name='href') failed .Anyone can help? Thanks!

enter image description here

library(rvest)
url <- "https://www.amazon.com/Best-Sellers-Industrial-Scientific-3D-Printers/zgbs/industrial/6066127011/ref=zg_bs_pg_1?_encoding=UTF8&pg=1"
web <- rvest::read_html(url)
web %>% html_node("div.p13n-desktop-grid") %>% html_text() %>% strsplit("#") # ok
web %>% html_node("div.p13n-desktop-grid") %>%  html_attr(name='href') # want to get the linkage following the click text, but failed

Solution

  • For (shortened) product links and link texts:

    library(rvest)
    library(dplyr)
    
    url <- "https://www.amazon.com/Best-Sellers-Industrial-Scientific-3D-Printers/zgbs/industrial/6066127011/ref=zg_bs_pg_1?_encoding=UTF8&pg=1"
    web <- rvest::read_html(url)
    
    # "div.p13n-desktop-grid a[tabindex] + a" : 
    # text links are adjacent siblings of image links & image links have tabindex attribute
    
    prod_links <- web %>% html_elements("div.p13n-desktop-grid a[tabindex] + a")
    tibble(
      # shorten links, keep only /pb/item_id/ part
      href =  prod_links %>% html_attr(name='href') %>% sub('.*(/dp/\\w*/).*','www.amazon.com\\1', .),
      descr = prod_links %>% html_text2()
    )
    #> # A tibble: 30 × 2
    #>    href                          descr                                          
    #>    <chr>                         <chr>                                          
    #>  1 www.amazon.com/dp/B07BR3F9N6/ Official Creality Ender 3 3D Printer Fully Ope…
    #>  2 www.amazon.com/dp/B07FFTHMMN/ Official Creality Ender 3 V2 3D Printer Upgrad…
    #>  3 www.amazon.com/dp/B09QGTTQKG/ ANYCUBIC Kobra 3D Printer Auto Leveling, FDM 3…
    #>  4 www.amazon.com/dp/B07GYRQVYV/ Official Creality Ender 3 Pro 3D Printer with …
    #>  5 www.amazon.com/dp/B083GTS8XJ/ ANYCUBIC Wash and Cure Station, Newest Upgrade…
    #>  6 www.amazon.com/dp/B09FXYSFBV/ ANYCUBIC Photon Mono 4K 3D Printer, 6.23'' Mon…
    #>  7 www.amazon.com/dp/B07J9QGP7S/ ANYCUBIC Mega-S New Upgrade 3D Printer with Hi…
    #>  8 www.amazon.com/dp/B07Z9C9T42/ ELEGOO 5PCs FEP Release Film Mars LCD 3D Print…
    #>  9 www.amazon.com/dp/B08SPXYND4/ Voxelab Aquila 3D Printer with Full Alloy Fram…
    #> 10 www.amazon.com/dp/B07DYL9B2S/ Official Creality Ender 3 S1 3D Printer with D…
    #> # … with 20 more rows
    

    Created on 2022-06-16 by the reprex package (v2.0.1)

    There are 50 products per page but only first 30 are included in the grid, the rest would be loaded in small chunks as you'd scroll down. Unless descriptions are needed, it's bit easier to just collect all IDs from data-client-recs-list and build links from those:

    library(rvest)
    library(dplyr)
    library(jsonlite)
    
    url <- "https://www.amazon.com/Best-Sellers-Industrial-Scientific-3D-Printers/zgbs/industrial/6066127011/ref=zg_bs_pg_1?_encoding=UTF8&pg=1"
    web <- rvest::read_html(url)
    client_recs_list <- web %>% 
      html_element("div.p13n-desktop-grid") %>% 
      html_attr(name='data-client-recs-list') %>% 
      fromJSON(flatten = TRUE) %>% 
      tibble()
    
    client_recs_list %>% 
      select(1,2) %>%
      mutate(prod_link = paste0('www.amazon.com/dp/', id, '/'), .after = id)
    #> # A tibble: 50 × 3
    #>    id         prod_link                     metadataMap.render.zg.rank
    #>    <chr>      <chr>                         <chr>                     
    #>  1 B07BR3F9N6 www.amazon.com/dp/B07BR3F9N6/ 1                         
    #>  2 B07FFTHMMN www.amazon.com/dp/B07FFTHMMN/ 2                         
    #>  3 B09Y54CWXY www.amazon.com/dp/B09Y54CWXY/ 3                         
    #>  4 B07GYRQVYV www.amazon.com/dp/B07GYRQVYV/ 4                         
    #>  5 B09L81S4L7 www.amazon.com/dp/B09L81S4L7/ 5                         
    #>  6 B09JNMRS7R www.amazon.com/dp/B09JNMRS7R/ 6                         
    #>  7 B09WHW8YXS www.amazon.com/dp/B09WHW8YXS/ 7                         
    #>  8 B09W5CSFZQ www.amazon.com/dp/B09W5CSFZQ/ 8                         
    #>  9 B09KXNYJLH www.amazon.com/dp/B09KXNYJLH/ 9                         
    #> 10 B09R4QDVY5 www.amazon.com/dp/B09R4QDVY5/ 10                        
    #> # … with 40 more rows
    

    Created on 2022-06-17 by the reprex package (v2.0.1)