I'm trying to scrape prices from Amazon. It used to work before but now it doesn't and I don't know if they implemented some protection or if I'm not using rvest
correctly.
I'm trying to scrape with this code:
library(rvest)
my_url <- "https://www.amazon.com/s?k=reusable+straws"
user_agent <- user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:120.0) Gecko/20100101 Firefox/120")
my_session <- session(my_url, user_agent)
my_session %>%
html_elements(".a-offscreen")
I can scrape the <a class>
above just fine and I can scrape the <span class="a-size-base a-color-secondary">
below fine but none of the price spans.
Any ideas?
Consider using tools like SelectorGadget to better identify the correct HTML elements to scrape.
library(tidyverse)
library(rvest)
"https://www.amazon.com/s?k=reusable+straws" %>%
read_html() %>%
html_elements(".puis-card-border") %>% # Select each product box
map_dfr(~ tibble( # Map over every box to extract info
title = html_element(.x, ".a-color-base.a-text-normal") %>%
html_text2(),
price = html_element(.x, ".a-price") %>%
html_text2(),
rating = html_element(.x, ".aok-align-bottom") %>%
html_text2()
))
# A tibble: 60 x 3
title price rating
<chr> <chr> <chr>
1 "HSHIJYA 18 Pack Reusable Stainless Steel Straws w~ $18.~ 4.7 o~
2 "Piteno\u00ae 16-Pack Reusable Glass Straws, Clear~ $6.9~ 4.7 o~
3 "Softy Straws Premium Reusable Stainless Steel Dri~ $12.~ 4.7 o~
4 "15 FITS ALL TUMBLERS STRAWS - Reusable Silicone S~ $14.~ 4.6 o~
5 "Tronco Set of 6 Stainless Steel Reusable Metal St~ $9.9~ 4.6 o~
6 "Hiware 12-Pack Reusable Stainless Steel Metal Str~ $6.2~ 4.8 o~
7 "24 PCS, Reusable Straws with 4 Brushes, 10.5\" Lo~ $5.9~ 4.6 o~
8 "Kynup Reusable Straws, 4Pack Collapsible Portable~ $9.9~ 4.6 o~
9 "Ello Impact Reusable Hard Plastic Straws with Cle~ $3.4~ 4.7 o~
10 "ALINK 10.5 in Long Rainbow Colored Reusable Trita~ $4.9~ 4.7 o~
# i 50 more rows
# i Use `print(n = ...)` to see more rows