I'd like to scrape Amazon customer reviews and while my code works fine if there's no "missing" information, converting the scraped data to a data frame doesn't work anymore if parts of the data are missing (arguments imply differing number of rows).
This is an example code:
library(rvest)
url <- read_html("https://www.amazon.de/product-reviews/3980710688/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&pageNumber=42&sortBy=recent")
get_reviews <- function(url) {
title <- url %>%
html_nodes("#cm_cr-review_list .a-color-base") %>%
html_text()
author <- url %>%
html_nodes(".author") %>%
html_text()
df <- data.frame(title, author, stringsAsFactors = F)
return(df)
}
results <- get_reviews(url)
In this case, "missing" means that there's no author information provided for multiple customer reviews (Ein Kunde simply means A customer in German).
Does anyone have an idea on how to fix this? Any help is appreciated. Thanks in advance!
would say here is the answer for your question (link)
Each on the
'div[id*=customer_review]'
and then check whether there is that value for the author or not.