For purely educational purposes Im trying to scrape reviews of a Dutch retail website using RSelenium (Link to website). I struggle however to extract the information on the review in the right format. Finally my goal would be to loop over all the reviews and extract only the pieces of info per review that I need (f.i. just location of reviewer).
This is the html piece of the reviews (piece 1) and the actual info within a specific review (piece 2):
Now I have saved the list of reviews like so:
rdriver <- rsDriver(browser = "chrome",
chromever = "101.0.4951.15",
port = 2232L
)
driver <- rdriver[["client"]]
reviews <- driver$findElements(using = 'xpath', '//*[@class="review js-review"]')
review <- reviews[[1]]
review$getElementText()
The final command gives me all the text that is present in the first review like title of review, name, age and location of reviewer, actual text of review and so on:
1 "Zoek niet verder als je een tv zoek met deze grote en alle laatste Sma\nGer1965rotterdam 60-69 jaar Rotterdam 18 april 2022 Heeft dit artikel gekocht\nIk raad dit product aan\nGoede beeldkwaliteit\nEenvoudig in gebruik\nJuiste formaat\nHeeft alles wat een tv moet hebben onder ander Sat.tv.ontvanger en alle nieuwste Smart Mogelijkheden hij is eind februari 2022 op de Hollandse markt gekomen dus nieuwer kan het niet !!!!!!!!!\nVond je dit een nuttige review?\n2 0"
But I would actually like to fetch only certain parts of the review, for instance just the location of the reviewer, in this case 'Rotterdam' at the end of the first line.
I tried:
check <- review$findElement(using = 'xpath', './/*[@data-test="review-author-city"]')
check$getElementText()
But it still gives me the entire piece of text like before and not just 'Rotterdam'. Does anyone know what Im doing wrong? I've looked online a lot to resolve this issue, but cant seem to find it. It should be possible to loop over a list of webelements and extract only certain pieces of info from these Elements right? Like I said Im doing this for educational purposes, so Im pretty new to the material.
Any help is greatly appreciated!
I think the problem with your code is that your reviews list contains WebElement objects. You cannot use findElement on a WebElement object afaik.
What you could do to get the location of all reviews is getting them directly with.
driver$findElements(using = 'xpath', '//li[@data-test="review-author-city")
Update: i tried it out myself in RSelenium and i found out there's a method findChildElement. You can find more information about this here: https://rdrr.io/cran/RSelenium/man/webElement-class.html
In your case this should work:
driver <- rdriver[["client"]]
reviews <- driver$findElements(using = 'xpath', '//*[@class="review js-review"]')
review <- reviews[[1]]
check <- review$findChildElement(using = 'xpath', './/*[@data-test="review-author-city"]')
check$getElementText()