I am using rvest to scrape some information off websites as a little hobby project. However, for one particular node I try to extract, it seems to append CSS styling code to the beginning.
URL <- 'https://www.thepioneerwoman.com/food-cooking/recipes/a41138141/apple-pie-cookies-recipe/'
recipe <- rvest::read_html(URL)
directions <- rvest::html_nodes(recipe, ".et3p2gv0") %>%
rvest::html_text() %>%
trimws()
This is what appears in the output:
[1] ".css-dt22uw{display:none;visibility:hidden;}Step .css-6ds1rq{border-right:thin solid #b20039;height:1rem;left:-3rem;position:absolute;top:0.45rem;width:1.4rem;}1.css-1baulvz{display:inline-block;}Melt the butter in a medium saucepan over medium-high heat. Add the apples and cook until they start to soften, 3 to 4 minutes. Stir in the brown sugar and lemon juice, bring to a simmer and cook until the apples are soft and the liquid is starting to reduce, 3 to 4 more minutes. Whisk the apple juice and cornstarch in a small bowl and add it to the pan. Cook, stirring, until the mixture thickens, about 1 more minute. Remove from the heat and let cool. "
I have tried a variety of different nodes, and used different CSS selectors but regardless, that still appears in the output.
I might end up just using gsub() to remove this from the string, but would rather not.
XPath text()
is quite handy at times, you can mix and match it with css selectors or rewrite selector as XPath:
URL <- 'https://www.thepioneerwoman.com/food-cooking/recipes/a41138141/apple-pie-cookies-recipe/'
recipe <- rvest::read_html(URL)
# get a list of <li> elements with css selector and extract text from each elemnet with XPath
directions_1 <- rvest::html_elements(recipe, "ol.et3p2gv0 li") %>%
html_nodes(xpath="./text()") %>%
rvest::html_text() %>%
trimws()
# or use only XPath
directions_2 <- rvest::html_elements(recipe, xpath='//ol[contains(@class, "et3p2gv0")]/li/text()') %>%
rvest::html_text() %>%
trimws()