This newspaper website lists paragraphs of its article in separate <p>
objects where each <class>
attribute's name starts with the word article.
How can I get all the paragraphs where the <class>
attribute starts with article from the tz2 object?
require(rvest)
url = 'http://taz.de/Kongo-Kunst-im-Bruesseler-Afrikamuseum/!5563620/'
tz = read_html(url)
tz2 = tz %>%
xml_nodes(xpath = "//*[@class='sectbody']") %>%
xml_children()
My attempts:
# get one paragraph by class attribute
tz2 %>%
xml_nodes(xpath = "//p[@class='article first odd Initial']") %>%
xml_text()
# regex-like get all 'article' paragraphs
tz2 %>%
xml_nodes(xpath = "//p[@starts-with(@class, 'article')]") %>%
xml_text()
CSS selectors are a tad simpler than XPath. For classes, the general syntax is tag.class
, and if something is missing, it matches everything, so .article
matches every tag with class article
. A space between selectors means look for children of the first part that match the selector of the second. So:
library(rvest)
tz <- read_html('http://taz.de/Kongo-Kunst-im-Bruesseler-Afrikamuseum/!5563620/')
paragraphs <- tz %>% html_nodes('.sectbody p.article') %>% html_text()
str(paragraphs)
#> chr [1:20] "TERVUREN taz | Wer dieses Jahr Belgiens berühmtes Afrikamuseum in Tervuren vor den Toren Brüssels besucht, kom"| __truncated__ ...
paragraphs[1]
#> [1] "TERVUREN taz | Wer dieses Jahr Belgiens berühmtes Afrikamuseum in Tervuren vor den Toren Brüssels besucht, kommt ins Staunen. Wo früher das Musée royal d’Afrique Centrale (MRAC) alte Kolonialsammlungen darbot, zelebriert heute das renovierte „Africa Museum“, wie es jetzt heißt, den Reichtum des Kongo mit all seinen hellen und dunklen Seiten."
Note this works because classes in HTML are separated by a space, so class="class1 class2"
will match .class1
or .class2
. Here's a great tutorial if you'd like to learn more about CSS selectors.