I'm not sure how to describe a problem so I will go directly to the example.
I have a HTML document (html_doc
) that looks like:
<div class="main">
<h2>A</h2>
<div class="route">
X<br />
</div>
<div class="route">
Y<br />
</div>
</div>
<div class="main">
<h2>B</h2>
<div class="route">
Z<br />
</div>
</div>
Inside each main
, there's more elements beside title
and route
so I'm looking for a scalable solution. Classes in main
are always the same.
I would like to get a tibble that looks like:
id | title | route
1 | A | X
1 | A | Y
2 | B | Z
My current try gives me error because there's a different number of rows in title
and route
. I also don't know how to index class main
.
tibble(
title = html_doc %>% html_nodes("h2") %>% html_text(),
route = html_doc %>% html_nodes(".route") %>% html_text()
)
This follows a similar strategy as your previous question. The trick is to loop through each child node creating a seperate data.frame of the title and route and then combining all of individual dataframes into the final result.
This solution does depend on having only 1 title per node.
library(rvest)
library(dplyr)
page<-read_html('<<div class="main">
<h2>A</h2>
<div class="route">
X<br />
</div>
<div class="route">
Y<br />
</div>
</div>
<div class="main">
<h2>B</h2>
<div class="route">
Z<br />
</div>
</div>')
#find all of the parent nodes
mainnodes <- page %>% html_nodes("div.main")
#loop through each parent node and extract the info from the children
dfs<-lapply(1:length(mainnodes), function(id){
#assume a single title node or same number as routes
title <- mainnodes[id] %>% html_nodes("h2") %>% html_text() %>% trimws()
#Count the number of img nodes per parent.
route <- mainnodes[id] %>% html_nodes("div.route") %>% html_text() %>% trimws()
tibble(id, title, route)
})
answer<-bind_rows(dfs)
answer
# A tibble: 3 x 3
id title route
<int> <chr> <chr>
1 1 A X
2 1 A Y
3 2 B Z