Search code examples
rpurrrrvesttibble

rvest: handling different number of nested classes


I'm not sure how to describe a problem so I will go directly to the example.
I have a HTML document (html_doc) that looks like:

<div class="main">
   <h2>A</h2>
   <div class="route">
      X<br />
   </div>
   <div class="route">
      Y<br />
   </div>
</div>
<div class="main">
   <h2>B</h2>
   <div class="route">
      Z<br />
   </div>
</div>

Inside each main, there's more elements beside title and route so I'm looking for a scalable solution. Classes in main are always the same.
I would like to get a tibble that looks like:

id | title | route
1  | A     | X
1  | A     | Y
2  | B     | Z 

My current try gives me error because there's a different number of rows in title and route. I also don't know how to index class main.

tibble(
  title = html_doc %>% html_nodes("h2") %>% html_text(), 
  route = html_doc %>% html_nodes(".route") %>% html_text()
  ) 

Solution

  • This follows a similar strategy as your previous question. The trick is to loop through each child node creating a seperate data.frame of the title and route and then combining all of individual dataframes into the final result.
    This solution does depend on having only 1 title per node.

    library(rvest)
    library(dplyr)
    
    page<-read_html('<<div class="main">
       <h2>A</h2>
       <div class="route">
          X<br />
       </div>
       <div class="route">
          Y<br />
       </div>
    </div>
    <div class="main">
       <h2>B</h2>
       <div class="route">
          Z<br />
       </div>
    </div>')
    
    #find all of the parent nodes
    mainnodes <- page %>% html_nodes("div.main")
    
    #loop through each parent node and extract the info from the children
    dfs<-lapply(1:length(mainnodes), function(id){
      #assume a single title node or same number as routes
      title <- mainnodes[id] %>% html_nodes("h2") %>% html_text() %>% trimws()
      #Count the number of img nodes per parent.
      route <- mainnodes[id] %>% html_nodes("div.route") %>% html_text() %>% trimws()
    
      tibble(id, title, route)
    })
    
    answer<-bind_rows(dfs)
    answer
    
    # A tibble: 3 x 3
         id title route
      <int> <chr> <chr>
    1     1 A     X    
    2     1 A     Y    
    3     2 B     Z