Search code examples
rweb-scrapingrvest

Correct div.class combination when scraping with rvest


I want to pull out the list of constituencies in the table on this website using rvest: https://labour.org.uk/activist-hub/governance-and-legal-hub/selections/parliamentary-candidate-application-form/

This is what I have so far, which kind of gets me what I want:

open <- rvest::read_html('https://labour.org.uk/activist-hub/governance-and-legal-hub/selections/parliamentary-candidate-application-form/')

open %>% 
   html_nodes("div.col")

Which currently returns this:

{xml_nodeset (14)}
 [1] <div class="col" style="width:30%">\n        \t\t\t\t\t\t\t\tSeat        \t\t\t\t\t\t\t</div>
 [2] <div class="col" style="width:70%">\n        \t\t\t\t\t\t\t\tDeadline        \t\t\t\t\t\t\t</div>
 [3] <div class="col" style="width:30%">\n\t        \t\t\t\t\t\t\t\t\t\t<h6>Blackpool North and Fleetwood</h6>\n\t        \t\t\t\t\t\t\t\t\t</div>
 [4] <div class="col" style="width:70%">\n\t        \t\t\t\t\t\t\t\t\t\t<ul>\n<li>12 Noon, Thursday 6 July</li>\t\t        \t\t\t\t\t\t\t\t\t</ul>\n</div>
 [5] <div class="col" style="width:30%">\n\t        \t\t\t\t\t\t\t\t\t\t<h6>Caerfryddin</h6>\n\t        \t\t\t\t\t\t\t\t\t</div>
 [6] <div class="col" style="width:70%">\n\t        \t\t\t\t\t\t\t\t\t\t<ul>\n<li>12 Noon, Friday 7 July</li>\t\t        \t\t\t\t\t\t\t\t\t</ul>\n</div>
 [7] <div class="col" style="width:30%">\n\t        \t\t\t\t\t\t\t\t\t\t<h6>Great Yarmouth</h6>\n\t        \t\t\t\t\t\t\t\t\t</div>
 [8] <div class="col" style="width:70%">\n\t        \t\t\t\t\t\t\t\t\t\t<ul>\n<li>12 Noon, Thursday 6 July</li>\t\t        \t\t\t\t\t\t\t\t\t</ul>\n</div>
 [9] <div class="col" style="width:30%">\n\t        \t\t\t\t\t\t\t\t\t\t<h6>Hemel Hempstead</h6>\n\t        \t\t\t\t\t\t\t\t\t</div>
[10] <div class="col" style="width:70%">\n\t        \t\t\t\t\t\t\t\t\t\t<ul>\n<li>12 Noon, Thursday 6 July</li>\t\t        \t\t\t\t\t\t\t\t\t</ul>\n</div>
[11] <div class="col" style="width:30%">\n\t        \t\t\t\t\t\t\t\t\t\t<h6>Leeds South West and Morley</h6>\n\t        \t\t\t\t\t\t\t\t\t</div>
[12] <div class="col" style="width:70%">\n\t        \t\t\t\t\t\t\t\t\t\t<ul>\n<li>12 Noon, Thursday 6 July</li>\t\t        \t\t\t\t\t\t\t\t\t</ul>\n</div>
[13] <div class="col" style="width:30%">\n\t        \t\t\t\t\t\t\t\t\t\t<h6>South Derbyshire</h6>\n\t        \t\t\t\t\t\t\t\t\t</div>
[14] <div class="col" style="width:70%">\n\t        \t\t\t\t\t\t\t\t\t\t<ul>\n<li>12 Noon, Thursday 6 July</li>\t\t        \t\t\t\t\t\t\t\t\t</ul>\n</div>

What I want is a clean tibble that includes the constituency name in one column and the date in the second column. Can anyone give me some pointers on the correct css selector please?


Solution

  • You did all of the hard work. Now it is just a matter of separating the 'style="width:30%"' from the 'style="width:70%"'. There are a couple ways of doing this.
    Below I look at the "style" attribute and separate them into 2 different vectors which I create the final answer from. I needed to delete the first row since that contained the column names.

    library(dplyr)
    library(rvest)
    
    open <- read_html('https://labour.org.uk/activist-hub/governance-and-legal-hub/selections/parliamentary-candidate-application-form/')
    
    nodes <- open %>% html_nodes("div.col")
    
    seat <- nodes[html_attr(nodes, "style") == "width:30%" ] %>% html_text() %>% trimws()
    deadline <- nodes[html_attr(nodes, "style") == "width:70%" ] %>% html_text() %>% trimws()
    
    data.frame(seat, deadline)[-1,]
                               seat                 deadline
    1 Blackpool North and Fleetwood 12 Noon, Thursday 6 July
    2                   Caerfryddin   12 Noon, Friday 7 July
    3                Great Yarmouth 12 Noon, Thursday 6 July
    4               Hemel Hempstead 12 Noon, Thursday 6 July
    5   Leeds South West and Morley 12 Noon, Thursday 6 July
    6              South Derbyshire 12 Noon, Thursday 6 July