I want to pull out the list of constituencies in the table on this website using rvest: https://labour.org.uk/activist-hub/governance-and-legal-hub/selections/parliamentary-candidate-application-form/
This is what I have so far, which kind of gets me what I want:
open <- rvest::read_html('https://labour.org.uk/activist-hub/governance-and-legal-hub/selections/parliamentary-candidate-application-form/')
open %>%
html_nodes("div.col")
Which currently returns this:
{xml_nodeset (14)}
[1] <div class="col" style="width:30%">\n \t\t\t\t\t\t\t\tSeat \t\t\t\t\t\t\t</div>
[2] <div class="col" style="width:70%">\n \t\t\t\t\t\t\t\tDeadline \t\t\t\t\t\t\t</div>
[3] <div class="col" style="width:30%">\n\t \t\t\t\t\t\t\t\t\t\t<h6>Blackpool North and Fleetwood</h6>\n\t \t\t\t\t\t\t\t\t\t</div>
[4] <div class="col" style="width:70%">\n\t \t\t\t\t\t\t\t\t\t\t<ul>\n<li>12 Noon, Thursday 6 July</li>\t\t \t\t\t\t\t\t\t\t\t</ul>\n</div>
[5] <div class="col" style="width:30%">\n\t \t\t\t\t\t\t\t\t\t\t<h6>Caerfryddin</h6>\n\t \t\t\t\t\t\t\t\t\t</div>
[6] <div class="col" style="width:70%">\n\t \t\t\t\t\t\t\t\t\t\t<ul>\n<li>12 Noon, Friday 7 July</li>\t\t \t\t\t\t\t\t\t\t\t</ul>\n</div>
[7] <div class="col" style="width:30%">\n\t \t\t\t\t\t\t\t\t\t\t<h6>Great Yarmouth</h6>\n\t \t\t\t\t\t\t\t\t\t</div>
[8] <div class="col" style="width:70%">\n\t \t\t\t\t\t\t\t\t\t\t<ul>\n<li>12 Noon, Thursday 6 July</li>\t\t \t\t\t\t\t\t\t\t\t</ul>\n</div>
[9] <div class="col" style="width:30%">\n\t \t\t\t\t\t\t\t\t\t\t<h6>Hemel Hempstead</h6>\n\t \t\t\t\t\t\t\t\t\t</div>
[10] <div class="col" style="width:70%">\n\t \t\t\t\t\t\t\t\t\t\t<ul>\n<li>12 Noon, Thursday 6 July</li>\t\t \t\t\t\t\t\t\t\t\t</ul>\n</div>
[11] <div class="col" style="width:30%">\n\t \t\t\t\t\t\t\t\t\t\t<h6>Leeds South West and Morley</h6>\n\t \t\t\t\t\t\t\t\t\t</div>
[12] <div class="col" style="width:70%">\n\t \t\t\t\t\t\t\t\t\t\t<ul>\n<li>12 Noon, Thursday 6 July</li>\t\t \t\t\t\t\t\t\t\t\t</ul>\n</div>
[13] <div class="col" style="width:30%">\n\t \t\t\t\t\t\t\t\t\t\t<h6>South Derbyshire</h6>\n\t \t\t\t\t\t\t\t\t\t</div>
[14] <div class="col" style="width:70%">\n\t \t\t\t\t\t\t\t\t\t\t<ul>\n<li>12 Noon, Thursday 6 July</li>\t\t \t\t\t\t\t\t\t\t\t</ul>\n</div>
What I want is a clean tibble that includes the constituency name in one column and the date in the second column. Can anyone give me some pointers on the correct css selector please?
You did all of the hard work. Now it is just a matter of separating the 'style="width:30%"' from the 'style="width:70%"'. There are a couple ways of doing this.
Below I look at the "style" attribute and separate them into 2 different vectors which I create the final answer from. I needed to delete the first row since that contained the column names.
library(dplyr)
library(rvest)
open <- read_html('https://labour.org.uk/activist-hub/governance-and-legal-hub/selections/parliamentary-candidate-application-form/')
nodes <- open %>% html_nodes("div.col")
seat <- nodes[html_attr(nodes, "style") == "width:30%" ] %>% html_text() %>% trimws()
deadline <- nodes[html_attr(nodes, "style") == "width:70%" ] %>% html_text() %>% trimws()
data.frame(seat, deadline)[-1,]
seat deadline
1 Blackpool North and Fleetwood 12 Noon, Thursday 6 July
2 Caerfryddin 12 Noon, Friday 7 July
3 Great Yarmouth 12 Noon, Thursday 6 July
4 Hemel Hempstead 12 Noon, Thursday 6 July
5 Leeds South West and Morley 12 Noon, Thursday 6 July
6 South Derbyshire 12 Noon, Thursday 6 July