I am trying to obtain data from a website and thanks to a helper i could get to the following script:
require(httr)
require(rvest)
res <- httr::POST(url = "http://apps.kew.org/wcsp/advsearch.do",
body = list(page = "advancedSearch",
AttachmentExist = "",
family = "",
placeOfPub = "",
genus = "Arctodupontia",
yearPublished = "",
species ="scleroclada",
author = "",
infraRank = "",
infraEpithet = "",
selectedLevel = "cont"),
encode = "form")
pg <- content(res, as="parsed")
lnks <- html_attr(html_node(pg,"td"), "href")
However, in some cases, like the example above, it does not retrieve the right link because, for some reason, html_attr does not find urls ("href") within the node detected by html_node. So far, i have tried different CSS selector, like "td", "a.onwardnav" and ".plantname" but none of them generate an object that html_attr can handle correctly. Any hint?
You are really close on getting the answer your were expecting. If you would like to pull the links off of the desired page then:
lnks <- html_attr(html_nodes(pg,"a"), "href")
will return a list of all of the links at the "a" tag with a "href" attribute. Notice the command is html_nodes and not node. There are multiple "a" tags thus the plural.
If you are looking for the information from the table in the body of then try this:
html_table(pg, fill=TRUE)
#or this
html_nodes(pg,"tr")
The second line will return a list of the 9 rows from the table which one could then parse to obtain the row names ("th") and/or row values ("td").
Hope this helps.