Search code examples
cssrweb-scrapingrvesthttr

identify the correct CSS selector of a url for an R script


I am trying to obtain data from a website and thanks to a helper i could get to the following script:

require(httr)
require(rvest)
      res <- httr::POST(url = "http://apps.kew.org/wcsp/advsearch.do", 
                    body = list(page = "advancedSearch", 
                                AttachmentExist = "", 
                                family = "", 
                                placeOfPub = "", 
                                genus =      "Arctodupontia", 
                                yearPublished = "", 
                                species ="scleroclada", 
                                author = "", 
                                infraRank = "", 
                                infraEpithet = "", 
                                selectedLevel = "cont"), 
                    encode = "form") 
  pg <- content(res, as="parsed")
  lnks <- html_attr(html_node(pg,"td"), "href")

However, in some cases, like the example above, it does not retrieve the right link because, for some reason, html_attr does not find urls ("href") within the node detected by html_node. So far, i have tried different CSS selector, like "td", "a.onwardnav" and ".plantname" but none of them generate an object that html_attr can handle correctly. Any hint?


Solution

  • You are really close on getting the answer your were expecting. If you would like to pull the links off of the desired page then:

    lnks <- html_attr(html_nodes(pg,"a"), "href") 
    

    will return a list of all of the links at the "a" tag with a "href" attribute. Notice the command is html_nodes and not node. There are multiple "a" tags thus the plural.
    If you are looking for the information from the table in the body of then try this:

    html_table(pg, fill=TRUE)
    #or this
    html_nodes(pg,"tr")
    

    The second line will return a list of the 9 rows from the table which one could then parse to obtain the row names ("th") and/or row values ("td").
    Hope this helps.