Search code examples
htmlrreplacegsub

R remove characters surrounding wildcard in string


I have a vector listing the various types of HTML found in a website that contain URLs, which is characterized by the wildcard: ([^<]*). So far, I've been able to pull the links into a dataframe I need but am having trouble cleaning them up so they can be accessed.

How do I remove all the tags without affecting the URL?

# Vector of HTML tags surrounding URL
x <- c('\t\t\t<div><a href=\"([^<]*)\">([^<]*)</a></div>','\t\t</tr><tr><td><a href=\"([^<]*)\">([^<]*)</a></td>','\t\t\t<td><a href=\"([^<]*)\">([^<]*)</a></td>')

Input:

URL <- "https://www.atf.gov/resource-center/data-statistics"
html <- paste(readLines(URL))

Output:

Link Title
"https://www.atf.gov/file/144871/download" Canada 2014-2019
"https://www.atf.gov/node/79436" 2019

Code I'm currently working with:

dlall <- list()
for(i in x){
  datalines <- grep(i,html,value=TRUE)
  dl_all <- rbind(data.frame(datalines), data.frame(dl_all))
  }

Solution

  • Similar to Wiktor Stribiżew using R >= 4.1:

    library(rvest)
    url <- "https://www.atf.gov/resource-center/data-statistics"
    df <- read_html(url) |> html_nodes("a") |> 
      {\(x) data.frame(
        Link = x |> html_attr("href"),
        Title = x |> html_text())
      }()
    

    Giving:

    tail(df)
                                                            Link                              Title
    203    https://www.justice.gov/jmd/eeo-program-status-report                        No Fear Act
    204 https://oig.justice.gov/hotline/whistleblower-protection Whistleblower Rights & Protections
    205                        https://www.atf.gov/home/site-map                           Site Map
    206 https://www.atf.gov/resource-center/accessibility-policy           Accessibility & Plug-Ins
    207                              https://www.atf.gov/<front>                            ATF.gov
    208                                  https://www.justice.gov         U.S. Department of Justice