I have a vector listing the various types of HTML found in a website that contain URLs, which is characterized by the wildcard: ([^<]*). So far, I've been able to pull the links into a dataframe I need but am having trouble cleaning them up so they can be accessed.
How do I remove all the tags without affecting the URL?
# Vector of HTML tags surrounding URL
x <- c('\t\t\t<div><a href=\"([^<]*)\">([^<]*)</a></div>','\t\t</tr><tr><td><a href=\"([^<]*)\">([^<]*)</a></td>','\t\t\t<td><a href=\"([^<]*)\">([^<]*)</a></td>')
Input:
URL <- "https://www.atf.gov/resource-center/data-statistics"
html <- paste(readLines(URL))
Output:
Link | Title |
---|---|
"https://www.atf.gov/file/144871/download" | Canada 2014-2019 |
"https://www.atf.gov/node/79436" | 2019 |
Code I'm currently working with:
dlall <- list()
for(i in x){
datalines <- grep(i,html,value=TRUE)
dl_all <- rbind(data.frame(datalines), data.frame(dl_all))
}
Similar to Wiktor Stribiżew using R >= 4.1:
library(rvest)
url <- "https://www.atf.gov/resource-center/data-statistics"
df <- read_html(url) |> html_nodes("a") |>
{\(x) data.frame(
Link = x |> html_attr("href"),
Title = x |> html_text())
}()
Giving:
tail(df)
Link Title
203 https://www.justice.gov/jmd/eeo-program-status-report No Fear Act
204 https://oig.justice.gov/hotline/whistleblower-protection Whistleblower Rights & Protections
205 https://www.atf.gov/home/site-map Site Map
206 https://www.atf.gov/resource-center/accessibility-policy Accessibility & Plug-Ins
207 https://www.atf.gov/<front> ATF.gov
208 https://www.justice.gov U.S. Department of Justice