So I have a website, https://ais.sbarc.org/logs_delimited/ , which has a bunch of links, and within each link is 24 links which have .txt files.
I'm new to R, but I'm able to loop through one link to get the 24 text files into a dataframe. But I can't figure out how to loop the whole directory.
I was able to loop the 24 links using hours.list, but the year.list and trip.list wouldn't work... I apologize if this is similar to other webscrape questions or if i'm missing something really simple but I'd appreciate any help
get_ais_text = function(ais_text){
hours.list = c(0:23)
hours.list_1 = sprintf('%02d', hours.list)
year.list = c(2018:2022)
year.list1 = sprintf('%d', year.list)
trip.list = c(190101:191016)
trip.list_1 = sprintf("%d", trip.list)
ais_text = tryCatch(
lapply(paste0('https://ais.sbarc.org/logs_delimited/2019/190101/AIS_SBARC_190101-', hours.list_1,'.txt'),
function(url){
url %>%
read_delim(";", col_names = sprintf("X%d", 1:25), col_types = ais_col_types)
}),
error = function(e){NA}
)
DF = do.call(rbind.data.frame, ais_text)
return(DF)
}
get_ais_text()
Here's a function that works recursively to get all the links starting with the home directory. Note that it takes a bit to run:
library(xml2)
library(magrittr)
.get_link <- function(u){
node <- xml2::read_html(u)
hrefs <- xml2::xml_find_all(node, ".//a[not(contains(@href,'../'))]") %>% xml_attr("href")
urls <- xml2::url_absolute(hrefs, xml_url(node))
if(!all(tools::file_ext(urls) == "txt")){
lapply(urls, .get_link)
}else {
return(urls)
}
}
What this is doing is basically starting with a url
, and reading the contents, finding any links <a...
using an xpath selector
, which says "all links that are not ../" ie... not the topmost directory back link. then if the link has more links, loop through and get all of those as well. If we have the final links, ie, .txt files, we're done.
Example cheating and starting only at 2018
a <- .get_link("https://ais.sbarc.org/logs_delimited/2018/")
> a[[1]][1:2]
[1] "https://ais.sbarc.org/logs_delimited/2018/180101/AIS_SBARC_180101-00.txt"
[2] "https://ais.sbarc.org/logs_delimited/2018/180101/AIS_SBARC_180101-01.txt"
> length(a)
[1] 365
> a[[365]][1:2]
[1] "https://ais.sbarc.org/logs_delimited/2018/181231/AIS_SBARC_181231-00.txt"
[2] "https://ais.sbarc.org/logs_delimited/2018/181231/AIS_SBARC_181231-01.txt"
What you would do is simply start with: https://ais.sbarc.org/logs_delimited/
for the url input, and then add something like data.table::fread
to digest the data. Which I would suggest doing in a separate iteration. Something like this works:
lapply(1:length(a), function(i){
lapply(a[[i]], data.table::fread)
})
First thing to take notice of here is that there are 11,636 files. That's a lot of links to hit on someone's server at once... so I'm going to sample a few and show how to do it. I would suggest adding a Sys.sleep
call into yours...
# This gets all the urls
a <- .get_link("https://ais.sbarc.org/logs_delimited/")
# This unlists and gives us a unique array of the urls
b <- unique(unlist(a))
# I'm sampling b, but you would just use `b` instead of `b[...]`
a_dfs <- jsonlite::rbind_pages(lapply(b[sample(1:length(b), 20)], function(i){
df <- data.table::fread(i, sep = ";") %>% as.data.frame()
# Giving the file path for debug later if needed seems helpful
df$file_path <- i
df
}))
> a_dfs %>% head()
17:00:00:165 24 0 338179477 LAUREN SEA V8 V9 V15 V16 V17 V18 V19 V20 V21 V22 V23 file_path V1 V2 V3 V4
1 17:00:00:166 EUPHONY ACE 79 71.08 1 371618000 0 254.0 253 52 0 0 0 0 5 NA https://ais.sbarc.org/logs_delimited/2018/180113/AIS_SBARC_180113-17.txt <NA> <NA> NA NA
2 17:00:01:607 SIMONE T BRUSCO 31 32.93 3 367593050 15 255.7 97 55 0 0 1 0 503 0 https://ais.sbarc.org/logs_delimited/2018/180113/AIS_SBARC_180113-17.txt <NA> <NA> NA NA
3 17:00:01:626 POLARIS VOYAGER 89 148.80 1 311000112 0 150.0 151 53 0 0 0 0 0 22 https://ais.sbarc.org/logs_delimited/2018/180113/AIS_SBARC_180113-17.txt <NA> <NA> NA NA
4 17:00:01:631 SPECTRE 60 25.31 1 367315630 5 265.1 511 55 0 0 1 0 2 20 https://ais.sbarc.org/logs_delimited/2018/180113/AIS_SBARC_180113-17.txt <NA> <NA> NA NA
5 17:00:01:650 KEN EI 70 73.97 1 354162000 0 269.0 269 38 0 0 0 0 1 84 https://ais.sbarc.org/logs_delimited/2018/180113/AIS_SBARC_180113-17.txt <NA> <NA> NA NA
6 17:00:02:866 HANNOVER BRIDGE 70 62.17 1 372104000 0 301.1 300 56 0 0 0 0 3 1 https://ais.sbarc.org/logs_delimited/2018/180113/AIS_SBARC_180113-17.txt <NA> <NA> NA NA
V5 V6 V7 V10 V11 V12 V13 V14 02:00:00:489 338115994 1 37 SRTG0$ 10 7 4 17:00:00:798 BROADBILL 16.84 269 18 367077090 16.3 -119.981493 34.402530 264.3 511 40
1 NA NA NA NA NA NA NA NA <NA> NA NA NA NA NA NA NA <NA> <NA> NA NA <NA> <NA> NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA <NA> NA NA NA NA NA NA NA <NA> <NA> NA NA <NA> <NA> NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA <NA> NA NA NA NA NA NA NA <NA> <NA> NA NA <NA> <NA> NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA <NA> NA NA NA NA NA NA NA <NA> <NA> NA NA <NA> <NA> NA NA NA NA NA NA
5 NA NA NA NA NA NA NA NA <NA> NA NA NA NA NA NA NA <NA> <NA> NA NA <NA> <NA> NA NA NA NA NA NA
6 NA NA NA NA NA NA NA NA <NA> NA NA NA NA NA NA NA <NA> <NA> NA NA <NA> <NA> NA NA NA NA NA NA
Obviously some cleaning to do.. but this is how you get to it i'd think.
I actually like this better, read the data in, then split the string and create forcefull the dataframe:
a_dfs <- rbind_pages(lapply(b[sample(1:length(b), 20)], function(i){
raw <- readLines(i)
str_matrix <- stringi::stri_split_regex(raw, "\\;", simplify = TRUE)
as.data.frame(apply(str_matrix, 2, function(j){
ifelse(!nchar(j), NA, j)
})) %>% mutate(file_name = i)
}))
> a_dfs %>% head
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25
1 09:59:57:746 STAR CARE 77 75.93 135 1 0 566341000 0 0 16.7 1 -118.839933 33.562167 321 322 50 0 0 0 0 6 19 <NA> <NA>
2 10:00:00:894 THALATTA 70 27.93 133.8 1 0 229710000 0 251 17.7 1 -119.366765 34.101742 283.9 282 55 0 0 0 0 7 <NA> <NA> <NA>
3 10:00:03:778 GULF GLORY 82 582.3 256 1 0 538007706 0 0 12.4 0 -129.345783 32.005983 87 86 54 0 0 0 0 2 20 <NA> <NA>
4 10:00:03:799 MAGPIE SW 70 68.59 123.4 1 0 352597000 0 0 10.9 0 -118.747970 33.789747 119.6 117 56 0 0 0 0 0 22 <NA> <NA>
5 10:00:09:152 CSL TECUMSEH 70 66.16 269.7 1 0 311056900 0 11 12 1 -120.846763 34.401482 105.8 106 56 0 0 0 0 6 21 <NA> <NA>
6 10:00:12:870 RANGER 85 60 31.39 117.9 1 0 367044250 0 128 0 1 -119.223133 34.162953 360 511 56 0 0 1 0 2 21 <NA> <NA>
file_name V26 V27
1 https://ais.sbarc.org/logs_delimited/2018/180211/AIS_SBARC_180211-10.txt <NA> <NA>
2 https://ais.sbarc.org/logs_delimited/2018/180211/AIS_SBARC_180211-10.txt <NA> <NA>
3 https://ais.sbarc.org/logs_delimited/2018/180211/AIS_SBARC_180211-10.txt <NA> <NA>
4 https://ais.sbarc.org/logs_delimited/2018/180211/AIS_SBARC_180211-10.txt <NA> <NA>
5 https://ais.sbarc.org/logs_delimited/2018/180211/AIS_SBARC_180211-10.txt <NA> <NA>
6 https://ais.sbarc.org/logs_delimited/2018/180211/AIS_SBARC_180211-10.txt <NA> <NA>