Search code examples
rweb-scrapingrvestrcurl

Webscrape text files using R, rvest or rcurl


So I have a website, https://ais.sbarc.org/logs_delimited/ , which has a bunch of links, and within each link is 24 links which have .txt files.

I'm new to R, but I'm able to loop through one link to get the 24 text files into a dataframe. But I can't figure out how to loop the whole directory.

I was able to loop the 24 links using hours.list, but the year.list and trip.list wouldn't work... I apologize if this is similar to other webscrape questions or if i'm missing something really simple but I'd appreciate any help

get_ais_text = function(ais_text){

    hours.list = c(0:23)
    hours.list_1 = sprintf('%02d', hours.list)

    year.list = c(2018:2022)
    year.list1 = sprintf('%d', year.list)

    trip.list = c(190101:191016)
    trip.list_1 = sprintf("%d", trip.list)

ais_text = tryCatch(    
lapply(paste0('https://ais.sbarc.org/logs_delimited/2019/190101/AIS_SBARC_190101-', hours.list_1,'.txt'),
                    function(url){
                      url %>% 
                        read_delim(";", col_names = sprintf("X%d", 1:25), col_types = ais_col_types)                   
                    }),
      error = function(e){NA}
    )
  DF = do.call(rbind.data.frame, ais_text)
  return(DF)
}

get_ais_text()

Solution

  • Here's a function that works recursively to get all the links starting with the home directory. Note that it takes a bit to run:

    library(xml2)
    library(magrittr)
    .get_link <- function(u){
      node <- xml2::read_html(u)
      hrefs <- xml2::xml_find_all(node, ".//a[not(contains(@href,'../'))]") %>% xml_attr("href")
      urls <- xml2::url_absolute(hrefs, xml_url(node))
      if(!all(tools::file_ext(urls) == "txt")){
        lapply(urls, .get_link)
      }else {
        return(urls)
      }
    }
    

    What this is doing is basically starting with a url, and reading the contents, finding any links <a... using an xpath selector, which says "all links that are not ../" ie... not the topmost directory back link. then if the link has more links, loop through and get all of those as well. If we have the final links, ie, .txt files, we're done.

    Example cheating and starting only at 2018

    a <- .get_link("https://ais.sbarc.org/logs_delimited/2018/")
    > a[[1]][1:2]
    [1] "https://ais.sbarc.org/logs_delimited/2018/180101/AIS_SBARC_180101-00.txt"
    [2] "https://ais.sbarc.org/logs_delimited/2018/180101/AIS_SBARC_180101-01.txt"
    > length(a)
    [1] 365
    > a[[365]][1:2]
    [1] "https://ais.sbarc.org/logs_delimited/2018/181231/AIS_SBARC_181231-00.txt"
    [2] "https://ais.sbarc.org/logs_delimited/2018/181231/AIS_SBARC_181231-01.txt"
    

    What you would do is simply start with: https://ais.sbarc.org/logs_delimited/ for the url input, and then add something like data.table::fread to digest the data. Which I would suggest doing in a separate iteration. Something like this works:

    lapply(1:length(a), function(i){
        lapply(a[[i]], data.table::fread)
    })
    

    For reading in data...

    First thing to take notice of here is that there are 11,636 files. That's a lot of links to hit on someone's server at once... so I'm going to sample a few and show how to do it. I would suggest adding a Sys.sleep call into yours...

    # This gets all the urls
    a <- .get_link("https://ais.sbarc.org/logs_delimited/")
    # This unlists and gives us a unique array of the urls
    b <- unique(unlist(a))
    # I'm sampling b, but you would just use `b` instead of `b[...]`
    a_dfs <- jsonlite::rbind_pages(lapply(b[sample(1:length(b), 20)], function(i){
        df <- data.table::fread(i, sep = ";") %>% as.data.frame()
        # Giving the file path for debug later if needed seems helpful
        df$file_path <- i
        df
    }))
    
    > a_dfs %>% head()
      17:00:00:165              24  0 338179477 LAUREN SEA        V8 V9   V15 V16 V17 V18 V19 V20 V21 V22 V23                                                                file_path   V1   V2 V3 V4
    1 17:00:00:166     EUPHONY ACE 79     71.08          1 371618000  0 254.0 253  52   0   0   0   0   5  NA https://ais.sbarc.org/logs_delimited/2018/180113/AIS_SBARC_180113-17.txt <NA> <NA> NA NA
    2 17:00:01:607 SIMONE T BRUSCO 31     32.93          3 367593050 15 255.7  97  55   0   0   1   0 503   0 https://ais.sbarc.org/logs_delimited/2018/180113/AIS_SBARC_180113-17.txt <NA> <NA> NA NA
    3 17:00:01:626 POLARIS VOYAGER 89    148.80          1 311000112  0 150.0 151  53   0   0   0   0   0  22 https://ais.sbarc.org/logs_delimited/2018/180113/AIS_SBARC_180113-17.txt <NA> <NA> NA NA
    4 17:00:01:631         SPECTRE 60     25.31          1 367315630  5 265.1 511  55   0   0   1   0   2  20 https://ais.sbarc.org/logs_delimited/2018/180113/AIS_SBARC_180113-17.txt <NA> <NA> NA NA
    5 17:00:01:650          KEN EI 70     73.97          1 354162000  0 269.0 269  38   0   0   0   0   1  84 https://ais.sbarc.org/logs_delimited/2018/180113/AIS_SBARC_180113-17.txt <NA> <NA> NA NA
    6 17:00:02:866 HANNOVER BRIDGE 70     62.17          1 372104000  0 301.1 300  56   0   0   0   0   3   1 https://ais.sbarc.org/logs_delimited/2018/180113/AIS_SBARC_180113-17.txt <NA> <NA> NA NA
      V5 V6 V7 V10 V11 V12 V13 V14 02:00:00:489 338115994  1 37 SRTG0$ 10  7  4 17:00:00:798 BROADBILL 16.84 269   18 367077090 16.3 -119.981493 34.402530 264.3 511 40
    1 NA NA NA  NA  NA  NA  NA  NA         <NA>        NA NA NA     NA NA NA NA         <NA>      <NA>    NA  NA <NA>      <NA>   NA          NA        NA    NA  NA NA
    2 NA NA NA  NA  NA  NA  NA  NA         <NA>        NA NA NA     NA NA NA NA         <NA>      <NA>    NA  NA <NA>      <NA>   NA          NA        NA    NA  NA NA
    3 NA NA NA  NA  NA  NA  NA  NA         <NA>        NA NA NA     NA NA NA NA         <NA>      <NA>    NA  NA <NA>      <NA>   NA          NA        NA    NA  NA NA
    4 NA NA NA  NA  NA  NA  NA  NA         <NA>        NA NA NA     NA NA NA NA         <NA>      <NA>    NA  NA <NA>      <NA>   NA          NA        NA    NA  NA NA
    5 NA NA NA  NA  NA  NA  NA  NA         <NA>        NA NA NA     NA NA NA NA         <NA>      <NA>    NA  NA <NA>      <NA>   NA          NA        NA    NA  NA NA
    6 NA NA NA  NA  NA  NA  NA  NA         <NA>        NA NA NA     NA NA NA NA         <NA>      <NA>    NA  NA <NA>      <NA>   NA          NA        NA    NA  NA NA
    

    Obviously some cleaning to do.. but this is how you get to it i'd think.

    Edit 2

    I actually like this better, read the data in, then split the string and create forcefull the dataframe:

    a_dfs <- rbind_pages(lapply(b[sample(1:length(b), 20)], function(i){
        raw <- readLines(i)
        str_matrix <- stringi::stri_split_regex(raw, "\\;", simplify = TRUE)
        as.data.frame(apply(str_matrix, 2, function(j){
            ifelse(!nchar(j), NA, j)
        })) %>% mutate(file_name = i)
    }))
    
    > a_dfs %>% head
                V1           V2 V3    V4    V5 V6 V7        V8 V9 V10  V11 V12         V13       V14   V15 V16 V17 V18 V19 V20 V21 V22  V23  V24  V25
    1 09:59:57:746    STAR CARE 77 75.93   135  1  0 566341000  0   0 16.7   1 -118.839933 33.562167   321 322  50   0   0   0   0   6   19 <NA> <NA>
    2 10:00:00:894     THALATTA 70 27.93 133.8  1  0 229710000  0 251 17.7   1 -119.366765 34.101742 283.9 282  55   0   0   0   0   7 <NA> <NA> <NA>
    3 10:00:03:778   GULF GLORY 82 582.3   256  1  0 538007706  0   0 12.4   0 -129.345783 32.005983    87  86  54   0   0   0   0   2   20 <NA> <NA>
    4 10:00:03:799    MAGPIE SW 70 68.59 123.4  1  0 352597000  0   0 10.9   0 -118.747970 33.789747 119.6 117  56   0   0   0   0   0   22 <NA> <NA>
    5 10:00:09:152 CSL TECUMSEH 70 66.16 269.7  1  0 311056900  0  11   12   1 -120.846763 34.401482 105.8 106  56   0   0   0   0   6   21 <NA> <NA>
    6 10:00:12:870    RANGER 85 60 31.39 117.9  1  0 367044250  0 128    0   1 -119.223133 34.162953   360 511  56   0   0   1   0   2   21 <NA> <NA>
                                                                     file_name  V26  V27
    1 https://ais.sbarc.org/logs_delimited/2018/180211/AIS_SBARC_180211-10.txt <NA> <NA>
    2 https://ais.sbarc.org/logs_delimited/2018/180211/AIS_SBARC_180211-10.txt <NA> <NA>
    3 https://ais.sbarc.org/logs_delimited/2018/180211/AIS_SBARC_180211-10.txt <NA> <NA>
    4 https://ais.sbarc.org/logs_delimited/2018/180211/AIS_SBARC_180211-10.txt <NA> <NA>
    5 https://ais.sbarc.org/logs_delimited/2018/180211/AIS_SBARC_180211-10.txt <NA> <NA>
    6 https://ais.sbarc.org/logs_delimited/2018/180211/AIS_SBARC_180211-10.txt <NA> <NA>