Search code examples
xmlrftphttrrvest

How to get all pages dowstream of ftp address in R


I want to retrieve a list of all the downstream pages from an ftp/html site:

say I have a site:

ftp://example.gov/  # (not real)

which contains all the pages/files:

ftp://example.gov/dir1  
ftp://example.gov/dir1/file1.txt  
ftp://example.gov/dir2  
ftp://example.gov/dir2/thing.txt  
ftp://example.gov/dir3  
ftp://example.gov/dir3/another  
ftp://example.gov/dir3/another/other.txt

so if I start with:

base_site <- "ftp://example.gov/"

I want a list of the site's "paths" (that is the output I want is an r object which contains all the above example links as characters) output can be nested or tidy.


Solution

  • library(RCurl)
    url<-"ftp://ftp2.census.gov/"
    alldir<-getURL(url, ftp.use.epsv = FALSE, ftplistonly = TRUE, crlf = TRUE)
    alldir<-paste(url, strsplit(alldir, "\r*\n")[[1]], sep = "")
    head(alldir)
    [1] "ftp://ftp2.census.gov/AHS"                      "ftp://ftp2.census.gov/AOA"                     
    [3] "ftp://ftp2.census.gov/CTPP_2006_2010"           "ftp://ftp2.census.gov/EEO_2006_2010"           
    [5] "ftp://ftp2.census.gov/EEO_Disability_2008-2010" "ftp://ftp2.census.gov/Econ2001_And_Earlier"  
    

    For details: see

    ?getURL {RCurl}