Search code examples
htmlrtm

read multiples local html files in a folder in R


I have several HTML files in a folder in my pc. I would like to read them in R, trying to keep the original format as much as posible. There is only text, by the way. I have tried two approaches, which failed misserably:

##first approach
 library (tm)
 cname <- file.path("C:", "Users", "usuario", "Desktop", "DEADataset", "The Phillipines", "gazzetes.presihtml")
  docs <- Corpus(DirSource(cname))
## second approach
 list_files_path<- list.files(path = './gazzetes.presihtml')
 a<- paste0(list_files_path, names) # vector names contain the names of the file with the .HTML extension
 rawHTML <- readLines(a)

Any guess? all the best


Solution

  • Your second approach is close to working, except that readLines only accepts one connection, but you are giving it a vector with multiple files. You can use lapply with readLines to achieve this. Here is an example:

    # generate vector of html files
    files <- c('/path/to/your/html/file1', '/path/to/your/html/file2')
    
    # readLines for each file and put them in a list
    lineList <- lapply(files, readLines)
    
    # create a character vector that contains all lines from all files
    lineVector <- unlist(lineList)
    
    # collapse the character vector into a single string
    html <- paste(lineVector , collapse = '\n')
    
    # print the string with original formatting
    cat(html)