I have several HTML files in a folder in my pc. I would like to read them in R, trying to keep the original format as much as posible. There is only text, by the way. I have tried two approaches, which failed misserably:
##first approach
library (tm)
cname <- file.path("C:", "Users", "usuario", "Desktop", "DEADataset", "The Phillipines", "gazzetes.presihtml")
docs <- Corpus(DirSource(cname))
## second approach
list_files_path<- list.files(path = './gazzetes.presihtml')
a<- paste0(list_files_path, names) # vector names contain the names of the file with the .HTML extension
rawHTML <- readLines(a)
Any guess? all the best
Your second approach is close to working, except that readLines
only accepts one connection, but you are giving it a vector with multiple files. You can use lapply
with readLines
to achieve this. Here is an example:
# generate vector of html files
files <- c('/path/to/your/html/file1', '/path/to/your/html/file2')
# readLines for each file and put them in a list
lineList <- lapply(files, readLines)
# create a character vector that contains all lines from all files
lineVector <- unlist(lineList)
# collapse the character vector into a single string
html <- paste(lineVector , collapse = '\n')
# print the string with original formatting