Search code examples
rtextcorpus

creating corpus from multiple html text files


I have a list of html files, I have taken some texts from the web and make them read with the read_html.

My files names are like:

a1 <- read_html(link of the text) 
a2 <- read_html(link of the text) 
.
.
. ## until:
a100 <- read_html(link of the text)

I am trying to create a corpus with these.

Any ideas how can I do it?

Thanks.


Solution

  • You could allocate the vector beforehand:

    text <- rep(NA, 100)
    text[1] <- read_html(link1)
    ...
    text[100] <- read_html(link100)
    

    Even better, if you organize your links as vector. Then you can use, as suggested in the comments, lapply:

    text <- lapply(links, read_html)
    

    (here links is a vector of the links).

    It would be rather bad coding style to use assign:

    # not a good idea
    for (i in 1:100) assign(paste0("text", i), get(paste0("link", i)))
    

    since this is rather slow and hard to process further.