Search code examples
rlisttm

tm combine list of corpora


I have a list of URL for which i have fetched the webcontent, and included that into tm corpora:

library(tm)
library(XML)

link <- c(
"http://www.r-statistics.com/tag/hadley-wickham/",                                                      
"http://had.co.nz/",                                                                                    
"http://vita.had.co.nz/articles.html",                                                                  
"http://blog.revolutionanalytics.com/2010/09/the-r-files-hadley-wickham.html",                          
"http://www.analyticstory.com/hadley-wickham/"  
)               

create.corpus <- function(url.name){
doc=htmlParse(url.name)
parag=xpathSApply(doc,'//p',xmlValue)
if (length(parag)==0){
  parag="empty"
}
cc=Corpus(VectorSource(parag))
meta(cc,"link")=url.name
return(cc)
}

link=catch$url
cc <- lapply(link, create.corpus)

This gives me a "large list" of corpora, one for each URL. Combining them one by one works:

x=cc[[1]]
y=cc[[2]]
z=c(x,y,recursive=T) # preserved metadata
x;y;z
# A corpus with 8 text documents
# A corpus with 2 text documents
# A corpus with 10 text documents

But this becomes unfeasible for a list with a few thousand corpora. So how can a list of corpora be merged into one corpus while maintaining the meta data?


Solution

  • You can use do.call to call c:

    do.call(function(...) c(..., recursive = TRUE), cc)
    # A corpus with 155 text documents