Search code examples
rmetadatatext-miningtmcorpus

R: tm: TextDocument level metadata setting looks to be very inefficient


I'm loading text documents from the database, then I create corpus from them, and finally I set prefixed id of the document (I need to use prefix, since I've got documents of several types).

rs <- dbSendQuery(con,"SELECT id::TEXT, content FROM entry")
entry.d = data.table(fetch(rs,n=-1))
entry.vs = VectorSource(entry.d$content)
entry.vc = VCorpus(entry.vs, readerControl = list(language = "pl"))
meta(entry.vc, tag = 'id', type = 'local') = paste0("e:",entry.d$id)

This works very slow. It takes 6 minutes, when

tm_map(entry.vc, tm_reduce, tmFuns = funs, mc.cores=1)

where funs is the list of 6 functions, needs only 2 minutes more.

Is there any way to do it faster?


Solution

  • I've changed my code to set IDs during initialization of the VCorpus.

    rs <- dbSendQuery(con,"SELECT ('e:'||id) AS id, content, 'pl'::TEXT AS language FROM entry")
    entry.d = data.table(fetch(rs,n=-1))
    entry.dfs = DataframeSource(entry.d)
    reader <- readTabular(mapping=list(content="content", id="id", language='language'))
    entry.vc = VCorpus(entry.dfs, readerControl = list(reader = reader))
    

    And now it takes only 2.5 minute to generate VCorpus with custom IDs.