Search code examples
rtext2vec

R : text2vec DTM's document number is not correct with origin document number


I am a student who uses text2vec very often.

Until last year, I used this program without any problems.

But today when I build the DTM with using Parallel fuction, the number of DTM's document is not correct with origin document numbers.

The DTM's number of document matched with the number of origin document divided with registered core. So It is suspected that the results were not merged after the parallel processing.

attached codes that I tested.

library(stringr)
library(text2vec)
library(data.table)
library (parallel)
library (doParallel)

N <- detectCores()
cl <- makeCluster (N)
registerDoParallel (cl)

data("movie_review")

setDT(movie_review)
setkey(movie_review, id)

##number of document is 5000
IT <- itoken_parallel (movie_review$review,
                       ids          = movie_review$id,
                       tokenizer    = word_tokenizer,
                       progressbar  = F)


VOCAB <- create_vocabulary (
    IT, 
    ngram = c(1, 1)) %>%
    prune_vocabulary (term_count_min = 3)

VoCAB.order <- VOCAB[order((VOCAB$term_count), decreasing = T),]

VECTORIZER <- vocab_vectorizer (VOCAB)

DTM <- create_dtm (IT,              
                   VECTORIZER,      
                   distributed = F)

##DTM dimension is not 5000. number is 5000/4(number of Cores) = 1250
dim(DTM)

I checked text2vec itoken function in Vignette. I found the example to test the parallel processing in itoken and it has been handled well without error.

In this process, how do I use the stop-word and minimum frequency function?

N_WORKERS = 1 # change 1 to number of cores in parallel backend
if(require(doParallel)) registerDoParallel(N_WORKERS)
data("movie_review")
it = itoken_parallel(movie_review$review[1:100], n_chunks = N_WORKERS)
system.time(dtm <- create_dtm(it, hash_vectorizer(2**16), type = 'dgTMatrix'))

I look forward to answering sincerely.

Thank you for your attention.


Solution

  • Hi please remove distributed = F. It is a bug (distributed = F captured to ellipsis here). I will fix it. Thanks for report!

    Regarding second question - there is no good solution. You can calculate frequent/non-frequent words(actually hashes) manually with colSums function, but I don't recommend go this way.

    UPD - fixed now.