r nlp lda stm

Unable to fit new documents without running out of memory in STM topic modeling

I'm trying to label new texts based on a previous topic model using the fitNewDocuments() function form the stm package in R.

I've tried fitting 10 new documents based on topic models trained on 20000, 10000 and 3000 documents, and the function always ends up using way too much memory (from 20gb to even 50gb), crashing the R session.

I'm not finding anything online about using the fitNewDocuments() properly. I'm following the documentation to the letter, but the process just never finishes. I've only noticed that the documentation says the origData argument should be out$meta, but it returns an error if I supply that, and I have to supply just out instead.

That being said, I'm able to reproduce the example in the documentation using Gadarian data. But it fails with my own data.

I could share code, but it would be useless without access to the data, which sadly I can't provide.

Solution

After trying a million things, somehow I was able to fix this by removing the prevalencePrior = "Covariate" argument from fitNewDocuments(), and the new documents were properly fit based on the models.