Search code examples
document-classificationstanford-nlp

Using CoreNLP ColumnDataClassifier for document classification with a large corpus


I'm trying to use the CoreNLP ColumnDataClassifier to classify a large number of documents. I have a little more than 1 million documents with about 20000 labels.

Is this even possible in terms of memory requirements? (I currently only have 16GB)

Is it somehow possible to train the classifier in an iterative way, splitting the input into many smaller files?


Solution

  • As an experiment I ran:

    1.) 500,000 documents, each with 100 random words
    2.) a label set of 10,000
    

    This crashed with a memory error even when I gave it 40 GB of RAM.

    I also ran:

    1.) same 500,000 documents
    2.) a label set of 6
    

    This ran successfully to completion with 16 GB of RAM.

    I'm not sure at what point growing the label set will cause a crash, but my advice would be to shrink the possible label set and experiment.