Search code examples
nlpstanford-nlpcrf

Stanford CoreNLP Train custom NER model


I was making some tests by training custom models with crf, and since i don't have a proper training file i would like to make by myself a list of 5 tags and maybe 10 words only to start with and the plan is to keep improving the model with more incoming data in the future. but the results i get are plenty of false positives (it tags many words which have nothing to do with the original one in the training file) i imagine since the models created are probabilistic and take into considerarion more than just separate words

Let's say i want to train corenlp to detect a small list of words without caring about the context are there some special settings for that? if not, is there a way to calculate how much data is needed to get an accurate model?


Solution

  • After some tests and research find out a really good option for my case is RegexNER which works in a deterministic way and can also be combined with NER. So far tried with smaller set of rules and does the job pretty well. Next step is to determine how scalable and usable is in a high traffic stress scenario (the one i'm interested of) and compare with other solutions based in python