Search code examples
pythonspacy-3

Train NER in spacy v3 needs dev.spacy at command line


I am trying to prepare a custom ner model in spacy v3. V3 has changed significantly as compared to v2 from training perspective.

I am Using the default config with en_web_lg. I have prepared the training data (training.spacy) using convert command. However, the training command needs a dev.spacy file.

Not sure what data is expected there in dev.spacy. Is this asking a plain text corpus for the training.spacy file? But then is there a way to convert the plain text file in spacy format..

Command from spacy site- python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy

Can someone pls help explain on how to prep the dev.spacy.


Solution

  • The train.spacy is a placeholder for collection of 'training' files - a directory of files usually using the Spacy convert utility. The dev.spacy is a placeholder for collection of 'validation' files - same format as training files, but used as a validation sample during training (for NER used to compute the prediction, recall and f-score after each training iteration). Commonly suggested 'size' of validation sample is between 10 to 20% of training sample. I tend to use 20% because my data has a large variation - but larger validation sample adds training overhead.