I'm training a custom named entity recognition model, I created the config.cfg
and train.spacy
files, among all it has, I'm using this as pre-trained vectors en_core_web_lg
[paths]
train = null
dev = null
vectors = "en_core_web_lg"
init_tok2vec = null
I then train the model using
!python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./train.spacy
This works and I can see the output model.
Then I want to train another NER model that has nothing to do with the previous one (same code different data) and I get this error:
Error: [E884] The pipeline could not be initialized because the vectors could not be found at 'en_core_web_lg'.
If your pipeline was already initialized/trained before, call 'resume_training' instead of 'initialize', or initialize only the components that are new.
It looks like it modified the base en_core_web_lg model, which can be a problem for me since I use it for different models, some fine-tuned and others just out of the box.
How can I train this NER model making sure the downloaded en_core_web_lg model is not modified? and would this ensure that I can train several models without interfering with each other?
When you use a model as a source of vectors, or for that matter a source for any other part of a pipeline, spaCy will not modify it under any circumstances. Something else is going on.
Are you perhaps using a different virtualenv? Does spacy.load("en_core_web_lg")
work?
One thing that could be happening (but seems less likely) is that in some fields, you can use the name of an installed pipeline (using entry points) or a local path. If you have a directory named en_core_web_lg
where you are training that could be checked first.