Search code examples
pythonspacyword2vecnamed-entity-recognitionword-embedding

Spacy models with different word2vec embeddings give same results


I am trying to improve the performance of my spacy NER model by implementing my pretrained vectors. I have created my own vectors with word2vec using different texts and I have saved them in .txt files. However I get the exact same scores and this doesn't seem right.

Here are the steps I have been following for one file with custom pretrained embeddings:

!python -m spacy init vectors en /content/drive/MyDrive/MODELS_W2V/JSTOR_uncleaned_sents_model.txt ./uncl_txt --name JSTOR_unlceaned_sents_model

nlp = spacy.load("./uncl_txt")
nlp.add_pipe("ner")
nlp.to_disk("./uncl_txt")

!python -m spacy train /content/uncl_txt/config.cfg --paths.train ./Spacy/train.spacy --paths.dev ./Spacy/dev.spacy --output ./uncl_txt --paths.vectors ./uncl_txt

!python -m spacy evaluate /content/uncl_txt/model-best ./Spacy/eval.spacy --output ner_with_uncleaned_sents_vectors.jsonl

Here are the steps for the other embeddings file:

!python -m spacy init vectors en /content/drive/MyDrive/MODELS_W2V/JSTOR_abs_model.txt ./abs --name JSTOR_abs_model

nlp = spacy.load("./abs")
nlp.add_pipe("ner")
nlp.to_disk("./abs")

!python -m spacy train /content/abs/config.cfg --paths.train ./Spacy/train.spacy --paths.dev ./Spacy/dev.spacy --output ./abs/ --paths.vectors ./abs

!python -m spacy evaluate ./abs/model-best ./Spacy/eval.spacy --output ner_with_abs_vectors.jsonl

Am I doing something wrong? Should I add something in the config file?


Solution

  • The model created using nlp.add_pipe("ner") does not have embeddings enabled by default.

    The easiest way to create a config for ner with embeddings enabled is to use spacy init config with -o accuracy:

    spacy init config -p ner -o accuracy ner.cfg
    

    And then train with:

    spacy train ner.cfg --paths.train train.spacy --paths.dev dev.spacy --paths.vectors ./vectors
    

    (You can also enable it using custom config settings with nlp.add_pipe("ner", config=...), but this requires digging into the details about the internal default model config, which might also change depending on the version of spacy, so spacy init config is easier to use.)