Search code examples
pythonspacyfasttext

Difference spacy's "--base-model" and "--vectors" arguments for using custom embeddings for NER?


I trained fasttext embeddings and saved them as a .vec file. I want to use these for my spacy NER model. Is there a difference between

python -m spacy train en [new_model] [train_data] [dev_data] --pipeline ner --base-model embeddings.vec

and

python -m spacy train en [new_model] [train_data] [dev_data] --pipeline ner --vectors embeddings.vec ?

Both methods produce nearly identical training loss, F score, etc.


Solution

  • If you need to initialize a spacy model with vectors, use spacy init-model like this where lg is the language code:

    spacy init-model lg model_dir -v embeddings.vec -vn my_custom_vectors
    

    Once you have the vectors saved as part of a spacy model:

    • --vectors loads the vectors from the provided model, so the initial model is spacy.blank("lg") + vectors
    • --base-model loads everything (tokenizer, pipeline components, vectors) from the provided model, so the initial model is spacy.load(model)

    If the provided model doesn't have any pipeline components in it, the only potential difference is the tokenizer settings resulting from spacy.blank("lg") which can vary a little between individual spacy versions.