I can see that in the English spaCy models the medium model performs better than the small one, and the large model outperforms the medium one - but only marginally. However, in the description of the models, it is written that they have all been trained on OntoNotes. The exception being the vectors of md and lg, which have been trained on CommonCrawl. So if all models were trained on the same dataset (OntoNotes), and the only difference is the vectors, why then is there a performance difference for the tasks that don't require vectors? I would love to find out more about each model and the settings they were trained with and so on, but it appears that this information is not readily available.
So if all models were trained on the same dataset (OntoNotes), and the only difference is the vectors, why then is there a performance difference for the tasks that don't require vectors?
I think the missing piece you're looking for is this one: If models are initialised with vectors, those vectors will be used as features during training. Depending on the vectors, this can give the statistical model components you train a significant boost in accuracy.
However, vectors can be quite large, so you typically want to find the best trade-off between model size and accuracy. If vectors were used during training, the same vectors also need to be available at runtime, and you can't easily swap them out – otherwise, the model will perform much worse. The sm
model, which wasn't trained with vectors, allows you to load in your own vectors for, say, similarity comparisons, without affecting the predictions of the pre-trained statistical components.
TL;DR: spaCy's sm
, md
and lg
core models were all trained on the same data under the same conditions. The only difference is the vectors that are included, which are used as features and thus have an impact on the model's accuracy.