I am aware that training a spaCy model (say, Named Entity Recognition), requires running some commands from CLI. However, because I need to train a spaCy model inside a Vertex AI Pipeline Component (which can be simply considered as a "Pure Python script"), training a spaCy model from CLI IS NOT an option for my use case. My current attempt looks like this:
#train.py
# IMPORTANT: Assume all the necessary files are already available in the same directory than this script
import spacy
import subprocess
subprocess.run(["python", "-m", "spacy", "init", "fill-config", "base_config.cfg", "config.cfg"])
subprocess.run(["python", "-m", "spacy", "train", "config.cfg",
"--output", "my_model",
"--paths.train", "train.spacy",
"--paths.dev", "dev.spacy"])
Which allows me to carry-on with the training (however not being quite stable at times). But I don't know if this is the best implementation, or there is something better or more recommended (once again, NOT involving CLI).
IMPORTANT: As a Python script, if I run it via python train.py
, it should run without a problem.
Any ideas?
I think I have a functional solution now:
from pathlib import Path
from spacy.cli.download import download
from spacy.cli.init_config import fill_config
from spacy.cli.train import train
download('en_core_web_lg')
fill_config(Path("config.cfg"), Path("base_config.cfg"))
train(Path("config.cfg"), Path("my_model"), overrides={"paths.train": "train.spacy", "paths.dev": "dev.spacy"})
With it, I have managed to successfully train a spaCy NER model, from a Python script (i.e., via python train.py
).
For more details, please check this thread.
Thanks.