Search code examples
pythongoogle-cloud-vertex-aispacy-3

Train spaCy NER model from Python script (not CLI)


I am aware that training a spaCy model (say, Named Entity Recognition), requires running some commands from CLI. However, because I need to train a spaCy model inside a Vertex AI Pipeline Component (which can be simply considered as a "Pure Python script"), training a spaCy model from CLI IS NOT an option for my use case. My current attempt looks like this:


#train.py
# IMPORTANT: Assume all the necessary files are already available in the same directory than this script

import spacy
import subprocess

subprocess.run(["python", "-m", "spacy", "init", "fill-config", "base_config.cfg", "config.cfg"])

subprocess.run(["python", "-m", "spacy", "train", "config.cfg",
                "--output", "my_model",
                "--paths.train", "train.spacy",
                "--paths.dev", "dev.spacy"])

Which allows me to carry-on with the training (however not being quite stable at times). But I don't know if this is the best implementation, or there is something better or more recommended (once again, NOT involving CLI).

IMPORTANT: As a Python script, if I run it via python train.py, it should run without a problem.

Any ideas?


Solution

  • I think I have a functional solution now:

    from pathlib import Path
    from spacy.cli.download import download
    from spacy.cli.init_config import fill_config
    from spacy.cli.train import train
    
    download('en_core_web_lg')
    fill_config(Path("config.cfg"), Path("base_config.cfg"))
    train(Path("config.cfg"), Path("my_model"), overrides={"paths.train": "train.spacy", "paths.dev": "dev.spacy"})
    

    With it, I have managed to successfully train a spaCy NER model, from a Python script (i.e., via python train.py).

    For more details, please check this thread.

    Thanks.