Search code examples

spaCy CLI debug shows 0 train/dev docs in CLI-formatted JSON converted by


I am trying to run the spaCy CLI but my training data and dev data seem somehow to be incorrect as seen when I run debug:

| => python3 -m spacy debug-data en 

./CLI_train_randsplit_anno191022.json ./CLI_dev_randsplit_anno191022.json --pipeline ner --verbose 

=========================== Data format validation =========================== 

✔ Corpus is loadable 

=============================== Training stats =============================== 

Training pipeline: ner 

Starting with blank model 'en' 

0 training docs 

0 evaluation docs 

✔ No overlap between training and evaluation data 

✘ Low number of examples to train from a blank model (0) 

It's recommended to use at least 2000 examples (minimum 100) 

============================== Vocab & Vectors ============================== 

ℹ 0 total words in the data (0 unique) 

10 most common words: 

ℹ No word vectors present in the model 

========================== Named Entity Recognition ========================== 

ℹ 0 new labels, 0 existing labels 

0 missing values (tokens with '-' label) 

✔ Good amount of examples for all labels 

✔ Examples without occurrences available for all labels 

✔ No entities consisting of or starting/ending with whitespace 

================================== Summary ================================== 

✔ 5 checks passed 

✘ 1 error 

Trying to train anyway yields:

| => python3 -m spacy train en ./models/CLI_1 ./CLI_train_randsplit_anno191022.json ./CLI_dev_randsplit_anno191022.json -n 150 -p 'ner' --verbose 

dropout_from = 0.2 by default 

dropout_to = 0.2 by default 

dropout_decay = 0.0 by default 

batch_from = 100.0 by default 

batch_to = 1000.0 by default 

batch_compound = 1.001 by default 

Training pipeline: ['ner'] 

Starting with blank model 'en' 

beam_width = 1 by default 

beam_density = 0.0 by default 

beam_update_prob = 1.0 by default 

Counting training words (limit=0) 

learn_rate = 0.001 by default 

optimizer_B1 = 0.9 by default 

optimizer_B2 = 0.999 by default 

optimizer_eps = 1e-08 by default 

L2_penalty = 1e-06 by default 

grad_norm_clip = 1.0 by default 

parser_hidden_depth = 1 by default 

subword_features = True by default 

conv_depth = 4 by default 

bilstm_depth = 0 by default 

parser_maxout_pieces = 2 by default 

token_vector_width = 96 by default 

hidden_width = 64 by default 

embed_size = 2000 by default 

Itn  NER Loss   NER P   NER R   NER F   Token %  CPU WPS 

---  ---------  ------  ------  ------  -------  ------- 

✔ Saved model to output directory 


Traceback (most recent call last): 

  File "/usr/local/lib/python3.7/site-packages/spacy/cli/", line 389, in train 

    scorer = nlp_loaded.evaluate(dev_docs, verbose=verbose) 

  File "/usr/local/lib/python3.7/site-packages/spacy/", line 673, in evaluate 

    docs, golds = zip(*docs_golds) 

ValueError: not enough values to unpack (expected 2, got 0) 

During handling of the above exception, another exception occurred: 

Traceback (most recent call last): 

  File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/", line 193, in _run_module_as_main 

    "__main__", mod_spec) 

  File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/", line 85, in _run_code 

    exec(code, run_globals) 

  File "/usr/local/lib/python3.7/site-packages/spacy/", line 35, in <module>[command], sys.argv[1:]) 

  File "/usr/local/lib/python3.7/site-packages/", line 328, in call 

    cmd, result = parser.consume(arglist) 

  File "/usr/local/lib/python3.7/site-packages/", line 207, in consume 

    return cmd, self.func(*(args + varargs + extraopts), **kwargs) 

  File "/usr/local/lib/python3.7/site-packages/spacy/cli/", line 486, in train 

    best_model_path = _collate_best_model(meta, output_path, nlp.pipe_names) 

  File "/usr/local/lib/python3.7/site-packages/spacy/cli/", line 548, in _collate_best_model 

    bests[component] = _find_best(output_path, component) 

  File "/usr/local/lib/python3.7/site-packages/spacy/cli/", line 567, in _find_best 

    accs = srsly.read_json(epoch_model / "accuracy.json") 

  File "/usr/local/lib/python3.7/site-packages/srsly/", line 50, in read_json 

    file_path = force_path(location) 

  File "/usr/local/lib/python3.7/site-packages/srsly/", line 21, in force_path 

    raise ValueError("Can't read file: {}".format(location)) 

ValueError: Can't read file: models/CLI_1/model0/accuracy.json 

My training and dev docs were generated using, saved as json files using the function:

def make_CLI_json(mock_docs, CLI_out_file_path): 
    CLI_json = docs_to_json(mock_docs) 
    with open(CLI_out_file_path, 'w') as json_file: 
        json.dump(CLI_json, json_file) 

I verified them both to be valid json at

I created the docs from which these json originated using the function:

def import_from_doccano(jx_in_file_path, view=True): 
    annotations = load_jsonl(jx_in_file_path)     
    mock_nlp = English()     
    sentencizer = mock_nlp.create_pipe("sentencizer")     
    unlabeled = 0     
    DATA = []     
    mock_docs = [] 

    for anno in annotations:                      
        # get DATA (as used in spacy inline training)     
        if "label" in anno.keys():     
            ents = [tuple([label[0], label[1], label[2]])     
                               for label in anno["labels"]] 
            ents = [] 

        DATUM = (anno["text"], {"entities": ents}) 

        # mock a doc for viz in displacy 
        mock_doc = mock_nlp(anno["text"]) 
        if "labels" in anno.keys():     
            entities = anno["labels"]     
            if not entities:     
                unlabeled += 1     
            ents = [(e[0], e[1], e[2]) for e in entities]     
            spans = [mock_doc.char_span(s, e, label=L) for s, e, L in ents]     
            mock_doc.ents = _cleanup_spans(spans)     

            if view:     
                displacy.render(mock_doc, style='ent') 

    print(f'Unlabeled: {unlabeled}')                      
    return DATA, mock_docs 

I wrote the function above to return the examples in both the format required for inline training (e.g. as shown at as well as to form these kind of “mock” docs so that I can use displacy and/or the CLI. For the latter purpose, I followed the code shown at with a couple of notable differences. The _cleanup_spans() function is identical to the one in the example. I did not use the minibatch() but made a separate doc for each of my labeled annotations. Also, (perhaps critically?) I found that using the sentencizer ruined many of my annotations, possibly because the spans get shifted in a way that the _cleanup_spans() function fails to repair properly. Removing the sentencizer causes the docs_to_json() function to throw an error. In my function (unlike in the linked example) I therefore run the sentencizer on each doc after the entities are written to them, which preserves my annotations properly and allows the docs_to_json() function to run without complaints.

The function load_jsonl called within import_from_doccano() is defined as:

def load_jsonl(input_path):     
    data = []     
    with open(input_path, 'r', encoding='utf-8') as f:     
        for line in f:     
            data.append(json.loads(line.replace('\n|\r',''), strict=False))     
    print('Loaded {} records from {}'.format(len(data), input_path))     
    return data 

My annotations are each of length ~10000 characters or less. They are exported from doccano

( as JSONL using the format:

{"id": 1, "text": "EU rejects ...", "labels": [[0,2,"ORG"], [11,17, "MISC"], [34,41,"ORG"]]}
{"id": 2, "text": "Peter Blackburn", "labels": [[0, 15, "PERSON"]]}
{"id": 3, "text": "President Obama", "labels": [[10, 15, "PERSON"]]}

The data are split into train and test sets using the function:

def test_train_split(DATA, mock_docs, n_train):
    L = list(zip(DATA, mock_docs))
    DATA, mock_docs = zip(*L)
    DATA = [i for i in DATA]
    mock_docs = [i for i in mock_docs]
    TRAIN_DATA = DATA[:n_train]
    train_docs = mock_docs[:n_train]
    TEST_DATA = DATA[n_train:] 
    test_docs = mock_docs[n_train:]
    return TRAIN_DATA, TEST_DATA, train_docs, test_docs

And finally each is written to json using the following function:

def make_CLI_json(mock_docs, CLI_out_file_path):
    CLI_json = docs_to_json(mock_docs)
    with open(CLI_out_file_path, 'w') as json_file:
        json.dump(CLI_json, json_file)

I do not understand why the debug shows 0 training docs and 0 development docs, or why the train command fails. The JSON look correct as far as I can tell. Is my data formatted incorrectly, or is there something else going on? Any help or insights would be greatly appreciated.

This is my first question on SE- apologies in advance if I've failed to follow some or other guideline. There are a lot of components involved so I'm not sure how I might produce a minimal code example that would replicate my problem.


Mac OS 10.15 Catalina Everything is pip3 installed into user path No virtual environment

| => python3 -m spacy info --markdown

## Info about spaCy

* **spaCy version:** 2.2.1
* **Platform:** Darwin-19.0.0-x86_64-i386-64bit
* **Python version:** 3.7.4


  • This is a legitimately confusing aspect of the API. For internal/historical reasons, produces a dict that still needs to be wrapped in list to get to the final training format. Try:

    srsly.write_json(filename, [])

    spacy debug-data doesn't have proper schema checks yet, so this is more frustrating/confusing than it should be.