Search code examples
bert-language-modelnamed-entity-recognitiondeeppavlov

Training on deeppavlov for NER keeps failing


I have been trying to train a deeppavlov model for NER based on the train syntax given on their docs and it keeps failing with below error message:

/opt/anaconda3/envs/py36/lib/python3.6/site-packages/deeppavlov/dataset_readers/conll2003_reader.py in parse_ner_file(self, file_name)
    104                     items = line.split()
    105                     if len(items) < expected_items:
--> 106                         raise Exception(f"Input is not valid {line}")
    107                     tokens.append(items[0])
    108                     tags.append(items[-1])

Exception: Input is not valid aio-pika==6.4.1

Used the following code to train the deeppavlov model, it seems to be working on their sample dataset, but when I created my own dataset as per their training sample guide, I keep getting above error message. Training ner code:

from deeppavlov import configs, train_model, build_model
from deeppavlov.core.commands.utils import parse_config
import json


with configs.ner.ner_ontonotes_bert_mult.open(encoding='utf8') as f:
    ner_config = json.load(f)

ner_config['dataset_reader']['data_path'] = '/Users/smankari001/deeppavlov'  # directory with train.txt, valid.txt and test.txt files
ner_config['metadata']['variables']['NER_PATH'] = '/Users/smankari001/deeppavlov'
ner_config['metadata']['download'] = [ner_config['metadata']['download'][-1]]  # do not download the pretrained ontonotes model

ner_model = train_model(ner_config, download=True)

input train.txt file:

What    O
kind    O
of  O
memory  O
?   O

We  O
respectfully    O
invite  O
you O
to  O
watch   O
a   O
special O
edition O
of  O
Across  B-ORG
China   I-ORG
.   O

WW  B-WORK_OF_ART
II  I-WORK_OF_ART
Landmarks   I-WORK_OF_ART
on  I-WORK_OF_ART
the I-WORK_OF_ART
Great   I-WORK_OF_ART
Earth   I-WORK_OF_ART
of  I-WORK_OF_ART
China   I-WORK_OF_ART
:   I-WORK_OF_ART
Eternal I-WORK_OF_ART
Memories    I-WORK_OF_ART
of  I-WORK_OF_ART
Taihang I-WORK_OF_ART
Mountain    I-WORK_OF_ART

Standing    O
tall    O
on  O
Taihang B-LOC
Mountain    I-LOC
is  O
the B-WORK_OF_ART
Monument    I-WORK_OF_ART
to  I-WORK_OF_ART
the I-WORK_OF_ART
Hundred I-WORK_OF_ART
Regiments   I-WORK_OF_ART
Offensive   I-WORK_OF_ART
.   O

It  O
is  O
composed    O
of  O
a   O
primary O
stele   O
,   O
secondary   O
steles  O
,   O
a   O
huge    O
round   O
sculpture   O
and O
beacon  O
tower   O
,   O
and O
the B-WORK_OF_ART
Great   I-WORK_OF_ART
Wall    I-WORK_OF_ART
,   O
among   O
other   O
things  O
.   O

A   O
primary O
stele   O
,   O
three   B-CARDINAL
secondary   O
steles  O
,   O
and O
two B-CARDINAL
inscribed   O
steles  O
.   O

The B-EVENT
Hundred I-EVENT
Regiments   I-EVENT
Offensive   I-EVENT
was O
the O
campaign    O
of  O
the O
largest O
scale   O
launched    O
by  O
the B-ORG
Eighth  I-ORG
Route   I-ORG
Army    I-ORG
during  O
the B-EVENT
War I-EVENT
of  I-EVENT
Resistance  I-EVENT
against I-EVENT
Japan   I-EVENT
.   O

This    O
campaign    O
broke   O
through O
the O
Japanese    B-NORP
army    O
's  O
blockade    O
to  O
reach   O
base    O
areas   O
behind  O
enemy   O
lines   O
,   O
stirring    O
up  O
anti-Japanese   B-NORP
spirit  O
throughout  O
the O
nation  O
and O
influencing O
the O
situation   O
of  O
the O
anti-fascist    O
war O
of  O
the O
people  O
worldwide   O
.   O


Solution

  • As ner_config['dataset_reader']['data_path'] you need to specify path to folder with only dataset files (train/valid/test).

    This error:

    Exception: Input is not valid aio-pika==6.4.1
    

    says that DatasetReader started to read lines from requirements.txt file.