Search code examples
pythonnlpspacynamed-entity-recognitionspacy-3

spaCy, preparing training data: doc.char_span returning 'None'


I'm following the instructions in spaCy's documentation to prepare my own training data (here).

My problem begins at this line:

span = doc.char_span(start, end, label=label)

For entities which I'm labelling as an organization ('ORG'), it seems to work fine i.e. it returns a span object. However, for entities which I'm labelling as money ('MONEY'), it returns a None object.

Here's two examples from my training set:

('Payments from the Guardian, Kings Place, 90 York Way, London N1 9GU, for articles:', [(18, 26, 'ORG')]) // Returns a span object for 'Guardian'

('24 July 2020, received £100. Hours: 1 hr. (Registered 02 February 2021)', [(24, 28, 'MONEY')]) // Returns None for '£100'

Note: the  appears in the console, but it's not in the original json text file. Leaving it in in case it's somehow part of the issue

Does anyone please have any suggestions where I'm going wrong?

[I'm very new to spacy (started learning last week), so please ELI5!]

UPDATE: As it seems the  could be the problem, below is how I'm loading the data. How do I get rid of the Â's? (which aren't visible in the original file)

with open('training_data.json') as train_data:
    train_data_json = json.load(train_data)

Solution

  • You have an encoding problem when opening the file. The context for information extraction on tags of type MONEY is not working most likely do to this issue since the start of the token is not £.

    It is not clear what encoding the file is using so try some of the most common ones first which are utf-8, iso-8859-1, latin1

    with open('training_data.json', encoding='utf-8')
        # your logic here
    

    replace the encoding with other potential candidates