Search code examples
pythonmachine-learningnlpspacynamed-entity-recognition

How to train a spaCy model with line number as a feature?


I'm a newbie to nlp and spaCy and I'm working on a project for extracting person and company names from business cards.

In order to extract text I am using a decent OCR function that I've made which gives me something like this:

Sunny J. Mistry
Product Design Engineer

Apple
5 Infinite Loop, MS 305-1PH
Cupertino, CA 95014

T 408 974-5339
M 925 548-4585
sjmistry@apple.com
www.apple.com

At first I was trying process line by line using the default English NER for the job and soon realized that it's not enough.

Eventually I've decided to create my own custom NER that will be trained with information about the position of text.

I haven't found any information in the official documentation on how to add custom features for the training data like line numbers, but I've found this answer and example of Matthew Honnibal which suggested to use a multi-task objective in order to train a model with a costume feature.

I'm still not sure:

  1. How the training data should look like?

  2. How do I use spaCy's API to add a custom feature to the training process?

  3. Is multi-task objective the right tool to train this kind of model?


Solution

  • Answering my own question:

    I didn't find an official way for implementing this kind of task, but in the end I decided on training a model on a normal business card data set containing 200 images. I've extracted the text from each image using Google OCR and annotated it using a tool described in this post.

    It worked like a charm.