machine-learning nlp artificial-intelligence spacy

Training Data Format with Spacy

I am trying to build NLP with Spacy, but I am having trouble formatting the training data. I want my app to be able to recognize entities and intents. For example, in "I want to place an order for pizza". The intent would be "place_order" and the entity would be pizza. How do I format the training data for BOTH entities and intents in Spacy?

Solution

It depends on how you cast the problem as NLP challenges. You could try to recognize entities like "pizza" with a Named Entity Recognizer, but beware that this model is designed mainly for truely named entities - i.e. entities with a name that refer to a unique entity in the real world, like London or Google.

Nevertheless, we've seen use-cases where the NER model works reasonably well for non-named entities. You can follow the training guide here and format your data like so:

TRAIN_DATA = [
    ("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
    ("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]

Another potential approach for this "pizza" entity are rule-based matching / dictionary lookup, depending how large a variety you expect. You can find more about rule-based matching strategies in spaCy here. Note that this approach wouldn't require training data, but you'd need to carefully craft the rules.

For "intent", again you have a few options. Either cast it as an NER challenge to find the verb phrase "place an order", but with the same caveat that this is not a real named entity. Perhaps a better approach would be to see this as a text classification challenge, and predict the "intent" label for the whole sentence. You can find the documentation on text classification here, and the data format would need to be a dictionary with each potential label getting a 1.0 or a 0.0:

TRAIN_DATA = [
    ("I'm so happy.", {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}),
    ("I'm so angry", {"cats": {"POSITIVE": 0.0, "NEGATIVE": 1.0}}),
]

Finally, a more complex approach is to use the dependency parser for your intent classification, cf the code example here. While this seems more difficult to get started with, and to annotate data for, it could also be the most powerful option.