Search code examples
pythonspeech-recognitionchatbotspacyrasa-nlu

How to recognize entities not in training examples


I am working on a customer relations chatbot. The user can input either a greeting, inital_query or a query related to a product. The initial query is when the user gives their user_id to the chatbot. This is done to filter results from the database.

I created a few training examples to help the chatbot classify initial_query from the others. But the problem is the chatbot is not able to recognize a user_id as an entity if it is not specified in the training data. for example

## intent:initial_query
- My name is [Karthik](name) and my user ID is [0234](UserID)

this is one such example for initial_query. Here the userId specified is 0234. but the database contains many more users with unique userIds for each user and it is not possible for me to add all the ids into the training example.

What should I do to make the bot understand when a user id is specified? I saw somewhere that lookup tables can be used. But when I tried using lookup tables, it still did not recognize ids not part of the training examples.

This is the link I used to try lookup tables in my code.

intent_entity_featurizer_regex does not seem to work for me. I am stuck here as this is a crucial part of the bot. If lookup tables is not the best solution to this problem I am also open to other ideas.

Thank you


Solution

  • I'm going to get a bad wrap for always saying you Need more training data, but I would imagine thet is playing a part here as well.

    I believe you have a few possible courses of action:

    • Provide more training data, I've never seen a good intent with fewer than 10 training examples. This number increases with every possible permutation of an intent as well as with more similar intents.
    • Use a pre-built entity recognizer like Duckling or spaCy. They won't necessarily know that 1234 is a userId, but they can auto extract numbers.

    If you are using new_crf with Rasa then it is important to realize that it is actually learning the pattern of utterances and recognizes entities by what is around that entity rather than the actual value.

    Also you could use regex with Rasa, but the regex featurizer isn't just a lookup tool. It adds a flag to the CRF whether or not the token matches that pattern. Given this it still needs sufficient training data to learn that that token is important for that entity.