Search code examples
machine-learningrasa-nlurasa-core

Rasa NLU - Understanding Training Data


I am having a hard time understanding training data in rasa nlu. Say I want to have training data where someone is informing someone of animals they can buy. For clarity I'll use markdown format:

Say the user is hypothetically responding to a question:

"What kind of animal would you like to buy?"

There are only so many different ways of saying you want to buy something. So take the below example:

##intent:inform
- [cat](animal)
- buy [cat](animal)
- I would like to buy a [cat](animal)

Would I need to repeat this for every type of animal I intended to handle? Like below?

##intent:inform
- [cat](animal)
- [dog](animal)
- [parrot](animal)
- buy [cat](animal)
- buy [dog](animal)
- buy [parrot](animal)
- I would like to buy a [cat](animal)
- I would like to buy a [dog](animal)
- I would like to buy a [parrot](animal)

Also, I noticed that in rasa's restaurant bot, they sometimes repeat the same example over and over again, sometimes up to seven times, like below:

##intent:inform
- [cat](animal)
- [cat](animal)
- [cat](animal)
- [cat](animal)
- [cat](animal)
- buy [cat](animal)
- I would like to buy a [cat](animal)

Why is that necessary? What affect does this have on the understanding? How would more occurrences of the same single word in the same position be an indicator that it is an appropriate response, especially if you had something like the below where a different value of the same entity was repeated the same amount of times?

##intent:inform
- [cat](animal)
- [cat](animal)
- [cat](animal)
- [cat](animal)
- [cat](animal)
- buy [cat](animal)
- I would like to buy a [cat](animal)
- [dog](animal)
- [dog](animal)
- [dog](animal)
- [dog](animal)
- [dog](animal)
- buy [dog](animal)
- I would like to buy a [dog](animal)

Thank you, any advice is appreciated.


Solution

  • There are only so many different ways of saying you want to buy something.

    You may be surprised:

    • Can I buy a dog?
    • I want to buy a dog.
    • I really want a dog.
    • I'd love it if I owned a dog.
    • I'm looking for a pet, maybe a dog.
    • purchase dog
    • adopt dog
    • get a dog
    • take a dog home with me

    and I am sure the list continues for many more examples. That being said Rasa NLU should be able to learn and adapt off of a handful of examples. With some exceptionsadopt may not have a strong relationship to buy for example and could be important to have as an example.

    Would I need to repeat this for every type of animal I intended to handle? Like below?

    No that is not necessary. Each animal value is an entity and Rasa by default uses a CRF for entity recognition, which is what you are talking about here. The CRF is more about the structure of the sentence than it is the value of the word. You can see the features the CRF looks at in the docs and code:

      # Available features are:
      # ``low``, ``title``, ``suffix5``, ``suffix3``, ``suffix2``,
      # ``suffix1``, ``pos``, ``pos2``, ``prefix5``, ``prefix2``,
      # ``bias``, ``upper`` and ``digit``
      features: [["low", "title"], ["bias", "suffix3"], ["upper", "pos", "pos2"]]
    

    That being said using different values for the entity can be a good way to get extra training data. You can use a tool like chatito to generate the training data from patterns. But be careful about repeating patterns as you can overfit the model to where it cannot generalize beyond the patterns you train for.

    they sometimes repeat the same example over and over again

    You saw this in a Rasa data set? Here is the default restaurant bot training data and I don't see any repeats.

    Repeating a single sentence over and over will re-inforce to the model that formats/words are important, this is a form of oversampling. This can be a good thing if you have very little training data or highly unbalanced training data. It can be a bad thing if you want to handle a lot of different ways to buy a pet as it can overfit the model as I mentioned above.