Search code examples
rasa-nlu

Entity Extraction fails for Sinhala Language


Trying chatbot development for Sinhala Language using RASA NLU.

My config.yml

pipeline:
- name: "WhitespaceTokenizer"
- name: "CRFEntityExtractor"
- name: "EntitySynonymMapper"
- name: "CountVectorsFeaturizer"
- name: "EmbeddingIntentClassifier"

And in data.json I have added sample data as below. When I train nlu model and try sample input to extract, "සිංහලෙන්" as medium, it only outputs the intent and the entity value, and not the entity. What am i doing wrong?

{
          "text": "සිංහලෙන් දේශන පවත්වන්නේ නැද්ද?",
          "intent": "ask_medium",
          "entities": [{
                "start":0,
                "end":8,
                "value": "සිංහලෙන්",
                "entity": "medium"
          }]
        },
        {
          "text": "සිංහලෙන් lectures කරන්නේ නැද්ද?",
          "intent": "ask_medium",
          "entities": [{
                "start":0,
                "end":8,
                "value": "සිංහලෙන්",
                "entity": "medium"
          }]
        }

The response I get when testing the nlu model is

{'intent': 
{'name': 'ask_langmedium', 'confidence': 0.9747527837753296}, 'entities': 
[{'start': 10, 
'end': 18, 
'value': 'සිංහලෙන්',
'entity': '-', 
'confidence': 0.5970129041418675,
'extractor': 'CRFEntityExtractor'}], 
'intent_ranking': [
{'name': 'ask_langmedium', 'confidence': 0.9747527837753296}, 
{'name': 'ask_langmedium_request_possibility', 'confidence': 
0.07433460652828217}],
'text': 'උගන්නන්නේ සිංහලෙන් ද ?'}

Solution

  • If this is your completed dataset then I am not sure how are you able to generate the model because rasa requires at least two intents. I added another intent with hello and rest of the data I just replicated your data in my own code and it worked out well and this is the output I've got.

    Enter a message: උගන්නන්නේ සිංහලෙන් ද?
    {
      "intent": {
        "name": "ask_medium",
        "confidence": 0.9638749361038208
      },
      "entities": [
        {
          "start": 10,
          "end": 18,
          "value": "\u0dc3\u0dd2\u0d82\u0dc4\u0dbd\u0dd9\u0db1\u0dca",
          "entity": "medium",
          "confidence": 0.7177257810884379,
          "extractor": "CRFEntityExtractor"
        }
      ]
    }
    

    This is my full Code

    DataSet.json

    {
        "rasa_nlu_data": {
            "common_examples": [
                {
                    "text": "හෙලෝ",
                    "intent": "hello",
                    "entities": []
                },
                {
                    "text": "සිංහලෙන් දේශන පවත්වන්නේ නැද්ද?",
                    "intent": "ask_medium",
                    "entities": [{
                          "start":0,
                          "end":8,
                          "value": "සිංහලෙන්",
                          "entity": "medium"
                    }]
                },
                {
                    "text": "සිංහලෙන් lectures කරන්නේ නැද්ද?",
                    "intent": "ask_medium",
                    "entities": [{
                          "start":0,
                          "end":8,
                          "value": "සිංහලෙන්",
                          "entity": "medium"
                    }]
                }
            ],
            "regex_features" : [],
            "lookup_tables"  : [],
            "entity_synonyms": []
        }
    }
    

    nlu_config.yml

    pipeline: "supervised_embeddings"
    

    Training Command

    python -m rasa_nlu.train -c ./config/nlu_config.yml --data ./data/sh_data.json -o models --fixed_model_name nlu --project current --verbose
    

    & testing.py

    from rasa_nlu.model import Interpreter
    import json
    
    interpreter = Interpreter.load('./models/current/nlu')
    
    
    def predict_intent(text):
        results = interpreter.parse(text)
        print(json.dumps({
            "intent": results["intent"],
            "entities": results["entities"]
        }, indent=2))
    
    
    keep_asking = True
    while(keep_asking):
        text = input('Enter a message: ')
        if (text == 'exit'):
            keep_asking = False
            break
        else:
            predict_intent(text)