Entity Extraction fails for Sinhala Language

Trying chatbot development for Sinhala Language using RASA NLU.

My config.yml

pipeline:
- name: "WhitespaceTokenizer"
- name: "CRFEntityExtractor"
- name: "EntitySynonymMapper"
- name: "CountVectorsFeaturizer"
- name: "EmbeddingIntentClassifier"

And in data.json I have added sample data as below. When I train nlu model and try sample input to extract, "සිංහලෙන්" as medium, it only outputs the intent and the entity value, and not the entity. What am i doing wrong?

{
          "text": "සිංහලෙන් දේශන පවත්වන්නේ නැද්ද?",
          "intent": "ask_medium",
          "entities": [{
                "start":0,
                "end":8,
                "value": "සිංහලෙන්",
                "entity": "medium"
          }]
        },
        {
          "text": "සිංහලෙන් lectures කරන්නේ නැද්ද?",
          "intent": "ask_medium",
          "entities": [{
                "start":0,
                "end":8,
                "value": "සිංහලෙන්",
                "entity": "medium"
          }]
        }

The response I get when testing the nlu model is

{'intent': 
{'name': 'ask_langmedium', 'confidence': 0.9747527837753296}, 'entities': 
[{'start': 10, 
'end': 18, 
'value': 'සිංහලෙන්',
'entity': '-', 
'confidence': 0.5970129041418675,
'extractor': 'CRFEntityExtractor'}], 
'intent_ranking': [
{'name': 'ask_langmedium', 'confidence': 0.9747527837753296}, 
{'name': 'ask_langmedium_request_possibility', 'confidence': 
0.07433460652828217}],
'text': 'උගන්නන්නේ සිංහලෙන් ද ?'}

Solution

If this is your completed dataset then I am not sure how are you able to generate the model because rasa requires at least two intents. I added another intent with hello and rest of the data I just replicated your data in my own code and it worked out well and this is the output I've got.

Enter a message: උගන්නන්නේ සිංහලෙන් ද?
{
  "intent": {
    "name": "ask_medium",
    "confidence": 0.9638749361038208
  },
  "entities": [
    {
      "start": 10,
      "end": 18,
      "value": "\u0dc3\u0dd2\u0d82\u0dc4\u0dbd\u0dd9\u0db1\u0dca",
      "entity": "medium",
      "confidence": 0.7177257810884379,
      "extractor": "CRFEntityExtractor"
    }
  ]
}

This is my full Code

DataSet.json

{
    "rasa_nlu_data": {
        "common_examples": [
            {
                "text": "හෙලෝ",
                "intent": "hello",
                "entities": []
            },
            {
                "text": "සිංහලෙන් දේශන පවත්වන්නේ නැද්ද?",
                "intent": "ask_medium",
                "entities": [{
                      "start":0,
                      "end":8,
                      "value": "සිංහලෙන්",
                      "entity": "medium"
                }]
            },
            {
                "text": "සිංහලෙන් lectures කරන්නේ නැද්ද?",
                "intent": "ask_medium",
                "entities": [{
                      "start":0,
                      "end":8,
                      "value": "සිංහලෙන්",
                      "entity": "medium"
                }]
            }
        ],
        "regex_features" : [],
        "lookup_tables"  : [],
        "entity_synonyms": []
    }
}

nlu_config.yml

pipeline: "supervised_embeddings"

Training Command

python -m rasa_nlu.train -c ./config/nlu_config.yml --data ./data/sh_data.json -o models --fixed_model_name nlu --project current --verbose

& testing.py

from rasa_nlu.model import Interpreter
import json

interpreter = Interpreter.load('./models/current/nlu')


def predict_intent(text):
    results = interpreter.parse(text)
    print(json.dumps({
        "intent": results["intent"],
        "entities": results["entities"]
    }, indent=2))


keep_asking = True
while(keep_asking):
    text = input('Enter a message: ')
    if (text == 'exit'):
        keep_asking = False
        break
    else:
        predict_intent(text)