I have about 1000 examples and 25 intents in nlu file. In which the number of examples containing entity is 710 (most examples only have 1 entity). It takes me about 30-40 minutes to complete a training set without gpu (and takes 6 minutes when test in GG Colab with Tesla T4). It takes quite a while. Is it because my data is too much or the way I choose the pipeline.
Here is my pipeline:
language: vi
pipeline:
- name: "WhitespaceTokenizer"
- name: "RegexFeaturizer"
- name: "CRFEntityExtractor"
- name: "LexicalSyntacticFeaturizer"
- name: "CountVectorsFeaturizer"
- name: "CountVectorsFeaturizer"
analyzer: "char_wb"
min_ngram: 1
max_ngram: 4
- name: DIETClassifier
epochs: 100
- name: "EntitySynonymMapper"
policies:
- name: TEDPolicy
max_history: 5
epochs: 100
state_featurizer:
- name: FullDialogueTrackerFeaturizer
attn_shift_range: 2
embed_dim: 20
constrain_similarities: True
- name: FallbackPolicy
core_threshold: 0.5
nlu_threshold: 0.4
- name: FormPolicy
- name: MappingPolicy
Rasa Version: 2.3.2 Does anyone know where the problem is? Please help!
There are some ideas running in my mind.
CountVectorsFeaturizer
is causing many more features compared to any English demo.char_wb
settings to be between (2, 3) is sufficient.There are many things that might be worth investigating here. In order to do that in detail it might be better to post this question on the Rasa forum. That way there can be a back and forth between you and folks with ideas. You can find it here. In particular when you ask your question you'll want to ping @koaning. He's working on Non-English tools and might also be able to help you further.