NLU training takes a long time

I have about 1000 examples and 25 intents in nlu file. In which the number of examples containing entity is 710 (most examples only have 1 entity). It takes me about 30-40 minutes to complete a training set without gpu (and takes 6 minutes when test in GG Colab with Tesla T4). It takes quite a while. Is it because my data is too much or the way I choose the pipeline.

Here is my pipeline:

language: vi

pipeline:
  - name: "WhitespaceTokenizer"
  - name: "RegexFeaturizer"
  - name: "CRFEntityExtractor"
  - name: "LexicalSyntacticFeaturizer"
  - name: "CountVectorsFeaturizer"
  - name: "CountVectorsFeaturizer" 
    analyzer: "char_wb" 
    min_ngram: 1 
    max_ngram: 4
  - name: DIETClassifier
    epochs: 100
  - name: "EntitySynonymMapper"

policies:
  - name: TEDPolicy
    max_history: 5
    epochs: 100
    state_featurizer:
      - name: FullDialogueTrackerFeaturizer
    attn_shift_range: 2
    embed_dim: 20
    constrain_similarities: True
  - name: FallbackPolicy
    core_threshold: 0.5
    nlu_threshold: 0.4
  - name: FormPolicy
  - name: MappingPolicy

Rasa Version: 2.3.2 Does anyone know where the problem is? Please help!

Solution

There are some ideas running in my mind.

We’re running for 100 epochs. That’s fine. But maybe we don’t need 100 epochs? Sometimes you can get convergence within 50. This advice is a bit of a “hack” because we’re risking a sub-optimal model. But it might be worth a mention.
You're using the CRFEntityExtractor. DIET is already detecting entities here so you might not need it. You should get a performance boost if you remove that model from your pipeline.
I see that the you're using Vietnamese. I’m wondering if there’s something about that language that is causing an effect. I might imagine that because the alphabet is different (more accents, to my understanding) that the CountVectorsFeaturizer is causing many more features compared to any English demo.
With that in mind you might try reducing the ngrams that are being generated. Especially if the dataset contains long words we might generate a lot of features for DIET which we might be able to tune down. Perhaps setting the char_wb settings to be between (2, 3) is sufficient.

There are many things that might be worth investigating here. In order to do that in detail it might be better to post this question on the Rasa forum. That way there can be a back and forth between you and folks with ideas. You can find it here. In particular when you ask your question you'll want to ping @koaning. He's working on Non-English tools and might also be able to help you further.