Search code examples

(OpenNMT) Spanish to English Model Improvement

I’m currently trying to train a Spanish to English model using yaml scripts. My data set is pretty big but just for starters, I’m trying to get a 10,000 training set and 1000-2000 validation set working well first. However, after trying for days, I think I need help considering that my validation accuracy goes down the more I train while my training accuracy goes up.

My data comes from the ES-EN coronavirus commentary data set from ModelFront found here I found the parallel sentences to be pretty accurate. And I’m using the first 10,000 parallel lines from the dataset, skipping sentences that contain any digits. I then take the next 1000 or 2000 for my validation set and the next 1000 for my test set, only containing sentences without numbers. Upon looking at the data, it looks clean and the sentences are lined up with each other in the respective lines.

I then use sentencepiece to build a vocabulary model. Using the spm_train command, I feed in my English and Spanish training set, comma separated in the argument, and output a single esen.model. In addition, I chose to use unigrams and a vocab size of 16000

As for my yaml configuration file: here is what I specify

My source and target training data (the 10,000 I extracted for English and Spanish with “sentencepiece” in the transforms [])

My source and target validation data (2,000 for English and Spanish with “sentencepiece” in the transforms [])

My vocab model esen.model for both my Src and target vocab model

Encoder: rnn Decoder: rnn Type: LSTM Layers: 2 bidir: true

Optim: Adam Learning rate: 0.001

Training steps: 5000 Valid steps: 1000

Other logging data.

Upon starting the training with onmt_translate, my training accuracy starts off at 7.65 and goes into the low 70s by the time 5000 steps are over. But, in that time frame, my validation accuracy goes from 24 to 19.

I then use bleu to score my test set, which gets a BP of ~0.67.

I noticed that after trying sgd with a learning rate of 1, my validation accuracy kept increasing, but the perplexity started going back up at the end.

I’m wondering if I’m doing anything wrong that would make my validation accuracy go down while my training accuracy goes up? Do I just need to train more? Can anyone recommend anything else to improve this model? I’ve been staring at it for a few days. Anything is appreciated. Thanks.

!spm_train --input=data/spanish_train,data/english_train --model_prefix=data/esen --character_coverage=1 --vocab_size=16000 --model_type=unigram

## Where the samples will be written
save_data: en-sp/run/example

## Where the vocab(s) will be written
src_vocab: en-sp/run/example.vocab.src
tgt_vocab: en-sp/run/example.vocab.tgt

## Where the model will be saved
save_model: drive/MyDrive/ESEN/model3_bpe_adam_001_layer2/model

# Prevent overwriting existing files in the folder
overwrite: False

# Corpus opts:
        path_src: data/spanish_train
        path_tgt: data/english_train
        transforms: [sentencepiece, filtertoolong]
        weight: 1

        path_src: data/spanish_valid
        path_tgt: data/english_valid
        transforms: [sentencepiece]

skip_empty_level: silent
src_subword_model: data/esen.model
tgt_subword_model: data/esen.model

# General opts
report_every: 100
train_steps: 5000
valid_steps: 1000
save_checkpoint_steps: 1000
world_size: 1
gpu_ranks: [0]

# Optimizer
optim: adam
learning_rate: 0.001

# Model
encoder_type: rnn
decoder_type: rnn
layers: 2
rnn_type: LSTM
bidir_edges: True

# Logging
tensorboard: true
tensorboard_log_dir: logs
log_file: logs/log-file.txt
verbose: True
attn_debug: True
align_debug: True
global_attention: general
global_attention_function: softmax

onmt_build_vocab -config en-sp.yaml -n_sample -1

onmt_train -config en-sp.yaml

Step 1000/ 5000; acc:  27.94; ppl: 71.88; xent: 4.27; lr: 0.00100; 13103/12039 tok/s;    157 sec
Validation perplexity: 136.446
Validation accuracy: 24.234


Step 4000/ 5000; acc:  61.25; ppl:  5.28; xent: 1.66; lr: 0.00100; 13584/12214 tok/s;    641 sec
Validation accuracy: 22.1157



  • my validation accuracy goes down the more I train while my training accuracy goes up.

    It sounds like overfitting.

    10K sentences is just not a lot. So what you are seeing is expected. You can just stop training when the results on the validation set stop improving.

    That same basic dynamic can happen at greater scale too, it'll just take a lot longer.

    If your goal is to train your own reasonably good model, I see a few options:

    1. increase the size to 1M or so
    2. start with a pretrained model and fine-tune
    3. both

    For 1, there are at least 1M lines of English:Spanish you can get from ModelFront even after filtering out the noisiest.

    For 2, I know the team at YerevaNN got winning results at WMT20 starting with a Fairseq model and using about 300K translations. And they were able to do that with fairly limited hardware.