python-3.x huggingface-transformers language-model simpletransformers

Fine tuning a pretrained language model with Simple Transformers

In his article 'Language Model Fine-Tuning For Pre-Trained Transformers' Thilina Rajapakse (https://medium.com/skilai/language-model-fine-tuning-for-pre-trained-transformers-b7262774a7ee) provides the following code snippet for fine-tuning a pre-trained model using the library simpletransformers:

from simpletransformers.language_modeling import LanguageModelingModel
import logging


logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

train_args = {
    "reprocess_input_data": True,
    "overwrite_output_dir": True,
}

model = LanguageModelingModel('bert', 'bert-base-cased', args=train_args)

model.train_model("data/train.txt", eval_file="data/text.txt")

model.eval_model("data/test.txt")

He then adds:

We assume that you have combined all the text in your dataset into two text files train.txt and test.txt which can be found in the data/ directory.

I have 2 questions:

Question 1

Does the highlighted sentence above implies that the entire corpus will be merged into one text file? So assuming that the Training Corpus is comprised of 1,000,000 text files, are we supposed to merge them all in one text file with code like this?

import fileinput
with open(outfilename, 'w') as fout, fileinput.input(filenames) as fin:
    for line in fin:
        fout.write(line)

Question 2

I presume that I can use the pretrained model: bert-base-multilingual-cased. Correct?

Solution

Question 1

Yes, the input to the train_model() and eval_model() methods need to be a single file.

Dynamically loading from multiple files will likely be supported in the future

Question 2

Yes, you can use bert-base-multilingual-cased model.

You will find a much more detailed, updated guide on language model training here.

Disclaimer: I am the creator of the above library.