Search code examples
nlppytorchhuggingface-transformersbert-language-model

How to do language model training on BERT


I want to train BERT on a target corpus. I am looking at this HuggingFace implementation. They are using .raw files for the training data. If I have .txt files of my training data, how can I use their implementation?


Solution

  • The .raw only indicates that they use the raw version of the WikiText, they are regular text files containing the raw text:

    We're using the raw WikiText-2 (no tokens were replaced before the tokenization).

    The description of the data files options also says that they are text files. From run_language_modeling.py - L86-L88:

    train_data_file: Optional[str] = field(
        default=None, metadata={"help": "The input training data file (a text file)."}
    )
    

    Therefore you can just specify your text files.