nlp pytorch huggingface-transformers bert-language-model

How to do language model training on BERT

I want to train BERT on a target corpus. I am looking at this HuggingFace implementation. They are using .raw files for the training data. If I have .txt files of my training data, how can I use their implementation?

Solution

The .raw only indicates that they use the raw version of the WikiText, they are regular text files containing the raw text:

We're using the raw WikiText-2 (no tokens were replaced before the tokenization).

The description of the data files options also says that they are text files. From run_language_modeling.py - L86-L88:

train_data_file: Optional[str] = field(
    default=None, metadata={"help": "The input training data file (a text file)."}
)

Therefore you can just specify your text files.

I want to install the "n" package and I get an error
n <version> command does not activate specified version
Change n install location
How to install a specific version of Node on Ubuntu/Debian?
Different node version for different projects, is there a way of telling node which version to use?
Install Node.js to install n to install Node.js?
How to select the latest node.js v6 version using n?
n-install: ERROR: GNU Make not found, which is required for operation
How to downgrade Node version with n
how switch to previous version in n (Node version manager)?
Automatically use the right version of Node for a package
internal/modules/cjs/loader.js:905 -> throw err;
Why doesn't "n" downgrade my node version on a Mac?
Node version manager
n failed to install/switch node in Linux?
vue command not found on Mac
How to uninstall n and all node versions installed by n
Angular CLI on HTTPS - can't install CI as root
n (node version manager): cannot create directory
npm module n emits errors
How to update npm permanently?
Cannot change nodejs version using n
upgrade nodejs to stable version
How should I install and use multiple versions of Node on the same production machine?