Search code examples
pythonencodingnlpgpt-2

How can I create and fit vocab.bpe file (GPT and GPT2 OpenAI models) with my own corpus text?


This question is for those who are familiar with GPT or GPT2 OpenAI models. In particular, with the encoding task (Byte-Pair Encoding). This is my problem:

I would like to know how I could create my own vocab.bpe file.

I have a spanish corpus text that I would like to use to fit my own bpe encoder. I have succeedeed in creating the encoder.json with the python-bpe library, but I have no idea on how to obtain the vocab.bpe file. I have reviewed the code in gpt-2/src/encoder.py but, I have not been able to find any hint. Any help or idea?

Thank you so much in advance.


Solution

  • check out here, you can easily create the same vocab.bpe using the following command:

    python learn_bpe -o ./vocab.bpe -i dataset.txt --symbols 50000