Search code examples
word-embeddingbert-language-modelhuggingface-transformers

How to use my own corpus on word embedding model BERT


I am trying to create a question-answering model with the word embedding model BERT from google. I am new to this and would really want to use my own corpus for the training. At first I used an example from the huggingface site and that worked fine:

from transformers import pipeline

qa_pipeline = pipeline(
    "question-answering",
    model="henryk/bert-base-multilingual-cased-finetuned-dutch-squad2",
    tokenizer="henryk/bert-base-multilingual-cased-finetuned-dutch-squad2"
)

qa_pipeline({
    'context': "Amsterdam is de hoofdstad en de dichtstbevolkte stad van Nederland.",
    'question': "Wat is de hoofdstad van Nederland?"})

output

> {'answer': 'Amsterdam', 'end': 9, 'score': 0.825619101524353, 'start': 0}

So, I tried creating a .txt file to test if it was possible to interchange the sentence in the context parameter with the exact same sentence but in a .txt file.

with open('test.txt') as f:
    lines = f.readlines()

qa_pipeline = pipeline(
    "question-answering",
    model="henryk/bert-base-multilingual-cased-finetuned-dutch-squad2",
    tokenizer="henryk/bert-base-multilingual-cased-finetuned-dutch-squad2"
)

qa_pipeline({
    'context': lines,
    'question': "Wat is de hoofdstad van Nederland?"})

But this gave me the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-7-2bae0ecad43e> in <module>()
     10 qa_pipeline({
     11     'context': lines,
---> 12     'question': "Wat is de hoofdstad van Nederland?"})

5 frames
/usr/local/lib/python3.6/dist-packages/transformers/data/processors/squad.py in _is_whitespace(c)
     84 
     85 def _is_whitespace(c):
---> 86     if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F:
     87         return True
     88     return False

TypeError: ord() expected a character, but string of length 66 found

I was just experimenting with ways to read and use a .txt file, but I don't seem to find a different solution. I did some research on the huggingface pipeline() function and this is what was written about the question and context parameters:

enter image description here


Solution

  • Got it! The solution was really easy. I assumed that the variable 'lines' was already a str but that wasn't the case. Just by casting to a string the question-answering model accepted my test.txt file.

    so from:

    with open('test.txt') as f:
        lines = f.readlines()
    

    to:

    with open('test.txt') as f:
        lines = str(f.readlines())