I am trying to create a question-answering model with the word embedding model BERT from google. I am new to this and would really want to use my own corpus for the training. At first I used an example from the huggingface site and that worked fine:
from transformers import pipeline
qa_pipeline = pipeline(
"question-answering",
model="henryk/bert-base-multilingual-cased-finetuned-dutch-squad2",
tokenizer="henryk/bert-base-multilingual-cased-finetuned-dutch-squad2"
)
qa_pipeline({
'context': "Amsterdam is de hoofdstad en de dichtstbevolkte stad van Nederland.",
'question': "Wat is de hoofdstad van Nederland?"})
output
> {'answer': 'Amsterdam', 'end': 9, 'score': 0.825619101524353, 'start': 0}
So, I tried creating a .txt file to test if it was possible to interchange the sentence in the context parameter with the exact same sentence but in a .txt file.
with open('test.txt') as f:
lines = f.readlines()
qa_pipeline = pipeline(
"question-answering",
model="henryk/bert-base-multilingual-cased-finetuned-dutch-squad2",
tokenizer="henryk/bert-base-multilingual-cased-finetuned-dutch-squad2"
)
qa_pipeline({
'context': lines,
'question': "Wat is de hoofdstad van Nederland?"})
But this gave me the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-7-2bae0ecad43e> in <module>()
10 qa_pipeline({
11 'context': lines,
---> 12 'question': "Wat is de hoofdstad van Nederland?"})
5 frames
/usr/local/lib/python3.6/dist-packages/transformers/data/processors/squad.py in _is_whitespace(c)
84
85 def _is_whitespace(c):
---> 86 if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F:
87 return True
88 return False
TypeError: ord() expected a character, but string of length 66 found
I was just experimenting with ways to read and use a .txt file, but I don't seem to find a different solution. I did some research on the huggingface pipeline() function and this is what was written about the question and context parameters:
Got it! The solution was really easy. I assumed that the variable 'lines' was already a str but that wasn't the case. Just by casting to a string the question-answering model accepted my test.txt file.
so from:
with open('test.txt') as f:
lines = f.readlines()
to:
with open('test.txt') as f:
lines = str(f.readlines())