python machine-learning text bert-language-model maxlength

max_length doesn't fix the question-answering model

My Question: How to make my 'question-answering' model run, given a big (>512b) .txt file?

Context: I am creating a question answering model with the word embedding model BERT from google. The model works fine when I import a .txt file with a few sentences, but when the .txt file exceeds the limit of 512b words as context for the model to learn, the model won't answer my questions.

My Attempt to resolve issue: I set a max_length at the encoding part, but that does not seem to solve the problem (my attempt code is below).

from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

max_seq_length = 512


tokenizer = AutoTokenizer.from_pretrained("henryk/bert-base-multilingual-cased-finetuned-dutch-squad2")
model = AutoModelForQuestionAnswering.from_pretrained("henryk/bert-base-multilingual-cased-finetuned-dutch-squad2")

f = open("test.txt", "r")

text = str(f.read())

questions = [
    "Wat is de hoofdstad van Nederland?",
    "Van welk automerk is een Cayenne?",
    "In welk jaar is pindakaas geproduceerd?",
]

for question in questions:
    inputs = tokenizer.encode_plus(question, 
                                   text, 
                                   add_special_tokens=True, 
                                   max_length=max_seq_length,
                                   truncation=True,
                                   return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]

    text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
    answer_start_scores, answer_end_scores = model(**inputs, return_dict=False)

    answer_start = torch.argmax(
        answer_start_scores
    )  # Get the most likely beginning of answer with the argmax of the score
    answer_end = torch.argmax(answer_end_scores) + 1  # Get the most likely end of answer with the argmax of the score

    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))

    print(f"Question: {question}")
    print(f"Answer: {answer}\n")

Code-result:

> Question: Wat is de hoofdstad van Nederland?
> Answer: [CLS]
>
> Question: Van welk automerk is een Cayenne?
> Answer: [CLS]
>
> Question: In welk jaar is pindakaas geproduceerd?
> Answer: [CLS]

As one can see, the model only returns the [CLS]-token which happens at the tokenizer encoding part.

EDIT: I figured out that the way to solve this, is to iterate through the .txt file, so the model can find the answer through the iteration.

Solution

EDIT: I figured out that the way to solve this, is to iterate through the .txt file, so the model can find the answer through the iteration. The reason for the model to answer with a [CLS] is because it could not find the answer in the 512b context, it has to look more further into the context.

By creating a loop like this:

with open("sample.txt", "r") as a_file:
  for line in a_file:
    text = line.strip()
    print(text)

it is possible to apply the iterated text into the encode_plus.