I have been working on a question answering model, where I receive answers on my questions by my word embedding model BERT. But I really want to plot something like this:
But the problem is, I don't really know how. I am really stuck at this quest. I don't know how to represent a part of the context in a plot. I do have two variables, named answer_start and answer_end which indicates in what part in the context the model got its answers from. Can someone please help me out with this and tell me what variables I need to put in my pyplot?
Below my code:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch
import numpy as np
import pandas as pd
max_seq_length = 512
tokenizer = AutoTokenizer.from_pretrained("henryk/bert-base-multilingual-cased-finetuned-dutch-squad2")
model = AutoModelForQuestionAnswering.from_pretrained("henryk/bert-base-multilingual-cased-finetuned-dutch-squad2")
questions = [
"Welke soorten gladiatoren waren er?",
"Wat is een provocator?"
]
for question in questions: # voor elke question moet er door alle lines geiterate worden
print(f"Question: {question}")
f = open("test.txt", "r")
for line in f:
text = str(line) #het antwoord moet een string zijn
#encoding met tokenizen van de zinnen
inputs = tokenizer.encode_plus(question,
text,
add_special_tokens=True,
max_length=max_seq_length,
truncation=True,
return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0]
#ff uitzoeken wat die ** deed
answer_start_scores, answer_end_scores = model(**inputs, return_dict=False)
answer_start = torch.argmax(
answer_start_scores
) # Het antwoord met de hoogste argmax accuracy vanaf het begin woord
answer_end = torch.argmax(
answer_end_scores) + 1 # Zelfde maar dan eind woord
answer = tokenizer.convert_tokens_to_string(
tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
#om het antwoorden [cls] en NaN te voorkomen
if answer == '[CLS]':
continue
elif answer == '':
continue
else:
print(f"Answer: {answer}")
print(f"Answer start: {answer_start}")
print(f"Answer end: {answer_end}")
f.seek(0)
break
# f.seek(0)
# break
f.close()
Also the output:
> Question: Welke soorten gladiatoren waren er?
> Answer: de thraex, de retiarius en de murmillo
> Answer start: 24
> Answer end: 37
> Question: Wat is een provocator?
> Answer: telemachus
> Answer start: 87
> Answer end: 90
I don't know if I understand what your problem is. But to make a plot similar to that of the figure, I would do something like this:
import matplotlib.pyplot as plt; plt.rcdefaults()
import numpy as np
import matplotlib.pyplot as plt
sentence = ('list' 'of' 'words' 'that' 'make' 'up' 'the' 'sentence' 'in' 'which' 'the' 'answer' 'is' 'found')
y_pos = np.arange(len(sentence))
probability = [0.1, 0.2, 0.1, 0.8, 0.6]
plt.bar(y_pos, probability, align='center', alpha=0.5)
plt.xticks(y_pos, sentence)
plt.ylabel('Answer probability')
plt.title('Words of the sentence')
plt.show()
So assuming that the answer lies within a larger sentence/paragraph, what I would do is insert all the words of the sentence/paragraph into the x axis of a bar plot (variable sentence
- text.txt I suppose), while on the y axis the percentage indicating the probability that a particular word is the beginning or ending word of the answer (variable probability
). Obviously the two variables sentence
and probability
will have the same length, where the first sentence variable corresponds to the first probability value and so on.
For instance answer_start_scores
and answer_end_scores
will be the words with the highest score, therefore their "bar" of the bar plot will be the highest (highest value in the list of probability).
Finally in answer_start_scores
and answer_end_scores
you should have all the scores for which the starting and ending word is most likely.
EDIT: Maybe, you could also make two separate bar plots for the initial word of the answer and the final word and then join them together by adding the percentages.