I'm using bert pre-trained model for question and answering. It's returning correct result but with lot of spaces between the text
The code is below :
def get_answer_using_bert(question, reference_text):
bert_model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
bert_tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
input_ids = bert_tokenizer.encode(question, reference_text)
input_tokens = bert_tokenizer.convert_ids_to_tokens(input_ids)
sep_location = input_ids.index(bert_tokenizer.sep_token_id)
first_seg_len, second_seg_len = sep_location + 1, len(input_ids) - (sep_location + 1)
seg_embedding = [0] * first_seg_len + [1] * second_seg_len
model_scores = bert_model(torch.tensor([input_ids]),
token_type_ids=torch.tensor([seg_embedding]))
ans_start_loc, ans_end_loc = torch.argmax(model_scores[0]), torch.argmax(model_scores[1])
result = ' '.join(input_tokens[ans_start_loc:ans_end_loc + 1])
result = result.replace('#', '')
return result
Followed by code below :
reference_text = 'Mukesh Dhirubhai Ambani was born on 19 April 1957 in the British Crown colony of Aden (present-day Yemen) to Dhirubhai Ambani and Kokilaben Ambani. He has a younger brother Anil Ambani and two sisters, Nina Bhadrashyam Kothari and Dipti Dattaraj Salgaonkar. Ambani lived only briefly in Yemen, because his father decided to move back to India in 1958 to start a trading business that focused on spices and textiles. The latter was originally named Vimal but later changed to Only Vimal His family lived in a modest two-bedroom apartment in Bhuleshwar, Mumbai until the 1970s. The family financial status slightly improved when they moved to India but Ambani still lived in a communal society, used public transportation, and never received an allowance. Dhirubhai later purchased a 14-floor apartment block called Sea Wind in Colaba, where, until recently, Ambani and his brother lived with their families on different floors.'
question = 'What is the name of mukesh ambani brother?'
get_answer_using_bert(question, reference_text)
And the output is :
'an il am ban i'
Can anyone help me how to fix this issue. It would be really helpful.
You can just use the tokenizer decode function:
bert_tokenizer.decode(input_ids[ans_start_loc:ans_end_loc +1])
Output:
'anil ambani'
In case you do not want to use decode, you can use:
result.replace(' ##', '')