What would be the best huggingface model to fine-tune for this type of task:
Example input 1:
If there's one person you don't want to interrupt in the middle of a sentence it's a judge.
Example output 1:
sentence
Example input 2:
A good baker will rise to the occasion, it's the yeast he can do.
Example output 2:
yeast
This looks like a Question Answering type of task, where the input is a sentence and the output is a span from the input sentence.
In transformers
this corresponds to the AutoModelForQuestionAnswering class
.
See the following illustration from the original BERT paper:
The only difference you have is that the input will be compsed of the "question" only.
In other words, you won't have a Question, a [SEP]
token, & a Paragraph, as shown in the figure.
Without knowing too much about your task, you might want to model this as a Token Classification type of task instead.
Here, your output would be labelled as some positive tag and the rest of the words labelled as some other negative tag.
If this makes more sense for you have a look at the AutoModelForTokenClassification class
.
I will base the rest of my discussion on question-answering, but these concepts can be easily adapted.
Since it seems that you're dealing with English sentences, you can probably use a pre-trained model such as bert-base-uncased
.
Depending on the data distribution, your choice of language model can change.
Not sure what the task you're doing is, but unless there's some fine-tuned model available which is doing your task (you can try searching the HuggingFace model hub), you're going to have to fine-tune your own model. To do so you need to have a dataset composed of sentences labelled with start & end indices corresponding to the answer span. See the documentation for more information on how to train.
Once you have a fine-tuned model you just need to run your test sentences through the model to extract answers. The following code, adapted from the HuggingFace documentation, does that:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer
import torch
model = AutoModelForQuestionAnswering.from_pretrained(name)
tokenizer = AutoTokenizer.from_pretrained(name)
input = "A good baker will rise to the occasion, it's the yeast he can do."
inputs = tokenizer(input, add_special_tokens=True, return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0]
outputs = model(**inputs)
start_scores = outputs.start_logits
end_scores = outputs.end_logits
start_index = torch.argmax(start_scores)
end_index = torch.argmax(end_scores) + 1
answer = tokenizer.convert_tokens_to_string(
tokenizer.convert_ids_to_tokens(input_ids[start_index:end_index])
) # "yeast", hopefully!