How can I test my openai fine-tuned model against question answering benchmarks?

I think the documentation only explains how to use the model through an API but that does not allow much flexibility nor automation. For example, I do not know how to test my model against some popular benchmarks from HuggingFace.

Solution

The general flow of fine tuning Open AI models consists of creating an account, having a valid API key and then uploading the data for fine tuning using the CLI tool, as described here: https://beta.openai.com/docs/guides/fine-tuning

Then to test against question answering benchmarks, like SQuAD you simply dowload the dataset, create a script that takes the questions (see below json snippet) and feeds to your model by calling the API as described here (using curl): https://beta.openai.com/docs/api-reference/making-requests

"question": "What century did the Normans first gain their separate identity?",
"id": "56ddde6b9a695914005b962c",
"answers": [
    {
        "text": "10th century",
        "answer_start": 671
    },
    {
        "text": "the first half of the 10th century",
        "answer_start": 649
    },
    {
        "text": "10th",
        "answer_start": 671
    },
    {
        "text": "10th",
        "answer_start": 671
    }
],
"is_impossible": false