Search code examples
pythonnlpgpt-3

Prepare json file for GPT


I would like to create a dataset to use it for fine-tuning GPT3. As I read from the following site https://beta.openai.com/docs/guides/fine-tuning, the dataset should look like this

{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
...

For this reason I am creating the dataset with the following way

import json

# Data to be written
dictionary = {
    "prompt": "<text1>", "completion": "<text to be generated1>"}, {
    "prompt": "<text2>", "completion": "<text to be generated2>"}

with open("sample2.json", "w") as outfile:
    json.dump(dictionary, outfile)

However, when I am trying to load it, it looks like this which is not as we want

import json
 
# Opening JSON file
with open('sample2.json', 'r') as openfile:
 
    # Reading from json file
    json_object = json.load(openfile)
 
print(json_object)
print(type(json_object))

>> [{'prompt': '<text1>', 'completion': '<text to be generated1>'}, {'prompt': '<text2>', 'completion': '<text to be generated2>'}]
<class 'list'>

Could you please let me know how can I face this problem?


Solution

  • it's more like, writing \n a new line character after each json. so each line is JSON. somehow the link jsonlines throw server not found error on me.

    you can have these options:

    1. write \n after each line:
    import json
    with open("sample2_op1.json", "w") as outfile:
        for e_json in dictionary:
            json.dump(e_json, outfile)
            outfile.write('\n')
    #read file, as it has \n, read line by line and load as json
    with open("sample2_op1.json","r") as file:
        for line in file:
            print(json.loads(line),type(json.loads(line)))
    
    1. which have way to read file too, its jsonlines install the module !pip install jsonlines
    import jsonlines
    #write to file
    with jsonlines.open('sample2_op2.jsonl', 'w') as outfile:
        outfile.write_all(dictionary)
    #read the file
    with jsonlines.open('sample2_op2.jsonl') as reader:
        for obj in reader:
            print(obj)