Search code examples
pythonjsonpandasopenai-api

load jsonl File with OpenAI API request results to pandas data.frame


I have a large data set containing around 500k observation. It has a string variable that I want to create an embedding for. I used the OpenAI API to create the embedding and because of the large number of observations I used their script for parallel requests:

https://github.com/openai/openai-cookbook/blob/main/examples/api_request_parallel_processor.py

Everything worked fine. But I'm struggling to load the results to a pandas data.frame. The jsonl file with the results has the following structure, each row corresponding to one of the 500k observations:

[{"model": "text-embedding-ada-002", "input": "INPUT STRING NR 1"}, {"object": "list", "data": [{"object": "embedding", "index": 0, "embedding": [1,2,3,4...,1536]}], "model": "text-embedding-ada-002-v2", "usage": {"prompt_tokens": 2, "total_tokens": 2}}]

[{"model": "text-embedding-ada-002", "input": "INPUT STRING NR 2}, {"object": "list", "data": [{"object": "embedding", "index": 0, "embedding": [1,2,3,4...,1536]}], "model": "text-embedding-ada-002-v2", "usage": {"prompt_tokens": 2, "total_tokens": 2}}]

Now, I want to read these results into a panda data frame with the following structure. It should have a variable that contains the "INPUT STRING" and 1536 additional variables that contain the embedding.

I'm new to python and json files. I usually work with csv files and R.

I tried to use the read_json function from pandas but that did not work

import pandas as pd
openai_results = pd.read_json("results.jsonl", lines=True)

But this gives me a a data set with only 2 variables: For example for the first observation, the first variable contains : {"model": "text-embedding-ada-002", "input": "INPUT STRING NR 1"} and the second variable {"object": "list", "data": [{"object": "embedding", "index": 0, "embedding": [1,2,3,4...,1536]}], "model": "text-embedding-ada-002-v2", "usage": {"prompt_tokens": 2, "total_tokens": 2}}


Solution

  • You can use something like this:

    df = pd.read_json('your_file.json', lines=True)
    df
    '''
       0                                                  1
    0  {'model': 'text-embedding-ada-002', 'input': '...  {'object': 'list', 'data': [{'object': 'embedd...
    1  {'model': 'text-embedding-ada-002', 'input': '...  {'object': 'list', 'data': [{'object': 'embedd...
    '''
    

    Access values:

    df["input"] = df[0].str["input"]
    df["embedding"] = df[1].str["data"].str[0].str["embedding"] # or df["embedding"]=df[1].apply(lambda x: x["data"][0]["embedding"])
    df = df[["input","embedding"]]
    

    Out:

                   input           embedding
    0  INPUT STRING NR 1  [1, 2, 3, 4, 1536]
    1  INPUT STRING NR 2  [1, 2, 3, 4, 1536]
    

    If you want to explode embedding column then use explode():

    df = df.explode("embedding")
    df
    '''
                   input embedding
    0  INPUT STRING NR 1         1
    0  INPUT STRING NR 1         2
    0  INPUT STRING NR 1         3
    0  INPUT STRING NR 1         4
    0  INPUT STRING NR 1      1536
    1  INPUT STRING NR 2         1
    1  INPUT STRING NR 2         2
    1  INPUT STRING NR 2         3
    1  INPUT STRING NR 2         4
    1  INPUT STRING NR 2      1536
    '''