Search code examples
pythonjsonpython-3.xtextjsonlines

Converting a text document into a jsonl (json lines) format


I want to convert a text file into a json lines format using Python. I need this to be applicable to a text file of any length (in characters or words).

As an example, I want to convert the following text;

A lot of effort in classification tasks is placed on feature engineering and parameter optimization, and rightfully so. 

These steps are essential for building models with robust performance. However, all these efforts can be wasted if you choose to assess these models with the wrong evaluation metrics.

To this:

{"text": "A lot of effort in classification tasks is placed on feature engineering and parameter optimization, and rightfully so."}
{"text": "These steps are essential for building models with robust performance. However, all these efforts can be wasted if you choose to assess these models with the wrong evaluation metrics."}

I tried this:

text = ""
with open(text.txt", encoding="utf8") as f:
    for line in f:
        text = {"text": line}

But not luck.


Solution

  • The basic idea of your for loop was correct but the line text = {"text": line} is just overwriting the previous line every time, whereas what you want is to generate a list of lines.

    Try the following:

    import json
    
    # Generate a list of dictionaries
    lines = []
    with open("text.txt", encoding="utf8") as f:
        for line in f.read().splitlines():
            if line:
                lines.append({"text": line})
    
    # Convert to a list of JSON strings
    json_lines = [json.dumps(l) for l in lines]
    
    # Join lines and save to .jsonl file
    json_data = '\n'.join(json_lines)
    with open('my_file.jsonl', 'w') as f:
        f.write(json_data)
    

    splitlines removes the \n characters and if line: ignores blank lines.