Search code examples
pythonfileparsing

Parse only selected records from empty-line separated file


I have a file with the following structure:

SE|text|Baz
SE|entity|Bla
SE|relation|Bla
SE|relation|Foo

SE|text|Bla
SE|entity|Foo

SE|text|Zoo
SE|relation|Bla
SE|relation|Baz

Records (i.e., blocks) are separated by an empty line. Each line in a block starts with a SE tag. text tag always occurs in the first line of each block.

I wonder how to properly extract only blocks with a relation tag, which is not necessarily present in each block. My attempt is pasted below:

from itertools import groupby
with open('test.txt') as f:
    for nonempty, group in groupby(f, bool):
        if nonempty:
            process_block() ## ?

Desired output is a json dump:

{
    "result": [
        {
            "text": "Baz", 
            "relation": ["Bla","Foo"]
        },
        {
            "text": "Zoo", 
            "relation": ["Bla","Baz"]
        }

    ]
}

Solution

  • I have a proposed solution in pure python that returns a block if it contains the value in any position. This could most likely be done more elegant in a proper framework like pandas.

    from pprint import pprint
    
    fname = 'ex.txt'
    
    # extract blocks
    with open(fname, 'r') as f:
        blocks = [[]]
        for line in f:
            if len(line) == 1:
                blocks.append([])
            else:
                blocks[-1] += [line.strip().split('|')]
    
    # remove blocks that don't contain 'relation
    blocks = [block for block in blocks
              if any('relation' == x[1] for x in block)]
    
    pprint(blocks)
    # [[['SE', 'text', 'Baz'],
    #   ['SE', 'entity', 'Bla'],
    #   ['SE', 'relation', 'Bla'],
    #   ['SE', 'relation', 'Foo']],
    #  [['SE', 'text', 'Zoo'], ['SE', 'relation', 'Bla'], ['SE', 'relation', 'Baz']]]
    
    
    # To export to proper json format the following can be done
    import pandas as pd
    import json
    results = []
    for block in blocks:
        df = pd.DataFrame(block)
        json_dict = {}
        json_dict['text'] = list(df[2][df[1] == 'text'])
        json_dict['relation'] = list(df[2][df[1] == 'relation'])
        results.append(json_dict)
    print(json.dumps(results))
    # '[{"text": ["Baz"], "relation": ["Bla", "Foo"]}, {"text": ["Zoo"], "relation": ["Bla", "Baz"]}]'
    

    Let's go through it

    1. Read the file into a list and divide each block by a blank line and divide columns with the | character.
    2. Go through each block in the list and sort out any that does not contain relation.
    3. Print the output.