Search code examples
pythonstringparsinglineblock

Extract lines containing specific words


Input:

ID   aa
AA   Homo sapiens
DR   ac
BB   ad
FT   ae
//
ID   ba
AA   mouse
DR   bc
BB   bd
FT   be
//
ID   ca
AA   Homo sapiens
DR   cc
BB   cd
FT   ce
//

Expected output:

DR   ac
FT   ae
//
DR   cc
FT   ce
//

Code:

word = 'Homo sapiens'
with open(input_file, 'r') as txtin, open(output_file, 'w') as txtout:
    
    for block in txtin.read().split('//\n'):   # reading a file in blocks
        if word in block:   # extracted block containing the word, 'Homo sapiens'
            extracted_block = block + '//\n'

            for line in extracted_block.strip().split('\n'):   # divide each block into lines
                if line.startswith('DR   '):
                    dr = line 

                elif line.startswith('FT   '):
                    ft = line

I read the input_file based on '//' (block). And, if the word 'Homo sapiens' is included in the blocks, I extracted the blocks. Also, in the block, the line starting with 'DR ' is defined as dr, and the line starting with 'FT ' is defined as ft. How should I write 'output' using dr and ft to get 'Expected output'?


Solution

  • You can write a simple parser with a flag. In summary, when you reach a line with AA and the word, set the flag True to keep the following fields of interest, until you reach a block end in which case you reset the flag.

    word = 'Homo sapiens'
    
    with open(input_file, 'r') as txtin, open(output_file, 'w') as txtout:
        keep = False
        for line in txtin:
            if keep and line.startswith(('DR', 'FT', '//')):
                txtout.write(line)
            if line.startswith('//'):
                keep = False # reset the flag on record end
            elif line.startswith('AA') and word in line:
                keep = True
    

    Output:

    DR   ac
    FT   ae
    //
    DR   cc
    FT   ce
    //
    

    NB. This requires AA to be before the fields to save. If not, you have to parse block by block (keeping the data in memory) with a similar logic