Search code examples
pythonpython-3.xbioinformaticsfasta

Reading in file block by block using specified delimiter in python


I have an input_file.fa file like this (FASTA format):

> header1 description
data data
data
>header2 description
more data
data
data

I want to read in the file one chunk at a time, so that each chunk contains one header and the corresponding data, e.g. block 1:

> header1 description
data data
data

Of course I could just read in the file like this and split:

with open("1.fa") as f:
    for block in f.read().split(">"):
        pass

But I want to avoid the reading the whole file into memory, because the files are often large.

I can read in the file line by line of course:

with open("input_file.fa") as f:
    for line in f:
        pass

But ideally what I want is something like this:

with open("input_file.fa", newline=">") as f:
    for block in f:
        pass

But I get an error:

ValueError: illegal newline value: >

I've also tried using the csv module, but with no success.

I did find this post from 3 years ago, which provides a generator based solution to this issue, but it doesn't seem that compact, is this really the only/best solution? It would be neat if it is possible to create the generator with a single line rather than a separate function, something like this pseudocode:

with open("input_file.fa") as f:
    blocks = magic_generator_split_by_>
    for block in blocks:
        pass

If this is impossible, then I guess you could consider my question a duplicate of the other post, but if that is so, I hope people can explain to me why the other solution is the only one. Many thanks.


Solution

  • A general solution here will be write a generator function for this that yields one group at a time. This was you will be storing only one group at a time in memory.

    def get_groups(seq, group_by):
        data = []
        for line in seq:
            # Here the `startswith()` logic can be replaced with other
            # condition(s) depending on the requirement.
            if line.startswith(group_by):
                if data:
                    yield data
                    data = []
            data.append(line)
    
        if data:
            yield data
    
    with open('input.txt') as f:
        for i, group in enumerate(get_groups(f, ">"), start=1):
            print ("Group #{}".format(i))
            print ("".join(group))
    

    Output:

    Group #1
    > header1 description
    data data
    data
    
    Group #2
    >header2 description
    more data
    data
    data
    

    For FASTA formats in general I would recommend using Biopython package.