Search code examples
pythonnewlinefastafile-read

Python: a way to ignore/account for newlines with read()


So I am having a problem extracting text from a larger (>GB) text file. The file is structured as follows:

>header1
hereComesTextWithNewlineAtPosition_80
hereComesTextWithNewlineAtPosition_80
hereComesTextWithNewlineAtPosition_80
andEnds
>header2
hereComesTextWithNewlineAtPosition_80
hereComesTextWithNewlineAtPosAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAlineAtPosition_80
MaybeAnotherTargetBBBBBBBBBBBrestText
andEndsSomewhereHere

Now I have the information that in the entry with header2 I need to extract the text from position X to position Y (the A's in this example), starting with 1 as the first letter in the line below the header.

BUT: the positions do not account for newline characters. So basically when it says from 1 to 95 it really means just the letters from 1 to 80 and the following 15 of the next line.

My first solution was to use file.read(X-1) to skip the unwanted part in front and then file.read(Y-X) to get the part I want, but when that stretches over newline(s) I get to few characters extracted.

Is there a way to solve this with another python-function than read() maybe? I thought about just replacing all newlines with empty strings but the file maybe quite large (millions of lines).

I also tried to account for the newlines by taking extractLength // 80 as added length, but this is problematic in cases like the example when eg. of 95 characters it's 2-80-3 over 3 lines I actually need 2 additional positions but 95 // 80 is 1.

UPDATE:

I modified my code to use Biopython:

for s in SeqIO.parse(sys.argv[2], "fasta"): 
        #foundClusters stores the information for substrings I want extracted
        currentCluster = foundClusters.get(s.id)

        if(currentCluster is not None):

            for i in range(len(currentCluster)):

                outputFile.write(">"+s.id+"|cluster"+str(i)+"\n")

                flanking = 25

                start = currentCluster[i][0]
                end = currentCluster[i][1]
                left = currentCluster[i][2]

                if(start - flanking < 0):
                    start = 0
                else:
                    start = start - flanking

                if(end + flanking > end + left):
                    end = end + left
                else:
                    end = end + flanking

                #for debugging only
                print(currentCluster)
                print(start)
                print(end)

                outputFile.write(s.seq[start, end+1])

But I get the following error:

[[1, 55, 2782]]
0
80
Traceback (most recent call last):
  File "findClaClusters.py", line 92, in <module>
    outputFile.write(s.seq[start, end+1])
  File "/usr/local/lib/python3.4/dist-packages/Bio/Seq.py", line 236, in __getitem__
   return Seq(self._data[index], self.alphabet)
TypeError: string indices must be integers

UPDATE2:

Changed outputFile.write(s.seq[start, end+1]) to:

outRecord = SeqRecord(s.seq[start: end+1], id=s.id+"|cluster"+str(i), description="Repeat-Cluster")
SeqIO.write(outRecord, outputFile, "fasta")

and its working :)


Solution

  • With Biopython:

    from Bio import SeqIO
    X = 66
    Y = 130
    for s in in SeqIO.parse("test.fst", "fasta"):
        if "header2" == s.id:
             print s.seq[X: Y+1]
    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
    

    Biopython let's you parse fasta file and access its id, description and sequence easily. You have then a Seq object and you can manipulate it conveniently without recoding everything (like reverse complement and so on).