Search code examples
pythonappendstartswith

How to read files with separators in Python and append characters at the end?


I have a file format that looks like this

 >1ATGC>2TTTT>3ATGC>$$$>B1ATCG>B2TT-G>3TTCG>B4TT-G>B5TTCG>B6TTCG$$$>C1TTTT>C2ATGC

Note: "$$$" divides the file, such that anything before $$$ is Set 1 and after $$$ is Set 2 and after the next $$$ Set3 etc.

I have to do the following:

a. Concatenate the sequences following ">". So, I have to join "ATGC" "TTTT" "ATGC" and store in (1) and I have to concatenate "ATCG" "TT-G" "TTCG" "TT-G" "TTCG""TTCG" and store as (2)... concatenate again and store in (3)

The output should be a list that looks like:

("ATGCTTTTATGC","ATCGTT-GTTCGTT-GTTCGTTCG","TTTTATGC")

(2) Then, I find the the Set that has the maximum length => here Set(2)

(3) If length of Set i is not equal to Set (2), then I add "Z" at the end Set i, such that length of Set i is now equal to length of Set (2)

(4) I replace all "-" with "Z"

The output should look like:

 ("ATGCTTTTATGCZZZZZZZZZZZZ",
 "ATCGTTZGTTCGTTZGTTCGTTCG",
 "TTTTATGCZZZZZZZZZZZZZZZZ")

Here is the code, I have attempted:

in_file = open('c:/test.txt','r')
org = []
seqlist = []
seqstring = ""

for line in in_file:
    if line.startswith("$$$"):
         if seqstring!= "":
            seqlist.append(seqstring)
            seqstring = ""
         org.append(line.rstrip("\n"))
    elif line.startswith(">"):
        seqstring += line.rstrip("\n")
seqlist.append(seqstring)

setdraft = seqlist
maxsetlength = max(len(seqlist))

setdraft2 =[]  

for i in setdraft:
     if len(i) != maxsetlength:
         setdraft2 = i.append("Z")

setfinal =[]

for j in setdraft2:
     if j in setdraft2 =="-":
         setfinal = j.insert ("Z")

The above script does not work. It gives me multiple errors. Eg. When I print setdraft it gives me the output

['>1ATGC>2TTTT>3ATGC>$$$>B1ATCG>B2TT-G>3TTCG>B4TT-G>B5TTCG>B‌​6TTCG$$$>C1TTTT>C2AT‌​GC']

which is the same as the input

Traceback (most recent call last):
  File "C:/Users/ACER/Desktop/trial.py", line 25, in <module>
    maxsetlength = max(len(seqlist))
TypeError: 'int' object is not iterable

Solution

  • It's unclear how fragile your data set is, but if it follows the pattern above (namely the last 4 characters are the ones you are looking for) then you can use a couple of split()s and itertools.zip_longest and zip it back to append the Z

    >>> import itertools as it
    >>> import string
    >>> def digit_index(s):
    ...     for i, c in enumerate(s):
    ...         if c in string.digits:
    ...             return i
    ...     return 0
    ...
    >>> s = '>1ATGC>2TTTT>3ATGC>$$$>B1ATCG>B2TT-G>3TTCG>B4TT-G>B5TTCG>B6TTCG$$$>C1TTTT>C2ATGC'
    >>> parsed = [''.join(y[digit_index(y)+1:].replace('-', 'Z') for y in x.split('>')) for x in s.split('$$$')]
    >>> parsed
    ['ATGCTTTTATGC', 'ATCGTTZGTTCGTTZGTTCGTTCG', 'TTTTATGC']
    >>> [''.join(x) for x in zip(*it.zip_longest(*parsed, fillvalue='Z'))]
    ['ATGCTTTTATGCZZZZZZZZZZZZ',
     'ATCGTTZGTTCGTTZGTTCGTTCG',
     'TTTTATGCZZZZZZZZZZZZZZZZ']
    

    If you don't mind it as a list then you can avoid join()ing it back to a string:

    >>> list(zip(*it.zip_longest(*parsed, fillvalue='Z')))
    [('A', 'T', 'G', 'C', 'T', 'T', 'T', 'T', 'A', 'T', 'G', 'C', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z'), 
     ('A', 'T', 'C', 'G', 'T', 'T', 'Z', 'G', 'T', 'T', 'C', 'G', 'T', 'T', 'Z', 'G', 'T', 'T', 'C', 'G', 'T', 'T', 'C', 'G'),
     ('T', 'T', 'T', 'T', 'A', 'T', 'G', 'C', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z')]