I have a file format that looks like this
>1ATGC>2TTTT>3ATGC>$$$>B1ATCG>B2TT-G>3TTCG>B4TT-G>B5TTCG>B6TTCG$$$>C1TTTT>C2ATGC
Note: "$$$" divides the file, such that anything before $$$ is Set 1 and after $$$ is Set 2 and after the next $$$ Set3 etc.
I have to do the following:
a. Concatenate the sequences following ">". So, I have to join "ATGC" "TTTT" "ATGC" and store in (1) and I have to concatenate "ATCG" "TT-G" "TTCG" "TT-G" "TTCG""TTCG" and store as (2)... concatenate again and store in (3)
The output should be a list that looks like:
("ATGCTTTTATGC","ATCGTT-GTTCGTT-GTTCGTTCG","TTTTATGC")
(2) Then, I find the the Set that has the maximum length => here Set(2)
(3) If length of Set i is not equal to Set (2), then I add "Z" at the end Set i, such that length of Set i is now equal to length of Set (2)
(4) I replace all "-" with "Z"
The output should look like:
("ATGCTTTTATGCZZZZZZZZZZZZ",
"ATCGTTZGTTCGTTZGTTCGTTCG",
"TTTTATGCZZZZZZZZZZZZZZZZ")
Here is the code, I have attempted:
in_file = open('c:/test.txt','r')
org = []
seqlist = []
seqstring = ""
for line in in_file:
if line.startswith("$$$"):
if seqstring!= "":
seqlist.append(seqstring)
seqstring = ""
org.append(line.rstrip("\n"))
elif line.startswith(">"):
seqstring += line.rstrip("\n")
seqlist.append(seqstring)
setdraft = seqlist
maxsetlength = max(len(seqlist))
setdraft2 =[]
for i in setdraft:
if len(i) != maxsetlength:
setdraft2 = i.append("Z")
setfinal =[]
for j in setdraft2:
if j in setdraft2 =="-":
setfinal = j.insert ("Z")
The above script does not work. It gives me multiple errors.
Eg. When I print setdraft
it gives me the output
['>1ATGC>2TTTT>3ATGC>$$$>B1ATCG>B2TT-G>3TTCG>B4TT-G>B5TTCG>B6TTCG$$$>C1TTTT>C2ATGC']
which is the same as the input
Traceback (most recent call last):
File "C:/Users/ACER/Desktop/trial.py", line 25, in <module>
maxsetlength = max(len(seqlist))
TypeError: 'int' object is not iterable
It's unclear how fragile your data set is, but if it follows the pattern above (namely the last 4 characters are the ones you are looking for) then you can use a couple of split()
s and itertools.zip_longest
and zip
it back to append the Z
>>> import itertools as it
>>> import string
>>> def digit_index(s):
... for i, c in enumerate(s):
... if c in string.digits:
... return i
... return 0
...
>>> s = '>1ATGC>2TTTT>3ATGC>$$$>B1ATCG>B2TT-G>3TTCG>B4TT-G>B5TTCG>B6TTCG$$$>C1TTTT>C2ATGC'
>>> parsed = [''.join(y[digit_index(y)+1:].replace('-', 'Z') for y in x.split('>')) for x in s.split('$$$')]
>>> parsed
['ATGCTTTTATGC', 'ATCGTTZGTTCGTTZGTTCGTTCG', 'TTTTATGC']
>>> [''.join(x) for x in zip(*it.zip_longest(*parsed, fillvalue='Z'))]
['ATGCTTTTATGCZZZZZZZZZZZZ',
'ATCGTTZGTTCGTTZGTTCGTTCG',
'TTTTATGCZZZZZZZZZZZZZZZZ']
If you don't mind it as a list then you can avoid join()
ing it back to a string:
>>> list(zip(*it.zip_longest(*parsed, fillvalue='Z')))
[('A', 'T', 'G', 'C', 'T', 'T', 'T', 'T', 'A', 'T', 'G', 'C', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z'),
('A', 'T', 'C', 'G', 'T', 'T', 'Z', 'G', 'T', 'T', 'C', 'G', 'T', 'T', 'Z', 'G', 'T', 'T', 'C', 'G', 'T', 'T', 'C', 'G'),
('T', 'T', 'T', 'T', 'A', 'T', 'G', 'C', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z')]