This code was established in a previous post. I'm trying to adapt it to fit our data. But it doesn't work .. Here is an example of our file :
read:1424:2165 TGACCA/1:2165 TGACCA/2
1..100 +chr1:3033296..3033395 #just this line
1..100 -chr1:3127494..3127395
1..100 +chr1:3740372..3740471
1 concordant read:1483:2172 TGACCA/1:2172 TGACCA/2
1..100 -chr7:94887644..94887545 #and just this line
This code should do the following :
So if I have "-chr : no..no" multiple times after "read:" that would only take the 1st one.
Unfortunately I cannot figure out how to get this to work...
import re
infile='myfile.txt'
outfile='outfile.txt'
pat1 = re.compile(r'read:')
pat2 = re.compile(r'([+-])chr([^:]+):(\d+)\.\.(\d+)')
with open(infile, mode='r') as in_f, open(outfile, mode='w') as out_f:
for line in in_f.readlines():
if '\t' not in line.rstrip():
continue
a = pat1.search(line)
if a:
m = pat2.search(line)
out_f.write(' '.join(m.groups()) + '\n')
if not a:
continue
The output should look like that :
1 3033293 3033395
7 94887644 94887545
Somebody throw me a bone please
Updated From Answer Below
Alright I'm uploading a slightly modified version from Tim McNamara that I use. It works well but the output doesn't recognize number with two digit after "chr" and it prints a string after the last number
with open(infile, mode='r') as in_f, open(outfile, mode='w') as out_f:
lines = [line for line in in_f.readlines()]
for i, line in enumerate(lines):
if 'read' in line:
data = lines[i+1].replace(':', '..').split('..')
try:
out_f.write('{} {} {}\n'.format(data[1][-1], data[2], data[3])) #Here I tried to remove data[3] to avoid to have "start" in the output file.. didn't work ..
except IndexError:
continue
Here is the output obtained with this code :
6 140302505 140302604 start # 'start' is a string in our data after this number
5 46605561 46605462 start # I don't understand why it grabs it thou...
5 46605423 46605522 start # I tried to modify the code to avoid this, but ... didn't work out
6 29908310 29908409 start
6 29908462 29908363 start
4 12712132 12712231 start
How can I fix these two errors ?
Your big mistake is that you need to include readlines
before you can iterate over 'in_f':
with open(infile, mode='r') as in_f, open(outfile, mode='w') as out_f:
for line in in_f.readlines():
...
However, that whole section of code can probably be tidied up quite a bit.
with open(infile, mode='r') as in_f, open(outfile, mode='w') as out_f:
lines = [line for line in in_f.readlines()]
for i, line in enumerate(lines):
if 'read' in line:
data = lines[i+1].replace(':', '..').split('..')
try:
a = data[1].split('chr')[-1]
b = data[2]
c = data[3].split()[0]
out_f.write('{} {} {}\n'.format(a, b, c))
except IndexError:
pass