python regex bioinformatics fasta complement

Matching parts of lines in a file (python)

I currently have a list of genes in a file. Each line has a chromosome with it's information. Such an entry appears as:

NM_198212 chr7 + 115926679 115935830 115927071 11593344 2 115926679,'115933260', 115927221,'115935830',

The sequence for the chromosome starts at base 115926679 and continues up to(but not including) base 115935830

If we want the spliced sequence, we use the exons.The first extends from 115926679 to 155927221, and the second goes from '115933260' to '115935830'

However, I have run across a problem when on a complementary sequence such as:

NM_001005286 chr1 - 245941755 245942680 245941755 245942680 1 245941755, '245942680'

Since column 3 is a '-', these coordinates are in reference to the anti-sense strand (the complement to the strand). The first base (in bold) matches the last base on the sense strand (in italics). Since the file only has the sense stand, I need to try to translate coordinates on the anti-sense strand to the sense strand, pick out the right sequence and then reverse-complement it.

That said, I have only been programming for about half a year and and not sure how to starts going about doing this.

I have written a regular expression:

'(NM_\d+)\s+(chr\d+)([(\+)|(-)])\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+),(\d+),s+(\d+),(\d+),'

but am now unsure as to how to start this function... If anyone can help me get started at all on this, perhaps making me see how to do this, I would very much appreciate it.

OK: suppose this is chromosome 25:

AAAAAAAAAACCCCCCCCCCTTTTTTTTTTGGGGGGGGGG

(there are 10 of each character).

Now: if I am looking for an unspliced gene on: chr25 + 10 20

Then the gene starts on position 10 (starting from 0), and goes up to but not including position 20. So its:

CCCCCCCCCC

This is easy. It matches python string slicing really well.

Its more confusing if I give you:

chr25 - 10 20

What you have is the positive strand. But this gene is on the negative (complementary) strand. Remember what the chromosome looks like as a double-strand:

AAAAAAAAAACCCCCCCCCCTTTTTTTTTTGGGGGGGGGG
TTTTTTTTTTGGGGGGGGGGAAAAAAAAAACCCCCCCCCC

We are looking for the gene on the bottom strand. Meaning we count from 0 starting on the right. Number the top strand from the left, and the bottom strand from the right. So what I want here is AAAAAAAAAA.

The catch is that I'm only giving you the top strand. I'm not giving you the bottom strand. (You could generate yourself from the top strand — but given how large it is, I advise against that.)

So you need to convert coordinates. On the bottom strand, base 0 (the right-most C) is opposed to base 39 on the top strand. Base 1 is against base 38. Base 2 is against case 37. (Important point: notice what happens when you add these two numbers up — every time.) So base 10 is against base 29, and base 19 is against base 20.

So: if I want to find base 10-20 on the bottom strand, I can look at base 20-29 on the top (and then reverse-complement it).

I need to figure out how to translate to coordinates on the bottom strand to the equivalent coordinates on the bottom strand. Yes: it is very confusing

I have tried weronika's original answer:

fields = line.split(' \t')
geneID, chr, strand = fields[:2]
start = int(fields[3])
end = int(fields[4])
if strand == '-':
    start,end = -(start + 1), -(end + 1) # this part was changed from original answer.

which is on the right track, but it's not enough. This would take the 10 and 20, and turn it into a 20 and 10.

And I know I can reverse complement the string by doing this:

r = s[::-1]
bc = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
l = list(r)
o = [bc[base] for base in l]
     return ''.join(o)

Edited! Does this look correct?!

fp2 = open('chr22.fa', 'r')
fp = open('chr22.fa', 'r')
for line in fp2:
    newstring = ''
    z = line.strip()
    newstring += z
for line in fp:
    fields = line.split('\t')
    gene_ID, chr, strand = fields[:2]
    start = int(fields[3])
    end = int(fields[4])
    bc = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A', 'a': 't', 't': 'a', 'c':'g', 'g':'c', 'N':'N', 'n':'n'}
    l = list(newstring)        
    if strand == '+':
        geneseq = ''.join([bc[base] for base in l[start:end]]) 
    if strand == '-':
        newstart, newend = -(start + 1), -(end + 1)
        genseq = ''.join([bc[base] for base in l[newstart:newend]])

Solution

If you slice a string with a negative number, Python counts backward from the end.

complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
s = "AAAAAAAAAACCCCCCCCCCTTTTTTTTTTGGGGGGGGGG"
"".join(complement[c] for c in s[-20:-10])

EDIT:

Your edit looks about right to me, yes. I am very bad at checking for fencepost errors, but you're better placed to see those than I am anyway!

I've rewritten your code to be more Pythonic, but not changed anything substantial.

bc = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A', 'N':'N'}

f = open('chr22.fa')
l = ''.join(line.strip() for line in f)
f.seek(0)

for line in f:
    fields = line.split('\t')
    gene_ID, chr, strand = fields[:2]

    start = int(fields[3])
    end = int(fields[4])

    if strand == '-':
        start, end = -(start + 1), -(end + 1)

    geneseq = ''.join(bc[base.upper()] for base in l[start:end])