Search code examples
pythonregexsamtools

regex to match words of length specified within string


I am trying to parse the text output from samtools mpileup. I start with a string

s = '.$......+2AG.+2AG.+2AGGG'

Whenever I have a + followed by an integer n, I would like to select n characters following that integer and replace the whole thing by *. So for this test case I would have

'.$......+2AG.+2AG.+2AGGG' ---> '.$......*.*.*GG' 

I have the regex \+[0-9]+[ACGTNacgtn]+ but that results in the output .$......*.*.* and the trailing G's are lost as well. How do I select n characters where the n is not known ahead of time but specified in the string itself?


Solution

  • The repl argument in re.sub can be a string or a function.

    So, you can do very complex things with function replacements:

    def removechars(m):
        x=m.group()
        n=re.match(r'\+(\d+).*', x).group(1) # digit part
        return '*'+x[1+len(n)+int(n):]
    

    Solves your problem:

    >>> re.sub(r'\+[0-9]+[ACGTNacgtn]+', removechars, s)
    '.$......*.*.*GG'