Search code examples

Regular Expression - how to capture a number of characters specified in the string

I'm trying to use regular expressions to extract a digit as well as a number of characters equal to that digit from a string. This is for analyzing a pileup summary output from samtools mpileup (see here). I'm doing this is python.

As an example, let's say I have the following string:


I am trying to extract the +3AAA from the string, leaving us with:


Note that the T remains, because I only wanted to extract 3 characters (because the string indicated that 3 should be extracted).

I could do the following:

re.sub("\+[0-9]+[ACGTNacgtn]+", "", ".....+3AAAT.....")

But this would cut out the T as well, leaving us with:


Is there a way to use the information in a string to adjust the pattern in a regular expression? There are ways I could go around using regular expressions to do this, but if there's a way regular expressions can do it I'd rather use that way.


  • You can pass a lambda to re.sub():

    import re
    def replace(string):
      replaced = re.sub(
        # group(1) = '3', group(2) = 'AAAT'
        lambda match:[int(],
      return replaced

    Try it:

    string = '.....+3AAAT.....'
    print(replace(string))  # '.....T.....'
    string = '.....+10AAACCCGGGGTN.....'
    print(replace(string))  # '.....TN.....'
    string = '.....+0AN.....'
    print(replace(string))  # '.....AN.....'
    string = '.....+5CAGN.....'
    print(replace(string))  # '..........'