Search code examples
regexpython-re

Find the shortest match between two occurrences of a pattern


I'm using the pattern \\n(((?!\.g).)*?\.vcf\.gz)\\r to match the desired sub-string in a string. In the following example string the match is in the middle of the string, engulfed by two \r\n.

"\r\n1115492_23181_0_0.g.vcf.gz.tbi\r\n1115492_23181_0_0.vcf.gz\r\n1115492_23181_0_0.vcf.gz.tbi\r\n..."

Using the pattern above yields the desired string 1115492_23181_0_0.vcf.gz as well as 0.
My question is what would be the proper regular expression to get only the desired string.
Thanks.


Solution

  • You have the match equalling lines, so match the whole lines that do not contain .g anywhere before the .vcf.gz extension:

    import re
    text = "\r\n1115492_23181_0_0.g.vcf.gz.tbi\r\n1115492_23181_0_0.vcf.gz\r\n1115492_23181_0_0.vcf.gz.tbi\r\n..."
    m = re.search(r"^((?:(?!\.g).)*\.vcf\.gz)\r?$", text, re.M)
    if m:
        print(m.group(1)) # => 1115492_23181_0_0.vcf.gz
    

    See the Python demo.

    Details:

    • ^ - start of a line
    • ((?:(?!\.g).)*\.vcf\.gz) - Group 1:
      • (?:(?!\.g).)* - any char other than line break chars, one or more but as many as possible occurrences, that does not start a .g char sequence
      • \.vcf\.gz - a .vcf.gz string
    • \r? - an optional CR (carriage return)
    • $ - end of a line.