I'm using the pattern \\n(((?!\.g).)*?\.vcf\.gz)\\r
to match the desired sub-string in a string. In the following example string the match is in the middle of the string, engulfed by two \r\n
.
"\r\n1115492_23181_0_0.g.vcf.gz.tbi\r\n1115492_23181_0_0.vcf.gz\r\n1115492_23181_0_0.vcf.gz.tbi\r\n..."
Using the pattern above yields the desired string 1115492_23181_0_0.vcf.gz
as well as 0
.
My question is what would be the proper regular expression to get only the desired string.
Thanks.
You have the match equalling lines, so match the whole lines that do not contain .g
anywhere before the .vcf.gz
extension:
import re
text = "\r\n1115492_23181_0_0.g.vcf.gz.tbi\r\n1115492_23181_0_0.vcf.gz\r\n1115492_23181_0_0.vcf.gz.tbi\r\n..."
m = re.search(r"^((?:(?!\.g).)*\.vcf\.gz)\r?$", text, re.M)
if m:
print(m.group(1)) # => 1115492_23181_0_0.vcf.gz
See the Python demo.
Details:
^
- start of a line((?:(?!\.g).)*\.vcf\.gz)
- Group 1:
(?:(?!\.g).)*
- any char other than line break chars, one or more but as many as possible occurrences, that does not start a .g
char sequence\.vcf\.gz
- a .vcf.gz
string\r?
- an optional CR (carriage return)$
- end of a line.