[Follow up from my old question with better description and links]
Trying to match any character (including newlines, tab characters, whitespaces, etc.) between two symbols, including those symbols.
For example:
foobar89\n\nfoo\tbar; '''blah blah blah'8&^"'''
need to match
''blah blah blah'8&^"'''
and
fjfdaslfdj; '''blah\n blah\n\t\t blah\n'8&^"'''
need to match
'''blah\n blah\n\t\t blah\n'8&^"'''
My Python code (taken and adapted from here) onto which I am testing the regexes :
import collections
import re
Token = collections.namedtuple('Token', ['typ', 'value', 'line', 'column'])
def tokenize(code):
token_specification = [
('BOTH', r'([\'"]{3}).*?\2'), # for both triple-single quotes and triple-double quotes
('SINGLE', r"('''.*?''')"), # triple-single quotes
('DOUBLE', r'(""".*?""")'), # triple-double quotes
# regexes which match OK
('COM', r'#.*'),
('NEWLINE', r'\n'), # Line endings
('SKIP', r'[ \t]+'), # Skip over spaces and tabs
('MISMATCH',r'.'), # Any other character
]
test_regexes = ['COM', 'BOTH', 'SINGLE', 'DOUBLE']
tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
line_num = 1
line_start = 0
for mo in re.finditer(tok_regex, code):
kind = mo.lastgroup
value = mo.group(kind)
if kind == 'NEWLINE':
line_start = mo.end()
line_num += 1
elif kind == 'SKIP':
pass
elif kind == 'MISMATCH':
pass
else:
if kind in test_regexes:
print(kind, value)
column = mo.start() - line_start
yield Token(kind, value, line_num, column)
f = r'C:\path_to_python_file_with_examples_to_match'
with open(f) as sfile:
content = sfile.read()
for t in tokenize(content):
pass #print(t)
where the file_with_examples_to_match
is:
import csv, urllib
class Q():
"""
This class holds lhghdhdf hgh dhghd hdfh ghd fh.
"""
def __init__(self, l, lo, d, m):
self.l= l
self.lo= longitude
self.depth = d
self.m= m
def __str__(self):
# sdasda fad fhs ghf dfh
d= self.d
if d== -1:
d= 'unknown'
m= self.m
if m== -1:
d= 'unknown'
return (m, d, self.l, self.lo)
foobar89foobar; '''blah qsdkfjqsv,;sv
vqùlvnqùv
dqvnq
vq
v
blah blah'8&^"'''
fjfdaslfdj; '''blah blah
blah
'8&^"'''
From this answer, I try r"('''.*?''')|"r'(""".*?""")
to match both cases of triple single-quotes and triple double-quotes without success. Same when trying r'([\'"]{3}).*?\2')
.
I have set up an online regex tester where some of the regexes do match as they are supposed to but when in the code above they fail.
I am interested in gaining understanding in Python's regular expressions so I would appreciate both a solution (perhaps a valid regex to do the desired matching on my code) and a brief explanation so I can see my shortcomings.
You're probably missing flags to make .
match newline also
re.finditer(tok_regex, code, flags = re.DOTALL)
In this case the output is
('BOTH', '"""\n This class holds lhghdhdf hgh dhghd hdfh ghd fh.\n """')
('COM', '# sdasda fad fhs ghf dfh\n d= self.d\n if d== -1:\n d= \'unknown\'\n m= self.m\n if m== -1:\n d= \'unknown\'\n\n return (m, d, self.l, self.lo)\n\nfoobar89foobar; \'\'\'blah qsdkfjqsv,;sv\n vq\xc3\xb9lvnq\xc3\xb9v \n dqvnq\n vq\n v\n\nblah blah\'8&^"\'\'\'\nfjfdaslfdj; \'\'\'blah blah\n blah\n \'8&^"\'\'\'')
COM
is now matching way too much, since .
now gets everything to the end of file. If we modify this pattern a bit to make it less greedy
('COM', r'#.*?$')
we can now use re.MULTILINE
to make it match less
re.finditer(tok_regex, code, flags = re.DOTALL | re.MULTILINE)
The output now is
('BOTH', '"""\n This class holds lhghdhdf hgh dhghd hdfh ghd fh.\n """')
('COM', '# sdasda fad fhs ghf dfh')
('BOTH', '\'\'\'blah qsdkfjqsv,;sv\n vq\xc3\xb9lvnq\xc3\xb9v \n dqvnq\n vq\n v\n\nblah blah\'8&^"\'\'\'')
('BOTH', '\'\'\'blah blah\n blah\n \'8&^"\'\'\'')
If you definitely don't want to use flags, you can use a kind of 'hack' to do without .
, since this metacharacter matches almost everything, except newlines. You can create a match group, which would match everything but one symbol, which is highly unlikely to be present in files you would parse. For example, you could use a character with an ASCII code 0. Regex for such character would be \x00
, the corresponding pattern [^\x00]
would match every symbol (even newlines), except symbol with ASCII code 0 (that's why it's a hack, you aren't able to match every symbol without flags). You'll need to keep initial regex for COM
, and for BOTH
it would be
('BOTH', r'([\'"]{3})[^\x00]*?\2')
Highly recommended for working with regex are online tools which explain them, like regex101
For more complex cases of quote matching you'll need to write a parser. See for example this Can the csv format be defined by a regex? and this When you should NOT use Regular Expressions?.