Search code examples
pythonregexpython-3.xquotes

Regular expression to match anything between combination of quotes


[Follow up from my old question with better description and links]

Trying to match any character (including newlines, tab characters, whitespaces, etc.) between two symbols, including those symbols.

For example:

foobar89\n\nfoo\tbar; '''blah blah blah'8&^"'''

need to match

''blah blah blah'8&^"'''

and

fjfdaslfdj; '''blah\n blah\n\t\t blah\n'8&^"'''

need to match

'''blah\n blah\n\t\t blah\n'8&^"'''

My Python code (taken and adapted from here) onto which I am testing the regexes :

import collections
import re

Token = collections.namedtuple('Token', ['typ', 'value', 'line', 'column'])

def tokenize(code):
    token_specification = [
        ('BOTH',      r'([\'"]{3}).*?\2'), # for both triple-single quotes and triple-double quotes
        ('SINGLE',    r"('''.*?''')"),     # triple-single quotes 
        ('DOUBLE',    r'(""".*?""")'),     # triple-double quotes 
        # regexes which match OK
        ('COM',       r'#.*'),
        ('NEWLINE', r'\n'),           # Line endings
        ('SKIP',    r'[ \t]+'),       # Skip over spaces and tabs
        ('MISMATCH',r'.'),            # Any other character
    ]

    test_regexes = ['COM', 'BOTH', 'SINGLE', 'DOUBLE']

    tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
    line_num = 1
    line_start = 0
    for mo in re.finditer(tok_regex, code):
        kind = mo.lastgroup
        value = mo.group(kind)
        if kind == 'NEWLINE':
            line_start = mo.end()
            line_num += 1
        elif kind == 'SKIP':
            pass
        elif kind == 'MISMATCH':
            pass
        else:
            if kind in test_regexes:
                print(kind, value)
            column = mo.start() - line_start
            yield Token(kind, value, line_num, column)

f = r'C:\path_to_python_file_with_examples_to_match'

with open(f) as sfile:
    content = sfile.read()

for t in tokenize(content):
    pass #print(t)

where the file_with_examples_to_match is:

import csv, urllib

class Q():
    """
    This class holds lhghdhdf hgh dhghd hdfh ghd fh.
    """

    def __init__(self, l, lo, d, m):
        self.l= l
        self.lo= longitude
        self.depth = d
        self.m= m

    def __str__(self):
        # sdasda fad fhs ghf dfh
        d= self.d
        if d== -1:
            d= 'unknown'
        m= self.m
        if m== -1:
            d= 'unknown'

        return (m, d, self.l, self.lo)

foobar89foobar; '''blah qsdkfjqsv,;sv
                   vqùlvnqùv 
                   dqvnq
                   vq
                   v

blah blah'8&^"'''
fjfdaslfdj; '''blah blah
     blah
    '8&^"'''

From this answer, I try r"('''.*?''')|"r'(""".*?""") to match both cases of triple single-quotes and triple double-quotes without success. Same when trying r'([\'"]{3}).*?\2').

I have set up an online regex tester where some of the regexes do match as they are supposed to but when in the code above they fail.

I am interested in gaining understanding in Python's regular expressions so I would appreciate both a solution (perhaps a valid regex to do the desired matching on my code) and a brief explanation so I can see my shortcomings.


Solution

  • You're probably missing flags to make . match newline also

    re.finditer(tok_regex, code, flags = re.DOTALL)
    

    In this case the output is

    ('BOTH', '"""\n    This class holds lhghdhdf hgh dhghd hdfh ghd fh.\n    """')
    ('COM', '# sdasda fad fhs ghf dfh\n        d= self.d\n        if d== -1:\n            d= \'unknown\'\n        m= self.m\n        if m== -1:\n            d= \'unknown\'\n\n        return (m, d, self.l, self.lo)\n\nfoobar89foobar; \'\'\'blah qsdkfjqsv,;sv\n                   vq\xc3\xb9lvnq\xc3\xb9v \n                   dqvnq\n                   vq\n                   v\n\nblah blah\'8&^"\'\'\'\nfjfdaslfdj; \'\'\'blah blah\n     blah\n    \'8&^"\'\'\'')
    

    COM is now matching way too much, since . now gets everything to the end of file. If we modify this pattern a bit to make it less greedy

    ('COM',       r'#.*?$')
    

    we can now use re.MULTILINE to make it match less

    re.finditer(tok_regex, code, flags = re.DOTALL | re.MULTILINE)
    

    The output now is

    ('BOTH', '"""\n    This class holds lhghdhdf hgh dhghd hdfh ghd fh.\n    """')
    ('COM', '# sdasda fad fhs ghf dfh')
    ('BOTH', '\'\'\'blah qsdkfjqsv,;sv\n                   vq\xc3\xb9lvnq\xc3\xb9v \n                   dqvnq\n                   vq\n                   v\n\nblah blah\'8&^"\'\'\'')
    ('BOTH', '\'\'\'blah blah\n     blah\n    \'8&^"\'\'\'')
    

    If you definitely don't want to use flags, you can use a kind of 'hack' to do without ., since this metacharacter matches almost everything, except newlines. You can create a match group, which would match everything but one symbol, which is highly unlikely to be present in files you would parse. For example, you could use a character with an ASCII code 0. Regex for such character would be \x00, the corresponding pattern [^\x00] would match every symbol (even newlines), except symbol with ASCII code 0 (that's why it's a hack, you aren't able to match every symbol without flags). You'll need to keep initial regex for COM, and for BOTH it would be

    ('BOTH',      r'([\'"]{3})[^\x00]*?\2')
    

    Highly recommended for working with regex are online tools which explain them, like regex101

    For more complex cases of quote matching you'll need to write a parser. See for example this Can the csv format be defined by a regex? and this When you should NOT use Regular Expressions?.