Search code examples
pythonperformancepyparsing

Constructing large character ranges with pyparsing.srange is extremely slow


I have some of these constructions in my code:

from pyparsing import Char, srange

_non_ascii = Char(srange('[\x80-\U0010FFFF]'))

The generation of the ranges is extremely slow, taking 6-8 seconds (huh?) even with Python 3.12 on a relatively decent machine.

Why is this happening and what should I replace those with?


Solution

  • This is what the srange() function looks like (taken from GitHub):

    _charRange = Group(_singleChar + Suppress("-") + _singleChar)
    _reBracketExpr = (
        Literal("[")
        + Opt("^").set_results_name("negate")
        + Group(OneOrMore(_charRange | _singleChar)).set_results_name("body")
        + Literal("]")
    )
    
    def srange(s: str) -> str:
        _expanded = (
            lambda p: p
            if not isinstance(p, ParseResults)
            else "".join(chr(c) for c in range(ord(p[0]), ord(p[1]) + 1))
        )
        try:
            return "".join(_expanded(part) for part in _reBracketExpr.parse_string(s).body)
        except Exception as e:
            return ""
    

    As you (or I) can see, pyparsing is parsing the argument using _reBracketExpr, which is itself a pyparsing expression, then pass the result of that to _expanded. _expanded will create a new string if p is a ParseResult (range) or return p unchanged otherwise (single character). In the end, all those strings are joined, creating an even bigger string.

    In your (or my) case, each Char(srange()) creates a 0x10FFFF - 0x80 + 1 = ~1.1m character string. Fortunately, since there is only one element, that string is reused instead of joined to create the eventual result. But that's not it; not yet. Word, which gets passed the string as Char's superclass, is also doing something on its own:

    class Word(Token):
    
        def __init__(self, init_chars: str = "", ..., *, initChars: str = "", ...):
            initChars = initChars or init_chars
            ...
            # One full iteration
            initChars_set = set(initChars)
            ...
            self.initChars = initChars_set
            # O(n log n) sorting, then another join
            self.initCharsOrig = "".join(sorted(initChars_set))
            
            ...
            
            # __or__() creates a new set, which requires another full iteration.
            if " " not in (self.initChars | self.bodyChars):  
                if len(self.initChars) == 1:
                    ...
                else:
                    # New string, again.
                    # This function is slow too, explained below
                    re_leading_fragment = f"[{_collapse_string_to_ranges(self.initChars)}]"
    
                if self.bodyChars == self.initChars:
                    ...
                else:
                    ...
                    # Yet another new string
                    self.reString = f"{re_leading_fragment}{re_body_fragment}{repeat}"
                ...
                self.re = re.compile(self.reString)
                ...
    
    def _collapse_string_to_ranges(
        s: Union[str, Iterable[str]], re_escape: bool = True
    ) -> str:
        ...
    
        ret = []
    
        # Haven't we been here before?
        s = "".join(sorted(set(s)))
    
        if len(s) > 3:
            # Yet another full iteration
            for _, chars in itertools.groupby(s, key=is_consecutive):
                ...  # Itertools dark magic to append to ret.
                # Mind you, list.append() is amortized O(1).
        else:
            ...
    
        return "".join(ret)  # A third .join()
    

    I removed unreachable sections not applicable to the specific problem at hand (no as_keyword and such) but some of them also create new strings.

    All in all, this is a lengthy and time-consuming process just to create a native re.Pattern instance (anyone counted how many times it iterated the original 1.1m character string?). That said, I switched to Regex which directly makes use of re:

    _any_char = Regex(r'[\x80-\U0010FFFF]')  # With and without r are both fine.
    

    No giant strings created, and is more or less the same what a normal Char was doing.