I'm working on parsing XML, and we discovered that the XML parser spent a lot of time needlessly checking for UTF compatibility. For example, let's say I'm parsing something akin to:
<root><ß❤></ß❤></root>
In our flamegraphs, we'd spend a lot of time checking whether root
or ß❤
was valid UTF.
One way to avoid this check is to have a precondition that XML input is valid Rust &str
. Since it is and the delimiters are ASCII based, in theory, slicing between any two ASCII delimiters should yield valid &str, which we won't check.
Is this a safe assumption? Or even better, is there a crate that does something similar (e.g. CSV
)?
I imagine that most XML parsers will check for valid UTF-8 encoding at the level of the input stream as a whole, and will then do further checks at a higher level that "root" and "ß❤" are valid XML names. You're certainly right to observe that these checks can be costly and that there are opportunities for optimization; and that one of these opportunities might be to take advantage of the fact that UTF-8 encoding principles ensure that the octet x3C never occurs in a UTF-8 stream except as a representation of the character "<".
You can also reduce the cost of parsing by cutting out some of the checks altogether. A parser that doesn't detect all errors isn't conformant with W3C standards, but that doesn't make it useless. However, beware of getting obsessed with performance at the expense of everything else: for 95% of your users, producing good error messages is probably worth at least a 10% performance overhead.