Search code examples
htmlregexxhtmlwebcharacter-entities

What are the longest and shortest HTML character entity names?


There are a million cheatsheets all around the tubes that enumerate to different levels of comprehension the character entities specified by various versions and specifications of HTML. I don't want to trust any particular one of them, so I figure I'll toss it out here and see if anyone posts a more authoritative answer.

So, let's assume that I want to match any and all character references and entities using a regular expression. I'd start with /&(?:#(?:x[0-9a-f]+|[0-9]+)|[a-z]{???,???});/i. But what would go into ???s? I can think of entities that are two characters long, like lt and gt, but are there any one-letter entities in any specifications of the HTML? Likewise, what is the longest entity? Finally, those are the only three syntaxes for expressing literal characters in HTML aside from just typing them directly, are they not?


Solution

  • Longest in HTML5 is &CounterClockwiseContourIntegral;, and there are no one-letter names.

    But note that named entity references don't work as you think. Some named character references don't end with a semi-colon, so a regex won't cut the mustard.