Search code examples
regexpcreregular-language

Numbers between 99 and 9999999 regular expression


I am trying to generate a regular expression that will match any numbers within the range of 99 and 9999999. I have trouble understanding how generating number ranges generally works. I managed to find a range generator online that does the job for me, but I want to understand how it actually works.

My attempt to do this range is as follows:

(99|[1-9][0-9][0-9]|[1-9][0-9][0-9][0-9])

This is supposed to match 99, any 3 digit number or any 4 digit number, but it does not work as expected. When tested it matches only numbers 99 and 3 digit numbers. Four digit numbers are not matched at all. If I only write the part for 4 digit numbers on its own as

[1-9][0-9][0-9][0-9]

It matches 4 digit numbers, but when I construct it as in the first example it does not work. Can someone give me some clarification how this actually works and how successfully to generate a regular expression for the range of 99 to 9999999.

Link to demo - Here


Solution

  • So you want to know how this works...

    Regexs have no real understanding of the values of numbers in your string, it only cares how they are represented, which is why looking for numbers in a range seems more awkward than it should be. The only reason your regex engine can understand a range in a character class like [0-9] at all is because of the characters' positions in a list (a character range like [&-~] is just as valid, and equally understandable to it.)

    So, to match a range like 99-9999999, ya gotta spell out what that looks like: literal "99", or three digits without a leading zero, or four digits without a leading zero, and so on.

    But this is what your demo did, right? And it didn't work. Of your test string "9293" your regex only matched "929". What happened here is the regex engine is eager to return a complete match - as soon as it found one it returned it, even though a better/longer match might have occurred later.


    Here's how that match happened. (I'll skip some details like grouping, as they're not super relevant here.)

    Step 1.

    The engine compares the first token in the regex with the first character in the string

    (99|[1-9][0-9][0-9]|[1-9][0-9][0-9][0-9])

    9293

    Success, they match.

    Step 2.

    The engine then advances both to the next token in the regex and the next character in the string and compares them.

    (99|[1-9][0-9][0-9]|[1-9][0-9][0-9][0-9])

    9293

    Failure, no match. The engine would stop and return the failure here, but you're using alternation via |, so it knows there's an alternate expression to try.

    Step 3.

    The engine advances to the first token of the next alternate expression in the regex, and rewinds the position in the string.

    (99|[1-9][0-9][0-9]|[1-9][0-9][0-9][0-9])

    9293

    Success, they match.

    Step 4.

    Continuing on.

    (99|[1-9][0-9][0-9]|[1-9][0-9][0-9][0-9])

    9293

    Match.

    Step 5.

    And again.

    (99|[1-9][0-9][0-9]|[1-9][0-9][0-9][0-9])

    9293

    Success. The complete expression matches. There's no need to try the remaining alternate. The match here returned is:

    929

    As you've probably figured out, if your input string was instead "9923" then step 2 would've matched and the engine there would've stopped and returned "99".

    As you've also probably figured out, if you rearrange your alternate expressions from longest to shortest

    ([1-9][0-9][0-9][0-9]|[1-9][0-9][0-9]|99)
    

    the longest would be attempted first, which would match and return your expected "9293".


    Simplifying

    It's still pretty wordy though, especially as you crank up the number of digits in your range. There are a couple things you can do to simplify it.

    The character class [0-9] can be represented by the shorthand character class \d.

    ([1-9]\d\d\d|[1-9]\d\d|99)
    

    And instead of repeating them use a quantifier in curly brackets like so:

    ([1-9]\d{3}|[1-9]\d{2}|99)
    

    As it happens, quantifiers can also take the form of {min, max}, so you can combine the two similar alternates:

    ([1-9]\d{2,3}|99)
    

    You might expect this to land you back returning "929" again, the engine being eager and all, but quantifiers are by default greedy so they'll try to pick up as much as they can. This lends itself well to your larger desired range:

    ([1-9]\d{2,6}|99)
    

    Finishing up

    What you do with it from here depends on what you need the regex to do. As it stands the parentheses are superfluous, there's no point in creating a capturing group of the entire regex itself. However a decision comes when you've got an input string like:

    You will likely be eaten by 1000 grue.

    If you're trying to pluck out how many grue are about to eat you, you might use

    [1-9]\d{2,6}|99
    

    which will return 1000.

    However that sorta runs back into the original problem with your demo. If it's "12345678 grue", which is out of range, this'll match "1234567" which might not be what you want. You can make sure the number you've matched isn't immediately followed by (or preceded by) another digit by using negative lookarounds.

    (?<!\d)([1-9]\d{2,6}|99)(?!\d)
    

    (?<!\d) means "from this position, the prior character is not a digit" while (?!\d) means "from this position, the next character is not a digit."

    The parentheses around the alternates are back as they're necessary for grouping here, otherwise the lookbehind would only be part of and apply in the first alternate expression and the lookahead would only be part of and apply in the second alternate.

    On the other hand if you're trying to make sure the entire string only consists of a number in your range you'll want to instead use the anchors ^ and $ (start of string and end of string, respectively):

    ^([1-9]\d{2,6}|99)$
    

    And finally you can trade the capturing group out for a non-capturing group (?:...), so:

    ^(?:[1-9]\d{2,6}|99)$
    

    or

    (?<!\d)(?:[1-9]\d{2,6}|99)(?!\d)
    

    You'll still grab the number as the match, it just won't be repeated in a group capture. (Lookarounds are already non-capturing, no need to worry about those.)