Search code examples
javascripthtmlregexhighlightdigits

Javascript Regex for Javascript Regex and Digits


The title might seem a bit recursive, and indeed it is.

I am working on a Javascript which can highlight/color Javascript code displayed in HTML. Thus, in the Internet Browser, comments will be turned green, definitions (for, if, while, etc.) will be turned a dark blue and italic, numbers will be red, and so on for other elements. However, the coloring is not all that important.

I am trying to figure out two different regular expressions which have started to cause a minor headache.

1. Finding a regular expression using a regular expression

I want to find regular expressions within the script-tags of HTML using a Javascript, such as:

    match(/findthis/i);

, where the regex part of course is "/findthis/i".

The rules are as follows:

  1. Finding multiple occurrences (/g) is not important.
  2. It must be on the same line (not /m).
  3. Caseinsensitive (/i).
  4. If a backward slash (ignore character) is followed directly by a forward slash, "/", the forward slash is part of the expression - not an escape character. E.g.: /itdoesntstop\/untilnow:/
  5. Two forward slashes right next to each other (//) is: (A) At the beginning: Not a regex; it's a comment. (B) Later on: First slash is the end of the regex and the second slash is nothing but a character.
  6. Regex continues until the line breaks or end of input (\n|$), or the escape character (second forward slash which complies with rule 4) is encountered. However, also as long as only alphabetic characters are encountered, following the second forward slash, they are considered part of the regex. E.g.: /aregex/allthisispartoftheregex

So far what I've got is this:

    '\\/(?:[^\\/\\\\]|\\/\\*)*\\/([a-zA-Z]*)?'

However, it isn't consistent. Any suggestions?

2. Find digits (alphanumeric, floating) using a regular expression

Finding digits on their own is simple. However, finding floating numbers (with multiple periods) and letters including underscore is more of a challenge.

All of the below are considered numbers (a new number starts after each space):

3 3.1 3.1.4 3a 3.A 3.a1 3_.1

The rules:

  1. Finding multiple occurrences (/g) is not important.
  2. It must be on the same line (not /m).
  3. Caseinsensitive (/i).
  4. A number must begin with a digit. However, the number can be preceeded or followed by a non-word (\W) character. E.g.: "=9.9;" where "9.9" is the actual number. "a9" is not a number. A period before the number, ".9", is not considered part of the number and thus the actual number is "9".
  5. Allowed characters: [a-zA-Z0-9_.]

What I've got:

'(^|\\W)\\d([a-zA-Z0-9_.]*?)(?=([^a-zA-Z0-9_.]|$))'

It doesn't work quite the way I want it.


Solution

  • For the first part, I think you are quite close. Here is what I would use (as a regex literal, to avoid all the double escapes):

    /\/(?:[^\/\\\n\r]|\\.)+\/([a-z]*)/i
    

    I don't know what you intended with your second alternative after the character class. But here the second alternative is used to consume backslashes and anything that follows them. The last part is important, so that you can recognize the regex ending in something like this: /backslash\\/. And the ? at the end of your regex was redundant. Otherwise this should be fine.

    Test it here.

    Your second regex is just fine for your specification. There are a few redundant elements though. The main thing you might want to do is capture everything but the possible first character:

    /(?:^|\W)(\d[\w.]*)/i
    

    Now the actual number (without the first character) will be in capturing group 1. Note that I removed the ungreediness and the lookahead, because greediness alone does exactly the same.

    Test it here.