Search code examples
javascriptregexstringtokenizestringtokenizer

Tokenizing strings using regular expression in Javascript


Suppose I've a long string containing newlines and tabs as:

var x = "This is a long string.\n\t This is another one on next line.";

So how can we split this string into tokens, using regular expression?

I don't want to use .split(' ') because I want to learn Javascript's Regex.

A more complicated string could be this:

var y = "This @is a #long $string. Alright, lets split this.";

Now I want to extract only the valid words out of this string, without special characters, and punctuation, i.e I want these:

var xwords = ["This", "is", "a", "long", "string", "This", "is", "another", "one", "on", "next", "line"];

var ywords = ["This", "is", "a", "long", "string", "Alright", "lets", "split", "this"];

Solution

  • Here is a jsfiddle example of what you asked: http://jsfiddle.net/ayezutov/BjXw5/1/

    Basically, the code is very simple:

    var y = "This @is a #long $string. Alright, lets split this.";
    var regex = /[^\s]+/g; // This is "multiple not space characters, which should be searched not once in string"
    
    var match = y.match(regex);
    for (var i = 0; i<match.length; i++)
    {
        document.write(match[i]);
        document.write('<br>');
    }
    

    UPDATE: Basically you can expand the list of separator characters: http://jsfiddle.net/ayezutov/BjXw5/2/

    var regex = /[^\s\.,!?]+/g;
    

    UPDATE 2: Only letters all the time: http://jsfiddle.net/ayezutov/BjXw5/3/

    var regex = /\w+/g;