Search code examples
javascripttokenize

How to tokenize a sentence splitting on spaces, except treat quoted segments as a single token?


For example I want to split following sentence:

(Quick brown "fox jumps (over)") the lazy dog and (looks for food)

Expected Output Array:

["(Quick","brown","fox jumps (over)",")the","lazy","dog","and","(looks","for","food)"]

I have tried this simple function in typescript playground:

const tokenizeSentenceText = (sentence: any = '') => {
 let wordList = [];

  wordList = sentence.match(/\\?.|^$/g).reduce((p: any, c: any) => {
    if (c === '"') {
        p.quote ^= 1;
    } else if (!p.quote && c === ' ') {
        p.a.push('');
    } else {
        p.a[p.a.length - 1] += c.replace(/\\(.)/, "$1");
    }
    return p;
}, { a: [''] }).a;

return wordList; }

Getting output something like this:

["(Quick", "brown", "fox jumps (over))", "the", "lazy", "dog", "and", "(looks", "for", "food)"]

As you can see "fox jumps (over))" the last closing bracket written outside of double quotes is coming alongside inside the word (over)) instead of (over) and the last closing bracket after the quotes should actually go to the next word ")the"

Note: Anything written inside double quotes " " should be treated as single word. There can be multiple spaces/brackets present inside the double quotes.

Thanks for your help in advance.


Solution

  • You can actually use

    const tokenizeSentenceText = (sentence) => {
      return sentence.match(/"[^"]*"|[^\s"]+/g);
    }
    // If the double quotes need removing 
    const tokenizeSentenceTextNoQuotes = (sentence) => {
      return Array.from(sentence.matchAll(/"([^"]*)"|[^\s"]+/g), (x) => x[1] ?? x[0]);
    }
    
    const text = '(Quick brown "fox     jumps (over)") the lazy dog and (looks for food)';
    console.log(tokenizeSentenceText(text))
    console.log(tokenizeSentenceTextNoQuotes(text))

    The regex matches

    • "([^"]*)" - a " char, any zero or more chars other than " and then a " char
    • | - or -[^\s"]+ - one or more chars other than whitespace and " chars.

    The (x) => x[1] ?? x[0] in Array.from(sentence.matchAll(/"([^"]*)"|[^\s"]+/g), (x) => x[1] ?? x[0]) returns Group 1 value if that alternative got matched, else, the whole match is returned (what was matched with [^\s"]+).