Search code examples
javascriptnode.jsregextokenize

Why is my program only detecting integer tokens in NodeJS?


I decided to try and make a language tokenizer (don't even know if that's a real word) and made around 4 tokens that successfully tokenized a full program with line breaks and multiple spaces etc, but I just started from scratch and am running into a problem; I have two tokens currently, int and variableSet. The program being read has the content of 1 sv 1 2 as just a test, and the tokenizer returns an array of int, int, int, int with sv having a value of 1.

const code = `1 sv 1 2`

var validTokens = require("./tokens"); // just an object with the structure tokenName: RegExp object

function reverseTokenSearch(regex){
    for (const [index, [key, value]] of Object.entries(Object.entries(validTokens))) {
        if (value === regex){
            return key;
        }
    }
    return false;
}

function throughTokens (code,lastidx=0) {
    for (const tokentype in validTokens){ // loop through all of the valid tokens
        validTokens[tokentype].lastIndex = lastidx;
        const searchresult = validTokens[tokentype]
        const tokenresult = searchresult.exec(code.toString());
        if (tokenresult) {
            return [searchresult, tokenresult[0], tokenresult.index, lastidx+tokenresult[0].length+1, tokenresult.groups]
        }
    }
}

function resetIndexes (){
    for (const tt in validTokens){
        validTokens[tt].lastidx = 0;
    }
}
resetIndexes();
var lst = 0
var tokens = []
var res = 1;
console.log("\ntokenizer; original input:\n"+code+"\n");
while (lst !== undefined && lst !== null){
    if (lst > code.length){
        console.error("Fatal error: tokenizer over-reached program length.")
        process.exit(1)
    }
    const res = throughTokens(code,lst);
    if(res){
        console.log(res,lst)
        const current = []
        current[0] = reverseTokenSearch(res[0])
        current[1] = res[1]
        const currentidx = 2
        for (const x in res[4]) {
            current[currentidx] = x;
        }
        tokens.push(current)
        lst = res[3]
    } else {
        lst = null
    }
}
console.log(tokens)
// What outputs:
/*
tokenizer; original input:
1 sv 1 2

[ /\d+/g { lastidx: 0 }, '1', 0, 2, undefined ] 0
[ /\d+/g { lastidx: 0 }, '1', 5, 4, undefined ] 2
[ /\d+/g { lastidx: 0 }, '1', 5, 6, undefined ] 4
[ /\d+/g { lastidx: 0 }, '2', 7, 8, undefined ] 6
[ [ 'int', '1' ], [ 'int', '1' ], [ 'int', '1' ], [ 'int', '2' ] ]
*/

I think it's because of the order of the array but I have no idea where to start fixing it and would greatly appreciate a push in the right direction. (edit): I tried removing the "g" flag on the RegExp object and all it did was broke the program into an infinite loop.


Solution

  • The problem is that you are silently assuming that every match found by the regex will start at lastidx which is not always the case. If you log tokenresult and lastidx before returning from throughTokens, you will see:

    0
    [ '1', index: 0, input: '1 sv 1 2', groups: undefined ] 
    2
    [ '1', index: 5, input: '1 sv 1 2', groups: undefined ]
    4
    [ '1', index: 5, input: '1 sv 1 2', groups: undefined ]
    6
    [ '2', index: 7, input: '1 sv 1 2', groups: undefined ]
    

    In the second iteration, the match is at index 5, but you assume it to be at index 2, which it is not (whereby you also incorrectly increment lastidx to 4). You also at the end of throughTokens assume that every match is followed by a space, which is also incorrect for the last token.

    Simplest way to fix this code is to replace

    //if (tokenresult) { // replace in throughTokens with below
    if (tokenresult && tokenresult.index === lastidx) {
    

    to be sure that you're matching at the right place and then in the main loop

    //while (lst !== undefined && lst !== null){ // replace with below
    while (lst !== undefined && lst !== null && lst < code.length){
    

    to handle the end of the input correctly.

    With these changes, the printouts that we added earlier will be

    0
    [ '1', index: 0, input: '1 sv 1 2', groups: undefined ]
    2
    [ 'sv', index: 2, input: '1 sv 1 2', groups: undefined ]
    5
    [ '1', index: 5, input: '1 sv 1 2', groups: undefined ]
    7
    [ '2', index: 7, input: '1 sv 1 2', groups: undefined ]
    

    which is correct and the output would be

    [
        [ 'int', '1' ],
        [ 'variableSet', 'sv' ],
        [ 'int', '1' ],
        [ 'int', '2' ]
    ]
    

    Recommendations

    There are a lot of other logical and programmatical problems with this code which I will not go into but my advice is to go through every piece of the code and understand what it does and whether it could be done in a simpler way.

    On a general level instead of returning an array with data [d1, d2, d3, ...] return an object with named properties { result: d1, index: d2, ... }. Then it is much easier for someone else to understand your code. Also go through naming of methods.

    As far as this approach is concerned, if you know that there will be a space after each token, then extract only the current token and send to throughToken. Then you can make that function both more efficient and robust against errors.