Search code examples
javascriptregexstring-parsing

Regular Expressions - Matching IRC-like parameters?


I am looking to create a IRC-like command format:

/commandname parameter1 "parameter 2" "parameter \"3\"" parameter"4 parameter\"5

Which would (ideally) give me a list of parameters:

parameter1
parameter 2
parameter "3"
parameter"4
parameter\"5

Now from what I have read, this isn't at all trivial and might as well be done in some other method.

Thoughts?

Below is C# code that does the job I need:

public List<string> ParseIrcCommand(string command)
    {
        command = command.Trim();
        command = command.TrimStart(new char[] { '/' });
        command += ' ';

        List<string> Tokens = new List<string>();

        int tokenStart = 0;
        bool inQuotes = false;
        bool inToken = true;
        string currentToken = "";
        for (int i = tokenStart; i < command.Length; i++)
        {
            char currentChar = command[i];
            char nextChar = (i + 1 >= command.Length ? ' ' : command[i + 1]);

            if (!inQuotes && inToken && currentChar == ' ')
            {
                Tokens.Add(currentToken);
                currentToken = "";
                inToken = false;
                continue;
            }

            if (inQuotes && inToken && currentChar == '"')
            {
                Tokens.Add(currentToken);
                currentToken = "";
                inQuotes = false;
                inToken = false;
                if (nextChar == ' ') i++;
                continue;
            }

            if (inQuotes && inToken && currentChar == '\\' && nextChar == '"')
            {
                i++;
                currentToken += nextChar;
                continue;
            }

            if (!inToken && currentChar != ' ')
            {
                inToken = true;
                tokenStart = i;
                if (currentChar == '"')
                {
                    tokenStart++;
                    inQuotes = true;
                    continue;
                }
            }

            currentToken += currentChar;
        }

        return Tokens;
    }

Solution

  • You have shown your code - that's good, but it seems that you haven't thought about whether it is reasonable to parse the command like that:

    • Firstly, your code will allow new line character inside the command name and parameters. It would be reasonable if you assume that new line character can never be there.
    • Secondly, \ also needs to be escaped like ", since there will be no way to specify a single \ at the end of a parameter without causing any confusion.
    • Thirdly, it is a bit weird to have the command name parsed the same way as parameters - command names are usually per-determined and fixed, so there is no need to allow for flexible ways to specify it.

    I cannot think of one-line solution in JavaScript that is general. JavaScript regex lacks \G, which asserts the last match boundary. So my solution will have to make do with beginning of string assertion ^ and chomping off the string as a token is matched.

    (There is not much code here, mostly comments)

    function parseCommand(str) {
        /*
         * Trim() in C# will trim off all whitespace characters
         * \s in JavaScript regex also match any whitespace character
         * However, the set of characters considered as whitespace might not be
         * equivalent
         * But you can be sure that \r, \n, \t, space (ASCII 32) are included.
         * 
         * However, allowing all those whitespace characters in the command
         * is questionable.
         */
        str = str.replace(/^\s*\//, "");
    
        /* Look-ahead (?!") is needed to prevent matching of quoted parameter with
         * missing closing quote
         * The look-ahead comes from the fact that your code does not backtrack
         * while the regex engine will backtrack. Possessive qualifier can prevent
         * backtracking, but it is not supported by JavaScript RegExp.
         *
         * We emulate the effect of \G by using ^ and repeatedly chomping off
         * the string.
         *
         * The regex will match 2 cases:
         * (?!")([^ ]+)
         * This will match non-quoted tokens, which are not allowed to 
         * contain spaces
         * The token is captured into capturing group 1
         *
         * "((?:[^\\"]|\\[\\"])*)"
         * This will match quoted tokens, which consists of 0 or more:
         * non-quote-or-backslash [^\\"] OR escaped quote \"
         * OR escaped backslash \\
         * The text inside the quote is captured into capturing group 2
         */
        var regex = /^ *(?:(?!")([^ ]+)|"((?:[^\\"]|\\[\\"])*)")/;
        var tokens = [];
        var arr;
    
        while ((arr = str.match(regex)) !== null) {
            if (arr[1] !== void 0) {
                // Non-space token
                tokens.push(arr[1]);
            } else {
                // Quoted token, needs extra processing to
                // convert escaped character back
                tokens.push(arr[2].replace(/\\([\\"])/g, '$1'));
            }
    
            // Remove the matched text
            str = str.substring(arr[0].length);
        }
    
        // Test that the leftover consists of only space characters
        if (/^ *$/.test(str)) {
            return tokens;
        } else {
            // The only way to reach here is opened quoted token
            // Your code returns the tokens successfully parsed
            // but I think it is better to show an error here.
            return null;
        }
    }