Search code examples
regexcsvshunting-yard

Processing a Comma Separated List Before Shunting-Yard


So I'm processing some math from XML strings using the Shunting-Yard algorithm. The trick is that I want to allow the generation of random values by using comma separated lists. For example...

( ( 3 + 4 ) * 12 ) * ( 2, 3, 4, 5 ) )

I've already got a basic Shunting-Yard processor working. But I want to pre-process the string to randomly pick one of the values from the list before processing the expression. Such that I might end up with:

( ( 3 + 4 ) * 12 ) * 4 )

The Shunting-Yard setup is already pretty complicated, as far as my understanding is concerned, so I'm hesitant to try to alter it to handle this. Handling that with error checking sounds like a nightmare. As such, I'm assuming it would make sense to look for that pattern beforehand? I was considering using a regular expression, but I'm not one of "those" people... though I wish that I was... and while I've found some examples, I'm not sure how I might modify them to check for the parenthesis first? I'm also not confident that this would be the best solution.

As a side note, if the solution is regex, it should be able to match strings (just characters, no symbols) in the comma list as well, as I'll be processing for specific strings for values in my Shunting-Yard implementation.

Thanks for your thoughts in advance.


Solution

  • This is easily solved using two regexes. The first regex, applied to the overall text, matches each parenthesized list of comma separated values. The second regex, applied to each of the previously matched lists, matches each of the values in the list. Here is a PHP script with a function that, given an input text having multiple lists, replaces each list with one of its values randomly chosen:

    <?php // test.php 20110425_0900
    
    function substitute_random_value($text) {
        $re = '/
            # Match parenthesized list of comma separated words.
            \(           # Opening delimiter.
            \s*          # Optional whitespace.
            \w+          # required first value.
            (?:          # Group for additional values.
              \s* , \s*  # Values separated by a comma, ws
              \w+        # Next value.
            )+           # One or more additional values.
            \s*          # Optional whitespace.
            \)           # Closing delimiter.
            /x';
        // Match each parenthesized list and replace with one of the values.
        $text = preg_replace_callback($re, '_srv_callback', $text);
        return $text;
    }
    function _srv_callback($matches_paren) {
        // Grab all word options in parenthesized list into $matches.
        $count = preg_match_all('/\w+/', $matches_paren[0], $matches);
        // Randomly pick one of the matches and return it.
        return $matches[0][rand(0, $count - 1)];
    }
    
    // Read input text
    $data_in = file_get_contents('testdata.txt');
    
    // Process text multiple times to verify random replacements.
    $data_out  = "Run 1:\n". substitute_random_value($data_in);
    $data_out .= "Run 2:\n". substitute_random_value($data_in);
    $data_out .= "Run 3:\n". substitute_random_value($data_in);
    
    // Write output text
    file_put_contents('testdata_out.txt', $data_out);
    
    ?>
    

    The substitute_random_value() function calls the PHP preg_replace_callback() function, which matches and replaces each list with one of the values in the list. It calls the _srv_callback() function which randomly picks out one of the values and returns it as the replacement value.

    Given this input test data (testdata.txt):

    ( ( 3 + 4 ) * 12 ) * ( 2, 3, 4, 5 ) )
    ( ( 3 + 4 ) * 12 ) * ( 12, 13) )
    ( ( 3 + 4 ) * 12 ) * ( 22, 23, 24) )
    ( ( 3 + 4 ) * 12 ) * ( 32, 33, 34, 35 ) )

    Here is the output from one example run of the script:

    Run 1:
    ( ( 3 + 4 ) * 12 ) * 5 )
    ( ( 3 + 4 ) * 12 ) * 13 )
    ( ( 3 + 4 ) * 12 ) * 22 )
    ( ( 3 + 4 ) * 12 ) * 35 )
    Run 2:
    ( ( 3 + 4 ) * 12 ) * 3 )
    ( ( 3 + 4 ) * 12 ) * 12 )
    ( ( 3 + 4 ) * 12 ) * 22 )
    ( ( 3 + 4 ) * 12 ) * 33 )
    Run 3:
    ( ( 3 + 4 ) * 12 ) * 3 )
    ( ( 3 + 4 ) * 12 ) * 12 )
    ( ( 3 + 4 ) * 12 ) * 23 )
    ( ( 3 + 4 ) * 12 ) * 32 )

    Note that this solution uses \w+ to match values consisting of "word" characters, i.e. [A-Za-z0-9_]. This can be easily changed if this does not meet your requirements.

    Edit: Here is a Javascript version of the substitute_random_value() function:

    function substitute_random_value(text) {
        // Replace each parenthesized list with one of the values.
        return text.replace(/\(\s*\w+(?:\s*,\s*\w+)+\s*\)/g,
            function (m0) {
               // Capture all word values in parenthesized list into values.
                var values = m0.match(/\w+/g);
                // Randomly pick one of the matches and return it.
                return values[Math.floor(Math.random() * values.length)];
            });
    }