Search code examples
regexstringbashtext-parsing

Matching optional parameters with non-capturing groups in Bash regular expression


I want to parse strings similar to the following into separate variables using regular expressions from within Bash:

Category: entity;scheme="http://schemas.ogf.org/occi/core#";class="kind";title="Entity";attributes="occi.core.id occi.core.title";

or

Category: resource;scheme="http://schemas.ogf.org/occi/core#";class="kind";title="Resource";rel="http://schemas.ogf.org/occi/core#entity";attributes="occi.core.summary";

The first part before "title" is common to all strings, the parts title and attributes are optional.

I managed to extract the mandatory parameters common to all strings, but I have trouble with optional parameters not necessarily present for all strings. As far as I found out, Bash doesn't support Non-capturing parentheses which I would use for this purpose.

Here is what I achieved thus far:

CATEGORY_REGEX='Category:\s*([^;]*);scheme="([^"]*)";class="([^"]*)";'
category_string='Category: entity;scheme="http://schemas.ogf.org/occi/core#";class="kind";title="Entity";attributes="occi.core.id occi.core.title";'
[[ $category_string =~ $CATEGORY_REGEX ]]
echo ${BASH_REMATCH[0]}
echo ${BASH_REMATCH[1]}
echo ${BASH_REMATCH[2]}
echo ${BASH_REMATCH[3]}

The regular expression I would like to use (and which is working for me in Ruby) would be:

CATEGORY_REGEX='Category:\s*([^;]*);\s*scheme="([^"]*)";\s*class="([^"]*)";\s*(?:title="([^"]*)";)?\s*(?:rel="([^"]*)";)?\s*(?:location="([^"]*)";)?\s*(?:attributes="([^"]*)";)?\s*(?:actions="([^"]*)";)?'

Is there any other solution to parse the string with command line tools without having to fall back on perl, python or ruby?


Solution

  • I don't think non-capturing groups exist in bash regex, so your options are to use a scripting language or to remove the ?: from all of the (?:...) groups and just be careful about which groups you reference, for example:

    CATEGORY_REGEX='Category:\s*([^;]*);\s*scheme="([^"]*)";\s*class="([^"]*)";\s*(title="([^"]*)";)?\s*(rel="([^"]*)";)?\s*(location="([^"]*)";)?\s*(attributes="([^"]*)";)?\s*(actions="([^"]*)";)?'
    category_string='Category: entity;scheme="http://schemas.ogf.org/occi/core#";class="kind";title="Entity";attributes="occi.core.id occi.core.title";'
    [[ $category_string =~ $CATEGORY_REGEX ]]
    echo "full:       ${BASH_REMATCH[0]}"
    echo "category:   ${BASH_REMATCH[1]}"
    echo "scheme:     ${BASH_REMATCH[2]}"
    echo "class:      ${BASH_REMATCH[3]}"
    echo "title:      ${BASH_REMATCH[5]}"
    echo "rel:        ${BASH_REMATCH[7]}"
    echo "location:   ${BASH_REMATCH[9]}"
    echo "attributes: ${BASH_REMATCH[11]}"
    echo "actions:    ${BASH_REMATCH[13]}"
    

    Note that starting with the optional parameters we need to skip a group each time, because the even numbered groups from 4 on contain the parameter name as well as the value (if the parameter is present).