Search code examples
bashawknestedbrackets

Can this parsing problem be nicely done with awk?


I am trying to process a whitelist in bash, but I would like to process it with awk. The whitelist.txt has a format like module:function(arguments) and in one line multiple arguments or multiple functions can be specified, in a comma-separated list format. Multiple functions can be encased in-between square brackets, while arguments are encased in-between round brackets, like this: some_module_name:[func1(arg1,arg2),func2] Arguments can be encased in-between double quotes if they have white spaces, like this: some_module:[function-alpha(arg1,"argument 2 has spaces"),function-beta]

On the right-end of any line there can be some flags, like offline_install=true, paralell_run=true and so on (separated by white spaces). Those will be stored separately in an array afterwards.

How can I use awk to transform the input file format:

module1:function1(alpha,beta),function2 offline_install=true paralell_run=true
module2:[function-alpha(arg1,"argument 2 is inside quotes",arg3),function-beta]

into the output file format:

module1:function1(alpha)
module1:function1(beta)
module1:function2
module2:function-alpha(arg1)
module2:function-alpha("argument 2 is inside quotes")
module2:function-alpha(arg3)
module2:function-beta

Specifically, I want to:

  • Expand each comma-separated list inside the square brackets [] into separate lines, while keeping the corresponding module and function name at the beginning of each line.
  • For each argument inside round brackets (), print it on a separate line along with the corresponding module and function name at the beginning of each line.
  • Preserve any quoted arguments with spaces inside the quotes.

Here's another example, if it helps. whitelist.txt:

control_service:(stop=apache,"disable=apache (everywhere)")
database:add_redo_log_groups
database:check_data_consistency paralell_run=true
sas_certificate
p/a/g/config.sh:[function1(status,"start firewall [1]","stop [all] firewall"),func2] offline_install=true paralell_run=true
control_server:stop,disable
database:kill_sessions paralell_run=true
my_module
my_module:[func1,func2]

output_file.txt

control_service:(stop=apache)
control_service:("disable=apache (everywhere)")
database:add_redo_log_groups
database:check_data_consistency paralell_run=true
sas_certificate
p/a/g/config.sh:function1(status) offline_install=true paralell_run=true
p/a/g/config.sh:function1("start firewall [1]") offline_install=true paralell_run=true
p/a/g/config.sh:function1("stop [all] firewall") offline_install=true paralell_run=true
p/a/g/config.sh:func2 offline_install=true paralell_run=true
control_server:stop
control_server:disable
database:kill_sessions paralell_run=true
my_module
my_module:func1
my_module:func2

I've tried different approaches, but so far, I have not been able to generate the correct output. Any help with the awk script would be greatly appreciated.


Solution

  • Here is a start showing you how to encode/decode the problematic characters inside the quoted strings so you can then identify and/or split on those characters outside the quoted strings:

    $ cat tst.awk
    {
        print "---------"
        printf "$0 = %s\n", $0
    
        module = gensub(/:.*/,"",1,$0)
        fns_args = substr($0,length(module)+2)
    
        printf "module = %s\n",  module
        printf "raw fns_args = %s\n", fns_args
    
        encoded_fns_args = encode(fns_args)
    
        printf "encoded fns_args = %s\n", encoded_fns_args
    
        if ( match(encoded_fns_args,/\[(.*)]\s*(.*)/,a) ) {
            encoded_args = a[2]
            decoded_args = decode(encoded_args)
            printf "decoded_args = %s\n", decoded_args
    
            n = split(a[1],encoded_fns,/,/)
            for ( i=1; i<=n; i++ ) {
                encoded_fn = encoded_fns[i]
                decoded_fn = decode(encoded_fn)
                printf "decoded_fn = %s\n", decoded_fn
            }
        }
    }
    
    function encode(str,    a) {
        gsub(/[@]/,"@A",str)
        while ( match(str,/([^"]*)("[^"]*")(.*)/,a) ) {
            gsub(/[[]/,"@B",a[2])
            gsub(/[]]/,"@C",a[2])
            gsub(/[(]/,"@D",a[2])
            gsub(/[)]/,"@E",a[2])
            gsub(/[,]/,"@F",a[2])
            gsub(/["]/,"@G",a[2])
            str = a[1] a[2] a[3]
        }
        return str
    }
    
    function decode(str) {
        gsub(/@G/,"\"",str)
        gsub(/@F/,",",str)
        gsub(/@E/,")",str)
        gsub(/@D/,"(",str)
        gsub(/@C/,"]",str)
        gsub(/@B/,"[",str)
        gsub(/@A/,"@",str)
        return str
    }
    

    $ awk -f tst.awk whitelist.txt
    ---------
    $0 = control_service:(stop=apache,"disable=apache (everywhere)")
    module = control_service
    raw fns_args = (stop=apache,"disable=apache (everywhere)")
    encoded fns_args = (stop=apache,@Gdisable=apache @Deverywhere@E@G)
    ---------
    $0 = database:add_redo_log_groups
    module = database
    raw fns_args = add_redo_log_groups
    encoded fns_args = add_redo_log_groups
    ---------
    $0 = database:check_data_consistency paralell_run=true
    module = database
    raw fns_args = check_data_consistency paralell_run=true
    encoded fns_args = check_data_consistency paralell_run=true
    ---------
    $0 = sas_certificate
    module = sas_certificate
    raw fns_args =
    encoded fns_args =
    ---------
    $0 = p/a/g/config.sh:[function1(status,"start firewall [1]","stop [all] firewall"),func2] offline_install=true paralell_run=true
    module = p/a/g/config.sh
    raw fns_args = [function1(status,"start firewall [1]","stop [all] firewall"),func2] offline_install=true paralell_run=true
    encoded fns_args = [function1(status,@Gstart firewall @B1@C@G,@Gstop @Ball@C firewall@G),func2] offline_install=true paralell_run=true
    decoded_args = offline_install=true paralell_run=true
    decoded_fn = function1(status
    decoded_fn = "start firewall [1]"
    decoded_fn = "stop [all] firewall")
    decoded_fn = func2
    ---------
    $0 = control_server:stop,disable
    module = control_server
    raw fns_args = stop,disable
    encoded fns_args = stop,disable
    ---------
    $0 = database:kill_sessions paralell_run=true
    module = database
    raw fns_args = kill_sessions paralell_run=true
    encoded fns_args = kill_sessions paralell_run=true
    ---------
    $0 = my_module
    module = my_module
    raw fns_args =
    encoded fns_args =
    ---------
    $0 = my_module:[func1,func2]
    module = my_module
    raw fns_args = [func1,func2]
    encoded fns_args = [func1,func2]
    decoded_args =
    decoded_fn = func1
    decoded_fn = func2
    

    The above uses GNU awk for various extensions and is not intended to be the full script you need, it's just a [big] start showing you a way to solve the problem.