Search code examples
awkgsubmawk

gsub for substituting translations not working


I have a dictionary dict with records separated by ":" and data fields by new lines, for example:

:one
1
:two
2
:three
3
:four
4

Now I want awk to substitute all occurrences of each record in the input file, eg

onetwotwotwoone
two
threetwoone
four

My first awk script looked like this and works just fine:

BEGIN { RS = ":" ; FS = "\n"}
NR == FNR {
rep[$1] = $2
next
}
{
for (key in rep)
grub(key,rep[key])
print
}

giving me:

12221
2
321
4

Unfortunately another dict file contains some character used by regular expressions, so I have to substitute escape characters in my script. By moving key and rep[key] into a string (which can then be parsed for escape characters), the script will only substitute the second record in the dict. Why? And how to solve?

Here's the current second part of the script:

{
for (key in rep)
orig=key
trans=rep[key]
gsub(/[\]\[^$.*?+{}\\()|]/, "\\\\&", orig)
gsub(orig,trans)
print
}

All scripts are run by awk -f translate.awk dict input

Thanks in advance!


Solution

  • Your fundamental problem is using strings in regexp and backreference contexts when you don't want them and then trying to escape the metacharacters in your strings to disable the characters that you're enabling by using them in those contexts. If you want strings, use them in string contexts, that's all.

    You won't want this:

    gsub(regexp,backreference-enabled-string)
    

    You want something more like this:

    index(...,string) substr(string)
    

    I think this is what you're trying to do:

    $ cat tst.awk
    BEGIN { FS = ":" }
    NR == FNR {
        if ( NR%2 ) {
            key = $2
        }
        else {
            rep[key] = $0
        }
        next
    }
    {
        for ( key in rep ) {
            head = ""
            tail = $0
            while ( start = index(tail,key) ) {
                head = head substr(tail,1,start-1) rep[key]
                tail = substr(tail,start+length(key))
            }
            $0 = head tail
        }
        print
    }
    
    $ awk -f tst.awk dict file
    12221
    2
    321
    4