Search code examples
unixawk

awk - numbering the match occurrence is not working correctly


I'm trying to highlight each occurrence of the match word, using the below awk command,

Input:

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

Command:

python -c 'import this' |  awk -v t="better" ' { gsub(t,"[" "&-" ++a["&"] "]" ,$0); print } ' 

but it looks like the gsub() is not working correctly. The input has total 8 - "better" match, but for the last one the above command prints "better-18". How to fix this.

Although never is often [better-18] than *right* now.. # wrong should be 8

Expected output:

Beautiful is [better-1] than ugly.
Explicit is [better-2] than implicit.
Simple is [better-3] than complex.
Complex is [better-4] than complicated.
Flat is [better-5] than nested.
Sparse is [better-6] than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is [better-7] than never.
Although never is often [better-8] than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

I would like to have an extendable solution that could accommodate more words, i.e input from another text file, with just one scanning of the input file. The "word" should be full match not partial/substring.

match.txt

better
idea

Solution

  • You can't do:

    gsub(t,"[" "&-" ++a["&"] "]" ,$0)
    

    because it'll do a partial regexp match instead of a full string match and, more importantly, the arguments to gsub() are evaluated before gsub() is called so the ++a["&"] is incrementing the value of a[] indexed by the literal single character string "&" before gsub() is called. It's exactly the same as if you wrote:

    foo=(++a["&"]); gsub(t,"[" "&-" foo "]" ,$0)
    

    This might be what you're trying to do, using GNU awk for patsplit():

    $ cat tst.awk
    NR==FNR {
        words[$1]
        next
    }
    {
        n = patsplit($0,flds,/[[:alnum:]_]+/,seps)
        out = seps[0]
        for (i=1; i<=n; i++) {
            word = flds[i]
            if ( word in words ) {
                 word = "[" word "-" (++cnt[word]) "]"
            }
            out = out word seps[i]
        }
        print out
    }
    

    $ awk -f tst.awk match.txt file
    The Zen of Python, by Tim Peters
    
    Beautiful is [better-1] than ugly.
    Explicit is [better-2] than implicit.
    Simple is [better-3] than complex.
    Complex is [better-4] than complicated.
    Flat is [better-5] than nested.
    Sparse is [better-6] than dense.
    Readability counts.
    Special cases aren't special enough to break the rules.
    Although practicality beats purity.
    Errors should never pass silently.
    Unless explicitly silenced.
    In the face of ambiguity, refuse the temptation to guess.
    There should be one-- and preferably only one --obvious way to do it.
    Although that way may not be obvious at first unless you're Dutch.
    Now is [better-7] than never.
    Although never is often [better-8] than *right* now.
    If the implementation is hard to explain, it's a bad [idea-1].
    If the implementation is easy to explain, it may be a good [idea-2].
    Namespaces are one honking great [idea-3] -- let's do more of those!
    

    The above assumes your definition of a "word" is any sequence of alphanumeric-or-underscore (i.e. "word-constituent") characters - if not then just change [[:alnum:]_]+ to whatever matches your definition of a "word".

    You can do the same in any POSIX awk with a while (match(..,/[[:alnum:]_]+/)) substr(... loop - left as an exercise if you want that.

    If you wanted to handle 1 word passed as a variable assignment then:

    $ cat tst.awk
    {
        n = patsplit($0,flds,/[[:alnum:]_]+/,seps)
        out = seps[0]
        for (i=1; i<=n; i++) {
            word = flds[i]
            if ( word == t ) {
                 word = "[" word "-" (++cnt) "]"
            }
            out = out word seps[i]
        }
        print out
    }
    

    $ awk -v t='better' -f tst.awk file
    The Zen of Python, by Tim Peters
    
    Beautiful is [better-1] than ugly.
    Explicit is [better-2] than implicit.
    Simple is [better-3] than complex.
    Complex is [better-4] than complicated.
    Flat is [better-5] than nested.
    Sparse is [better-6] than dense.
    Readability counts.
    Special cases aren't special enough to break the rules.
    Although practicality beats purity.
    Errors should never pass silently.
    Unless explicitly silenced.
    In the face of ambiguity, refuse the temptation to guess.
    There should be one-- and preferably only one --obvious way to do it.
    Although that way may not be obvious at first unless you're Dutch.
    Now is [better-7] than never.
    Although never is often [better-8] than *right* now.
    If the implementation is hard to explain, it's a bad idea.
    If the implementation is easy to explain, it may be a good idea.
    Namespaces are one honking great idea -- let's do more of those!