Search code examples
bashawksubstr

How do I count multiple overlapping strings and get the total occurences per line (awk or anything else)


I have an input file like this:

315secondbin    x12121321211332123x
315firstbin 3212212121x
315thirdbin 132221312
316firstbin 121
316secondbin    1212

What I want to do is count how many instances of a few different strings (say "121" and "212") exist in each line counting overlap. So my expected output would be:

6
5
0
1
2

So I slightly modified some awk from another thread to use the OR operator in hopes that it would count up everything that meets either condition:

{
count = 0
$0 = tolower($0)
while (length() > 0) {
    m = match($0, /212/ || /121/)
    if (m == 0)
         break
    count++
    $0 = substr($0, m + 1)
}
print count
}

unfortunately, my output is this:

8
4
0
2
3

But if I leave out the OR it counts perfectly. What am I doing wrong?

Also, I run the script on the file ymaz.txt by running:

 cat ymaz.txt | awk -v "pattern=" -f count3.awk

As an alternate approach I tried this:

{
count = 0
$0 = tolower($0)
while (length() > 0) {
    m = match($0, /212/)
y = match($0, /121/)
    if ((m == 0) && (y == 0))
         break
    count++
    $0 = substr($0, (m + 1) + (y + 1))
}
print count
}

but my output was this:

1
1
0
1
1

What am I doing wrong? I know I should be understanding the code and not cutting and pasting stuff together, but that's my skill level at this point.

BTW when I don't have the OR in there (ie I'm just searching for 1 string) it works perfectly.


Solution

  • You're making it too complicated:

    {
        count=0
        while ( match($0,/121|212/) ) {
            count++
            $0=substr($0,RSTART+1)
        }
        print count
    }
    
    $ awk -f tst.awk file
    6
    5
    0
    1
    2
    

    Your fundamental problem is that you were confusing a condition with a regexp. A regexp can be compared with a string to form a condition, and when the string in question is $0 you can leave it out and just use regexp as a shorthand for $0 ~ regexp but in that context what's being tested is still a condition. The 2nd arg for match() is a regexp, not a condition. | is the or operator in a regexp while || is the or operator in a condition. /.../ are the regexp delimiters.

    /foo/ is a regexp

    $0 ~ /foo/ is a condition

    /foo/ in a conditional context is shorthand for $0 ~ /foo/ but in any other context is just a regexp.

    /foo/ || /bar in a conditional context is shorthand for $0 ~ /foo/ || $0 ~ /bar/ but as the 2nd arg to match() awk actually assumes you intended to write:

    match($0,($0 ~ /foo/ || $0 ~ /bar/))
    

    i.e. it will test the current record against foo or bar and if true then that condition evaluates to 1 and that 1 is then given to match() as it's 2nd arg.

    Look:

    $ echo foo | gawk 'match($0,/foo/||/bar/)'        
    $ echo foo | gawk '{print /foo/||/bar/}'  
    1
    $ echo 1foo | gawk 'match($0,/foo/||/bar/)'       
    1foo
    

    Get the book Effective Awk Programming, 4th Edition, by Arnold Robbins.