I have an input file like this:
315secondbin x12121321211332123x
315firstbin 3212212121x
315thirdbin 132221312
316firstbin 121
316secondbin 1212
What I want to do is count how many instances of a few different strings (say "121" and "212") exist in each line counting overlap. So my expected output would be:
6
5
0
1
2
So I slightly modified some awk from another thread to use the OR operator in hopes that it would count up everything that meets either condition:
{
count = 0
$0 = tolower($0)
while (length() > 0) {
m = match($0, /212/ || /121/)
if (m == 0)
break
count++
$0 = substr($0, m + 1)
}
print count
}
unfortunately, my output is this:
8
4
0
2
3
But if I leave out the OR it counts perfectly. What am I doing wrong?
Also, I run the script on the file ymaz.txt by running:
cat ymaz.txt | awk -v "pattern=" -f count3.awk
As an alternate approach I tried this:
{
count = 0
$0 = tolower($0)
while (length() > 0) {
m = match($0, /212/)
y = match($0, /121/)
if ((m == 0) && (y == 0))
break
count++
$0 = substr($0, (m + 1) + (y + 1))
}
print count
}
but my output was this:
1
1
0
1
1
What am I doing wrong? I know I should be understanding the code and not cutting and pasting stuff together, but that's my skill level at this point.
BTW when I don't have the OR in there (ie I'm just searching for 1 string) it works perfectly.
You're making it too complicated:
{
count=0
while ( match($0,/121|212/) ) {
count++
$0=substr($0,RSTART+1)
}
print count
}
$ awk -f tst.awk file
6
5
0
1
2
Your fundamental problem is that you were confusing a condition with a regexp. A regexp can be compared with a string to form a condition, and when the string in question is $0 you can leave it out and just use regexp
as a shorthand for $0 ~ regexp
but in that context what's being tested is still a condition. The 2nd arg for match() is a regexp, not a condition. |
is the or
operator in a regexp while ||
is the or
operator in a condition. /.../
are the regexp delimiters.
/foo/
is a regexp
$0 ~ /foo/
is a condition
/foo/
in a conditional context is shorthand for $0 ~ /foo/
but in any other context is just a regexp.
/foo/ || /bar
in a conditional context is shorthand for $0 ~ /foo/ || $0 ~ /bar/
but as the 2nd arg to match() awk actually assumes you intended to write:
match($0,($0 ~ /foo/ || $0 ~ /bar/))
i.e. it will test the current record against foo or bar and if true then that condition evaluates to 1 and that 1 is then given to match() as it's 2nd arg.
Look:
$ echo foo | gawk 'match($0,/foo/||/bar/)'
$ echo foo | gawk '{print /foo/||/bar/}'
1
$ echo 1foo | gawk 'match($0,/foo/||/bar/)'
1foo
Get the book Effective Awk Programming, 4th Edition, by Arnold Robbins.