I am writing a program to examine the string STRING
to see where it matches SUBSTRING
using gawk. One problem I have run into is that the match
function only gives the left most match in the string. My current thought is to use gsub to find out how many times the SUBSTRING
is present and then use match multiple times using the last substring(STRING,RSTART+1)
to find the true start positions of each position, of course with some edits to the code. I am wondering if there is an easier way than this, or a built in function that gives all RSTARTS.
Example:
STRING=DDDADDCDFFDFGSDD
SUBSTRING=D
EDIT:
I looked at the array function for match (thanks for pointing me to more up to date documentation than I had been reading). This still doesn't work, as it allows you to search for multiple things in the same string, but still only gives the left most location of each of these strings.
For example:
$ echo DDDADDCDFFDFGSDD | gawk '{match($0,/D/,a); for (i in a) print i,a[i]}'
0start 1
0length 1
0 D
it works to find the left most of multiple things
echo gDDDADDCDFFDFGSDD | gawk '{match($0,/(D)(A)/,a); for (i in a) print i,a[i]}'
0start 4
0length 2
1start 4
2start 5
2length 1
1length 1
0 DA
1 D
2 A
So we are still finding the left most match (which is what the documentation say it will do)
There isn't a native way to deal with this that i have found, so I wrote this function to do it. This will only work with version of gawk that allow for multidimensional arrays, though making this work with older versions of awk would be simple as well, though parsing afterwards would be more difficult.
The function searches through the string for the regex and populates an array MM
. It returns -1 if there was an error, 0 if there were no matches found, else it returns the number of matches found.
function multiMatch(string,subs){
split("",MM,"")
RLENGTH=0
RSTART=0
t=0
s=string
if (length(string) == 0 || length(subs) == 0){
print "Must have string and Regex to look for"
return -1
}
while (1) {
t=RSTART+t
s=substr(string,t+1)
if ( length(s) == 0 ){
break
}
match(s,subs)
if (RLENGTH == -1) {
break
}
found=substr(string,0,length(string)-(length(string)-t-RSTART+1))"-"substr(string,t+RSTART,RLENGTH)"-"substr(string,t+RSTART+RLENGTH);
MM[n]["RSTART"]=RSTART
MM[n]["RLENGTH"]=RLENGTH
MM[n]["STR"]=found
n++
}
return n
}
Example
echo doogggogogggggggooogggogggggooogoooggoooo g*o | awk '
BEGIN{PROCINFO["sorted_in"]="@ind_num_asc"}
{
print "Found "multiMatch($1,$2)" Matches"
for (x in MM) {
print x,MM[x]["RSTART"],MM[x]["RLENGTH"],MM[x]["STR"]
}
}'
OUTPUT
Found 40 Matches
2 1 d-o-ogggogogggggggooogggogggggooogoooggoooo
1 1 1 do-o-gggogogggggggooogggogggggooogoooggoooo
2 1 4 doo-gggo-gogggggggooogggogggggooogoooggoooo
3 1 3 doog-ggo-gogggggggooogggogggggooogoooggoooo
4 1 2 doogg-go-gogggggggooogggogggggooogoooggoooo
5 1 1 dooggg-o-gogggggggooogggogggggooogoooggoooo
6 1 2 doogggo-go-gggggggooogggogggggooogoooggoooo
7 1 1 doogggog-o-gggggggooogggogggggooogoooggoooo
8 1 8 doogggogo-gggggggo-oogggogggggooogoooggoooo
9 1 7 doogggogog-ggggggo-oogggogggggooogoooggoooo
10 1 6 doogggogogg-gggggo-oogggogggggooogoooggoooo
11 1 5 doogggogoggg-ggggo-oogggogggggooogoooggoooo
12 1 4 doogggogogggg-gggo-oogggogggggooogoooggoooo
13 1 3 doogggogoggggg-ggo-oogggogggggooogoooggoooo
14 1 2 doogggogogggggg-go-oogggogggggooogoooggoooo
15 1 1 doogggogoggggggg-o-oogggogggggooogoooggoooo
16 1 1 doogggogogggggggo-o-ogggogggggooogoooggoooo
17 1 1 doogggogogggggggoo-o-gggogggggooogoooggoooo
18 1 4 doogggogogggggggooo-gggo-gggggooogoooggoooo
19 1 3 doogggogogggggggooog-ggo-gggggooogoooggoooo
20 1 2 doogggogogggggggooogg-go-gggggooogoooggoooo
21 1 1 doogggogogggggggoooggg-o-gggggooogoooggoooo
22 1 6 doogggogogggggggooogggo-gggggo-oogoooggoooo
23 1 5 doogggogogggggggooogggog-ggggo-oogoooggoooo
24 1 4 doogggogogggggggooogggogg-gggo-oogoooggoooo
25 1 3 doogggogogggggggooogggoggg-ggo-oogoooggoooo
26 1 2 doogggogogggggggooogggogggg-go-oogoooggoooo
27 1 1 doogggogogggggggooogggoggggg-o-oogoooggoooo
28 1 1 doogggogogggggggooogggogggggo-o-ogoooggoooo
29 1 1 doogggogogggggggooogggogggggoo-o-goooggoooo
30 1 2 doogggogogggggggooogggogggggooo-go-ooggoooo
31 1 1 doogggogogggggggooogggogggggooog-o-ooggoooo
32 1 1 doogggogogggggggooogggogggggooogo-o-oggoooo
33 1 1 doogggogogggggggooogggogggggooogoo-o-ggoooo
34 1 3 doogggogogggggggooogggogggggooogooo-ggo-ooo
35 1 2 doogggogogggggggooogggogggggooogooog-go-ooo
36 1 1 doogggogogggggggooogggogggggooogooogg-o-ooo
37 1 1 doogggogogggggggooogggogggggooogoooggo-o-oo
38 1 1 doogggogogggggggooogggogggggooogoooggoo-o-o
39 1 1 doogggogogggggggooogggogggggooogoooggooo-o-