Search code examples
regexmatchgawk

Find RSTART for multiple matches in a line


I am writing a program to examine the string STRING to see where it matches SUBSTRING using gawk. One problem I have run into is that the match function only gives the left most match in the string. My current thought is to use gsub to find out how many times the SUBSTRING is present and then use match multiple times using the last substring(STRING,RSTART+1) to find the true start positions of each position, of course with some edits to the code. I am wondering if there is an easier way than this, or a built in function that gives all RSTARTS.

Example:
STRING=DDDADDCDFFDFGSDD
SUBSTRING=D

EDIT:

I looked at the array function for match (thanks for pointing me to more up to date documentation than I had been reading). This still doesn't work, as it allows you to search for multiple things in the same string, but still only gives the left most location of each of these strings.

For example:

$ echo DDDADDCDFFDFGSDD | gawk '{match($0,/D/,a); for (i in a) print i,a[i]}'
0start 1
0length 1
0 D

it works to find the left most of multiple things

echo gDDDADDCDFFDFGSDD | gawk '{match($0,/(D)(A)/,a); for (i in a) print i,a[i]}'
0start 4
0length 2
1start 4
2start 5
2length 1
1length 1
0 DA
1 D
2 A

So we are still finding the left most match (which is what the documentation say it will do)


Solution

  • There isn't a native way to deal with this that i have found, so I wrote this function to do it. This will only work with version of gawk that allow for multidimensional arrays, though making this work with older versions of awk would be simple as well, though parsing afterwards would be more difficult.

    The function searches through the string for the regex and populates an array MM. It returns -1 if there was an error, 0 if there were no matches found, else it returns the number of matches found.

    function multiMatch(string,subs){
        split("",MM,"")
        RLENGTH=0
        RSTART=0
        t=0
        s=string
        if (length(string) == 0 || length(subs) == 0){
                print "Must have string and Regex to look for"
                return -1
        }
        while (1) {
                t=RSTART+t
                s=substr(string,t+1)
                if ( length(s) == 0 ){
                        break
                }
                match(s,subs)
                if (RLENGTH == -1) {
                        break
                }
                found=substr(string,0,length(string)-(length(string)-t-RSTART+1))"-"substr(string,t+RSTART,RLENGTH)"-"substr(string,t+RSTART+RLENGTH);
                MM[n]["RSTART"]=RSTART
                MM[n]["RLENGTH"]=RLENGTH
                MM[n]["STR"]=found
                n++
        }
        return n
    }
    

    Example

    echo doogggogogggggggooogggogggggooogoooggoooo g*o | awk '
    BEGIN{PROCINFO["sorted_in"]="@ind_num_asc"}
    {
            print "Found "multiMatch($1,$2)" Matches"
            for (x in MM) {
                    print x,MM[x]["RSTART"],MM[x]["RLENGTH"],MM[x]["STR"]
            }
    }' 
    

    OUTPUT

    Found 40 Matches
     2 1 d-o-ogggogogggggggooogggogggggooogoooggoooo
    1 1 1 do-o-gggogogggggggooogggogggggooogoooggoooo
    2 1 4 doo-gggo-gogggggggooogggogggggooogoooggoooo
    3 1 3 doog-ggo-gogggggggooogggogggggooogoooggoooo
    4 1 2 doogg-go-gogggggggooogggogggggooogoooggoooo
    5 1 1 dooggg-o-gogggggggooogggogggggooogoooggoooo
    6 1 2 doogggo-go-gggggggooogggogggggooogoooggoooo
    7 1 1 doogggog-o-gggggggooogggogggggooogoooggoooo
    8 1 8 doogggogo-gggggggo-oogggogggggooogoooggoooo
    9 1 7 doogggogog-ggggggo-oogggogggggooogoooggoooo
    10 1 6 doogggogogg-gggggo-oogggogggggooogoooggoooo
    11 1 5 doogggogoggg-ggggo-oogggogggggooogoooggoooo
    12 1 4 doogggogogggg-gggo-oogggogggggooogoooggoooo
    13 1 3 doogggogoggggg-ggo-oogggogggggooogoooggoooo
    14 1 2 doogggogogggggg-go-oogggogggggooogoooggoooo
    15 1 1 doogggogoggggggg-o-oogggogggggooogoooggoooo
    16 1 1 doogggogogggggggo-o-ogggogggggooogoooggoooo
    17 1 1 doogggogogggggggoo-o-gggogggggooogoooggoooo
    18 1 4 doogggogogggggggooo-gggo-gggggooogoooggoooo
    19 1 3 doogggogogggggggooog-ggo-gggggooogoooggoooo
    20 1 2 doogggogogggggggooogg-go-gggggooogoooggoooo
    21 1 1 doogggogogggggggoooggg-o-gggggooogoooggoooo
    22 1 6 doogggogogggggggooogggo-gggggo-oogoooggoooo
    23 1 5 doogggogogggggggooogggog-ggggo-oogoooggoooo
    24 1 4 doogggogogggggggooogggogg-gggo-oogoooggoooo
    25 1 3 doogggogogggggggooogggoggg-ggo-oogoooggoooo
    26 1 2 doogggogogggggggooogggogggg-go-oogoooggoooo
    27 1 1 doogggogogggggggooogggoggggg-o-oogoooggoooo
    28 1 1 doogggogogggggggooogggogggggo-o-ogoooggoooo
    29 1 1 doogggogogggggggooogggogggggoo-o-goooggoooo
    30 1 2 doogggogogggggggooogggogggggooo-go-ooggoooo
    31 1 1 doogggogogggggggooogggogggggooog-o-ooggoooo
    32 1 1 doogggogogggggggooogggogggggooogo-o-oggoooo
    33 1 1 doogggogogggggggooogggogggggooogoo-o-ggoooo
    34 1 3 doogggogogggggggooogggogggggooogooo-ggo-ooo
    35 1 2 doogggogogggggggooogggogggggooogooog-go-ooo
    36 1 1 doogggogogggggggooogggogggggooogooogg-o-ooo
    37 1 1 doogggogogggggggooogggogggggooogoooggo-o-oo
    38 1 1 doogggogogggggggooogggogggggooogoooggoo-o-o
    39 1 1 doogggogogggggggooogggogggggooogoooggooo-o-