Search code examples
awkstring-matching

Partial matches in 2 columns following exact match


I need to do an exact match followed by a partial match and retrieve the strings from two columns. I would ideally like to do this with awk.

Input:

k141_18046_1    k141_18046_1
k141_18046_1    k141_18046_2
k141_18046_2    k141_18046_1
k141_12033_1    k141_18046_2
k141_12033_1    k141_12033_1
k141_12033_2    k141_12033_2
k141_2012_1     k141_2012_1
k141_2012_1     k141_2012_2
k141_2012_2     k141_2012_1
k141_21_1     k141_2012_2
k141_21_1       k141_21_1
k141_21_2       k141_21_2

Expected output:

k141_18046_1    k141_18046_2
k141_18046_2    k141_18046_1
k141_2012_1     k141_2012_2
k141_2012_2     k141_2012_1

In both columns, the first part of the ID is the same. I need to get the IDs where either ID_1 && ID_2 (OR) ID_2 && ID_1 are present in a single row.

Thank you, Susheel


Solution

  • Updated based on comments:

    $ awk '
    $1!=$2 {                     # consider only unequal strings
        n=split($1,a,/_/)        # split them by undescored
        m=split($2,b,/_/)
        if(m==n) {               # there should be equal amount of parts
            for(i=1;i<n;i++)  
                if(a[i]!=b[i])   # all but last parts should equal
                    next         # or not valid
        } else
            next
        print                    # if you made it so far...
    }' file
    

    Output:

    k141_18046_1    k141_18046_2
    k141_18046_2    k141_18046_1
    k141_2012_1     k141_2012_2
    k141_2012_2     k141_2012_1
    

    Another awk, using match()

    $ awk '
    substr($1,match($1,/^.*_/),RLENGTH) == substr($2,match($2,/^.*_/),RLENGTH) && 
    substr($1,match($1,/[^_]*$/),RLENGTH) != substr($2,match($2,/[^_]*$/),RLENGTH)
    ' file