I need to do an exact
match followed by a partial
match and retrieve the strings from two columns. I would ideally like to do this with awk
.
Input:
k141_18046_1 k141_18046_1
k141_18046_1 k141_18046_2
k141_18046_2 k141_18046_1
k141_12033_1 k141_18046_2
k141_12033_1 k141_12033_1
k141_12033_2 k141_12033_2
k141_2012_1 k141_2012_1
k141_2012_1 k141_2012_2
k141_2012_2 k141_2012_1
k141_21_1 k141_2012_2
k141_21_1 k141_21_1
k141_21_2 k141_21_2
Expected output:
k141_18046_1 k141_18046_2
k141_18046_2 k141_18046_1
k141_2012_1 k141_2012_2
k141_2012_2 k141_2012_1
In both columns, the first part of the ID is the same. I need to get the IDs where either ID_1 && ID_2 (OR) ID_2 && ID_1 are present in a single row.
Thank you, Susheel
Updated based on comments:
$ awk '
$1!=$2 { # consider only unequal strings
n=split($1,a,/_/) # split them by undescored
m=split($2,b,/_/)
if(m==n) { # there should be equal amount of parts
for(i=1;i<n;i++)
if(a[i]!=b[i]) # all but last parts should equal
next # or not valid
} else
next
print # if you made it so far...
}' file
Output:
k141_18046_1 k141_18046_2
k141_18046_2 k141_18046_1
k141_2012_1 k141_2012_2
k141_2012_2 k141_2012_1
Another awk, using match()
$ awk '
substr($1,match($1,/^.*_/),RLENGTH) == substr($2,match($2,/^.*_/),RLENGTH) &&
substr($1,match($1,/[^_]*$/),RLENGTH) != substr($2,match($2,/[^_]*$/),RLENGTH)
' file