Search code examples
grep

GREP Locating target letters in identical positions


I have a file with lines containing pairs of letter strings such as

ABXF\\CDYG

and a pair of target letters, for example X and Y (the target letters may vary). I would like to locate all the lines where the target letters are in the same position (in this example, both are in position 3 of their respective letter strings). The locations could be anywhere, include at the very first, or at the very last position. The two letter strings always have the same length.

How could I do such a search with regular expressions? (here the Perl grep).


Solution

  • If that's okay with you, here's a shellscript that might do the job.

    #! /bin/sh
    
    Target="${1:?missing target letters}"
    
    File="${2:?missing input filename}"
    
    Previous=''
    
    test "${#Target}" -eq 2   ||   { echo 'please provide two target letters'; exit 1; }
    
    test -r "$File"   ||   { echo "cannot find file \"$File\""; exit 1; }  
      
    grep -n -b -o -e "${Target%?}\\|${Target#?}" "$File" \
      | while read -r Line
        do    if test "${Line%%:*}" != "${Previous%%:*}"
                 then Previous="$Line"
              else
                 printf '%s:%s\n' "$Previous" "$Line" \
                   | { IFS=':' read -r Line Pos1 Char1 _ Pos2 Char2
                       test "$(( Pos1 == Pos2 - 6))" -eq 1 \
                         && test "$Char1" != "$Char2"      \
                         && echo "match at line $Line"
                     }
                 Previous=''
              fi
        done
    

    Based on the following input data:

    ABXF\\CDYG
    ZETX\\FCBA
    XHCB\\YEIH
    BYCT\\ABCD
    CYTZ\\AXVH
    ABXZ\\CDXV
    

    when you invoke the script like this:

    ./scriptname XY INPUTFILE
    

    it produces this output:

    match at line 1
    match at line 3
    match at line 5
    

    Explanation

    The script uses the -o -b and -n grep options.

    • '-n' prints a line number for every match
    • '-b' includes a byte offset for every match
    • '-o' prints a matching result for every occurrences in a given line

    Thus grep -n -b -o -e 'X\|Y' INPUTFILE produces :

    1:2:X
    1:8:Y
    2:14:X
    3:22:X
    3:28:Y
    4:34:Y
    

    (line:offset:matched expression)

    The script only parses that output, assuming that:

    • IF PreviousLine == CurrentLine
    • AND PreviousOffset + 6 == CurrentOffset
    • AND the matched letters are different
    • THEN there's a match

    Tested under Debian 11 with GNU grep.

    Hope that helps.