Search code examples
linuxunixawksedcomm

Unix - Want records from file 2 that are not in file 1 by matching on the first 91 characters


I want to compare file2 to file1 by matching in the first 91 characters of each file and output the full record from file2 to file3. I'm new to Unix commands and just cant seem to figure this out.

Thanks in advance, Jeff


Solution

  • I generated dummy files as follows:

    file1

    A012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
    B012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
    C012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
    D012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
    E012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
    F012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
    

    file2

    Z012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789 Line 1
    B012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789 Line 2
    T012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789 Line 3
    D012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789 Line 4
    E012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789 Line 5
    F012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789 Line 6
    

    Then I think you want this:

    awk '
       # Processing for file1, basically create associative array entry indexed by leftmost 91 characters
       FNR==NR { f1[substr($0,1,91)]++; next }
    
       # Processing for second file
       f1[substr($0,1,91)] > 0
    
       ' file1 file2
    

    Sample Output

    B012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789 Line 2
    D012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789 Line 4
    E012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789 Line 5
    F012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789 Line 6
    

    Actually, I now think you might want precisely the other lines, if so, change this:

    f1[substr($0,1,91)] > 0
    

    to this:

    ! f1[substr($0,1,91)]