Search code examples
linuxshellawktext-processing

Removing duplicates in fixed width file which has multiple key columns except the last occurrence


I have a fixed width file like below, where 1-9 and 18-21 are the key. Depending on which I am trying to get the output file without duplicate.

In File
12345ABCD78.90200ABCD
12345ABCD90.45300ABCD
11111EFGH56.75100ABCD
12345ABCD34.45400ABCD
11111EFGH75.90200ABCD

Out File
12345ABCD34.45400ABCD
11111EFGH75.90200ABCD

I have Tried using awk as below but not able to get the last occurrence of the duplicate. Can anyone help more on this.

awk -v df=Duplicates_File.dat -v of=Output_wdout_Duplicate.dat '
(substr($0, 1, 18),substr($0, 174, 3)) in key {
        print > df
        next
}
{       key[substr($0, 1, 18),substr($0, 174, 3)]
        print > of
}' Inputfile

Solution

  • Please try following awk code. Written and tested with shown samples.

    awk '{arr[substr($0,1,9),substr($0,18,4)]=$0} END{for(i in arr){print arr[i]}}' Input_file
    

    Explanation: Simple explanation would be, creating arr with index of 1st 9 characters and 18th to 21st characters and having current line value in it; keep doing same till whole Input_file is done with reading. In END block of this program printing all elements of array, which will basically provide all elements last occurrence only.



    2nd solution: Using GNU awk's FIELDSWIDTH option you can try following.

    awk 'BEGIN{FIELDWIDTHS = "9 8 4 *"} {arr[$1,$3]=$0} END{for(i in arr){print arr[i]}}'  Input_file