Search code examples
linuxsortingcomparison

How to compare the records in linux of a file and print the latest


I have the below file which has 4 columns, pipe separated.

09252|20|30|20200426|
09252|20|10|20200406|
09252|30|10|20200427|
09252|10|20|20200327|
12112|10|20|20200530|
04442|20|10|20200612|
52552|20|10|20200614|
04442|10|20|20200530|
52552|10|20|20200530|
12112|20|10|20200613|

I have sorted the file as below using sort command.

sort input.unl

04442|10|20|20200530|                                                                                                       
04442|20|10|20200612|                                                                                                       
09252|10|20|20200327|
09252|20|10|20200406|
09252|20|30|20200426|
09252|30|10|20200427|
12112|10|20|20200530|
12112|20|10|20200613|
52552|10|20|20200530|
52552|20|10|20200614|

I want to print the line into another file which has the latest date(column 4) for every entry in column 1. For example, 09252 has 4 entries in file, output file should only contain 09252|30|10|20200427| because this record has the latest date among 4 entries of 09252.

My expected output is as below

04442|20|10|20200612|
09252|30|10|20200427|
12112|20|10|20200613|
52552|20|10|20200614|

Looking forward to your help,


Solution

  • First, sort reverse by date. Then, sort and unique by the first column, keeping the order stable, so that only the last line from the previous sort, i.e. the max date, is kept for each id.

    sort -t'|' -rk4,4 file | sort -t'|' -su -k1,1
    
    • -t tells sort what the delimiter is. We need single quotes as bare | is a pipeline control operator.
    • -k4,4 uses the 4th column for sorting, i.e. the date
    • -r means sort backwards, i.e. the latest date will go first
    • -k1,1 means sort by the 1st column, i.e. the id
    • -u means "unique", i.e. only the first occurrence of each id will be printed
    • -s means "stable", it means sort won't shuffle the dates for the same id