Search code examples
perlawk

sort and extract certain number of rows from a file containing dates


i have in a txt file, date like:

yyyymmdd

raw data are like:

20171115
20171115
20180903
...
20201231

They are more than 100k rows. i am trying to keep in one file the "newest" 10k lines, and in a separate file the 10k "oldest" 10k lines.

I guess this must be a two steps process:

  1. sort lines,

  2. then extract the 10k rows that are on top, the "newest = most recent dates" and the 10k rows that are towards the end of the file ie the "oldest = most ancient dates"

How could i achieve it using awk?

I even tried with perl no luck though, so a perl one liner would be highly accepted as well.

Edit: i would prefer a clean clever solution so that i learn from, and not an optimization of my attempts.

example with perl

@dates = ('20170401', '20170721', '20200911');
@ordered = sort { &compare } @dates;
sub compare {
    $a =~ /(\d{4})(\d{2})(\d{2})/;
    $c = $3 . $2 . $1;
    $b =~ /(\d{4})(\d{2})(\d{2})/;
    $c = $3 . $2 . $1;
    $c <=> $d;
}
print "@ordered\n";

Solution

  • Given that your lines with dates will sort lexicographically, it is simple. Just use sort then split.

    Given:

    cat infile
    20171115
    20171115
    20180903
    20100101
    20211118
    20201231
    

    You can sort then split that input file into files of 3 lines each:

    split  -l3 <(sort -r infile) dates
    # -l10000 for a 10,000 line split
    

    The result:

    for fn in dates*; do echo "${fn}:"; cat "$fn"; done
    datesaa:
    20211118
    20201231
    20180903
    datesab:
    20171115
    20171115
    20100101
    
    # files are names datesaa, datesab, datesac, ... dateszz
    # if you only want two blocks of 10,000 dates, 
    # just throw the remaining files away.
    

    Given you may have significantly more lines than you are interested in, you can also sort to a intermediate file then use head and tail to get the newest and oldest respectively:

    sort -r infile >dates_sorted
    head -n10000 dates_sorted >newest_dates
    tail -n10000 dates_sorted >oldest_dates