Search code examples
bashshellawksedtimedelay

time delay after printing lines with sed or awk in large file


I have a large file (1Gb) and I need to extract a few lines of it using the record number. I wrote my script with sed and, as it took too much time, I decided to investigate it. It turns out that, when I run something like sed -n '15689,15696p' filename the print is quick, but I have a time delay after it, and this is turning my script really slow. Doing the same task with awk the delay is smaller, but it's still there! The command line I used for awk was: awk 'NR>=15689 && NR<=15696' filename

I tried to print just one line (sed -n '15689p' filename) and the same problem appears!

I'm wondering if no one has ever seen that before and knows how to get rid of this stupid delay. It seems to me this is a big problem, because this delay occurs after the printing task! I already searched in this and in other forums and I haven't seen a question with this issue. Can someone help me? Thanks


Solution

  • Avoid using sed -n '15689,15696p', as sed will go through the entire file. The fastest way I know is this:

    head -15696 filename | tail -10
    

    I benchmarked it, and it runs way faster:

    $ seq 1 100000000 > file
    
    $ time (head -50000000 file | tail -10) > /dev/null
    real    0m0.694s
    user    0m0.830s
    sys     0m0.333s
    
    $ time (sed -n '49999991,50000000p' file) > /dev/null
    real    0m6.018s
    user    0m5.863s
    sys     0m0.160s
    
    $ time (sed -n '50000000q;49999991,50000000p' file) > /dev/null
    real    0m3.197s
    user    0m3.153s
    sys     0m0.043s
    
    $ time (awk 'NR>=49999991 && NR<=50000000' file) > /dev/null
    real    0m12.665s
    user    0m12.543s
    sys     0m0.123s
    
    $ time (awk 'NR>=49999991 && NR<=50000000{print} NR==50000001{exit}' file)
    real    0m9.104s
    user    0m9.010s
    sys     0m0.100s