Search code examples
linuxbashgrepcutcat

How do I determine the slowest component of my shell pipeline?


I have an extremely long and complicated shell pipeline set up to grab 2.2Gb of data and process it. It currently takes 45 minutes to process. The pipeline is a number of cut, grep, sort, uniq, grep and awk commands tied together. I have my suspicion that it's the grep portion that is causing it to take so much time but I have no way of confirming it.

Is there anyway to "profile" the entire pipeline from end to end to determine which component is the slowest and if it is CPU or IO bound so it can be optimised?

I cannot post the entire command here unfortunately as it would require posting proprietary information but I suspect it is the following bit checking it out with htop:

grep -v ^[0-9]

Solution

  • I found the problem myself after some further experimentation. It appears to be due to the encoding support in grep. Using the following hung the pipeline:

    grep -v ^[0-9]
    

    I replaced it with sed as follows and it finished in under 45 seconds!

    sed '/^[0-9]/d'