regexperformancebashsed

Is there a difference in efficiency between pipelined sed invocations and multiple sed expressions?


I have a question about sed efficiency in bash. I have a pipelined series of sed statements, e.g.:

var1="Some string of text"

var2=$(echo "$var1" | sed 's/pattern1/replacement1/g' | sed 's/pattern2/replacement2/g' | sed 's/pattern3/replacement3/g' | sed 's/pattern4/replacement4' | sed 's/pattern5/replacement5/g')

Assuming no inputs depend on edited output from an earlier sed pipe, am I better off scripting the above with expression statements instead? For example:

var2=$(echo "$var1" | sed -e's/pattern1/replacement1/g' -e's/pattern2/replacement2/g' -e's/pattern3/replacement3/g' -e's/pattern4/replacement4/g' -e's/pattern5/replacement5/g')

Is there any efficiency to be gained here?


Solution

  • Short Answer

    Using multiple expressions will be faster than using multiple pipelines, because you there's additional overhead in creating pipelines and forking sed processes. However, it's rarely enough of a difference to matter in practice.

    Benchmarks

    Using multiple expressions is faster than multiple pipelines, but probably not enough to matter for the average use case. Using your example, the average difference in execution speed was only two-thousandths of a second, which is not enough to get excited about.

    # Average run with multiple pipelines.
    $ time {
        echo "$var1" | 
        sed 's/pattern1/replacement1/g' |
        sed 's/pattern2/replacement2/g' |
        sed 's/pattern3/replacement3/g' |
        sed 's/pattern4/replacement4/g' |
        sed 's/pattern5/replacement5/g'
    }
    Some string of text
    
    real        0m0.007s
    user        0m0.000s
    sys         0m0.004s
    

    # Average run with multiple expressions.
    $ time {
        echo "$var1" | sed \
        -e 's/pattern1/replacement1/g' \
        -e 's/pattern2/replacement2/g' \
        -e 's/pattern3/replacement3/g' \
        -e 's/pattern4/replacement4/g' \
        -e 's/pattern5/replacement5/g'
    }
    Some string of text
    
    real        0m0.005s
    user        0m0.000s
    sys         0m0.000s
    

    Granted, this isn't testing against a large input file, thousands of input files, or running in a loop with tens of thousands of iterations. Still, it seems safe to say that the difference is small enough to be irrelevant for most common situations.

    Uncommon situations are a different story. In such cases, benchmarking will help you determine whether replacing pipes with in-line expressions is a valuable optimization for that use case.