Search code examples
bashpipepipelinebuffering

Piping sometimes does not lead to immediate output


I observed a few times now that A | B | C may not lead to immediate output, although A is constantly producing output. I have no idea how this even may be possible. From my understanding all three processes ought to be working on the same time, putting their output into the next pipe (or stdout) and taking from the previous pipe when they are finished with one step.

Here's an example where I am currently experiencing that:

tcpflow -ec -i any port 8340 | tee second.flow | grep -i "\(</Manufacturer>\)\|\(</SerialNumber>\)" | awk -F'[<>]' '{print $3}'

What is supposed to happen:

I look at one port for tcp packages. If something comes it should be a certain XML format and I want to grep the Manufacturer and the Serialnumber from these packages. I would also like to get the full, unmodified output in a text file "second.flow", for later reference.

What happens:

Everything as desired, but instead of getting output every 10 seconds (I'm sure I get these outputs every ten seconds!) I have to wait for a long time and then a lot is printed at once. It's like one of the tools gobbles up everything in a buffer and only prints it if the buffer is full. I don't want that. I want to get each line as fast as possible.

If I replace tcpflow ... with a cat second.flow it works immediately. Can someone describe what's going on? And in case that it's obvious would there be another way to achieve the same result?


Solution

  • Every layer in a series of pipes can involve buffering; by default, tools that don't specify buffering behavior for stdout will use line buffering when outputting to a terminal, and block buffering when outputting anywhere else (including piping to another program or a file). In a chained pipe, all but the last stage will see their output as not going to the terminal, and will block buffer.

    So in your case, tcpflow might be producing output constantly, and if it's doing so, tee should be producing data almost at the same rate. But grep is going to limit that flow to a trickle, and won't produce output until that trickle exceeds the size of the output buffer. It's already performed the filtering and called fwrite or puts or printf, but the data is waiting for enough bytes to build up behind it before sending it along to awk, to reduce the number of (expensive) system calls.

    cat second.flow produces output immediately because as soon as cat finishes producing output, it exits, flushing and closing its stdout in the process, which cascades, when each step finds its stdin to be at EOF, it exits, flushing and closing its stdout. tcpflow isn't exiting, so the cascade of EOFs and flushing isn't happening.

    For some programs, in the general case, you can change the buffering behavior by using stdbuf (or unbuffer, though that can't do line buffering to balance efficiency, and has issues with piped input). If the program is using internal buffering, this still might not work, but it's worth a shot.

    In your specific case, though, since it's likely grep that's causing the interruption (by only producing a trickle of output that is sticking in the buffer, where tcpflow and tee are producing a torrent, and awk is connected to stdout and therefore line buffered by default), you can just adjust your command line to:

    tcpflow -ec -i any port 8340 | tee second.flow | grep -i --line-buffered "\(</Manufacturer>\)\|\(</SerialNumber>\)" | awk -F'[<>]' '{print $3}'
    

    At least for Linux's grep (not sure if switch is standard), that makes grep change its own output buffering to line-oriented buffering explicitly, which should remove the delay. If tcpflow itself is not producing enough output to flush regularly (you implied it did, but you could be wrong), you'd use stdbuf on it (but not tee, which, per stdbuf man page notes, manually changes its buffering, so stdbuf doesn't do anything) to make them line buffered:

    stdbuf -oL tcpflow -ec -i any port 8340 | tee second.flow | grep -i --line-buffered "\(</Manufacturer>\)\|\(</SerialNumber>\)" | awk -F'[<>]' '{print $3}'
    

    Update from comments: It looks like some flavors of awk block buffer prints to stdout, even when connected to a terminal. For mawk (the default on many Debian based distros), you can non-portably disable it by passing the -Winteractive switch at invocation. Alternatively, to work portably, you can just call system("") after each print, which portably forces output flushing on all implementations of awk. Sadly, the obvious fflush() is not portable to older implementations of awk, but if you only care about modern awk, just use fflush() to be obvious and mostly portable.