Search code examples
bashwaitrace-conditionflushdiskcache

Running into a race-condition, even with a 'wait'


I'm facing a strange race condition in my bash program. I tried duplicating it via a simple enough demo program but, obviously, as true for all/most timing-related race demonstration attempts, I couldn't.

Here's an abstracted version of the program that DOES NOT duplicate the issue, but let me still explain:

# Abstracted version of the original program
# that is NOT able to demo the race.
#
function foo() {
    local instance=$1

    # [A lot of logic here -
    #  all foreground commands, nothing in the background.]

    echo "$instance: test" > /tmp/foo.$instance.log        
    echo "Instance $instance ended"
}

# Launch the process in background...
#
echo "Launching instance 1"
foo 1 &

# ... and wait for it to complete.
#
echo "Waiting..."
wait
echo "Waiting... done.  (wait exited with: $?)"

# This ls command ALWAYS fails in the real
# program in the 1st while-iteration, complaining about 
# missing files, but works in the 2nd iteration!
#
# It always works in the very 1st while-iteration of the
# abstracted version.
#
while ! ls -l /tmp/foo.*; do
    :
done

In my original program (and NOT in the above abstracted version), I do see Waiting... done. (wait exited with: 0) on stdout, just as I see in the above version. Yet, the ls -l always fails in the original, but always works in the above abstracted version in the very first while loop iteration.

Also, the ls command fails despite seeing the Instance 1 ended message on stdout. The output is:

$ ./myProgram
Launching instance 1
Waiting...
Waiting... done. (wait exited with: 0)
Instance 1 ended
ls: cannot access '/tmp/foo.*': No such file or directory
/tmp/foo.1
$

I noticed that the while loop can be safely done away with if I put a sleep 1 right before ls in my original program, like so:

# This too works in the original program:
sleep 1
ls -l /tmp/foo.*

Question: Why isn't wait working as expected in my original program? Any suggestions to at least help troubleshoot the problem?

I'm using bash 4.4.19 on Ubuntu 18.04.

EDIT: I just also verified that the call to wait in the original, failing program is exiting with a status code of 0.

EDIT 2: Shouldn't the Instance 1 ended message appear BEFORE Waiting... done. (wait exited with: 0)? Could this be a 'flushing problem' with OS' disk-buffer/cache when dealing with background processes in bash?

EDIT 3: If instead of the while loop or sleep 1 hacks, I issue a sync command, then, voila, it works! But why should I have to do a sync in one program but the other?


Solution

  • I noticed that each the following three hacks work, but not quite sure why:

    Hack 1

    while ! ls -l /tmp/foo.*; do
        :
    done
    

    Hack 2

    sleep 1
    ls -l /tmp/foo.*
    

    Hack 3

    sync
    ls -l /tmp/foo.*
    

    Could this be a 'flushing problem' with OS' disk-buffer/cache, especially when dealing with background processes, especially in bash? In other words, the call to wait seems to returning BEFORE it flushes the diskcache (or, BEFORE the OS, on its own realizes and, is done flushing the diskcache).

    EDIT Thanks to @Jon, his was a very close guess and got me thinking in the right direction, along with the age-old, bit-wise tweaking advice from @chepner.

    The Real Problem: I was starting foo, not directly/plainly as shown in my inaccurate abstracted version in my original question, but via another launchThread function that, after doing some bookkeeping, would also say foo 1 & in its body. And the call to launchThread was itself suffixed with an &! So, my wait was really waiting on launchThread and not on foo! The sleep, sync, and while just were helping buy more time for foo to complete, which is why introducing them worked. The following is a more accurate demonstration of the problem, even though you may or may not be able to duplicate it on your own system (due to scheduling/timing variance across systems):

    #!/bin/bash -u
    
    function now() {
        date +'%Y-%m-%d %H:%M:%S'
    }
    
    function log() {
        echo "$(now) - $@" >> $logDir/log # Line 1
    }
    
    function foo() {
        local msg=$1
        log "$msg"
        echo "  foo ended"
    }
    
    function launchThread() {
        local f=$1
        shift
        "$f" "$@" &  # Line 2
    }
    
    logDir=/tmp/log
    
    /bin/rm -rf "$logDir"
    mkdir -p "$logDir"
    
    echo "Launching foo..."
    launchThread foo 'message abc' &  # Line 3
    
    echo "Waiting for foo to finish..."
    wait
    echo "Waiting for foo to finish... done. (wait exited with: $?)"
    
    ls "$logDir"/log*
    

    Output of the above buggy program:

    Launching foo...
    Waiting for foo to finish...
    Waiting for foo to finish... done. (wait exited with: 0)
      foo ended
    ls: cannot access '/tmp/log/log*': No such file or directory
    

    If I remove the & from EITHER Line 2 OR from Line 3, the program works correctly, with the following as output:

    Launching foo...
    Waiting for foo to finish...
      foo ended
    Waiting for foo to finish... done. (wait exited with: 0)
    /tmp/log/log
    

    The program also works correctly if I remove the $(now) part from Line 1.