Search code examples
bashcentoscentos7

Bash: process blocks b/c read loop sometimes fails to read more data from file descriptor but manual cat /proc/<pid>/fd/<fd> can


Setup:

I have a complex script that starts a postgres instance and redirects the output to file descriptor 3

exec 3< <(su -l postgres -c "/usr/local/pgsql/bin/postmaster -p '$port' -d 3 -D '$dataDir' 2>&1 & echo \$!")
            #Explanation:
            #
            # exec = execute the command
            # 3< send the output of the followint to the file descriptor 3 which later is read by the "read" command
            # <(...) execute the command in () and send the output to the 3< filedescriptor
            # su -l postgres -c "...": Execute the command in "..." as user postgres
            # /usr/.../postmaster  ... -D '$backupDir': Execute postgres
            # 2>&1 redirect sterr to stdout, so both are returned normally
            # & command chaining. Execute one more command if the first succeeded (starting postgres)
            # echo \$!: Echo the PID of this process

Then following it I read the data that postgres writes to that file descriptor

while [ true ]
do
  read -u 3 line

  //Do something or break

done

Problem:

Most of the time this just works. But sometimes everything gets stuck until I do manually do cat /proc/<pid>/fd/3 on the command line, which resolves the issue and makes the script continue.

My Analysis so far:

My assumption on what happens is, that for whatever reason the read stops emptying the buffer for the fd(3) but instead always gets the same content into $line and then postgres stops, as it is blocked while trying to write to fd(3) which seems to be "full" leading to a deadlock.

This always happens when postgres is shutting down where it logs a lot of debug info at once.

When I do the cat /proc/<pid>/fd/3 the buffer seemingly gets emptied, postgres continues and closes and my script that is reading the logs and also checking if postgres exited continues as well.

I tried to increase the file descriptor buffer, but am not sure how.

sysctl -w net.unix.max_dgram_qlen=20480

Did not solve this isse.

Questions:

1: Why does read fail while a manual cat works?
2: How to increase the buffer for file descriptors?

Additional info:

Cent OS 7: 3.10.0-1160.15.2.el7.x86_64 (now also happens on latest Manjaro)

This seems to happen more often, if multiple such scripts are running in parallel (Against different postgres data directories)

It started to happen after a bigger update of the server but could be an coincidence as at that time we also increased the number of parallel scripts

UPDATE: I made another deep investigation and this time it seems like the behavior changed and what happens makes a bit more sense, at least on Manjaro.

Instead of read, hanging in a loop and returning always the same content, it happens while I make a small pause to execute an SQL statement on the postgres instance.

So what happens is this: read is not emptying the buffer and postgres gets stuck writing stuff to it, before the sql statement is being executed which also gets stuck so reading does not resume => deadlock

So if I can somehow increase the buffer size of this file descriptor that I open, this would solve everything. Anyone in for the bounty?

Solution:

Beside the accepted answer, which would totally work. In my case it was enough to make sure fd(3) was completely emptied before issuing the SQL statement to postgres.

    echo "Empty postgres output buffer before executing sql command"
    lastLine=""
    line=""
    while [ true ]
    do
            read -u 3 -t 1 line 
            if [ $? -ne 0 ]
            then
                    echo "Nothing in buffer";
                    break
            fi
            if [ "$lastLine" == "$line" ]
            then
                    echo "Nothing new in buffer: $line"
                    break
            fi
            lastLine=$line
            echo "Postgres: $line"
    done

Then there was enough buffer left, to not block postgres while trying to write to fd(3) while executing the sql command.


Solution

  • For troubleshooting purposes, can you run this ?

    # export variables port and dataDir
    
    exec 3< <(su -l postgres -c "echo /usr/local/pgsql/bin/postmaster -p '$port' -d 3 -D '$dataDir' 2>&1")
    
    while read -u 3 line
    do
        echo "$line"
    done
    

    and see if you still get problems.

    Notice I left out & echo \$!.

    Can you consider this structure of script ?

    #!/usr/bin/env bash
    
    # export variables port and dataDir
    
    log_file=/tmp/postmaster.log
    su -l postgres -c "echo /usr/local/pgsql/bin/postmaster -p '$port' -d 3 -D '$dataDir'" &>$log_file & echo $!
    
    exec 3< <(tail -f $log_file)
    
    while read -u 3 line
    do
        echo "$line"
    done
    

    so that you decouple the two process and don't depend on buffer size.