Any specific problems running (linux) BCP on "too many" threads?

Are there any specific problems with running Microsoft's BCP utility (on CentOS 7, https://learn.microsoft.com/en-us/sql/linux/sql-server-linux-migrate-bcp?view=sql-server-2017) on multiple threads? Googling could not find much, but am looking at a problem that seems to be related to just that.

Copying a set of large TSV files from HDFS to a remote MSSQL Server with some code of the form

bcpexport() {
    filename=$1
    TO_SERVER_ODBCDSN=$2
    DB=$3 
    TABLE=$4 
    USER=$5
    PASSWORD=$6
    RECOMMEDED_IMPORT_MODE=$7 
    DELIMITER=$8 

    echo -e "\nRemoving header from TSV file $filename"
    echo -e "Current head:\n"
    echo $(head -n 1 $filename)
    echo "$(tail -n +2 $filename)" > $filename
    echo "First line of file is now..."    
    echo $(head -n 1 $filename)

    # temp. workaround safeguard for NFS latency
    #sleep 5 #FIXME: appears to sometimes cause script to hang, workaround implemented below, throws error if timeout reached 
    timeout 30 sleep 5

    echo -e "\nReplacing null literal values with empty chars"
    NULL_WITH_TAB="null\t" # WARN: assumes the first field is prime-key so never null
    TAB="\t"
    sed -i -e "s/$NULL_WITH_TAB/$TAB/g" $filename
    echo -e "Lines containing null (expect zero): $(grep -c "\tnull\t" $filename)"
    
    # temp. workaround safeguard for NFS latency
    #sleep 5 #FIXME: appears to sometimes cause script to hang, workaround implemented below 
    timeout 30 sleep 5

    /opt/mssql-tools/bin/bcp "$TABLE" in "$filename" \
        $TO_SERVER_ODBCDSN \
        -U $USER -P $PASSWORD \
        -d $DB \
        $RECOMMEDED_IMPORT_MODE \
        -t "\t" \
        -e ${filename}.bcperror.log
}

export -f bcpexport
parallel -q -j 7 bcpexport {} "$TO_SERVER_ODBCDSN" $DB $TABLE $USER $PASSWORD $RECOMMEDED_IMPORT_MODE $DELIMITER \
    ::: $DATAFILES/$TARGET_GLOB

where $DATAFILES/$TARGET_GLOB constructs a glob that lists a set of files in a directory.

When running this code for a set of TSV files, finding that sometimes some (but not all) of the parallel BCP threads fail, ie. some files successfully copy to MSSQL Server

Starting copy...

5397376 rows copied.

Network packet size (bytes): 4096

Clock Time (ms.) Total : 154902 Average : (34843.8 rows per sec.)

while others output error message

Starting copy...

BCP copy in failed

Usually, see this pattern: a few successful BCP copy-in operations in the first few threads returned, then a bunch of failing threads return their output until run out of files (GNU Parallel returns output only when whole thread done to appear same as if sequential).

Notice in the code there is the -e option to produce an error file for each BCP copy-in operation (see https://learn.microsoft.com/en-us/sql/tools/bcp-utility?view=sql-server-2017#e). When examining the files after observing these failing behaviors, all are blank, no error messages.

Only have seen this with the number of threads >= 10 (and only for certain sets of data (assuming has something to do with total number of files are files sizes, and yet...)), no errors seen so far when using ~7 threads, which further makes me suspect this has something to do with multi-threading.

Monitoring system resources (via free -mh) shows that generally ~13GB or RAM is always available.

May be helpful to note that the data bcp is trying to copy-in may be ~500000-1000000 records long with an upper limit of ~100 columns per record.

Does anyone have any idea what could be going on here? Note, am pretty new to using BCP as well as GNU Parallel and multi-threading.

Solution

TLDR: Adding more threads to run concurrently to have bcp copy-in files of data seems to have the affect of overwhelming the endpoint MSSQL Server with write instructions, causing the bcp threads to fail (maybe timeing out?). When the number of threads becomes too many seems to depend on the size of the files getting copy-in'ed by bcp (ie. both the number of records in the file as well as the width of each record (ie. number of columns)).

Long version (more reasons for my theory):

1. When running a larger number of bcp threads and looking at the processes started on the machine (https://clustershell.readthedocs.io/en/latest/tools/clush.html)

ps -aux | grep bcp

seeing a bunch of sleeping processes (notice the S, see https://askubuntu.com/a/360253/760862) as shown below (added newlines for readability)

me 135296 14.5 0.0 77596 6940 ? S 00:32 0:01

/opt/mssql-tools/bin/bcp TABLENAME in /path/to/tsv/1_16_0.tsv -D -S MyMSSQLServer -U myusername -P -d myDB -c -t \t -e /path/to/logfile

These threads appear to sleep for very long time. Further debugging into why these threads are sleeping suggests that they may in fact be doing their intended job (which would further imply that the problem may be coming from BCP itself (see https://stackoverflow.com/a/52748660/8236733)). From https://unix.stackexchange.com/a/47259/260742 and https://unix.stackexchange.com/a/36200/260742)

A process in S state is usually in a blocking system call, such as reading or writing to a file or the network, or waiting for another called program to finish.

(eg. writing to the MSSQL Server endpoint destination given to bcp in the ODBCDSN)

Your process will be in S state when it is doing reads and possibly writes that are blocking. Can also happen while waiting on semaphores or other synchronization primitives... This is all normal and expected, and not usually a problem... you don't want it to waste CPU while it's waiting for user input.

2. When running different sets of files of varying record-amount-per-file (eg. ranges of 500000 - 1000000 rows/file) and record-width-per-file (~10 - 100 columns/row), found that in cases with either very large data width or amounts, running a fixed set of bcp threads would fail.

Eg. for a set of ~33 TSVs with ~500000 rows each, each row being ~100 columns wide, a set of 30 threads would write the first few OK, but then all the rest would start returning failure messages. Incorporating a bit from @jamie's answer, the fact the the failure messages returned are "BCP copy in failed" errors does not necessarily mean it has do do with the content of the data in question. Having no actual content being written into the -e errorlog files from my process, @jamie's post says this

Regarding the "-e" option, it is meant to output data errors. login errors, bad table names... many other errros are not reported in the file created with the -e option. When you get output using the "-e" option, you'll see info like "value truncated" and such... will give you line numbers and sample data that was at issue.

Meanwhile, a set of ~33 TSVs with ~500000 rows each, each row being ~100 wide, and still using 30 bcp threads would complete quickly and without error (also would be faster when reducing the number of threads or file set). The only difference here being the overall size of the data being bcp copy-in'ed to the MSSQL Server.

All this while

free -mh

still showed that the machine running the threads still had ~15GB of free RAM remaining in each case (which is again why I suspect that the problem has to do with the remote MSSQL Server endpoint rather than with the code or local machine itself).

3. When running some of the tests from (2), found that manually killing the parallel process (via CTL+C) and then trying to remotely truncate the testing table being written to with /opt/mssql-tools/bin/sqlcmd -Q "truncate table mytable" on the local machine would take a very long time (as opposed to manually logging into the MSSQL Server and executing a truncate mytable in the DB). Again this makes me think that this has something to do with the MSSQL Server having too many connections and just being overwhelmed.

** Anyone with any MSSQL Mgmt Studio experience reading this (I have basically none), if you see anything here that makes you think that my theory is incorrect please let me know your thoughts.