Search code examples
bashhttpio-redirection

Why is my bash network redirection not taking into account all the request body?


I am using bash network redirection to upload a file to HDFS, via webhdfs api.

However, with a content that is quite "high" (a bit bellow 1 mb), the entire content of the file is not sent to HDFS, unless I make the program sleep.

The code is the following:

content_length=$(wc -c $local_dump_file)
content_length=${content_length% *}

exec 3<> $socket_file
content="PUT $path HTTP/1.1\n"
content="${content}Host: $url:$port\nAccept: */*\nContent-Length: $content_length\nContent-Type: application/x-www-form-urlencoded; charset=utf-8\n\n$(cat $local_dump_file)"
echo -e "$content" >&3
sleep 120
exec 3<&-

Note: all the variables are correct, and this work for small amount of data (from 1 byte to 1000 bytes, there is no truncature problem).

The snippet above has a sleep for 2 minutes. Afterwards, the file on HDFS is complete. However, without the sleep, the content of the file on HDFS is not complete, and is actually not the same at each run (the end of the file is truncated).

For example, once it uploaded 500kb, next execution 615kb, etc.

It seems the transfert is "truncated", like if the command echo -e "$content" >&3 unexpectedly shutdowns without any warning nor error.

  • Why does it behave like this ?
  • How to I force the program to send all the data without an arbitrary sleep ?

Solution

  • Issues with your code include

    • in Bash, "\n" expands to a backslash and a lower-case 'n' (two characters), whereas I guess you were expecting a newline (one character).

    • Moreover, the HTTP protocol requires header lines to be terminated by a carriage return / newline pair, not by a bare newline

    • The script does not wait for or consume an HTTP response. This is a plausible reason for the remote side to truncate large files, and for adding a sleep to work around that.

    • It's unnecessary and inefficient to interpolate the content into a Bash variable and then send that, when you could instead send it directly to the remote side.

    • The names of the url and path variables do not line up with the kinds of data I expect at those positions, based on the webhdfs documentation.

    If I were going to do this in Bash

    it would probably go something more like this (untested):

    # Determine the content length
    content_length=$(wc -c "${local_dump_file}")
    content_length=${content_length%% *}
    
    # Open the socket for reading and writing
    exec 3<> "${socket_file}"
    
    # Send the message header
    cat <<EOH | sed $'s/$/\r/' >&3
    PUT ${path} HTTP/1.1
    Host: ${url}:${port}
    Accept: */*
    Content-Length: ${content_length}
    Content-Type: application/x-www-form-urlencoded; charset=utf-8
    
    EOH
    
    # Send the message body
    cat "${local_dump_file}" >&3
    
    # Wait for a response (or the first line of one, anyway)
    head -n1 <&3 >/dev/null
    
    # Close the socket file
    exec 3<&-
    

    Notes:

    • To get C-style escapes in Bash, use the $'...' style of quoting. In particular, $'s/$/\r/' expands to s/$/<cr>/ (where <cr> is a convention by which this answer represents a single carriage-return character.

    • Thus, sed $'s/$/\r/' is a substitute for unix2dos that can take its input from the standard input and write the converted result to its standard output. The above script uses it to convert the header lines from LF-terminated (as it assumes they will be) to CRLF-terminated.

    • The file content is transferred via a separate cat command to avoid it being subject to line-terminator translation. This is necessary regardless of the actual format of the data, because the computation of the content length does not take any kind of translation into account.

    • I'm transferring the same data that the original script does, whether correct or not. See my remarks above about the url and path variables.

    • The head command is used to consume one line from the socket, which is expected to be the first line of an HTTP response. Receiving the response is how you should satisfy yourself that the server has received and fully processed the request. However, you might find that you need or want to handle that in a more protocol-aware manner. See next.

    But in practice, I would actually use curl instead

    Something like this:

    curl --unix-socket "${socket_file}" -X PUT -T "${local_dump_file}" "${path}"
    

    Note that the webhdfs docs have numerous examples of using curl to drive the API. The only wrinkle here seems to be the use of a socket file, but curl can handle that via its --unix-socket option (requires version 7.40 or later, I believe).