I am using bash network redirection to upload a file to HDFS, via webhdfs api.
However, with a content that is quite "high" (a bit bellow 1 mb), the entire content of the file is not sent to HDFS, unless I make the program sleep.
The code is the following:
content_length=$(wc -c $local_dump_file)
content_length=${content_length% *}
exec 3<> $socket_file
content="PUT $path HTTP/1.1\n"
content="${content}Host: $url:$port\nAccept: */*\nContent-Length: $content_length\nContent-Type: application/x-www-form-urlencoded; charset=utf-8\n\n$(cat $local_dump_file)"
echo -e "$content" >&3
sleep 120
exec 3<&-
Note: all the variables are correct, and this work for small amount of data (from 1 byte to 1000 bytes, there is no truncature problem).
The snippet above has a sleep for 2 minutes. Afterwards, the file on HDFS is complete. However, without the sleep, the content of the file on HDFS is not complete, and is actually not the same at each run (the end of the file is truncated).
For example, once it uploaded 500kb, next execution 615kb, etc.
It seems the transfert is "truncated", like if the command echo -e "$content" >&3
unexpectedly shutdowns without any warning nor error.
in Bash, "\n" expands to a backslash and a lower-case 'n' (two characters), whereas I guess you were expecting a newline (one character).
Moreover, the HTTP protocol requires header lines to be terminated by a carriage return / newline pair, not by a bare newline
The script does not wait for or consume an HTTP response. This is a plausible reason for the remote side to truncate large files, and for adding a sleep
to work around that.
It's unnecessary and inefficient to interpolate the content into a Bash variable and then send that, when you could instead send it directly to the remote side.
The names of the url
and path
variables do not line up with the kinds of data I expect at those positions, based on the webhdfs documentation.
it would probably go something more like this (untested):
# Determine the content length
content_length=$(wc -c "${local_dump_file}")
content_length=${content_length%% *}
# Open the socket for reading and writing
exec 3<> "${socket_file}"
# Send the message header
cat <<EOH | sed $'s/$/\r/' >&3
PUT ${path} HTTP/1.1
Host: ${url}:${port}
Accept: */*
Content-Length: ${content_length}
Content-Type: application/x-www-form-urlencoded; charset=utf-8
EOH
# Send the message body
cat "${local_dump_file}" >&3
# Wait for a response (or the first line of one, anyway)
head -n1 <&3 >/dev/null
# Close the socket file
exec 3<&-
Notes:
To get C-style escapes in Bash, use the $'...'
style of quoting. In particular, $'s/$/\r/'
expands to s/$/<cr>/
(where <cr>
is a convention by which this answer represents a single carriage-return character.
Thus, sed $'s/$/\r/'
is a substitute for unix2dos
that can take its input from the standard input and write the converted result to its standard output. The above script uses it to convert the header lines from LF-terminated (as it assumes they will be) to CRLF-terminated.
The file content is transferred via a separate cat
command to avoid it being subject to line-terminator translation. This is necessary regardless of the actual format of the data, because the computation of the content length does not take any kind of translation into account.
I'm transferring the same data that the original script does, whether correct or not. See my remarks above about the url
and path
variables.
The head
command is used to consume one line from the socket, which is expected to be the first line of an HTTP response. Receiving the response is how you should satisfy yourself that the server has received and fully processed the request. However, you might find that you need or want to handle that in a more protocol-aware manner. See next.
Something like this:
curl --unix-socket "${socket_file}" -X PUT -T "${local_dump_file}" "${path}"
Note that the webhdfs docs have numerous examples of using curl
to drive the API. The only wrinkle here seems to be the use of a socket file, but curl can handle that via its --unix-socket
option (requires version 7.40 or later, I believe).