I have an application that uses FSDataOutputStream
to write data to HDFS.
In order to write that data I use FSDataOutputStream
's hflush
function. In order to obtain the number of bytes that have been written I use FSDataOutputStream
's getPos
function.
For some reason after hflush
has been called, getPos
returns the wrong file size most of the time (sometimes it is correct).
My understanding is that when I call hflush
, and after that when I call getPos
, the file size in HDFS has to be equal (in bytes) to what getPos
returns, but getPos
always returns something greater! As though half of the file is still stuck in some buffer and hasn't reached a physical disk...
I read about the hsync
function of FSDataOutputStream
. I started using hsync
instead of hflush
, because it guarantees that the data will not be buffered and will be written to disk.
But the problem still persists, it is very rare now, but I still have the same issue. 10% of the time, when I call hsync
, and then getPos
, the file size in HDFS is less than what getPos
returns.
Why is this happening and how can I synchronize getPos
with hsync
?
You have to call hsync
with the SyncFlag.UPDATE_LENGTH
argument. Why is hsync() not flushing my hdfs file?