I had a basic question in Impala. We know that Impala allows you to query data that is stored in HDFS. Now, if a file is split into multiple blocks, and let us say a line of text is spread across two blocks. In Hive/MapReduce, the RecordReader takes care of this.
How does Impala read the record in such a scenario?
Referencing my answer on the Impala user list:
When Impala finds an incomplete record (e.g. which can happen scanning certain file formats such as text or rc files), it will continue to read incrementally from the next block(s) until it has read the entire record. Note that this may require small amounts of 'remote reads' (reading from a remote datanode), but usually this is a very small amount compared to the entire block which should have been read locally (and ideally via a short circuit read).