Search code examples
hiveparquetorc

Hive - Will columnar scan work on CSV External Table


Generally, if we create a Hive table with parquet or ORC, then it'll scan the particular column which we mentioned in select query.

But lets say, I have 10 CSV files and created an external table on top it. Now if I do select col1, col2, col3 from external_table; then will it scan all of my rows and do reducer will pick only the selected columns or how it'll work?


Solution

  • Short answer: Mapper reads file split, row by row and transforms rows to the intermediate rows with only necessary columns, does initial filtering, aggregation and other transformations.

    Reducer receives intermediate rows from mapper and does final processing: aggregation, ordering, etc.

    Long answer:

    Hive on MR and Hive on Tez uses MapReduce framework primitives to build the whole processing job.

    CSV file is being splitted and MapReduce framework spawns one map task for each InputSplit generated by the InputFormat.

    Maps transform input records into intermediate records. The transformed intermediate records do not need to be of the same type as the input records. Only necessary columns are included in the emitted records. A given input pair may map to zero or many output pairs, mapper does initial filtering and aggregation.

    The Mapper outputs are sorted and then partitioned per Reducer, the total number of partitions is the same as the number of reducers.

    The intermediate, sorted outputs are always stored in a simple (key-len, key, value-len, value) format. Applications can control if, and how, the intermediate outputs are to be compressed.

    Reducer reduces a set of intermediate values which share a key to a smaller set of values. In other words, Reducer does final aggregation, ordering if necessary.

    Reducer has 3 primary phases: shuffle, sort and reduce. Shuffle

    Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers, via HTTP. Sort

    The framework groups Reducer inputs by keys (since different mappers may have output the same key) in this stage.

    The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged.

    In the Reduce phase the reduce(WritableComparable, Iterator, OutputCollector, Reporter) method is called for each <key, (list of values)> pair in the grouped inputs.

    The output of the reduce task is typically written to the FileSystem.

    It is legal to set the number of reduce-tasks to zero if no reduction is desired. In this case the outputs of the map-tasks go directly to the FileSystem.

    Hive on Tez builds complex job (consisted of many map and reduce steps) as a single DAG (vertices are processing steps - mappers or reducers, edges - data transfer between them) and can skip unnecessary steps and does not write intermediate results to the persistent storage (HDFS).

    Read more details in the MapReduce documentation.