Search code examples
hadoophqlhive

Hive Data Retrieval Queries: Difference between CLUSTER BY, ORDER BY, and SORT BY


On Hive, for Data Retrieval Queries (e.g. SELECT ...), NOT Data Definition (e.g. CREATE TABLES ...), as far as I understand:

  • SORT BY only sorts with in the reducer

  • ORDER BY orders things globally but shoves everything into one reducers

  • CLUSTER BY intelligently distributes stuff into reducers by the key hash and make a sort by

So my question is:

  1. Does CLUSTER BY guarantees a global order?
  2. DISTRIBUTE BY puts the same keys into same reducers but what aboutthe adjacent keys?

The only document I can find on this is here and from the example it seems like it orders them globally. But from the definition I feel like it doesn't always do that.


Solution

  • In short, for your questions:

    1. Does CLUSTER BY guarantee a global order?
      • No.
    2. DISTRIBUTE BY puts the same keys into same reducers but what about the adjacent keys?

    The longer version:

    To understand Hive, first you must understand Map Reduce. The relevant property of Map Reduce for this question is that when data arrives at the reducing stage, they would have always gone through shuffling and sorting on mapper or reducer before the reducing step. Then:

    • ORDER BY x: Data guarantees global ordering by x by using one reducer for the last reducing step. This will be slow for large outputs. It is however acceptable even for processing large datasets as long as you know at the last step the data size is small. (e.g. after filtering and aggregation)
    • SORT BY x: guarantees local ordering specifically by x, at each of N reducers. Global ordering is not guaranteed.
    • DISTRIBUTE BY x: guarantees that the rows of the same distribute key x goes to the same reducer but does NOT guarantees your desired ordering. Data at reducing stage is still sorted, but NOT necessarily by x.
    • CLUSTER BY x: This is the same as doing (DISTRIBUTE BY x and SORT BY x).

    Please refer to the official wiki for detailed mechanism and syntax: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy


    Edited: As mentioned in comments and other answers, previous answer is incorrect and the choice of words may add wrong nuances. The use of of the phrase overlapping-range is misleading since shuffling is done with a hash function, which is NOT necessarily by range. For instance depending on implementation hashing integers by even or odd number is valid, which is overlapping in terms of number range. DISTRIBUTE BY in this case only guarantees all even numbers goes to the same reducer, and all odd numbers goes to another.

    The other answers have minor mistakes, or unrelated to the question, or too verbose.