Search code examples
sqlhiveqlhsqldbcategorical-data

How to list first 10 rows in each category in Hive SQL


I have a table that contains a column with categorial data (let's call it 'category') and another column that contains arbitrary strings (let's call it 'text').

Now i would like to see few examples of text for every category, in other words: For each category show first 5 values of text. The text values don't need to be sorted by any criteria (i don't want the top 5 longest strings or anything like that), just the first encountered (random) ones.

The result should look something like this:

+----------+------------------+
| category | text             |
+----------+------------------+
| cat A    | random string 1  |
| cat A    | random string 2  |
| cat A    | random string 3  |
| cat A    | random string 4  |
| cat A    | random string 5  |
| cat B    | random string 6  |
| cat B    | random string 7  |
| cat B    | random string 8  |
| cat B    | random string 9  |
| cat B    | random string 10 |
| cat C    | random string 11 |
| ...      | ...              |

I cannot afford quadratic complexity, since the table is really huge, tens of milions of rows.

I also would like to avoid unioning subqueries for each category manually

select ... where category = catA
union 
select ... where category = catB
union
select ... where category = catC
union
...

because there is like 50 different categories.

And to the top of it, i cannot use any database-specific featues, because my database isn't any of the traditional relational databases, but an Apache Hive, that only supports basic SQL syntax.

A lot of strict requirements, i know. But if someone knew how to solve this in plain SQL, that would help me a lot. Thanks.


Solution

  • You can use row_number() in most databases include Hive. For 10 examples per category, for instance:

    select t.*
    from (select t.*,
                 row_number() over (partition by category order by category) as seqnum
          from t
         ) t
    where seqnum <= 10;