I have a table that contains a column with categorial data (let's call it 'category') and another column that contains arbitrary strings (let's call it 'text').
Now i would like to see few examples of text for every category, in other words: For each category show first 5 values of text. The text values don't need to be sorted by any criteria (i don't want the top 5 longest strings or anything like that), just the first encountered (random) ones.
The result should look something like this:
+----------+------------------+
| category | text |
+----------+------------------+
| cat A | random string 1 |
| cat A | random string 2 |
| cat A | random string 3 |
| cat A | random string 4 |
| cat A | random string 5 |
| cat B | random string 6 |
| cat B | random string 7 |
| cat B | random string 8 |
| cat B | random string 9 |
| cat B | random string 10 |
| cat C | random string 11 |
| ... | ... |
I cannot afford quadratic complexity, since the table is really huge, tens of milions of rows.
I also would like to avoid unioning subqueries for each category manually
select ... where category = catA
union
select ... where category = catB
union
select ... where category = catC
union
...
because there is like 50 different categories.
And to the top of it, i cannot use any database-specific featues, because my database isn't any of the traditional relational databases, but an Apache Hive, that only supports basic SQL syntax.
A lot of strict requirements, i know. But if someone knew how to solve this in plain SQL, that would help me a lot. Thanks.
You can use row_number()
in most databases include Hive. For 10 examples per category, for instance:
select t.*
from (select t.*,
row_number() over (partition by category order by category) as seqnum
from t
) t
where seqnum <= 10;