Search code examples
sqlhadoopmapreducehiveudf

Hive for bag of words (word count for each word in the dictionary)


I have a table with this structure:

user_id | message_id | content
   1    |      1     | "I like cats"
   1    |      1     | "I like dogs"

And a list of valid words in dictionary.txt (or an external hive table), for example:

I,like,dogs,cats,lemurs

And my goal is to generate an word-count table for each user

user_id  |  "I"  |  "like"  |  "dogs"  |  "cats"  |  "lemurs"
   1     |   2   |     2    |     1    |     1    |     0

This is what I tried so far:

SELECT user_id, word, COUNT(*) 
FROM messages LATERAL VIEW explode(split(content, ' ')) lTable as word 
GROUP BY user_id,word;

Solution

  • Check this :

    select ename, 
    length(ename)-length(replace(ename,'A', '')) A,
    length(ename)-length(replace(ename,'W', '')) W 
    FROM EMP;
    

    Else you can define a variable(your search string) and place it in the place of 'A', 'W' etc