Search code examples
hadoopamazon-s3hiveamazon-emrhue

Hive results not being saved into S3 bucket


I am having trouble getting my Hive output to save to S3. I have tried sshing into the master node and running my command in Hive but it does not save the output. I have also tried running the command in Hue from the EMR console in AWS and still it does not save to S3. I also added the script as a step and still it doesn't save. The only way I have been able to get the results is to run it in Hue and then click to see results and download that way, then push them to S3. I am clueless as to why this is happening. Here is the query that I am running.

with temp as (
select /*+ streamtable(l) */ a.id, a.name, a.page
from my_table a
join my_other_table l on (a.id = l.id)
group by a.page, a.id, a.name)
insert overwrite directory 's3://bucket/folder/folder2/folder3/folder4/folder5/folder6/folder7/'
select page, count(distinct id) over (PARTITION BY page)
from temp
group by page;

As a note, I would prefer for the solution to work when adding a step since I plan to add x number of steps sequentially.


Solution

  • The normal way I've seen Amazon EMR output to Amazon S3 is to CREATE EXTERNAL TABLE with a LOCATION in Amazon S3.

    For example:

    CREATE EXTERNAL TABLE IF NOT EXISTS output_table
    (gram string, year int, ratio double, increase double)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    STORED AS TEXTFILE
    LOCATION 's3://my-bucket/directory';
    

    Then, just INSERT data into that table:

    INSERT OVERWRITE TABLE output_table
    SELECT gram FROM table...