Search code examples
amazon-s3hivegzipqubole

How to query data from gz file of Amazon S3 using Qubole Hive query?


I need get specific data from gz. how to write the sql? can I just sql as table database?:

  Select * from gz_File_Name where key = 'keyname' limit 10.

but it always turn back with an error.


Solution

  • You need to create Hive external table over this file location(folder) to be able to query using Hive. Hive will recognize gzip format. Like this:

    create external table hive_schema.your_table (
    col_one string, 
    col_two string
    )
    stored as textfile  --specify your file type, or use serde
    LOCATION
      's3://your_s3_path_to_the_folder_where_the_file_is_located'
    ;
    

    See the manual on Hive table here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableCreate/Drop/TruncateTable

    To be precise s3 under the hood does not store folders, filename containing /s in s3 represented by different tools such as Hive like a folder structure. See here: https://stackoverflow.com/a/42877381/2700344