Search code examples
hivehdfsexternal-tables

Introduce HDFS folder information into Hive external table


I have an HDFS directory structure like this:

/home/date_1/A/file.txt
/home/date_1/B/file.txt
/home/date_2/A/file.txt
/home/date_2/B/file.txt
...

I can create an external table

CREATE EXTERNAL TABLE table_name(col1 int, col2 string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORE AS TEXTFILE
LOCATION '/home'

But I don't know how to introduce the folder information 'A' or 'B' into the table. What can I do? Thanks!


Solution

  • In Hive you have virtual columns which you can use to read the underlying filename. INPUT__FILE__NAME will give your the list of files that the data has used to get the filename.

    So you need to first create external table (as you have done). Then when you query the external table you can make use of the virtual column and split the data, as below:

    select 
     col1,
     col2,
     INPUT__FILE__NAME as full_filepath,
     concat_ws("/",reverse(split(reverse(INPUT__FILE__NAME),"/")[1]), reverse(split(reverse(INPUT__FILE__NAME),"/")[0])) as splitted_filepath
     FROM
     table_name;
    

    More on virtual column in hive.