Search code examples
apache-sparkhivehiveqlcreate-tablehiveddl

How to define a partitioned external table for a nested directory structure


For a set of datafiles stored in hdfs in a year/*.csv structure as follows:

$ hdfs dfs -ls air/


    Found 21 items
air/year=2000
    drwxr-xr-x   - hadoop hadoop          0 2019-03-08 01:45 air/year=2001
    drwxr-xr-x   - hadoop hadoop          0 2019-03-08 01:45 air/year=2002
    drwxr-xr-x   - hadoop hadoop          0 2019-03-08 01:45 air/year=2003
    drwxr-xr-x   - hadoop hadoop          0 2019-03-08 01:45 air/year=2004
    drwxr-xr-x   - hadoop hadoop          0 2019-03-08 01:45 air/year=2005
    drwxr-xr-x   - hadoop hadoop          0 2019-03-08 01:45 air/year=2006
    drwxr-xr-x   - hadoop hadoop          0 2019-03-08 01:45 air/year=2007
    drwxr-xr-x   - hadoop hadoop          0 2019-03-08 01:45 air/year=2008

There are 12 csv files -one for each month. Since our queries do not care about month granularity it is fine to throw all months of a year into one directory. Here is the contents for one of the years: note these are .csv files:

[hadoop@ip-172-31-25-82 ~]$ hdfs dfs -ls air/year=2008


Found 10 items
-rw-r--r--   2 hadoop hadoop  193893785 2019-03-07 23:49 air/year=2008/On_Time_On_Time_Performance_2008_1.csv
-rw-r--r--   2 hadoop hadoop  199126288 2019-03-07 23:49 air/year=2008/On_Time_On_Time_Performance_2008_10.csv
-rw-r--r--   2 hadoop hadoop  182225240 2019-03-07 23:49 air/year=2008/On_Time_On_Time_Performance_2008_2.csv
-rw-r--r--   2 hadoop hadoop  197399305 2019-03-07 23:49 air/year=2008/On_Time_On_Time_Performance_2008_3.csv
-rw-r--r--   2 hadoop hadoop  191321415 2019-03-07 23:49 air/year=2008/On_Time_On_Time_Performance_2008_4.csv
-rw-r--r--   2 hadoop hadoop  194141438 2019-03-07 23:49 air/year=2008/On_Time_On_Time_Performance_2008_5.csv
-rw-r--r--   2 hadoop hadoop  195477306 2019-03-07 23:49 air/year=2008/On_Time_On_Time_Performance_2008_6.csv
-rw-r--r--   2 hadoop hadoop  201148079 2019-03-07 23:49 air/year=2008/On_Time_On_Time_Performance_2008_7.csv
-rw-r--r--   2 hadoop hadoop  219060870 2019-03-07 23:49 air/year=2008/On_Time_On_Time_Performance_2008_8.csv
-rw-r--r--   2 hadoop hadoop  172127584 2019-03-07 23:49 air/year=2008/On_Time_On_Time_Performance_2008_9.csv

The header and one row look like this:

hdfs dfs -cat airlines/2008/On_Time_On_Time_Performance_2008_4.csv | head -n 2


  "Year","Quarter","Month","DayofMonth","DayOfWeek","FlightDate","UniqueCarrier","AirlineID","Carrier","TailNum","FlightNum","Origin","OriginCityName","OriginState","OriginStateFips","OriginStateName","OriginWac","Dest","DestCityName","DestState","DestStateFips","DestStateName","DestWac","CRSDepTime","DepTime","DepDelay","DepDelayMinutes","DepDel15","DepartureDelayGroups","DepTimeBlk","TaxiOut","WheelsOff","WheelsOn","TaxiIn","CRSArrTime","ArrTime","ArrDelay","ArrDelayMinutes","ArrDel15","ArrivalDelayGroups","ArrTimeBlk","Cancelled","CancellationCode","Diverted","CRSElapsedTime","ActualElapsedTime","AirTime","Flights","Distance","DistanceGroup","CarrierDelay","WeatherDelay","NASDelay","SecurityDelay","LateAircraftDelay",

2008,2,4,3,4,2008-04-03,"WN",19393,"WN","N601WN","3599","MAF","Midland/Odessa, TX","TX","48","Texas",74,"DAL","Dallas, TX","TX","48","Texas",74,"1115","1112",-3.00,0.00,0.00,-1,"1100-1159",10.00,"1122","1218",6.00,"1220","1224",4.00,4.00,0.00,0,"1200-1259",0.00,"",0.00,65.00,72.00,56.00,1.00,319.00,2,,,,,,

The question is: how to "convince" hive / spark to read these contents properly? The approach is:

  • The last column year will be read automatically by hive due to the partitioning
  • The first column YearIn will be a placeholder: its value will be read in but my application code will ignore it in favor of the year partitioning column
    • All other fields are handled without any special considerations

Here is my attempt.

create external table air (
YearIn string,Quarter string,Month string, 
 .. _long list of columns_ ..) 
partitioned by (year int) 
row format delimited fields terminated by ',' location '/user/hadoop/air/';

The results are:

  • table is created and is accessible both by hive and by `spark
  • But the table is empty - as reported by both hive and spark

What is incorrect in this process?


Solution

  • The table definition looks good, except headers. If you do not skip headers then header rows will be returned in the dataset and if some columns are not strings, header values will be selected as NULLs. To skip headers from being selected add this at the end of your table DDL tblproperties("skip.header.line.count"="1") - this property is supported in Hive only, read also this workaround: https://stackoverflow.com/a/54542483/2700344

    In addition to creating table, you need to create partitions.

    Use MSCK [REPAIR] TABLE Air; command.

    The equivalent command on Amazon Elastic MapReduce (EMR)'s version of Hive is: ALTER TABLE Air RECOVER PARTITIONS.

    This will add Hive partitions metadata. See manual here: RECOVER PARTITIONS