Search code examples
apache-sparkamazon-s3hivedatabrickshive-partitions

Add New Partition to Hive External Table via databricks


I have a Folder which previously had subfolders based on ingestiontime which is also the original PARTITION used in its Hive Table.

So the Folder Looks as -

s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200712230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200711230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200710230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200709230000/....
........

Inside each ingestiontime folder, data is present in PARQUET format.

Now in the Same myStreamingData folder, I am adding another folder that holds similar data but in the folder named businessname.

So my Folder structure now looks like -

s3://MyDevBucket/dev/myStreamingData/businessname=007/ingestiontime=20200712230000/....
s3://MyDevBucket/dev/myStreamingData/businessname=007/ingestiontime=20200711230000/....
s3://MyDevBucket/dev/myStreamingData/businessname=007/ingestiontime=20200710230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200712230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200711230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200710230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200709230000/....
    ........

So I need to add the data in the businessname partition to my current hive table too.

To achieve this , I was running the ALTER Query - ( on Databricks)

%sql
alter table gp_hive_table add partition (businessname=007,ingestiontime=20200712230000) location "s3://MyDevBucket/dev/myStreamingData/businessname=007/ingestiontime=20200712230000"

But I am getting this error -

Error in SQL statement: AnalysisException: businessname is not a valid partition column in table `default`.`gp_hive_table`.;

What part I am doing incorrectly here ?

Thanks in Advance.


Solution

  • So, building upon the suggestion from @leftjoin,

    Instead of having a hive table without businessname as one of the partition , What I did is -

    Step 1 -> Create hive table with - PARTITION BY (businessname long,ingestiontime long)

    Step 2 -> Executed the query - MSCK REPAIR <Hive_Table_name> to auto add partitions.

    Step 3 -> Now, there are ingestiontime folders which are not in the folder businessname i.e folders like -

    s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200712230000/....
    s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200711230000/....
    s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200710230000/....
    s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200709230000/....
    

    I wrote a small piece of code to fetch all such partitions and then ran the following query for all of them - ALTER TABLE <hive_table_name> ADD PARTITION (businessname=<some_value>,ingestiontime=<ingestion_time_partition_name>) LOCATION "<s3_location_of_all_partitions_not_belonging_to_a_specific_businesskey>

    This solved my issue.