Could someone explain which file fomats of hive will be efficient to be used in pigScript using HCatalog.
I would like to understand which hive file formats will be efficient, since currently we have a partitioned hive table based on date and the underlying file is a sequential file. Reading for 80 days of data creates around 70,000 mappers which is very huge. Tried changing the map split size to 2GB and did not reduce much.
So, instead of sequential file looking for other options which will reduce the number of mappers. Size of data per data is 9GB.
Is there any suggestions or some inspiration?
Thank you.
As per my knowledge ORC is most suitable file format for hive it has high compression ration, efficiently work on large amount of data and also faster in read. ORC Stored as columns and compressed, which leads to smaller disk reads. The columnar format is also ideal for vectorization optimizations in hive.