Search code examples
hivemappers

Hive - Randomly Distribute Records Across Mappers


I am looking for something like DISTRIBUTE BY but for mappers instead of reducers.

I have a map-only transform job that I am running, and using

SET mapred.min.split.size=2100000;
SET mapred.max.split.size=2100000;

To control the number of mappers assigned. The total partition size is about 800MB and the job does get assigned about 400 mappers which seems consistent with the split size. The problem I am having is that ~390 of the mappers finish in < 1m and show that 0 records were processed. The remaining 10 mappers take the entire job and it takes days to complete.

Is there a way that I can force the mappers to take an (approximately) equal number of records so that this doesn't happen?


Solution

  • Fixed. Apparently the table being queried from only had 10 files in the HDFS and hence only 10 mappers could be utilized.