Need some clarifications regarding the estimation of mappers for a particular job in Hadoop cluster. As per my understanding the no of mappers depends on the input splits taken for processing . But this is in the case if we are gonna to do processing for input data residing already in HDFS . Here I need clarification regarding the mappers and reducers triggered by a SQOOP job . PFB..
How mappers count are estimated for a dedicated cluster,based on RAM or based on the input splits/blocks?(In General)
How mappers count are estimated for a sqoop job for retrieving data from an RDBMS to HDFS based on input size?(Sqoop based)
what is meant by core CPU’s and how it affects the count of mappers that can be run parallel?(Genaral)
Thanks.
Need some clarifications regarding the estimation of mappers for a particular job in Hadoop cluster. As per my understanding the no of mappers depends on the input splits taken for processing . But this is in the case if we are gonna to do processing for input data residing already in HDFS . Here I need clarification regarding the mappers and reducers triggered by a SQOOP job . PFB..
answer : No it has nothing to do with the RAM size. it all depends on the number of input split.
answer : by default the number of mappers for a Sqoop job are 4. you can change the default by using -m (1,2,3,4,5...) or --num-mappers parameter , but you have to make sure that either you have primary key in you db or you are using -split-by parameter otherwise there will be only one mapper running and you have to explicitly say -m 1.
answer : core in CPU is the processing unit which can run a task. and when you say 4 core processor that means it can run 4 task at a time. the number of cores does not participate in calculating the number of mappers by mapreduce framework. but yes if there are 4 cores and mapreduce calculates the number of mappers are 12 then at a time 4 mappers will be running in parallel and after that rest will be running serially.