When running a spark job on an AWS cluster I believe that I've changed my code properly to distribute both the data and the work of the algorithm I am using. But the output looks like this:
[Stage 3:> (0 + 2) / 1000]
[Stage 3:> (1 + 2) / 1000]
[Stage 3:> (2 + 2) / 1000]
[Stage 3:> (3 + 2) / 1000]
[Stage 3:> (4 + 2) / 1000]
[Stage 3:> (5 + 2) / 1000]
[Stage 3:> (6 + 2) / 1000]
[Stage 3:> (7 + 2) / 1000]
[Stage 3:> (8 + 2) / 1000]
[Stage 3:> (9 + 2) / 1000]
[Stage 3:> (10 + 2) / 1000]
[Stage 3:> (11 + 2) / 1000]
[Stage 3:> (12 + 2) / 1000]
[Stage 3:> (13 + 2) / 1000]
[Stage 3:> (14 + 2) / 1000]
[Stage 3:> (15 + 2) / 1000]
[Stage 3:> (16 + 2) / 1000]
Am I correct to interpret the 0 + 2 / 1000 as only one two core processor carrying out one of the 1000 tasks at a time? With 5 nodes (10 processors) why wouldn't I see 0 + 10 / 1000?
This looks more like the output I wanted:
[Stage 2:=======> (143 + 20) / 1000]
[Stage 2:=========> (188 + 20) / 1000]
[Stage 2:===========> (225 + 20) / 1000]
[Stage 2:==============> (277 + 20) / 1000]
[Stage 2:=================> (326 + 20) / 1000]
[Stage 2:==================> (354 + 20) / 1000]
[Stage 2:=====================> (405 + 20) / 1000]
[Stage 2:========================> (464 + 21) / 1000]
[Stage 2:===========================> (526 + 20) / 1000]
[Stage 2:===============================> (588 + 20) / 1000]
[Stage 2:=================================> (633 + 20) / 1000]
[Stage 2:====================================> (687 + 20) / 1000]
[Stage 2:=======================================> (752 + 20) / 1000]
[Stage 2:===========================================> (824 + 20) / 1000]
In AWS EMR make sure that the --executor-cores option is set to the number of nodes you are using like this: