How to Calculate Throughput from TestDFSIO benchmark on hadoop cluster

I have a cluster with 11 nodes, 9 being slaves and 2 masters, the same as in my previous question. I am executing the TestDFSIO benchmark on this cluster, which uses CDH 5.8.0.

I get the below output from the TestDFSIO result. Is this the throughput? or do I need to calculate the throughput from this, like number of files multiplied by the TestDFSIO result througput or something else?

Please let me know how to get the throughput of the whole cluster.

----- TestDFSIO ----- : write
           Date & time: Mon Aug 29 07:28:01 MDT 2016
       Number of files: 10000
Total MBytes processed: 8000000.0
     Throughput mb/sec: 50.75090177850001
Average IO rate mb/sec: 85.83160400390625
 IO rate std deviation: 82.41435666074283
    Test exec time sec: 3149.755

Solution

In short (rough estimate):

Total throughput [mb/sec] = total MBytes processed / test exec time

so ~2.5GB in your case.

Or, for more accurate results, find out what is the number of available map slots on your cluster (VCores total from yarn console will do) and try this one :

Total throughput mb/sec = min(nrFiles, VCores total - 1) * Throughput mb/sec

But I would recommend to repeat that test with a slightly different settings because of very high IO rate std deviation result (82.41435666074283).

You set number of files to 10k. I’m assuming that described cluster doesn’t have 10k map slots available. Now, because TestDFSIO runs with one map per file, it will take more than one MapReduce wave to finish the test. This is unnecessary. Moreover, last wave usually run with less maps than previous waves. Less maps running at the same time will generate better individual throughput and that will affect accuracy. Example : TestDFSIO single task throughput

So it’s better to set the number of task to something lower. Total number of drivers in datanodes is a good starting point. Look at the following graph : TestDFSIO total throughput

I’ve run TestDFSIO several times with different nrFiles parameter values. You can see that after crossing certain point (drives saturation in this case) there is not much going on. Total throughput of this cluster has been reached at 2.3GB/s. So, to answer your question, you can get the total throughput of the cluster running :

hdfs yarn jar hadoop-mapreduce-client-jobclient.jar TestDFSIO -write -nrFiles N -size 10GB

Where :

N = 3 / replication_factor * total_datanodes_drives
-size should be set to something that will let the test run for at least 10min

Total throughput could be calculated using values from the results, like this :

Total throughput [mb/sec] = nrFiles * Throughput mb/sec

Things to watch for:

HDFS free space ;) Test will generate: replication * size * nrFiles amount of data. Don’t go over 60% of your cluster capacity.
nrFiles should be lower than available map slots (nrFiles <= VCores total - 1, on Yarn)