Search code examples
apache-sparkhadoopapache-spark-sqlgoogle-cloud-dataprocdataproc

Why is my hdfs capacity not remainng constant?


I am running a pyspark job on dataproc and my total hdfs capacity is not remaining constant.

enter image description here

As you can see in the first chart that the remaining hdfs capacity is falling even though the used hdfs capacity is minimal. Why is remaining + used not constant?


Solution

  • The "used" in the monitoring graph is actually "DFS used", and it didn't show "non-DFS used". If you open the HDFS UI in the component gateway web interfaces you should be able to see something like:

    Configured Capacity  :   232.5 GB
    DFS Used     :   38.52 GB
    Non DFS Used     :   45.35 GB
    DFS Remaining    :   148.62 GB
    DFS Used%    :   16.57 %
    DFS Remaining%   :   63.92 %
    

    The formula is :

    DFS Remaining = Total Disk Space - max(Reserved Space, Non-DFS Used) - DFS Used

    Configured capacity = Total Disk Space - Reserved Space

    Reserved Space is controlled by the dfs.datanode.du.reserved 1 property which defaults to 0. So in your case it is the non-DFS used that gets deducted. Here is a similar question.