Having problem where HDFS (HDP v3.1.0) is running out of storage space (which is also causing problems with spark jobs hanging in ACCEPTED mode). I assume that there is some configuration where I can have HDFS use more of the storage space already present on the node hosts, but exactly what was not clear from quick googling. Can anyone with more experience help with this?
In Ambari UI, I see...
(from ambari UI)
(from NameNode UI).
Yet when looking at the overall hosts via ambari UI, there appears to be still a good amount of space left on the cluster hosts (the last 4 nodes in this list are the data nodes and each has a total of 140GB of storage space)
Not sure what setting are relevant, but here are the general setting in ambari:
My interpretation of the "Reserved Space for HDFS" setting is that it shows there should be 13GB reserved for non-DFS (ie. local FS) storage, so does not seem to make sense that HDFS is already running out of space.
Am I interpreting this wrongly?
Any other HDFS configs that should be shown in this question?
Looking at the disk usage by HDFS, I see...
[hdfs@HW001 root]$ hdfs dfs -du -h /
1.3 G 4.0 G /app-logs
3.7 M 2.3 G /apps
0 0 /ats
899.1 M 2.6 G /atsv2
0 0 /datalake
39.9 G 119.6 G /etl
1.7 G 5.2 G /hdp
0 0 /mapred
92.8 M 278.5 M /mr-history
19.5 G 60.4 G /ranger
4.4 K 13.1 K /services
11.3 G 34.0 G /spark2-history
1.8 M 5.4 M /tmp
4.3 G 42.2 G /user
0 0 /warehouse
for a total of ~269GB consumed (perhaps setting a shorter interval to spark-history cleanup would help as well?). Looking at the free space on HDFS, I see...
[hdfs@HW001 root]$ hdfs dfs -df -h /
Filesystem Size Used Available Use%
hdfs://hw001.ucera.local:8020 353.3 G 244.1 G 31.5 G 69%
Yet ambari reports 91% capacity, so this seems odd to me (unless I am misinterpreting something here (LMK)). This also conflicts with what I see broadly when looking at the disk space on the local FS where the hdfs datanode dirs are located...
[root@HW001 ~]# clush -ab -x airflowet df -h /hadoop/hdfs/data
HW001: df: ‘/hadoop/hdfs/data’: No such file or directory
airflowetl: df: ‘/hadoop/hdfs/data’: No such file or directory
---------------
HW002
---------------
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/centos_mapr001-root 101G 93G 8.0G 93% /
---------------
HW003
---------------
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/centos_mapr001-root 101G 94G 7.6G 93% /
---------------
HW004
---------------
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/centos_mapr001-root 101G 92G 9.2G 91% /
---------------
HW005
---------------
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/centos_mapr001-root 101G 92G 9.8G 91% /
Looking at the block report for the hdfs root...
[hdfs@HW001 root]$ hdfs fsck / -files -blocks
.
.
.
Status: HEALTHY
Number of data-nodes: 4
Number of racks: 1
Total dirs: 8734
Total symlinks: 0
Replicated Blocks:
Total size: 84897192381 B (Total open files size: 10582 B)
Total files: 43820 (Files currently being written: 10)
Total blocks (validated): 42990 (avg. block size 1974812 B) (Total open file blocks (not validated): 8)
Minimally replicated blocks: 42990 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 1937 (4.505699 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 3.045057
Missing blocks: 0
Corrupt blocks: 0
Missing replicas: 11597 (8.138018 %)
Erasure Coded Block Groups:
Total size: 0 B
Total files: 0
Total block groups (validated): 0
Minimally erasure-coded block groups: 0
Over-erasure-coded block groups: 0
Under-erasure-coded block groups: 0
Unsatisfactory placement block groups: 0
Average block group size: 0.0
Missing block groups: 0
Corrupt block groups: 0
Missing internal blocks: 0
FSCK ended at Tue May 26 12:10:43 HST 2020 in 1717 milliseconds
The filesystem under path '/' is HEALTHY
I assume that there is some configuration where I can have HDFS use more of the storage space already present on the node hosts, but exactly what was not clear from quick googling. Can anyone with more experience help with this? Also if anyone could LMK if this may be due to other problems I am not seeing?
You haven't mentioned if there is crappy data in /tmp
for example that could be cleaned.
Each datanode has 88.33 GB of storage?
If so, you cannot just create new HDDs to become attached to the cluster and suddenly create space.
dfs.data.dir
in hdfs-site
is a comma-separated list of mounted volumes on each datanode.
To get more storage, you need to format and mount more disks, then edit that property.