hadoop google-cloud-platform google-cloud-dataproc dataproc

Disk utilization of Dataproc Worker Node is getting increased day by day

We have Dataproc cluster of 1 Master node and 7 worker node. All the worker nodes has 1 boot disk and 1 local disk of 375 GB (sdb). The sdb(mounted on /mnt/1) disk utilization of worker nodes 0,1,2,3 has reached above 85% and 5,6,7 also gradually increasing towards 85% .

We have found that files under below directory is taking major part (304G)

/mnt/1/hadoop/dfs/data/current/BP-XXXXXXX-XX.XX.XX.X-XXXXXXXX/current/finalized .

we have found that it has folder below like

drwxrwxr-x  4 hdfs hdfs 4.0K Jul 24 16:11 ..
drwxrwxr-x 34 hdfs hdfs 4.0K Jul 26 15:20 subdir0
drwxrwxr-x 34 hdfs hdfs 4.0K Aug  8 13:19 subdir1
drwxrwxr-x 34 hdfs hdfs 4.0K Aug 10 08:16 subdir2
drwxrwxr-x 34 hdfs hdfs 4.0K Aug 17 22:16 subdir3
drwxrwxr-x 34 hdfs hdfs 4.0K Aug 23 02:49 subdir4
drwxrwxr-x 34 hdfs hdfs 4.0K Aug 27 20:30 subdir5
drwxrwxr-x 34 hdfs hdfs 4.0K Sep  2 08:30 subdir6
drwxrwxr-x 34 hdfs hdfs 4.0K Sep  7 02:21 subdir7
drwxrwxr-x 34 hdfs hdfs 4.0K Sep 12 18:00 subdir8
drwxrwxr-x 34 hdfs hdfs 4.0K Sep 16 22:46 subdir9
drwxrwxr-x 34 hdfs hdfs 4.0K Sep 23 02:45 subdir10
drwxrwxr-x 34 hdfs hdfs 4.0K Sep 28 22:31 subdir11
drwxrwxr-x 34 hdfs hdfs 4.0K Oct  3 19:15 subdir12
drwxrwxr-x 34 hdfs hdfs 4.0K Oct  8 13:30 subdir13
drwxrwxr-x 17 hdfs hdfs 4.0K Oct 12 15:35 .
drwxrwxr-x 34 hdfs hdfs 4.0K Oct 13 04:46 subdir14

cd subdir6

ls -larth total 688K

drwxrwxr-x 34 hdfs hdfs 4.0K Sep  2 08:30 .
drwxrwxr-x  2 hdfs hdfs  20K Sep  5 22:35 subdir0
drwxrwxr-x  2 hdfs hdfs  20K Sep  5 22:51 subdir1
drwxrwxr-x  2 hdfs hdfs  20K Sep  5 23:34 subdir2
drwxrwxr-x  2 hdfs hdfs  20K Sep  6 00:28 subdir4
drwxrwxr-x  2 hdfs hdfs  20K Sep  6 00:50 subdir5
drwxrwxr-x  2 hdfs hdfs  20K Sep  6 01:36 subdir6
drwxrwxr-x  2 hdfs hdfs  20K Sep  6 01:50 subdir7
drwxrwxr-x  2 hdfs hdfs  20K Sep  6 02:21 subdir8
drwxrwxr-x  2 hdfs hdfs  20K Sep  6 02:50 subdir9
drwxrwxr-x  2 hdfs hdfs  20K Sep  6 03:19 subdir10
drwxrwxr-x  2 hdfs hdfs  20K Sep  6 03:38 subdir11
drwxrwxr-x  2 hdfs hdfs  20K Sep  6 04:19 subdir12
drwxrwxr-x  2 hdfs hdfs  20K Sep  6 04:38 subdir13
drwxrwxr-x  2 hdfs hdfs  20K Sep  6 05:20 subdir14
drwxrwxr-x  2 hdfs hdfs  20K Sep  6 05:49 subdir15
drwxrwxr-x  2 hdfs hdfs  20K Sep  6 06:19 subdir16
drwxrwxr-x  2 hdfs hdfs  20K Sep  6 07:20 subdir18
drwxrwxr-x  2 hdfs hdfs  20K Sep  6 08:06 subdir20
drwxrwxr-x  2 hdfs hdfs  20K Sep  6 08:24 subdir21
drwxrwxr-x  2 hdfs hdfs  20K Sep  6 08:50 subdir22
drwxrwxr-x  2 hdfs hdfs  20K Sep  6 09:23 subdir23
drwxrwxr-x  2 hdfs hdfs  20K Sep  6 09:39 subdir24
drwxrwxr-x  2 hdfs hdfs  20K Sep  6 10:05 subdir25
drwxrwxr-x  2 hdfs hdfs  20K Sep  6 10:26 subdir26
drwxrwxr-x  2 hdfs hdfs  20K Sep  6 11:02 subdir27
drwxrwxr-x  2 hdfs hdfs  20K Sep  6 11:36 subdir28
drwxrwxr-x  2 hdfs hdfs  20K Sep  6 12:04 subdir29
drwxrwxr-x  2 hdfs hdfs  20K Sep  6 12:24 subdir30
drwxrwxr-x  2 hdfs hdfs  20K Sep  6 12:53 subdir31
drwxrwxr-x  2 hdfs hdfs  20K Sep 10 16:12 subdir17
drwxrwxr-x  2 hdfs hdfs  20K Sep 12 16:13 subdir3
drwxrwxr-x  2 hdfs hdfs  20K Sep 13 16:13 subdir19
drwxrwxr-x 17 hdfs hdfs 4.0K Oct 12 15:35 ..

XXXX/current/finalized/subdir6# cd subdir0 XXXXX/current/finalized/subdir6/subdir0# ls -larth total 726M

-rw-rw-r--  1 hdfs hdfs  39K Sep  1 18:35 blk_1074135056_394248.meta
-rw-rw-r--  1 hdfs hdfs 4.8M Sep  1 18:35 blk_1074135056
-rw-rw-r--  1 hdfs hdfs  38K Sep  1 18:36 blk_1074135053_394245.meta
-rw-rw-r--  1 hdfs hdfs 4.8M Sep  1 18:36 blk_1074135053
-rw-rw-r--  1 hdfs hdfs  40K Sep  1 18:36 blk_1074135055_394247.meta
-rw-rw-r--  1 hdfs hdfs 5.0M Sep  1 18:36 blk_1074135055
-rw-rw-r--  1 hdfs hdfs  39K Sep  1 18:36 blk_1074135049_394241.meta
-rw-rw-r--  1 hdfs hdfs 4.9M Sep  1 18:36 blk_1074135049
-rw-rw-r--  1 hdfs hdfs  45K Sep  1 18:38 blk_1074135057_394249.meta
-rw-rw-r--  1 hdfs hdfs 5.6M Sep  1 18:38 blk_1074135057
-rw-rw-r--  1 hdfs hdfs  39K Sep  1 18:47 blk_1074135070_394262.meta
-rw-rw-r--  1 hdfs hdfs 4.8M Sep  1 18:47 blk_1074135070
-rw-rw-r--  1 hdfs hdfs  24K Sep  1 18:47 blk_1074135097_394289.meta
-rw-rw-r--  1 hdfs hdfs 2.9M Sep  1 18:47 blk_1074135097
-rw-rw-r--  1 hdfs hdfs  36K Sep  1 18:49 blk_1074135141_394333.meta
-rw-rw-r--  1 hdfs hdfs 4.5M Sep  1 18:49 blk_1074135141
-rw-rw-r--  1 hdfs hdfs  23K Sep  1 18:49 blk_1074135142_394334.meta
-rw-rw-r--  1 hdfs hdfs 2.9M Sep  1 18:49 blk_1074135142
-rw-rw-r--  1 hdfs hdfs  36K Sep  1 18:49 blk_1074135134_394326.meta
-rw-rw-r--  1 hdfs hdfs 4.5M Sep  1 18:49 blk_1074135134
-rw-rw-r--  1 hdfs hdfs  38K Sep  1 18:50 blk_1074135071_394263.meta

--------------------------- Many many file like that ---------------

-rw-rw-r--  1 hdfs hdfs  37K Sep  5 22:23 blk_1074192610_451802.meta
-rw-rw-r--  1 hdfs hdfs 4.6M Sep  5 22:23 blk_1074192610
-rw-rw-r--  1 hdfs hdfs  37K Sep  5 22:26 blk_1074192592_451784.meta
-rw-rw-r--  1 hdfs hdfs 4.6M Sep  5 22:26 blk_1074192592
-rw-rw-r--  1 hdfs hdfs  44K Sep  5 22:33 blk_1074192633_451825.meta
-rw-rw-r--  1 hdfs hdfs 5.5M Sep  5 22:33 blk_1074192633

Can we delete those file ?
What are purpose of those files ?

What are best ways to delete those ?

Thanks a lot for the info. I have ran the command.

~# hadoop fs -du -h /
136       /hadoop
0         /tmp
1017.5 G  /user


# hadoop fs -du -h /user/spark/
1012.7 G  /user/spark/eventlog

The event log seems to take nearly 1TB

:~# hdfs dfsadmin -report
Configured Capacity: 2766820474880 (2.52 TB)
Present Capacity: 2539325628659 (2.31 TB)
DFS Remaining: 331457595322 (308.69 GB)
DFS Used: 2207868033337 (2.01 TB)
DFS Used%: 86.95%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0
Pending deletion blocks: 0
-------------------------------------------------
Live datanodes (7):
Name: XXXXXX
Hostname: XXXXXX-w-0.c.XXXXXXXX
Decommission Status : Normal
Configured Capacity: 395260067840 (368.11 GB)
DFS Used: 328729737718 (306.15 GB)
Non DFS Used: 10792138250 (10.05 GB)
DFS Remaining: 34530736445 (32.16 GB)
DFS Used%: 83.17%
DFS Remaining%: 8.74%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 18
Last contact: Wed Oct 14 17:52:46 UTC 2020
Last Block Report: Wed Oct 14 14:32:42 UTC 2020

=============== above is trimmed output. Remaining 6 nodes are nearly having the same disk usage.

Is the deletion of eventlog is safe ? I mean will it hamper any running job or cluster ?
I am ran the below command to find how many files are there and I found it is huge.

~# hadoop fs -du -h /user/spark/eventlog|wc -l

236757

All the files nearly have 5~6MB in size. Is there any command by which I can delete the matching files which are at least 7 days old ?

Solution

The directories you listed are used by HDFS, you can run the following commands on the master node to figure out which HDFS files are consuming the space:

hdfs dfs -du <dir>
hdfs dfsadmin -report

You can delete unneeded files with

hdfs dfs -rm -r -f -skipTrash <path>

See more details in HDFS commands guide. There are also some nice scripts and tools that might be useful.

Pay attention to /user/spark/eventlog and /tmp/hadoop-yarn/staging/history, they usually grow as you run more jobs.

Increasing HDFS capacity

Before you identify and delete unneeded files, to prevent HDFS running out of space, you can add more worker nodes to the cluster as a mitigation:

gcloud dataproc clusters update <cluster> --num-workers=<num>

See more details in scaling Dataproc clusters.

Spark event logs

If it is caused by Spark event logs or history files, for live clusters, consider adding these properties in /etc/spark/conf/spark-defaults.conf:

spark.history.fs.cleaner.enabled=true
spark.history.fs.cleaner.interval=1d
spark.history.fs.cleaner.maxAge=7d

then restart Spark history server with

sudo systemctl restart spark-history-server.service

it will clean up the old files for you. You can change the interval to a smaller value e.g., 10m if you want it to run more frequently.

For new clusters, add these properties

gcloud dataproc clusters create ... \
  --properties spark:spark.history.fs.cleaner.enabled=true,spark:spark.history.fs.cleaner.interval=1d,spark:spark.history.fs.cleaner.maxAge=7d

See this doc on the related Spark configs. BTW, you can view Spark job history in Dataproc web UI, and after cleaning up some old history files, you should see fewer items.