I have a namenode that had to be brought down for an emergency that has not had an FSImage taken for 9 months and has about 5TB worth of edit files to process in its next restart. The secondary namenode has not been running (or had any checkpoint operations performed) since about 9 months ago, thus the 9 month old FSImage.
There are about 7.8 million inodes in the HDFS cluster. The machine has about 260GB of total memory.
We've tried a few different combinations of Java heap size, GC algorithms, etc... but have not been able to find a combination that allows the restart to complete without eventually slowing down to a crawl due to FGCs.
I have 2 questions: 1. Has anyone found a namenode configuration that allows this large of an edit file backlog to complete successfully?
We were able to get through the 5TB backlog of edits files using a version of what I suggested in my question (2) on the original post. Here is the process we went through:
dfs.namenode.name.dir
property of the namenode's hdfs-site.xml
file.dfs.namenode.name.dir
location. If you are not familiar with the naming convention for the FSImage and edits files, take a look at the example below. It will hopefully clarify what is meant by next subset of edits files.seen_txid
to contain the value of the last transaction represented by the last edits file from the subset you copied over in step (3). So if the last edits file is edits_0000000000000000011-0000000000000000020
, you would want to update the value of seen_txid
to 20
. This essentially fools the namenode into thinking this subset is the entire set of edits files.Startup Progress
tab of the HDFS Web UI, you will see that the namenode will start with the latest present FSImage, process through the edits files present, create a new FSImage file, and then go into safemode while it waits for the datanodes to come online.edits_inprogress_########
file created as a placeholder by the namenode. Unless this is the final set of edits files to process, delete this file.Let's say we have FSImage fsimage_0000000000000000010
and a bunch of edits files: edits_0000000000000000011-0000000000000000020
edits_0000000000000000021-0000000000000000030
edits_0000000000000000031-0000000000000000040
edits_0000000000000000041-0000000000000000050
edits_0000000000000000051-0000000000000000060
...
edits_0000000000000000091-0000000000000000100
Following the steps outlined above:
dfs.namenode.name.dir
to another location, ex: /tmp/backup
edits_0000000000000000011-0000000000000000020
and edits_0000000000000000021-0000000000000000030
over to the dfs.namenode.name.dir
location.seen_txid
to contain a value of 30
since this is the last transaction we will be processing during this run.Startup Progress
tab that it correctly used fsimage_0000000000000000010
as a starting point and then processed edits_0000000000000000011-0000000000000000020
and edits_0000000000000000021-0000000000000000030
. It then created a new FSImage file fsimage_0000000000000000030` and entered safemode, waiting for the datanodes to come up.edits_inprogress_########
since this is not the final set of edits files to be processsed.