Search code examples
hadoophdfsnamenode

How to successfully complete a namenode restart with 5TB worth of edit files to process


I have a namenode that had to be brought down for an emergency that has not had an FSImage taken for 9 months and has about 5TB worth of edit files to process in its next restart. The secondary namenode has not been running (or had any checkpoint operations performed) since about 9 months ago, thus the 9 month old FSImage.

There are about 7.8 million inodes in the HDFS cluster. The machine has about 260GB of total memory.

We've tried a few different combinations of Java heap size, GC algorithms, etc... but have not been able to find a combination that allows the restart to complete without eventually slowing down to a crawl due to FGCs.

I have 2 questions: 1. Has anyone found a namenode configuration that allows this large of an edit file backlog to complete successfully?

  1. An alternate approach I've considered is restarting the namenode with only a manageable subset of the edit files present. Once the namenode comes up and creates a new FSImage, bring it down, copy the next subset of edit files over, and then restart it. Repeat until it's processed the entire set of edit files. Would this approach work? Is it safe to do, in terms of the overall stability of the system and the file system?

Solution

  • We were able to get through the 5TB backlog of edits files using a version of what I suggested in my question (2) on the original post. Here is the process we went through:

    Solution:

    1. Make sure that the namenode is "isolated" from the datanodes. This can be done by either shutting down the datanodes, or just removing them from the slaves list while the namenode is offline. This is done to keep the namenode from being able to communicate with the datanodes before the entire backlog of edits files is processed.
    2. Move the entire set of edits files to a location outside of what is configured on the dfs.namenode.name.dir property of the namenode's hdfs-site.xmlfile.
    3. Move (or copy if you would like to maintain a backup) the next subset of edits files to be processed to the dfs.namenode.name.dir location. If you are not familiar with the naming convention for the FSImage and edits files, take a look at the example below. It will hopefully clarify what is meant by next subset of edits files.
    4. Update file seen_txid to contain the value of the last transaction represented by the last edits file from the subset you copied over in step (3). So if the last edits file is edits_0000000000000000011-0000000000000000020, you would want to update the value of seen_txid to 20. This essentially fools the namenode into thinking this subset is the entire set of edits files.
    5. Start up the namenode. If you take a look at the Startup Progress tab of the HDFS Web UI, you will see that the namenode will start with the latest present FSImage, process through the edits files present, create a new FSImage file, and then go into safemode while it waits for the datanodes to come online.
    6. Bring down the namenode
    7. There will be edits_inprogress_######## file created as a placeholder by the namenode. Unless this is the final set of edits files to process, delete this file.
    8. Repeat steps 3-7 until you've worked through the entire backlog of edits files.
    9. Bring up datanodes. The namenode should get out of safemode once it's been able to confirm the location of a number of data blocks.
    10. Set up a secondary namenode, or high availability for your cluster, so that the FSImage will periodically get created from now on.

    Example:

    Let's say we have FSImage fsimage_0000000000000000010 and a bunch of edits files: edits_0000000000000000011-0000000000000000020 edits_0000000000000000021-0000000000000000030 edits_0000000000000000031-0000000000000000040 edits_0000000000000000041-0000000000000000050 edits_0000000000000000051-0000000000000000060 ... edits_0000000000000000091-0000000000000000100

    Following the steps outlined above:

    1. All datanodes brought offline.
    2. All edits files copied from dfs.namenode.name.dir to another location, ex: /tmp/backup
    3. Let's process 2 files at a time. So copy edits_0000000000000000011-0000000000000000020 and edits_0000000000000000021-0000000000000000030 over to the dfs.namenode.name.dir location.
    4. Update seen_txid to contain a value of 30 since this is the last transaction we will be processing during this run.
    5. Start up the namenode, and confirm through the HDFS Web UI's Startup Progress tab that it correctly used fsimage_0000000000000000010 as a starting point and then processed edits_0000000000000000011-0000000000000000020 and edits_0000000000000000021-0000000000000000030. It then created a new FSImage file fsimage_0000000000000000030` and entered safemode, waiting for the datanodes to come up.
    6. Bring down the namenode
    7. Delete the placeholder file edits_inprogress_######## since this is not the final set of edits files to be processsed.
    8. Proceed with the next run and repeat until all edits files have been processed.