What is Safe Mode in HDFS

How to Fix Hadoop HDFS Cluster with Missing Blocks After Reinstalling a Node?

I have a 5 slave Hadoop cluster (with CDH4) --- slaves are where DataNode and TaskNode are running. Each slave has 4 partitions for HDFS storage. One of the slaves had to be reinstalled, which resulted in the loss of one of the HDFS partitions. At this point, the HDFS was complaining about 35K missing blocks.

A few days later, the reinstallation was complete and I brought the node back online to Hadoop. HDFS stays in safe mode and the new server doesn't register close to the number of blocks the other nodes are on. For example, the new node under DFS Admin shows 6K blocks while the other nodes have around 400K blocks.

Currently, the new node's DataNode logs show that it is checking (or copying?) A number of blocks, some of which fail because they already exist. I believe that HDFS is only replicating existing data to the new node. Verification example:

Example of an error:

In DFS Admin I can also see that this new node has a capacity of 61% (which is the approximate usage of other nodes) even though its number of blocks is about 2% of the other nodes. I suspect this is just the old data that HDFS doesn't recognize.

I suspect that one of the following happened: (a) HDFS abandoned this node's data for reasons of age; (b) The reinstallation changed some system parameters so that HDFS treats it as a new node (i.e. not an existing one with data). or (c) the drive mapping was messed up somehow resulting in the partition mapping being changed and HDFS not being able to find the old data (although the drives are labeled and I'm 95% sure we got this right) .

Main question: How can I get HDFS to recognize the data on this drive?

  • Answer: Restart NameNode and the nodes will report again what blocks they have (see Update 1 below)

Sub-question 1: If my assumption of the new node's data usage is correct - that the 61% usage is ghost data - will this ever clean up HDFS, or do I have to remove this manually?

  • One minor problem: Since a large part of the hard drive is recognized (see Update 1 below)

Sub-question 2: Currently I am unable to run to find the missing blocks because "replication queues have not initialized". Any idea how to fix this? Do I have to wait for the new node to rebalance (i.e. this verify / copy phase ends)?

  • Answer: If you exit Safe Mode let me do that (see Update 1 below)

Updates

Update 1: I thought I fixed the problem by restarting my NameNode. This caused the new node's block count to jump to roughly the same usage as the other nodes, and DFS changed its message to:

Safe mode is ON. The reported blocks 629047 needs additional 8172 blocks to reach the threshold 0.9990 of total blocks 637856. Safe mode will be turned off automatically.

I left it this state for several hours in hopes that it would leave safe mode for good, but nothing has changed. I then manually turned off Safe Mode and the DFS message changed to "8800 blocks missing". At that point, I was able to run to see a large chunk of the files that are missing blocks.

Current remaining issue: How are these missing blocks recovered? (should I spin that out in a new question?)

reply

We had a similar problem today. One of our nodes (out of 3, with replication = 3) just died on us, and after rebooting we started to see this in the logs of the affected data:

The webui of the namenodes shows the data node with only 92 blocks (of the 13400 that the rest had).

This bug was fixed by triggering a full block report on the data node that updated the name code data:

The result: the data node was missing a few blocks, which it happily accepted and restored the cluster.

In the end, I had to delete the files with bad blocks. After further investigation, I found that the replication was very low (rep = 1 if I remember correctly).

This SO post contains more information on how to find files with bad blocks. To do this, use:

To answer my own questions:

  1. Can these files be restored? Not when the failed nodes / drives are brought back online with the missing data.
  2. How do I get out of Safe Mode? Remove these problematic files and exit from Safe Mode.