Hardware fails, that is a fact. Nowadays, hard drives are rather reliable, but nevertheless every now and then we will see drives failing or at least having hiccups. Using smartcl/smartd to monitor disks is a good thing, below we will discuss how some lesser issues can be handled without actually having to reboot the system – it is still up to a sys admin’s own discretion to judge circumstances correctly and evaluate whether disk errors encountered are a one time incident or indicative of an entirely failing disk.
Let’s have a look at a typical smartcl -a DEVICE output:
# smartctl -a /dev/sda
... ID# ATTRIBUTE_NAME .... RAW_VALUE 197 Current_Pending_Sector .... 2 ...
OK, so we have an oops here. Time to find out what is going on:
# smartctl –test=short /dev/sda
This will take a very short time, a couple of minutes at most, e.g.:
Please wait 2 minutes for test to complete. Test will complete after Sat Feb 2 16:25:10 2013
Now, with a current pending sector count > 0 we will most likely have an ouch after the test completes:
Num .. Status Remaining .. LBA_of_first_error ... # 2 .. Completed: read failure 90% .. 1825221261 ...
LBA counts sectors in units of 512 bytes and starts at 0, so we now need to find out where 1825221261 is actually located:
# fdisk -lu /dev/sda
will display some information about the device in question:
Device Boot Start End Blocks Id System ... /dev/sda3 31641600 1953523711 960941056 83 Linux ...
Obviously, 1825221261 is on /dev/sda3, thus. Now we need to determine the file system block for our LBA in question, so we first have to get the block size:
# tune2fs -l /dev/sda3 | grep Block
Block count: 240235264 Block size: 4096 Blocks per group: 32768
OK, 4096 bytes. So, the actual block number will be:
(LBA – PARTITION_START_SECTOR) * (512 / BLOCKSIZE)
In our case, this is:
(1825221261 – 31641600) * (512 / 4096) = 224197457.625
We only need the integer part, the fraction just tells us that we are into the 6th sector out of eight that make up this file system block.
It is good practice to find out which inode/file has been affected by using debugfs (operations can take a while with this tool):
debugfs: open /dev/sda3 debugfs: icheck BLOCK (224197457 in our case) Block Inode number 224197457 56025154 debugfs: ncheck 56025154 Inode Pathname 56025154 /some/path/to/file
Now, if this file isn’t anything crucial, then we can start correcting things now:
# dd if=/dev/zero of=/dev/sda3 bs=4096 count=1 seek=BLOCK
smartctl -a will now show an updated current pending sector count, and you can re-run a short smartctl test.