Fscking Precautions: Snapshots and the undo file

If you have a badly corrupted filesystem, e.g. because you had back blocks on the hard drive, you have want to take some precautions to make sure fsck doesn't destroy your files. This goes especially for large raids.

Something you can always do is test the fsck operation on a dmsetup snapshot. For this to work you must boot from a different partition and it must have a sparse file enabled filesystem.

# The path to your snapshot storage file.

Terabytes should be enough for your partition, otherwise increase this number
truncate -s
000G $COW 
#setup a loop for dmsetup snapshots to the COW file.
loop="$(losetup -f)"
losetup $loop $COW
#setup the snapshot device
echo 0 `blockdev --getsz $INPUT` snapshot $INPUT $loop p 8 | dmsetup create top
# let you know where the snapshot device is.
echo loop: $loop top: /dev/mapper/top
dmsetup status

After this you should be able to fsck /dev/mapper/top. You can see how much space the COW file actually occupies with du -h $COW. You may also want to get the newest fsck version (e.g. with a newer fsck-static package). If you end up with many multiply-claimed blocks, this e2fsck version may help: http://git.hpdd.intel.com/tools/e2fsprogs.git/ (checkout a -wc branch).

Good luck, you might need it!

Linux Raid: ignoring /dev/sdX as it reports /dev/sdY as failed

The fix for this problem may be extremely easy. What happened is that some disks of the raid failed. They were ejected. This happens. But the raid won't be assembled anymore if the failed disks are first on the mdadm assemble command line. Because for some reason, mdadm does not check what most disks say, but what the first disks say. So if you have a raid with 10 disks and the first two on the command line are failed, it will reject the remaining 8, because the are not compatible. All you need to do now is to list those two failed disks at the end with --force to activate the raid again:

Instead of 
mdadm /dev/md1 --assemble /dev/sdX,Y /dev/sd[a-f]
mdadm /dev/md1 --assemble /dev/sd[a-f] /dev/sdX,Y

Note that there's probably still a good reason for those disks to have been marked as failed...