This article shows you how to recover Linux Software RAID in case of a single disk failure. For the subsequent examples, we will use the following assumptions:
Healthy Array cat /proc/mdstat Personalities : [raid1] md0 : active raid1 sdb1[1] sda1[0] 511988 blocks super 1.0 [2/2] [UU] md1 : active raid1 sdb2[1] sda2[0] 976247676 blocks super 1.1 [2/2] [UU] bitmap: 1/8 pages [4KB], 65536KB chunk unused devices: <none>
Note that both of the arrays (md0, md1) are:
cat /proc/partitions major minor #blocks name 8 0 976762584 sda 8 1 512000 sda1 8 2 976248832 sda2 8 16 976762584 sdb 8 17 512000 sdb1 8 18 976248832 sdb2 9 1 976247676 md1 253 0 2097152 dm-0 253 1 2097152 dm-1 9 0 511988 md0
Note that the following items should be true:
Degraded Array
Personalities : [raid1] md0 : active raid1 sdb1[1] sda1[0](F) 511988 blocks super 1.0 [2/1] [_U] md1 : active raid1 sdb2[1] sda2[0] 976247676 blocks super 1.1 [2/2] [UU] bitmap: 0/8 pages [0KB], 65536KB chunk unused devices: <none>
Note: One or more RAID arrays will show missing members: [2/1] [_U]
Personalities : [raid1] md0 : active raid1 sdb1[1] sda1[0](F) 511988 blocks super 1.0 [2/1] [_U] md1 : active raid1 sdb2[1] sda2[0](F) 976247676 blocks super 1.1 [2/1] [_U] bitmap: 1/8 pages [4KB], 65536KB chunk unused devices: <none> In case of severe failure or if a disk fails to initialize on system restart, the disk itself can go missing as shown in the examples below.
Personalities : [raid1] md0 : active raid1 sdb1[1] 511988 blocks super 1.0 [2/1] [_U] md1 : active raid1 sdb2[1] 976247676 blocks super 1.1 [2/1] [_U] bitmap: 1/8 pages [4KB], 65536KB chunk unused devices: <none>
Disk missing: /proc/partitions major minor #blocks name 8 16 976762584 sdb 8 17 512000 sdb1 8 18 976248832 sdb2 9 1 976247676 md1 253 0 2097152 dm-0 253 1 2097152 dm-1 9 0 511988 md0
Removing a Failed Disk Manually marking devices as ‘failed’ You should manually mark a device as faulty only if one of its arrays is in a degraded state due to a timeout or read error in one of the partitions. The physical disk itself should still be working fine.
CAUTION: Be sure you are failing only the partitions that belong to the corresponding failed disk (i.e. sda1/sda2 for sda and sdb1/sdb2 for sdb). Set sda1 and sda1 as faulty [root@freshinstall ~]# mdadm --manage /dev/md0 --fail /dev/sda1 mdadm: set /dev/sda1 faulty in /dev/md0
[root@freshinstall ~]# mdadm --manage /dev/md1 --fail /dev/sda2 mdadm: set /dev/sda2 faulty in /dev/md1
In cat /proc/mdstat Personalities : [raid1] md0 : active raid1 sdb1[1] sda1[0](F) 511988 blocks super 1.0 [2/1] [_U] md1 : active raid1 sdb2[1] sda2[0](F) 976247676 blocks super 1.1 [2/1] [_U] bitmap: 1/8 pages [4KB], 65536KB chunk unused devices: <none>
Removing failed devices from array Before a disk can be replaced, the failed members should be removed from their corresponding arrays:
[root@freshinstall ~]# mdadm --manage /dev/md0 --remove /dev/sda1 mdadm: hot removed /dev/sda1 from /dev/md0
[root@freshinstall ~]# mdadm --manage /dev/md1 --remove /dev/sda2 mdadm: hot removed /dev/sda2 from /dev/md1
[root@freshinstall ~]# cat /proc/mdstat Personalities : [raid1] md0 : active raid1 sdb1[1] 511988 blocks super 1.0 [2/1] [_U] md1 : active raid1 sdb2[1] 976247676 blocks super 1.1 [2/1] [_U] bitmap: 1/8 pages [4KB], 65536KB chunk unused devices: <none>
[root@freshinstall ~]# mdadm --misc --zero-superblock /dev/sda1 [root@freshinstall ~]# mdadm --misc --zero-superblock /dev/sda2
dd if=/dev/zero of=/dev/sda bs=1M CAUTION: This process erases all data in the device given the "of=" argument (/dev/sda in the above example). Note: This process may take some time to complete.
Adding a Replacement Disk The replacement disk must be exactly the same size or larger than the failed disk. Once the replacement disk is inserted into the system, the kernel will log something like this: dmesg | tail sd 0:0:0:0: [sdc] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB) sd 0:0:0:0: [sdc] 4096-byte physical blocks sd 0:0:0:0: [sdc] Write Protect is off sd 0:0:0:0: [sdc] Mode Sense: 00 3a 00 00 sd 0:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
cat /proc/partitions major minor #blocks name
8 16 976762584 sdb 8 17 512000 sdb1 8 18 976248832 sdb2 9 1 976247676 md1 253 0 2097152 dm-0 253 1 2097152 dm-1 9 0 511988 md0 8 0 976762584 sdc Check the SMART status of the new disk: smartctl -H /dev/sdc smartctl 5.39.1 2010-01-28 r3054 [x86_64-unknown-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED
Partitioning To start portioning, confirm first that the new disk has not been auto-mounted: mount | grep sdc There should be no output. If anything is displayed, then the corresponding partitions must be un-mounted manually by calling: umount <partition> Since the initial system used two mirrored drives of exactly same size, we can use the operational drive as a template for partitioning: sfdisk -d -uS /dev/sdb | sfdisk -L -uS /dev/sdc Checking that no-one is using this disk right now ... OK
Disk /dev/sdc: 121601 cylinders, 255 heads, 63 sectors/track Old situation: Units = sectors of 512 bytes, counting from 0
Device Boot Start End #sectors Id System /dev/sdc1 * 2048 1026047 1024000 fd Linux raid autodetect /dev/sdc2 1026048 1953523711 1952497664 fd Linux raid autodetect /dev/sdc3 0 - 0 0 Empty /dev/sdc4 0 - 0 0 Empty New situation: Units = sectors of 512 bytes, counting from 0
Device Boot Start End #sectors Id System /dev/sdc1 * 2048 1026047 1024000 fd Linux raid autodetect /dev/sdc2 1026048 1953523711 1952497664 fd Linux raid autodetect /dev/sdc3 0 - 0 0 Empty /dev/sdc4 0 - 0 0 Empty Warning: partition 1 does not end at a cylinder boundary Warning: partition 2 does not start at a cylinder boundary Warning: partition 2 does not end at a cylinder boundary Successfully wrote the new partition table
Re-reading the partition table ...
If you created or changed a DOS partition, /dev/foo7, say, then use dd(1) to zero the first 512 bytes: dd if=/dev/zero of=/dev/foo7 bs=512 count=1 (See fdisk(8).)
After this, you should be able to see your newly partitioned disk: cat /proc/partitions major minor #blocks name
8 16 976762584 sdb 8 17 512000 sdb1 8 18 976248832 sdb2 9 1 976247676 md1 253 0 2097152 dm-0 253 1 2097152 dm-1 9 0 511988 md0 8 0 976762584 sdc 8 1 512000 sdc1 8 2 976248832 sdc2
Adding new devices to the array [root@freshinstall ~]# mdadm --manage /dev/md0 --add /dev/sdc1 mdadm: added /dev/sdc1
[root@freshinstall ~]# mdadm --manage /dev/md1 --add /dev/sdc2 mdadm: added /dev/sdc2
[root@freshinstall ~]# cat /proc/mdstat Personalities : [raid1] md0 : active raid1 sdc1[2] sdb1[1] 511988 blocks super 1.0 [2/2] [UU] md1 : active raid1 sdc2[2] sdb2[1] 976247676 blocks super 1.1 [2/1] [_U] [>....................] recovery = 0.1% (1618176/976247676) finish=160.6min speed=101136K/sec bitmap: 1/8 pages [4KB], 65536KB chunk unused devices: <none> Note: Both arrays have two active members in them and md1 is actively reconstructing.
Re-installing bootloader Once the failed drive is replaced, re-install GRUB into the MBR of the new disk to make sure you can boot from it: grub-install /dev/sdc Installation finished. No error reported. This is the contents of the device map /boot/grub/device.map. Check if this is correct or not. If any of the lines is incorrect, fix it and re-run the script `grub-install'.
# this device map was generated by anaconda (hd0) /dev/sdc (hd1) /dev/sdb
|
Add Comment