MDADM - Replacing a failed disk
Before we begin and let’s be clear, RAID is not a backup! Always have a dedicated working, regularly tested backup solution in the event of some disaster or failure. Read more here and here.
It is also strongly recommended to have email alerting configured when using MDADM arrays to alert when there is an issue. To set this up, edit the /etc/mdadm.conf file to include the following parameter:
MAILADDR simon@magrin.one
Other optionsto consider for active monitoring of your system are:
- logwatch – monitors /var/log/messages for anything out of the ordinary and send a daily email report
- smartd – smartd can be set to do short tests daily and long tests weekly
To can also refer to the syslog, dmesg, smartctl logs for disk-related failures, including /proc/mdstat for signs of MD array degradation. To investigating a bad drive, interrogate S.M.A.R.T (Self-Monitoring, Analysis and Reporting Technology) status for suspected disk(s):
smartctl -i /dev/sdb
The above does a basic test, the below command does a thorough test:
smartctl -t long /dev/sdb
Also perform a self-test:
smartctl -l selftest /dev/sdb
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 23574 267040872
To replace the failing disk, start by marking the faulty disk by marking it as failed and removing it from the MD array:
mdadm --manage /dev/md0 --fail /dev/sdb1
mdadm --manage /dev/md0 --remove /dev/sdb1
If the disk isn’t hot swappable, shutdown the server and swap out the disk with a replacement. With the new disk in place, duplicate the partition set up from the existing disk. You can use sfdisk to duplicate the partition layout:
sfdisk -d /dev/sda | sfdisk /dev/sdb
Another alternative is cfdisk to recreate manually. Be sure to change the Units value to Sectors for precise partition table replication. After replicating the partitions and marking them as 0xFD (RAID autodetect), add the new disk to rebuild the MD array(s):
mdadm --manage /dev/md0 --add /dev/sdb1
Watch the rebuild status:
watch -n 1 cat /proc/mdstat
Ensure you make the replacement disk bootable; this can be done from the live system or chrooting the system via the same install media (CentOS > Rescue Install System):
grub
grub> device (hd0) /dev/sdb
grub> root (hd0,0)
grub> setup (hd0)
grub> quit
You can test by shutting down the host, detaching the known working disk and place the rebuilt disk into primary slot. It should also be bootable.