MDADM - Replacing a failed disk

Before we begin and let’s be clear, RAID is not a backup! Always have a dedicated working, regularly tested backup solution in the event of some disaster or failure. Read more here and here

It is also strongly recommended to have email alerting configured when using MDADM arrays to alert when there is an issue. To set this up, edit the /etc/mdadm.conf file to include the following parameter:


MAILADDR simon@magrin.one

Other optionsto consider for active monitoring of your system are:

  • logwatch – monitors /var/log/messages for anything out of the ordinary and send a daily email report
  • smartd – smartd can be set to do short tests daily and long tests weekly

To can also refer to the syslog, dmesg, smartctl logs for disk-related failures, including /proc/mdstat for signs of MD array degradation. To investigating a bad drive, interrogate S.M.A.R.T (Self-Monitoring, Analysis and Reporting Technology) status for suspected disk(s):


smartctl -i /dev/sdb

The above does a basic test, the below command does a thorough test:


smartctl -t long /dev/sdb

Also perform a self-test:


smartctl -l selftest /dev/sdb

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 23574 267040872

To replace the failing disk, start by marking the faulty disk by marking it as failed and removing it from the MD array:


mdadm --manage /dev/md0 --fail /dev/sdb1

mdadm --manage /dev/md0 --remove /dev/sdb1

If the disk isn’t hot swappable, shutdown the server and swap out the disk with a replacement. With the new disk in place, duplicate the partition set up from the existing disk. You can use sfdisk to duplicate the partition layout:


sfdisk -d /dev/sda | sfdisk /dev/sdb

Another alternative is cfdisk to recreate manually. Be sure to change the Units value to Sectors for precise partition table replication. After replicating the partitions and marking them as 0xFD (RAID autodetect), add the new disk to rebuild the MD array(s):


mdadm --manage /dev/md0 --add /dev/sdb1

Watch the rebuild status:


watch -n 1 cat /proc/mdstat

Ensure you make the replacement disk bootable; this can be done from the live system or chrooting the system via the same install media (CentOS > Rescue Install System):


grub

grub> device (hd0) /dev/sdb
grub> root (hd0,0)
grub> setup (hd0)
grub> quit

You can test by shutting down the host, detaching the known working disk and place the rebuilt disk into primary slot. It should also be bootable.