RAID Failure – Incorrect Array Manipulation
Multi-disk systems usually employ a RAID configuration, or which there are several variants. Different variants offer advantages in read/write speed, capacity and hardware redundancy.
The disk set appears as a logical volume to the operating system, and this is achieved using a RAID controller which connects to each disk physically, and presents them as a logical volume according to the RAID variant specified.
Certain RAID configurations can tolerate one disk falling offline, and once detected, the failed disk should be replaced, in order that the replacement disk can be ‘rebuilt’ automatically, thereby returning a state of redundancy in the system.
One common problem occurs where a once-off write error sent from the RAID controller writes corrupt information to the disk array, making the RAID volume inaccessible. This type of failure has nothing to do with the physical condition of the disks.
Another common problem occurs where one disk falls offline, is not dealt with, and a second disk subsequently falls offline, making the RAID inaccessible. The cause of a disk failure could be electronic, service area related, or mechanical.
In either case, incorrect and dangerous manipulation of the drives in the array can include: –
- removing and replacing disks with incorrect rotation
- replacement of drives followed by re-initialisation
- interruption of automatic array re-build
- rebuilding of OS over existing drives with faulty drives replaced
- running disk repair or defrag utilities
These actions can very seriously affect the chances of a successful recovery, and if such actions are taken, will greatly increase the complexity and cost of recovery of whatever remaining data might be intact.