My raid 5 with 4 drives (a 500GB) reaches an age of about 1.5years.
One drive failes, i RMA it and got a new one, raid rebuilds everything is fine.
Two weeks later another drive failes, i RMA it, rebuild.. all fine.
Around 3 weeks after that the third drive failes, i RMA it and i got a bit suspicious about all the failures and started creating a backup.
2hrs later the complete raid failes – because drive 4 just removed itself from the system and refused to work again.
I reboot into intel matrix rom manager and could recover the raid with just 3 drives. I was quite euphoric – my data is stil there 🙂
Windows boots up and checks the raid for errors, once it reached about 38% it spammed the event log with read errors. One reboot later i realized that the raid has failed again.
Now all data was gone – but i didn’t give up!
I bought an additional new drive and used ubuntu with ddrescue to clone the failing drive to the new one.
At that point things started to get complicated. I realized that the last replacement drive (Stupid refurbished drives – I hate you for that WD!) had unrecoverable read errors. The whole drive had 11 errors in 800KB out of 465GB. This was the reason for the raid failure, i could recover it and read some of the data until i hit those unrecoverable sectors.
So that didnt look that bad for my data 🙂
However, neither dmraid nor the intel rom recognized the cloned drive as part of the raid array. I tried digging through the linux dmraid code to find out what made the drives special and found the answer in the raid metadata at the end of each drive – every serial number of the raid members is stored there. I looked for a tool that allowed the manipulation of the metadata but there was none.
Additionally i found out that newer linux kernels don’t supply the module to load up the array under linux at all so i was thrown back to windows.
After some hours of googling i came to the conclusion that noone ever accomplished to force a (cloned) drive into a failed raid so i looked for recovery tools instead.
To be able to use the discs within windows recovery tools i set the intel controller from raid to ide mode (ahci sata wouldnt boot) and fired up RStudio.
However due to the cloned disc not beeing part of the raid it wasnt recognized as a raid. Additionally the order of the discs was completely off (due to a lot of removing and inserting drives into the case).
For restoring my raid data i needed to have 4 things:
- 3 readable discs out of the original 4
- correct raid stripesize
- correct parity alignment
- correct disc order
While i was now having the discs it was a bit harder to get the rest.
Another program comes into play: WinHex and it’s disc editor.
I opened the 3 discs and had a look at the data. If you can read stuff (The partition table and NTFS volume header contain quite some human readable ASCII words) you are probably not looking at a parity stripe. So i browsed a bit through the data until i noticed a significant change in the layout of the data.
Now i looked at the sector count below, if its 127/128 where the layout changes and my harddisc sector size is 512 (Default back then) then my parity stripesize is 64kb.
Now for the reordering… i browsed through all harddrives and noted where i could see a parity stripe – its the garbled stuff.
I created a small sheet with the gathered data like this:
Stripe | 1 | 2 | 3 | 4
Drive 1 | | | P |
Drive 2 | | P | |
Drive 3 | P | | |
Drive 4 | | | | P
Of course drive 4 is not existing but generally when you see no garbled stripe on every of your drives – then the parity is on the lost one.
Now i could reconstruct the correct drive order by just looking at the virtual raid5 builder that RStudio provides.
In my case this was 4-1-2-3, i put in the drives at that order (right click and created a missing one for drive 3) and set parity alignment to left asymmetric.
This is the standard for intel matrix raids (as taken from dmraid isw.c code).
Pressed apply and there i got my data back – all accessible with RStudio.
Now what did i learn from this?
- Do not trust a RAID5 for data protection
- Always have a Backup
- Do not buy all drives from the same batch