They say that you learn the most when you make mistakes and things go wrong. Well, last night I certainly must have learned a lot. What started as a simple physical re-organisation of my hardware turned into a rebuild of my production VMware ESXi server – finishing at 1am. Here’s what happened.
I started by shutting down and moving my production ESXi Server out and back into the standard rack it occupies. On power up, the server failed to reboot, claiming the boot disk was no longer present. A quick check inside showed that the SAS connector on the boot disk had come loose, so I plugged it back in and tried again (Oh, SAS specification guys – bad design, no retainers on the plugs). Unfortunately, the boot disk had somehow become corrupted and the server wouldn’t come up. No problem, I thought, just repair using the installation media. This is where things started to get complicated.
My ESXi server runs off a Seagate Savvio 2.5″ 15K 73GB drive, one of four Seagate generously loaned me last year for long term testing. More on that another day. The server has two disks installed, one of which has VMs on it. During the repair process I wasn’t sure which disk was the O/S and which was data. ESXi doesn’t help much, only indicating that both disks contained data in partitions, data that would be lost if I reinstalled.
Lesson 1 – Make sure you know exactly how your hardware is configured, down to the SAS ports each drive is plugged into.
Actually having multiple drives of the same type is a pain. So rather than risk data loss, I removed both drives and re-installed the ESXi O/S from a third Savvio drive. All good. Now I need to locate and import all my VMs, however some were on the removed Savvio disks. This meant installing each disk independently and checking the contents to determine which contained VMs and which contained the broken O/S.
Lesson 2 – Wherever possible, place your VMs on disks separate from the server itself.
Yes, I do have most of my VMs on my Iomega ix4-200d, but, rather crucially, not my Windows 2008 AD Server, which needed to be moved from internal disk to the ix4 before I continued (schoolboy error there). The AD server was rather important for accessing my, ahem, ix4, which is configured to validate logins using AD. This creates a bit of a circular reference which could have been a disaster.
Lesson 3 – Place your Windows domain controller on a physical server, or have another independent backup elsewhere.
Having a physical server just for AD control isn’t part of my total virtualisation plan, so I’m looking at whether I can host a backup controller with Amazon AWS and use VPN to secure it into my private network. This way, if I ever have an issue, I can still authenticate. The issue of course is cost, which may make a dedicated server the cheaper option.
So, by 1am everything was back up and running. Did I learn anything else? Well yes…
Lesson 4 – after 22 years in IT, I should remember that adequate documentation and a DR plan are crucial. In fact, in a virtualised environment, they are essential due to the concentration of risk placing all systems on a single server causes.
So what next for my virtual infrastructure? I have a few changes planned; I’ll create a backup ESXi server that can import and run the VMs in the event of a future server failure. I will also be investigating AWS with Windows 2008 and VPN to create a backup domain controller and see if I can continue to work if both server’s hardware failed.
That leaves one Single Point of Failure… my ix4-200d. Anyone want to donate me a spare one?