Active Directory Log Disk Lost

miqrogroove
2010-08-24T00:53:23+00:00
Event Viewer excerpt showing atapi and disk errors.

A Disk Kissing Itself Goodbye

At 2:45 this morning, my home office / techie practice server suffered a catastrophic failure of its primary slave disk.  Among other things, that disk was responsible for storing the Active Directory log file for the server’s Windows 2003 domain controller.  The device itself was a Maxtor 20 gig model going on 12 years of age.  It was still in service after the server’s motherboard overhaul because of the Windows 2000 Active Directory Services recommendation: “For best performance, place the database and the log file on separate hard disks.”

Today I learned that the loss of the log crashes the Active Directory service, crashes all the services that depend on it, and prevents the server from booting up until the problem is resolved.  That’s a very serious predicament caused by a file whose only intended purpose is consistency during crash recovery.  I still had a perfectly good Active Directory database file on the primary master disk, but these circumstances prevented it from working.

I concluded that the only viable option was to abandon the current database and perform a full System State restoration.  Here is my explanation of how it worked, how to do it, and several pitfalls that can be avoided.

Things you will need to find before starting:

  • System State Backup
  • Phillips-head Screwdriver
  • Flat-head Screwdriver
  • Flashlight or Lamp
  • Computer Blower or Canned Air
  • Replacement Disk or 3rd Party Partitioning Tools
  • 8 Hours of Spare Time

Also take note at this point that the instructions will not be aimed at the casual reader.  I assume if you’re still reading this that you already know what I’m talking about.

Here is a quick list of things you can avoid until you get desperate to fix the problem.  These are the strategies I brainstormed and pieces of advice from other people that didn’t help me in the slightest way:

  • Copying, moving, or deleting database files and log files.
  • Restoring the system state to an alternate location.
  • Running the Active Directory database “repair” command.
  • Using a backup that hasn’t reached the “tombstone” age.
  • Trying to restore files to mounted volumes, or to drive letters created by the “subst” command.
  • Editing directory values using ldp.exe.

Diagnostics

Step 1 – When I went to use my workstation this morning, it was obvious that the server had crashed.  The server console was still responsive, so I popped open the Event Viewer.  The bad news looked like this:

Event Type:	Error
Event Source:	atapi
Event ID:	9
Time:		2:45:28 AM
Description:
The device, \Device\Ide\IdePort0, did not respond within the timeout period.

.
In other words, the primary IDE channel is in an error state, which is bad.  The next error helps pinpoint the problem:

Event Type:	Error
Event Source:	Disk
Event ID:	11
Time:		2:46:50 AM
Description:
The driver detected a controller error on \Device\Harddisk1.

.
In other words, the primary slave device is brain dead.  🙁

Step 2 – Save what you can.  At this point I began a frantic hunt for existing backups.  I had multiple backups of the most critical user data, which were still intact on the other disk in any case.  The system state backup had me worried for a while because it wasn’t where it was supposed to be.  It turned out, that particular backup was left on my desk over a year ago and then thoroughly buried under a small mountain of paperwork.

Step 3 – Shut down the server.

Step 4 – Grab your screwdriver and remove all of the server’s case screws.  Years of experience has taught me that I can save a lot of time during this type of repair by opening both sides of the computer case early, and then leaving it open until after the problem is completely fixed.

Step 5 – Grab the rest of your tools and make some room in the computer case.  Blow out the dust, remove the heat sinks, pull out the cards, unplug the drives, pop open the bay covers, do whatever it takes to get in there.  In this particular server, I couldn’t get to the hard drives until I removed the CPU fan assembly.

Step 6 – Remove the bad disk and run it through some physical diagnostics.  I tried switching the bad disk to secondary master, messing with the BIOS settings, and inspecting the device for obvious thermal damage.  After that I declared it dead and moved on to more important matters.

Repairs

Before you can effectively restore anything, there’s a big problem to solve.  Windows is expecting the log file to exist at a specific drive letter and path.  If the active partition is the only one remaining in the server, then there’s no easy way to make that happen.

Step 7 – Put a new disk in the server.

Or

Step 7 – Run a disk partitioning program that can shrink the active partition and create one or more extra logical drives.

Step 8 – Start the server in Directory Services Restore Mode.  Hit F8 repeatedly when the server turns on.

Step 9 – In Disk Management, assign the same drive letter that the dead drive had, to the new partition.  This step was a bit more complicated for me because I had installed the Active Directory log file at G:\WINDOWS\NTDS\, and the shared system volume at F:\WINDOWS\SYSVOL\

Step 10 – Stage your system state file.  I copied mine from a backup CD to C:\Temp\.  Don’t be afraid to take some extra time with this step.  After several failed attempts to restore and restart the system, I began to suspect that the resulting active directory errors were being caused by a corrupt system state file.  The CD drive in this server looked like it might have been as old as the hard drive, and I noticed it was making some odd noises during the file copy.  I managed to get the server back on the network despite many errors, and copied the system state file from a reliable CD drive over the network.  After I did that, the active directory database errors disappeared on the next attempt.  I suspect the CD drive was the biggest change I made at that point.  Who knows if the NT Backup utility even checks the file for corruption before restoring it?

Restore

Step 11 – Open the NT Backup program and run the restore wizard.

Step 12 – During the restore dialogs, always click the Advanced button and triple check the junction points option.  There is an explanation of why and how you have to do this during a system state restore, by the University of Waterloo.  Be sure to restore the system state to its original location, overwriting all of the existing files.

Step 13 – Make sure the SYSVOL share isn’t missing.  There are some great tips about restoring the SYSVOL shared folder at the AionSolution Blog.

Step 14 – Check the integrity of the Active Directory database file.  This can be done on the command line with “ntdsutil files integrity”.  If you see any errors at this point, your restore was not successful.  Proceed with extreme caution.  You will need to go back to Step 10 and persist with this process until you get it straightened out.  There is a database repair utility that might help a little, but eventually I was able to restore the system state without database errors.

Step 15 – Authoritative Restore.  If you have only one domain controller on your network, then it is probably a good idea to do this before attempting a reboot.  Go back to the ntdsutil prompt and type authoritative restore, then restore database.  See the link in Step 11 for more details about that.  If the authoritative restore completes with no errors it is a very good sign that things are working.

Step 16 – Reboot.  Do expect a lot of problems on the first reboot.  You shouldn’t see any error messages, but you will see a much longer than normal Active Directory load time.  Symptoms may also include DNS and GPO failures.

Step 17 – Reboot again.  One of two things is going to happen.  If everything is going your way, Active Directory will load right up and then the server will start processing GPOs.  If the server is still bogged down and the DNS zones are all missing then you need to go back to the drawing board.

Step 18 – ?????.  Don’t worry.  It took me about a dozen tries and over 8 hours to get this far.

Step 19 – Profit!!!  Despite my system state being over a year old, there have been no major changes to the server or the domain since then.  Everything is working good as new!

Please let me know if you find this article helpful, or if you have any variations on this scenario that might be relevant to other readers.  🙂

24 Aug 2010

Category:
Systems Engineering

Tags:
, ,

Discuss:
Comments Go Here

Write a Comment