Windows Server 2008 Cluster: Troubleshooting or shooting yourself?
On a very quiet afternoon, I got this case which said “Problem with Cluster staying offline!”. Usually when I receive a ticket I’m optimistic but when it comes to Windows Cluster services I’m a bit agitated yet nevertheless excited to delve it and resolve the issue. Thankfully Windows Server 2008 Clustering is more stable than the days of the old where we have to keep a keen eye on it for weeks and months before even thinking of it being fit for production.
Anyway, the first obvious thing would be to simulate the incident so I went to the Failover Cluster management snap-in and I tried to bring the Cluster Name online and I got this error “An error occurred while attempting to bring the resource cluster name online” detailed with “The resource failed to come online due to the failure of one or more provider resources.”
At first I suspected that the CNO (Cluster Name Object) had been deleted but sadly I found it, I tried changing the IP address bound to the cluster and still I got the same error. One other trial was to check on the NTFS security groups, at first it took a couple of minutes to show me the security tab and then when I tried to add/remove an account I got this error “The program cannot open the required dialog box because it cannot determine whether the computer named cluster is joined to a domain“.
Now the up above error made me feel that the cluster service is not capable of accessing its CNO nor being able to control it for that matter, I did a quick research on how to increase the logging level of my cluster and I found a great article over at MSDN Blogs concerning the Cluster Log (http://blogs.msdn.com/b/clustering/archive/2008/09/24/8962934.aspx).
Once I increased the logging level and generated the logs the only error which I could find was this “ERR IP Address <Print Mgmt – IP Address>: Unable to open node parameters key, status 2“, sadly I couldn’t get much info about this error but ultimately I switched my brain channel to the CNO located in the Active Directory.
I found out that the CNO had been moved to another OU where this OU had a GPO of it’s own which caused our CNO to lose all of it’s security attributes including ownership, here is a checklist of what I did:
1- I moved the CNO to clean from GPOs OU.
2- I changed the owner ship of the CNO to the CNO itself (with inheritance).
3- I added all my cluster nodes to have full control over the CNO object (full control might not be needed but this won’t harm your configuration as well).
4- I did a ‘klist purge’ on all of my cluster nodes.
6- Waited for about 45 minutes for everything to replicate on my Active Directory Forest and thankfully I was able to put my Cluster Name online and also I tested it both ways around (online-offline / online-offline).
7- Finally you can run a ‘cluster res’ via cmd and you’ll notice that all the resources are online .
Also I couple of good blogs that helped me more understand my issue and what was going on:
http://technet.microsoft.com/en-us/magazine/hh289314.aspx
The Resource Hosting Subsystem (Rhs.exe) process stops unexpectedly when you start a cluster resource in Windows Server 2008 R2