VMware vCenter 6.5 and SSL Renewal (Secure Entrapment)

Hallaw! Well, around two weeks ago I noticed that my management cluster vCenter server (Windows edition) will have its SSL certificate expiring so I thought rather than renewing it I wanted it to actually expire and see the outcome.

The good thing is that everything was working fine even after the expiry and you just get a warning that the SSL certificate is no longer trusted, but I was able to access the WebClient successfully without any issues at least.

So, I decided to renew my SSL certificate and because my initial Active Directory Certificate Services was based on SHA1 I thought it is a good time to get rid of it and use SHA512, so I did what I did on my CA server and went towards our trusty Certificate Manager utility with all the trust in the world that this is going to be something not too much troublesome and rather something that requires a keen eye on things so that to get it right from the first time.

I generated the CSR, got the key, used the CSR, got the SSL certificate, got the root certificate, and it was time to commence with completing the process and start with importing the certificate via the certificate management tool.

And here is where your joy ride stumbles against a great sign which was [WELCOME TO THE MAZE], when the process starts to apply the certificates it fails even after doing several cleanups every time it fails and the certificate manager log showed nothing usable (C:\programdata\VMware\vCenterServer\logs\vmca\certificate-manager.log).

At this point, everything seemed to be broken and the vSphere Web Client was no longer initializing properly which lead to using the “Reset all Certificates” task and luckily all the services were up again and all components were using the self-signed certificate generated by the vCenter server.

So now, things should be fine if we attempt to apply my certificates; right? I launched the certificate manager utility again and try to import the certificate and after the process starts and the services are disabled and after the certificates were bound to services, just when the services are being started I get a meaningless error that the process failed and that everything will be rolled back.

I struggled with this for two days, trying different CA servers, different certificate formats (I am now expert at using OpenSSL for SSL certificates conversion and extraction) and all my attempts ended with the same drastic result, now one would have faith in the logs! If the logs cannot be trusted then its a error-eat-admin world %) and I started looking into the certificate-manager.log file hoping to see something that needs to be fixed so that I can go and fix it, sadly I couldn’t find any.

2017-03-20T22:14:45.933Z INFO certificate-manager MACHINE_SSL_CERT certificate replaced successfully. SerialNumber and Thumbprint changed.
2017-03-20T22:15:57.221Z INFO certificate-manager Running command :- “C:\Program Files\VMware\vCenter Server\bin\service-control.bat” –stop –ignore  –all
2017-03-20T22:15:57.221Z INFO certificate-manager please see service-control.log for service status
2017-03-20T22:18:30.624Z INFO certificate-manager Command executed successfully
2017-03-20T22:18:30.624Z INFO certificate-manager all services stopped successfully.
2017-03-20T22:18:30.624Z INFO certificate-manager None
2017-03-20T22:18:40.624Z INFO certificate-manager Running command :- “C:\Program Files\VMware\vCenter Server\bin\service-control.bat” –start  –all
2017-03-20T22:18:40.624Z INFO certificate-manager please see service-control.log for service status
2017-03-20T22:19:36.836Z ERROR certificate-manager None
2017-03-20T22:19:36.838Z ERROR certificate-manager Error while starting services, please see log for more details
2017-03-20T22:19:36.838Z ERROR certificate-manager Error while replacing Machine SSL Cert, please see C:\ProgramData\VMware\vCenterServer\logs\vmca\certificate-manager.log for more information.
2017-03-20T22:19:36.838Z ERROR certificate-manager {
“resolution”: null,
“detail”: [
{
“args”: [
“None”
],
“id”: “install.ciscommon.command.errinvoke”,
“localized”: “An error occurred while invoking external command : ‘None'”,
“translatable”: “An error occurred while invoking external command : ‘%(0)s'”
},
“Error while starting services, please see log for more details”
],
“componentKey”: null,
“problemId”: null
}
2017-03-20T22:19:36.970Z INFO certificate-manager Performing rollback of Machine SSL Cert…

Reading the up above carefully, you would notice that the SSL certificate is being replaced successfully but upon attempting to start the services (as per the log) it fails, here is what I noticed:

  1. When the error regarding starting the services is thrown, the services actually start (this is because I keep monitoring the system’s performance and the CPU usually peeks at 100% until the vSphere WebClient initializes successfully).
  2. As the services are being started if you attempt to access the the vSphere WebClient I can actually see that my signed SSL certificate is being used.
  3. Just after the services are up they get stopped by the utility again so that it would roll back to the old SSL certificate.

Again I spent hours trying to trace this and couldn’t find anything logical, so it was time to do something illogical!!! And as I was attempting to change the SSL certificate I waited for the script to start the phase where it is stopping the services and then I killed the script, yes I do confirm I killed the utility! Then after validating that all the services were stopped I rebooted the vCenter Server.

I was almost sure that this is going to break the vCenter Server or at least it is going to forcefully attempt the certificate rollback somehow, once the server got back online and I waited for all the services to start successfully I attempted accessing the vSphere Web Client and you know what! It actually worked %) I restarted the vCenter Server a couple of times just to make sure that everything is stable and every time I was able to access the vSphere Web Client and my signed SSL certificate was being used without any issues.

Again, I totally do not have any idea about this behavior and before writing this post I waited for two days so see if something bad would happen and till now everything is okay. So if you’ve had such an experience and managed to get to the root cause of it I would appreciate sharing your thoughts, otherwise I hope this post helps people pull less hair out of their heads ^_^.

(Abdullah)^2

19432 Total Views 1 Views Today

Abdullah

Knowledge is limitless.

16 Responses

  1. E2 says:

    Came across this exact issue. Did as you suggested and it worked perfectly. Stopped the script and rebooted. Looks like maybe an oversight on the script end. Not sure why it would rollback the entire operation for a failed service to start. However, all the services start fine after reboot. It seems like something happened to my Update Manager when I did this process. A re-installed of Update Manger fixed that issue.

    Thanks again :-)

    -E2

  2. daniel says:

    this was the last test I was performing in my lab before going to upgrade the dev/prod … enviroments.
    I was about to reinstall all test enviroment before coming across this worked to the word.
    Thats for the post
    How will open case with VMware so I don’t have to perform in prod env this non convential fix!!!

    thats again

    daniel

  3. moethelawn says:

    Another potential cause for this is the subject name being the same on all certificates between MACHINE_SSL_CERT, machine, vsphere-webclient, etc. If you make them different for each cert, the services startup at the end should complete successfully. I ran into this issue and that is what ultimately got me through besides closing out the script at the end.

    • Abdullah says:

      Hello Moe,

      Thank you for sharing this information, to my understanding is that when you have everything embedded you can use the same SSL certificate but when you have a distributed setup then you will need an SSL certificate with a SAN for each server/appliance and a wildcard will not be of use.

      Regards,
      (Abdullah)^2

  4. Getting frustrated by exactly this same issue. Thanks for the advice!

  5. Craig. M says:

    You saved my sanity! I have spent all day looking at this issue, screaming at the monitor! Arrgh! Thanks so much for posting this.

  6. tom miller says:

    Same issue. My external PSC had no issues, and I figured I have it easy going after I figured out how to perform the process on the PSC. Surprised when I saw the rollback. Strange fix, I tried it, and it worked!

  7. John says:

    SO GLAD I just found this…..I have a wildcard and have been using a wildcard and our SSL just expired. Went flawless last time, now with whatever past update that requires the SAN and same dns name etc, basically it wants the same cert as prior (except a wildcard). Anyway, this helped. I also commented out these lines (a little different from this write up but) https://communities.vmware.com/thread/559089
    The lines for me on v 6.5.0.20100 were:
    # if var.strip() in [‘1’]:
    # iscomparerequired = compare_certificate_san_to_pnid(cert_file)

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.