SDDC Manager 3.x [3.11.x] Issues & Solutions

Hello All,

Recently I came across a few issues while preparing for an VCF Upgrade from 3.11.x version to 4.x in our environment.

Below are few of the issues and how to resolve them using commands on the SDDC Manager

Issue: SDDC Manager UI shows the message “Password Manager option failed in pre-validation stage” or when you go to the security tab for password management, it shows that one of the password tasks have failed.

Issue: Deployment locked by password manager when you try to rotate the passwords of PSC, VCENTER from the security -> password management tab

Solution/s:

Find the Deployment Lock ID in the sddc manager by logging into sddc manager using SSH, login as VCF and then root user and use the following command

psql --host=localhost -U postgres -d platform -c "select * from lock"

This will display the ID’s of the locked tasks in the SDDC Manager

Delete the locked task by using the following command

psql --host=localhost -U postgres -d platform -c "delete from lock where id=<ID displayed above>"

Once the task is deleted, the lock is released. You can now refresh the SDDC Manager UI and then continue with the password update or rotate options

Issue: Remove Failed Tasks in SDDC Manager

Solution:

Reference https://www.martingustafsson.com/removing-failed-tasks-in-sddc-manager/

Multiple Useful Commands for SDDC 3.x

The Unofficial VCF Troubleshooting Guide v2 – https://www.lab2prod.com.au/2021/03/the-unofficial-vcf-troubleshooting-guide.html

Reference

MANAGING VMWARE CLOUD FOUNDATION – FIRST LOOK

Advertisement

Password Operation Failed to Change the SSO password on an external PSC in VCF 3.11

Recently I came across an issue trying to change the SSO account (administrator@vsphere.local) password from the SDDC Manager using the Rotate password option under Security in VCF 3.11

I tried to Rotate the SSO password using the SDDC Manager, and got the following error:

However, Interesting thing is the sddc manager did change the SSO password in the backend

However, to check on this error, I dug a little deeper and saw the following error in the password rotate task:

I used the following command to check the operationsmanager.log to check the log in SDDC Manager

less /var/log/vmware/vcf/operationsmanager/operationsmanager.log

The log also shows that the sddc manager is trying to change the sso credential (administrator@vsphere.local) on VRA endpoints

I had to open a VMware Support ticket and here is the answer I received:

“As per the Engineering team this issue is due to a misconfiguration of vRA endpoints. SDDC Manager is trying to change the administrator@vsphere.local on the VRA endpoints but VRA endpoints are configured with a different user (vcf-secured-user@vsphere.local).  This issue is addressed in VCF 4.x”

What the VMware Engineering team is saying is that in VCF 3.10.x, 3.11 there is an issue with VRA as it is typically configured using a different tenant admin instead of using administrator@vsphere.local user to configure the endpoints in it. However, the SDDC manager is trying to change the administrator@vsphere.local credential on VRA endpoints. Hence this issue. Looks like this issue has been fixed in VCF 4.x

This resolves the issue at this time as we will be working to upgrade our VCF to 4.x soon.

Install & Configure Skyline Collector 3.1 in VCF 3.11

Here are steps to Install and Configure Skyline Collector 3.1 in VCF 3.11 Environment

NOTE: Some of the Content Might be Pixelated or the Data is Changed to Protect Existing Environments at my Organization

First you will have to download the skyline ova from VMware portal, the current filename as of this post is Skyline-Appliance-3.1.0.0-19303936_OVF10.ova

Deploy the OVA file in the vcenter server

This concludes the deployment of the skyline appliance

Next we go to the skyline appliance Web UI at https://skyline_hostname_fqdn

Login using the default user admin and you will need to configure the skyline collector as below:

You can check the option “Hostname Verification”to make sure that https connection is enabled to the collector

Make sure you provide the correct Collector Registration Token available at VMware Cloud Console

Provide a Friendly Name for the Collector, This will be displayed in the VMware Cloud Console

In this window you can opt in so that the appliance auto-upgrades itself at a set day or time.

Next, you can configure the vcenter, nsx-v/nsx-t, horizon view, vrops, vcf (sddc manager) and vrslcm components with their hostname and credentials.

This is the final step after configuring the required components and finally we click finish.

with this we complete the deployment and configuration of VMware Skyline Collector version 3.1

At the time of this writing the Skyline Collector Version is 3.2

Check for Passwords in SDDC Manager in VCF 3.x

Recently I had to check the existing passwords in sddc manager in our VCF 3.11 environment and found out there is a simple way. Here it is.

SSH into your SDDC Manager using vcf user and go to the root prompt using su command and use the below command:

root@sddcmgr01 [home/vcf]# lookup_passwords

Screenshot:

This will bring up all the products which sddc manager keeps track of

Select any product and then you will have to provide the sddc secured user credentials which you provided at the time of deploying SDDC manager in the VCF environment. This credential is also used for the backup of SDDC Manager and NSX Components.

in this case ESXI was selected to display the esxi hosts credentials

This way, you can get all the passwords for all the components controlled by SDDC Manager in VCF 3.x

NOTE/Disclaimer: I had to Blur/Pixelate certain components in my screenshots as they are in a live environment.

VRA Agent Status Down in VRA 7.6, LDAPS Certificate Issue

Recently came across an issue in our Production environment that VRA Agent status was showing as Down in one of our Sites.

The screenshot is shown as below:

This screenshot has 2 clusters

On investigating, we checked the vSphereAgent.log file which is present on the server where this VRA agent was installed and configured. (In our case it was one one of the IWS (IAAS Web Server) Node)

The location of this log file is at C:\Program Files (x86)\VMware\vCAC\Agents\<VRA_Agent_Name>\logs\

In this log, you can find multiple lines with an error:

This exception was caught:
System.Web.Services.Protocols.SoapException: vCenter Error: Cannot complete login due to an incorrect user name or password.

if this is the case, check the LDAPS Certificate to your Domain Controllers of the domain you have added on the vCenter server Web UI.

Even though it doesn’t show you the certificate expiry in this UI, you can check the certificate status by logging into vcenter SSH and executing the following command:

openssl s_client -connect adds01.corp.test.local:636 -showcerts

Replace the Domain Controller hostname with your domain controller hostname after the -connect in the above command to get the valid cert from the domain controller.

In our case, we found that the cert on the domain controller has been recently renewed and we had to input the new cert to the Identity Source in the vcenter web UI.

Once the new cert is installed, you can login into your VRA Default Tenant (VRA 7.6), go to Infrastructure -> Endpoints -> Endpoints and go to your vcenter and click on edit and then re-validate the service account password (Test the connection) and once it is successful, the VRA Agent will come back UP.

Testing the connection to the vcenter using the service account which is already added and the test is successful.

Hope this article helps you if you see your VRA agents as down and can’t find anything else missing or even restarting the vra agent service doesn’t change the status.

Great VCF Troubleshooting Guide by my Fellow vExpert

I wanted to ping back one of the great article by one of my fellow vExpert Shank Mohan on his website about an unofficial VCF Troubleshooting guide. I have learned from this article and would like to remember this article and hence posting it back on my blog.

Great VCF Troubleshooting guide by Shank Mohan

LCM Directory Permission Error When pre-checking for SDDC Manager Upgrade with VCF 3.11 Patch

I was getting ready to patch our environment from VCF 3.10.2.2 to VCF 3.11 as VMware has officially released a complete Patch for VCF 3.10.x this month, when I was performing the VCF Upgrade Pre-Check for the Management Domain, I came across this issue

The LCM Pre-Check Failed due to a directory permission issue for one of the lcm directory

Issue is that the pre-check says that the directory “/var/log/vmare/vcf/lcm/upgrades/<long code directory>/lcmAbout” owner is root but the owner needs to be user vcf_lcm

This is how I resolved the issue:

Login into SDDC Manager as user vcf, do su and provide the root password

then go to the following directory “/var/log/vmware/vcf/lcm/upgrades/<long code directory as displayed in the lcm error on sddc manager>

chown vcf_lcm lcmAbout
chmod 750 lcmAbout

The above two commands will change the owner from root to vcf_lcm and also provide the required permissions to the folder so the pre-check can complete.

The full screenshot of what I performed is below:

Commands to change owner to vcf_lcm and to provide the required permissions for the folder lcmAbout

Once you perform the commands above, you can run the pre-check and this time it will proceed successfully as shown below

Hope this article helps if you come across this issue with sddc manager upgrade from VCF 3.10.2.2 to 3.11

VCF 3.x patch 3.11 for Log4J Vulnerability and Other Security Patches included

VMware has finally realeased an patch version for VCF 3.x and the version is 3.11. You can only download this as a patch form from the SDDC Manager. You can Upgrade to version 3.11 from 30.10.2.2 or VCF 3.5 or later.

VMSA-2021-0028.13 (vmware.com)

This Release VCF 3.11 includes the following:

  • Security fixes for Apache Log4j Remote Code Execution Vulnerability: This release fixes CVE-2021-44228 and CVE-2021-45046. See VMSA-2021-0028.
  • Security fixes for Apache HTTP Server: This release fixes CVE-2021-40438. See CVE-2021-40438.
  • Improvements to upgrade prechecks: Upgrade prechecks have been expanded to verify filesystem capacity, file permissions, and passwords. These improved prechecks help identify issues that you need to resolve to ensure a smooth upgrade.
  • This also resolves the following Security Advisory VMSA-2022-0004 which deals with several vulnerabilities in esxi 6.7 hosts
  • This also resolves the vulnerability in VCF SDDC Manager 3.x according to the security advisory VMSA-2022-0003
  • This version also addresses the heap-overflow vulnerability in esxi hosts according to the security advisory VMSA-2022-0001.2

The Updated product versions according to the BOM for VCF 3.11 are

Hope this post helps for the teams who have VCF 3.10.x and waiting for the long awaited log4j patch instead of an workaround.

VRA Proxy Agent Down and Inventory Data Collection stuck ‘in progress’ – VRA 7.6

Recently we had an issue where in one of our Sites (We have multiple sites in VCF), the VRA Proxy Agent was showing as Down and restarting the services (VRA Agent) on the ims (Infrastructure Manager Service) did not bring the agent up.

Here is the process to check if the ims load balancer address is entered in the VRMAgent.exe.config file on the ims server.

Issue: In our case, the VRM Agent was installed on the active Infrastructure Manager Service server (ims01a), However, the vrm agent config only had this entry instead of the load balancer entry (imslb) in its configuration. So, when the ims01a became passive and node ims01b became active node, this broke the VRM agent and the agent status became down.

Solution: Edit the vrmagent.config file and update the lines 83 and 104 pointing this file to the ims load balancer hostname so that when the ims servers change active-passive state, the VRM agent will not go down.

Before we continue, stop the service “VMware vCloud Automation Center Agent – agent_name” (Here in my example the agent name is dc2)

Pictures of the issue are below:

VRM Agent status showing as Down
Data Collection status showing as in progress but not changing state to successful

Solution Screenshots are as below:

Location of the VRMAgent.exe.config file on the iws (Infrastructure Web Server) node
Line 83 where you will need to change the hostname to the ims loadbalancer. (In this screenshot, the load balancer hostname is https://dc1vraimslb.domain.local)
line 104 where you need to edit the endpoint address to be the load balancer hostname

Once these modifications are done in this config file, you save it and then start the service “VMware vCloud Automation Center Agent – dc2 (where dc2 is the agent name configured when the agent was installed on this server)

Disclaimer: As this Environment is Property of my Company, The Original names have either been modified or pixelated for Privacy.

Once the agent service is started, you can go back to VRA and check the Agent Status and it will be up and the in progress data collection will actually complete in few minutes (For my environment it took atleast 15-20 minutes for the inventory to complete).

Hope this article helps if you face the same issue in VRA 7.6!