VRA Agent Status Down in VRA 7.6, LDAPS Certificate Issue

Recently came across an issue in our Production environment that VRA Agent status was showing as Down in one of our Sites.

The screenshot is shown as below:

This screenshot has 2 clusters

On investigating, we checked the vSphereAgent.log file which is present on the server where this VRA agent was installed and configured. (In our case it was one one of the IWS (IAAS Web Server) Node)

The location of this log file is at C:\Program Files (x86)\VMware\vCAC\Agents\<VRA_Agent_Name>\logs\

In this log, you can find multiple lines with an error:

This exception was caught:
System.Web.Services.Protocols.SoapException: vCenter Error: Cannot complete login due to an incorrect user name or password.

if this is the case, check the LDAPS Certificate to your Domain Controllers of the domain you have added on the vCenter server Web UI.

Even though it doesn’t show you the certificate expiry in this UI, you can check the certificate status by logging into vcenter SSH and executing the following command:

openssl s_client -connect adds01.corp.test.local:636 -showcerts

Replace the Domain Controller hostname with your domain controller hostname after the -connect in the above command to get the valid cert from the domain controller.

In our case, we found that the cert on the domain controller has been recently renewed and we had to input the new cert to the Identity Source in the vcenter web UI.

Once the new cert is installed, you can login into your VRA Default Tenant (VRA 7.6), go to Infrastructure -> Endpoints -> Endpoints and go to your vcenter and click on edit and then re-validate the service account password (Test the connection) and once it is successful, the VRA Agent will come back UP.

Testing the connection to the vcenter using the service account which is already added and the test is successful.

Hope this article helps you if you see your VRA agents as down and can’t find anything else missing or even restarting the vra agent service doesn’t change the status.

Advertisement

Great VCF Troubleshooting Guide by my Fellow vExpert

I wanted to ping back one of the great article by one of my fellow vExpert Shank Mohan on his website about an unofficial VCF Troubleshooting guide. I have learned from this article and would like to remember this article and hence posting it back on my blog.

Great VCF Troubleshooting guide by Shank Mohan

LCM Directory Permission Error When pre-checking for SDDC Manager Upgrade with VCF 3.11 Patch

I was getting ready to patch our environment from VCF 3.10.2.2 to VCF 3.11 as VMware has officially released a complete Patch for VCF 3.10.x this month, when I was performing the VCF Upgrade Pre-Check for the Management Domain, I came across this issue

The LCM Pre-Check Failed due to a directory permission issue for one of the lcm directory

Issue is that the pre-check says that the directory “/var/log/vmare/vcf/lcm/upgrades/<long code directory>/lcmAbout” owner is root but the owner needs to be user vcf_lcm

This is how I resolved the issue:

Login into SDDC Manager as user vcf, do su and provide the root password

then go to the following directory “/var/log/vmware/vcf/lcm/upgrades/<long code directory as displayed in the lcm error on sddc manager>

chown vcf_lcm lcmAbout
chmod 750 lcmAbout

The above two commands will change the owner from root to vcf_lcm and also provide the required permissions to the folder so the pre-check can complete.

The full screenshot of what I performed is below:

Commands to change owner to vcf_lcm and to provide the required permissions for the folder lcmAbout

Once you perform the commands above, you can run the pre-check and this time it will proceed successfully as shown below

Hope this article helps if you come across this issue with sddc manager upgrade from VCF 3.10.2.2 to 3.11

VRA Proxy Agent Down and Inventory Data Collection stuck ‘in progress’ – VRA 7.6

Recently we had an issue where in one of our Sites (We have multiple sites in VCF), the VRA Proxy Agent was showing as Down and restarting the services (VRA Agent) on the ims (Infrastructure Manager Service) did not bring the agent up.

Here is the process to check if the ims load balancer address is entered in the VRMAgent.exe.config file on the ims server.

Issue: In our case, the VRM Agent was installed on the active Infrastructure Manager Service server (ims01a), However, the vrm agent config only had this entry instead of the load balancer entry (imslb) in its configuration. So, when the ims01a became passive and node ims01b became active node, this broke the VRM agent and the agent status became down.

Solution: Edit the vrmagent.config file and update the lines 83 and 104 pointing this file to the ims load balancer hostname so that when the ims servers change active-passive state, the VRM agent will not go down.

Before we continue, stop the service “VMware vCloud Automation Center Agent – agent_name” (Here in my example the agent name is dc2)

Pictures of the issue are below:

VRM Agent status showing as Down
Data Collection status showing as in progress but not changing state to successful

Solution Screenshots are as below:

Location of the VRMAgent.exe.config file on the iws (Infrastructure Web Server) node
Line 83 where you will need to change the hostname to the ims loadbalancer. (In this screenshot, the load balancer hostname is https://dc1vraimslb.domain.local)
line 104 where you need to edit the endpoint address to be the load balancer hostname

Once these modifications are done in this config file, you save it and then start the service “VMware vCloud Automation Center Agent – dc2 (where dc2 is the agent name configured when the agent was installed on this server)

Disclaimer: As this Environment is Property of my Company, The Original names have either been modified or pixelated for Privacy.

Once the agent service is started, you can go back to VRA and check the Agent Status and it will be up and the in progress data collection will actually complete in few minutes (For my environment it took atleast 15-20 minutes for the inventory to complete).

Hope this article helps if you face the same issue in VRA 7.6!

Workaround instructions to address CVE-2021-44228 in vCenter Server 6.7.x – For VCF 3.10.x

UPDATE: VMware has Updated the KB 87081 to Include the Script to remove log4j_class

I have taken these Workaround Instructions from the KB article 87081 and KB article 87095

For vCenter 6.7.x appliance in an VCF 3.10.x setup, some of the instructions in article 87081 don’t work and also in VCF 3.10.x since there are external PSC’s and the order to execute the instructions is as follows.

I am calling out VMware team to amend the steps for vCenter 6.7.x appliance in an non-HA configuration in the article 87081, especially for VCF 3.10.x installations.

For vCenter 6.7.x ; Steps to execute

vMON Service

  1. Backup the existing java-wrapper-vmon file

cp -rfp /usr/lib/vmware-vmon/java-wrapper-vmon /usr/lib/vmware-vmon/java-wrapper-vmon.bak

  1. Update the java-wrapper-vmon file with a text editor such as vi

vi /usr/lib/vmware-vmon/java-wrapper-vmon

  1. At the very bottom of the file, replace the very last line with 2 new lines
    • Originalexec $java_start_bin $jvm_dynargs “$@”Updated
      log4j_arg=”-Dlog4j2.formatMsgNoLookups=true”
      exec $java_start_bin $jvm_dynargs $log4j_arg “$@” 
  2. Restart vCenter Services

service-control –stop –all
service-control –start –all

Note: If the services do not start, ensure the file permissions are set correctly with these commands:

  • chown root:cis /usr/lib/vmware-vmon/java-wrapper-vmon
  • chmod 754 /usr/lib/vmware-vmon/java-wrapper-vmon

Analytics Service

NOTE:- The below workaround (Analytics service) is applicable for vCenter Server Appliance 6.7 Update 3o and Older versions only. vCenter Server Appliance 6.7 Update 3p is by default covered by vMON Service workaround. 

  1. Back up the log4j-core-2.8.2.jar file

cp -rfp /usr/lib/vmware/common-jars/log4j-core-2.8.2.jar /usr/lib/vmware/common-jars/log4j-core-2.8.2.jar.bak

  1. Run the zip command to disable the class

zip -q -d /usr/lib/vmware/common-jars/log4j-core-2.8.2.jar org/apache/logging/log4j/core/lookup/JndiLookup.class

  1. Restart the Analytics service

service-control –restart vmware-analytics 

CM Service

  1. Back up the log4j-core.jar file

cp -rfp /usr/lib/vmware-cm/lib/log4j-core.jar /usr/lib/vmware-cm/lib/log4j-core.jar.bak

  1. Run the zip command to disable the class

zip -q -d /usr/lib/vmware-cm/lib/log4j-core.jar org/apache/logging/log4j/core/lookup/JndiLookup.class

  1. Restart the CM service

service-control –restart vmware-cm

Run the remove_log4j_class.py script

1. Download the script attached to this KB (remove_log4j_class.py)

2. Login to the vCSA using an SSH Client (using Putty.exe or any similar SSH Client)

3. Transfer the file to /tmp folder on vCenter Server Appliance using WinSCP
Note: It’s necessary to enable the bash shell before WinSCP will work

4. Execute the script copied in step 1:

python remove_log4j_class.py

The script will stop all vCenter services, proceed with removing the JndiLookup.class from all jar files on the appliance and finally start all vCenter services. The files that the script modifies will be reported as “VULNERABLE FILE” as the script runs.

Verify the changes

Once all sections are complete, use the following steps to confirm if they were implemented successfully.

  1. Verify if the stsd, idmd, and vMon controlled services were started with the new -Dlog4j2.formatMsgNoLookups=true parameter:

ps auxww | grep formatMsgNoLookups

Check if the processes include -Dlog4j2.formatMsgNoLookups=true

  1. Verify the Analytics Service changes:

grep -i jndilookup /usr/lib/vmware/common-jars/log4j-core-2.8.2.jar | wc -l
 This should return 0 lines

  1. Verify the CM Service changes:

grep -i jndilookup /usr/lib/vmware-cm/lib/log4j-core.jar | wc -l

This should return 0 lines

The remaining steps for Secure Token Service, Identity Management Service don’t work for vcenter 6.7.x in VCF 3.10.x (3.10.2.1) environment

——– So, after this Step, we will have to SSH into the External PSC and follow the below steps ———-

CM Service

  1. Back up the log4j-core.jar file

cp -rfp /usr/lib/vmware-cm/lib/log4j-core.jar /usr/lib/vmware-cm/lib/log4j-core.jar.bak

  1. Run the zip command to disable the class

zip -q -d /usr/lib/vmware-cm/lib/log4j-core.jar org/apache/logging/log4j/core/lookup/JndiLookup.class

  1. Restart the CM service

service-control –restart vmware-cm


Secure Token Service

  1. Back up and edit the the vmware-stsd file

cp /etc/rc.d/init.d/vmware-stsd /root/vmware-stsd.bakvi /etc/rc.d/init.d/vmware-stsd

  1. Find the section labeled start_service(). Insert a new line near line 266, just before “$DAEMON_CLASS start” with “-Dlog4j2.formatMsgNoLookups=true \” as seen in the example:

start_service()
{
  perform_pre_startup_actions

  local retval
  JAVA_MEM_ARGS=`/usr/sbin/cloudvm-ram-size -J vmware-stsd`
  $JSVC_BIN -procname $SERVICE_NAME \
            -home $JAVA_HOME \
            -server \
            <snip>
            -Dauditlog.dir=/var/log/audit/sso-events  \
            -Dlog4j2.formatMsgNoLookups=true \
            $DAEMON_CLASS start

  1. Restart the vmware-stsd service

service-control –stop vmware-stsd
service-control –start vmware-stsd

Identity Management Service

  1. Back up and edit the the vmware-sts-idmd file

cp /etc/rc.d/init.d/vmware-sts-idmd /root/vmware-sts-idmd.bakvi /etc/rc.d/init.d/vmware-sts-idmd

  1. Insert a new line near line 177 before “$DEBUG_OPTS \” with “-Dlog4j2.formatMsgNoLookups=true \” as seen in the example:

$JSVC_BIN -procname $SERVICE_NAME \
          -wait 120 \
          -server \
          <snip>
          -Dlog4j.configurationFile=file://$PREFIX/share/config/log4j2.xml \
          -Dlog4j2.formatMsgNoLookups=true \
          $DEBUG_OPTS \
          $DAEMON_CLASS

  1. Restart the vmware-sts-idmd service

service-control –stop vmware-sts-idmd
service-control –start vmware-sts-idmd

Verify the changes

Once all sections are complete, use the following steps to confirm if they were implemented successfully.

  1. Verify if the stsd, idmd, psc-client, and vMon controlled services were started with the new -Dlog4j2.formatMsgNoLookups=true parameter:

ps auxww | grep formatMsgNoLookups

Check if the processes include -Dlog4j2.formatMsgNoLookups=true

  1. Verify the CM Service changes:

grep -i jndilookup /usr/lib/vmware-cm/lib/log4j-core.jar | wc -l

This should return 0 lines

The steps in VMware KB Article 87081 is for vCenter with Embedded PSC and the above steps are for the vCenter server 6.7 with an External PSC

Hope this article helps the Engineers who are working on this log4j Vulnerability and if they have VCF 3.10.x you can follow the above steps with an external PSC Configuration.

VMware Cloud Foundation (VCF) API Reference Guides

Here is the direct link to the API Reference Guide for VMware Cloud Foundation (VCF)

https://vdc-download.vmware.com/vmwb-repository/dcr-public/2d4955d7-fb6f-4a61-be78-64d95b951ccd/c6e26ae1-9438-4da0-bfc7-2e21d9046820/index.html#_overview

This is the Generic API Reference Guide for VCF instead of being Version Centric.

Updated – 2/16/2022

From Version VCF 4.3.1 (Non-VxRail), VMware has moved their API Guides to a new location which is a better format than before and more user friendly.

VCF 4.3.1 New API Reference Guide

VCF 4.4.0 New API Reference Guide

Updated – 1/25/2022

For Version Centric API Guides

VCF 3.10 API Reference Guide

VCF 4.0 API Reference Guide

VCF 4.1 API Reference Guide

VCF 4.2 API Reference Guide

VCF 4.3 API Reference Guide

VCF 4.3.1 API Reference Guide

NOTE: These Reference Guides and their versions are for NON VXRAIL implementations. They are valid for Regular VCF Implementation with VSAN Ready nodes.