VRA Agent Status Down in VRA 7.6, LDAPS Certificate Issue

Recently came across an issue in our Production environment that VRA Agent status was showing as Down in one of our Sites.

The screenshot is shown as below:

This screenshot has 2 clusters

On investigating, we checked the vSphereAgent.log file which is present on the server where this VRA agent was installed and configured. (In our case it was one one of the IWS (IAAS Web Server) Node)

The location of this log file is at C:\Program Files (x86)\VMware\vCAC\Agents\<VRA_Agent_Name>\logs\

In this log, you can find multiple lines with an error:

This exception was caught:
System.Web.Services.Protocols.SoapException: vCenter Error: Cannot complete login due to an incorrect user name or password.

if this is the case, check the LDAPS Certificate to your Domain Controllers of the domain you have added on the vCenter server Web UI.

Even though it doesn’t show you the certificate expiry in this UI, you can check the certificate status by logging into vcenter SSH and executing the following command:

openssl s_client -connect adds01.corp.test.local:636 -showcerts

Replace the Domain Controller hostname with your domain controller hostname after the -connect in the above command to get the valid cert from the domain controller.

In our case, we found that the cert on the domain controller has been recently renewed and we had to input the new cert to the Identity Source in the vcenter web UI.

Once the new cert is installed, you can login into your VRA Default Tenant (VRA 7.6), go to Infrastructure -> Endpoints -> Endpoints and go to your vcenter and click on edit and then re-validate the service account password (Test the connection) and once it is successful, the VRA Agent will come back UP.

Testing the connection to the vcenter using the service account which is already added and the test is successful.

Hope this article helps you if you see your VRA agents as down and can’t find anything else missing or even restarting the vra agent service doesn’t change the status.

Great VCF Troubleshooting Guide by my Fellow vExpert

I wanted to ping back one of the great article by one of my fellow vExpert Shank Mohan on his website about an unofficial VCF Troubleshooting guide. I have learned from this article and would like to remember this article and hence posting it back on my blog.

Great VCF Troubleshooting guide by Shank Mohan

LCM Directory Permission Error When pre-checking for SDDC Manager Upgrade with VCF 3.11 Patch

I was getting ready to patch our environment from VCF 3.10.2.2 to VCF 3.11 as VMware has officially released a complete Patch for VCF 3.10.x this month, when I was performing the VCF Upgrade Pre-Check for the Management Domain, I came across this issue

The LCM Pre-Check Failed due to a directory permission issue for one of the lcm directory

Issue is that the pre-check says that the directory “/var/log/vmare/vcf/lcm/upgrades/<long code directory>/lcmAbout” owner is root but the owner needs to be user vcf_lcm

This is how I resolved the issue:

Login into SDDC Manager as user vcf, do su and provide the root password

then go to the following directory “/var/log/vmware/vcf/lcm/upgrades/<long code directory as displayed in the lcm error on sddc manager>

chown vcf_lcm lcmAbout
chmod 750 lcmAbout

The above two commands will change the owner from root to vcf_lcm and also provide the required permissions to the folder so the pre-check can complete.

The full screenshot of what I performed is below:

Commands to change owner to vcf_lcm and to provide the required permissions for the folder lcmAbout

Once you perform the commands above, you can run the pre-check and this time it will proceed successfully as shown below

Hope this article helps if you come across this issue with sddc manager upgrade from VCF 3.10.2.2 to 3.11

VCF 3.x patch 3.11 for Log4J Vulnerability and Other Security Patches included

VMware has finally realeased an patch version for VCF 3.x and the version is 3.11. You can only download this as a patch form from the SDDC Manager. You can Upgrade to version 3.11 from 30.10.2.2 or VCF 3.5 or later.

VMSA-2021-0028.13 (vmware.com)

This Release VCF 3.11 includes the following:

  • Security fixes for Apache Log4j Remote Code Execution Vulnerability: This release fixes CVE-2021-44228 and CVE-2021-45046. See VMSA-2021-0028.
  • Security fixes for Apache HTTP Server: This release fixes CVE-2021-40438. See CVE-2021-40438.
  • Improvements to upgrade prechecks: Upgrade prechecks have been expanded to verify filesystem capacity, file permissions, and passwords. These improved prechecks help identify issues that you need to resolve to ensure a smooth upgrade.
  • This also resolves the following Security Advisory VMSA-2022-0004 which deals with several vulnerabilities in esxi 6.7 hosts
  • This also resolves the vulnerability in VCF SDDC Manager 3.x according to the security advisory VMSA-2022-0003
  • This version also addresses the heap-overflow vulnerability in esxi hosts according to the security advisory VMSA-2022-0001.2

The Updated product versions according to the BOM for VCF 3.11 are

Hope this post helps for the teams who have VCF 3.10.x and waiting for the long awaited log4j patch instead of an workaround.

VRA Proxy Agent Down and Inventory Data Collection stuck ‘in progress’ – VRA 7.6

Recently we had an issue where in one of our Sites (We have multiple sites in VCF), the VRA Proxy Agent was showing as Down and restarting the services (VRA Agent) on the ims (Infrastructure Manager Service) did not bring the agent up.

Here is the process to check if the ims load balancer address is entered in the VRMAgent.exe.config file on the ims server.

Issue: In our case, the VRM Agent was installed on the active Infrastructure Manager Service server (ims01a), However, the vrm agent config only had this entry instead of the load balancer entry (imslb) in its configuration. So, when the ims01a became passive and node ims01b became active node, this broke the VRM agent and the agent status became down.

Solution: Edit the vrmagent.config file and update the lines 83 and 104 pointing this file to the ims load balancer hostname so that when the ims servers change active-passive state, the VRM agent will not go down.

Before we continue, stop the service “VMware vCloud Automation Center Agent – agent_name” (Here in my example the agent name is dc2)

Pictures of the issue are below:

VRM Agent status showing as Down
Data Collection status showing as in progress but not changing state to successful

Solution Screenshots are as below:

Location of the VRMAgent.exe.config file on the iws (Infrastructure Web Server) node
Line 83 where you will need to change the hostname to the ims loadbalancer. (In this screenshot, the load balancer hostname is https://dc1vraimslb.domain.local)
line 104 where you need to edit the endpoint address to be the load balancer hostname

Once these modifications are done in this config file, you save it and then start the service “VMware vCloud Automation Center Agent – dc2 (where dc2 is the agent name configured when the agent was installed on this server)

Disclaimer: As this Environment is Property of my Company, The Original names have either been modified or pixelated for Privacy.

Once the agent service is started, you can go back to VRA and check the Agent Status and it will be up and the in progress data collection will actually complete in few minutes (For my environment it took atleast 15-20 minutes for the inventory to complete).

Hope this article helps if you face the same issue in VRA 7.6!

How to troubleshoot vSAN issues?

Great Comprehensive Post on How to Check for VSAN Issues in your Environment !!

VirtuallyVTrue

Hope you are doing all great, for today’s post I wanted to put together some of the commands/troubleshooting I’ve had used with VMware vSAN,

Identify a partitioned node from a VSAN Cluster(Hosts)

What a single partitioned node looks like:

~ # esxcli vsan cluster get
Cluster Information
 Enabled: true
 Current Local Time: 2020-10-25T10:35:19Z
 Local Node UUID: 507e7bd5-ad2f-6424-66cb-1cc1de253de4
 Local Node State: MASTER
 Local Node Health State: HEALTHY
 Sub-Cluster Master UUID: 507e7bd5-ad2f-6424-66cb-1cc1de253de4
 Sub-Cluster Backup UUID:
 Sub-Cluster UUID: 52e4fbe6-7fe4-9e44-f9eb-c2fc1da77631
 Sub-Cluster Membership Entry Revision: 7
 Sub-Cluster Member UUIDs: 507e7bd5-ad2f-6424-66cb-1cc1de253de4
 Sub-Cluster Membership UUID: ba45d050-2e84-c490-845f-1cc1de253de4
~ #

What a full 4-node cluster looks like (no partition)

~ # esxcli vsan cluster get Cluster Information Enabled: true Current Local Time: 2020-10-25T10:35:19Z Local Node UUID: 54188e3a-84fd-9a38-23ba-001b21168828 Local Node State: MASTER Local Node Health State: HEALTHY Sub-Cluster Master UUID: 54188e3a-84fd-9a38-23ba-001b21168828 Sub-Cluster Backup UUID: 545ca9af-ff4b-fc84-dcee-001f29595f9f Sub-Cluster UUID: 529ccbe4-81d2-89bc-7a70-a9c69bd23a19 Sub-Cluster Membership Entry Revision: 3 Sub-Cluster Member UUIDs: 54188e3a-84fd-9a38-23ba-001b21168828, 545ca9af-ff4bfc84-dcee-001f29595f9f…

View original post 1,361 more words

Obtain the placement of the physical disk by NAA id on the ESXi Hosts

This is a great Post on How to find the Physical Location of the disks on an esxi host.

VirtuallyVTrue

Here is a simple script to obtain the placement of the physical disk by naa on ESXi hosts

Copy below script and save it on the ESXi host

# Script to obtain the placement of the physical disk by naa on ESXi hosts
# Do not change anything below this line
# --------------------------------------

echo "=============Physical disks placement=============="
echo ""
	
esxcli storage core device list | grep "naa" | awk '{print $1}' | grep "naa" | while read in; do

echo "$in"
esxcli storage core device physical get -d "$in"
sleep 1

echo "===================================================="

done

Run the script:

[root@esxi1:~] sh disk.sh

You will get similar output as per your environment.
Output:

[root@esxi:~] sh disk.sh =============Physical disks placement============== naa.5002538a9823d020 Physical Location: enclosure 1, slot 6 ==================================================== naa.5002538a9823d1c0 Physical Location: enclosure 1, slot 3 ==================================================== naa.58ce38ee204ccd59 Physical Location: enclosure 1, slot 7 ==================================================== naa.5002538a9823d070 Physical Location: enclosure 1, slot 1 ==================================================== naa.5002538a9823d040 Physical…

View original post 33 more words

Workaround instructions to address CVE-2021-44228 in vCenter Server 6.7.x – For VCF 3.10.x

UPDATE: VMware has Updated the KB 87081 to Include the Script to remove log4j_class

I have taken these Workaround Instructions from the KB article 87081 and KB article 87095

For vCenter 6.7.x appliance in an VCF 3.10.x setup, some of the instructions in article 87081 don’t work and also in VCF 3.10.x since there are external PSC’s and the order to execute the instructions is as follows.

I am calling out VMware team to amend the steps for vCenter 6.7.x appliance in an non-HA configuration in the article 87081, especially for VCF 3.10.x installations.

For vCenter 6.7.x ; Steps to execute

vMON Service

  1. Backup the existing java-wrapper-vmon file

cp -rfp /usr/lib/vmware-vmon/java-wrapper-vmon /usr/lib/vmware-vmon/java-wrapper-vmon.bak

  1. Update the java-wrapper-vmon file with a text editor such as vi

vi /usr/lib/vmware-vmon/java-wrapper-vmon

  1. At the very bottom of the file, replace the very last line with 2 new lines
    • Originalexec $java_start_bin $jvm_dynargs “$@”Updated
      log4j_arg=”-Dlog4j2.formatMsgNoLookups=true”
      exec $java_start_bin $jvm_dynargs $log4j_arg “$@” 
  2. Restart vCenter Services

service-control –stop –all
service-control –start –all

Note: If the services do not start, ensure the file permissions are set correctly with these commands:

  • chown root:cis /usr/lib/vmware-vmon/java-wrapper-vmon
  • chmod 754 /usr/lib/vmware-vmon/java-wrapper-vmon

Analytics Service

NOTE:- The below workaround (Analytics service) is applicable for vCenter Server Appliance 6.7 Update 3o and Older versions only. vCenter Server Appliance 6.7 Update 3p is by default covered by vMON Service workaround. 

  1. Back up the log4j-core-2.8.2.jar file

cp -rfp /usr/lib/vmware/common-jars/log4j-core-2.8.2.jar /usr/lib/vmware/common-jars/log4j-core-2.8.2.jar.bak

  1. Run the zip command to disable the class

zip -q -d /usr/lib/vmware/common-jars/log4j-core-2.8.2.jar org/apache/logging/log4j/core/lookup/JndiLookup.class

  1. Restart the Analytics service

service-control –restart vmware-analytics 

CM Service

  1. Back up the log4j-core.jar file

cp -rfp /usr/lib/vmware-cm/lib/log4j-core.jar /usr/lib/vmware-cm/lib/log4j-core.jar.bak

  1. Run the zip command to disable the class

zip -q -d /usr/lib/vmware-cm/lib/log4j-core.jar org/apache/logging/log4j/core/lookup/JndiLookup.class

  1. Restart the CM service

service-control –restart vmware-cm

Run the remove_log4j_class.py script

1. Download the script attached to this KB (remove_log4j_class.py)

2. Login to the vCSA using an SSH Client (using Putty.exe or any similar SSH Client)

3. Transfer the file to /tmp folder on vCenter Server Appliance using WinSCP
Note: It’s necessary to enable the bash shell before WinSCP will work

4. Execute the script copied in step 1:

python remove_log4j_class.py

The script will stop all vCenter services, proceed with removing the JndiLookup.class from all jar files on the appliance and finally start all vCenter services. The files that the script modifies will be reported as “VULNERABLE FILE” as the script runs.

Verify the changes

Once all sections are complete, use the following steps to confirm if they were implemented successfully.

  1. Verify if the stsd, idmd, and vMon controlled services were started with the new -Dlog4j2.formatMsgNoLookups=true parameter:

ps auxww | grep formatMsgNoLookups

Check if the processes include -Dlog4j2.formatMsgNoLookups=true

  1. Verify the Analytics Service changes:

grep -i jndilookup /usr/lib/vmware/common-jars/log4j-core-2.8.2.jar | wc -l
 This should return 0 lines

  1. Verify the CM Service changes:

grep -i jndilookup /usr/lib/vmware-cm/lib/log4j-core.jar | wc -l

This should return 0 lines

The remaining steps for Secure Token Service, Identity Management Service don’t work for vcenter 6.7.x in VCF 3.10.x (3.10.2.1) environment

——– So, after this Step, we will have to SSH into the External PSC and follow the below steps ———-

CM Service

  1. Back up the log4j-core.jar file

cp -rfp /usr/lib/vmware-cm/lib/log4j-core.jar /usr/lib/vmware-cm/lib/log4j-core.jar.bak

  1. Run the zip command to disable the class

zip -q -d /usr/lib/vmware-cm/lib/log4j-core.jar org/apache/logging/log4j/core/lookup/JndiLookup.class

  1. Restart the CM service

service-control –restart vmware-cm


Secure Token Service

  1. Back up and edit the the vmware-stsd file

cp /etc/rc.d/init.d/vmware-stsd /root/vmware-stsd.bakvi /etc/rc.d/init.d/vmware-stsd

  1. Find the section labeled start_service(). Insert a new line near line 266, just before “$DAEMON_CLASS start” with “-Dlog4j2.formatMsgNoLookups=true \” as seen in the example:

start_service()
{
  perform_pre_startup_actions

  local retval
  JAVA_MEM_ARGS=`/usr/sbin/cloudvm-ram-size -J vmware-stsd`
  $JSVC_BIN -procname $SERVICE_NAME \
            -home $JAVA_HOME \
            -server \
            <snip>
            -Dauditlog.dir=/var/log/audit/sso-events  \
            -Dlog4j2.formatMsgNoLookups=true \
            $DAEMON_CLASS start

  1. Restart the vmware-stsd service

service-control –stop vmware-stsd
service-control –start vmware-stsd

Identity Management Service

  1. Back up and edit the the vmware-sts-idmd file

cp /etc/rc.d/init.d/vmware-sts-idmd /root/vmware-sts-idmd.bakvi /etc/rc.d/init.d/vmware-sts-idmd

  1. Insert a new line near line 177 before “$DEBUG_OPTS \” with “-Dlog4j2.formatMsgNoLookups=true \” as seen in the example:

$JSVC_BIN -procname $SERVICE_NAME \
          -wait 120 \
          -server \
          <snip>
          -Dlog4j.configurationFile=file://$PREFIX/share/config/log4j2.xml \
          -Dlog4j2.formatMsgNoLookups=true \
          $DEBUG_OPTS \
          $DAEMON_CLASS

  1. Restart the vmware-sts-idmd service

service-control –stop vmware-sts-idmd
service-control –start vmware-sts-idmd

Verify the changes

Once all sections are complete, use the following steps to confirm if they were implemented successfully.

  1. Verify if the stsd, idmd, psc-client, and vMon controlled services were started with the new -Dlog4j2.formatMsgNoLookups=true parameter:

ps auxww | grep formatMsgNoLookups

Check if the processes include -Dlog4j2.formatMsgNoLookups=true

  1. Verify the CM Service changes:

grep -i jndilookup /usr/lib/vmware-cm/lib/log4j-core.jar | wc -l

This should return 0 lines

The steps in VMware KB Article 87081 is for vCenter with Embedded PSC and the above steps are for the vCenter server 6.7 with an External PSC

Hope this article helps the Engineers who are working on this log4j Vulnerability and if they have VCF 3.10.x you can follow the above steps with an external PSC Configuration.

NSX Plugin 1.2 in VRA 7.6 Not Generating NSX Security Groups in a Page (NOT SOLVED YET)

Recently, we have an ongoing issue where the NSX Plugin in VRO is not populating one page out of 4 pages and this is messing up our VRO Code to create and put Security Tags (NSX-V) on our VMs.

Below is a screenshot of the issue

This shows that the other pages have security groups in them but page-1 under one of the NSX Manager (NSX-V version 6.4.x) are not populated.

I have already deleted and re-installed the NSX-V Plugin using the VRO Control Center to no resolution.

The issue is not resolved yet and I will update this post with the resolution soon.

How to Find the NIC Driver Version on ESXI Host and get the Correct Driver from VMware

Recently, I had to Search for an QLogic 2x25GE QL41262HMCU CNA NIC driver to update it on multiple Dell R740XD hosts. It’s been a while since I used the Update Manager (vSphere 6.7 environment) and hence writing this post.

First thing is to SSH into an esxi host and then execute the following commands to check the firmware/driver version of the vmnic you want to update (In my case all my vmnics are Qlogic CNA NIC’s)

esxcli network nic get -n vmnic2

Output to the above esxcli command

Things to note is the Driver Name/Type, Firmware Version (First Part of it is sufficient), Version (This is the actual driver version on the esxi host).

In the Above screenshot the driver is ‘qedentv’, the firmware version is 8.53.3.0 and the version is 3.11.16.0

Now, we need to find the entries/numbers to search for the exact driver on the VMware compatibility website.

Execute the following command on the ESXI SSH session

vmkchdev -l | grep vmnic2

The highlighted portion is the one we require to search for the driver on VMware Compatibility website

Let us go to the VMware Compatibility website and IO section

We need to fill in the following values —

VID, DID, SVID and Max SSID to get the exact driver for your nic.

Let us fill in the values from our vmkchdev output

  • VID 1077
  • DID 8070
  • SVID 1077
  • Max SSID 000b
Input the values in VMware IO Compatibility list website
Qlogic Adapter and its versions by vSphere version

Select the vSphere version and click on the version to display the different driver versions we can download

I have selected vSphere version 6.7 U3 in this case and the screenshot is below

The esxi nic driver version and the physical adapter firmware version is different on my Dell server

As you can see, the esxi nic driver version and the physical nic adapter firmware versions are different on this Host. (Typically you should update the esxi nic driver once you upgrade the physical nic firmware as a best practice)

In this case my esxi nic driver version is 3.11.16.0 and the Qlogic NIC Physical firmware version is 8.53.x.x

To download the correct driver, you need to make sure that the esxi nic driver coincides with the Physical nic driver firmware for best compatibility. We will need to download the ‘qedentv’ driver.

We download the driver equal to the physical nic firmware version and the esxi nic driver name which is qedentv in this case

Download the driver.zip file using your my vmware credentials and you can use this zip file in the offline patches in Update Manager to create a baseline for your esxi hosts so this driver can be updated.

NOTE: Put the Host in Maintenance mode before you update the nic driver as this will reboot the esxi host.