During a recent storage upgrade and virtual environment refresh on a customer site, I ran into an issue that kept me busy, and really tested my patience. My patience for vendor support, customers, and IT in general. Please forgive the rambling. Skip ahead to the Resolution section for the answer.
The documents I found along the way are represented as links throughout the article.
This was a pretty routine EMC VNX 5500 block and VMware ESXi 5.x implementation.
1. The array and the FC fabric (Brocade 5100B) were configured, the current vCenter instance was upgraded from 4.1 update 1 to 5.0 Update 1.
- Due to a previously identified bug between the VNX running Block OE 5.31 and VMware ESXi 5.x we made sure all of the hosts were configured for VMware NMP FIXED and Failover Mode 4 (ALUA) on the array.
2. I installed ESXi 5.0 Update 1 on the new Cisco C200 and validated it could see the legacy CX4 datastores, as well as the new VNX datastores.
3. After moving one of the production host’s virtual machines to the new C200, I upgraded it’s version of ESXi 4.1 Update 1 to version 5.0 Update 1.
At this point the implementation was smooth sailing, so I upgraded the remaining hosts.
The next day I was informed that the virtual environment was extremely unresponsive. Virtual machines were still running but latencies were high and sporadic (10,000ms – 45,000ms). What I found most interesting, was that this was NOT consistent among all of the hosts. The newly installed C200 server was operating as expected, as were some of the older servers.
After reviewing the ESXi vmkerel.log logs for each of the ESXi hosts, I noticed very specific failures resident in the hosts exhibiting the storage performance issues.
Here is an example of some of the messages seen in /var/log/vmkernel.log:
cpu3:2051)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device “mpx.vmhba33:C0:T0:L0″ state in doubt; requested fast p ath state update…
cpu3:2051)ScsiDeviceIO: 2322: Cmd(0x412400391f40) 0x1a, CmdSN 0×3517 from world 0 to dev “mpx.vmhba33:C0:T0:L0″ failed H:0×7 D:0×0 P:0×0 Possible sense data: 0×5 0×24 0×0.
…. output suppressed
cpu7:2714)<6>qla2xxx 0000:09:00.0: Performing ISP error recovery – ha= 0x41001e88c440.
cpu7:2714)<6>qla2xxx 0000:09:00.0: ZIO mode 6 enabled; timer delay (100 us).
cpu7:2714)<6>qla2xxx 0000:09:00.0: FW: Loading via request-firmware…
cpu7:2714)<4>qla2xxx 0000:09:00.0: Firmware dump previously allocated.
cpu7:2714)<6>qla2xxx 0000:09:00.0: LIP reset occured (f700).
cpu7:2714)<6>qla2xxx 0000:09:00.0: LOOP UP detected (4 Gbps).
…. output suppressed
cpu11:3926)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device “mpx.vmhba33:C0:T0:L0″ state in doubt; requested fast path state update…
cpu11:3926)ScsiDeviceIO: 2322: Cmd(0×412440385700) 0x1a, CmdSN 0x35c6 from world 0 to dev “mpx.vmhba33:C0:T0:L0″ failed H:0×7 D:0×0 P:0×0 Possible sense data: 0×5 0×24 0×0.
Now the main entries to note are the references to “qla2xxx”. I only found sparse references to similar errors, but nothing promising. Finally I threw in the towel and escalated the incident to VMWare.
VMware (Round 1)
My experience with VMware support has been hit and miss, but in this case they were on the ball. The case was quickly picked up by an eager engineer and we had some environment details to verify. The first item we identified was outdated firmware for each of the qLogic HBAs. The default response from all vendor support is usually to have you make sure “everything” is at the latest revision before proceeding. (This is a good practice as it rules out previously identified/resolved bugs, but can be frustrating when firmware/drivers are out of the problems scope.)
That said, I was apprehensive about spending time going down the “upgrade bios/firmware/drivers” route but the Engineer found a few internal cases between VMware/qLogic (not releasable) that cited the newest HBA firmware resolved this issue. The article can be found here: Qlogic ISP Error
|Current Code||Latest Code|
BIOS version 2.10
BIOS version 3.00
To my disappointment, the issue was still resident after all of the HBA drivers had been upgraded. Enter Cisco TAC.
Cisco (Round 2)
A colleague of mine mentioned a BIOS (or vmKernel) setting that had to be made on Cisco C200 hosts and provided me with this article here: KB1030265. Even though this change is simple enough to implement, this particular environment couldn’t withstand a host reboot because virtual machines could not be evacuated due to the performance issues. Furthermore, the article specifically states that this fix is only applicable if “ALERT: APIC: 1823: APICID 0×00000000 – ESR = 0×40” is present in the /var/log/vmKernel.log log (Which it wasn’t)
Needing further assistance, I escalated the case to Cisco TAC. After reviewing my findings, as well as some ESXi host logs, support agreed that the APIC interrupt issue referenced above did NOT apply, but that the server firmware was not supported with ESXi 5.x. After reviewing the release notes (Found here), I confirmed this was the case.
The Holy Grail
To my surprise, the answer was in the release notes. Although the VMware KB referenced above states that the APIC Interrupt mapping issue is only occurring if you see the “ALERT: APIC: 1823: APICID 0×00000000 – ESR = 0×40” entry, the release notes also mention another symptom. Here is a little gem I pulled from the release notes:
BIOS > Advanced > PCI configuration > Onboard NIC# > enabled(CSCtu40828)
Symptom As per VMware KB 1030265, UCS C-series servers running VMware ESX/ESXi 4.1 or ESXi 5.0 may experience the following symptoms:
•HBAs stop responding.
•Other PCIs devices may also stop responding.
•You see an an illegal vector shortly before an HBA stops responding to the driver.
Example 1:vmkernel: 6:01:34:46.970 cpu0:4120)ALERT: APIC: 1823: APICID 0x00000000 - ESR = 0x40
The HBA stops responding to commands.
Example 2:vmkernel: 6:01:42:36.189 cpu15:4274)<6>qla2xxx 0000:1a:00.0: qla2x00_abort_isp: **** FAILED ****vmkernel: 6:01:47:36.383 cpu14:4274)<4>qla2xxx 0000:1a:00.0: Failed mailbox send register test
The HBA card is marked as offline.
Example 3:vmkernel: 6:01:47:36.383 cpu14:4274)<4>qla2xxx 0000:1a:00.0: ISP error recovery failed - board disabled
Do examples 2 & 3 look familiar?
Once I disabled the vmKernel interrupt remapping using the following command on each host (SSH/CLI) followed by a reboot, the issue was resolved!
esxcli system settings kernel list -o iovDisableIR
I hope this has been helpful, and saves you some time. I definitely felt the pain while troubleshooting.