blindpete.com Just my rambles

17Feb/102

VMWare ESX4 guests loosing network connectivity briefly.

VMWare ESX4 guests loosing network connectivity briefly.

Posted by admin on February 17th, 2010 | 0 comments

Came across a very odd issue lately where guests on one of our ESX4 hosts were periodically loosing network connectivity very briefly – maybe 10 ICMP packets every half hour or hour.

After much debugging on the network side, thinking that perhaps there was a misconfigured NIC with the wrong VLAN config, the problem was still happening.

So ssh’ing onto the host, I started to trawl through the log files, and came across the below in the /var/log/vmkwarning file:

Feb 17 13:44:19 vminfraboxvmkernel: 18:00:00:11.865 cpu4:4222)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device “naa.6090a028004f243d08ab44c26687e3dd” – issuing command 0×410002074040
Feb 17 13:44:19 vminfrabox vmkernel: 18:00:00:11.865 cpu4:4222)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device “naa.6090a028004f243d08ab44c26687e3dd” – failed to issue command due to Not found (APD), try again…
Feb 17 13:44:19 vminfrabox vmkernel: 18:00:00:11.865 cpu4:4222)WARNING: NMP: nmp_DeviceAttemptFailover: Logical device “naa.6090a028004f243d08ab44c26687e3dd”: awaiting fast path state update…

This was occuring repeatedly every half hour and the entries above filled the logs solidly for about 2 minutes continuously every half an hour.

After doing some digging on the google, I found out that ESX4 has a bug whereby if you have a duff or old connection to an iSCSI LUN – perhaps one that no longer exists – but you never rescanned to remove it – when the host tries to check the paths every 30 minutes, it finds this duff connection and goes through the motions of trying to find failover paths. The bug is that this causes very brief network loss to your guests.

The fix for me was to simply re-scan my adapter, which removed the old mapping to one of our removed LUNS’s and the problem went away.

You can follow any responses to this entry through the RSS 2.0 feed.

Posted via web from blindpete's posterous

Filed under: Ramble Leave a comment
Comments (2) Trackbacks (0)
  1. Thanks for the fix. Even though I am using FC I had the exact same problem and had to rescan adaptors in order to remove reference to a datastore that had been taken offline.

  2. Same here. FC instead, but found about a dozen devices still hanging on. You made me look good man. Thanks for posting this.


Leave a comment

No trackbacks yet.

blindpete.com is Stephen Fry proof thanks to caching by WP Super Cache