[Faccus] NETWORK ALERT - 2013-11-01 - status update re load balancer

bruce at uwaterloo.ca bruce at uwaterloo.ca
Fri Nov 1 10:38:26 EDT 2013

Description:     status update re load balancer

Date: (YYYY-MM-DD)     2013-11-01

Start Time:              

End Time:		

Impact:           information update only


Submitted By: bruce at uwaterloo.ca

Summary : bug and/or interoperability issue observed during load balancer firmware upgrade resulted in loss of load balancing services, including after firmware reversion.   The load balancers have been in service for approximately 5 years.  We believe that the issue identified has been in place for 2 years, since the upgrade of the machine room to a VRF (Virtual Route Forwarding) environment which included moving the routers to VRRP (Virtual Router Redundancy Protocol) for redundancy.   It appears that the issue randomly occurs after a reboot of the load balancers.  The load balancers were rebooted approximately 3-4 times in the last 2 years, the issue was first seen Oct 30.

•	A firmware upgrade on Cisco ACE 4710 load balancers was required by the vendor, in preparation for installation of additional bandwidth license, and other bug fixes and improvements as recommended by the vendor.   Previous version A3(2.5).  Version to upgrade to A4(2.3)
•	Oct 15 announcement that the development load balancer would have firmware upgraded on Oct 16 https://istns.uwaterloo.ca/uwna/index.php?n=1381841732
•	Oct 16 upgrade of development load balancer to A4(2.3) successful
•	Oct 21 announcement that the production load balancers would be rebooted (but not yet upgraded) Oct 23 https://istns.uwaterloo.ca/uwna/index.php?n=1382367787
•	Oct 23 reboots of production load balancers successful
•	Oct 28 announcement that the production load balancers would have firmware upgraded on Oct 30 https://istns.uwaterloo.ca/uwna/index.php?n=1382961374  The maintenance window of 6:30am to 7:30am on a weekday was selected based on assessment of risk, and nature/duration of service impact, and experience with past upgrades/reboots of the devices, which were installed in 2008 
•	Oct 30 the vendor procedure to upgrade firmware on redundant pair of load balancers was followed, that is to upgrade/reboot the passive node, perform manual failover from active to passive, then upgrade/reboot the former active node.
•	After the upgrade, the Exchange e-mail service was not working.  Some debugging was performed on the load balancers and Exchange service.  A support case with Cisco was opened.
•	After being unable to restore service, the load balancers were reverted to A3(2.5)
•	Multiple load balanced services, including Exchange, wcms (homepage), authentication (including for Wi-Fi and learn), jobmine, hr and quest were not working.  Some debugging was performed on the load balancers and the affected services.
•	Technical symptoms included inability to ping the gateway from server farm nodes, and lack of ARP entry for the gateway on some (but not all) server farm nodes.  In some cases manual addition of the ARP entry could restore a given server farm node.
•	A manual failover to the passive load balancer was performed, and the former active powered off, neither of which restored service.
•	The load balancer was operating on the same version as the day previous, with the same active node, however services were not restored.
•	After being unable to restore services, load balanced server farms were migrated to single instance non load balanced services, to restore service.  Services were gradually restored over the course of the morning until early afternoon, with most major services being online by 1:30pm, and some earlier.
•	Updates to the campus were sent using the network alerts tool (reverse chronological order shown)
o	Wed, 30 Oct 2013 15:58:16 -0400 : 2013-10-30 - Major services restored 
Wed, 30 Oct 2013 12:41:45 -0400 : 2013-10-30 - SERVICE DEGRADATION remains 
Wed, 30 Oct 2013 11:58:48 -0400 : 2013-10-30 - SERVICE DEGRADATION 
Wed, 30 Oct 2013 09:51:27 -0400 : 2013-10-30 - EXCHANGE service still unavailable 
Wed, 30 Oct 2013 08:48:08 -0400 : 2013-10-30 - EMERGENCY RELOAD server load balancers 
o	and also on the IST notice board http://strobe.uwaterloo.ca/blogs/ist_notice_board/2013/10/31/issue-with-exchange/
o	In addition staff in the CHIP phoned multiple departments providing updates (during the time the e-mail service was unavailable), and responded to phone inquiries
•	Considerable debugging was performed on the load balancer throughout the day, including phone discussions with the vendor, and supplying packet captures and other information to the vendor.
•	Oct 31 7:15am approx the active load balancer was again rebooted with no improvement
•	Oct 31 - the passive node was reset to factory defaults, upgraded to A4(2.3) and configured as a standalone load balancer.   One server farm was configured on it, which worked.  This load balancer was then rebooted, and the server farm stopped working.  Several more reboots were performed, and in some cases the farm worked, some cases it recovered up to 7 minutes later, and some times it never recovered.
•	Oct 31 - a workaround on the separate load balancer was identified, through debugging, to ping the real IP of the router VRRP (Virtual Router Redundancy Protocol) environment, from the load balancer.   This information, and additional packet captures were sent to the vendor.
•	Oct 31 4:05pm - jobmine was migrated to the separate load balancer, successfully, in preparation for the evenings job application process.
•	Nov 1 – multiple services (exchange, hr, quest, homepage) remain un loadbalanced, pending solution or advice from Cisco, and/or our ability to implement a permanent workaround.

Notice Submitted:    Fri Nov 1 10:38:26 EDT 2013

Follow us on twitter https://twitter.com/UWNetworkAlert

Note:   If you have any questions or concerns please contact the IST Help Desk at ext: 84357 or helpdesk at uwaterloo.ca

More information about the Faccus mailing list