Working on the Engines While the Plane is Flying
Topic: Data Center Operations
Operators of large scale networks will, from time to time, be required to perform major upgrades to the network while keeping the network available with no downtime. This type of work has been compared to working on the engines of an airliner while it is flying. At eBay, our Site Network Engineering team recently completed a migration of our data center aggregation layer from one platform to another under these conditions. By sharing our experience, we hope to help our peers in the industry plan for and successfully execute their own network transformations.
eBay’s production network consists of large data center sites plus smaller point of presence sites, all connected by our global backbone. At the top of each data center network, there is a layer of four aggregation routers that we call AR routers. ARs are typically shared across two or more data center fabrics. The ARs are responsible for implementing routing policy and filtering between data centers and fabrics so that our network functions as designed. At a high level, one set of AR routers that serve a data center looks like this:
Over the course of time through organic growth, we arrived at a point where we had various vendors and models across the data centers. We also had two different major versions of our Border Gateway Protocol (BGP) policies in use at the same time. We made the decision to refresh the older sites and bring all of the AR routers up to our newest standards for hardware, software, and configuration across the board. This initiative would make performance and policy behavior more consistent and predictable for this critical layer of our network as we continue to drive toward our goal of a completely automated network.
We agreed on our definition of done early to ensure that all further actions would be in support of these goals. These goals included:
- No outage time for our business
- All data center ARs will be one vendor and code version
- All data center ARs will operate with our newest routing policies
- Physical design will be standard
- Number of links
- Connectivity pattern
- Speed of links
After establishing your definition of done, it is a good practice to spend some time surfacing challenges that you expect to face. Early identification of challenges gives you the most leverage on them. We discussed these items early to afford ourselves the greatest number of options to handle them:
- Some applications cannot be easily moved for a maintenance window
- Data center traffic would never be at zero during the work
- Some older data center environments still ran Open Shortest Path First (OSPF), so we needed to find a way to integrate with our current BGP only design
- Due to the number of configuration items, scripting would be required to keep things consistent and standard
- Due to the length of time the maintenances would require, we could only do one per night
- Our engineers could not do maintenances on back-to-back nights due to fatigue issues
- We would need to be able to run at peak load with any combination of four ARs from the set of eight total ARs (four old ones plus the four new ones)
- Some old and new configuration pieces could interact in unexpected ways when used together
Working on the engines while the plane is flying takes some planning
Armed with our definition of done and our list of expected challenges, our team spent considerable time to work out our execution plan. Each maintenance window would remove one old AR and turn up a new one in its place. Visually, it looks fairly easy, but the devil is in the details, as they say.
Fortunately, we had some extra rack space so we racked and powered up the new devices ahead of time. Our Site Services team also pre-ran the hundreds of connections that would be required complete this migration.
Early identification of challenges gives you the most leverage on them.In the best case, all traffic can be removed from a data center so that maintenance activities can be performed with less risk. At eBay, we perform regular maintenances, so we know what can be easily moved and what cannot. Executing our standard data center exit plan, we would be able to drain around 70% of the traffic moving through the ARs. In addition to the data center exit, we would start work in the evening at a time when site traffic is lower for us. Our plan was to perform the data center exit followed by BGP and OSPF metric changes for the AR that we wanted to migrate. This costing out step would reduce the remaining production traffic for one AR to zero to avoid causing any business impacts during the maintenance.
We estimated that the work to move each AR router would take about six to eight hours, so doing all four ARs that serve a single data center at one time would simply not be possible. This constraint meant that to do each group of four AR routers would require four maintenance windows spread over two weeks. Because of the timeline, we had to be able to interoperate with a mixture of old and new ARs in parallel under full production load conditions indefinitely if needed.
In order to make this work with both old and new BGP policies, smaller pieces of non-standard shim policy would be put in place to allow old and new to coexist until all four devices for that data center were moved. After all four were migrated, these shims could be removed later.
BGP communities and the associated policy matches provided some challenges for us as well. We discovered that two different communities from the old and new designs would have unintended matches in policies once we had old and new devices running in parallel. As an example, think about 65500:200 and 65500:2000. A match that is not specific enough in a policy somewhere could match both of these and take an action that was not expected. What was working in each version of our BGP design separately could not be counted on to work correctly when both policies were run at the same time. We resolved this by going through all policies with a fine-toothed comb looking for these type of issues and corrected them or worked around them. Having a lab environment in which to model your network and test the interaction of components is essential.
OSPF is found in some of the oldest parts of our infrastructure, and the easiest way for us to solve this issue was to simply add a small OSPF configuration section to our newer design to support this feature until we can decommission that older environment or move the OSPF area border further “down” in the network.
The best-laid plans of mice and men often go awry.
- Robert BurnsAs already mentioned, some sites used one particular vendor and other sites used a different vendor. Over the multi-week course of the overall plan for each data center AR set, we knew that we would have to have these vendors operating in parallel for at least a few weeks without any unexpected issues with load balancing, policy actions, or best path selection. In addition, the configuration can look quite different between vendors, and you need to make absolutely sure that your intent is accurately implemented in the devices—especially when they are going to try to operate in parallel. Any subtle preference in selection of a route, a difference in tie breakers, or default behaviors could result in disastrous consequences.
We built as much of our target configuration as possible using our normal build tools. This helped us generate boilerplate items such as the IP addresses (IPv4 and IPv6) needed and our standard set of BGP policies. The shim pieces were handcrafted and incorporated into the maintenance plan. Many portions of the maintenance plan itself were created using Jinja templates. With hundreds of links to move, this approach was worth the extra time to set it up.
After a number of planning meetings spread out over more than a month, the plan had been documented to the last detail. Wiki pages, Git repos with generated configurations, and peer-reviewed maintenance plans were all set.
At 7:00 p.m., our Site Engineering Center (SEC) initiated a data center exit for the site that we were about to work on. As expected, this took about 70% of the traffic off all of the links that we were working on. We allowed a few minutes to make sure everything was stable, and then we applied our cost-out metrics to the AR device that was going under the knife that night. We use this procedure on a regular basis, so it worked as expected and traffic was soon at nearly zero on the device. Again, we waited a few minutes to make sure that everything was stable before proceeding.
Pausing for just a few minutes between major steps is an important best practice, because there is often some delay between an event on the network and issues being displayed in monitoring tools. If the plan is rushed, it becomes difficult to tell which step caused an impact. In addition, if there are steps that are slow to undo, such as powering down a router, you don’t want to execute those steps until you are sure that you are ready and the previous step didn’t have an unexpected effect.
The next major step in our plan called for dozens of cables to be moved from the old AR to the new AR. Our standard practice for link commissioning consists of three layers: the physical layer is connected and tested, then routing protocols are brought up in a costed-out state, and finally, when we are ready, the data plane is set in motion with the final metrics and policies.
After making the new connections, we used an internally developed tool to quickly check every link for correct physical operation. All links checked out except we found one link had been cabled to the wrong port, because we set up our auto build tools incorrectly. Our Site Services team also checked some links during the connection process with a light meter and got some strange results with very high power readings that did not make sense. After some investigation, it was determined that the light meter was incorrectly reading the four lanes of the PSM4 optics that we were using. Sorting through these issues cost us about thirty minutes of extra time, bringing the total for this step to a little over two hours.
Working on the engines while the plane is flying takes some planning.Next, we set about loading all of the required routing policies and pieces of shim configuration with everything still in a costed-out state. This would allow us to verify all policies and routing exchanges without risking any impact to the business.
Our patient was effectively still under anesthesia while we patched him up and checked for issues - or so we thought.
We turned up one out of four links to our backbone from the new AR router to test all of the routing in a costed-out state. Internal BGP (iBGP) did not establish with one of our older OSPF fabric environments downstream from the AR, because the loopback of the new AR was not reachable in OSPF from that device. This was unexpected.
We had an export policy configured in the OSPF protocol on the new AR router with a specific match for the loopback address as well as a metric of 20 set in the policy that was designed to prevent the device from attracting any traffic at this stage. However, when we actually arrived at this step in the maintenance, the loopback was not in OSPF at all for reasons unknown to us at that time.
Our team decided that adding an OSPF "passive" statement to the loopback interface on the new AR would fix this. The "passive" statement for the loopback interface was added, and iBGP came up. At this point, a large flow of traffic that was not expected saturated the upstream link from the AR to the backbone, and we were now impacting the business! Our SEC quickly notified us that something was not right, and we rolled back a step to stop the impact.
Looking at the scrollback on the engineer’s terminals, it was discovered that the OSPF metric from the underlying fabric environment to the new AR was 1. We had expected a higher metric due to the overload feature being active as well as the metric of 20 being set in the OSPF export policy. The diagram below shows the network state when we had the impact.
At the moment of impact, we had only one out of four links from the new AR to the backbone ready to take traffic. The numbers on the AR routers show the OSPF cost that was being sent downstream. The cost of zero being sent by the new AR1 was the best Interior Gateway Protocol (IGP) cost within the OSPF area, because it was the lowest. This affected the BGP best path selection algorithm and attracted all of the traffic in that part of the network to the new AR1. The red links show where the unexpected traffic flow occurred.
In hindsight, we learned that on this particular vendor, when the OSPF overload feature is set, the export policy is not applied. That is the reason that the loopback was originally not visible—it wasn’t being exported. When we forced the loopback into OSPF with the "passive" statement, it did not receive our custom metric because it wasn’t getting into OSPF via the export policy. The default metric was used instead, resulting in a metric value of 1 from the other device.
Pressing forward, we reversed the order of two steps so that all four of the AR to backbone links would be up before trying again with the step that caused the impact. We also hardcoded the OSPF metric for the loopback interface to a value of 1000 to ensure that this device would not become a preferred path again.
The remainder of the work went according to plan from this point forward and allowed the maintenance to be completed, albeit very late into the night.
Improvement and iterating
Regrouping after the first device was migrated, our team evaluated what could have gone better. We identified several areas for improvement, including the order and technical content of the steps, testing procedures, and separation of roles during the work. The team made these changes after the first migration:
- We explicitly set an OSPF metric for the loopback outside of any export policies.
- We would keep the status of all four links to our backbone layer synchronized, so that we would have sufficient bandwidth even if the traffic started flowing unexpectedly.
- Cabling would not be tested with the light meter, and we would rely on our software tools to check cables once they were plugged into the devices instead.
- Crew Resource Management: Borrowing techniques originally developed to handle airliner incidents, one engineer focused on performing the actual maintenance procedure, one engineer worked with Site Services, and one engineer handled communication with other teams such as our SEC.
With these improvements in place, we successfully executed several more of these migrations. The total time elapsed for each maintenance window dropped from over ten hours for the first one to about four hours as we polished our procedure. Best of all, there was no impact to the business in any of of the subsequent maintenance windows.
Experience is the best teacher
With the site-wide upgrade of our AR layer complete, the overall performance and reliability of our site network has never been better. We completed many hours of high-risk work with nearly flawless execution. All team members pitched in to help in some way, and as a result, this trial by ordeal was overcome. We continue to refine our craft in the pursuit of quality. Most importantly, we learned from this experience and adapted so that we could ultimately be successful.