Why Testing and Debugging Networks is so Difficult
by Nikhil Handigol
The network lies at the heart of a modern enterprise’s ability to perform its daily business and operations. When a network outage occurs, due to a policy misconfiguration or a device failure, business grinds to a halt. Almost every week, it seems, we read a new headline where a Fortune 500 company suffered the catastrophic consequences of a network outage. These incidents are costly, causing revenue loss and impacting corporate reputation and customer loyalty. In the most extreme cases, outages have triggered both a company bankruptcy and a CEO’s dismissal.
Yet, despite the costs and frequency of outages, operating a network remains a manual, error-prone process. In an oft-cited study by Gartner analysts Ronni Colville and George Spafford, the authors note that “80 percent of network outages are caused by people and process issues, with more than 50 percent of those outages caused by change configuration issues.” The reality is that a company’s network team may be one or two changes away from causing a severe outage, even when the network has hardware redundancy. Hiring more people won’t solve the problem either.
A primary reason for this fragility is that networks are inherently hard to test. In the world of software development, an abundance of testing frameworks and continuous integration servers help to ensure that code is correct, while an abundance of troubleshooting and debugging tools help to resolve problems when they appear. In networking, there simply isn’t a modern and comprehensive toolset for testing the correctness of multi-vendor device configurations and policies.
To start, the scale and complexity of today’s modern networks is simply daunting. Not counting servers, the network of a Fortune 500 company is typically comprised of thousands, if not tens of thousands of hardware devices (i.e. switches, routers, load balancers, and firewalls). Add virtual switches inside virtualized servers or for containers, and this number can grow radically larger. Each device can have thousands of rules determining how to forward and process packets. The emergent interactions of this enormous amount of distributed state defines network behavior yielding a degree of complexity that no human can grasp, let alone test and troubleshoot. Furthermore, this complexity has historically exceeded what silicon-based systems can handle.
So today, when an outage occurs, network teams turn to simple tools like ping, traceroute, or netflow in an attempt to map the symptoms back to the actual root cause. The most common approach is to log into devices, box-by-box, inspect via the CLI, attempt to infer behavior, and then mentally join it all together to divine the root cause of the problem. Such a manual approach is not only time-consuming, but infeasible as networks grow in size and complexity. Most importantly, such a method of troubleshooting is inherently reactive. The operator can know of a problem to fix only after the symptoms appear, by which time the customer is already experiencing the damage.
What about SDN? Does it eliminate this problem?
It is true that SDN in its purest form can bring some order to the chaos by providing a single logically centralized source of policy and configuration. It can also provide a clear abstraction and standardized representation of network configuration and state. However, instead of humans making changes at human timescales, SDN and network automation enable changes to occur at software speeds.
With potentially thousands of changes every hour, what happens when the network goes down? How do operators troubleshoot problems or outages in a constantly evolving network where they’re not triggering most changes? In a modern stack with multiple new vendors, which specific component is at fault? Did the network even make the mistake, or did it receive the wrong input from the operator?
With SDN and automation, the need to make sense of complexity has not gone away. The ability to make more frequent changes simply amplifies the need for new tools and approaches to provide visibility and help with troubleshooting.
The problem of network assurance is business-critical for legacy as well as SDN environments. Forward Networks has taken on the monumentally ambitious challenge of making networks as testable as software. Our engineering team is staffed with 8 PhDs and some of the brightest networking talent from Google, Facebook, Microsoft, Apple, and Cisco. We’ve built a game-changing solution to transform the way operators and engineers build and maintain their networks — be they legacy, SDN, or hybrid.
So how do we do this? Stop by our booth at ONUG Fall 2016 and find out.
About the author:
Nikhil Handigol is a Stanford PhD and co-founder of Forward Networks. He has a rich background in enterprise networking and SDN. Under luminary professor and Nicira co-founder Nick McKeown, Nikhil and his Forward Networks co-founders from Stanford built some of the foundational components of SDN. His research focused on using SDN principles for systematic network troubleshooting (NetSight), flexible network emulation (Mininet), and smart load-balancing (Aster*x). He also traveled the world as Director of Course Content and Instruction at SDN Academy teaching the fundamentals of SDN to the engineering teams of some of the largest enterprises.
Ronni J. Colville, George Spafford, Top Seven Considerations for Configuration Management for Virtual and Cloud Infrastructure, (Gartner RAS Core Research Note, 2010)