Unlocking Hyperscale with Multiple Tables

Even after four years of work from the Open Networking Foundation (ONF), three substantial updates to the specification, and countless training sessions, there still exists significant confusion about the OpenFlow protocol and SDN -- particularly about their performance and scale. Fortunately, as more OpenFlow-based commercial products ship and customers experience grows, strong signals are emerging from the noise about the actual scale and resiliency. This blog series is my attempt to document these signals, describe our hardened, productized “modern” OpenFlow, and how it differs from the initial academic attempts that I and others were involved with at Stanford University so many years ago.

As with many of my blogs, this information gets very technical very quickly. The good news is that if you’re looking for a networking solution that “just works”, you don’t have to concern yourself with any of these technical details.  Practically speaking, OpenFlow is “assembly language for networking” and that’s definitely not for everyone.  But, if you are like me and want to know the very last details of network implementation, then please read on.

In this three part series, I plan to address three big changes in what I’m calling “modern” OpenFlow:

  • Part 1: Discuss how OpenFlow/SDN scales to even the largest sized networks
  • Part 2: Describe the difference between proactive and reactive management and why the hybrid approach is most flexible and resilient
  • Part 3: Explain how modern OpenFlow/SDN is implemented on existing merchant silicon and bare metal switches with thin OS, scale and resiliency

Despite the Name, OpenFlow Is Not Flow-Based

The single biggest misunderstanding about OpenFlow is that it is “flow-based”.  That is, similar to ATM’s dynamic virtual circuits of years past, people mistakenly believe that each pair of end-points in the network requires unique forwarding state.  This misunderstanding snowballs into wild concerns about scale: flow-based networks require O(N2) state in the underlying forwarding memory (that is, double the number of hosts N, and you need to quadruple the required memory) which would be a non-starter for most modern 1M+ node data center networks.

Fortunately, this has never been true: OpenFlow forwarding rules are closer to regular expressions than flows.  In other words, while a controller can create a specific flow-like forwarding rule, e.g., “the path from VM X to database server Y on TCP port 3306 ingresses on port 1 and egresses on port 3”, it is not a hard requirement.  In practice, modern OpenFlow-based products use wildcarded rules--“the path from ALL to Y ingresses on ANY port and egresses port 3”--and have clever ways (e.g., via tagging, CIDR prefixes, or multiple tables--below) of aggregating many rules into one: “the path from ALL to 128.8.0.0/16 egresses port 5”, or “the path from X  to any end-point in the set of { Y1, Y2, Y3,…} egresses port 4” .  The result is that modern OpenFlow solutions can use standard L2-like destination-only rules (“ALL → Y”), L3-like aggregation (“ALL → IP/prefix) rules, as well as flow-based rules (matching five tuple fields and more) .  In other words, OpenFlow can scale just as well as traditional networking protocols (L2 learning, OSPF, BGP, etc.) because it creates exactly the same forwarding rules.  At the same time, OpenFlow controllers can add value over traditional control plane protocols by mixing different types of rules in the same network, e.g., by mixing L2 and L3  rules to prevent hair-pin routing or mixing a small number of flow-based rules on top of standard L3 CIDR-like rules to forward long-lived elephant flows around ECMP hot spots.

Aggregate More Rules Over Bigger Tables

Like any modern integrated circuit, the chips used in switches and routers have complex trade-offs between performance, scale, and flexibility.  In practice, these chips have many forwarding tables that vary dramatically in terms of the number of rules they support versus the types and flexibility of those rules (see Figure, replicated from the OF-DPA datasheet[1]).  On one side of the scale/flexibility spectrum, a chip typically has a highly-specialized and large 100K+ entry table (“Bridging Flow Table” from Figure) for L2 that only matches on destination MAC address.  Then, on the other side of the spectrum, chips typically support a very small ~2K entries ACL table that matches on practically any field in the packet headers.  Additional tables exist along this continuum for matching L3 packets, multicast, tunneling packets, and as well as a litany of VLAN operations.  Further, tables are visited in a certain order according to the chip’s processing pipeline and the results of one table’s match can affect a subsequent table’s lookup, allowing smart controllers to aggregate rules efficiently.  For example, a single packet might successfully match a rule in an L2 table, learning an egress forwarding port action, only to have that action overridden by the ACL policy table, e.g., to be dropped because the packet was disallowed by the local security policy.

So while the hardware has a number of rich, programmable tables, initial OpenFlow implementations used none of this functionality and instead only exposed a single ~2K-entry ACL table.  This design decision was made to more easily support “hybrid” OpenFlow, that is, OpenFlow mixed with legacy software.  This was because the legacy software required exclusive control of the other more useful tables.  Single, small tables were never a limitation of the OpenFlow protocol itself or the underlying hardware.  Modern OpenFlow implementations completely replace the legacy software stacks and thus leverage all of the hardware’s underlying power and scale up to the size of even the largest data centers.

Conclusion

The way that modern OpenFlow uses multi-table on-chip forwarding memory has improved dramatically over the initial academic implementations.  By carefully leveraging all of the forwarding tables on a chip, we are able to mix and match standard L2, L3, and L4 paradigms to produce solutions at scale that are otherwise not possible (e.g., as mentioned above, removing hair pin routes or dodging hotspots). In short, I hope I have put the unfounded concerns about scaling to rest and have moved the conversation to a more constructive level.  In my next article, I’ll discuss how controllers mix pro-active and re-actively inserted rules and how we leverage those algorithms to provide enhanced resiliency by keeping networks functional even when all the controllers in the system have died.

[1] http://www.broadcom.com/collateral/pb/OF-DPA-PB100-R.pdf

--Rob Sherwood, Big Switch CTO