Many data center networks are configured as a single Layer 2 network, with the switches configured into a spanning tree to avoid routing loops (see the SEATTLE paper for some more background on how this works). However, in practice, simply finding a Layer 2 path to the appropriate destination host is not enough: data centers also include various “middleboxes” (e.g. firewalls, load balancers, and SSL offloaders) that intercept and sometimes modify Ethernet frames in transit. To ensure that traffic is routed to middleboxes as appropriate, network administrators typically modify Layer 2 path selection. This has a number of problems:
- Correctness: How can the administrator be assured that frames always traverse the appropriate sequence of middleboxes, even in the face of network churn and switch failures?
- Flexibility: Changing the middlebox routing policy should be easy to do.
- Efficiency: Frames should not be routed to middleboxes that aren’t interested in those frames.
To address these problems, “A Policy-Aware Switching Layer for Data Centers” proposes that Layer 2 routing policies be specified in a declarative language at a centralized administrative location. These policies are compiled into then compiled into a set of rules that are installed onto “policy switches” (pswitches). Pswitches make Layer 2 routing decisions by consulting a policy table, rather than the “destination MAC => switch port” mapping table used in traditional Layer 2 switches. New policies can be deployed by simply modifying the centralized configuration, which then disseminates them to the appropriate pswitches.
Policies take the form “At location X, send frames like Y to Z”, where Z is a sequence of middlebox types. Writing policies in terms of middlebox types, rather than talking about individual middlebox instances, makes it easier to reason about the correctness and how to handle failures without violating the policy. Policies are compiled into sets of rules, which take the form “For frames that arrive from hop X like Y, send to hop Z” — this should admit an efficient implementation.
To avoid sending frames to middleboxes that are not interested in them, middleboxes are removed from the physical network data path—instead, pswitches explicitly forward frames to middleboxes as required. The authors argue that because data center interconnects are relatively low latency, this doesn’t have a major performance impact.
Making this idea work in practical networks requires addressing a lot of details. For example:
- To support load balancers, the right-hand side of a policy can be a list of middlebox types. The pswitch uses a hash partitioning scheme that is careful to send all the packets belonging to either direction of a flow to the same destination.
- To allow this work to be deployed without modifying the entire network, policy-based routing is implemented by encapsulating frames: to implement policy-based routing, a frame is encapsulated in another frame with a destination MAC of the next hop (middlebox or server) required by the policy. Frames are decapsulated before being delivered to middleboxes.
I like this paper. There are only two obvious concerns I can see: the deployment difficulties in any attempt to redesign how switches operate, and the performance concerns of (1) rule table lookups rather than normal Layer 2 routing (2) the additional latency of traversing middleboxes that are off the network data path. Compared with a traditional software-based router, their software-based pswitch implementation achieves 82% throughput and adds an additional 16% latency. Middleboxes deployed with this architecture suffer a lot more overhead: 40% throughput and twice the latency. Its hard to say how the performance of a hardware-based pswitch implementation would compare with regular hardware-based routers.
The authors argue that the additional hop required to traverse a middlebox is acceptable, because data center networks are typically low latency. This seems backward to me: data center applications might actually want to take advantage of low latency links (e.g. see RAMCloud), and won’t like the network infrastructure adding additional latency overhead without a compelling reason.
The idea of taking middleboxes off the data path seems orthogonal to the paper’s main contribution (which is the idea of a high-level specification of Layer 2 routing invariants). It might be nice to separate these two ideas more distinctly.