Tag Archives: ethernet


One of the challenges faced in networking research and education is the difficulty of accurately recreating a network environment. Production routers are implemented as hardware devices, but designing and fabricating a new hardware design for research or teaching is very expensive. Modular software routers like Click enable research and education on routing in software, but don’t allow experimentation with hardware and physical-layer issues.

NetFPGA aims to rectify this, by providing an easy-programmable hardware platform for network devices (NICs, switches, routers, etc.). Users interact with NetFPGA remotely, uploading programs that configure FPGAs on the device appropriately, and then observing the results. The initial version of NetFPGA didn’t have a CPU: instead, software to control the device runs on another computer, and accesses hardware registers remotely (using special Ethernet packets that are decoded by the NetFPGA board). The second version of NetFPGA has an optional hardware CPU.


This seems like a pretty incontrovertible “good idea.” One potential question is how the performance of an FPGA-based router compares to that of realistic production hardware — I’d suspect that NetFPGA-routers occupy a middle ground between software-only routers (easy to develop, poor performance) and production routers (difficult to program and expensive to fabricate, but good performance).

Leave a comment

Filed under Paper Summaries

“A Policy-Aware Switching Layer for Data Centers”

Many data center networks are configured as a single Layer 2 network, with the switches configured into a spanning tree to avoid routing loops (see the SEATTLE paper for some more background on how this works). However, in practice, simply finding a Layer 2 path to the appropriate destination host is not enough: data centers also include various “middleboxes” (e.g. firewalls, load balancers, and SSL offloaders) that intercept and sometimes modify Ethernet frames in transit. To ensure that traffic is routed to middleboxes as appropriate, network administrators typically modify Layer 2 path selection. This has a number of problems:

  • Correctness: How can the administrator be assured that frames always traverse the appropriate sequence of middleboxes, even in the face of network churn and switch failures?
  • Flexibility: Changing the middlebox routing policy should be easy to do.
  • Efficiency: Frames should not be routed to middleboxes that aren’t interested in those frames.

To address these problems, “A Policy-Aware Switching Layer for Data Centers” proposes that Layer 2 routing policies be specified in a declarative language at a centralized administrative location. These policies are compiled into then compiled into a set of rules that are installed onto “policy switches” (pswitches). Pswitches make Layer 2 routing decisions by consulting a policy table, rather than the “destination MAC => switch port” mapping table used in traditional Layer 2 switches. New policies can be deployed by simply modifying the centralized configuration, which then disseminates them to the appropriate pswitches.

Policies take the form “At location X, send frames like Y to Z”, where Z is a sequence of middlebox types. Writing policies in terms of middlebox types, rather than talking about individual middlebox instances, makes it easier to reason about the correctness and how to handle failures without violating the policy. Policies are compiled into sets of rules, which take the form “For frames that arrive from hop X like Y, send to hop Z” — this should admit an efficient implementation.

To avoid sending frames to middleboxes that are not interested in them, middleboxes are removed from the physical network data path—instead, pswitches explicitly forward frames to middleboxes as required. The authors argue that because data center interconnects are relatively low latency, this doesn’t have a major performance impact.


Making this idea work in practical networks requires addressing a lot of details. For example:

  • To support load balancers, the right-hand side of a policy can be a list of middlebox types. The pswitch uses a hash partitioning scheme that is careful to send all the packets belonging to either direction of a flow to the same destination.
  • To allow this work to be deployed without modifying the entire network, policy-based routing is implemented by encapsulating frames: to implement policy-based routing, a frame is encapsulated in another frame with a destination MAC of the next hop (middlebox or server) required by the policy. Frames are decapsulated before being delivered to middleboxes.


I like this paper. There are only two obvious concerns I can see: the deployment difficulties in any attempt to redesign how switches operate, and the performance concerns of (1) rule table lookups rather than normal Layer 2 routing (2) the additional latency of traversing middleboxes that are off the network data path. Compared with a traditional software-based router, their software-based pswitch implementation achieves 82% throughput and adds an additional 16% latency. Middleboxes deployed with this architecture suffer a lot more overhead: 40% throughput and twice the latency. Its hard to say how the performance of a hardware-based pswitch implementation would compare with regular hardware-based routers.

The authors argue that the additional hop required to traverse a middlebox is acceptable, because data center networks are typically low latency. This seems backward to me: data center applications might actually want to take advantage of low latency links (e.g. see RAMCloud), and won’t like the network infrastructure adding additional latency overhead without a compelling reason.

The idea of taking middleboxes off the data path seems orthogonal to the paper’s main contribution (which is the idea of a high-level specification of Layer 2 routing invariants). It might be nice to separate these two ideas more distinctly.

Leave a comment

Filed under Paper Summaries

“PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric”

PortLand” begins with essentially the same motivation as SEATTLE: Ethernet is easy to manage but hard to scale to the size required by modern data centers, and IP is scalable but hard to manage. Given the choice between making Ethernet more scalable or making IP more manageable, both papers choose to tackle the former problem.

To make Ethernet more scalable, PortLand takes a classic approach: they simplify the problem by making assumptions. Using the observation that modern data center networks are typically organized into a “fat tree” or multi-rooted hierarchy, the authors use this assumption to make PortLand simpler and more efficient than a Layer 2 network that must handle an arbitrary topology. In PortLand, there are core, aggregation, and edge switches; these last are directly connected to end hosts. A “Location Discovery Protocol” (LDP) is employed by each switch to automatically determine its position in this hierarchy, and communicate that information to the other switches. Because LDP assumes that the network is a multi-rooted tree, it is quite a simple protocol.

Rather than using MAC addresses, end hosts are identified with Pseudo MAC addresses (PMACs). While a MAC merely identifies a host, a PMAC is both an identifier and a locator: that is, it encodes the location of the end host in the multi-rooted tree. This is the key to efficient routing: rather than needing to maintain forwarding tables for each MAC/PMAC in the network, switches can instead forward packets based on the PMAC prefix, as in IP. This significantly reduces the size of router forwarding tables. PMACs are translated back to MACs by switches before delivering packets to end hosts or sending them outside the PortLand fabric.

PortLand employs a centralized fabric manager to resolve ARP queries, and to simplify multicast and fault tolerance. ARP is handled by having edge switches forward the ARP request to the fabric manager. If the fabric manager does not have the PMAC for the requested IP, it broadcasts the ARP request down the fat tree, and caches the result. PMAC IP mappings are eagerly sent to the fabric manager as PMACs are assigned, so broadcasts should be relatively rare. When a VM is migrated to a new machine, a gratuitous ARP is sent to the fabric manager with the new IP to PMAC mapping. The fabric manager also sends an invalidation message for the IP/PMAC to the VM’s old switch. The fabric manager is made highly-available using asynchronous replication.

Fault tolerance is simplified by the fabric manager. Each switch sends a keepalive (LDP) message to its neighbors every 10ms; if a keepalive is not heard from a switch for 50ms, the switch is assumed to be failed, and the fabric manager is contacted. The FM updates its record of switch liveness, and then informs any switches affected by the failure. These switches recompute their forwarding tables based on the new network topology.


I thought the evaluation section was a little disappointing. The paper makes some disparaging comments about other approaches to solving this problem, such as SEATTLE and TRILL. However, the evaluation only examines the performance of PortLand, and doesn’t compare it to these alternatives. It is hard to say how important the concerns about SEATTLE and TRILL raised by the paper are, given the lack of empirical data.

The paper doesn’t detail their strategy for handling failures of the fabric manager. Since the FM doesn’t contain hard state, this is presumably not too complicated, but one wonders if simultaneous failures of the FM and one or more switches would significantly increase convergence time. Similarly, FM failure and failover is not addressed in the evaluation.

The requirement for a separate control network simply for communicating with the fabric manager is also unfortunate — it introduces a significant administrative headache, and seems hard to justify economically. If the fabric manager and the switches communicate over the data network, the paper doesn’t address how link or switch failures that impact connectively to the fabric manager will be handled.

1 Comment

Filed under Paper Summaries

“Floodless in SEATTLE”

This paper describes SEATTLE, a network architecture that tries to combine the simplicity and manageability of Ethernet with the scalability of IP. Their basic approach is to locate all the situations in which Ethernet, ARP, and DHCP use flooding or broadcast, and replace them with more scaleable protocols based on DHTs with consistent hashing and unicast messaging.

Ethernet is simple to manage because MAC addresses are simply identifiers, they do not encode a location (unlike IP addresses, which encode location because of their hierarchical structure). This makes Ethernet “plug-and-play”, and network reconfiguration is simplified. The disadvantage to Ethernet is that it isn’t intended to be used with large networks (“broadcast domains”), and hence relies on flooding and broadcasting to learn information about the network:

  • Each Ethernet bridge holds a forwarding table mapping MAC addresses to physical addresses. If a destination MAC address is seen that is not in the table, the bridge floods the network (sending the packet on all outgoing ports). Furthermore, the size of the forwarding table is linear in the size of the broadcast domain.
  • If an Ethernet broadcast domain is composed of multiple bridges, the bridges are arranged into a spanning tree. This means that packets “routed” through the tree don’t necessarily follow the shortest path; they also can’t choose alternate paths to improve scalability or reliability.
  • ARP is used to resolve IP addresses into MAC addresses. It does this by broadcasting, which scales poorly as the size of the broadcast domain increases.

The poor scalability of Ethernet becomes increasingly important as datacenter networks grow to hundreds of thousands of hosts. IP solves many of these problems: for example, it uses shortest-path routing, and allows smaller routing tables (based on IP prefix and subnetting, not a flat namespace). However, IP is much harder to administer; for example, hosts much be arranged into hierarchical networks, and support for mobile hosts is limited (especially if continuity of service for mobile nodes is desired).

Therefore, SEATTLE tries to eliminate the scalability problems of Ethernet while retaining its administrative advantages. In SEATTLE, information about switch topology is replicated to every switch using a link-state protocol, which also enables shortest-path routing. Replicating switch topology is sensible, because it changes much less frequently than the locations of individual end hosts. SEATTLE also defines a network-level DHT (that is, each switch in the network is also a DHT node). This DHT is used to resolve the MAC address associated with an IP address, and the physical address/location associated with a MAC address. SEATTLE modifies ARP so that an ARP request can be satisfied without broadcasting (by doing a lookup in the DHT); it also extends ARP to return both the MAC address associated with the IP and the physical address associated with that MAC address, which avoids the need for a second DHT query.

Routers aggressively cache DHT lookups; the paper argues that because this caching scheme is “reactive” and traffic-driven and each host typically communicates with only a small set of other hosts (perhaps a debatable assumption), this caching approach requires that nodes maintain much less state than a “proactive” distribution scheme. Of course, the difficulty of caching is invalidation, which SEATTLE must do in order to support host mobility.

Rather than supporting network-wide broadcasts, SEATTLE allows administrators to define “groups”, which are essentially virtual broadcast domains. Because this scheme is layered on top of the DHT architecture, group membership is flexible. SEATTLE is evaluated with both simulations and a prototype implementation.


I really liked this paper: it simultaneously explains the existing architecture, critiques it, and sketches a clean-slate redesign while remaining coherent. I like the proposed clean-slate design, and I buy the motivation.

One potential problem is locality: if a resolver for either an IP address or MAC address is located far from the clients querying that data, the queries would be relatively expensive, especially in a WAN environment. The paper’s proposed solution for this problem seems like a bandaid: they suggest using a multi-level DHT, but they don’t describe how the levels would be configured. Presumably this would be left to the network administrators, which would give back much of the easy-maintainability advantage of SEATTLE in the first place. Perhaps there is DHT technology that automatically clusters data items close to the sources of queries for that data?


Filed under Paper Summaries