Like PortLand and SEATTLE, VL2 is another paper that attempts to redesign traditional Layer 2/3 network architectures for modern data center networks. Their goal is to provide the illusion that each service is connected to a single Layer 2 switch: the communication bandwidth between any two end hosts should be uniform, regardless of network topology or colocation of hosts. This enables agility: “the capacity to assign any server to any service.” VL2 also aims to provide performance isolation between services — traffic to one service should not affect the performance of other services hosted on the same network. Finally, VL2 retains compatibility with legacy applications (e.g. Layer-2 broadcast), and is built largely out of existing network technology like ECMP forwarding, link-state routing, and IP multicast: new functionality is mostly implemented by modifying end hosts, rather than network infrastructure.
VL2 uses a Clos network topology: each top-of-rack switch connects to two different aggregation switches, and each aggregation switch is connected to every core switch (called an “intermediate” switch in the paper).
Services use “application-specific IP addresses” (AAs), and each service has its own (virtual) IP subnet. VL2 automatically translates AAs into “location-specific IP addresses” (LAs), which describe the current location of the AA in the network topology. AAs remain unchanged as servers change network locations (e.g. due to VM migration). The translation is done by a directory service that runs on a separate set of end hosts; ARP requests made by end hosts are intercepted (by modifying the end hosts’ network stacks) to instead query the directory service. This scheme has several similarities to the PMACs and fabric manager employed by PortLand:
- ARP requests are trapped and converted into queries against a centralized lookup service, rather than broadcast.
- Both systems separate names from locators: AAs vs. LAs in VL2, and MACs vs. PMACs in PortLand. Routing is done using locators, because this is more efficient: the correct next hop can be chosen by longest-prefix match, rather than remembering the appropriate next hop for every possible MAC/AA address. Applications use identifiers, because a flat identifier namespace simplifies management and means that application-level identifiers don’t encode any information about location, simplifying VM migration.
To ensure uniform performance, VL2 chooses a random path through the network for each flow. The shortest path between an end host and an intermediate switch is always three hops (top-of-rack switch, aggregate switch, intermediate switch). ECMP is used to choose among these equal-cost paths, which also makes switch failure tolerance easier.
I thought the empirical study of data center traffic was really interesting. However, their methodology has an obvious flaw: in Sections 3.1, 3.2, and 3.3, they only looked at a single “highly utilized 1,500 node cluster in a data center that supports data mining on petabytes of data.” Many of their results are likely to be specific to that particular application: for example, the traffic bias toward “large” flows is largely due to the block size of their distributed file system, and the lack of traffic predictability (they admit) is caused in part by the random distribution of DFS chunks among servers. How many of their results would remain true if the cluster was devoted to transaction processing or web servers, for example? It seems unreasonable to draw conclusions from a single application and use those results to justify the design of a networking technology that must support a broad range of applications. I would have preferred to see a more rigorous empirical study that examined a more realistic range of data center applications. That study could easily be presented as a separate paper, which would leave room in this paper for a more detailed description of VL2 itself: I thought the paper’s description was too terse.
One thing I found confusing is their claim that “VL2 picks a different path for each flow”, and that paths are chosen randomly using VLB/ECMP. Presumably, this isn’t quite true: if two end hosts connected to the same switch want to communicate, why traverse any aggregation or intermediate switches? In fact, some applications go to considerable lengths to ensure that pairs of tasks that communicate frequently are colocated in the same rack. Hopefully VL2 doesn’t prevent such optimizations. However, unless VL2 does inhibit such a basic optimization, then the paths among end hosts are not completely uniform, which is inconsistent with one of the primary goals of the paper. This might not have arisen in their evaluation, since their experiments seem deliberately chosen to exhibit little potential for intra-rack locality (e.g. all-to-all shuffle).
An obvious downside to implementing network layer functionality (e.g. the directory service and “layer 2.5 shim” in VL2) at end hosts is the need to modify end host network stacks, which are numerous, heterogenous and would be an impediment to widespread deployment. Perhaps an interesting middle path would be to implement this functionality in the VM hypervisor, rather than the guest operating system. This would allow the guest OS to remain unmodified and might be easier to implement, as VM hypervisors are more homogenous. It is interesting that the PortLand work chose to modify the switch hardware, whereas VL2 pushes more functionality onto end hosts.
Their reactive cache update proposal (traffic to a stale LA is forwarded by the receiving ToR to the directory server, which sends a gratuitous ARP back to the originating node) seems to rely on modifying switch software, although they don’t say this explicitly. I also wonder what might happen if the ToR pointed to by a stale LA crashes before the VL2 agents’ cache has expired: wouldn’t this result in considerable downtime?
- Amin Vahdat, one of the authors of the PortLand paper, has a blog post comparing the PortLand work with VL2: “David vs. Goliath, UCSD vs. Microsoft?“
- James Hamilton has a blog post discussing the VL2 paper
- Mike Freedman from Princeton reviews several recent papers on data center networks, including PortLand and VL2
- I hadn’t heard the term “IP anycast” before; the Wikipedia article on anycast provides a reasonable introduction