“VL2: A Scalable and Flexible Data Center Network”

Like PortLand and SEATTLE, VL2 is another paper that attempts to redesign traditional Layer 2/3 network architectures for modern data center networks. Their goal is to provide the illusion that each service is connected to a single Layer 2 switch: the communication bandwidth between any two end hosts should be uniform, regardless of network topology or colocation of hosts. This enables agility: “the capacity to assign any server to any service.” VL2 also aims to provide performance isolation between services — traffic to one service should not affect the performance of other services hosted on the same network. Finally, VL2 retains compatibility with legacy applications (e.g. Layer-2 broadcast), and is built largely out of existing network technology like ECMP forwarding, link-state routing, and IP multicast: new functionality is mostly implemented by modifying end hosts, rather than network infrastructure.

Design

VL2 uses a Clos network topology: each top-of-rack switch connects to two different aggregation switches, and each aggregation switch is connected to every core switch (called an “intermediate” switch in the paper).

Services use “application-specific IP addresses” (AAs), and each service has its own (virtual) IP subnet. VL2 automatically translates AAs into “location-specific IP addresses” (LAs), which describe the current location of the AA in the network topology. AAs remain unchanged as servers change network locations (e.g. due to VM migration). The translation is done by a directory service that runs on a separate set of end hosts; ARP requests made by end hosts are intercepted (by modifying the end hosts’ network stacks) to instead query the directory service. This scheme has several similarities to the PMACs and fabric manager employed by PortLand:

  • ARP requests are trapped and converted into queries against a centralized lookup service, rather than broadcast.
  • Both systems separate names from locators: AAs vs. LAs in VL2, and MACs vs. PMACs in PortLand. Routing is done using locators, because this is more efficient: the correct next hop can be chosen by longest-prefix match, rather than remembering the appropriate next hop for every possible MAC/AA address. Applications use identifiers, because a flat identifier namespace simplifies management and means that application-level identifiers don’t encode any information about location, simplifying VM migration.

To ensure uniform performance, VL2 chooses a random path through the network for each flow. The shortest path between an end host and an intermediate switch is always three hops (top-of-rack switch, aggregate switch, intermediate switch). ECMP is used to choose among these equal-cost paths, which also makes switch failure tolerance easier.

Comments

I thought the empirical study of data center traffic was really interesting. However, their methodology has an obvious flaw: in Sections 3.1, 3.2, and 3.3, they only looked at a single “highly utilized 1,500 node cluster in a data center that supports data mining on petabytes of data.” Many of their results are likely to be specific to that particular application: for example, the traffic bias toward “large” flows is largely due to the block size of their distributed file system, and the lack of traffic predictability (they admit) is caused in part by the random distribution of DFS chunks among servers. How many of their results would remain true if the cluster was devoted to transaction processing or web servers, for example? It seems unreasonable to draw conclusions from a single application and use those results to justify the design of a networking technology that must support a broad range of applications. I would have preferred to see a more rigorous empirical study that examined a more realistic range of data center applications. That study could easily be presented as a separate paper, which would leave room in this paper for a more detailed description of VL2 itself: I thought the paper’s description was too terse.

One thing I found confusing is their claim that “VL2 picks a different path for each flow”, and that paths are chosen randomly using VLB/ECMP. Presumably, this isn’t quite true: if two end hosts connected to the same switch want to communicate, why traverse any aggregation or intermediate switches? In fact, some applications go to considerable lengths to ensure that pairs of tasks that communicate frequently are colocated in the same rack. Hopefully VL2 doesn’t prevent such optimizations. However, unless VL2 does inhibit such a basic optimization, then the paths among end hosts are not completely uniform, which is inconsistent with one of the primary goals of the paper. This might not have arisen in their evaluation, since their experiments seem deliberately chosen to exhibit little potential for intra-rack locality (e.g. all-to-all shuffle).

An obvious downside to implementing network layer functionality (e.g. the directory service and “layer 2.5 shim” in VL2) at end hosts is the need to modify end host network stacks, which are numerous, heterogenous and would be an impediment to widespread deployment. Perhaps an interesting middle path would be to implement this functionality in the VM hypervisor, rather than the guest operating system. This would allow the guest OS to remain unmodified and might be easier to implement, as VM hypervisors are more homogenous. It is interesting that the PortLand work chose to modify the switch hardware, whereas VL2 pushes more functionality onto end hosts.

Their reactive cache update proposal (traffic to a stale LA is forwarded by the receiving ToR to the directory server, which sends a gratuitous ARP back to the originating node) seems to rely on modifying switch software, although they don’t say this explicitly. I also wonder what might happen if the ToR pointed to by a stale LA crashes before the VL2 agents’ cache has expired: wouldn’t this result in considerable downtime?

Recommended Reading

Advertisements

1 Comment

Filed under Paper Summaries

One response to ““VL2: A Scalable and Flexible Data Center Network”

  1. Arash

    Thanks for the pointers Neil, quite interesting readings!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s