“Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication” describes the so-called “TCP Incast” problem, and proposes a solution based on lowering the TCP retransmission timeout (RTO).
“TCP Incast” describes a catastrophic drop in TCP goodput (useful throughput) that occurs in the following situation:
- A client issues a barrier-synchronized request in parallel to a collection of servers. That is, the client waits for a response from all the servers before submitting its next request. This occurs when a block is striped over multiple machines in a distributed file system, for example.
- Each server takes approximately the same amount of time to return a response. Because all servers respond at about the same time, this is likely to overflow the switch buffers on the bottleneck link, causing packet loss.
- The default lower bound on the TCP retransmission timeout (RTO) is typically 200 msec. That is far greater than the typical roundtrip time on data center networks (which is typically < 1 msec), which means that servers that suffered packet loss must wait for a relatively long time before retransmitting dropped packets. Because of barrier synchronization, the client cannot submit another useful request in the mean time, which leads to low network utilization.
Several related factors contribute to this problem:
- Commodity switches often have small buffers, which lowers the threshold for the number of servers involved before TCP Incast occurs.
- The problem can be exacerbated by synchronization between the RTO timers on multiple servers: if many servers drop packets simultaneously, they will all timeout and retry at about the same time, which might lead to further packet loss. Because each timeout causes the RTO to be doubled until an ACK is received, this leads to exponential growth in the RTO if there is continued synchronization among senders.
The Proposed Solution
The authors propose a straightforward fix: rather than forcing TCP stacks to wait at least 200 msec before retransmitting a packet, lower the RTO to 1 msec or lower. They performed empirical and simulation-based studies that show that for their workload (parallel reads from a striped distributed file system), RTO lower bounds of 200 usec to 1 msec avoid the TCP incast phenomenon. They describe how such fine-grained timers can be implemented in practice using the “high-resolution timers” feature of the Linux kernel.
While a fine-grained RTO lower bound was sufficient for current networks, the authors simulated a next-generation network (10 Gbps with 20 usec baseline RTT), and found that even very low RTO values didn’t prevent the problem at scale (> 1024 servers), because of the synchronization that occurred between RTO timers caused repeated packet loss and exponential increase of the RTO. The authors found that adding randomization to the RTO avoids this phenomena, although they note that in a non-simulation setting, scheduling variance may be large enough that explicit randomization is unnecessary.
Two practical concerns are addressed: is a fine-grained RTO safe to use in a WAN environment, and how does a fine-grained RTO interact with unmodified TCP stacks that use delayed ACKs? To answer the first question, the authors found that a fine-grained RTO had little to no impact on WAN bulk data transfers.
The second question is more interesting. “Delayed ACKs” are a TCP mechanism to reduce ACK traffic: after receiving a packet, a receiver waits up to 40 msec (by default on Linux) before sending an ACK. The idea here is to wait for another packet to arrive from the sender, allowing a single ACK to be used for both packets. The authors found that this can interact poorly with a fine-grained RTO: the 40 msec delayed ACK threshold means that an unlucky sender will likely retransmit the packet before the delayed ACK timer fires. (When the receiver sees the retransmit, it immediately sends an ACK, which at least avoids the 40 msec timeout.) However, the authors found that the impact of this effect is smaller than the goodput collapse caused by TCP Incast, and so fine-grained RTOs are still a win. The authors recommend disabling delayed ACKs in a data center environment, because latency is typically more important than reducing ACK traffic.
I think the mismatch between typical data center RTTs and the lower bound on the RTO is a more general problem than just TCP Incast. In any network where the RTT is measured in microseconds and the RTO is measured in hundreds of milliseconds, any packet loss will lead to relatively huge variances in response time, which is obviously undesirable. I don’t think this is surprising; “TCP Incast” merely describes a particular workload in which achieving good utilization of the network requires avoiding these latency spikes, because of barrier synchronization. As such, it seems that a fine-grained RTO is appropriate for any network with a sub-millisecond RTT, whether or not a particular application happens to fit the criteria for TCP Incast.
Conversely, I wonder how realistic the criteria for TCP Incast are: how likely is it that applications can’t find any useful work to do while waiting for the RTO timer to fire? For example, in a parallel file system, a client can very easily overlap I/O for another block while waiting for the first I/O to complete. Real-world applications might also experience significantly more jitter than simulation or an artificial microbenchmark, which would act to offset the impact of TCP Incast.
I wonder if ECN might be another candidate approach to avoiding this problem: if packet drops are signaled to end hosts explicitly, there is no need to wait for the RTO timer to fire. I remain puzzled by the relatively slow deployment of ECN in practice: it seems to me like it ought to be a pure win.