“Understanding TCP Incast Throughput Collapse in Datacenter Networks” is another paper that discusses the “TCP incast” phenomenon. This paper has a number of differences with the earlier CMU study on TCP Incast, in both their experimental configuration and their results:
- In the CMU paper, as the number of senders increases, the per-sender fragment size decreases. This simulates a fixed-size data block being striped over a variable number of servers. In the Berkeley paper, the fragment size is fixed, which essentially means that the simulated block size varies with the number of servers. It is debatable which of these scenarios is more realistic.
- The authors found that disabling delayed ACKs was actually harmful to performance, for both the fixed-fragment and fixed-block size workloads. The authors argue that this is because it “overdrives” the TCP congestion window and causes unnecessary congestion.
- Using a 200 usec RTO, as suggested from the CMU paper, was found to lead to poor performance for both workloads. The author’s explanation for this is similar to the delayed ACK phenomena: the RTT (as estimated by TCP) for their network was approximately 2 msec. Therefore, using a 200 usec RTO leads to spurious retransmissions, which is similar to the retransmits due to congestion when delayed ACKs are disabled. The difference between these results and those of the CMU paper appears to be largely due to the difference in RTT: CMU’s baseline RTT was only 100 usec.
- The authors found more complex behavior for the fixed-fragment workload than for the variable-fragment workload used by CMU: in the former, as the number of servers increases, goodput is initially high, then catastrophically low; it then rises to a peak below the initial goodput peak, and then gradually falls off.
The critical graphs in this paper (Figures 6 and 11) are very difficult to read.
I was confused as to why the authors even tried to use an RTO of 200 usec on a network with a 2 msec RTT. The change proposed by the CMU paper is to reduce the lower bound on the RTO to 200 usec, they don’t suggest a fixed RTO of 200 usec. That is, the Jacobson RTO estimation method should still be used: RTO is the minimum of 200 usec and the smoothed RTT estimate plus 4 times the linear deviation. Hence, using an RTO of 200 usec on a network with a TCP-estimated RTT of 2 msec seems like an inaccurate representation of the CMU proposal.
The author’s argument that disabling delayed ACKing results in overdriving the congestion window is convincing, but they didn’t give an intuition for why this behavior occurs. Is the connection between a lack of delayed ACKs and overdriving the congestion window inherent to TCP, or specific to their experimental configuration?
- An illustration of TCP incast using Riak