Thursday, March 17, 2011

Random Early Mea Culpa

Long, long ago I was an ASIC designer. I worked mostly on devices for ATM networks. Try not to judge too harshly, I was young and back then people said ATM was a good idea.

In the early 1990s there was a new concept for how to manage congestion in an IP network: Random Early Discard. Its basic premise is that TCP detects congestion via packet loss. If you wait until the switch buffers are completely full, you end up dropping a bunch of packets before TCP can respond. With RED the switch begins deliberately dropping packets before the queues are completely full, providing an early indication of a problem and triggering TCP to slow down more gracefully.

As described in the paper proposing it, the hardware should choose a random packet already within its queue to drop. As ASIC designers, that seemed ludicrous.

  1. We'd already stored that packet, and found resources to hold it. Why spend all those resources and then just throw it away?
  2. Hardware at that time often used FIFOs. We couldn't drop a packet and immediately reclaim its buffering. We could only drop it when it finally exited the FIFO, some time in the future. Madness!
Graph of drop probabilities from 0.0 to 1.0 as a function of queue depth 0% to 100%

So instead I came up with a drop probability at ingress. As the queue depth increased, the ASIC would begin dropping packets with increasing probability. The external behavior would match the requirements by dropping packets as the queue filled, thought I. It would also better align with the properties of a FIFO, thought I.

Unfortunately this only superficially matched the desired behavior, in that it did drop packets before the queues became completely full. It took several years to fully understand how badly I'd misunderstood the idea.


 
Propagation Delay

The first issue is with the amount of time for the indication of a problem to reach the entity which could do something about it. The sending TCP will realize there is a problem when it times out on receiving an ACK. Dropping a packet at ingress to the FIFO delays the indication to the sender. Had I dropped a packet somewhere in the queue, its timer would be further along and the indication of a problem would come sooner.


 
Burstiness

However, this wasn't the biggest mistake. A more serious problem was which packet would be dropped. TCP flows tend to be bursty: a host gets a chunk of data to send, and it sends as much as its current transmission window allows. When congestion occurs in a switch it is usually not because the overall level of traffic on the network has increased, its most often because a small number of flows are sending large bursts at the same time. To ameliorate it, you need to slow down those particular flows.

ASIC buffers are designed with bursty behavior in mind. Estimating the burst size is straightforward: you can guess at the round trip time based on whether its a LAN or WAN, and you know the bit rate. The ASIC queues are sized to ensure they can absorb one or more bursts, with some extra padding for safety.

Illustration of a queue occupied mostly by one flow in the first 80%

Unfortunately this means that as the buffer fills, it is all but guaranteed to have absorbed the burst(s) which caused the congestion. The packets which arrive later are innocent, and are not occupying the majority of queue space. In the illustration above, the red flow clearly occupies most of the queue but has finished its burst. Had packets been dropped from within the queue, the offending flow would have suffered proportionally. By dropping packets only at ingress, the flows which suffer are those which haven't yet finished their bursts. It will almost certainly punish the wrong flow. It blames the victims of congestion, not the perpetrators.


 
Bufferbloat

Yet even this wasn't the biggest mistake. At this point I have to include the entire networking industry, not just me personally.

Our biggest mistake was in making queue management optional, and making it scary.

Instead of describing RED as a feature to control congestion in the network, we described it as a feature which would deliberately drop your packets. I attribute this to the same attitude which made ASIC designers want to hold onto the packets which had already been stored in the buffers. We made RED sound like a dangerous thing, which you should only use if you know exactly what you're doing and also have some very special network with obscure requirements.

The result is that it is widespread practice to leave all forms of active queue management turned off, considering it risky and unnecessary. There have been some efforts to rectify this portrayal. We now define RED as Random Early Detection, to avoid using the word "discard." The industry also now offers Explicit Congestion Notification, which marks packets rather than dropping them. Nonetheless even ECN isn't widely used.

Instead of pushing queue management, the networking industry has relied on Moore's law to vastly increase the amount of buffering in switches. There is equipment with so much buffering that it is no longer described in terms of packets or bytes, but in how many seconds of traffic it can absorb. There are reports of packets on subscriber networks being delayed a full 8 seconds before being successfully delivered. We have avoided the need for queue management by never allowing the queues to fill.

This is congestion control via infinite buffering. Unfortunately there are two, related problems with it:

  1. It isn't really infinite.
  2. It is addictive.

There is now so much buffering in the network that TCP's own attempts at congestion control are undermined. By the time TCP realizes there is a problem, there is a vast amount of data sitting in queues. Even if TCP reacts immediately and forcefully, it won't have an impact until the mass of packets already in the network sort themselves out. We've created a feedback loop where the control delay is enormous. Most of the time it works, but when it doesn't work the results are astonishingly bad.

It is also addictive, and the patient develops a tolerance. The solution is always more buffering, to kick the can even further down the road. As traffic grows the need for doses of buffering becomes ever larger.

As an industry, we have some work to do.