Wednesday, April 27, 2011

Discrimination on the Basis of Bytes

There is no length field in the Ethernet header (*). The MAC infers how long the frame is by noticing when the carrier goes away, and relies on the CRC to catch truncated or overlength frames. After removing the Ethernet header and CRC, what is left is payload. (*) It isn't quite that simple, as the optional LLC header adds a length field. As is customary when talking about Ethernet, I'll ignore the existence of LLC.

Ethernet packet with source address, destination address, ethertype, payload, and CRC
Protocol:Whatever. When I send N bytes, the other station will receive N bytes. Thats all I care about.
Ethernet:Not quite. The sending device will add padding to 64 bytes.
Protocol:Whatever. The receiver removes the padding, right?
Ethernet:No.
Protocol:&%#@*!

The minimum Ethernet frame size has been a source of confusion for decades. From protocol implementors miffed at getting garbage after their data to driver writers who forget to zero out the padding and end up leaking kernel data onto the wire, its been a hoot. So why have a minimum size?


 
Wayback Machine to 1976 Ethernet network with 5 segments, 4 repeaters, and a bunch of hosts

Ethernet was designed for half duplex operation. If two stations check for a carrier at the same time they might both start transmitting, resulting in a collision. Reliably detecting collisions is important: though Ethernet has never guaranteed delivery, collisions are so common that relying on protocols to retransmit would have resulted in a miserable network.

In the IEEE standard 10 Megabit Ethernet allows up to 5 segments separated by 4 repeaters between any two stations.

  • The minimum frame size is 512 bits.
  • A segment of 10BASE-5 can be 500 meters long, where the speed of light is 0.77c. A signal takes 2.17 microseconds to propagate 500m, which is 22 bit-times.
  • It takes some time for bits to propagate through a repeater. I have no confirmed numbers, but an estimate to synchronize between clock domains plus a little buffering to avoid underruns is 24 bit-times.
  • 5 segments x 22 bit times + 4 repeaters x 24 bit times = 206 bit-times.

Station A can start transmitting and its bits almost make it before B checks the carrier and begins transmitting. We need to allow for two crossings of the network, or 412 bit times. Adding some margin for safety and rounding up to the next power of 2 gives us the 512 bit minimum frame size.

So thats how we ended up with 64 byte minimum packets, by defining requirements for distance and working out propagation delays, right? ... Well, no.


 
The Plot Thins

Ethernet products were available prior to IEEE standardization. As originally specified it allowed for two repeaters and a maximum of three segments between any two hosts, yet still had a minimum frame size of 64 bytes. It could have gotten by with less.

As with so many things in technology, I believe the 64 byte size was chosen mainly for expediency. They knew they needed to listen for collisions, and made some calculations on propagation delay. The earliest Ethernet equipment was constructed out of discrete SSI parts and memories, and I suspect 64 bytes of buffer was available. So there we are.

IEEE defined the repeater limits to match the pre-existing minimum frame size, not the other way round.


 
Consequences

Though padding sounds wasteful of bandwidth, in practice it doesn't matter. A truism in networking is that most packets are small, but most data is carried in the large packets. Real networks are not made up of minimum sized frames at the link rate, but that is how they are tested. Read any review and you'll find the packet forwarding rate, measured using test equipment sending 64 byte frames at wire speed.

Switch fabric designers are grateful for the minimum frame size: it puts a cap on packets per second. Without the 64 byte minimum, the forwarding logic would have to design for 3x as many packets per second. Real networks don't operate that way, but if you can't handle it your gear gets tossed out of the lab. Really, it would be nice if the minimum frame were even a bit bigger.

Wednesday, April 20, 2011

A Tale of Two MACs

If you've looked at the spec sheets for 10 Gig server NICs, you may have noticed something interesting: the feature set supported when operating at 10 Gig is often not the same as the feature set for 10/100/1000 Mbps. Usually, the 10G features are a subset of the lower speed options.

Modern NIC designs essentially always contain a CPU helping out with datapath operation, and sometimes this feature disparity is due to an inability to keep up with processing at the higher link rate. However, that isn't the entire story.


 
The Care and Feeding of Half Duplex

Lets discuss what goes in to an Ethernet NIC. The block diagram shown here isn't comprehensive, its intended to highlight only those aspects to be discussed further. We start with a DMA engine, plus buffering for sent and received packets. The MAC design is typically split into TX and RX modules for chip layout reasons. Control signals run between RX and TX to support flow control, where a received pause frame will make the transmitter cease sending packets. As Ethernet pause is frame by frame, the timing for this control signal is fairly relaxed. NIC ASICs also generally integrate the PHY to reduce cost, but 10G copper PHYs are new enough that this is not yet always done.

Ethernet NIC showing MAC, packet buffering, and DMA

You'll note that the TX and RX MACs are further subdivided, with a red line running from the middle of RX to the middle of TX. This is used for half duplex operation. While transmitting half duplex, the MAC compares what it sees on the wire to what it is transmitting. When the received bits don't match the sent, it means another station is transmitting at the same time and they have collided. Both MACs cease transmitting and back off.

Further, there are two switches in the middle of the red line. While the station is transmitting with the received signal fed to the TX MAC, it is important that the RX MAC not process the data. It isn't a packet: the rx counters should not be incremented and the payload should not be handed to the software as a received frame. The RX MAC is disconnected until the transmission finishes, then resumes listening for packets.

Support for the Gigabit half duplex comes with additional complexity. For reasons which would take too long to describe here, half duplex at gigabit speed requires the MAC to implement frame bursting. The MAC transmits multiple frames without dropping the carrier, to ensure that collisions can be detected. Though this isn't terribly difficult, it is yet another bit of complexity which has to be tolerated to support a feature which hardly anyone actually uses.


 
A Tale of Two MACs

Half duplex was the only option for Ethernet networks until just sightly before 100 Mbps Ethernet debuted. For the most part the transition to switched networks running full duplex happened during the 100 Mbps era. By the time Gigabit Ethernet debuted, full duplex operation was the norm with half duplex used by an ever diminishing sliver of the market. Though Gigabit Ethernet defines a half duplex mode, it is rarely used and a number of early gigabit products didn't work properly in half duplex mode.

Ethernet NIC showing two MACs, one for 10G and one for 10/100/100010 Gig Ethernet does not have a half duplex mode. It always operates full duplex.

It is difficult to implement a MAC hardware design which handles the full range of link speeds from 10 Mbps all the way up to 10 Gbps, three orders of magnitude faster. Add in a requirement to run a time-critical signal all the way across the chip and between TX/RX clock domains, plus gigabit frame bursting, and it becomes even harder.

Therefore some hardware designs punt, and include what are recognizably two MACs. One is used for 10/100/1000 operation, supports half duplex operation, and is most likely derived from an existing design from older products. The 10G MAC is new, only supports full duplex, and has wider datapaths needed for higher speed operation. It only supports features useful for server deployments, because at this point 10G is too expensive for desktop or other uses. The chip chooses between the two MACs based on the link speed, the result of autonegotiation or explicit configuration.

The feature set is different for 10G operation because internally it really is different. Nonetheless, it operates as just one interface. The two MACs might be visible to the driver software, but not above that. To the rest of the software stack, its just one NIC.

Monday, April 18, 2011

Early Market Development

Baby Caterpillar toy with bird logo that looks like Twitter

Twitter client developers have come up with some awfully creative niches to target.


Wednesday, April 13, 2011

Don Knuth Q&A

Don Knuth visited Google in March. He had no prepared notes, instead answering a series of questions from the audience. The video of the talk posted to YouTube the next day.

My favorite answer was about digital typography, despite being personally unable to distinguish much beyond monospace and proportional fonts. I think the answer is very revealing of character.

Q:You are famously known for your interest in (and contributions to) digital typography. Over 30 years after the release of TeX, what are your thoughts on the current state of typography as it exists on the web and other digital media?
A:I’m upbeat about [it]. I got a Nexus S and it has beautiful fonts on it. I love the typography that I’m seeing. I think that people are starting to understand fonts. I’m famously bad at predicting. The fact that I can’t predict how hard something is is the only reason I started working on typography in the first place, and Art of Computer Programming and a bunch of other stuff, but I did predict that font designers would become heroes, and that turned out to be fairly close to the mark.

Monday, April 11, 2011

IPv6 Addresses for Fun and Profit

The IPv6 address is 128 bits, divided into a 64 bit network portion and a 64 bit host portion. The network portion is so large that organizations are generally issued a block of 256 or more, in order to let each geographic site have a unique prefix. Therefore the admins get to control the lower 8 or more bits of the address advertised to the world via DNS... and many people are using tired old deca:fbad and c0ed:babe addresses. This is our chance! Its a brave new world, with huge swaths of IP address space available, and we should make the most of it.

Presented here for your edification and bemusement are suggested choices for the lower bits of IPv6 network addresses.

a110:c8edI allocated an address, just for you.
defa:cedI hate my web designer.
bad:fac:adeOur CSS needs work.
bad:deedThank you for visiting my site. Really.
be:fa11As in "what has befallen yon dead server?"
abba:ca:dabaOur network is powered by pure magic.
d00:beeNetwork debugging probably qualifies as "medicinal purposes."
b0:cce:ba11You know, I only discovered Bocce Ball in my 30s.
5ca1:ab1eIgnore what you see elsewhere, the secret to scalability is in using clever IP addresses.
ca:b0byummy
fa1:afe1even more yummy!
b1ab:bedWe might need to tighten up our HTML a bit.
bab:b1eWe might need to recompress our images a bit.
ba:b00My sweet baboo!
10ad:edI bet it has an itchy trigger finger, too.
ba:11adThe entire site is set in iambic pentameter.
a:100fMy site doesn't like me.
acc0:1adeNetwork admins rarely, if ever, hear praise of their work.
aff:ab1eAn address for a social networking site if ever I heard one.
ba:ff1eDon't blame me for the contents of this site. The web team reports to a whole different department from the network admins.
ba1:b0aIts the Eye of the Tiger, baby!
ed1:f1ceLook upon my network, ye Mighty, and despair.
5caf:f01dThis load balancing tier was intended to be temporary. That was four years ago. Such is the way of things.

Finally, here are some 16 bit numbers which are more interesting than "dead," "f00d" and "beef"

cacaA statement about the website quality, I suppose.
deafThis is a transmit-only network.
ba1dIf tires can go bald, why not networks?
a1faI couldn't find a reasonable approximation of beta.
f01dOrigami networking!
fa11If a site falls in the forest, does it make a sound?
c01dNetworking is a dish best served cold.
ab1eOr ! 0xab1e, as the case may be.
b0deBeware the Ides of March.
cedeI give up, I'm done.

This post was inspired by several tweets by Dan Morrill. BTW, this post is 100% recycled content.

Friday, April 1, 2011

New blogger.com Distribution Option

Yesterday Blogger announced Dynamic Views, with several snappy new layout options. However a new option of far more import to Blogger users was rolled out silently, with no announcement and no fanfare. A new publication option, feeding into a network with global reach and massive installed base, is now available.

Blogger UUCP publish help pages

I'm planning to syndicate to alt.google.blogger.codingrelic. I've also reserved alt.google.blogger.codingrelic.die.die.die, so don't even think about it.