Coding Relic: October 2011

Friday, October 28, 2011

Tweetflection Point

Last week at the Web 2.0 Summit in San Francisco, Twitter CEO Dick Costolo talked about recent growth in the service and how iOS5 had caused a sudden 3x jump in signups. He also said daily Tweet volume had reached 250 million. There are many, many estimates of the volume of Tweets sent, but I know of only three which are verifiable as directly from Twitter:

50M tweets/day in March, 2010 according to a Twitter blog post.
140M tweets/day in March, 2011 according to that same Twitter blog post.
250M tweets/day in late October, 2011 according to Dick Costolo.

Graphing these on a log scale shows the rate of growth in Tweet volume, ~~roughly tripling in two years~~ almost tripling in one year.

This graph is misleading though, as we have so few data points. It is very likely that, like signups for the service, the rate of growth in tweet volume suddenly increased after iOS5 shipped. Lets assume the rate of growth also tripled for the few days after the iOS5 launch, and zoom in on the tail end of the graph. It is quite similar up until a sharp uptick at the end.

Speculative graph of average daily Tweet volume, knee of curve at iOS5 launch.

The reality is somewhere between those two graphs, but likely still steep enough to be terrifying to the engineers involved. iOS5 will absolutely have an impact on the daily volume of Tweets, it would be ludicrous to think otherwise. It probably isn't so abrupt a knee in the curve as shown here, but it has to be substantial. Tweet growth is on a new and steeper slope now. It used to triple in a bit over a year, now it will triple in way less than one year.

Why this matters

Even five months ago, the traffic to carry the Twitter Firehose was becoming a challenge to handle. At that time the average throughput was 35 Mbps, with spikes up to about 138 Mbps. Scaling those numbers to today would be 56 Mbps sustained with spikes to 223 Mbps, and about one year until the spikes exceed a gigabit.

The indications I've seen are that the feed from Twitter is still sent uncompressed. Compressing using gzip (or Snappy) would gain some breathing room, but not solve the underlying problem. The underlying problem is that the volume of data is increasing way, way faster than the capacity of the network and computing elements tasked with handling it. Compression can reduce the absolute number of bits being sent (at the cost of even more CPU), but not reduce the rate of growth.

Fundamentally, there is a limit to how fast a single HTTP stream can go. As described in the post earlier this year, we've scaled network and CPU capacity by going horizontal and spreading load across more elements. Use of a single very fast TCP flow restricts the handling to a single network link and single CPU in a number of places. The network capacity has some headroom still, particularly by throwing money at it in the form of 10G Ethernet links. The capacity of a single CPU core to process the TCP stream is the more serious bottleneck. At some point relatively soon it will be more cost effective to split the Twitter firehose across multiple TCP streams, for easier scaling. The Tweet ID (or a new sequence number) could put tweets back into an absolute order when needed.

Unbalanced link aggregation with a single high speed HTTP firehose.

Update: My math was off. Even before the iOS5 announcement, the rate of growth was nearly tripling in one year. Corrected post.

Monday, October 24, 2011

Well Trodden Technology Paths

Modern CPUs are astonishingly complex, with huge numbers of caches, lookaside buffers, and other features to optimize performance. Hardware reset is generally insufficient to initialize these features for use: reset leaves them in a known state, but not a useful one. Extensive software initialization is required to get all of the subsystems working.

Its quite difficult to design such a complex CPU to handle its own initialization out of reset. The hardware verification wants to focus on the "normal" mode of operation, with all hardware functions active, but handling the boot case requires that a vast number of partially initialized CPU configurations also be verified. Any glitch in these partially initialized states results in a CPU which cannot boot, and is therefore useless.

Large CPU with many cores, and a small 68k CPU in the corner.

Many, and I'd hazard to guess most, complex CPU designs reduce their verification cost and design risk by relying on a far simpler CPU buried within the system to handle the earliest stages of initialization. For example, the Montalvo x86 CPU design contained a small 68000 core to handle many tasks of getting it running. The 68k was an existing, well proven logic design requiring minimal verification by the ASIC team. That small CPU went through the steps to initialize all the various hardware units of the larger CPUs around it before releasing them from reset. It ran an image fetched from flash which could be updated with new code as needed.

Warning: Sudden Segue Ahead

Networking today is at the cusp of a transition, one which other parts of the computing market have already undergone. We're quickly shifting from fixed-function switch fabrics to software defined networks. This shift bears remarkable similarities to the graphics industry shifting from fixed 3D pipelines to GPUs, and of CPUs shedding their coprocessors to focus on delivering more general purpose computing power.

Networking will also face some of the same issues as modern CPUs, where the optimal design for performance in normal operation is not suitable for handling its own control and maintenance. Last week's ruminations about L2 learning are one example: though we can make a case for software provisioning of MAC addresses, the result is a network which doesn't handle topology changes without software stepping in to reprovision.

The control network cannot, itself, seize up whenever there is a problem. The control network has to be robust in handling link failures and topology changes, to allow software to reconfigure the rest of the data-carrying network. This could mean an out of band control network, hooked to the ubiquitous Management ports on enterprise and datacenter switches. It might also be that a VLAN used for control operates in a very different fashion than one used for data, learning L2 addresses dynamically and using MSTP to handle link failures.

All in all, its an exciting time to be in networking.

Sunday, October 23, 2011

Tornado HTTPClient Chunked Downloads

Tornado is an open source web server in Python. It was originally developed to power friendfeed.com, and excels at non-blocking operations for real-time web services.

Tornado includes an HTTP client as well, to fetch files from other servers. I found a number of examples of how to use it, but all of them would fetch the entire item and return it in a callback. I plan to fetch some rather large multi-megabyte files, and don't see a reason to hold them entirely in memory. Here is an example of how to get partial updates as the download progresses: pass in a streaming_callback to the HTTPRequest().

The streaming_callback will be called for each chunk of data from the server. 4 KBytes is a common chunk size. The async_callback will be called when the file has been fully fetched; the response.data will be empty

#!/usr/bin/python

import os
import tempfile
import tornado.httpclient
import tornado.ioloop

class HttpDownload(object):
  def __init__(self, url, ioloop):
    self.ioloop = ioloop
    self.tempfile = tempfile.NamedTemporaryFile(delete=False)
    req = tornado.httpclient.HTTPRequest(
        url = url,
        streaming_callback = self.streaming_callback)
    http_client = tornado.httpclient.AsyncHTTPClient()
    http_client.fetch(req, self.async_callback)

  def streaming_callback(self, data):
    self.tempfile.write(data)

  def async_callback(self, response):
    self.tempfile.flush()
    self.tempfile.close()
    if response.error:
      print "Failed"
      os.unlink(self.tempfile.name)
    else:
      print("Success: %s" % self.tempfile.name)
      self.ioloop.stop()

def main():
  ioloop = tornado.ioloop.IOLoop.instance()
  dl = HttpDownload("http://codingrelic.geekhold.com/", ioloop)
  ioloop.start()

if __name__ == '__main__':
  main()

I'm mostly blogging this for my own future use, to be able to find how to do something I remember doing before. There you go, future me.

Wednesday, October 19, 2011

Layer 2 History

Why use L2 networks in datacenters?
Virtual machines need to move from one physical server to another, to balance load. To avoid disrupting service, their IP address cannot change as a result of this move. That means the servers need to be in the same L3 subnet, leading to enormous L2 networks.

Why are enormous L2 networks a problem?
A switch looks up the destination MAC address of the packet it is forwarding. If the switch knows what port that MAC address is on, it sends the packet to that port. If the switch does not know where the MAC address is, it floods the packet to all ports. The amount of flooding traffic tends to rise as the number of stations attached to the L2 network increases.

Transition from half duplexed Ethernet to L2 switching.

Why do L2 switches flood unknown address packets?
So they can learn where that address is. Flooding the packet to all ports means that if that destination exists, it should see the packet and respond. The source address in the response packet lets the switches learn where that address is.

Why do L2 switches need to learn addresses dynamically?
Because they replaced simpler repeaters (often called hubs). Repeaters required no configuration, they just repeated the packet they saw on one segment to all other segments. Requiring extensive configuration of MAC addresses for switches would have been an enormous drawback.

Why did repeaters send packets to all segments?
Repeaters were developed to scale up Ethernet networks. Ethernet at that time mostly used coaxial cable. Once attached to the cable, the station could see all packets from all other stations. Repeaters kept that same property.

How could all stations see all packets?
There were limits placed on the maximum cable length, propagation delay through a repeater, and the number of repeaters in an Ethernet network. The speed of light in the coaxial cable used for the original Ethernet networks is 0.77c, or 77% of the speed of light in a vacuum. Ethernet has a minimum packet size to allow sufficient time for the first bit of the packet to propagate all the way across the topology and back before the packet ends transmission.

So there you go. We build datacenter networks this way because of the speed of light in coaxial cable.

Monday, October 17, 2011

Complexity All the Way Down

Jean-Baptiste Queru recently wrote a brilliant essay titled Dizzying but invisible depth, a description of the sheer, unimaginable complexity at each layer of modern computing infrastructure. It is worth a read.

Wednesday, October 12, 2011

Dennis Ritchie, 1941-2011

Kernighan and Ritchie _The C Programming Language_

K&R C is the finest programming language book ever published. Its terseness is a hallmark of the work of Dennis Ritchie; it says exactly what needs to be said, and nothing more.

Rest in Peace, Dennis Ritchie.

The first generation of computer pioneers are already gone. We're beginning to lose the second generation.

Monday, October 10, 2011

ld-http.so

In the last decade we have enjoyed a renaissance of programming language development. Clojure, Scala, Python, C#/F#/et al, Ruby (and Rails), Javascript, node.js, Haskell, Go, and the list goes on. Development of many of those languages started in the 1990s, but adoption accelerated in the 2000s.

Why now? There are probably a lot of reasons, but I want to opine on one.

HTTP is our program linker.

We no longer have to worry about linking to a gazillion libraries written in different languages, with all of the compatibility issues that entails. We no longer build large software systems by linking it all into ginormous binaries, and that loosens a straightjacket which made it difficult to stray too far from C. We dabbled with DCE/CORBA/SunRPC as a way to decouple systems, but RPC marshaling semantics still dragged in a bunch of assumptions about data types.

It took the web and the model of software as a service running on server farms to really decompose large systems into cooperating subsystems which could be implemented any way they like. Facebook can implement chat in Erlang, Akamai can use Clojure, Google can mix C++ with Java/Python/Go/etc. It is all connected together via HTTP, sometimes carrying SOAP or other RPCs, and sometimes with RESTful interfaces even inside the system.

Friday, October 7, 2011

Finding Ada, 2011

Ada Lovelace Day aims to raise the profile of women in science, technology, engineering and maths by encouraging people around the world to talk about the women whose work they admire. This international day of celebration helps people learn about the achievements of women in STEM, inspiring others and creating new role models for young and old alike.

findingada.com

For Ada Lovelace Day 2010 I analyzed a patent for a frequency hopping control system for guided torpedoes, granted to Hedy Lamarr and George Antheil. For Ada Lovelace Day this year I want to share a story from early in my career.

After graduation I worked on ASICs for a few years, mostly on Asynchronous Transfer Mode NICs for Sun workstations. In the 1990s Sun made large investments in ATM: designed its own Segmentation and Reassembly ASICs, wrote a q.2931 signaling stack, adapted NetSNMP as an ILMI stack, wrote Lan Emulation and MPOA implementations, etc.

Yet ATM wasn't a great fit for carrying data traffic. Its overhead for cell headers was very high, it had an unnatural fondness for Sonet as its physical layer, and it required a signaling protocol far more complex than the simple ARP protocol of Ethernet.

Cell loss == packet loss. Its most pernicious problem for data networking was in dealing with congestion. There was no mechanism for flow control, because ATM evolved out of a circuit switched world with predictable traffic patterns. Congestive problems come when you try to switch packets and deal with bursty traffic. In an ATM network the loss of a single cell would render the entire packet unusable, but the network would be further congested carrying the remaining cells of that packet's corpse.

Allyn Romanow at Sun Microsystems and Sally Floyd from the Lawrence Berkeley Labs conducted a series of simulations, ultimately resulting in a paper on how to deal with congestion. If a cell had to be dropped, drop the rest of the cells in that packet. Furthermore, deliberately dropping packets early as buffering approached capacity was even better, and brought ATM links up to the same efficiency for TCP transport as native packet links. Allyn was very generous with her time in explaining the issues and how to solve them, both in ATM congestion control and in a number of other aspects of making a network stable.

ATM also had a very complex signaling stack for setting up connections, so complex that many ATM deployments simply gave up and permanently configured circuits everywhere they needed to go. PVCs only work up to a point, the network size is constrained by the number of available circuits. Renee Danson Sommerfeld took on the task of writing a q.2931 signaling stack for Solaris, requiring painstaking care with specifications and interoperability testing. Sun's ATM products were never reliant on PVCs to operate, they could set up switched circuits on demand and close them when no longer needed.

In this industry we tend to celebrate engineers who spend massive effort putting out fires. What I learned from Allyn, Sally, and Renee is that the truly great engineers see the fire coming, and keep it from spreading in the first place.

Update: Dan McDonald worked at Sun in the same timeframe, and posted his own recollections of working with Allyn, Sally, and Renee. As he put it on Google+, "Good choices for people, poor choice for technology." (i.e. ATM Considered Harmful).

Wednesday, October 5, 2011

Non Uniform Network Access

Four CPUs in a ring, with RAM attached to each. Non Uniform Memory Access is common in modern x86 servers. RAM is connected to each CPU, which connect to each other. Any CPU can access any location in RAM, but will incur additional latency if there are multiple hops along the way. This is the non-uniform part: some portions of memory take longer to access than others.

Yet the NUMA we use today is NUMA in the small. In the 1990s NUMA aimed to make very, very large systems commonplace. There were many levels of bridging, each adding yet more latency. RAM attached to the local CPU was fast, RAM for other CPUs on the same board was somewhat slower. RAM on boards in the same local grouping took longer still, while RAM on the other side of the chassis took forever. Nonetheless this was considered to be a huge advancement in system design because it allowed the software to access vast amounts of memory in the system with a uniform programming interface... except for performance.

Operating system schedulers which had previously run any task on any available CPU would randomly exhibit extremely bad behavior: a process running on distant combinations of CPU and RAM would run an order of magnitude slower. NUMA meant that all RAM was equal, but some was more equal than others. Operating systems desperately added notions of RAM affinity to go along with CPU and cache affinity, but reliably good performance was difficult to achieve.

As an industry we concluded that NUMA in moderation is good, but too much NUMA is bad. Those enormous NUMA systems have mostly lost out to smaller servers clustered together, where each server uses a bit of NUMA to improve its own scalability. The big jump in latency to get to another server is accompanied by a change in API, to use the network instead of memory pointers.

A Segue to Web Applications

Modern web applications can make tradeoffs between CPU utilization, memory footprint, and network bandwidth. Increase the amount of memory available for caching, and reduce the CPU required to recalculate results. Shard the data across more nodes to reduce the memory footprint on each at the cost of increasing network bandwidth. In many cases these tradeoffs don't need to be baked deep in the application, they can be tweaked via relatively simple changes. They can be adjusted to tune the application for RAM size, or for the availability of network bandwidth.

Further Segue To Overlay Networks

There is a lot of effort being put into overlay networks for virtualized datacenters, to create an L2 network atop an L3 infrastructure. This allows the infrastructure to run as an L3 network, which we are pretty good at scaling and managing, while the service provided to the VMs behaves as an L2 network.

Yet once the packets are carried in IP tunnels they can, through the magic of routing, be carried across a WAN to another facility. The datacenter network can be transparently extended to include resources in several locations. Transparently, except for performance. The round trip time across a WAN will inevitably be longer than the LAN, the speed of light demands it. Even for geographically close facilities the bandwidth available over a WAN will be far less than the bandwidth available within a datacenter, perhaps orders of magnitude less. Application tuning parameters set based on the performance within a single datacenter will be horribly wrong across the WAN.

I've no doubt that people will do it anyway. We will see L2 overlay networks being carried across VPNs to link datacenters together transparently (except for performance). Like the OS schedulers suddenly finding themselves in a NUMA world, software infrastructure within the datacenter will find itself in a network where some links are more equal than others. As an industry, we'll spend a couple years figuring out whether that was a good idea or not.

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.

Sunday, October 2, 2011

NVGRE Musings

It is an interesting time to be involved in datacenter networking. There have been announcements recently of two competing proposals for running virtual L2 networks as an overlay atop a underlying IP network, VXLAN and NVGRE. Supporting an L2 service is important for virtualized servers, which need to be able to move from one physical server to another without changing their IP address or interrupting the services they provide. Having written about VXLAN in a series of three posts, now it is time for NVGRE. Ivan Pepelnjak has already posted about it on IOShints, which I recommend reading.

NVGRE encapsulates L2 frames inside tunnels to carry them across an L3 network. As its name implies, it uses GRE tunneling. GRE has been around for a very long time, and is well supported by networking gear and analysis tools. An NVGRE Endpoint uses the Key field in the GRE header to hold the Tenant Network Identifier (TNI), a 24 bit space of virtual LANs.

The encapsulated packet has no Inner CRC. When VMs send packets to other VMs within a server they do not calculate a CRC, one is added by a physical NIC when the packet leaves the server. As the NVGRE Endpoint is likely to be a software component within the server, prior to hitting any NIC, the frames have no CRC. This is another case where even on L2 networks, the Ethernet CRC does not work the way our intuition would suggest.

The NVGRE draft refers to IP addresses in the outer header as Provider Addresses, and the inner header as Customer Addresses. NVGRE can optionally also use an IP multicast group for each TNI to distribute L2 broadcast and multicast packets.

Not Quite Done

As befits its "draft" designation, a number of details in the NVGRE proposal are left to be determined in future iterations. One largish bit left unspecified is mapping of Customer Addresses to Provider. When an NVGRE Endpoint needs to send a packet to a remote VM, it must know the address of the remote NVGRE Endpoint. The mechanism to maintain this mapping is not yet defined, though it will be provisioned by a control function communicating with the Hypervisors and switches.

Optional Multicast?

The NVGRE draft calls out broadcast and multicast support as being optional, only if the network operator chooses to support it. To operate as a virtual Ethernet network a few broadcast protocols are essential, like ARP and IPv6 ND. Presumably if broadcast is not available, the NVGRE Endpoint would respond to these requests to its local VMs.

Yet I don't see how that can work in all cases. The NVGRE control plane can certainly know the Provider Address of all NVGRE Endpoints. It can know the MAC address of all guest VMs within the tenant network, because the Hypervisor provides the MAC address as part of the virtual hardware platform. There are notable exceptions where guest VMs use VRRP, or make up locally administered MAC addresses, but I'll ignore those for now.

I don't see how an NVGRE Endpoint can know all Customer IP Addresses. One of two things would have to happen:

Require all customer VMs to obtain their IP from the provider. Even backend systems using private, internal addresses would have to get them from the datacenter operator so that NVGRE can know where they are.
Implement a distributed learning function where NVGRE Endpoints watch for new IP addresses sent by their VMs and report them to all other Endpoints.

The current draft of NVGRE makes no mention of either such function, so we'll have to watch for future developments.

The earlier VL2 network also did not require multicast and handled ARP via a network-wide directory service. Many VL2 concepts made their way into NVGRE. So far as I understand it, VL2 assigned all IP addresses to VMs and could know where they were in the network.

Multipathing

Load balancing across four links between switches. An important topic for tunneling protocols is multipathing. When multiple paths are available to a destination, either LACP at L2 or ECMP at L3, the switches have to choose which link to use. It is important that packets on the same flow stay in order, as protocols like TCP use excessive reordering as an indication of congestion. Switches hash packet headers to select a link, so packets with the same headers will always choose the same link.

Tunneling protocols have issues with this type of hashing: all packets in the tunnel have the same header. This limits them to a single link, and congests that one link for other traffic. Some switch chips implement extra support for common tunnels like GRE, to include the Inner header in the hash computation. NVGRE would benefit greatly from this support. Unfortunately, it is not universal amongst modern switches.

Choosing Provider Address by hashing the Inner headers. The NVGRE draft proposes that each NVGRE Endpoint have multiple Provider Addresses. The Endpoints can choose one of several source and destination IP addresses in the encapsulating IP header, to provide variance to spread load across LACP and ECMP links. The draft says that when the Endpoint has multiple PAs, each Customer Address will be provisioned to use one of them. In practice I suspect it would be better were the NVGRE Endpoint to hash the Inner headers to choose addresses, and distribute the load for each Customer Address across all links.

Using multiple IP addresses for load balancing is clever, but I can't easily predict how well it will work. The number of different flows the switches see will be relatively small. For example if each endpoint has four addresses, the total number of different header combinations between any two endpoints is sixteen. This is sixteen times better than having a single address each, but it is still not a lot. Unbalanced link utilization seems quite possible.

Aside: Deliberate Multipathing

One LACP group feeding in to the next. The relatively limited variance in headers leads to an obvious next step: ensure the traffic will be balanced by predicting what the switch will do, and choose Provider IP addresses to optimize and ensure it is well balanced. In networking today we tend to solve problems by making the edges smarter.

The NVGRE draft says that selection of a Provider Address is provisioned to the Endpoint. Each Customer Address will be associated with exactly one Provider Address to use. I suspect that selection of Provider Addresses is expected to be done via an optimization mechanism like this, but I'm definitely speculating.

I'd caution that this is harder than it sounds. Switches use the ingress port as part of the hash calculation. That is, the same packet arriving on a different ingress port will choose a different egress link within the LACP/ECMP group. To predict behavior one needs a complete wiring diagram of the network. In the rather common case where several LACP/ECMP groups are traversed along the way to a destination, the link selected by each previous switch influences the hash computation of the next.

Misc Notes

The NVGRE draft mentions keeping an MTU state per Endpoint, to avoid fragmentation. Details will be described in future drafts. NVGRE certainly benefits from a datacenter network with a larger MTU, but will not require it.
VXLAN describes its overlay network as existing within a datacenter. NVGRE explicitly calls for spanning across wide area networks via VPNs, for example to connect a corporate datacenter to additional resources in a cloud provider. I'll have to cover this aspect in another post, this post is too long already.

Conclusion

Its quite difficult to draw a conclusion about NVGRE, as so much is still unspecified. There are two relatively crucial mapping functions which have yet to be described:

When a VM wants to contact a remote Customer IP and sends an ARP Request, in the absence of multicast, how can the matching MAC address be known?
When the NVGRE Endpoint is handed a frame destined to a remote Customer MAC, how does it find the Provider Address of the remote Endpoint?

So we'll wait and see.

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.