Tuesday, September 8, 2009

Infinite Arrays of Tweeples

Twitter followers list Twitter is somewhat out of the range of topics I normally cover here, but I promise we'll come around to a software development angle by the end of this post.

When you follow someone on twitter, you appear in their followers list and they appear in your following list. New entries appear at the top, so the newest follower will be the first entry in your list. Recently I noticed an exception to this sort ordering, when someone who had been following a long time ago and later unfollowed decided to follow again. I received the notification email from twitter, yet he wasn't at the top of the list of followers. Instead he appeared much further down in the listing, off the first page. That is where he was when he followed me the first time, many months ago. This entry disappeared when he unfollowed, and when he re-followed he ended up back in that same place. Why would that be?

Possibility #1: Timestamp

The first possibility, and more likely the correct one, is that twitter tracks the timestamp of every new follow and chooses not to update it on a subsequent refollow. No matter how many times you have followed/unfollowed, you retain the timestamp of the very first time and will show up in the followers list at that position.

If an unfollow+refollow was sufficient to move you back up to the top of the list of followers, the bots would do it all the time to make it more likely you'd follow them back. Yet this is a boring root cause. So lets consider a second possibility which is more illustrative to software development.

Possibility #2: Array Deletion

Twitter operates at a scale where performance optimization is essential. If they are not cognizant about performance the wheels fly off and users start to see the Fail Whale. An area of particular importance is the list of followers, as the backend infrastructure has to traverse it for every tweet. It is possible that twitter implemented the followers list as an array in memory instead of a linked list, presumably to get better locality. The classic drawback of an array is deletion: you cannot delete an element from the middle without moving all subsequent elements into the hole thus created. To avoid this compaction a "deleted" or "active" bit is commonly kept for each element, allowing deleted entries to be left in place but skipped without processing.

When Scobleizer unfollowed everyone it would have resulted in holes in the followers list of 106,000 different accounts, entries with the deleted bit set.

Array with deleted bits

I suspect that Twitter does not immediately compact these arrays, so long as the ratio of holes/filled entries is tolerable. When Scobleizer decided to re-follow me the twitter backend located the earlier, deleted entry and flipped the bit back to active.

Array after refollowing

Thus the newly restored entry will re-appear in the followers list, but not as the top most entry. It will re-appear at its existing position within the array. This, or a similar implementation choice of retaining deleted entries in some way, could be why re-follows do not appear at the top of the list.

The Moral of the Story

Optimization is fine, and absolutely crucial to function at Twitter scale, but one must to be careful when an optimization changes user-visible behavior. This is particularly true for social media, where we're explicitly conversing with other humans and ascribe human motivations to their actions. Twitter's handling of deleted and re-added follows can cause considerable consternation, because to the casual observer it appears the person followed but then immediately unfollowed. It can seem judgmental.

Of course, I am most likely completely wrong about Twitter's implementation using arrays. It wouldn't be the first time I've made a complete fool out of myself in a blog post. Its cathartic, in a way. Perhaps I'll do it more often.