Wednesday, May 28, 2008

Thoughts on twitter scalability problem

After I read the original twitter post on there architecture vows, I searched a bit more on the blogsphere for getting more insight into what the architecture of twitter would be and where in lies the problem. Most of the stuff I read agree on the point that the problem is due to the large number of people that a person follows or is followed by. Because of this during special events a large number of tweets are posted which result in heavy load on databases. Where the opinions differ is on the point that is this heavy load on db a read load or a write load. To explain this let me quote from here

"Nothing is as easy as it looks. When Robert Scoble writes a simple “I’m hanging out with…” message, Twitter has about two choices of how they can dispatch that message:

  1. PUSH the message to the queue’s of each of his 6,864 followers, or
  2. Wait for the 6,864 followers to log in, then PULL the message."

Doing 1 means that writes will be expensive but when a user logs in, only by reading his local queue he can be shown the tweets of all the users he is following. So read would be pretty fast. This is like email.
Doing 2 would mean that write would be fast as it just needs to write to one place, but while reading if a user is following a large number of users, twitter will have to first look up all the users that this guy is following and then look up messages of all those users, then combine them and generate the page for the logged in guy. Now all this users would typically reside on different databases. Therefore read would hit many databases and data needs to be combined across all the different machines. Read here would be quite expensive.

So what does twitter do. This guy says that twitter is optimising writes while here it mentions that twitter is optimising reads. My personal hunch is that twitter stores the tweets just once and pays on the read because you can have caching to make reads much faster but with writes you are stuck. They have to go to the database. Also isn't this similar to what lets say google would do when returning results for your search. It would get matching pages from various machines (because the index is split across multiple machines) and then combine them, rank them and generate the final query page. Ofcourse with google one advantage would be that a large number of queries probably would be repeated in a day and so there results can be got from the cache instantly. But each user would have a different sets of people he is following on twitter so it will not make sense to cache the page of individual user. Rather the tweets need to be cached and then the combination needs to be done on fly as is pointed out here.

No comments: