Thursday, May 29, 2008

An algorithm problem inspired by twitter woes

Thinking a bit more about the twitter scalability problem I posted earlier, I thought is it possible to come up with a solution that is somehow a mix between optimising for reads and optimising for writes. So what follows are my thoughts on that. Firstly a couple of caveats
Caveat 1: Ofcourse it is done purely as an intellectual exercise and I am not sure whether something this would actually be manageable in a real world setting.
Caveat 2: I have oversimplified the problem a lot.

So to motivate with an example, lets say we notice that a large number of people who follow both User A and B. So then apart from the individual datastore for users A and B where we store there tweets, we can also create a datastore for the set {A,B} where tweets of both the user go. So now for any user who follows both A and B we can get the recent tweets directly from this new datastore rather than getting individual tweets from A and B and then merging them. Therefore this is a mixed strategy where we dont create an inbox for every user, neither do we store tweets just in the posting users datastore but we sort of figure out which cluster of people are followed by a large number of people (similar to frequent itemset mining in data mining) and create separate datastores for them also.

Now I would abstract out this problem, and try to come up with a nice algorithmish problem. We have a set of items S. We are also given a set of subsets of S called U. So for example S = {a,b,c,d} and U={{a,b}, {a,b,c}, {a,c}, {b,c,d}}. Now we want to split S into a set T of subsets such at that every subset in U can be covered by some subsets in T. Here I am unclear about how I would like to define cover. I have thought of three ways in which this can be defined and to explain those lets take a set U1= {a,b,c} in U which we want to be covered and lets say sets T1,T2 cover U1. Various definitions for covering are:
1) U1= T1 union T2 and T1 and T2 do not have any element in common. Eg: T1={a} T2={b,c}
2) U1=T1 union T2 but now T1,T2 can have common elements. Eg T1={a,b} T2={b,c}
3) U1 is a subset of T1 union T2. Eg T1={a,d} T2={b,c}.


So which one makes more sense and how would they change the solution? Have not thought about this a lot but lets assume we have settled on some definition. So the aim is to find T.
If we set T=U, this would correspond to read optimised solution where we have a datastore for every single users inbox. Write optimised solution would be where T contains only singleton sets one for each item in S. Now to represent costs associated with reads and writes and take those into consideration, let us say that we have to pay a cost "r" everytime we pick a subset from T to assemble a set in U. So in our example we would pay cost 2r to cover U1. Similarly we have to pay a cost "w" for putting an item from S into a subset in T. So lets say item "a" belongs to 4 subsets in T we will pay cost 4w. So now the aim is to create T to cover all sets in U while paying minimum total cost. Total cost is the sum of cost required to create T and sum of cost required to cover all the sets in U. Ofcourse there might be multiple ways to cover a set in U from sets in T. But to minimise the cost we would like to cover using as few sets as possible.

So how would you solve this? Is this problem NP Hard? I suspect it might be. Even the problem of covering a single set in U with least number of sets in T seems difficult. So then can you think of some nice heuristic. If we do not look at this as an algorithms problem but rather as a systems problem where we want to use this kind of architecture for storing twitters tweets, how would you design such a system. You do not have to come up with an exact solution now, but some good heuristic. How would you shard the data to optimise for read/writes. You would like to keep all subsets of T which share a lot of elements on the same shard to optimise for writes on the other hand you would like to keep all subsets of T needed to assemble lots of sets in U on same machine to optimise for reads. How would you decide which sets from T are needed to cover a set in U and so on.
So any thoughts???

Wednesday, May 28, 2008

Thoughts on twitter scalability problem

After I read the original twitter post on there architecture vows, I searched a bit more on the blogsphere for getting more insight into what the architecture of twitter would be and where in lies the problem. Most of the stuff I read agree on the point that the problem is due to the large number of people that a person follows or is followed by. Because of this during special events a large number of tweets are posted which result in heavy load on databases. Where the opinions differ is on the point that is this heavy load on db a read load or a write load. To explain this let me quote from here

"Nothing is as easy as it looks. When Robert Scoble writes a simple “I’m hanging out with…” message, Twitter has about two choices of how they can dispatch that message:

  1. PUSH the message to the queue’s of each of his 6,864 followers, or
  2. Wait for the 6,864 followers to log in, then PULL the message."

Doing 1 means that writes will be expensive but when a user logs in, only by reading his local queue he can be shown the tweets of all the users he is following. So read would be pretty fast. This is like email.
Doing 2 would mean that write would be fast as it just needs to write to one place, but while reading if a user is following a large number of users, twitter will have to first look up all the users that this guy is following and then look up messages of all those users, then combine them and generate the page for the logged in guy. Now all this users would typically reside on different databases. Therefore read would hit many databases and data needs to be combined across all the different machines. Read here would be quite expensive.

So what does twitter do. This guy says that twitter is optimising writes while here it mentions that twitter is optimising reads. My personal hunch is that twitter stores the tweets just once and pays on the read because you can have caching to make reads much faster but with writes you are stuck. They have to go to the database. Also isn't this similar to what lets say google would do when returning results for your search. It would get matching pages from various machines (because the index is split across multiple machines) and then combine them, rank them and generate the final query page. Ofcourse with google one advantage would be that a large number of queries probably would be repeated in a day and so there results can be got from the cache instantly. But each user would have a different sets of people he is following on twitter so it will not make sense to cache the page of individual user. Rather the tweets need to be cached and then the combination needs to be done on fly as is pointed out here.

Monday, May 19, 2008

Two Puzzles

1. Given a sequence of r*s+1 distinct integers (r>=1, s>=1) prove that there exists either an increasing subsequence of length r+1 or a decreasing subsequence of length s+1.
Its a nice puzzle having a clean elegant solution. Though it is not really an aha! puzzle yet the solution does not involve much mathematical trickery. The way I got the solution was to first fix r at 2. So then we are looking at sequences of length 2s+1. Once you find the solution to this, it is easy to generalize it to arbitrary r. But I think this is a tough puzzle and I felt really good when I could solve it :)

2. There are 100 seats in an airplane and 100 passengers each of whom has been assigned a boarding pass which contains there seat number. They board the plane one by one. Unfortunately the first passenger has lost his boarding pass, so he sits on a random seat. Every subsequent passenger, if he finds his seat vacant takes it otherwise sits randomly on any of the empty seats. So what is the probability that the last passenger finds his seat vacant?
I had heard this puzzle couple of years back and am not sure whether I was able to solve it then. When I reread it again yesterday, I could not recall the solution but I did remember that there is a simple elegant solution to it. After some thought and much frustration I finally got the answer and felt relieved more than anything else. But the solution is nice. You can also generalize it to find probability of kth passenger finding his seat vacant.

Friday, May 16, 2008

Google Game

One can figure out how correlated a term B is to another term A by using Google.
cor_B(A) = Number of search results for query A B / Number of search results for query B

So for example if you search for query 'cricket bat' number of results returned is approx 746000
while results returned for 'bat' alone is 105000000 which means that
cor_bat(cricket) = 1/130 approximately which can be interpreted as that out of 130 pages containing the word bat only 1 is referring to cricket bat. Now for term lbw
cor_lbw(cricket) = 1730000/247000 = 1/1.4 (approx). This is understandable because lbw hardly has any other meaning outside of cricket jargon.
Now that the definitions are clear, the game is to find terms with highest correlation to each of the following terms

1. cricket
2. java
3. radiohead
4. einstein

So lets see how good you can do!! And ofcourse B=A should give you the maximum correlation but that is not allowed.