Thursday, May 29, 2008

An algorithm problem inspired by twitter woes

Thinking a bit more about the twitter scalability problem I posted earlier, I thought is it possible to come up with a solution that is somehow a mix between optimising for reads and optimising for writes. So what follows are my thoughts on that. Firstly a couple of caveats
Caveat 1: Ofcourse it is done purely as an intellectual exercise and I am not sure whether something this would actually be manageable in a real world setting.
Caveat 2: I have oversimplified the problem a lot.

So to motivate with an example, lets say we notice that a large number of people who follow both User A and B. So then apart from the individual datastore for users A and B where we store there tweets, we can also create a datastore for the set {A,B} where tweets of both the user go. So now for any user who follows both A and B we can get the recent tweets directly from this new datastore rather than getting individual tweets from A and B and then merging them. Therefore this is a mixed strategy where we dont create an inbox for every user, neither do we store tweets just in the posting users datastore but we sort of figure out which cluster of people are followed by a large number of people (similar to frequent itemset mining in data mining) and create separate datastores for them also.

Now I would abstract out this problem, and try to come up with a nice algorithmish problem. We have a set of items S. We are also given a set of subsets of S called U. So for example S = {a,b,c,d} and U={{a,b}, {a,b,c}, {a,c}, {b,c,d}}. Now we want to split S into a set T of subsets such at that every subset in U can be covered by some subsets in T. Here I am unclear about how I would like to define cover. I have thought of three ways in which this can be defined and to explain those lets take a set U1= {a,b,c} in U which we want to be covered and lets say sets T1,T2 cover U1. Various definitions for covering are:
1) U1= T1 union T2 and T1 and T2 do not have any element in common. Eg: T1={a} T2={b,c}
2) U1=T1 union T2 but now T1,T2 can have common elements. Eg T1={a,b} T2={b,c}
3) U1 is a subset of T1 union T2. Eg T1={a,d} T2={b,c}.


So which one makes more sense and how would they change the solution? Have not thought about this a lot but lets assume we have settled on some definition. So the aim is to find T.
If we set T=U, this would correspond to read optimised solution where we have a datastore for every single users inbox. Write optimised solution would be where T contains only singleton sets one for each item in S. Now to represent costs associated with reads and writes and take those into consideration, let us say that we have to pay a cost "r" everytime we pick a subset from T to assemble a set in U. So in our example we would pay cost 2r to cover U1. Similarly we have to pay a cost "w" for putting an item from S into a subset in T. So lets say item "a" belongs to 4 subsets in T we will pay cost 4w. So now the aim is to create T to cover all sets in U while paying minimum total cost. Total cost is the sum of cost required to create T and sum of cost required to cover all the sets in U. Ofcourse there might be multiple ways to cover a set in U from sets in T. But to minimise the cost we would like to cover using as few sets as possible.

So how would you solve this? Is this problem NP Hard? I suspect it might be. Even the problem of covering a single set in U with least number of sets in T seems difficult. So then can you think of some nice heuristic. If we do not look at this as an algorithms problem but rather as a systems problem where we want to use this kind of architecture for storing twitters tweets, how would you design such a system. You do not have to come up with an exact solution now, but some good heuristic. How would you shard the data to optimise for read/writes. You would like to keep all subsets of T which share a lot of elements on the same shard to optimise for writes on the other hand you would like to keep all subsets of T needed to assemble lots of sets in U on same machine to optimise for reads. How would you decide which sets from T are needed to cover a set in U and so on.
So any thoughts???

No comments: