Ever since Diego Maradona hung up his magical boots, Argentina has been waiting for an heir. And because players like Maradona don't just drop off trees or grow in football field there has been no apparent heir. So if we cannot get a next Maradona we will thrust the crown over someone, that is what they decided. So every other season some promising young starlet is crowned as the "Next Maradona". Sometimes it is Maradona himself doing the declaration which lends even more credibility to such an outrageous claim. Aimar, Saviola, D Allesandro, Messi and recently Kun Aguero. The list keeps on growing. Of course what keeps this game interesting is that most player who have been so named have failed to live up to promise and fall by the wayside, though it does seem that Messi is here to stay (and while these Maradonas continue to fail, a man completely different from Maradona has made Argentine national team his own). So the hunt continues.
And now in another country, this game has caught up and become a favorite pastime. French football reached its pinnacle in the reign of Zidane The First. When it became apparent in 2006 that world cup would be the final chapter of his career, French public immediately anointed Frank Ribery as the next midfield general. Since then Ribery has evolved nicely and has really become the fulcrum of this team. But to keep the game interesting and competitive suddenly two more names have been thrust into the ring those of Karim Benzema and Samir Nasri. What advances there case is the fact that they are both of the same ethnicity as Zidane, Algerian Khabyle. And now that French national team is playing in its first major tournament without Zidane, great things are expected of these three. Maybe by the time the tournament ends one of them would have taken the lead in the race to become next Zidane. But I am sure in couple of years, fickle public having gotten bored of these names would have elevated some other youngster, because the game of course is bigger than any individual.
And what about India. When the maestro puts his bat to rest, I am sure we will not be left behind and many a new Tendulkars will be launched in passionate discussions at coffee table, barber shops, panwalas etc. Anyone betting on Rohit Sharma!!
Tuesday, June 10, 2008
Thursday, June 5, 2008
Useful tips for booking tickets on IRCTC
I have been frustrated endlessly by the Indian Railway online ticket booking site especially at 8 o clock in the morning (not exactly the best time to waste 1 hour in front of your laptop), when I am trying to book Tatkal tickets. And given the fact the I am a very frequent traveller, this happens almost every other week. If I am lucky I end up with a confirmed ticket and a headache otherwise I only end up with a headache. But given that it is a monopoly one has to learn to live with it. These are some of the tips that I have learned the hard way over time (so called wisdom gained from experience). Ofcourse my head still hurts everytime I undergo this exercise (like today) but atleast I have now a blog post to show for all this.
1. In tatkal tickets even though the ticket is always booked from starting station of train to the ending station, there are different quotas for different source, destination pairs, which is extremely surprising. So for example today when I looked for Tatkal ticket from New Delhi to Nagpur on Gondwana express it was showing waiting. But for the same train when I looked for New Delhi to Bilaspur (which is the end station), it was showing around 86 tickets available. So I happily booked the ticket from New Delhi to Bilaspur, ofcourse after a few retries. So remember, always look for availability and book ticket from starting to ending station of train and anyways you are going to pay for the whole ticket in tatkal no matter what your source and destination are.
2. Tatkal days can be booked 5 days in advance. But a large number of times when booking a ticket from Nagpur to Delhi, exactly 5 days in advance and at sharp 8 am in the morning (which is when tatkal bookings open), I would find no availability. This puzzled me a lot and then I had one of those life changing moments, an epiphany, an aha insight. The trains didnt actually start at nagpur and in many cases the starting day would be one before the day they reached nagpur. For example Tamil Nadu express reaches Nagpur on 10th of June but it started from Chennai on 9th of June. So its tatkal booking would have opened on 4th of June (not on 5th) and would have got over that day itself. So lookout for the starting day of your train and book 5 days in advance from then. Also if I would have given my source station as Nagpur it will not have allowed me to book on 4th saying that "tatkal tickets can onlly be booked 5 days in advance". So again you would have to give source station as the starting station of train and then in boarding point you can put your actual station.
3. There is an option of "Quick Book" in the menu just below Plan Your Travel. This is something I discovered just a few days back and it is very useful. Basically here you need to know the train number that you would be traveling on and then in a single page you can fill all the details and get the booking instead of first looking at list of trains and then selecting one among them. Protocol I follow is that first at www.indianrail.gov.in I look at the availability and select a particular train I would like to travel on and then I directly go to Quick Book on irctc and fill in all the details. This really saves a lot of time, especially since indianrail.gov.in is not down that often as irctc. Also if a large number of people start using indianrail.gov.in for availability lookups than hopefully it should reduce the load on irctc and make it more responsive.
So if anyone has any other useful tips, please do share.
1. In tatkal tickets even though the ticket is always booked from starting station of train to the ending station, there are different quotas for different source, destination pairs, which is extremely surprising. So for example today when I looked for Tatkal ticket from New Delhi to Nagpur on Gondwana express it was showing waiting. But for the same train when I looked for New Delhi to Bilaspur (which is the end station), it was showing around 86 tickets available. So I happily booked the ticket from New Delhi to Bilaspur, ofcourse after a few retries. So remember, always look for availability and book ticket from starting to ending station of train and anyways you are going to pay for the whole ticket in tatkal no matter what your source and destination are.
2. Tatkal days can be booked 5 days in advance. But a large number of times when booking a ticket from Nagpur to Delhi, exactly 5 days in advance and at sharp 8 am in the morning (which is when tatkal bookings open), I would find no availability. This puzzled me a lot and then I had one of those life changing moments, an epiphany, an aha insight. The trains didnt actually start at nagpur and in many cases the starting day would be one before the day they reached nagpur. For example Tamil Nadu express reaches Nagpur on 10th of June but it started from Chennai on 9th of June. So its tatkal booking would have opened on 4th of June (not on 5th) and would have got over that day itself. So lookout for the starting day of your train and book 5 days in advance from then. Also if I would have given my source station as Nagpur it will not have allowed me to book on 4th saying that "tatkal tickets can onlly be booked 5 days in advance". So again you would have to give source station as the starting station of train and then in boarding point you can put your actual station.
3. There is an option of "Quick Book" in the menu just below Plan Your Travel. This is something I discovered just a few days back and it is very useful. Basically here you need to know the train number that you would be traveling on and then in a single page you can fill all the details and get the booking instead of first looking at list of trains and then selecting one among them. Protocol I follow is that first at www.indianrail.gov.in I look at the availability and select a particular train I would like to travel on and then I directly go to Quick Book on irctc and fill in all the details. This really saves a lot of time, especially since indianrail.gov.in is not down that often as irctc. Also if a large number of people start using indianrail.gov.in for availability lookups than hopefully it should reduce the load on irctc and make it more responsive.
So if anyone has any other useful tips, please do share.
Wednesday, June 4, 2008
Peter Winkler : The Puzzle Master
Couple of years back I came to know about Peter Winkler, a mathematics professor, most likely when I came across this piece by him "Seven Puzzles You Think You Must Not Have Heard Correctly".
This contains Seven really nice puzzles with solutions most of which thought I had heard previosuly but there was one especially interesting and which turned out to be one of the toughest puzzle I have ever solved.
After that I came to know of couple of books that he has written on puzzles.
1. Mathematical Puzzles: A Connoisseur's Collection
2. Mathematical Mind-Benders
Unfortunately they are not available in indian editions, but luckily a large number of pages of the first book are on view at both amazon and books.google.com. It is certainly a treasure trove for puzzle seekers like me. It contains some of the most amazing puzzles you would have come across. He also gives a personal perspective to most puzzles, by explaining in the solutions, how he came across the puzzle. I really enjoy reading those parts too. But for the sheer quality of the puzzle this book is unbeatable and the puzzle are neatly divided into various sections like Algorithms, Geometry, Probability and so on. I would highly recommend it.
Some days back a friend pointed me to another paper by winkler, "Five algorithmic puzzles".
This as the name says contains 5 puzzles all of which have a common theme, which is that you are given a process, such as breaking a chocolate bar along a boundary iteratively, and you have to prove that the process will end in finite steps and it will end in specific configuration . To solve these you will have to find some property which either increases / decreases in the process and it has a bound beyond which it cannot go. These are again really tough puzzles and accompanied with solutions and winklers interesting notes.
Anyone interested in other recreational mathematics / puzzle writings will do good to read stuff by these people:
1. Martin Gardner (Highly Recommended): Perhaps the first person to write so extensively about puzzles and recreational mathematics. Immensely interesting and readable.
2. Raymond Smullyan: Mostly writes about Logical puzzles and interesting stuff in Mathematical Logic.
3. Ian Stewart: I dont think he is into puzzles but he writes about complex but interesting mathematical topics and makes them easy to understand and exciting. After reading his books you would want to be a mathematician!!
Again if you go to Google Books, you will find some of the works by these guys (Not Smullyan) are available for limited preview.
This contains Seven really nice puzzles with solutions most of which thought I had heard previosuly but there was one especially interesting and which turned out to be one of the toughest puzzle I have ever solved.
After that I came to know of couple of books that he has written on puzzles.
1. Mathematical Puzzles: A Connoisseur's Collection
2. Mathematical Mind-Benders
Unfortunately they are not available in indian editions, but luckily a large number of pages of the first book are on view at both amazon and books.google.com. It is certainly a treasure trove for puzzle seekers like me. It contains some of the most amazing puzzles you would have come across. He also gives a personal perspective to most puzzles, by explaining in the solutions, how he came across the puzzle. I really enjoy reading those parts too. But for the sheer quality of the puzzle this book is unbeatable and the puzzle are neatly divided into various sections like Algorithms, Geometry, Probability and so on. I would highly recommend it.
Some days back a friend pointed me to another paper by winkler, "Five algorithmic puzzles".
This as the name says contains 5 puzzles all of which have a common theme, which is that you are given a process, such as breaking a chocolate bar along a boundary iteratively, and you have to prove that the process will end in finite steps and it will end in specific configuration . To solve these you will have to find some property which either increases / decreases in the process and it has a bound beyond which it cannot go. These are again really tough puzzles and accompanied with solutions and winklers interesting notes.
Anyone interested in other recreational mathematics / puzzle writings will do good to read stuff by these people:
1. Martin Gardner (Highly Recommended): Perhaps the first person to write so extensively about puzzles and recreational mathematics. Immensely interesting and readable.
2. Raymond Smullyan: Mostly writes about Logical puzzles and interesting stuff in Mathematical Logic.
3. Ian Stewart: I dont think he is into puzzles but he writes about complex but interesting mathematical topics and makes them easy to understand and exciting. After reading his books you would want to be a mathematician!!
Again if you go to Google Books, you will find some of the works by these guys (Not Smullyan) are available for limited preview.
Thursday, May 29, 2008
An algorithm problem inspired by twitter woes
Thinking a bit more about the twitter scalability problem I posted earlier, I thought is it possible to come up with a solution that is somehow a mix between optimising for reads and optimising for writes. So what follows are my thoughts on that. Firstly a couple of caveats
Caveat 1: Ofcourse it is done purely as an intellectual exercise and I am not sure whether something this would actually be manageable in a real world setting.
Caveat 2: I have oversimplified the problem a lot.
So to motivate with an example, lets say we notice that a large number of people who follow both User A and B. So then apart from the individual datastore for users A and B where we store there tweets, we can also create a datastore for the set {A,B} where tweets of both the user go. So now for any user who follows both A and B we can get the recent tweets directly from this new datastore rather than getting individual tweets from A and B and then merging them. Therefore this is a mixed strategy where we dont create an inbox for every user, neither do we store tweets just in the posting users datastore but we sort of figure out which cluster of people are followed by a large number of people (similar to frequent itemset mining in data mining) and create separate datastores for them also.
Now I would abstract out this problem, and try to come up with a nice algorithmish problem. We have a set of items S. We are also given a set of subsets of S called U. So for example S = {a,b,c,d} and U={{a,b}, {a,b,c}, {a,c}, {b,c,d}}. Now we want to split S into a set T of subsets such at that every subset in U can be covered by some subsets in T. Here I am unclear about how I would like to define cover. I have thought of three ways in which this can be defined and to explain those lets take a set U1= {a,b,c} in U which we want to be covered and lets say sets T1,T2 cover U1. Various definitions for covering are:
1) U1= T1 union T2 and T1 and T2 do not have any element in common. Eg: T1={a} T2={b,c}
2) U1=T1 union T2 but now T1,T2 can have common elements. Eg T1={a,b} T2={b,c}
3) U1 is a subset of T1 union T2. Eg T1={a,d} T2={b,c}.
So which one makes more sense and how would they change the solution? Have not thought about this a lot but lets assume we have settled on some definition. So the aim is to find T.
If we set T=U, this would correspond to read optimised solution where we have a datastore for every single users inbox. Write optimised solution would be where T contains only singleton sets one for each item in S. Now to represent costs associated with reads and writes and take those into consideration, let us say that we have to pay a cost "r" everytime we pick a subset from T to assemble a set in U. So in our example we would pay cost 2r to cover U1. Similarly we have to pay a cost "w" for putting an item from S into a subset in T. So lets say item "a" belongs to 4 subsets in T we will pay cost 4w. So now the aim is to create T to cover all sets in U while paying minimum total cost. Total cost is the sum of cost required to create T and sum of cost required to cover all the sets in U. Ofcourse there might be multiple ways to cover a set in U from sets in T. But to minimise the cost we would like to cover using as few sets as possible.
So how would you solve this? Is this problem NP Hard? I suspect it might be. Even the problem of covering a single set in U with least number of sets in T seems difficult. So then can you think of some nice heuristic. If we do not look at this as an algorithms problem but rather as a systems problem where we want to use this kind of architecture for storing twitters tweets, how would you design such a system. You do not have to come up with an exact solution now, but some good heuristic. How would you shard the data to optimise for read/writes. You would like to keep all subsets of T which share a lot of elements on the same shard to optimise for writes on the other hand you would like to keep all subsets of T needed to assemble lots of sets in U on same machine to optimise for reads. How would you decide which sets from T are needed to cover a set in U and so on.
So any thoughts???
Caveat 1: Ofcourse it is done purely as an intellectual exercise and I am not sure whether something this would actually be manageable in a real world setting.
Caveat 2: I have oversimplified the problem a lot.
So to motivate with an example, lets say we notice that a large number of people who follow both User A and B. So then apart from the individual datastore for users A and B where we store there tweets, we can also create a datastore for the set {A,B} where tweets of both the user go. So now for any user who follows both A and B we can get the recent tweets directly from this new datastore rather than getting individual tweets from A and B and then merging them. Therefore this is a mixed strategy where we dont create an inbox for every user, neither do we store tweets just in the posting users datastore but we sort of figure out which cluster of people are followed by a large number of people (similar to frequent itemset mining in data mining) and create separate datastores for them also.
Now I would abstract out this problem, and try to come up with a nice algorithmish problem. We have a set of items S. We are also given a set of subsets of S called U. So for example S = {a,b,c,d} and U={{a,b}, {a,b,c}, {a,c}, {b,c,d}}. Now we want to split S into a set T of subsets such at that every subset in U can be covered by some subsets in T. Here I am unclear about how I would like to define cover. I have thought of three ways in which this can be defined and to explain those lets take a set U1= {a,b,c} in U which we want to be covered and lets say sets T1,T2 cover U1. Various definitions for covering are:
1) U1= T1 union T2 and T1 and T2 do not have any element in common. Eg: T1={a} T2={b,c}
2) U1=T1 union T2 but now T1,T2 can have common elements. Eg T1={a,b} T2={b,c}
3) U1 is a subset of T1 union T2. Eg T1={a,d} T2={b,c}.
So which one makes more sense and how would they change the solution? Have not thought about this a lot but lets assume we have settled on some definition. So the aim is to find T.
If we set T=U, this would correspond to read optimised solution where we have a datastore for every single users inbox. Write optimised solution would be where T contains only singleton sets one for each item in S. Now to represent costs associated with reads and writes and take those into consideration, let us say that we have to pay a cost "r" everytime we pick a subset from T to assemble a set in U. So in our example we would pay cost 2r to cover U1. Similarly we have to pay a cost "w" for putting an item from S into a subset in T. So lets say item "a" belongs to 4 subsets in T we will pay cost 4w. So now the aim is to create T to cover all sets in U while paying minimum total cost. Total cost is the sum of cost required to create T and sum of cost required to cover all the sets in U. Ofcourse there might be multiple ways to cover a set in U from sets in T. But to minimise the cost we would like to cover using as few sets as possible.
So how would you solve this? Is this problem NP Hard? I suspect it might be. Even the problem of covering a single set in U with least number of sets in T seems difficult. So then can you think of some nice heuristic. If we do not look at this as an algorithms problem but rather as a systems problem where we want to use this kind of architecture for storing twitters tweets, how would you design such a system. You do not have to come up with an exact solution now, but some good heuristic. How would you shard the data to optimise for read/writes. You would like to keep all subsets of T which share a lot of elements on the same shard to optimise for writes on the other hand you would like to keep all subsets of T needed to assemble lots of sets in U on same machine to optimise for reads. How would you decide which sets from T are needed to cover a set in U and so on.
So any thoughts???
Wednesday, May 28, 2008
Thoughts on twitter scalability problem
After I read the original twitter post on there architecture vows, I searched a bit more on the blogsphere for getting more insight into what the architecture of twitter would be and where in lies the problem. Most of the stuff I read agree on the point that the problem is due to the large number of people that a person follows or is followed by. Because of this during special events a large number of tweets are posted which result in heavy load on databases. Where the opinions differ is on the point that is this heavy load on db a read load or a write load. To explain this let me quote from here
Doing 1 means that writes will be expensive but when a user logs in, only by reading his local queue he can be shown the tweets of all the users he is following. So read would be pretty fast. This is like email.
Doing 2 would mean that write would be fast as it just needs to write to one place, but while reading if a user is following a large number of users, twitter will have to first look up all the users that this guy is following and then look up messages of all those users, then combine them and generate the page for the logged in guy. Now all this users would typically reside on different databases. Therefore read would hit many databases and data needs to be combined across all the different machines. Read here would be quite expensive.
So what does twitter do. This guy says that twitter is optimising writes while here it mentions that twitter is optimising reads. My personal hunch is that twitter stores the tweets just once and pays on the read because you can have caching to make reads much faster but with writes you are stuck. They have to go to the database. Also isn't this similar to what lets say google would do when returning results for your search. It would get matching pages from various machines (because the index is split across multiple machines) and then combine them, rank them and generate the final query page. Ofcourse with google one advantage would be that a large number of queries probably would be repeated in a day and so there results can be got from the cache instantly. But each user would have a different sets of people he is following on twitter so it will not make sense to cache the page of individual user. Rather the tweets need to be cached and then the combination needs to be done on fly as is pointed out here.
"Nothing is as easy as it looks. When Robert Scoble writes a simple “I’m hanging out with…” message, Twitter has about two choices of how they can dispatch that message:
- PUSH the message to the queue’s of each of his 6,864 followers, or
- Wait for the 6,864 followers to log in, then PULL the message."
Doing 1 means that writes will be expensive but when a user logs in, only by reading his local queue he can be shown the tweets of all the users he is following. So read would be pretty fast. This is like email.
Doing 2 would mean that write would be fast as it just needs to write to one place, but while reading if a user is following a large number of users, twitter will have to first look up all the users that this guy is following and then look up messages of all those users, then combine them and generate the page for the logged in guy. Now all this users would typically reside on different databases. Therefore read would hit many databases and data needs to be combined across all the different machines. Read here would be quite expensive.
So what does twitter do. This guy says that twitter is optimising writes while here it mentions that twitter is optimising reads. My personal hunch is that twitter stores the tweets just once and pays on the read because you can have caching to make reads much faster but with writes you are stuck. They have to go to the database. Also isn't this similar to what lets say google would do when returning results for your search. It would get matching pages from various machines (because the index is split across multiple machines) and then combine them, rank them and generate the final query page. Ofcourse with google one advantage would be that a large number of queries probably would be repeated in a day and so there results can be got from the cache instantly. But each user would have a different sets of people he is following on twitter so it will not make sense to cache the page of individual user. Rather the tweets need to be cached and then the combination needs to be done on fly as is pointed out here.
Monday, May 19, 2008
Two Puzzles
1. Given a sequence of r*s+1 distinct integers (r>=1, s>=1) prove that there exists either an increasing subsequence of length r+1 or a decreasing subsequence of length s+1.
Its a nice puzzle having a clean elegant solution. Though it is not really an aha! puzzle yet the solution does not involve much mathematical trickery. The way I got the solution was to first fix r at 2. So then we are looking at sequences of length 2s+1. Once you find the solution to this, it is easy to generalize it to arbitrary r. But I think this is a tough puzzle and I felt really good when I could solve it :)
2. There are 100 seats in an airplane and 100 passengers each of whom has been assigned a boarding pass which contains there seat number. They board the plane one by one. Unfortunately the first passenger has lost his boarding pass, so he sits on a random seat. Every subsequent passenger, if he finds his seat vacant takes it otherwise sits randomly on any of the empty seats. So what is the probability that the last passenger finds his seat vacant?
I had heard this puzzle couple of years back and am not sure whether I was able to solve it then. When I reread it again yesterday, I could not recall the solution but I did remember that there is a simple elegant solution to it. After some thought and much frustration I finally got the answer and felt relieved more than anything else. But the solution is nice. You can also generalize it to find probability of kth passenger finding his seat vacant.
Its a nice puzzle having a clean elegant solution. Though it is not really an aha! puzzle yet the solution does not involve much mathematical trickery. The way I got the solution was to first fix r at 2. So then we are looking at sequences of length 2s+1. Once you find the solution to this, it is easy to generalize it to arbitrary r. But I think this is a tough puzzle and I felt really good when I could solve it :)
2. There are 100 seats in an airplane and 100 passengers each of whom has been assigned a boarding pass which contains there seat number. They board the plane one by one. Unfortunately the first passenger has lost his boarding pass, so he sits on a random seat. Every subsequent passenger, if he finds his seat vacant takes it otherwise sits randomly on any of the empty seats. So what is the probability that the last passenger finds his seat vacant?
I had heard this puzzle couple of years back and am not sure whether I was able to solve it then. When I reread it again yesterday, I could not recall the solution but I did remember that there is a simple elegant solution to it. After some thought and much frustration I finally got the answer and felt relieved more than anything else. But the solution is nice. You can also generalize it to find probability of kth passenger finding his seat vacant.
Friday, May 16, 2008
Google Game
One can figure out how correlated a term B is to another term A by using Google.
cor_B(A) = Number of search results for query A B / Number of search results for query B
So for example if you search for query 'cricket bat' number of results returned is approx 746000
while results returned for 'bat' alone is 105000000 which means that
cor_bat(cricket) = 1/130 approximately which can be interpreted as that out of 130 pages containing the word bat only 1 is referring to cricket bat. Now for term lbw
cor_lbw(cricket) = 1730000/247000 = 1/1.4 (approx). This is understandable because lbw hardly has any other meaning outside of cricket jargon.
Now that the definitions are clear, the game is to find terms with highest correlation to each of the following terms
1. cricket
2. java
3. radiohead
4. einstein
So lets see how good you can do!! And ofcourse B=A should give you the maximum correlation but that is not allowed.
cor_B(A) = Number of search results for query A B / Number of search results for query B
So for example if you search for query 'cricket bat' number of results returned is approx 746000
while results returned for 'bat' alone is 105000000 which means that
cor_bat(cricket) = 1/130 approximately which can be interpreted as that out of 130 pages containing the word bat only 1 is referring to cricket bat. Now for term lbw
cor_lbw(cricket) = 1730000/247000 = 1/1.4 (approx). This is understandable because lbw hardly has any other meaning outside of cricket jargon.
Now that the definitions are clear, the game is to find terms with highest correlation to each of the following terms
1. cricket
2. java
3. radiohead
4. einstein
So lets see how good you can do!! And ofcourse B=A should give you the maximum correlation but that is not allowed.
Subscribe to:
Posts (Atom)