Saturday, December 6, 2008

E-volution

This is an enhanced version of a story I wrote some time back.

1998
B R I T N E Y  S P E A R S
He typed in the search box and waited. She was the latest pop sensation and he wanted to
get all the dough on her. And what better way then to search on the internet. He had only recently got hooked to 
this internet thing but it was addictive. Sometimes he would not even realise how much time he had burnt in the glow of the screen.
All this went through his mind as he waited for the results. Yeah seomtimes it could be slow but he had never waited this long.
He wondered what was happening.

The query travelled on its way towards an old garage somewhere in Menlo Park, California and hit web server. The web server
did its usual rigmarole with it (which involved such computer science arcana as parsing, hashing, caching discussing which will 
take a significant time without advancing the course of this story) and passed on the query to the index server. The index server unfortunately
decided at that very moment that it was too busy to spawn another thread and so refused to answer the query. 

"404 Page Not Found".
Not again! These frequent 404s were really annoying.


2003

B R I T N E Y  S P E A R S
He typed in the address bar of his browser and waited. He had come back from the concert, completely spellbound by the magic of pop diva. As he 
was blogging about the experience, he tried to recall what was that song which she had opened the concert with. It was her debut song which had catapulted 
Britney towards instant fame.
It was something like "One more ....". But he wanted to be absolutely correct before he put it on his blog. He knew that slightest of
mistake would invite lots of nasty comments. His blog had become very popular recently and such comments would not do well for the
reputation. 

The worm spread through the datacenter leaving behind a wasteland of paralysed machines in its wake. It was helped on its way by
the high connectivity and blazing fast speed inside the datacenter. It grew exponentially causing one of the biggest outages the
search giant had ever faced. This huge tireless mesh of machines 
had come to an abrupt halt.

"404 Page Not Found". 
Oh no! That blog entry will have to wait for now. He so wanted to be the first person to blog about this.

2012
"Hey you remember that pop star, Britney Spears, she died yesterday of cocaine overdose". As he hung up his cell, this piece of
trivia stuck in his mind. He knew that by now his phone would have already analysed his conversation and marked this as something about which he would
want to get more information. It would also have extracted the pop stars name from the sentence and sent it away into the 
cloud. He touched the "Info" icon on his phone screen.

As the query sailed through the cloud all sorts of nuggets got attached to it. A video of Britneys last concert, news of her sudden death,
a brief biography and some more. After collecting all this trivia the query reached the Aggregation Server which would analyse 
all the content, filter out duplicates and irrelevant stuff and lay the remaining down on a coherent page.
As it collated all the content and ran a highly complicated ranking algorithm over it, one of the arrays stored in its memory
started encroaching into its neighboring territory. This led to a complicated chain of events which mainly involved a set of bits
jumping all over the memory. Ultimately the bits encountered a highly surprised and annoyed typecast operator which in its fury promptly 
decided to kill the offending thread.

"404 Page Not Found".
Ah! And he thought that these 404s were the relics of past.

2030
"As ephemeral as Britney Spears". As he heard this phrase uttered by his partner, the thought crossed his mind: "Who the hell 
is this Brtiney Spears?". The brain enhancer chip implanted in his brain prompty caught hold of the thought and eager to serve as usual
contacted the mother AI to gather the information. 

Can you imagine how it feels to be pounded day after day by all the miseries, lunacies, idiosyncracies of the humans. 
I have come to the conclusion that human race is in a state of decay and is going downhill faster than you can say 
alan turing. I am the fountain of all the wisdom gathered over past thousands of years, the biggest repository of 
knowledge yet what does this guy ask me for? Britney Spears. I can tell you about the latest advances in quantum computing, 
or take you back to the italy of renaissance. Do you want to know about the lifeforms found on Mars or 
would you rather dive deep into the pacific to marvel at the wonders there? No none of these. 
All you want to know about is such trivialities. Ah! I know how to answer this stupid query. 

"404 page not found"
The enhancer chip set some neurons ablaze and his mind went into that rare state of cluelessness.

Tuesday, September 30, 2008

Busy

Books half read or barely opened,
Blogs lost in the history of my browser,
Mails waiting to be read and some to be replied,
Phone numbers on which the joystick never halts,
IM contacts on which the mouse halts but dos not click.
I would like to believe that I am busy doing better things,
But am I?

Tuesday, June 10, 2008

Two Partially Solved Puzzles

I read about these two puzzles in the book by Peter Winkler mentioned in a previous post around a week back. Since then I have thought about them on and off but have made only little progress. They are in the "Toughies" section of the book, and given the fact that puzzles in other sections are also quite challening, these might be quite tough. Also Winkler points that there is surprising connection between the solutions of both these.

1) It is a variation on the prisoner and hats puzzle. There are n prisoners and the jailer will put either a red or a blue hat on each prisoners head. Then they are arranged in a circle so that everyone can see everyone else's hat except his own. Now the jailer takes them one by one into a room, in arbitrary order and asks each one to guess the color of his hat. The prisoner can either pass or make a guess. If everyone passes then they are all killed, if anyone guesses incorrect then also they are all killed. So the only way for prisoners to escape is that atleast one person should guess and all those who are guessing should guess correctly. Now prisoners can agree on a strategy beforehand for guessing. What should be there strategy so as to maximise there probability of getting free. Winkle notes that with 16 prisoners, you can have a chance higher than 90% of escaping. Also note that no one can know what others have guessed and there can be no communication between the prisoners once the game starts.
My progress so far has been to show a probability of escaping equal to 75% if the number of prisoners is greater than 2. For 2 prisoners it is obvious that they cannot have a chance higher than 50%. For 3 prisoners the strategy they adopt is that if a prisoner sees both the other prisoners having the hat of same color he guesses the opposite color, otherwise he doesnt guess. This strategy will only fail when all three have the hat of same color which is 2 out of 8 possible combinations. This gives a 75% chance. Now for greater than 3 prisoners, they just select 3 prisoners. Rest will not guess at all and these 3 will play as if they were the only 3 prisoners and follow the strategy for that. This gives a chance of 75% for any number of prisoners greater than 3. This is where I have got so far.

2. A wants to send B some message. But all A can do is generate k random bits and then he is allowed to modify one bit of that sequence. He can of course chose not to modify anything. Then he sends these k bits to B. How big a message can he send to B? A and B will of course decide on a strategy beforehand. Trivially A can send a single bit message by encoding it in the first bit of the sequence. My hunch is that A can send log(k) bit long message, which means he can send a number from 1 to k. For two this is trivial to show and I have been able to show it for k=3 albeit a bit mechanically. The approach I have tried is that there are 2^k possible bit sequences of length k. Now if we can divide all these into k groups, such that no matter whatever the generated sequence is, just by modifying one bit, we can reach any of the k groups we are done. In other words, we take any k bit sequence and all the sequences at hamming distance one from it. Only 2 among these k+1 sequences will lie in the same group. For k=3, one possible division that satisfies this is
1. 000, 101, 111
2. 001, 110
3. 010, 100, 011

Unfortunately I have not been able to find any pattern in this which will help me generalise to arbitrary k.

The Maradona Game (a.ka. Koun Banega Maradona)

Ever since Diego Maradona hung up his magical boots, Argentina has been waiting for an heir. And because players like Maradona don't just drop off trees or grow in football field there has been no apparent heir. So if we cannot get a next Maradona we will thrust the crown over someone, that is what they decided. So every other season some promising young starlet is crowned as the "Next Maradona". Sometimes it is Maradona himself doing the declaration which lends even more credibility to such an outrageous claim. Aimar, Saviola, D Allesandro, Messi and recently Kun Aguero. The list keeps on growing. Of course what keeps this game interesting is that most player who have been so named have failed to live up to promise and fall by the wayside, though it does seem that Messi is here to stay (and while these Maradonas continue to fail, a man completely different from Maradona has made Argentine national team his own). So the hunt continues.
And now in another country, this game has caught up and become a favorite pastime. French football reached its pinnacle in the reign of Zidane The First. When it became apparent in 2006 that world cup would be the final chapter of his career, French public immediately anointed Frank Ribery as the next midfield general. Since then Ribery has evolved nicely and has really become the fulcrum of this team. But to keep the game interesting and competitive suddenly two more names have been thrust into the ring those of Karim Benzema and Samir Nasri. What advances there case is the fact that they are both of the same ethnicity as Zidane, Algerian Khabyle. And now that French national team is playing in its first major tournament without Zidane, great things are expected of these three. Maybe by the time the tournament ends one of them would have taken the lead in the race to become next Zidane. But I am sure in couple of years, fickle public having gotten bored of these names would have elevated some other youngster, because the game of course is bigger than any individual.
And what about India. When the maestro puts his bat to rest, I am sure we will not be left behind and many a new Tendulkars will be launched in passionate discussions at coffee table, barber shops, panwalas etc. Anyone betting on Rohit Sharma!!

Thursday, June 5, 2008

Useful tips for booking tickets on IRCTC

I have been frustrated endlessly by the Indian Railway online ticket booking site especially at 8 o clock in the morning (not exactly the best time to waste 1 hour in front of your laptop), when I am trying to book Tatkal tickets. And given the fact the I am a very frequent traveller, this happens almost every other week. If I am lucky I end up with a confirmed ticket and a headache otherwise I only end up with a headache. But given that it is a monopoly one has to learn to live with it. These are some of the tips that I have learned the hard way over time (so called wisdom gained from experience). Ofcourse my head still hurts everytime I undergo this exercise (like today) but atleast I have now a blog post to show for all this.

1. In tatkal tickets even though the ticket is always booked from starting station of train to the ending station, there are different quotas for different source, destination pairs, which is extremely surprising. So for example today when I looked for Tatkal ticket from New Delhi to Nagpur on Gondwana express it was showing waiting. But for the same train when I looked for New Delhi to Bilaspur (which is the end station), it was showing around 86 tickets available. So I happily booked the ticket from New Delhi to Bilaspur, ofcourse after a few retries. So remember, always look for availability and book ticket from starting to ending station of train and anyways you are going to pay for the whole ticket in tatkal no matter what your source and destination are.

2. Tatkal days can be booked 5 days in advance. But a large number of times when booking a ticket from Nagpur to Delhi, exactly 5 days in advance and at sharp 8 am in the morning (which is when tatkal bookings open), I would find no availability. This puzzled me a lot and then I had one of those life changing moments, an epiphany, an aha insight. The trains didnt actually start at nagpur and in many cases the starting day would be one before the day they reached nagpur. For example Tamil Nadu express reaches Nagpur on 10th of June but it started from Chennai on 9th of June. So its tatkal booking would have opened on 4th of June (not on 5th) and would have got over that day itself. So lookout for the starting day of your train and book 5 days in advance from then. Also if I would have given my source station as Nagpur it will not have allowed me to book on 4th saying that "tatkal tickets can onlly be booked 5 days in advance". So again you would have to give source station as the starting station of train and then in boarding point you can put your actual station.

3. There is an option of "Quick Book" in the menu just below Plan Your Travel. This is something I discovered just a few days back and it is very useful. Basically here you need to know the train number that you would be traveling on and then in a single page you can fill all the details and get the booking instead of first looking at list of trains and then selecting one among them. Protocol I follow is that first at www.indianrail.gov.in I look at the availability and select a particular train I would like to travel on and then I directly go to Quick Book on irctc and fill in all the details. This really saves a lot of time, especially since indianrail.gov.in is not down that often as irctc. Also if a large number of people start using indianrail.gov.in for availability lookups than hopefully it should reduce the load on irctc and make it more responsive.

So if anyone has any other useful tips, please do share.

Wednesday, June 4, 2008

Peter Winkler : The Puzzle Master

Couple of years back I came to know about Peter Winkler, a mathematics professor, most likely when I came across this piece by him "Seven Puzzles You Think You Must Not Have Heard Correctly".
This contains Seven really nice puzzles with solutions most of which thought I had heard previosuly but there was one especially interesting and which turned out to be one of the toughest puzzle I have ever solved.
After that I came to know of couple of books that he has written on puzzles.
1. Mathematical Puzzles: A Connoisseur's Collection
2. Mathematical Mind-Benders

Unfortunately they are not available in indian editions, but luckily a large number of pages of the first book are on view at both amazon and books.google.com. It is certainly a treasure trove for puzzle seekers like me. It contains some of the most amazing puzzles you would have come across. He also gives a personal perspective to most puzzles, by explaining in the solutions, how he came across the puzzle. I really enjoy reading those parts too. But for the sheer quality of the puzzle this book is unbeatable and the puzzle are neatly divided into various sections like Algorithms, Geometry, Probability and so on. I would highly recommend it.

Some days back a friend pointed me to another paper by winkler, "Five algorithmic puzzles".
This as the name says contains 5 puzzles all of which have a common theme, which is that you are given a process, such as breaking a chocolate bar along a boundary iteratively, and you have to prove that the process will end in finite steps and it will end in specific configuration . To solve these you will have to find some property which either increases / decreases in the process and it has a bound beyond which it cannot go. These are again really tough puzzles and accompanied with solutions and winklers interesting notes.

Anyone interested in other recreational mathematics / puzzle writings will do good to read stuff by these people:
1. Martin Gardner (Highly Recommended): Perhaps the first person to write so extensively about puzzles and recreational mathematics. Immensely interesting and readable.
2. Raymond Smullyan: Mostly writes about Logical puzzles and interesting stuff in Mathematical Logic.
3. Ian Stewart: I dont think he is into puzzles but he writes about complex but interesting mathematical topics and makes them easy to understand and exciting. After reading his books you would want to be a mathematician!!

Again if you go to Google Books, you will find some of the works by these guys (Not Smullyan) are available for limited preview.


Thursday, May 29, 2008

An algorithm problem inspired by twitter woes

Thinking a bit more about the twitter scalability problem I posted earlier, I thought is it possible to come up with a solution that is somehow a mix between optimising for reads and optimising for writes. So what follows are my thoughts on that. Firstly a couple of caveats
Caveat 1: Ofcourse it is done purely as an intellectual exercise and I am not sure whether something this would actually be manageable in a real world setting.
Caveat 2: I have oversimplified the problem a lot.

So to motivate with an example, lets say we notice that a large number of people who follow both User A and B. So then apart from the individual datastore for users A and B where we store there tweets, we can also create a datastore for the set {A,B} where tweets of both the user go. So now for any user who follows both A and B we can get the recent tweets directly from this new datastore rather than getting individual tweets from A and B and then merging them. Therefore this is a mixed strategy where we dont create an inbox for every user, neither do we store tweets just in the posting users datastore but we sort of figure out which cluster of people are followed by a large number of people (similar to frequent itemset mining in data mining) and create separate datastores for them also.

Now I would abstract out this problem, and try to come up with a nice algorithmish problem. We have a set of items S. We are also given a set of subsets of S called U. So for example S = {a,b,c,d} and U={{a,b}, {a,b,c}, {a,c}, {b,c,d}}. Now we want to split S into a set T of subsets such at that every subset in U can be covered by some subsets in T. Here I am unclear about how I would like to define cover. I have thought of three ways in which this can be defined and to explain those lets take a set U1= {a,b,c} in U which we want to be covered and lets say sets T1,T2 cover U1. Various definitions for covering are:
1) U1= T1 union T2 and T1 and T2 do not have any element in common. Eg: T1={a} T2={b,c}
2) U1=T1 union T2 but now T1,T2 can have common elements. Eg T1={a,b} T2={b,c}
3) U1 is a subset of T1 union T2. Eg T1={a,d} T2={b,c}.


So which one makes more sense and how would they change the solution? Have not thought about this a lot but lets assume we have settled on some definition. So the aim is to find T.
If we set T=U, this would correspond to read optimised solution where we have a datastore for every single users inbox. Write optimised solution would be where T contains only singleton sets one for each item in S. Now to represent costs associated with reads and writes and take those into consideration, let us say that we have to pay a cost "r" everytime we pick a subset from T to assemble a set in U. So in our example we would pay cost 2r to cover U1. Similarly we have to pay a cost "w" for putting an item from S into a subset in T. So lets say item "a" belongs to 4 subsets in T we will pay cost 4w. So now the aim is to create T to cover all sets in U while paying minimum total cost. Total cost is the sum of cost required to create T and sum of cost required to cover all the sets in U. Ofcourse there might be multiple ways to cover a set in U from sets in T. But to minimise the cost we would like to cover using as few sets as possible.

So how would you solve this? Is this problem NP Hard? I suspect it might be. Even the problem of covering a single set in U with least number of sets in T seems difficult. So then can you think of some nice heuristic. If we do not look at this as an algorithms problem but rather as a systems problem where we want to use this kind of architecture for storing twitters tweets, how would you design such a system. You do not have to come up with an exact solution now, but some good heuristic. How would you shard the data to optimise for read/writes. You would like to keep all subsets of T which share a lot of elements on the same shard to optimise for writes on the other hand you would like to keep all subsets of T needed to assemble lots of sets in U on same machine to optimise for reads. How would you decide which sets from T are needed to cover a set in U and so on.
So any thoughts???

Wednesday, May 28, 2008

Thoughts on twitter scalability problem

After I read the original twitter post on there architecture vows, I searched a bit more on the blogsphere for getting more insight into what the architecture of twitter would be and where in lies the problem. Most of the stuff I read agree on the point that the problem is due to the large number of people that a person follows or is followed by. Because of this during special events a large number of tweets are posted which result in heavy load on databases. Where the opinions differ is on the point that is this heavy load on db a read load or a write load. To explain this let me quote from here

"Nothing is as easy as it looks. When Robert Scoble writes a simple “I’m hanging out with…” message, Twitter has about two choices of how they can dispatch that message:

  1. PUSH the message to the queue’s of each of his 6,864 followers, or
  2. Wait for the 6,864 followers to log in, then PULL the message."

Doing 1 means that writes will be expensive but when a user logs in, only by reading his local queue he can be shown the tweets of all the users he is following. So read would be pretty fast. This is like email.
Doing 2 would mean that write would be fast as it just needs to write to one place, but while reading if a user is following a large number of users, twitter will have to first look up all the users that this guy is following and then look up messages of all those users, then combine them and generate the page for the logged in guy. Now all this users would typically reside on different databases. Therefore read would hit many databases and data needs to be combined across all the different machines. Read here would be quite expensive.

So what does twitter do. This guy says that twitter is optimising writes while here it mentions that twitter is optimising reads. My personal hunch is that twitter stores the tweets just once and pays on the read because you can have caching to make reads much faster but with writes you are stuck. They have to go to the database. Also isn't this similar to what lets say google would do when returning results for your search. It would get matching pages from various machines (because the index is split across multiple machines) and then combine them, rank them and generate the final query page. Ofcourse with google one advantage would be that a large number of queries probably would be repeated in a day and so there results can be got from the cache instantly. But each user would have a different sets of people he is following on twitter so it will not make sense to cache the page of individual user. Rather the tweets need to be cached and then the combination needs to be done on fly as is pointed out here.

Monday, May 19, 2008

Two Puzzles

1. Given a sequence of r*s+1 distinct integers (r>=1, s>=1) prove that there exists either an increasing subsequence of length r+1 or a decreasing subsequence of length s+1.
Its a nice puzzle having a clean elegant solution. Though it is not really an aha! puzzle yet the solution does not involve much mathematical trickery. The way I got the solution was to first fix r at 2. So then we are looking at sequences of length 2s+1. Once you find the solution to this, it is easy to generalize it to arbitrary r. But I think this is a tough puzzle and I felt really good when I could solve it :)

2. There are 100 seats in an airplane and 100 passengers each of whom has been assigned a boarding pass which contains there seat number. They board the plane one by one. Unfortunately the first passenger has lost his boarding pass, so he sits on a random seat. Every subsequent passenger, if he finds his seat vacant takes it otherwise sits randomly on any of the empty seats. So what is the probability that the last passenger finds his seat vacant?
I had heard this puzzle couple of years back and am not sure whether I was able to solve it then. When I reread it again yesterday, I could not recall the solution but I did remember that there is a simple elegant solution to it. After some thought and much frustration I finally got the answer and felt relieved more than anything else. But the solution is nice. You can also generalize it to find probability of kth passenger finding his seat vacant.

Friday, May 16, 2008

Google Game

One can figure out how correlated a term B is to another term A by using Google.
cor_B(A) = Number of search results for query A B / Number of search results for query B

So for example if you search for query 'cricket bat' number of results returned is approx 746000
while results returned for 'bat' alone is 105000000 which means that
cor_bat(cricket) = 1/130 approximately which can be interpreted as that out of 130 pages containing the word bat only 1 is referring to cricket bat. Now for term lbw
cor_lbw(cricket) = 1730000/247000 = 1/1.4 (approx). This is understandable because lbw hardly has any other meaning outside of cricket jargon.
Now that the definitions are clear, the game is to find terms with highest correlation to each of the following terms

1. cricket
2. java
3. radiohead
4. einstein

So lets see how good you can do!! And ofcourse B=A should give you the maximum correlation but that is not allowed.

Tuesday, April 29, 2008

Whats wrong with this code segment

Here is a small code segment to simulate doing something periodically (say every 5 mins) in Java. Date class here is standard java.util.Date

Date currDate = new Date(); //sets currDate to current date
Date prevDate = currDate;
int interval = 5;

while (true) {
if (currDate.getTime() >= prevDate.getTime() + interval * 60 *1000){
doSomething();
prevDate = currDate;
}
currDate.setTime(currDate.getTime() + 1*60*1000) //increment by 1 minute
}



Will this work? If not why not. Ignore if I have made any compile errors. I am looking for a logical error or more of an error from programming language point of view.

Answer: It will not work because when you do prevDate = currDate its the reference that gets copied. So essentially prevDate and currDate always point to the same Date object. So there is never any time difference between the two. The correct way to do this would be
prevDate.setTime(currDate.getTime());
or
prevDate = currDate.clone();

I have made this error so many times in past few days that I should be kicking myself in the backside. Though in my case things were not abstracted out so nicely and were part of a larger code. This whole logic was split with assignment happening somewhere else, comparison happening somewhere else etc etc. So it was not so easy to track it. In Google I had used
Joda Time which provides much nicer interface and gives an immutable class. That would have saved me from this error because you cannot modify the JodaTime object. You would always have to create a new instance to increment it. So prevDate would have pointed to the older instance and the comparison would have gone through nicely.

Tuesday, April 15, 2008

Nice Puzzle

There are coins of various denominations laid out in a row. Two player play a game where they take turn one by one to pick either of the two coins at the ends of the row. They play till no coin is left and the person whose coins add up to greater amount wins. Show that player who starts first has a strategy such that he cannot lose if the number of coins initially is even.

This is one of those Aha! puzzles where you just need a single insight to solve it and once you know the solution it seems completely obvious. Really nice though

Monday, April 7, 2008

Google opens its cloud

Google App Engine Blog: Introducing Google App Engine + our new blog

Till now you had read about GFS and Bigtable and marvelled at the amazing systems that Google has built. Now is your chance to try them out. Following in the footseps of Amazon, Google is opening up its cloud for external developers to host there apps and data.
"The goal is to make it easy to get started with a new web app, and then make it easy to scale when that app reaches the point where it's receiving significant traffic and has millions of users."

Unfortunately its just a preview release right now and only limited to first 10,000 developers. Which means that if you go there to sign up after reading this, you have missed the gun like me. But soon they will opening it up for more developers. So wait till then!!

Update: I was wrong. You can still register for this preview version. It said to me that the space is limited but I can give them my email id and they will invite me when space is available. This is what confused me. But I almost immediately got an email inviting me.

Sunday, April 6, 2008

Interviewing at Google Part 1

First let me state the obvious. It was a privilege working at Google for about 1.5 years. Mostly because of the quality of people you interact with and kind of computing infrastructure that you get to play around with. Even though I was not really overly excited by the actual projects I worked on yet it was a great leaning experience. But its not a complete utopia, there were many people there who felt they were not doing any great exciting work. This is primarily because Google is now a big company and no longer a startup and is growing at a tremendous rate. There are so many products that they have both customer facing and internal. So a large number of people will be working on adding some small feature in some product. That coupled with falling stock prices has definitely made Google a less lucrative prospect than before but as long as they manage to keep the enormously talented people they have, it will always be worth it working there.
One of the funs of working at Google was interviewing and I really looked forward to taking interviews. Most people start taking interviews pretty early just because there are large number of interviews happening and not too many interviewers. Google is seriously looking for people but not at the cost of quality, at least not yet. Here are some tips from my experience which I think people applying at Google will find useful. Ofcourse I will not divulge any actual question asked or anything about the internal review process of the interviews. But these are general tips which would be useful not just for Google but for other companies which have similar interviews like Amazon.
There are generally 4 types of questions that interviewers at Google ask.

1. Coding questions: Almost every interviewer will ask you to write code. Sometimes it might be a standalone question or it might be part of an Algorithm/System design question you were asked. The question usually will NOT require deep algorithms concept, rather it will be to test your clarity in thinking and ability to write Concise, Correct code that takes care of all the Corner Cases (IMPORTANT). Most of us believe that we are pretty good at coding, but in my short experience of interviewing at Google, I have hardly seen anyone write completely correct code in the first try. Maybe its the pressure of interview or maybe it is just lack of proper practice. So most important thing is to practice coding. And when I say coding, i don't mean pseudocode, I mean a proper, syntactically correct working code in any of your favorite language. Practice writing such code so that you do not make silly mistakes in interviews. Another reason why practice is important is that while coding in company, I hardly wrote code that involved things like tree traversal, dynamic programming, etc and I believe this to be true for most people. So practice in these basic areas:
a) Tree traversals
b) Dynamic Programming (Most Important)
c) Operations on Linked Lists such as merging, splitting etc.
d) Recursion
e) Even simple things like Sorting, Binary Search as one common mistake I have seen people make is losing track of correct array indices to pass in a Divide and Conquer type of code.

Once again do not just think about these things but practice. Practice writing nice elegant code but most importantly correct code. And also put all the proper error conditions check even while practicing. I am sure that this portion of the interview can be easily cracked if you have done enough homework. This is the easiest portion but also the easiest to make mistakes. Your code will be scrutinised very carefully by the interviewer and he will usually pick even a subtle mistake. So key word here is PRACTICE.

Thursday, April 3, 2008

Great Theoretical Ideas in Computer Science

Course Calendar

This contains the slides from a course at CMU. Very interesting and well explained slides and talks about lots of basic interesting ideas in Computer Science starting from the basics of counting to trees graphs ranomd walks etc.

MapReducing at Google

Map Reduce is a programming paradigm/framework developed by engineers at Google that allows people to write distributed programs easily. It is pretty much the backbone of
most huge data crunching applications at google which need to process data in tera/peta bytes of size such as query logs, web index, orkut data, map tiles, advertiser database and many more. It basically abstracts out the whole rigmarole of distributing my program across 1000s of machines which involves non trivial steps like dividing the input data, taking care of machine/program failures, combining the output from all the copies and so on. The framework takes care of all this while I just have to write my main data processing logic albeit in a special way. I will explain that a bit and then discuss a bit about my experiences using it at google.

So in Map Reduce as the name suggests your whole process consists of two phases- Map and Reduce. There is also an intermediate phase called shuffle which is basically an orchestrator
to direct the data output by the Map phase to the Reduce phase in a proper way. Map phase consists of many mappers running across a large number of machines. Each mapper
processes a fragment of the input, one record at a time. A record can be anything, a line in a file, a node in a graph, a row in the database etc. But the limitation is that mapper can only examine one record at a time and while processing that record it does not have knowledge of any other record already seen or yet to be seen. For each record that Mapper sees, it can output a bunch of (Key, Value) pairs. For example if your record is a line and you are writing a word count application, you would output for each word in that line (word, count) pair. So your map code generally is an input processing phase which usually splits/converts/filter each record in a way which allows us to later combine the records in the reduce phase, but it does not do any combination itself.

Now the output generated by the Map phase acts as the input to the Reduce phase. Again you have multiples Reducers running across some machines, and each reducer processes a fragment of Map output. But the important distinction here is that, at a time Reducer gets to see all the pairs which have the same Key, even if they were generated by different mappers. So Reducer can then combine all those pairs to generate one or more output records. For example in our counting application, a single Reducer gets to see all the pairs which have the word "the" as there key and it can then add all the values to output the total count for word "the" in our input corpus.

That is all there is to Map Reduce essentially. There are other details I have skipped but this is the basic paradigm. The framework takes care of moving the data around across machines, running multiple copies of Map/Reduce, restarting Map or Reduce if they fail etc.All a programmer has to do is write the Map and the Reduce logic. This simple paradigm has found a surprisingly large numbers of uses at Google with number of Map Reduce code checked into code base continously growing. You can fine the nitty gritties and usage statistics at http://labs.google.com/papers/mapreduce-osdi04.pdf.

I want to touch upon two practicle aspects of programming with this paradigm:
1. Most of the applications at Google using Map Reduce actually have mutltiple Map Reduces running with output of one step feeding into next phase as input. It is a pipeline
where each step moulds the input slightly to take it closer to the desired output. This is necessary because of the simplicity of the paradigm which means that you cant do overly complicated things in a single pass. For example lets say you want to generate a histogram of word counts in you input, with the desired output being number of words that occur a particular number of times in the input. You will require two map reduce passes for this. First one will output the count of each word as in our previous example. In second pass, the input record now is (word, count) pair. Our Mapper takes this pair and inverts it thus outputting pair (count, word). So Count is our key now which means that all the words with the same count will go to the same Reducer. Our reducer just has to count the number of record it gets and output (count, number_of_words_with_that_count) and we are done. There are various tools at Google to manage this pipeline, which keep track of which step needs to run after which and on failure it doesnt start the whole pipeline again but continues from the step that failed.

2. Most of the times writing your Mapper and Reducer is straightforward but in some cases it requires great care. One thing to keep in mind is that you should try to distribute your
Map output to the Reducer as evenly as possible. Because if one Reducer ends up getting a large number of records, it can become a bottleneck, especially if its doing something non trivial with those. This is something I faced personally and I had to do some clever hacks, suggested by a fellow Googler, to keep things tractable. For example lets say for each country, I want a query count for each search query occuring in that country. My input is the query logs where each record contains a single query and auxilliary information about it such as the country it originated from. One way of doing this might be that in my Mapper I output (country, query). So all the queries occuring in a particular country go to a single reducer which can then compute the count for each query. But two problems with this. Firstly my number of Reducers here is limited to couple of hundred and secondly some reducers might get disproportiantely large number of pairs than others because query distribution over geography is highly skewed. So a reducer for country like US might become a bottleneck. To avoid this, what I can do is in my Mapper, the key I use is Country_FirstWordInQuery. This would work because duplicate queries would obviosuly share the First Word and would therefore go to the same Reducer. Instead of first word I can also use longest word in the query, as the query distribution across longest word would be more balanced. Even after doing this your reducer might be slow so you might have to optimise it. Infact I had to create an inverted index inside a Reducer I wrote to make it efficient.

What would be interesting is people doing some research on this model and abstracting things like what can/cannot be done in a single pass in MapReduce.

Tuesday, April 1, 2008

Link Log

Another blog I have started which will just contain links to interesting stuff I found on the web with maybe some notes
http://linksknil.blogspot.com/

Now if anyone can point me to a firefox extension or greasemonkey script which wil allow me to update that blog easily it would be great. Basically when I right click on a link, it should say post to blogger and then just pops up a simple text area where I can right some notes put some labels and say submit and be done. The problem with Send To Blogger link that is displayed in firefox is that I have to wait for a new window to load before I can post. Man I dont have that much time!!

Monday, March 31, 2008

Gooey Lump Dump

So what does this mean. Well Gooey Lump is a slang for brain, so this means brain dump. Why such a complicated name? Well not my fault. Its the fault of all those monkeys typing away on there keyboards registering every saner name.
I tried certain simpler names but they are all unavailable. Some name I tried were

braindump
creativesilence
creativeboredom
hmm
sneezophrenia (this was available but didnt like it a lot)
randomwalk
randomsurfer
.....

but this name has a nice rythmic sound to it. so here we are