Gooey Lump Dump: April 2008

Tuesday, April 29, 2008

Whats wrong with this code segment

Here is a small code segment to simulate doing something periodically (say every 5 mins) in Java. Date class here is standard java.util.Date

Date currDate = new Date(); //sets currDate to current date
Date prevDate = currDate;
int interval = 5;

while (true) {
if (currDate.getTime() >= prevDate.getTime() + interval * 60 *1000){
doSomething();
prevDate = currDate;
}
currDate.setTime(currDate.getTime() + 1*60*1000) //increment by 1 minute
}

Will this work? If not why not. Ignore if I have made any compile errors. I am looking for a logical error or more of an error from programming language point of view.

Answer: It will not work because when you do prevDate = currDate its the reference that gets copied. So essentially prevDate and currDate always point to the same Date object. So there is never any time difference between the two. The correct way to do this would be
prevDate.setTime(currDate.getTime());
or
prevDate = currDate.clone();

I have made this error so many times in past few days that I should be kicking myself in the backside. Though in my case things were not abstracted out so nicely and were part of a larger code. This whole logic was split with assignment happening somewhere else, comparison happening somewhere else etc etc. So it was not so easy to track it. In Google I had used
Joda Time which provides much nicer interface and gives an immutable class. That would have saved me from this error because you cannot modify the JodaTime object. You would always have to create a new instance to increment it. So prevDate would have pointed to the older instance and the comparison would have gone through nicely.

Tuesday, April 15, 2008

Nice Puzzle

There are coins of various denominations laid out in a row. Two player play a game where they take turn one by one to pick either of the two coins at the ends of the row. They play till no coin is left and the person whose coins add up to greater amount wins. Show that player who starts first has a strategy such that he cannot lose if the number of coins initially is even.

This is one of those Aha! puzzles where you just need a single insight to solve it and once you know the solution it seems completely obvious. Really nice though

Monday, April 7, 2008

Google opens its cloud

Google App Engine Blog: Introducing Google App Engine + our new blog

Till now you had read about GFS and Bigtable and marvelled at the amazing systems that Google has built. Now is your chance to try them out. Following in the footseps of Amazon, Google is opening up its cloud for external developers to host there apps and data.
"The goal is to make it easy to get started with a new web app, and then make it easy to scale when that app reaches the point where it's receiving significant traffic and has millions of users."

Unfortunately its just a preview release right now and only limited to first 10,000 developers. Which means that if you go there to sign up after reading this, you have missed the gun like me. But soon they will opening it up for more developers. So wait till then!!

Update: I was wrong. You can still register for this preview version. It said to me that the space is limited but I can give them my email id and they will invite me when space is available. This is what confused me. But I almost immediately got an email inviting me.

Sunday, April 6, 2008

Interviewing at Google Part 1

First let me state the obvious. It was a privilege working at Google for about 1.5 years. Mostly because of the quality of people you interact with and kind of computing infrastructure that you get to play around with. Even though I was not really overly excited by the actual projects I worked on yet it was a great leaning experience. But its not a complete utopia, there were many people there who felt they were not doing any great exciting work. This is primarily because Google is now a big company and no longer a startup and is growing at a tremendous rate. There are so many products that they have both customer facing and internal. So a large number of people will be working on adding some small feature in some product. That coupled with falling stock prices has definitely made Google a less lucrative prospect than before but as long as they manage to keep the enormously talented people they have, it will always be worth it working there.
One of the funs of working at Google was interviewing and I really looked forward to taking interviews. Most people start taking interviews pretty early just because there are large number of interviews happening and not too many interviewers. Google is seriously looking for people but not at the cost of quality, at least not yet. Here are some tips from my experience which I think people applying at Google will find useful. Ofcourse I will not divulge any actual question asked or anything about the internal review process of the interviews. But these are general tips which would be useful not just for Google but for other companies which have similar interviews like Amazon.
There are generally 4 types of questions that interviewers at Google ask.

1. Coding questions: Almost every interviewer will ask you to write code. Sometimes it might be a standalone question or it might be part of an Algorithm/System design question you were asked. The question usually will NOT require deep algorithms concept, rather it will be to test your clarity in thinking and ability to write Concise, Correct code that takes care of all the Corner Cases (IMPORTANT). Most of us believe that we are pretty good at coding, but in my short experience of interviewing at Google, I have hardly seen anyone write completely correct code in the first try. Maybe its the pressure of interview or maybe it is just lack of proper practice. So most important thing is to practice coding. And when I say coding, i don't mean pseudocode, I mean a proper, syntactically correct working code in any of your favorite language. Practice writing such code so that you do not make silly mistakes in interviews. Another reason why practice is important is that while coding in company, I hardly wrote code that involved things like tree traversal, dynamic programming, etc and I believe this to be true for most people. So practice in these basic areas:
a) Tree traversals
b) Dynamic Programming (Most Important)
c) Operations on Linked Lists such as merging, splitting etc.
d) Recursion
e) Even simple things like Sorting, Binary Search as one common mistake I have seen people make is losing track of correct array indices to pass in a Divide and Conquer type of code.

Once again do not just think about these things but practice. Practice writing nice elegant code but most importantly correct code. And also put all the proper error conditions check even while practicing. I am sure that this portion of the interview can be easily cracked if you have done enough homework. This is the easiest portion but also the easiest to make mistakes. Your code will be scrutinised very carefully by the interviewer and he will usually pick even a subtle mistake. So key word here is PRACTICE.

Thursday, April 3, 2008

Great Theoretical Ideas in Computer Science

Course Calendar

This contains the slides from a course at CMU. Very interesting and well explained slides and talks about lots of basic interesting ideas in Computer Science starting from the basics of counting to trees graphs ranomd walks etc.

MapReducing at Google

Map Reduce is a programming paradigm/framework developed by engineers at Google that allows people to write distributed programs easily. It is pretty much the backbone of
most huge data crunching applications at google which need to process data in tera/peta bytes of size such as query logs, web index, orkut data, map tiles, advertiser database and many more. It basically abstracts out the whole rigmarole of distributing my program across 1000s of machines which involves non trivial steps like dividing the input data, taking care of machine/program failures, combining the output from all the copies and so on. The framework takes care of all this while I just have to write my main data processing logic albeit in a special way. I will explain that a bit and then discuss a bit about my experiences using it at google.

So in Map Reduce as the name suggests your whole process consists of two phases- Map and Reduce. There is also an intermediate phase called shuffle which is basically an orchestrator
to direct the data output by the Map phase to the Reduce phase in a proper way. Map phase consists of many mappers running across a large number of machines. Each mapper
processes a fragment of the input, one record at a time. A record can be anything, a line in a file, a node in a graph, a row in the database etc. But the limitation is that mapper can only examine one record at a time and while processing that record it does not have knowledge of any other record already seen or yet to be seen. For each record that Mapper sees, it can output a bunch of (Key, Value) pairs. For example if your record is a line and you are writing a word count application, you would output for each word in that line (word, count) pair. So your map code generally is an input processing phase which usually splits/converts/filter each record in a way which allows us to later combine the records in the reduce phase, but it does not do any combination itself.

Now the output generated by the Map phase acts as the input to the Reduce phase. Again you have multiples Reducers running across some machines, and each reducer processes a fragment of Map output. But the important distinction here is that, at a time Reducer gets to see all the pairs which have the same Key, even if they were generated by different mappers. So Reducer can then combine all those pairs to generate one or more output records. For example in our counting application, a single Reducer gets to see all the pairs which have the word "the" as there key and it can then add all the values to output the total count for word "the" in our input corpus.

That is all there is to Map Reduce essentially. There are other details I have skipped but this is the basic paradigm. The framework takes care of moving the data around across machines, running multiple copies of Map/Reduce, restarting Map or Reduce if they fail etc.All a programmer has to do is write the Map and the Reduce logic. This simple paradigm has found a surprisingly large numbers of uses at Google with number of Map Reduce code checked into code base continously growing. You can fine the nitty gritties and usage statistics at http://labs.google.com/papers/mapreduce-osdi04.pdf.

I want to touch upon two practicle aspects of programming with this paradigm:
1. Most of the applications at Google using Map Reduce actually have mutltiple Map Reduces running with output of one step feeding into next phase as input. It is a pipeline
where each step moulds the input slightly to take it closer to the desired output. This is necessary because of the simplicity of the paradigm which means that you cant do overly complicated things in a single pass. For example lets say you want to generate a histogram of word counts in you input, with the desired output being number of words that occur a particular number of times in the input. You will require two map reduce passes for this. First one will output the count of each word as in our previous example. In second pass, the input record now is (word, count) pair. Our Mapper takes this pair and inverts it thus outputting pair (count, word). So Count is our key now which means that all the words with the same count will go to the same Reducer. Our reducer just has to count the number of record it gets and output (count, number_of_words_with_that_count) and we are done. There are various tools at Google to manage this pipeline, which keep track of which step needs to run after which and on failure it doesnt start the whole pipeline again but continues from the step that failed.

2. Most of the times writing your Mapper and Reducer is straightforward but in some cases it requires great care. One thing to keep in mind is that you should try to distribute your
Map output to the Reducer as evenly as possible. Because if one Reducer ends up getting a large number of records, it can become a bottleneck, especially if its doing something non trivial with those. This is something I faced personally and I had to do some clever hacks, suggested by a fellow Googler, to keep things tractable. For example lets say for each country, I want a query count for each search query occuring in that country. My input is the query logs where each record contains a single query and auxilliary information about it such as the country it originated from. One way of doing this might be that in my Mapper I output (country, query). So all the queries occuring in a particular country go to a single reducer which can then compute the count for each query. But two problems with this. Firstly my number of Reducers here is limited to couple of hundred and secondly some reducers might get disproportiantely large number of pairs than others because query distribution over geography is highly skewed. So a reducer for country like US might become a bottleneck. To avoid this, what I can do is in my Mapper, the key I use is Country_FirstWordInQuery. This would work because duplicate queries would obviosuly share the First Word and would therefore go to the same Reducer. Instead of first word I can also use longest word in the query, as the query distribution across longest word would be more balanced. Even after doing this your reducer might be slow so you might have to optimise it. Infact I had to create an inverted index inside a Reducer I wrote to make it efficient.

What would be interesting is people doing some research on this model and abstracting things like what can/cannot be done in a single pass in MapReduce.

Tuesday, April 1, 2008

Link Log

Another blog I have started which will just contain links to interesting stuff I found on the web with maybe some notes
http://linksknil.blogspot.com/

Now if anyone can point me to a firefox extension or greasemonkey script which wil allow me to update that blog easily it would be great. Basically when I right click on a link, it should say post to blogger and then just pops up a simple text area where I can right some notes put some labels and say submit and be done. The problem with Send To Blogger link that is displayed in firefox is that I have to wait for a new window to load before I can post. Man I dont have that much time!!