31.1.08

Map Reduce

Most of my work here at IIIT is on the Apache Hadoop project. Hadoop is an implementation of the Map-Reduce framework originally put forth for parallel programming by Google. Its based on the map and reduce routines which are commonly found in many functional languages.
Recently there has been a lot of interest in MR. One of the important reasons is that it makes writing parallel applications, especially those that have to deal with huge amount of data distributed over unreliable computers, very easy.
Yahoo! is currently running a Hadoop cluster of over 1000 nodes, and they are doing pretty interesting stuff with it.

Although Hadoop has emerged as the leading open source implementation there are others worth mentioning:
  1. The QT Concurrent package supports MR. I am not sure if it has support for distributed operation though.
  2. There is an MR implementation available for Ruby called Skynet(don't know why they chose this name) as well, like every other thing that Ruby has, this also makes writing MR code ridiculously easy.

2 comments:

Kapil Barve said...

JD: Can you give one example where problem has been solved using MR?

Jaideep said...

"MapReduce is useful in a wide range of applications, including: "distributed grep, distributed sort, web link-graph reversal, term-vector per host, web access log stats, inverted index construction, document clustering, machine learning, statistical machine translation..." Most significantly, when MapReduce was finished, it was used to completely regenerate Google's index of the World Wide Web, and replaced the old ad hoc programs that updated the index and ran the various analyses. [2]

MapReduce's stable inputs and outputs are usually stored in a distributed file system. The transient data is usually stored on local disk and fetched remotely by the reduces."
Source: Wikipedia