Hadoop – An introduction
Hadoop is a framework of tools; it is not single software that you can download on your computer and say you have downloaded Hadoop. So what is this framework used for? The objective of these tools is to support running of applications on Big Data? Hadoop is an open-source set of tools and it is distributed under Apache License; so this guarantees that no particular company is controlling the direction of Hadoop and it is maintained by Apache. So, we understand that Hadoop is a set of tools that supports running of applications on big data. So, the keyword behind Hadoop is big data; Big Data is creating challenges that Hadoop is addressing. The challenges are created at three levels; lot of data is coming in at very high speed, a big volume of data has been gathered and it is growing and also data is all sorts of variety; it is not organized data as it has all sorts of audios, videos and so on.
In a traditional approach the enterprise will get a very powerful computer and it will compete in whatever the data is available to this computer to crunch the numbers and this computer will do a good job but only until a certain point. A point will come when this computer will not be able to do any processing anymore because it is not scalable. And now the Big Data is growing; so enterprise approach does have its limitations when it comes to Big Data.
Hadoop takes a very different approach than the enterprise approach; it breaks the data into smaller pieces and that’s why it’s able to deal with the Big Data. Okay, wrecking the data into smaller pieces is a good idea but then what how are you going to perform the computation? Well, it breaks the competition as well, down into smaller pieces and it sends each piece of competition to each piece of data. The data is broken down into the equal amount of data so that the competition, the child computations, could be finished in equal amounts of time. And once all these competitions are finished then the results are combined together which is what is sent back to the application data as a combined or overall result.
So, how is that Hadoop is able to break that data into pieces and then computation into pieces? At a very high level it has a simple architecture; you can say Hadoop has two main components: MapReduce and the file system. The file system is called HDFS or Hadoop Distributed File System. As already mentioned Hadoop is a set of tools and let’s call these set of tools projects. There are numerous projects that have been started and are being managed by Apache and data umbrella of Hadoop and the objective of these projects is to provide assistant in tasks that are related to Hadoop. One has to keep in mind that the besides MapReduce and HDFS there are another component to Hadoop and that’s called these projects.