Hadoop MapReduce Fundamentals – Different ways of optimization
For today’s MapReduce Fundamentals, we are going to look into the subject of optimization. There are many different ways to optimize MapReduce and of course it’s critically important because we’re dealing with huge volumes of data and we sometimes may have some resource constraints particularly for being charged and for running these MapReduce jobs in the cloud, for example. We also might have time constraints where we want to optimize so that the results come back more quickly. You have to see this as a cycle as it relates to one or more MapReduce jobs.
In addition to the cycle of optimization of jobs, there is also optimization that can be done to the physical environment and by that I mean the Hadoop cluster. There are many configuration settings that you can just and there is in fact for the various distributions multi-administration classes that you can do further study on. Before we go to the specific job we can do optimization before the job runs, we can do optimization when we’re loading the data, and we can do optimization during the map phase of the job, which often includes breaking complex mapping tasks into multiple jobs. We can do optimization of the shuffle phase of the job, we can do optimization of the reduce phase and we can do post processing or optimization after the job/jobs complete.
The first thing to remember is that the physical file size of an HDFS cluster is 64 or 128 MB blocks. There’ll be certain situations where we want to a adjust pre-process the incoming files and filter out what doesn’t have any value or junk data. In addition to that we may choose to compress the incoming files; when the MapReduce job runs the compressed files need to be uncompressed so that they can be processed. Sometimes there is additional compression applied throughout the lifetime of the MapReduce job.
Although you may not think about this as an optimization, another potential optimization is encryption and this is because there might be security or privacy considerations around the data and if the data is encrypted or somehow obfuscated this might affect the overhead of running a particular job. We’ve discussed a bit about pre-processing and also discussed some examples and we have also discussed this concept of writing your MapReduce jobs in a very functional way where they perform the smallest possible unit of work with a single type of process being performed rather than multiple processes.
Let’s discuss compression as optimization in more detail. To understand the Hadoop compression options you have to look at particular vendor distributions and you also want to look at the version of Hadoop binaries that you’re working with. As an example we are considering data from Cloudera website; here we have uncompressed files and then they have files that have been compressed using two different compression algorithms: GZip and LZO, which notably can be split.
What that means is that when you are loading if you have a suitable compression algorithm that can optimize the load as well. If you take a look here you can see the size of the file after its compressed and you can see the time to compress and decompress. So, these are very common considerations on the optimization a MapReduce and often you do proof testing on small subsets if this is an ongoing batch type situation to figure out what type of compression algorithm is most appropriate and at which phases of the MapReduce job.