Hadoop and the tools around it
Here is a kind of tour to the open-source community and the Apache software in particular. Hadoop is actually a collection of open-source projects; there is this vibrant community of hundreds and thousands of people working 24 hours a day and 7 days a week working to create the set of tools that make up Hadoop. But again there are two set of layers to think about; there is this infrastructure layer, the core – distributed file system.
And then MapReduce that some of the IT guys need, some of the new NoSQL databases as well. And then the analytics layer that data scientists are interested in; where one can take SQL capabilities and apply them through open-source projects like Hive, to be able to pipeline in applications through languages like PIG and to look at predictive analytics through projects like Mahout. So though there are more, these are some that require attention.
Now when you think about companies in this space, there are people who don’t know how to think about this Hadoop ecosystem. The truth is that large companies, like EMC with their announcement Greenplum Hadoop, a lot of small companies, a lot of investors are driving huge amounts of money into this ecosystem. They are driving their money into this ecosystem because they see the problem that organizations have with their Big Data and they see the opportunity that Hadoop represents. So, there is really a whole ecosystem developing which is good news for the data scientist because there are choice and competition and innovation coming from not just the open source community but also the commercial sources.
At the bottom level we have the data management layer; we have companies that are taking the Hadoop distribution and adding value with incremental capabilities and services and hence there is a lot of choice down there. We have companies providing mechanisms to get data into and out of Hadoop and with data integration and ETL. But on top of that as a data scientist you have to worry about how to plug in your skills into it and how do you plug your developers in. How do you, data scientists, get access with familiar skill sets.
And then at a higher there are sort of visualization companies who think about presenting this to high-level business users. They take existing BI products and put new user interfaces and exciting visualizations on top of it. And then the service organizations are really important; we all need to look at this ecosystem in terms of hardware, software and service vendors. We move to Hadoop consider everybody then you will know not only about products but also about services.