Hadoop – Dealing with structured and unstructured data
When dealing with data in Hadoop there are a number of different types of structured data. In 9 times out of 10 clients would want to do this after they have put the data in and that is you will have a wide range of structured data where you can have trials data, a lot of intrusion network events, weblogs, graph data, data coming in from sensors, data from social networking sites, etc. And a lot of people believe that there is a lot of values that can be gleaned from the data and hence they would like to do some basic queries on the data. But since the data is quite huge they are overwhelmed and they end up just storing it and there is no way to draw a connection and get the network effect by combining the different datasets.
So with Hadoop, NoSQL and the schema-less data you can make an attempt at making all that possible; with this we can put all the data in one place and draw all the necessary conclusions from it. Another important thing to keep in mind don’t mess with the data; try making as little modifications as possible to get it into the database and with NoSQL it is possible without the requirement of many modifications.
Traditional database people will want to optimize in the sense that they would like to change parameters like the name of columns and rows which shouldn’t be too long, etc. but all this needn’t be considered here since the data is already being compressed and hence they won’t be such issues. Hence, the scalability is possible both in regards to the size as well as the complexity of the data. Hence, the bottom line is a change in the data at the storage level doesn’t require the change in the NoSQL database. And that is kind of fundamental since in a lot of databases, with data flowing even when you think you aren’t modifying the data it is changing underneath as things evolve. Hence, with this concept you can keep adding new datasets, modify existing ones and also put all data in the same place in a relatively easy manner.
To talk about something more specific to give more concreteness to this, is talking about exploring new ways using network defense data to detect intrusions; this is the poster child of Big Data problems. There are tons of data all through the network, it is always changing – there are always new ways, new summaries, new methods that we need to look for and the threat is constantly evolving. So, how do we use Big Data technologies to keep up with this? Well, the first step is Ad-Hoc analysis: We have to do the ad-hoc analysis; we do MapReduce jobs which take maybe one day to make the experimentation or they might take less.
With the analysis we can discover new patterns and threat vectors quickly and then we can quickly port them over to some sort of NoSQL database linked to a user interface. And frankly this is still a part of the experiment since you don’t know if the person who is sitting there in the middle of the night is going to find this useful. So, you are going to put it in his interface, it is all low cost, it is basically a mash-up and you can quickly evolve to get to something that is really meaningful. Generally, doing it with a traditional database would have taken months but with Hadoop we can do it in a matter of days. Hence, Hadoop is game changing in a variety of ways. The primary one is being able to look at data at the speed at which it is evolving.