In a nutshell Hadoop is….
An opensource java framework bringing compute as physically close to the data as possible. Hadoop basically constists of two parts: Hadoop Distributed File System (HDFS) and Hadoop MapReduce
Plain & Simple… What is HDFS?
It will take a file and split it up into many small ‘chunks’, it then distributes and stores these chunks across many servers. These chunks are also replicated (the number of copies required can be specified by the application) for fault tolerance. Storing data across multiple nodes in this way will boost performance (and potentially save money?).
HDFS uses an intelligent placement model for reliability and performance. Optimizing the placement makes HDFS unique from most other distributed file systems.
This works on top of HDFS – after HDFS has distributed the data to many different servers MapReduce sends a fragment of a program (“a piece of work”) to each server to execute.
So in a nutshell MapReduce is a framework that enables you to write an application that processes vast amounts of data in parallel by “sharing” the work to be completed out to a large number of servers (collectively referred to as a cluster).
Hadoop is meant for cheap hetergenous hardware where scaling takes place by simply adding more cheap hardware.
So your probably wondering what all the other buzzwords are? These are other projects that relate to Hadoop, they are generally built on top of HDFS or MapReduce and I plan to write more about these in upcoming posts.
What problems does it address?
Well I guess the first thing to ask yourself is “What is a big data problem?” My one line answer is ‘Datasets that exceed the boundries of normal processing capabilities forcing you to take a non-traditional approach’. The way I see it is that “Big data” provides us with an opportunity to use the data that is streaming in from all around to change the way that companies engage with their customers.