Question: Why do we need Hadoop at all?
Suppose you have 1GB of data that you need to process. The data is stored in a traditional relational database in your desktop computer and this desktop computer has no problem handling this load. Then your company expands and that data grows to 10 GB. And then 100 GB. And you reach the limits of your current desktop computer. So you scale your DB by investing in a larger PC. When your data grows to 10 TB, and then 100 TB. And you are fast approaching the limits of that computer.
Moreover, you are now asked to feed your application with unstructured data coming from sources like social medias e.g Facebook, Twitter or RFID readers, sensors etc. Your management wants to derive info from both the RDBMS data and the unstructured data and needs to have this information as soon as possible. What should you do? Hadoop may give you the answer!
Question: What is hadoop?
Hadoop is an open source implemented project of the Apache Foundation. It is a framework, developed by Doug Cutting in Java who gave it name after his son’s toy elephant. Hadoop uses Google’s Map-Reduce and GoogleFileSystem(GFS) technologies as its foundation. It is optimized to handle large quantities of data which could be structured or unstructured or semi-structured, using commodity(usual computers) hardware. This massive parallel processing is done with high performance. However it works on batch operation. Handling large quantity of data for response time is not immediate.
More about hadoop-
Hadoop replicates its data over different computers, so that if one fails, the data is processed on any of the replicated computers. Hadoop is not helpful for OnLine Transaction Processing workloads where data are randomly accessed on structured database like a relational database(RDBMS). Hadoop does not suit to OnLine Analytical Processing or Decision Support System workloads where data are sequentially accessed on structured database like a relational DB, for creating reports that provide with business intelligence. Hadoop is used for Big Data. It complements OnLine Transaction Processing and OnLine Analytical Processing. It is not the replacement for a relational DBMS.
Question: If hadoop is just a framework then what does hadoop consists of?
Components of Hadoop
The Hadoop project is comprised of three pieces: Hadoop Distributed File System (HDFS), the Hadoop Map-Reduce model.
1.HDFS is a distributed(clustered), scalable and portable file system written in JAVA for the Hadoop. The only input source to the Hadoop framework is through its HDFS.
2.Hadoop makes use of Map and Reduce programming model to process big datasets and distributed computing on clusters. Map Reduce take the advantage of location of data for processing it on or near the storage assets to decrease travel time for the data. It is carried out in two step namely Map and Reduce. In the Mapping, the master node takes the input from the Hadoop FS and divides it into smaller sub-inputs and distributes them to work nodes. The work node may divide it again and process the smaller sub-problem, and pass the answer back to its master node. Mapping is performed by Mapper func by the use of key-value pair. In the Reducing, the master node collects the solution to all the sub-problems and integrates them in some way to form the result to the original complete problem. Reduction task is performed by Reducers.
Question: How hadoop works?
Hadoop has awareness of the network topology. This allows it to optimize where it sends the computations to be applied to the data. Placing the work close to the data it operates on,maximizes the bandwidth available for reading the data. When deciding which TaskTracker should receive the MapTask that reads data, the best option is to choose the TaskTracker that runs on the same node as the data. If we can’t place the computation on the same node, our next best option is to place it on a node in the same rack as that of data. The worst case that Hadoop currently supports is when the computation must be done from a node in a different rack than the data. When rack-awareness is configured for your cluster.Hadoop will always try to run the task on the TaskTracker with the highest bandwidth access to the data.
Let us see an example of how a file gets written to HDFS.
First, the client submits a “create” request to the Namenode. The Namenode then checks whether the file already exist and the client has permission to write the file.
If that succeeds, the NameNode determines the DataNode to write the first block to.
If the client is running on a DataNode, it will try to place it there , otherwise it chooses at random.
By default, data is replicated to two other places in the cluster.A pipeline is built between the three DataNodes that make up the pipeline. The second DataNode is a randomly chosen node on a rack other than that of the first replica of block. This is to increase redundancy.
The final replica is placed on any random node within the same rack as the 2nd replica. The data is piped from the second DataNode to the third.
To ensure the write was successful before continuing, acknowledgment packets are sent back from the 3rd DataNode to the second, from the 2nd DataNode to the first and from the 1st DataNode to the client.