Hadoop was developed in 2005 by Doug Cutting and Mike Cafarella as an ancillary project at Yahoo. Doug Cutting named it after his son’s toy elephant, “Hadoop”. Today Hadoop is a top shelf open source project of the Apache Software Foundation that is being further developed by the global community. It was programmed in Java. Hadoop enables distributed processing of large data sets across clusters of commodity servers, and is geared towards the extension of a single server to thousands of machines with a low margin of error. Instead of relying on luxury hardware, the reliability of the cluster comes from the capability of the software to detect and rectify failures in the application layer. Hadoop is a software framework that supports data-intensive, decentralized applications under a free license.
The core of Hadoop consists of:
1. The Hadoop Distributed File System (HDFS) – permits storage and access of hundreds of gigabytes to terabytes of data across all the nodes in a Hadoop cluster. HDFS separates large volumes of data into data blocks that are managed from various interfaces in the cluster. Each data block is copied to multiple hardware locations so that a single machine failure does not result in a loss of data availability.
2. MapReduce as a medium – runs work (diagram and reduction tasks) efficiently over these machines. MapReduce – two tasks that Hadoop carries out: a diagram task sorts and filters a data set. The reduce job takes the output from a map as input and summarizes it. The significant benefit is that MapReduce parallelizes massively across all nodes of the cluster which allows for the sorting of Petabytes of data in a few hours.
3. YARN – The basic concept behind YARN, the most important addition to Hadoop 2, is the separation of resource management from application management. Instead of placing planning and processing tasks in MapReduce, these tasks are now executed as separate components. These changes have a positive effect on the capability of the Hadoop framework to support real-time analyses and adhoc queries.
What characterizes Hadoop:
• Java-based open source framework
• Hadoop Distributed File System (HD- FS) as a distributed data system
• MapReduce algorithms for parallel data processing
• Hive as a data warehouse on Hadoop
• Hbase as a NoSQL databank on Hadoop
Technological benefits of Hadoop:
• Fast and easy scalability of the clusters
• High speed processing and analysis through parallelization
• Simultaneous processing of multiple data types (structured, halfstructured, unstructured)
• Capability to process text
• Low costs through open source and standard hardware