Hadoop

When we talk about distributed systems, Hadoop maybe the most popular architecture nowadays, although it is not the first distributed system.

What Is Apache Hadoop?

cited from: http://hadoop.apache.org

The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

 

Hadoop includes four modules:

  • Hadoop Common: The common utilities that support the other Hadoop modules.
  • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Following is the introduction and architecture for each part.

Hadoop Common:

These are Java libraries and utilities required by other Hadoop modules. These libraries provides filesystem and OS level abstractions and contains the necessary Java files and scripts required to start Hadoop.

Hadoop Distributed File System (HDFS):

hdfsarchitecture

 

HDFS follows the master-slave architecture and it has the following elements.

Namenode

The namenode is the commodity hardware that contains the GNU/Linux operating system and the namenode software. It is a software that can be run on commodity hardware. The system having the namenode acts as the master server and it does the following tasks:

  • Manages the file system namespace.
  • Regulates client’s access to files.
  • It also executes file system operations such as renaming, closing, and opening files and directories.

Datanode

The datanode is a commodity hardware having the GNU/Linux operating system and datanode software. For every node (Commodity hardware/System) in a cluster, there will be a datanode. These nodes manage the data storage of their system.

  • Datanodes perform read-write operations on the file systems, as per client request.
  • They also perform operations such as block creation, deletion, and replication according to the instructions of the namenode.

Block

Generally the user data is stored in the files of HDFS. The file in a file system will be divided into one or more segments and/or stored in individual data nodes. These file segments are called as blocks. In other words, the minimum amount of data that HDFS can read or write is called a Block. The default block size is 64MB, but it can be increased as per the need to change in HDFS configuration.

Hadoop YARN

cited from: http://ercoppa.github.io/HadoopInternals/HadoopArchitectureOverview.html

yarn_architecture

the YARN Infrastructure (Yet Another Resource Negotiator) is the framework responsible for providing the computational resources (e.g., CPUs, memory, etc.) needed for application executions. Two important elements are:

  • the Resource Manager (one per cluster) is the master. It knows where the slaves are located (Rack Awareness) and how many resources they have. It runs several services, the most important is the Resource Scheduler which decides how to assign the resources.Services provided by Resource Manager is showed below:

resource-manager_534be06c-eb4c-4516-a178-5ff00a005d90

 

  • the Node Manager (many per cluster) is the slave of the infrastructure. When it starts, it announces himself to the Resource Manager. Periodically, it sends an heartbeat to the Resource Manager. Each Node Manager offers some resources to the cluster. Its resource capacity is the amount of memory and the number of vcores. At run-time, the Resource Scheduler will decide how to use this capacity: a Container is a fraction of the NM capacity and it is used by the client for running a program.

node-manager-overview_534beb08-0c0c-4d84-bf75-3a670a00c014

In YARN, there are at least three actors:

  • the Job Submitter (the client)
  • the Resource Manager (the master)
  • the Node Manager (the slave)

The application startup process is the following:

  1. a client submits an application to the Resource Manager
  2. the Resource Manager allocates a container
  3. the Resource Manager contacts the related Node Manager
  4. the Node Manager launches the container
  5. the Container executes the Application Master

The Application Master is responsible for the execution of a single application. It asks for containers to the Resource Scheduler (Resource Manager) and executes specific programs (e.g., the main of a Java class) on the obtained containers. The Application Master knows the application logic and thus it is framework-specific. The MapReduce framework provides its own implementation of an Application Master.

The Resource Manager is a single point of failure in YARN. Using Application Masters, YARN is spreading over the cluster the metadata related to running applications. This reduces the load of the Resource Manager and makes it fast recoverable.

yarn-architecture_5356ab97-2bd8-4f19-b30e-1ef60a00dcc0yarn-application-startup_534bf195-890c-4c7a-95eb-13cb0a008d03

 

Hadoop MapReduce

cited from: http://www.tutorialspoint.com/hadoop/hadoop_mapreduce.htm; https://en.wikipedia.org/wiki/MapReduce

What is MapReduce?

MapReduce is a processing technique and a program model for distributed computing based on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is always performed after the map job.

Why MapReduce?

The major advantage of MapReduce is that it is easy to scale data processing over multiple computing nodes. Under the MapReduce model, the data processing primitives are called mappers and reducers. Decomposing a data processing application into mappers and reducers is sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the application to run over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a configuration change. This simple scalability is what has attracted many programmers to use the MapReduce model.

What the steps of MapReduce?

  • “Map” step: Each worker node applies the “map()” function to the local data, and writes the output to a temporary storage. A master node ensures that only one copy of redundant input data is processed.
  • “Shuffle” step: Worker nodes redistribute data based on the output keys (produced by the “map()” function), such that all data belonging to one key is located on the same worker node.
  • “Reduce” step: Worker nodes now process each group of output data, per key, in parallel.

Another way to look at MapReduce is as a 5-step parallel and distributed computation:

  1. Prepare the Map() input – the “MapReduce system” designates Map processors, assigns the input key value K1 that each processor would work on, and provides that processor with all the input data associated with that key value.
  2. Run the user-provided Map() code – Map() is run exactly once for each K1 key value, generating output organized by key values K2.
  3. “Shuffle” the Map output to the Reduce processors – the MapReduce system designates Reduce processors, assigns the K2 key value each processor should work on, and provides that processor with all the Map-generated data associated with that key value.
  4. Run the user-provided Reduce() code – Reduce() is run exactly once for each K2 key value produced by the Map step.
  5. Produce the final output – the MapReduce system collects all the Reduce output, and sorts it by K2 to produce the final outcome.

These five steps can be logically thought of as running in sequence – each step starts only after the previous step is completed – although in practice they can be interleaved as long as the final result is not affected.

One thought on “Hadoop

  1. Pingback: Distributed System Study | Keep Learning

Leave a comment