Yet Another Resource Negotiator (YARN )
Components of YARN in Cloudera Data Platform
Table of contents
YARN (Yet Another Resource Negotiator) is a resource management framework used in Hadoop for scheduling and managing resources required for running applications. In the context of CDP (Cloudera Data Platform), YARN is used to manage resources for running various data processing applications like Hadoop MapReduce, Apache Spark, and Apache Hive. Here are some of the important components and their settings in YARN in CDP:
The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job or a DAG of jobs.
ResourceManager (RM)
The ResourceManager is responsible for managing resources for the cluster and scheduling applications. It also handles node failures and tracks the status of applications. The settings for the ResourceManager can be configured in yarn-site.xml, which is located in the /etc/hadoop/conf directory.
Some important settings for the ResourceManager include:
yarn.resourcemanager.hostname
: The hostname of the ResourceManager.yarn.resourcemanager.resource-tracker.address
: The address for the ResourceTracker service.yarn.resourcemanager.scheduler.address
: The address for the scheduler service.yarn.resourcemanager.webapp.address
: The address for the web application.
NodeManager (NM)
The NodeManager runs on each node in the cluster and is responsible for managing resources on that node. It communicates with the ResourceManager to get resource allocation requests and executes tasks on the node. The settings for the NodeManager can be configured in yarn-site.xml
.
Some important settings for the NodeManager include:
yarn.nodemanager.hostname
: The hostname of the NodeManager.yarn.nodemanager.local-dirs
: The directory on the local file system where the NodeManager stores temporary files.yarn.nodemanager.log-dirs
: The directory where the NodeManager stores logs.
ResourceManager and the NodeManager form the data-computation framework.
ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system.
NodeManager is the per-machine framework agent who is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler.
ApplicationMaster (AM)
The ApplicationMaster is responsible for managing a single application's execution. It communicates with the ResourceManager to get resource allocations and manages the lifecycle of the application. The settings for the ApplicationMaster can be set at the application level.
Some important settings for the ApplicationMaster include:
yarn.app.mapreduce.am.resource.mb
: The amount of memory to allocate to the ApplicationMaster.yarn.app.mapreduce.am.command-opts
: The command line options to pass to the ApplicationMaster.
Containers
Containers are the basic unit of resource allocation in YARN. Each container is assigned to run a specific task or process. The settings for containers can be set at the application level.
Some important settings for containers include:
yarn.nodemanager.resource.memory-mb
: The maximum amount of memory to allocate to each container.yarn.nodemanager.resource.cpu-vcores
: The maximum number of CPU cores to allocate to each container.
Overall, configuring these YARN components and their settings is essential for ensuring efficient resource allocation and management in a Hadoop cluster running on CDP.