The key differences between primary, core, and task nodes in an Amazon EMR cluster are:
Primary Node (also known as Master Node):
- The primary node is responsible for coordinating the cluster and managing the execution of jobs.
- It runs the main Hadoop services, such as the JobTracker, NameNode, and ResourceManager.
- There is only one primary node in an EMR cluster.
- The primary node cannot be terminated during the lifetime of the cluster, as it is essential for the cluster's operation.
Core Nodes:
- Core nodes host the Hadoop Distributed File System (HDFS) and run the DataNode and TaskTracker services.
- They are responsible for storing and processing data in the cluster.
- Core nodes cannot be removed from the cluster without risking data loss, as they contain the persistent data in HDFS.
- You should reserve core nodes for the capacity that is required until your cluster completes.
Task Nodes:
- Task nodes are used for running tasks and do not host HDFS. They can be added or removed from the cluster as needed, without the risk of data loss.
- Task nodes are ideal for handling temporary or burst workloads, as you can launch task instance fleets on Spot Instances to increase capacity while minimizing costs.
- The cluster will never scale below the minimum constraints set in the managed scaling policy.
Here's a table summarizing the key differences:
More details regarding,