Hadoop
Resources: https://github.com/wavestone-cdt/hadoop-attack-library
Terminology
Cluster - Refers to all the systems that together make the datalake.
Node - A single host or computer in the Hadoop cluster.
NameNode - A node that is responsible for keeping the directory tree of the Hadoop file system.
DataNode - A slave node that stores files according to the instructions of a NameNode.
Primary NameNode - The current active node responsible for keeping the directory structure.
Secondary NameNode - The backup node which will perform a seamless takeover of the directory structure should the Primary NameNode become unresponsive. There can be more than one Secondary NameNode in a cluster, but only one Primary active at any given time.
Master Node - Any node that is executing a Hadoop "management" application such as HDFS Manager or YARN Resource Manager.
Slave Node - Any node that runs a Hadoop "worker" application such as HDFS or MapReduce. It should be noted that a single node can be both a Master and Slave node at the same time.
Edge Node - Any node that is hosting a Hadoop "user" application such as Zeppelin or Hue. These are applications that users can use to perform processing on the data stored in the datalake.
Kerberised - The term given for a datalake that has security enabled through Kerberos.
Apache Hadoop Services
HDFS - Hadoop Distributed File System is the primary storage application for unstructured data such as files
Hive - Hive is the primary storage application for structured data. Think of it as a massive database.
YARN - Main resource manager application of Hadoop, used to schedule jobs in the cluster
MapReduce - Application executor of Hadoop to process vast amounts of data. It consists of a Map procedure, which performs filtering and sorting, and a reduce method, which performs a summary operation.
HUE - A user application that provides a GUI for HDFS and Hive.
Zookeeper - Provides operational services for the cluster to set the configuration of the cluster in question.
Spark - Engine for large-scale data processing.
Kafka - A message broker to build pipelines for real-time data processing.
Ranger - Used for the configuration of privilege access control over the resources in the datalake.
Zeppelin - A web-based notebook application for interactive data analytics.
Last updated