A cluster is a collection of stand-alone computers connected using some interconnection network. Each node in a cluster could be a workstation, personal computer, or even a multiprocessor system.
A node is an autonomous computer that may be engaged in its private activities while at the same time cooperating with other units in the context of some computational task. Each node has its input/output systems and its operating system.
When all nodes in a cluster have the same architecture and run the same operating system, the cluster is called homogeneous, otherwise, it is heterogeneous. The interconnection network could be a fast LAN or a switch.
To achieve high-performance computing, the interconnection network must provide high bandwidth and low-latency communication. The nodes of a cluster may be dedicated to the cluster all the time; hence computation can be performed on the entire cluster. Dedicated clusters are normally packaged compactly in a single room.
Dedicated clusters usually use high-speed networks such as fast Ethernet and Myrinet. Alternatively, nodes owned by different individuals on the Internet could participate in a cluster only part of the time. In this case, the cluster can utilize the idle CPU cycles of each participating node if the owner’s permission is granted.
The middleware layer in the architecture makes the cluster appears to the user as a single parallel machine, which is referred to as the single system image (SSI). The SSI infrastructure offers unified access to system resources by supporting several features including −
Single entry point − A user can connect to the cluster instead of to a particular node. .
Single file system − A user sees a single hierarchy of directories and files.
Single image for administration − The whole cluster is administered from a single window.
Coordinated resource management − A job can transparently compete for the resources in the entire cluster.
The main objective of the cluster is high availability, the middleware will also support features that enable the cluster services for recovery from failure and fault tolerance among all nodes of the cluster.
For example, the middleware should offer the necessary infrastructure for check-pointing. A check-pointing scheme makes sure that the process state is saved periodically. In the case of a node failure, processes on the failed node can be restarted on another working node.