Various Failures in Distributed Systems


Restoration failure can be a complicated issue, especially for distributed systems, where there may be multiple participating devices and multiple failure points. It is very instructive to find different roles in the program and ask each one, "What would happen if that part of the program failed?" Designing a reliable system that can recover from failure requires identifying the types of failure the system must deal with.

In a distributed system, we need to deal with mainly four types of failures −

  • Transaction failure (abortion),
  • Site (program) failure,
  • Media (disk) failure, and
  • Communication line failure.

Some of these are due to hardware, and some are due to software.

Failure to Perform/ Method Failure

Transactions can fail for various reasons in any system. Failure to do so may result from an error in the transaction caused by incorrect input data and the detection of a current or potential problem. In addition, some simultaneous cash management algorithms do not allow transactions to continue or wait for the data they are currently trying to access to be retrieved by another service. This can be considered a failure.

The most common way to take in case of a transaction failure is to take action, thus resetting the database to its original state before the start of this transaction.

Site (System) Failure

Reasons for the failure of a system can be traced back by identifying whether it's a hardware or software failure. System failure is often thought to lead to loss of core memory content. Therefore, any part of the database that was in the main memory bars is lost due to system failure. However, a database stored on secondary storage is considered secure and accurate.

In distributed database names, system failures are referred to as site failures, as they result in the failed site not being available on other sites in the distributed system.

We often distinguish between minor failures and total failures in a distributed system. Complete failure means simultaneous failure of all sites in the distributed system; partial failure only reflects the failure of specific sites, and others remain operational. It can be recovered by rebooting the system as soon as possible and replacing the failure point once it is identified.

Media Failure/ Secondary Storage Device

Media failure refers to the failure of secondary data storage devices. Such failures may be due to operating system errors and hardware errors such as headaches or control failures. The critical point is that all or part of the database in the second archive is considered degrading and inaccessible. Integrating disk storage and archiving are common strategies that deal with this type of catastrophic problem.

Media failures are often treated as local problems and, as a result, cannot be addressed directly in the reliability of distributed DBMS systems.

Communication Failure

There are many types of communication failures. The most common errors are incorrectly ordered messages, lost messages, and failure of communication lines. The first two errors are the function of the computer network; we will not look at them further. Therefore, in our DBMS loyalty distribution negotiations, we expect the hardware under the computer network and software to ensure that two messages sent from the site of a particular site from a process to a destination are sent without any errors.

Missing or unsolicited messages are usually the result of communication lines failure or (destination) site failure. If the communication line fails, in addition to losing sent messages, it may split the network into two or more unrelated groups. This is called network fragmentation. If a network is fragmented, the sites in each division may continue to operate. In this case, making transactions that access data stored on multiple partitions becomes a significant problem.

Updated on: 28-Oct-2021

4K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements