Chapter 7     Fault Tolerance

A characteristic feature of distributed systems that distinguishes them from single-machine systems in the noition of partial failure.
bulletConstruct a system so that it can automatically recover from partial failures
bulletA distributed system should continue to operate in an acceptable way while repairs are being made.
bulletShould be fault tolerant.
bulletReliable multicasting (to keep processes synchronized
bulletShould be able to recover

Fault tolerant = Dependable systems

Availability:  Working at a given instant in time

Reliability:  Run continuously without failure

Safety:  When a system fails like for a powerplant, nothing catastrophic happens.

Maintainability:  How easy a failed system can be repaired.


Cause of an error = fault

1; Transient faults (birds flying by) 2; Intermittent faults 3; Permanent fault

Failure models:

1; Crash failure (works then does not work)  2; Omission failure (a server fails to respond to a request 3; Timing failures

4; Response failure 5; state transition failure 6; arbitrary failure (Byzantine failure)  7; Fail-stop failure or Arbitrary failure (a server anounces its crash before it happens 8; Arbitrary, but fail-safe faults.  The junk output can not be recognized (Byzantine).


Byzantine failure:  A server is producing output it should never have produced and the output can not be detected as being incorrect.  A server can work with other servers to deliver wrong answers together.


Masking faults:

The key technique for masking faults is to use redundancy.



Information redundancy - Hamming code


Time redundancy       - Used with transient or intermittent faults.  - During transactions


Physical redundancy   - Extra equipment or processes are added  - Replicating processes  Triple Modular Redunancy (three inputs gives one output)




Replicating processes into groups:

Can be dynamic, usually follows the social organization, having processes in a group allow the group of processes to be dealt with as a single abstraction.  A process can send a message to a group of servers without knowing or neeing to know how many processes there are in a group.



Flat groups have no single point of failure.  Decisionmaking is more difficult.  Hierarchical groups work the opposite.



A group server can be used to administer creating and deleting groups and allowing processes to join or leave the groups. This set-up can though cause a single point of failure

Can manage groups distributed if reliable multicasting is available.


Reach an agreement within a process group:



Primary backup protocol.  If the primary crashes, the backups execute some election algorithm to choose a new primary.  Starts an election algorithm.


A system is said to be K fault tolerant if it can survive faults in k components and still meet its specifications.

If processes exhibit Byzantine failures, a minimum of 2k+1 processors are needed to achieve k fault tolerance. 




Not only processes can be faulte, also the communication need to be reliable.  Duplicated messages can be a problem. 


To mask a Point to Point Communication, we set up the system to automatically be communicating again over TCP if the TCP channel crashed.

Remote Procedure Calls and Remote Method Invocations are hiding communication by making the remote communication look like a local one.


Problems with RPC systems: 

  1. The client is unable to locate the server (write in exceptions)

  2. The request message from the client to the server is lost (start a timer to fix it)

  3. The server crashes after receiving a request                                                                                                Least once semantics and at most once semantics or guarantee nothing for the server.  For the client there are 4 strategies; NEVER REISSUE A REQUEST, ALWAYS ISSUE A REQUEST, ONLY IF IT DID NOT RECEIVE AN ACKNOWLEDGMENT THAT THE PRINT REQUEST HAS NOT BEEN DELIVERED AND THE LAST ONE WHEN THERE IS NO ACKNOWLEDGEMENT ON THE

  4. The reply message from the server to the client is lost (rely on a timer and resend, transferring money is not IDEMPOTENT, for money transfer it will be safer to put on a sequence number for each transaction)

  5. The client crashes after sending a request (orphans, 1: Extermination 2: Reincarnation 3: Gentle reincarnation 4: Expiration



Process resilience =  One definition of resilience is the rate at which a system returns to a single steady or cyclic state following a perturbation. This definition of resilience assumes that behavior of a system remains within the stable domain that contains this steady state.


TCP offer reliable point-to-point communication, but not to a group of processes.

What is reliable multicasting? 

Message sent to a message group should be delivered to each member of that group.

add sequence number to the multicast message (each message is stored in a history buffer at the sender until each receiver has resturned an acknowledgement).


FEEDBACK IMPLOSION There are too many processes that receives a multicast and are sending acknowledgements back.


-  Nonhierarchical feedback control (feedback supression, SRM, report only when a message is missing, the process multicasts the message to the other processes)

Hierarchical Feedback Control

Achieving scalability for very large groups of receivers requires that hierarchical approaches are adopted.

Each subgroup appoints a local coordinator (responsible for handling, retransmission)

If the coordinator itself has missed a message m, it asks the coordinator of the parent subroup to retransmit m.

Negative with this set-up: A tree needs to be constructed dynamically.



Atomicity:      An operation being performed by each member of a process goup, or none at all.  USED IN TRANSACTIONS.


Fault-tolerant applications can also use distributed commit protocols:


One-phase commit protocol:    A process participant has no way of telling the coordinator that it can not do the task.

Two-phase commit protocol:    Each two phases consist of two phases:  VOTE_REQUEST (vote_commit or vote_abort)  and GLOBAL_COMMIT(global_abort).

A problem with the two-phase commit protocol is that both the participants and the coordinator have states where they block waiting for the incoming messages.  The protocol can fail if a process fail and other processes are independantly waiting for it.


The main purpose of the two-phase protocol is to make sure that either the participants in a process group is doing the task or that NONE of the participant are doing it.


Coordinator(process):  Phase 1 of the two-phase:  Sends VOTE_REQUEST

Participant(process): Phase 1 of the two-phase:  Returns either VOTE_COMMIT or VOTE_ABORT


Coordinator(process):  Phase 2 of the two-phase:  Collects all votes.  If ONE voted to abort, GLOBAL_ABORT message is multicasted else GLOBAL_COMMIT is sent.

Participant(process): Phase 2 of the two-phase:  Those that voted for are waiting for the reply.  If a participant receives a message it either does GLOBAL_COMMIT or GLOBAL_ABORT.


Observation:  The real problem lies in the fact that the coordinator`s final decision may not be available for some time.

Alternative:  Let a participant P in the ready state timeout when it hasn`t received the coordinator`s decision; P tries to find out what other participants know.

Question:  Can P not succeed in getting the required information?

Observation:  Essence of the problem is that a recovering participant cannot make a local decision:  it is dependent on other (possibly failed) processes.



Three-phase commit protocol:  Do not need to read.