Chapter 7 Fault Tolerance
A characteristic feature of distributed
systems that distinguishes them from single-machine systems in the noition
of partial failure.
Fault tolerant = Dependable systems Availability: Working at a given instant in time Reliability: Run continuously without failure Safety: When a system fails like for a powerplant, nothing catastrophic happens. Maintainability: How easy a failed system can be repaired. A SYSTEM IS SAID TO FAIL WHEN IT CANNOT MEET ITS PROMISES Cause of an error = fault 1; Transient faults (birds flying by) 2; Intermittent faults 3; Permanent fault Failure models: 1; Crash failure (works then does not work) 2; Omission failure (a server fails to respond to a request 3; Timing failures 4; Response failure 5; state transition failure 6; arbitrary failure (Byzantine failure) 7; Fail-stop failure or Arbitrary failure (a server anounces its crash before it happens 8; Arbitrary, but fail-safe faults. The junk output can not be recognized (Byzantine).
Byzantine failure: A server is producing output it should never have produced and the output can not be detected as being incorrect. A server can work with other servers to deliver wrong answers together.
Masking faults: The key technique for masking faults is to use redundancy.
PROCESS RESILIENCE:
Replicating processes into groups: Can be dynamic, usually follows the social organization, having processes in a group allow the group of processes to be dealt with as a single abstraction. A process can send a message to a group of servers without knowing or neeing to know how many processes there are in a group.
FLAT GROUPS VERSUS HIERARCHICAL GROUPS: Flat groups have no single point of failure. Decisionmaking is more difficult. Hierarchical groups work the opposite.
GROUP MEMBERSHIP: A group server can be used to administer creating and deleting groups and allowing processes to join or leave the groups. This set-up can though cause a single point of failure. Can manage groups distributed if reliable multicasting is available.
Reach an agreement within a process group:
PRIMARY BASED REPLICATION: Primary backup protocol. If the primary crashes, the backups execute some election algorithm to choose a new primary. Starts an election algorithm.
RELIABLE CLIENT-SERVER COMMUNICATION
Not only processes can be faulte, also the communication need to be reliable. Duplicated messages can be a problem.
To mask a Point to Point Communication, we set up the system to automatically be communicating again over TCP if the TCP channel crashed. Remote Procedure Calls and Remote Method Invocations are hiding communication by making the remote communication look like a local one.
Problems with RPC systems:
RELIABLE GROUP COMMUNICATION Process resilience = One definition of resilience is the rate at which a system returns to a single steady or cyclic state following a perturbation. This definition of resilience assumes that behavior of a system remains within the stable domain that contains this steady state.
TCP offer reliable point-to-point communication, but not to a group of processes. What is reliable multicasting? Message sent to a message group should be delivered to each member of that group.
SCALABILITY IN RELIABLE COMMUNICATION - Nonhierarchical feedback control (feedback supression, SRM, report only when a message is missing, the process multicasts the message to the other processes) Hierarchical Feedback Control Achieving scalability for very large groups of receivers requires that hierarchical approaches are adopted.
If the coordinator itself has missed a message m, it asks the coordinator of the parent subroup to retransmit m. Negative with this set-up: A tree needs to be constructed dynamically.
DISTRIBUTED COMMIT Atomicity: An operation being performed by each member of a process goup, or none at all. USED IN TRANSACTIONS.
Fault-tolerant applications can also use distributed commit protocols:
One-phase commit protocol: A process participant has no way of telling the coordinator that it can not do the task. Two-phase commit protocol: Each two phases consist of two phases: VOTE_REQUEST (vote_commit or vote_abort) and GLOBAL_COMMIT(global_abort). A problem with the two-phase commit protocol is that both the participants and the coordinator have states where they block waiting for the incoming messages. The protocol can fail if a process fail and other processes are independantly waiting for it.
Coordinator(process): Phase 1 of the two-phase: Sends VOTE_REQUEST Participant(process): Phase 1 of the two-phase: Returns either VOTE_COMMIT or VOTE_ABORT
Coordinator(process): Phase 2 of the two-phase: Collects all votes. If ONE voted to abort, GLOBAL_ABORT message is multicasted else GLOBAL_COMMIT is sent. Participant(process): Phase 2 of the two-phase: Those that voted for are waiting for the reply. If a participant receives a message it either does GLOBAL_COMMIT or GLOBAL_ABORT.
Observation: The real problem lies in the fact that the coordinator`s final decision may not be available for some time. Alternative: Let a participant P in the ready state timeout when it hasn`t received the coordinator`s decision; P tries to find out what other participants know. Question: Can P not succeed in getting the required information? Observation: Essence of the problem is that a recovering participant cannot make a local decision: it is dependent on other (possibly failed) processes.
Three-phase commit protocol: Do not need to read.
|