Availability

  • It is a property of software that it is there and ready to carry out its task when you need it to be.

    • Includes reliability and downtime for maintaince

    • Builds on reliability, by adding recoverability

      • Recoverability - when the system breaks, it repairs itself

    • availability is about minimizing service outage time by mitigating faults.

  • AKA Dependability is the ability to avoid failures that are more frequent and more severe than is acceptable

    • Fault tolerant

  • Availability refers to the ability of a system to mask or repair faults/failures such that the cumulative service outage period does not exceed a required value over a specified time interval.

    • Failures, which affect the availability, is subject to the judgment of an external agent, possibly a human

      • Hence other quality attributes such as reliability, confidentiality, integrity and and anything else that causes unacceptable failures

      • Failure implies visibility to a system or human observer in the environment

      • a failure is the deviation of the system from its specification, where the deviation is externally visible

        • A failure occurs when the system no longer delivers a service that is consistent with its specification; this failure is observable by the system's actor

      • A Fault (or combination of faults) is the cause of the failure

        • A fault can be either internal or external to the system

        • Errors - Intermediate states between the occurrence of a fault and the occurrence of a failure

        • Faults can be prevented, tolerated, removed, or forecast

        • faults are detected and correlated prior to being reported and repaired

        • Fault correlation logic will categorize a fault according to its severity (critical, major, or minor) and service impact (service-affecting or non-service-affecting) in order to provide the system operator with timely and accurate system status and allow for the appropriate repair strategy to be employed.

        • The repair strategy may be automated or may require manual intervention.

  • Availability is closely related to security

    • DDoS/DoS attacks designed to make a system fail that is, to make it unavailable

  • Availability is also closely related to performance

    • it may be difficult to tell when a system has failed and when it is simply being outrageously slow to respond

    • A slow system is just as bad and seen as the same as donw service

  • availability is closely allied with safety

    • Safety - keeping the system from entering a hazardous state and recovering or limiting the damage when it does

  • Building HA systems, need to understand the nature of the failures that

    can arise during operation and provide mitgation strategies for them

    • We care about:

      • how system faults are detected

      • how frequently system faults may occur

      • what happens when a fault occurs

      • how long a system is allowed to be out of operation, when faults or failures may occur safely

      • how faults or failures can be prevented

      • what kinds of notifications are required when a failure occurs

      • the level of capability that remains when a failure has occurred

    • Because a system failure is observable by users, the time to repair is the time until the failure is no longer observable

    • observability (making a system processes/state observable) is another quality attribute linked to availability

  • automatic repair strategies

    • if code containing a fault is executed but the system is able to recover from the fault without any deviation from specified behavior being observable, there is no failure

  • availability of a system = MTBF / ( MTBF + MTTR)

    • MTBF = the mean time between failures

    • MTTR = the mean time to repair

    • As a percentage

      • like 5 9's = 99.999% available, thus only 0.001% downtime

        • downtime = 1m18s per 90 days, 5m15s per year

      • Although scheduled downtime (repairs) are sometimes not conisdered part of this calculation

      • This % is used in SLAs (service-level agreement)

        • specifies the availability level that is guaranteed and, usually, the penalties that the computer system or hosting service will suffer if the SLA is violated

    • helps you think about

      • what will make your system fail

      • how likely that is to occur

      • that there will be some time required to repair it

Planning for failure

  • planning for failure is not an option, is impossible and inevitable

  • Better to plan for the occurrence of failure or (more likely) failures, and handling them

  • Need to understand the possible failures and what their consequences are

    • Methods:

      • Hazard Analysis

        • attempts to catalog the hazards that can occur during the operation of a system

        • categorizes each hazard according to its severity

        • Different fields have set standards of categories

        • assesses the probability of each hazard occurring

        • Hazards for which the product of cost and probability exceed some threshold are then made the subject of mitigation activities.

      • Fault tree analysis

        • an analytical technique that specifies a state ofthe system that negatively impacts safety or reliability, and then analyzes the system's context and operation to find all the ways that the undesired state could occur

        • uses a graphic construct (the fault tree) that helps identify all sequential and parallel sequences of contributing faults that will result in the occurrence of the undesired state, which is listed at the top ofthe tree (the "top event")

        • Can find a "minimal cut set" is the smallest combination of events along the bottom of the tree that together can cause the top event.

          • The set ofminimal cut sets shows all the ways the bottom events can combine to cause the overarching failure.

          • Any singleton minimal cut set reveals a single point of failure, which should be carefully scrutinized.

        • the probabilities of various contributing failures can be combined to come up with a probability ofthe top event occurring

        • Dynamic analysis occurs when the order of contributing failures matters

          • Markov analysis can be used to calculate probability of failure over different failure sequences

        • used to diagnose failures at runtime

          • If the top event has occurred, then (assuming the fault tree model is complete) one or more ofthe contributing failures has occurred, and the fault tree can be used to track it down and initiate repairs

      • Failure Mode, Effects, and Criticality Analysis (FMECA)

        • catalogs the kinds of failures that syste1ns of a given type are prone to, along with how severe the effects of each one can be

        • relies on the history of failure of similar systems in the past

    • These methods are only as good as the knowledge and experience of the people who populate their respective data structures.

      • don't let safety engineering become a matter ofjust filling out the tables.

      • keep pressing to find out what else can go wrong, and then plan for it

General Scenario

  • Source of stimulus

    • We differentiate between internal and external origins of faults or failure because the desired system response may be different

    • ie

      • people

      • hardware

      • software

      • physical infrastructure

      • physical environment

  • Stimulus

    • A fault of one ofthe following classes occurs:

      • Omission = A component fails to respond to an input

      • Crash = The component repeatedly suffers omission faults

      • (Incorrect) Timing = A component responds but the response is early or late

      • (Incorrect) Response = A component responds with an incorrect value

  • Artifact

    • specifies the resource that is required to be highly available, such as:

      • processor

      • communication channel

      • process

      • storage

  • Environment

    • The state of the system when the fault or failure occurs may also affect the desired system response.

      • Normal operation

      • startup

      • shutdown

      • repair mode

      • degraded operation

      • over1oaded operation

    • For example, if the system has already seen some faults and is operating in other than normal mode, it may be desirable to shut it down totally. However, if this is the first fault observed, some degradation of response time or function may be preferred

  • Response

    • There are a number of possible reactions to a system fault

    • First, the fault must be detected and isolated (correlated) before any other response is possible

      • One exception to this is when the fault is prevented before it occurs.

      • Actions associated with these possibilities include

        • logging the failure

        • notifying selected users or other systems

    • After the fault is detected, the system must recover from it

      • Actions associated with these possibilities include

        • taking actions to limit the damage caused by the fault

        • Disable source of events causing the fault

        • switching to a degraded mode with either less capacity or less function

        • shutting down external systems

        • becoming temp unavailable during repair

        • Fix or mask the fault/failure or contain the damage it causes

  • Response measure

    • The response measure can specify

      • an availability percentage

      • it can specify

        • a time to detect the fault

        • time to repair the fault

        • times or time intervals during which the system must be available

        • the duration for which the system must be available.

Tactics

  • Availability tactics are designed to enable a system to endure system faults so that a service being delivered by the system remains compliant with its specification

  • aim

    • keep faults from becoming failures

    • at least bound the effects of the fault and make repair possible.

  • often the case that these tactics will be provided for you by a software infrastructure, such as a middleware package

  • Detect Faults

    • the presence of the fault must be detected or anticipated before taking action on the fault

    • Tactic: Ping/echo

      • an asynchronous request/response message pair exchanged between nodes, used to determine reachability and the round-trip delay through the associated network path.

      • the echo also determines that the pinged component is alive and responding correctly

      • ping is often sent by a system monitor

      • requires a time threshold to be set; this threshold tells the pinging component how long to wait for the echo before considering the pinged component to have failed ("timed out").

      • There are standard implementations for this ie for ip

    • Tactic: Monitor

      • is a component that is used to monitor the state ofhealth ofvarious other parts ofthe system: processors, processes, 1/0, memory, and so on

        • monitor key components of a system and detect malfunctions

          • ie third party services, database connections

          • incomplete processes, processes in failed state

      • A system monitor can detect failure or congestion in the network or other shared resources

    • Tactic: Heartbeat

      • a fault detection mechanism that employs a periodic message exchange between a system monitor and a process being monitored

      • For systems where scalability is a concern, transport and processing overhead can be reduced by piggybacking heartbeat messages on to other control messages being exchanged between the process being monitored and the distributed system controller.

      • The big difference between heartbeat and ping/echo is who holds the responsibility for initiating the health check the monitor or the component itself

    • Tactic: Time stamp

      • used to detect incorrect sequences of events, primarily in distributed message-passing systems

      • A time stamp of an event can be established by assigning the state of a local clock to the event immediately after the event occurs.

      • Simple sequence numbers can also be used for this purpose, iftime information is not important.

    • Tactic: Sanity checking

      • checks the validity or reasonableness of specific operations or outputs of a component.

      • based on

        • a knowledge ofthe internal design

        • the state of the system

        • the nature of the information under scrutiny

      • most often employed at interfaces, to examine a specific information flow

    • Tactic:Condition monitoring

      • checking conditions in a process or device, or validating assumptions made during the design

      • prevents a system from producing faulty behavior

      • ie computation of checksum

      • the monitor must itself be simple (and, ideally, provable) to ensure that it does not introduce new software errors.

    • Tactic: voting

      • implemented triple modular redundancy (TMR)

        • which employs three components that do the same thing, each of which receives identical inputs, and forwards their output to voting logic, used to detect any inconsistency among the three output states

      • Faced with an inconsistency, the voter reports a fault.

      • It can let the majority rule, or choose some computed average of the disparate outputs

      • voting schema is important

        • which is usually realized as a simple, rigorously reviewed and tested singleton so that the probability of error is low

      • Types of voting

        • Replication

          • is the simplest form ofvoting;

          • the components are exact clones of each other.

          • Having multiple copies of identical components can be effective in protecting against random failures of hardware

            • but this cannot protect against design or implementation errors, in hardware or software, because there is no form of diversity embedded in this tactic

        • Functional redundancy

          • is a form of voting intended to address the issue of common-mode failures (design or implementation faults) in hardware or software components.

          • the components must always give the same output given the same input, but they are diversely designed and diversely implemented.

        • Analytic redundancy

          • permits not only diversity among components' private sides, but also diversity among the components' inputs and outputs

          • intended to tolerate specification errors by using separate requirement specifications

          • The voter mechanism used needs to be more sophisticated than just letting majority rule or computing a simple average

          • it may have to understand which sensors are currently reliable or not, and it may be asked to produce a higher-fidelity value than any individual component can, by blending and smoothing individual values over time

    • Tactic: Exception detection

      • the detection of a system condition that alters the normal flow of execution

      • Tactic: System exceptions

        • will vary according to the processor hardware architecture employed and include faults such as divide by zero, bus and address faults, illegal program instructions, and so forth.

      • Tactic: parameter fence

        • incorporates an a priori data pattern (such as OxDEADBEEF) placed immediately after any variable-length parameters of an object

        • allows for runtime detection of overwriting the memory allocated for the object's variable-length parameters.

      • Tactic: Parameter typing

        • employs a base class that defines functions that add, find, and iterate over type-length-value (TLV) formatted message parameters

        • Derived classes use the base class functions to implement functions that provide parameter typing according to each parameter's structure.

        • Use of strong typing to build and parse messages results in higher availability than implementations that simply treat messages as byte buckets

        • When you employ strong typing, you typically trade higher availability against ease of evolution.

      • Tactic: Timeout

        • that raises an exception when a component detects that it or another component has failed to meet its timing constraints

        • ie a component awaiting a response from another component can raise an exception if the wait time exceeds a certain value

    • Tactic: Self-test

      • Components (or whole subsystems) can run procedures to test themselves for correct operation.

      • Self-test procedures can be initiated by the component itself, or invoked from time to time by a system monitor.

      • These may involve employing some of the techniques found in condition monitoring, such as checksums

  • Recover From faults

    • Preparation-and-repair tactics

      • based on a variety of combinations of retrying a computation or introducing redundancy

      • Tactic: Active redundancy (hot spare)

        • a configuration where all of the nodes (active or redundant spare) in a protection group. receive and process identical inputs in parallel, allowing the redundant spare(s) to maintain synchronous state with the active node(s).

          • A protection group is a group ofprocessing nodes where one or more nodes are "active," with the remaining nodes in the protection group serving as redundant spares.

        • the redundant spare possesses an identical state to the active processor, it can take over from a failed component in a matter of milliseconds

        • aka "one plus one" redundancy

      • Tactic: Passive redundancy (warm spare)

        • to a configuration where only the active members of the protection group process input traffic

        • one of their duties is to provide the redundant spare(s) with periodic state updates

        • the state maintained by the redundant spares is only loosely coupled with that of the active node(s) in the protection group (with the looseness of the coupling being a function ofthe check pointing mechanism employed between active and redundant nodes)

          • the redundant nodes are referred to as warm spares

        • passive redundancy provides a solution that achieves a balance between the more highly available but more compute-intensive (and expensive) active redundancy tactic and the less available but significantly less complex cold spare tactic (which is also significantly cheaper)

      • Tactic: Spare (cold spare)

        • Cold sparing refers to a configuration where the redundant spares of a protection group remain out of service until a fail-over occurs, at which point a power-on-reset procedure is initiated on the redundant spare prior to its being placed in service.

        • Due to its poor recovery performance, cold sparing is better suited for systems having only high-reliability (MTBF) requirements as opposed to those also having high-availability requirements.

      • Tactic: Exception handling

        • Once an exception has been detected, the system must handle it in some fashion

        • The easiest thing it can do is simply to crash, but that's a terrible idea from the point of availability

        • Solutions

          • simple function return codes (error codes)

          • use of exception classes that contain information helpful in fault correlation

            • such as the name ofthe exception thrown, the origin of the exception, and the cause of the exception thrown

            • logging

        • Software/people can then use this information to mask the fault, usually by correcting the cause of the exception and retrying the operation

      • Tactic: Rollback

        • permits the system to revert to a previous known good state (rollback line) upon the detection of a failure

        • Once the good state is reached, then execution can continue.

          • or retry

        • combined with active or passive redundancy tactics so that after a rollback has occurred, a standby version ofthe failed component is promoted to active status

        • depends on a copy of a previous good state (a checkpoint) being available to the components that are rolling back

          • Checkpoints can be stored in a fixed location and updated

            • at regular intervals,

            • at convenient or significant times in the processing

              • such as at the completion of a complex operation

      • Tactic: Software upgrade

        • to achieve in-service upgrades to executable code images in a non-service-affecting manner

        • realized as

          • a function patch

            • used in procedural programming and employs an incremental linker/loader to store an updated software function into a pre-allocated segment of target memory

            • The new version of the software function will employ the entry and exit points of the deprecated function.

            • upon loading the new software function, the symbol table must be updated and the instruction cache invalidated.

            • deliver bug fixes

          • a class patch

            • for targets executing object-oriented code, where the class definitions include a back-door mechanism that enables the runtime addition of member data and functions.

            • deliver bug fixes

          • a hitless in-service software upgrade (ISSU)

            • leverages the active redundancy or passive redundancy tactics to achieve non-service-affecting upgrades to software and associated schema

            • deliver new features and capabilities

      • Tactic: retry

        • assumes that the fault that caused a failure is transient and retrying the operation may lead to success.

        • used in places where failures are common and expected

        • There should be a limit on the number of retries that are attempted before a permanent failure is declared

      • Tactic: Ignore faulty behavior

        • ignoring messages sent from a particular source when we determine that those messages are spurious

        • ie we would like to ignore the messages of an exten1al component launching a denial-of-service attack by establishing Access Control List filters

      • Tactic: degradation

        • maintains the most critical system functions in the presence of component failures, dropping less critical functions

        • done in circumstances where individual component failures gracefully reduce system functionality rather than causing a complete system failure.

      • Tactic: Reconfiguration

        • attempts to recover from component failures by reassigning responsibilities to the (potentially restricted) resources left functioning, while maintaining as much functionality as possible

        • load balancing

    • Reintroduction tactics

      • concerned with reintroducing a failed (but rehabilitated) component back into normal operation.

      • Tactic: shadow tactic

        • to operating a previously failed or in-service upgraded component in a "shadow mode" for a predefmed duration oftime prior to reverting the component back to an active role

        • During this duration its behavior can be monitored for correctness and it can repopulate its state incrementally

      • Tactic: State resynchronization

        • When used alongside the active redundancy tactic,

          • the state resynchronization occurs organically, because the active and standby components each receive and process identical inputs in parallel

          • the states of the active and standby components are periodically compared to ensure synchronization

          • Comparison based on

            • a cyclic redundancy check calculation (checksum)

            • for systems providing safety critical services, a message digest calculation (a one-way hash function)

        • When used alongside the passive redundancy (warm spare) tactic

          • state resynchronization is based solely on periodic state information transmitted from the active component(s) to the standby component(s), typically via checkpointing

        • A special case of this tactic is found in stateless services, whereby any resource can handle a request from another (failed) resource.

          • ie for services which are replicas

          • Issues with shared state ie in datastore

      • Tactic: Escalating restart

        • that allows the system to recover from faults by varying the granularity ofthe component(s) restarted and minimizing the level of service affected

        • useful for graceful degradation,

          • where a system is able to degrade the services it provides while maintaining support for mission-critical or safety-critical applications.

      • Tactic: Non-stopforwarding (NSF)

        • is a concept that originated in router design.

          • In this design functionality is split into two parts:

            • supervisory, or control plane (which manages connectivity and routing information)

            • data plane (which does the actual work of routing

            • packets from sender to receiver).

          • If a router experiences the failure of an active supervisor, it can continue forwarding packets along known routes with neighboring routers while the routing protocol information is recovered and validated.

          • When the control plane is restarted, it implements what is sometimes called "graceful restart," incrementally rebuilding its routing protocol database even as the data plane continues to operate

    • Prevent faults

      • deal with runtime means to prevent faults from occurring

      • Best way to do this is before the system is running, by having high quality code

        • code reviews, pair programming, solid requirements reviews, testing coverage

      • Tactic: Removalfrom service/software rejuvenation

        • temporarily placing a system component in an outof-service state for the purpose ofmitigating potential system failures

        • ie taking a component of a system out of service and resetting the component in order to scrub latent faults (such as memory leaks, fragmentation, or soft errors in an unprotected cache) before the accumulation of faults affects service (resulting in system failure)

      • Tactic: Transaction

        • ensure that asynchronous messages exchanged between distributed components are atomic, consistent, isolated, and durable.

        • "two-phase commit protocol

          • prevents race conditions caused by two processes attempting to update the same data item.

      • Tactic: Predictive model

        • monitor the state of health of a system process to ensure that the system is operating within its nominal operating parameters, and to take corrective action when conditions are detected that are predictive of likely future faults

      • Tactic: Exception prevention

        • the purpose ofpreventing system exceptions from occurring.

      • Tactic: Increase competence set

        • A program's competence set is the set of states in which it is "competent" to operate

        • When a component raises an exception, it is signaling that it has discovered itself to be outside its competence set;

          • it doesn't know what to do and is throwing in the towel

        • Increasing a component's competence set means designing it to handle more cases faults as part of its normal operation

Checklist for Availability

  • Allocation of Responsibilities

    • Determine the system responsibilities that need to be highly available.

    • Within those responsibilit1es, ensure that additional responsibilities have been allocated to detect an

      • omission

      • crash

      • incorrect timing

      • incorrect response.

    • ensure that there are responsibilities to do the following:

      • Log the fault

      • Notify appropriate entities (people or systems)

      • Disable the source of events causing the fault

      • Be temporarily unavailable

      • Fix or mask the fault/failure

      • Operate in a degraded mode

  • Coordination Model

    • Determine the system responsibilities that need to be 'highly available.

    • With respect to those responsibilities, do the following:

      • Ensure that coordination mechanisms can detect an omission, crash, incorrect timing, or incorrect response.

        • Consider, for example, whether guaranteed delivery is necessary.

        • Will the coordination work under conditions of degraded communication?

      • Ensure, that coordination mechanisms enable

        • the logging of the fault

        • notification of appropriate entities

        • disabling of the source of the events causing the fault

        • fixing or masking the fault

        • operating in a degraded mode.

      • Ensure that the coordination model supports the replacement of the artifacts used (processors, communications channels, persistent storage, and processes).

        • For example, does replacement of a server allow the system to continue to operate?

      • Determine if the coordination will work under conditions of

        • degraded communication

        • at startup/shutdown

        • in repair mode

        • under overloaded operation.

          • For example, how much lost information can the coordination model withstand and with what consequences?

  • Data Model

    • Determine which portions of the system need to be highly avai1able.

      • Within those portions, determine which data abstractions, along with their operations or their properties, could cause

        • a fault of omission

        • a crash

        • incorrect timing behavior

        • an incorrect response

    • For those data abstractions, operations, and properties, ensure that they can

      • be disabled

        • be temporarily unavailable

        • be fixed or masked in the event of a fault

    • For example, ensure that write requests are cached if a server is temporarily unavailable and performed when the server is returned to service.

  • Mapping among Architectural Elements

    • Determine which artifacts (processors, communication channels, persistent storage, or processes) may produce a fault omission, crash, incorrect timing, or incorrect response.

    • Ensure that the mapping (or remapping) of architectural elements is flexible enough to permit the recovery from the fault.

      • This may involve a consideration of the following:

        • Which processes on failed processors need to be reassigned at runtime

        • Which processors. data stores, or communication channels can be activated or reassigned at runtime

        • How data on failed processors or storage can be served by replacement units

        • How quickly the system can be reinstalled based on the units of delivery provided

        • How to (re)assign runume elements to processors, communication channels, and data stores

        • When employing tactics that depend on redundancy of functionality, the mapping from modules to redundant components is important.

          • For example, it is possible to write one module that contains code appropriate for both the acUve component and backup components in a protectton group.

  • Resource Management

    • Determine what critical resources are necessary to continue operating in the presence of a fault: omission, crash, incorrect timing or incorrect response.

      • Ensure there are sufficient remaining resources in the event of a fault to log the fault; notify appropriate entities (people or systems); disable the source of events causing the fault; be temporarily unavailable; fix or mask the fault/failure; operate normally, in startup, shutdown, repair mode, degraded operation, and overloaded operation

    • Determine the availability time for critfcal resources, what critical resources must be available during specified time intervals, time intervals during which the critical resources may be In a degraded mode, and repair time for critical resources.

      • Ensure that the critical resources are available during these time intervals

    • For example, ensure that input queues are large enough to buffer anticipated messages if a server fails so that the messages are not permanently lost

  • Binding Time

    • Determine how and when architectural elements are bound.

      • If late binding is used to alternate between components that can themselves be sources of faul1s (e.g. processes, processors, communication channels)

      • ensure the chosen availability strategy is sufficient to cover fautts introduced by all sources. For example

        • a late binding is used to switch between artifacts such as processors that will receive or be the subject of faults, will the chosen fault detection and recovery mechanisms work for all possible bindings?

        • If late binding is used to change the definition or tolerance of what constitutes a fault {e.g. how long a process can go without responding before a fault is assumed) is the recovery strategy chosen suffident to handle all cases?

          • For example, if a fault is flagged after 0.1 milliseconds, but the recovery mechanism takes 1 .5 seconds to work, that might be an unacceptable mismatch.

        • What are the availability characteristics of the late binding mechanism itself? Can it fail?

  • Choice of Technology

    • Determine the available technologies that can (help) detect faults, recover from faults, or reintroduce failed components.

    • Determine what technologies are available that help the response to a fault (e.g. event loggers).

    • Determine the availabllity characteristics of chosen technologies themselves:

      • What faults can they recover from?

      • What faults might they introduce into the system?

Last updated