Infrastructure Availability
Everyone expects their infrastructure to be available all the time.
A 100% guaranteed availability of an infrastructure, however, is impossible
there is always a chance of downtime
calculating availability
In general, availability can neither be calculated, nor guaranteed upfront.
It can only be reported on afterwards, when a system has run for some years.
We use past experience and design patterns to design high available systems
ie failover, redundancy, structured programming, avoiding Single Points of Failures (SPOFs), and implementing
sound systems management
Availability percentages and intervals
The availability of a system is usually expressed as a percentage of uptime in a given time period
ie month/year
ie for 99.9% uptime, we expect downtime of 17.5 hours per year, 86.2 min per month and 20.2 min per week
Typical requirements used in service level agreements today are 99.8% or 99.9%
availability per month for a full IT system.
To meet this requirement, the availability of the underlying infrastructure must be much higher, typically in the range of 99.99% or higher
99.999% uptime is also known as carrier grade availability;
originates from telecommunication system components, not full systems
Higher availability levels for a complete system are very uncommon, as they are almost impossible to reach.
ie average electricity supply downtime in uk is 75 min, therefore 99.9857% available
Downtime does not mean it occurs in one event
down time could occur in multiple ranges ie 0-x1, x1 - x2 ....xn-1 to x
good practice to agree on the maximum frequency of unavailability
ie 0-5 min has 30 or less events per year, more than 30 min has 1 or less event per year
MTBF and MTTR
Two factors invovled in calculating availability
Mean Time Between Failures(MTBF)
which is the average time that passes between failures
expressed in hours
how many hours will the component or service work without failure
Impossible to test to find value, numbers are too large
manufacturers run tests on large batches of components.
ie hard disks, 1000 disks could have been tested for 3 months. If in that period of time five disks fail
, then MTBF can be calculated
MTBF only says something about the chance of failure in the first months of use.
an extrapolated value for the probable downtime of a disk
better to specify the annual failure rate instead
ie 2% of all disk will fail in first year
Do another table of rates for each year
Mean Time To Repair (MTTR)
which is the time it takes to recover from a failure
When a component breaks, it needs to be repaired.
Usually the repair time (expressed as Mean Time To Repair – MTTR) is kept low by having a service contract
with the supplier of the component.
Sometimes spare parts are kept onsite to lower the MTTR
Typically, a faulty component is not repaired immediately
A process occurs before repair ie
Notification of the fault (time before seeing an alarm message)
Processing the alarm
Finding the root cause of the error
Looking up repair information
Getting spare components from storage
Having technician come to the datacenter with the spare component
Physically repairing the fault
Restarting and testing the component
The best way to keep the MTTR low is to introduce automated redundancy and failover
These are statistically calculated values
Decreasing MTTR and increasing MTBF both increase availability.
Dividing MTBF by the sum of MTBF and MTTR results in the availability
To reach five nines of availability the repair time should be as low as 90 minutes for a component (if MTBF is 150
,000 hrs)
if 5 9s for a year then the repair time must be 6 minutes
As system complexity increases, usually availability decreases.
serial availability
When a failure of any one part in a system causes a failure of the system as a whole
To calculate the availability of such a complex system or device, multiply the availability of all its parts
(convert % to dec first)
This is lower than the availability of any single component in the system
it can never be higher, and only reach a maximum of the lower % if all availablity are same number
To increase the availability, systems (composed of a various components) can be deployed in parallel.
The combined system no longer contains a Single Point Of Failure
If one component goes down in one system, the other system can take over until the first system's componet is
fixed and brough back up
In this situation, it is important to have no single point of failure that combines the set of systems
for instance, all systems run on the same power supply
Human Errors and Availability
Usually only 20% of the failures leading to unavailability are technology failures
The rest are people and process issues
50% of this is via change/configuration/release integration and hand-off issues
Need to have highly qualified and trained staff, with a healthy sense of responsibility.
Errors are human, they will always happen
Exampls
End users can introduce downtime by misuse of the system
When a user for instance starts the generation of ten very large reports at the same time, the performance of
the system could suffer in such a degree that the system becomes unavailable to other users
hen a user forgets a password she is locked out and the system is unavailable for that user, being locked out
could mean that a business process is unavailable to other users as well
Most unavailability issues, however, are the result of actions from systems managers.
examples
Performing a test in the production environment
Switching off the wrong component - not the defective server that needs repair, but the one still operating
Swapping out the wrong component instaed of the faulty ones
Restoring the wrong backup tape to production
Accidentally removing files (mail folders, configuration files) or database entries
Making incorrect changes to configurations
Incorrect labeling of cables, later leading to errors when changes are made to the cabling.
Performing maintenance on an incorrect virtual machine
Making a typo in a system command environment
Insufficient testing, for instance, the fallback procedure to move operations from the primary datacenter to
the secondary was never tested, and failed when it was really needed
Many of these mistakes can be avoided by using proper systems management procedures
having a standard template for creating new servers
using formal deployment strategies with the appropriate tools
using administrative accounts only when absolutely needed
Warning signs given to root users, to keep them aware
Hackers can create downtime by for instance executing a Denial of Service attack
Bugs
software bugs are the number two reason for unavailability
the complexity of most software it is nearly impossible (and very costly) to create bug-free software
Bugs in systems or drivers can
stop an entire system
create downtime
operating systems contain bugs that can lead to corrupted file systems, network failures, or other sources of
unavailability
These can be
accidental
something breaks in production and fixed later on in the software so does not happen again
accepted and have a manual way of dealing with them, as cheaper
on purpose by dissatisfied worker, spy, hacker. But will be spotted and fixed when the bug surfaces
Although can be costly, so prevention is better ie code reviews, access to code base to specific members etc
Planned Maintenance
Planned maintenance is sometimes needed to perform
systems management tasks like upgrading hardware or software,
implementing software changes
migrating data
the creation of backups
planned maintenance should only be performed on parts of the infrastructure while other parts keep serving clients.
to maintain high availability
downtime of a single component does not lead to downtime of the entire system
if not single point of failure
Allows for upgrade of say OS while system is still up
During planned maintenance, however, the system is more vulnerable to downtime than under normal circumstances
When the systems manager makes a mistake during planned maintenance, the risk of downtime is higher than normal
can lead to creating SPOF
Example
the upgrade of systems in a high available cluster. When one component is upgraded and the other is not upgraded yet, it could be that the high available cluster is not working as such. In that period of time the system is vulnerable to downtime
Physical defects
everything breaks down eventually, but mechanical parts are most likely to break first.
Apart from mechanical failures because of normal usage, parts also break because of external factors like ambient temperature, moist, vibrations, and aging.
In most cases the availability of a component follows a so-called bathtub curve.
A component failure is most likely when the component is new. In the first month of use the chance of a components failure is relatively high. Sometimes a component doesn't even work at all when unpacked for the first time.
This is called a DOA component – Dead On Arrival
When a component still works after the first month, it is likely that it will continue working without failure until the end of its technical life cycle
the chance of failure rises suddenly at the end of the life cycle of a component.
Environmental issues
Issues with power and cooling, and external factors like fire, earthquakes and flooding can cause entire datacenters to fail.
Affect of power, being cut (foor long /short time periods ), or voltage drops/spikes
Failure of air con, causes temperature raise, which cause parts to break or slow down
Complexity of the infrastructure
Complex systems inherently have more potential points of failure and are more difficult to implement correctly.
a complex system is harder to manage; more knowledge is needed to maintain the system and errors are made more easily.
Sometimes it is better to just have an extra spare system in the closet than to use complex redundant systems
Availability patterns
A single point of failure (SPOF) is a component in the infrastructure that, if it fails, causes downtime to the entire system.
SPOFs should be avoided, need to eliminate them
The trick is to find SPOFs that are not that obvious
it is not always feasible or cost effective.
it is good to realize that there is always something shared in an infrastructur
We just need to know what is shared and if the risk of sharing is acceptable
To eliminate SPOFs, a combination of redundancy, failover, and fallback
Redundacy
Redundancy is the duplication of critical components in a single system, to avoid a SPOF.
usually implemented in power supplies, network interfaces, and SAN HBAs (Host Bus Adapters) for connecting storage.
Failovers
Failover is the (semi)automatic switch-over to a standby system (component), either in the same or in another datacenter, upon the failure or abnormal termination of the previously active system (component)
Fallback
Fallback is the manual switchover to an identical standby computer system in a different location, typically used for disaster recovery.
three basic forms of fallback solutions
Hot site
is a fully configured fallback datacenter, fully equipped with power and cooling. The applications are installed on the servers, and data is kept up todate to fully mirror the production system.
Staff and operators should be able to walk in and begin full operations in a very short time (typically one or two hours).
requires constant maintenance of the hardware, software, data, and applications to be sure the site accurately mirrors the state of the production site at all times.
Warm site
A warm site could best be described as a mix between a hot site and cold site.
is a computer facility readily available with power, cooling, and computers, but the applications may not be installed or configured
external communication links and other data elements, that commonly take a long time to order and install, will be present
To start working in a warm site, applications and all their data will need to be restored from backup media and tested. This typically takes a day
needs less attention when not in use and is much cheaper than a hot site
Cold site
it is ready for equipment to be brought in during an emergency, but no computer hardware is available at the site.
is a room with power and cooling facilities, but computers must be brought on-site if needed, and communications links may not be ready. Applications will need to be installed and current data fully restored from backups.
if an organization has very little budget for a fallback site, a cold site may be better than nothing
Business Continuity
the availability of the IT infrastructure can never be guaranteed in all situations
Business continuity is about identifying threats an organization faces and providing an effective response
Business Continuity Management (BCM) and Disaster Recovery Planning (DRP) are processes to handle the effect of disasters
Business Continuity Management
managing business processes, and the availability of people and work places in disaster situations.
disaster recovery, business recovery, crisis management, incident management, emergency management, product recall, and contingency planning
Business Continuity Plan (BCP) describes the measures to be taken when a critical incident occurs in order to continue running critical operations, and to halt non-critical processes
guidlines like BS:25999
Disaster Recovery Planning
a set of measures to take in case of a disaster, when (parts of) the IT infrastructure must be accommodated in an alternative location.
DRP assesses the risk of failing IT systems and provides solutions
IT disaster is defined as an irreparable problem in a datacenter, making the datacenter unusable
The first category is natural disasters such as floods, hurricanes, tornadoes or earthquakes.
The second category is manmade disasters, including hazardous material spills, infrastructure failure, or bio-terrorism
disaster recovery standard BS:25777 can be used to implement DRP
A typical DRP solution is the use of fallback facilities and having a Computer Emergency Response Team (CERT) in place.
A CERT is usually a team of systems managers and senior management that decides how to handle a certain crisis once it happens
One of the first worries is to save people during a diaster.
But after that, procedures must be followed to restore IT operations as soon as possible
RTO - Recovery Time Objective
he maximum duration of time within which a business process must be restored after a disaster, in order to avoid unacceptable consequences (like bankruptcy).
only valid in case of a disaster and not the acceptable downtime under normal circumstances.
failover and fallback used
RPO - Recovery Point Objective
The point in time to which data must be recovered considering some "acceptable loss" in a disaster situation.
the amount of data loss a business is willing to accept in case of a disaster, measured in time.
Different backup regimes will affect this
Last updated
Was this helpful?