Fault Tolerance
An airplane has 4 engines, if one fails, can still use the other 3.
A fault tolerant system
need to anticipate the faults
manage the failures
prevent the failures
monitor the system for the faults and failures, for warning signs and when they occur
How to fix it
How to fail gracefully
https://www.future-processing.pl/blog/how-to-build-a-fault-tolerant-system/
https://dzone.com/articles/fault-tolerance-is-not-high-availability
https://dzone.com/articles/make-services-fault-tolerant
Fault vs failure
Fault is a cause of failure
Failure is the effect
https://youtu.be/7vIzGmxdUvI
Fault prevention
Preventing faults from occuring
Need some form of monitoring of what could go wrong
Alerting when when going into danger zone, and in danger zone
Thus can be fixed (manual or automated)
Backups, to replace problem issue
removal and replacement of problem issue
Fault handling
Includes fault tolerance
How to handle the failure, in a better way then exiting or breaking down
Maybe give some between result, but not the expected result
Out of date, default
time it takes to produce result is longer (use of queue)
replicas to share load
Fault fixing
Use of metrics, logs
Find issues
Short term fix, manual intervention - data fix or hit an endpoint
Long term fix
Code fix
Triage and investigation, so this can be prevented, fix can be automated (or done faster when spotted), fixed permemently
Types of faults
-Categories
Transient
occurs for a very small duration
hard to locate
permanent
continues until fixed
easier to identify
Human/code fault
Fix:
before release - testing, requirments met, requirements are correct from business
Not all faults can be caught
AFter release - nice failure message to user, no crashing, good logs with more details about that failure for techops or devs to help resolve issue, if known issue can deal with appropriate response, if unknonw need to deal with through investigation and quick fix but need to triage to decide on automated or manual process for this issue
Hardware fault
Fix: Replication or backups
Network fault
Types of failures
Effects could be
requests are return 500 responses
System acts in erratic state
Server shuts down
Processing of request is slow, even stops, as throughput increases
Wrong response
Use of wrong data
One component stops working, cascading to whole system failure
Last updated