Learning
  • Software Engineering Golden Treasury
  • Trail Map
  • Caching
    • Alternatives to use before using cache
    • Caching Architecture
    • Cache Invalidation and Eviction
    • Cache Patterns
    • Cache
    • Consistency
    • Distributed Caching
    • Issues with caching
    • Types of caches
  • Career
    • algo types
    • Backend Knowledge
    • Burnout
    • consultancy
    • dev-level
    • Enterprise Developer
    • how-to-get-in-tech-from-other-job
    • how-to-get-into-junior-dev-position
    • induction
    • Interview
    • junior
    • mid
    • New Job
    • paths
    • Principle/staff Engineer
    • Requirements for job
    • Senior Dev capabilities
    • learning
      • automating-beginner
      • company1
        • analyst-progression
        • core-eng-progression
        • dev-progression
        • perf-eng-progression
        • soft-deliv-progression
    • mentoring
      • mentor-resources
    • recruitment
      • questions
      • Spotting posers
  • Computer Science
    • boolean-algebra
    • Compiler
    • Finite State Machine
    • Hashing
    • Algorithms
      • Breadth Firth Search
      • complexity
      • Depth First Search
      • efficiency
      • Sliding Window
      • sorting
    • data-structures
      • AVL Trees
      • data-structures
      • Linked List
    • machines
      • Intel Machine
      • Turing Machine
      • von neumann machine
      • Zeus Machine
  • devops
    • The 5 Ideals
    • microservice
    • Artifact repository
    • Bugs and Fixes
    • Build police
    • cloud-servers
    • Deployments
    • Environments
    • GitOps
    • handling-releases
    • infrastructure-as-code
    • System Migrations
    • SDP
    • On Premises Hosting
    • Properties/configuration
    • Release process
    • Release
    • Roll Outs
    • serverless
    • Serverless
    • Cloud Services
    • Versioning
    • AWS
      • deploy-docker-esc
      • cloud-practitiioner-essentials-notes
        • Module 1 - Intro to AWS
        • Module 2 Compute in the cloud
        • Module 3 Global Infrastructure and Reliability
        • Module 4 Networking
        • Module 5 Storage and Databases
        • Security
        • 7 Monitoring and Aanlytics
        • 8 Pricing and Support
        • 9 Migration and Innovation
      • developer-associate
        • AWS Elastic Beanstalk
    • build-tools
      • Managing dependecies
      • Apache ANT
      • Gradle
        • Custom Plugins
        • local-jars
      • Project Management - maven
        • Archtypes
        • Build Lifecycles
        • Customising build lifecycle
        • Dependencies
        • Directory layout
        • jar-files
        • one-to-one
        • Modules
        • Phases
        • Maven Plugins
        • POM
        • profiles
        • setup
        • Starting a maven project
        • wrapper
    • CI/CD
      • Continuous Delivery
      • zookeeper
      • Continuous Integration (CI)
      • github-actions
      • Pipeline
      • Teamcity
    • Cloud computing
      • Overview
      • Service Models
      • Cloud Services
    • containers
      • Best Practices
      • Docker
    • Infrastructure
      • IT Infrastructure Model
      • Non functional Attributes (Quality Attributes)
        • Infrastructure Availability
        • Performance
        • Secruity
    • monitoring
      • Alerting
      • Monitoring & Metrics
      • Metrics
      • Ready pages
      • Splunk
      • Status pages
      • notes-devops-talk
      • logging
        • logging
        • issues
        • Logging
        • Logging
    • Service mesh
      • Service Discovery
      • Istio
    • Terraform
    • container-management
      • Kubernetes
        • commands-glossary
        • OLTP
        • config-maps
        • Links
        • ingress
        • SDP
        • minikube
        • filter
        • indexes
        • sidecar
        • continuous-deployment
  • General Paradigms
    • CAP theorem
    • designing data-intensive applications summary
    • a-philosophy-of-software-design-notes
    • Aspect oriented Programming (AOP)
    • Best Practice
    • Cargo Cult
    • Clean Code
    • Coding reflections
    • Cognitive Complexity
    • Complexity
    • Conventions
    • Design discussions
    • Design
    • Error Handling Checklist
    • Exceptions
    • Feature Flags/toggle
    • Functional requirements
    • Last Responsible Moment
    • Lock In
    • Named Arguments
    • Naming
    • Performance Fallacy
    • Quality
    • Redesign of a system
    • Resuse vs Decoupling
    • Rules for software designs
    • Sad Paths
    • Scaling Webservices
    • Scientific Method
    • stream-processing
    • Upstream and Downstream
    • Patterns
      • Client-SDK-Pattern
      • ORM
      • Api gateway
      • Business Rules Engine
      • cache
      • Composition Root
      • Dependency Injection Containers
      • Dependency Injections
      • Double Dispatch
      • Exception Handling
      • Gateway pattern
      • Humble Object
      • Inheritance for reuse
      • Null Object Pattern
      • Object Mother
      • Patterns
      • Collection pipeline pattern
      • Service Locator
      • Setter constructor
      • Static factory method
      • Step Builder Pattern
      • telescopic constructors
      • Toggles
      • API
        • Aims of API designs
        • Avoid Checked Exceptions
        • Avoid returning nulls
        • Be defensive with your data
        • convience-methods
        • Fluent Interfaces
        • Loan Pattern
        • prefer-enums-to-boolean-returns
        • return-meaningful-types
        • Small intefaces
        • Support Lambdas
        • Weakest type
      • Gang of Four
        • Builder
        • Factory Pattern
        • Strategy Pattern
        • Template
        • abstract Factory
        • Adapter
        • Bridge Pattern
        • Chain of responsibility
        • Command Pattern
        • Composite Design Pattern
        • Decorator Pattern
        • Facade Pattern
        • Flyweight pattern
        • Guard Clause
        • Interpreter
        • html
        • Mediator Pattern
        • Memento Pattern
        • Observer
        • Prototype
        • Proxy
        • Singleton
        • State Pattern
        • Visitor Pattern
    • Architecture
      • Entity Component System
      • Integration Operation Segregation Principle
      • Adaptable Architecture
      • Architecture
      • C4 Modelling
      • cell-based
      • Clean/Hexagonal Architecture
      • Codifying architecture
      • Correct By configuration
      • Cost Base Architecture
      • Data Oriented Design
      • deliberate
      • Domain oriented DOMA
      • Event Driven Architecture
      • Evolutionary Architecture
      • examples
      • Feature Architecture
      • Framework and Libraries
      • functional-core-imperative-shell
      • Layered Architecture
      • Micro services
      • monoliths-to-services
      • Multi tiered Architecture
      • Multi tenant application
      • Resilient Architecture
      • stage event driven architecture (SEDA)
      • links spring rest app
      • Tomato Architecture
      • Tooling
      • Types of architecture
      • checklist
        • Checklist for new project
        • Back end Architecture Checklist
        • Front end Architecture Checklist
        • Mobile Architecture Checklist
      • Cloud Patterns
        • Command and Query Responsibility Segregation (CQRS)
        • Event Sourcing & CQRS
        • Asynchronous Request and Reply
        • Circuit Breaker
        • Retry
        • Sidecar
        • Strangler pattern
      • Domain driven design
        • value & entity
      • Microservices
        • Alternatives to choosing microservices first when scaling
        • Consistency in distributed systems
        • 12 Factor applications
      • Modularity
        • Module monolith vs Microservices
        • Spring Moduilth
      • Architecture Patterns
        • Hexagonal architecture
        • Inverting dependencies
        • Layering & Dependency Inversion Principle
        • Mappings
        • Vertical Slice architecture
        • Web Client Server
        • domain
          • Business and Data Layers Separation
          • DTO
          • Domain Model Pattern
          • Domain Object
          • Transaction Script/ Use Case pattern
        • Enterprise Patterns
          • Concurrency
          • Distribution strategies
          • Domain layer patterns
          • Layering/organisation of code
          • Mapping to datasource
          • Session State
        • Usecases
          • Use case return types
      • Serverless
        • Knative
    • Design architecture aims
      • back of envelope
      • Design ideas
      • Design mistakes
      • high-volume-design
      • ISO Quality Attributes
      • Non functional requirements
      • “Designing for Performance” by Martin Thompson
      • High Performance
      • Qaulity Attributes
        • Availability
        • System Availability
        • Fault Tolerance
        • interoperability
        • Latency
        • Maintability
        • Modifiability
        • Performance
        • Readability
        • Reliability
        • Scalability vs performance
        • Scalability
        • Scaling
        • statelessness
        • Testability
        • Throughput
      • System Design
      • web-scalability-distributed-arch
        • scalable-and-distributed-web-architecture
    • README
      • Conflict-free Replicated Data Type
      • Fallacies
      • Load balancing
      • Rate Limiting
      • Transactions
    • Patterns of Enterprise Application Architecture
      • Repository Pattern
      • Rules Engines
      • scatter-gather
      • Specification Design Pattern
      • Table Driven Development
      • Workflow Design Patterns
        • Triggers
    • Principles
      • Do It Or Get Bitten In The End
      • Dont Repeat Yourself
      • Habitability
      • Keep it simple
      • Responsibility Driven Design
      • Ya Ain’t Gonna Need It
      • Conceptual Overhead
      • CUPID
      • Reuse existing interfaces
      • Facts and Fallacies
      • locality of behaviour
      • Separation of Concerns
      • Simplicity
      • SLAP principle
      • Step down rule
      • Unix Philosophy
      • Wrong abstractions
      • SOLID
        • 1. Single Responsibility Principle
        • 2. Open Close Principle
        • 3. Liskov Substitution Principle
        • 4. Interface Segregation Principle
        • 5. Dependency Inversion Principle
        • GRASP (General Responsibility Assignment Software Principles)
        • Solid for packages
          • jobs
          • CCP
          • CRP
          • REP
          • egress
          • gossip-protocol
        • STUPID
    • programming-types
      • Coding to Contract/Interface
      • Links
      • Declarative vs Imperative Programming Languages
      • defensive-programming
      • Design by contract
      • Domain Specific Languages (DSL)
      • Event Driven
      • file-transfers
      • Logical Programming
      • Mutability
      • Self Healing
      • Simplicity
      • Type Driven Design
      • Value objects
      • Aspect Oriented Programming
      • Concurrent and Parallel Programming
        • Actor Model
        • Asynchronous and Synchronous Programming
        • Batch processing
        • Concurrency Models
        • SAP
        • Multithreading
        • Non Blocking IO
        • Optimistic vs Pessimistic Concurrency
        • Thread per connection or request model
        • Actor
        • aysnchronous-tasks
          • Computational Graphs
          • Divide and conquer
          • Future
          • Thread Pool
        • barriers
          • Barriers
          • Race conditions
        • design
          • agglomeration
          • Communication
          • Mapping
          • Partitioning
        • Liveness
          • Abandoned Lock
          • Deadlocks
          • Livelock
          • Starvation
        • locks
          • Read write lock
          • Reentrant lock
          • Try Lock
        • Mutual Exclusion
          • Data Races
          • Mutual Exclusion AKA Locks
        • performance
          • Amdahl's Law
          • Latency, throughput & speed
          • Measure Speed up
        • synchronization
          • Condition variable
          • producer consumer pattern
          • Semaphore
        • Threads and processes
          • Concurrent and parallel programming
          • Daemon Thread
          • Execution Scheduling
          • sequential-parallel
          • Thread Lifecycle
          • threads-and-processes
      • Functional Programming
        • Currying
        • design-patterns-to-func
        • imperative-programming
        • First class functions
        • Functional Looping
        • Higher Order Functions
        • Immutability
        • Issues with functional Programming
        • Lambda calculus
        • Lazy & Eager
        • map
        • Monad
        • Railway Programming
        • Recursion
        • Reduce
        • referential-transparacy
        • Referential transparency
        • Supplier
      • oop-design
        • Issues with object oriented code
        • Aggregation
        • Anti Patterns
        • Association
        • class-and-objects
        • Composition
        • general-laws-of-programming
        • general-notes
        • Getters and Setters
        • Inside out programming
        • Inversion of control
        • oop-design
        • Other principles
        • Outside in programming
        • Readability
        • Why OO is bad
        • README
          • abstraction
          • encapsulation
          • inheritance
          • Polymorphism
        • clean-code
          • Code Smells
          • Comments
          • Naming
          • CLEAN design
            • code is assertive
            • Cohesion
            • Connascence
            • Coupling
            • Encapsulation
            • Loose Coupling
            • Nonredundant code
      • Reactive Programming
        • reactive-programming
    • Projects and Software types
      • Applicatoin Development
      • Buying or creating software
      • Console Applications
      • Embedded Software development
      • Enterprise
      • Framework Development
      • Games
      • Library development
      • Rewriting
      • White Label Apps
    • State Machines
      • Spring State Machine
  • Other
    • 10x devs
    • Aim of software
    • Choosing Technologies
    • Coding faster
    • Component ownership
    • developer-pain-points
    • Developer Types
    • Effective Software design
    • Full Stack Developer
    • Good coder
    • Issues with Software Engineering and Engineers
    • Learning
    • Logic
    • Role
    • Software Actions
    • Software craftmanship
    • Software Designed
    • Software Engineering
    • Software
    • article-summaries
      • General notes
      • Summary of The Grug Brained Developer A layman's guide to thinking like the self-aware smol brained
      • improve-backend-engineer
      • Optimising Api
      • Simple and Easy
    • README
  • Hardware
    • Cpu memory
    • Storage
  • Integration
    • GRPC
    • API
    • Apis and communications between apps
    • asynchronous and synchronous communications
    • Batch Processing
    • Communications between apps
    • Delivery
    • Distributed Computing
    • Entry point
    • Event Source
    • SDP
    • egress
    • Graphql
    • Idempotency
    • Libraries
    • Long Polling
    • Multiplexing & Demultiplexing
    • Publish Subscribe
    • Push
    • Request & Response
    • REST
    • Remote Method Invocation
    • Remote Procedure Calls
    • Server Sent Events
    • Short Polling
    • Sidecars
    • SOAP
    • Stateless and Stateful
    • Streams
    • Third Party Integrations
    • wdsl
    • Web Services
    • Webhooks
    • repository
    • Kafka
      • Kafka Streams
    • message-queues
      • ActiveMQ
      • Dead Letter Queue
      • JMS
      • Messaging
  • Languages
    • C
    • Choosing A Language
    • cobol
    • Composite Data Types
    • creating
    • Date time
    • Numbers
    • Pass by value vs Pass by reference
    • Primitive Data Types
    • REST anti-patterns
    • Rust
    • Scripting
    • Static typing
    • string
    • Task Oriented Language
    • assembly
    • Getting started
      • Functional Concepts
    • cpp
    • Java
      • Code style
      • Garbage Collection
      • Intellij Debugging
      • Artifacts, Jars
      • Java internals
      • Java resources
      • Java versions
      • JShell
      • Libraries
      • opinionated-guide
      • Starting java
      • Java Tools
      • Why use java
      • Advanced Java
        • Annotations
        • API
        • Database and java
        • Debugging Performance
        • Files IO
        • Finalize
        • JDBC
        • jni
        • Libraries
        • Logging
        • SAP
        • Memory Management
        • Modules
        • OTher
        • Packaging Application
        • Pattern matching
        • performance
        • Properties
        • Reference
        • reflection
        • Scaling
        • Scheduling
        • secruity
        • Serilization
        • Time in Java
        • validation
        • Vector
        • Concurrency and Multithreaading
          • Akka
          • ExecutorCompletionService
          • Asynchronous Programming
          • Concurrency and Threads
          • CountDownLatch
          • Conccurrent Data Structures
          • Executor Service
          • Futures
          • reactive
          • Semaphore
          • structured concurrency
          • Threadlocal
          • Threads
          • Virtual Threads
          • Mutual Exclusion
            • Atomic
            • Synchronized
            • Thread safe class
            • Threads
        • debug
          • heap-dumps
          • thread-dumps
        • functional
          • Collectors
          • Exception Handling
          • Flatmap
          • Functional Programming
          • Generators
          • Immutability
          • issues
          • Optional
          • Parallel Streams
          • Reduce
        • networks
          • HTTP client
          • servlet-webcontainers
          • sockets
          • ssl-tls-https
      • Basics of java
        • compilation
        • computation
        • Conditonal/Flow control
        • Excuting code
        • Instructions
        • Looping/Iterating
        • memory-types-variables
        • methods
        • Printing to screen/debugging
        • Setup the system
        • Data structures
          • Arrays
          • Arayslist/list
          • Map
      • Effective Java notes
        • Creating and Destroying Objects
        • Methods Common to All Objects
        • best-practice-api
        • Classes and Interfaces
        • Enums and Annotations
        • Generics
      • framework
        • aop
        • bad
        • Dagger
        • Databases
        • Lombok
        • Mapstruct
        • netty
        • resliance4j
        • RxJava
        • Vert.x
        • Spring
          • Spring Data Repositories
          • actuator
          • cloud-native
          • H2 Db in Spring
          • Initializrs
          • JDBC Template
          • Java Persistence API (JPA)
          • kotlin
          • Pitfalls and advice
          • PRoxies
          • Reactive
          • spring security
          • spring-aop
          • Spring Boot
          • spring-jdbc
          • Spring MVC
          • Spring Testing
          • Testing
          • Transaction
          • patterns
            • Component Scan Patterns
            • Concurrency
            • Decorator Pattern in Spring
        • Micronaut
          • DI
        • Quarkus
          • database
          • Links
      • Intermediate level java
        • String Class
        • Assertions
        • Casting
        • Clonable
        • Command line arguments
        • Common Libraries/classes
        • Comparators
        • Where to store them?
        • Shallow and Deep Copy
        • Date and Time
        • Enums
        • Equals and Hashcode
        • Equals and hashcode
        • Exceptions
        • Final
        • Finally
        • Generics
        • incrementors
        • Null
        • packages and imports
        • Random numbers
        • Regex
        • Static
        • toString()
        • OOP
          • Accessors
          • Classes
          • Object Oriented Programming
          • Constructors
          • Fields/state
          • Inheritence
          • Interfaces
          • Methods/behaviour
          • Nested Classes
          • Objects
          • Static VS Instance
          • Whether to use a dependency or static method?
        • Other Collections
          • Other Collections
          • Arraylist vs Linkedlist
          • LinkedHashMap
          • Linked List
          • Priority queue
          • Sequenced Collections
          • Set
          • Shallow vs Deep Copy
          • Time Complexity of Collections
          • What Collection To use?
    • kotlin
      • Domain Specific Language
      • learning
      • Libraries
      • Personal Roadmap
      • Links
    • Nodejs
      • Performance
  • Management & Workflow
    • Agile
    • Take Breaks
    • # Communication
    • Engineering Daybook
    • Estimates
    • Feedback Loops
    • Little's law
    • Managing Others
    • poser.
    • Presentations
    • self-improvement
    • software-teams
    • Task List
    • trade-off
    • Types of devs
    • Type of work
    • Waterfall Methodology
    • coding-process
      • Bugs
      • Code Review
      • Code Reviews
      • Documentation
      • Done
      • Handover
      • Mob Programming
      • Navigate codebase
      • Pair Programming
      • Pull Requests
      • How to do a story
      • Story to code
      • Trunk based development
      • Xtreme Programming (XP)
      • debugging
        • 9 Rules of Thumb of Dubugging
        • Debugging
        • using-debugger
      • Legacy code
        • Legacy crisis
        • Working with legacy code
    • Managing work
      • Theory of constraints
      • Distributed Teams
      • estimations
      • Improving team's output
      • Kanban
      • Kick offs
      • Retrospectives
      • Scrum
      • Sign offs
      • Stand ups
      • Time bombs
      • Project management triangle
    • Notion
    • recruitment
      • In Person Test
      • Interviews
      • Unattended test
  • Networks
    • Content Delivery Network - CDN
    • DNS
    • cache control
    • Cookies and Sessions
    • Docker Networking
    • Duplex
    • Etags
    • HTTP Cache
    • HTTP - Hyper Text Transfer Protocol
    • HTTP/2
    • Http 3
    • Internet & Web
    • iptables
    • Keep alive
    • Leader Election
    • Load balancer
    • long-polling
    • Network Access Control
    • Network Address Translation (NAT)
    • Network Layers
    • Nginx
    • OSI network model
    • Persistent Connection
    • Polling
    • Proxy
    • Quic
    • reverse-proxy
    • servers
    • Server sent events (SSE)
    • SSH
    • Streaming
    • Timeouts
    • Url Encoding
    • Web sockets
    • WebRTC (Web Real-Time Communication)
    • Wireshark
    • tcp/ip
      • Congestion
      • IP - Internet Protocol
      • TCP - Transmission Control Protocol
  • Operating Systems
    • Cloud Computing
    • Distributed File Systems
    • Distributed Shared Memory
    • Input/Output Management
    • Inter-Process Communication
    • Threads and Concurrency
    • Virtualization
    • Searching using CLI
    • Bash and scripting
    • Booting of linux
    • makefile
    • Memory Management
    • Processes and Process Management
    • Scheduling
    • Scripting
    • Links
    • Ubuntu
    • Unix File System
    • User groups
    • Linux
  • Other Topics
    • Finite state machine
    • Floating point
    • Googling
    • Setup
    • Unicode
    • Machine Learning
      • Artificial Intelligence
      • Jupyter Notebook
    • Blockchain
    • Front End
      • Single Page App
      • cqrs
      • css
      • Debounce
      • Dom, Virtual Dom
      • ADP
      • htmx
      • Island Architecture
      • Why use?
      • Java and front end tech
      • mermaidjs
      • Next JS
      • javascript
        • Debounce
        • design
        • Event loop
        • testing
        • Typescript
        • react
          • Design
          • learning
          • performance
          • React JS
          • testing
      • performance
      • Static website
    • jobs
      • Tooling
      • bash text editor - vim
      • VS code
      • scaling
        • AI Assistant
        • Debugging
        • General features and tips and tricks
        • IDE - Intellij
        • Plugins
        • Spring usage
  • persistance
    • ACID - Atomicity, Consistency, Isolation, Durability
    • BASE - Basic Availability, Soft state, Eventual Consistency
    • Buffer
    • Connection pooling
    • service
    • Database Migrations - flywaydb
    • Databases
    • Eventual Consistency
    • GraphQL
    • IDs
    • indexing
    • MongoDB
    • Normalisation
    • ORacle sql
    • Partitioning
    • patterns
    • PL SQL
    • Replication and Sharding
    • Repository pattern
    • Sharding
    • Snapshot
    • Strong Consistency
    • links
    • Files
      • Areas to think of
    • hibernate
      • ORM-hibernate
    • Indexes
      • Elastisearch
    • relationships
      • many-to-many
      • SDP
      • serverless
      • x-to-x-relationships
    • sql
      • Group by
      • indexes
      • Joins
      • Common mistakes
      • operators
      • performance
    • types
      • maven-commands-on-intellij
      • in-memory-database-h2
      • Key value database/store
      • Mongo DB
      • NoSQL Databases
      • Relational Database
      • Relational Vs Document Databases
  • Security
    • OAuth
    • API Keys
    • Certificates and JKS
    • Cluster Secruity
    • Communication Between Two Applications via TLS
    • Cookies & Sessions
    • CORS - Cross-Origin Resource Sharing
    • csrf
    • Encryption and Decryption
    • Endpoint Protection
    • JWT
    • language-specific
    • OpenID
    • OWASP
    • Secrets
    • Secruity
    • Servlet authentication and Authorization
    • vault
  • Testing, Maintainablity & Debugging
    • Service-virtualization and api mocking
    • a-test-bk
    • Build Monitor
    • Builds
    • Code coverage
    • consumer-driven contract testing
    • Fixity
    • Living Documentation
    • Mocks, Stubs & Doubles
    • patterns
    • Quality Engineering
    • Reading and working with legacy code
    • Reading
    • remote-debug-intellij
    • simulator
    • Technical Debt
    • Technical Waste
    • Test cases
    • Test Data Builders
    • Test Pyramids
    • Test Types
    • Testing Good Practice
    • Testing
    • What to prime
    • What to test
    • Debugging
      • Debugging in kubernetes or Docker
    • fixing
      • How to Deal with I/O Expense
      • How to Manage Memory
      • How to Optimize Loops
      • How to Fix Performance Problems
    • Legacy Code
      • Learning
      • Legacy code
      • techniques
    • libraries
      • assertj
      • Data Faker
      • Junit
      • mockito
      • Test Containers
      • Wiremock
      • Yatspec
    • Refactoring
      • Code Smells
      • refactoring-types
      • Refactoring
      • Technical Debt
      • pyramid-of-refactoring
        • Pyramid of Refactoring
    • Test first strategies
      • Acceptance Testing Driven Developement (ATDD)
      • Behaviour Driven Development/Design - BDD
      • Inside out
      • Outside in
      • Test driven development (TDD)
    • testing
      • Acceptance tests
      • How Much Testing is Enough?
      • Approval Testing
      • Bad Testing
      • End to end tests
      • Honeycomb
      • Testing Microservices
      • Mutation testing
      • Property based testing
      • Smoke Testing
      • social-unit-tests
      • solitary-unit-tests
      • Static Analysis Test
      • Unit testing
  • Version Control - Git
    • Branch by Abstraction
    • feature-branching
    • Git patches
    • Trunk Based Development
Powered by GitBook
On this page
  • calculating availability
  • Availability percentages and intervals
  • MTBF and MTTR
  • Human Errors and Availability
  • Bugs
  • Planned Maintenance
  • Physical defects
  • Environmental issues
  • Complexity of the infrastructure
  • Availability patterns
  • Redundacy
  • Failovers
  • Fallback
  • Business Continuity

Was this helpful?

  1. devops
  2. Infrastructure
  3. Non functional Attributes (Quality Attributes)

Infrastructure Availability

  • Everyone expects their infrastructure to be available all the time.

    • A 100% guaranteed availability of an infrastructure, however, is impossible

    • there is always a chance of downtime

calculating availability

  • In general, availability can neither be calculated, nor guaranteed upfront.

    • It can only be reported on afterwards, when a system has run for some years.

    • We use past experience and design patterns to design high available systems

      • ie failover, redundancy, structured programming, avoiding Single Points of Failures (SPOFs), and implementing

        sound systems management

Availability percentages and intervals

  • The availability of a system is usually expressed as a percentage of uptime in a given time period

    • ie month/year

    • ie for 99.9% uptime, we expect downtime of 17.5 hours per year, 86.2 min per month and 20.2 min per week

  • Typical requirements used in service level agreements today are 99.8% or 99.9%

    availability per month for a full IT system.

    • To meet this requirement, the availability of the underlying infrastructure must be much higher, typically in the range of 99.99% or higher

  • 99.999% uptime is also known as carrier grade availability;

    • originates from telecommunication system components, not full systems

  • Higher availability levels for a complete system are very uncommon, as they are almost impossible to reach.

    • ie average electricity supply downtime in uk is 75 min, therefore 99.9857% available

  • Downtime does not mean it occurs in one event

    • down time could occur in multiple ranges ie 0-x1, x1 - x2 ....xn-1 to x

    • good practice to agree on the maximum frequency of unavailability

      • ie 0-5 min has 30 or less events per year, more than 30 min has 1 or less event per year

MTBF and MTTR

  • Two factors invovled in calculating availability

    • Mean Time Between Failures(MTBF)

      • which is the average time that passes between failures

      • expressed in hours

        • how many hours will the component or service work without failure

      • Impossible to test to find value, numbers are too large

        • manufacturers run tests on large batches of components.

          • ie hard disks, 1000 disks could have been tested for 3 months. If in that period of time five disks fail

            , then MTBF can be calculated

      • MTBF only says something about the chance of failure in the first months of use.

        • an extrapolated value for the probable downtime of a disk

      • better to specify the annual failure rate instead

        • ie 2% of all disk will fail in first year

        • Do another table of rates for each year

    • Mean Time To Repair (MTTR)

      • which is the time it takes to recover from a failure

      • When a component breaks, it needs to be repaired.

        • Usually the repair time (expressed as Mean Time To Repair – MTTR) is kept low by having a service contract

          with the supplier of the component.

        • Sometimes spare parts are kept onsite to lower the MTTR

        • Typically, a faulty component is not repaired immediately

          • A process occurs before repair ie

            • Notification of the fault (time before seeing an alarm message)

            • Processing the alarm

            • Finding the root cause of the error

            • Looking up repair information

            • Getting spare components from storage

            • Having technician come to the datacenter with the spare component

            • Physically repairing the fault

            • Restarting and testing the component

      • The best way to keep the MTTR low is to introduce automated redundancy and failover

    • These are statistically calculated values

  • Decreasing MTTR and increasing MTBF both increase availability.

  • Dividing MTBF by the sum of MTBF and MTTR results in the availability

  • To reach five nines of availability the repair time should be as low as 90 minutes for a component (if MTBF is 150

    ,000 hrs)

    • if 5 9s for a year then the repair time must be 6 minutes

  • As system complexity increases, usually availability decreases.

  • serial availability

    • When a failure of any one part in a system causes a failure of the system as a whole

    • To calculate the availability of such a complex system or device, multiply the availability of all its parts

      (convert % to dec first)

    • This is lower than the availability of any single component in the system

      • it can never be higher, and only reach a maximum of the lower % if all availablity are same number

  • To increase the availability, systems (composed of a various components) can be deployed in parallel.

    • The combined system no longer contains a Single Point Of Failure

    • If one component goes down in one system, the other system can take over until the first system's componet is

      fixed and brough back up

    • In this situation, it is important to have no single point of failure that combines the set of systems

      • for instance, all systems run on the same power supply

Human Errors and Availability

  • Usually only 20% of the failures leading to unavailability are technology failures

    • The rest are people and process issues

      • 50% of this is via change/configuration/release integration and hand-off issues

  • Need to have highly qualified and trained staff, with a healthy sense of responsibility.

    • Errors are human, they will always happen

  • Exampls

    • End users can introduce downtime by misuse of the system

      • When a user for instance starts the generation of ten very large reports at the same time, the performance of

        the system could suffer in such a degree that the system becomes unavailable to other users

      • hen a user forgets a password she is locked out and the system is unavailable for that user, being locked out

        could mean that a business process is unavailable to other users as well

  • Most unavailability issues, however, are the result of actions from systems managers.

    • examples

      • Performing a test in the production environment

      • Switching off the wrong component - not the defective server that needs repair, but the one still operating

      • Swapping out the wrong component instaed of the faulty ones

      • Restoring the wrong backup tape to production

      • Accidentally removing files (mail folders, configuration files) or database entries

      • Making incorrect changes to configurations

      • Incorrect labeling of cables, later leading to errors when changes are made to the cabling.

      • Performing maintenance on an incorrect virtual machine

      • Making a typo in a system command environment

      • Insufficient testing, for instance, the fallback procedure to move operations from the primary datacenter to

        the secondary was never tested, and failed when it was really needed

    • Many of these mistakes can be avoided by using proper systems management procedures

      • having a standard template for creating new servers

      • using formal deployment strategies with the appropriate tools

      • using administrative accounts only when absolutely needed

      • Warning signs given to root users, to keep them aware

  • Hackers can create downtime by for instance executing a Denial of Service attack

Bugs

  • software bugs are the number two reason for unavailability

  • the complexity of most software it is nearly impossible (and very costly) to create bug-free software

  • Bugs in systems or drivers can

    • stop an entire system

    • create downtime

  • operating systems contain bugs that can lead to corrupted file systems, network failures, or other sources of

    unavailability

  • These can be

    • accidental

      • something breaks in production and fixed later on in the software so does not happen again

    • accepted and have a manual way of dealing with them, as cheaper

    • on purpose by dissatisfied worker, spy, hacker. But will be spotted and fixed when the bug surfaces

      • Although can be costly, so prevention is better ie code reviews, access to code base to specific members etc

Planned Maintenance

  • Planned maintenance is sometimes needed to perform

    • systems management tasks like upgrading hardware or software,

    • implementing software changes

    • migrating data

    • the creation of backups

  • planned maintenance should only be performed on parts of the infrastructure while other parts keep serving clients.

    • to maintain high availability

    • downtime of a single component does not lead to downtime of the entire system

      • if not single point of failure

      • Allows for upgrade of say OS while system is still up

  • During planned maintenance, however, the system is more vulnerable to downtime than under normal circumstances

    • When the systems manager makes a mistake during planned maintenance, the risk of downtime is higher than normal

    • can lead to creating SPOF

  • Example

    • the upgrade of systems in a high available cluster. When one component is upgraded and the other is not upgraded yet, it could be that the high available cluster is not working as such. In that period of time the system is vulnerable to downtime

Physical defects

  • everything breaks down eventually, but mechanical parts are most likely to break first.

  • Apart from mechanical failures because of normal usage, parts also break because of external factors like ambient temperature, moist, vibrations, and aging.

  • In most cases the availability of a component follows a so-called bathtub curve.

    • A component failure is most likely when the component is new. In the first month of use the chance of a components failure is relatively high. Sometimes a component doesn't even work at all when unpacked for the first time.

      • This is called a DOA component – Dead On Arrival

    • When a component still works after the first month, it is likely that it will continue working without failure until the end of its technical life cycle

    • the chance of failure rises suddenly at the end of the life cycle of a component.

Environmental issues

  • Issues with power and cooling, and external factors like fire, earthquakes and flooding can cause entire datacenters to fail.

    • Affect of power, being cut (foor long /short time periods ), or voltage drops/spikes

    • Failure of air con, causes temperature raise, which cause parts to break or slow down

Complexity of the infrastructure

  • Complex systems inherently have more potential points of failure and are more difficult to implement correctly.

  • a complex system is harder to manage; more knowledge is needed to maintain the system and errors are made more easily.

  • Sometimes it is better to just have an extra spare system in the closet than to use complex redundant systems

Availability patterns

  • A single point of failure (SPOF) is a component in the infrastructure that, if it fails, causes downtime to the entire system.

    • SPOFs should be avoided, need to eliminate them

  • The trick is to find SPOFs that are not that obvious

  • it is not always feasible or cost effective.

  • it is good to realize that there is always something shared in an infrastructur

    • We just need to know what is shared and if the risk of sharing is acceptable

  • To eliminate SPOFs, a combination of redundancy, failover, and fallback

Redundacy

  • Redundancy is the duplication of critical components in a single system, to avoid a SPOF.

  • usually implemented in power supplies, network interfaces, and SAN HBAs (Host Bus Adapters) for connecting storage.

Failovers

  • Failover is the (semi)automatic switch-over to a standby system (component), either in the same or in another datacenter, upon the failure or abnormal termination of the previously active system (component)

Fallback

  • Fallback is the manual switchover to an identical standby computer system in a different location, typically used for disaster recovery.

  • three basic forms of fallback solutions

    • Hot site

      • is a fully configured fallback datacenter, fully equipped with power and cooling. The applications are installed on the servers, and data is kept up todate to fully mirror the production system.

      • Staff and operators should be able to walk in and begin full operations in a very short time (typically one or two hours).

      • requires constant maintenance of the hardware, software, data, and applications to be sure the site accurately mirrors the state of the production site at all times.

    • Warm site

      • A warm site could best be described as a mix between a hot site and cold site.

      • is a computer facility readily available with power, cooling, and computers, but the applications may not be installed or configured

      • external communication links and other data elements, that commonly take a long time to order and install, will be present

      • To start working in a warm site, applications and all their data will need to be restored from backup media and tested. This typically takes a day

      • needs less attention when not in use and is much cheaper than a hot site

    • Cold site

      • it is ready for equipment to be brought in during an emergency, but no computer hardware is available at the site.

      • is a room with power and cooling facilities, but computers must be brought on-site if needed, and communications links may not be ready. Applications will need to be installed and current data fully restored from backups.

      • if an organization has very little budget for a fallback site, a cold site may be better than nothing

Business Continuity

  • the availability of the IT infrastructure can never be guaranteed in all situations

  • Business continuity is about identifying threats an organization faces and providing an effective response

    • Business Continuity Management (BCM) and Disaster Recovery Planning (DRP) are processes to handle the effect of disasters

    • Business Continuity Management

      • managing business processes, and the availability of people and work places in disaster situations.

      • disaster recovery, business recovery, crisis management, incident management, emergency management, product recall, and contingency planning

      • Business Continuity Plan (BCP) describes the measures to be taken when a critical incident occurs in order to continue running critical operations, and to halt non-critical processes

      • guidlines like BS:25999

    • Disaster Recovery Planning

      • a set of measures to take in case of a disaster, when (parts of) the IT infrastructure must be accommodated in an alternative location.

      • DRP assesses the risk of failing IT systems and provides solutions

      • IT disaster is defined as an irreparable problem in a datacenter, making the datacenter unusable

        • The first category is natural disasters such as floods, hurricanes, tornadoes or earthquakes.

        • The second category is manmade disasters, including hazardous material spills, infrastructure failure, or bio-terrorism

      • disaster recovery standard BS:25777 can be used to implement DRP

      • A typical DRP solution is the use of fallback facilities and having a Computer Emergency Response Team (CERT) in place.

        • A CERT is usually a team of systems managers and senior management that decides how to handle a certain crisis once it happens

      • One of the first worries is to save people during a diaster.

        • But after that, procedures must be followed to restore IT operations as soon as possible

  • RTO - Recovery Time Objective

    • he maximum duration of time within which a business process must be restored after a disaster, in order to avoid unacceptable consequences (like bankruptcy).

    • only valid in case of a disaster and not the acceptable downtime under normal circumstances.

    • failover and fallback used

  • RPO - Recovery Point Objective

    • The point in time to which data must be recovered considering some "acceptable loss" in a disaster situation.

    • the amount of data loss a business is willing to accept in case of a disaster, measured in time.

    • Different backup regimes will affect this

PreviousNon functional Attributes (Quality Attributes)NextPerformance

Last updated 4 years ago

Was this helpful?

mttr-mtbf.png
RTOandRPO.png