Retry

  • Problem

    • A distributed environment is prone to transient errors due to slow networks, timeouts etc.

    • But these issues typically self-correct and if the action is re-triggered, it’s likely to succeed.

    • In such situations, applications need to handle these transient failures without impacting the end-user experience.

  • Solutions

    • Try and try again

      • There are 3 ways to handle transient failures

        • Stop and Report exception:

          • If a fault isn’t transient or cannot succeed when repeated, the application could raise an alert and log the exception

        • Retry immediately:

          • If a fault is rare, the application could retry the failing request immediately and the request may be successful

        • Retry with a delay:

          • If a fault is caused by connectivity issues or issues that may need a short period, the application could retry the failing request after a reasonable amount of time has passed

      • The time delay and number of retries can be configured to suit the application needs

      • If the request still fails even after the desired retry count, the application could report it as a fault and raise an alert

    • Example

      • It’s your dear friend’s birthday! You want to be the first one to wish them so you call them exactly as the clock strikes 12

      • The phone’s busy… You figure someone beat you to it. You hang up (kind of disappointed)

      • But you also know the phone won’t be busy for too long. So you redial, and this time you get through and wish them.

Last updated