Java Microservices - Retry Pattern



Introduction

In distributed systems and microservices, network failures, timeouts, and temporary faults are common. These failures are often temporary and may succeed on subsequent attempts. The Retry Pattern is a resilience technique where a failed request is automatically retried after a brief delay before finally giving up.

This pattern significantly increases the fault tolerance of microservices by allowing them to recover from temporary issues without immediate failure.

Motivation and Problem Statement

Let's consider a real-world example−

A payment microservice calls a third-party payment gateway API. Occasionally, the request fails due to−

  • Temporary network issues

  • DNS lookup failure

  • Gateway throttling

If the service fails outright, it may disrupt customer experience. Instead, if it retries the request a few times, the operation could succeed on the second or third attempt, improving reliability.

Key Challenges

  • Unpredictable failures in remote services

  • Overreaction to minor or short-lived glitches

  • Impact on user experience and system stability

When and Where to Apply

Use the Retry Pattern when −

  • Failures are transient and recoverable (e.g., timeouts, 5xx errors, temporary unavailability)

  • The operation is idempotent (i.e., calling it multiple times won't corrupt data or cause unwanted side effects)

  • The remote system is well-known and typically stable

Avoid retries when −

  • The failure is permanent (e.g., 404 Not Found, 401 Unauthorized)

  • The call is non-idempotent (e.g., money transfer or email sending)

  • Retry may flood an already overloaded system

Core Concepts and Principles

Retry Policy

A retry policy defines how retry attempts are made. Key parameters −

  • Max retries − How many times to retry (e.g., 3 attempts)

  • Delay − Time between retries (e.g., 200ms)

  • Backoff strategy − Fixed, exponential, or randomized

  • Retry on − Specific exceptions or HTTP statuses

Backoff Strategy

  • Fixed Delay − Wait a constant time between retries

  • Exponential Backoff − Delay increases exponentially

  • Exponential Backoff with Jitter − Adds randomness to avoid retry storms

Design Considerations

When designing a retry mechanism −

  • Ensure idempotency

  • Set timeouts on retries to avoid hanging requests

  • Log each retry attempt

  • Use circuit breaker in conjunction to avoid retrying during complete outages

  • Implement fallbacks for graceful degradation

Retry Diagram (described in text)

A retry loop can be illustrated as−

Request → Failure → Retry → Failure → Retry → Give up → Fallback/Error

Implementation Strategies

Strategy 1 − Manual Retry Logic

A developer can wrap method calls in a loop with sleep/delay and exception handling.

int maxAttempts = 3;
int attempt = 0;
while (attempt < maxAttempts) {
   try {
      callExternalService();
      break;
   } catch (Exception e) {
      attempt++;
      Thread.sleep(200); // Delay before retry
   }
}

Strategy 2 − Framework-Based Retry

Use libraries like −

  • Spring Retry

  • Resilience4j Retry

These offer declarative retry behavior with advanced configuration.

Example Implementation: Spring Boot + Resilience4j

Dependency

<dependency>
   <groupId>io.github.resilience4j</groupId>
   <artifactId>resilience4j-spring-boot3</artifactId>
   <version>2.0.2</version>
</dependency>

Configuration (application.yml)

resilience4j.retry:
  instances:
    myServiceRetry:
      max-attempts: 3
      wait-duration: 500ms
      retry-exceptions:
        - java.io.IOException

Annotated Method

@Retry(name = "myServiceRetry", fallbackMethod = "fallbackMethod")
public String callExternalService() {
   // Call to external API
}

Fallback Method

public String fallbackMethod(Exception e) {
   return "Service temporarily unavailable";
}

Challenges and Pitfalls

Common Mistakes

  • Retrying non-idempotent operations

  • Not limiting max attempts

  • Retrying instantly without backoff

  • Not using timeouts − can lead to thread exhaustion

  • Cascading retries across services causing overload

Best Practices

  • Always limit the number of retries

  • Retry only on transient and known recoverable failures

  • Log retry attempts and metrics for observability

  • Prefer framework-level retries over custom code when possible

Tools and Libraries

Sr.No. Tool Purpose
1 Spring Retry Declarative retry support in Spring Boot
2 Resilience4j Retry Lightweight, modern retry + resilience
3 Polly (.NET) Retry handling in .NET applications
4 Retry4j Fluent, configurable retry logic in Java
5 Backoff (Python) Retry utilities with exponential backoff
Advertisements