A hands-on workshop where you'll learn to protect your microservices from cascading failures using retry, circuit breaker, and distributed retry patterns.
┌─────────────────────┐ ┌─────────────────────┐
│ Service A │ HTTP │ Service B │
│ Product Catalog │────────>│ Pricing Service │
│ (port 8080) │ │ (port 8081) │
│ │ │ │
│ GET /products │ │ GET /pricing/{id} │
│ GET /products/{id} │ │ POST /admin/... │
└─────────────────────┘ └─────────────────────┘
Service A serves a product catalog. For each product, it calls Service B to get the current price. Service B has an admin API that lets you simulate different failure scenarios.
- Java 21
- Maven 3.9+
- Docker & Docker Compose
- A REST client (curl, Postman, or IntelliJ HTTP client)
# Build both services
mvn clean package -DskipTests
# Terminal 1: Start Service B (Pricing)
cd pricing-service
mvn spring-boot:run
# Terminal 2: Start Service A (Product Catalog)
cd product-catalog-service
mvn spring-boot:run# Build the jars first
mvn clean package -DskipTests
# Start both services
docker-compose up --build# Get all products with prices
curl http://localhost:8080/products
# Get a single product
curl http://localhost:8080/products/PROD-001
# Check Service B status
curl http://localhost:8081/admin/statusUse these endpoints to simulate different failure scenarios:
| Command | Effect |
|---|---|
curl -X POST http://localhost:8081/admin/healthy |
Normal operation |
curl -X POST http://localhost:8081/admin/slow |
10-second delay on every request |
curl -X POST http://localhost:8081/admin/fail |
500 error on every request |
curl -X POST "http://localhost:8081/admin/random?rate=40" |
40% of requests fail randomly |
curl http://localhost:8081/admin/status |
Check current failure mode |
Goal: Understand what happens when a downstream service fails and there's no resilience in place.
You're on-call. The pricing service is having issues. Document what happens.
- Start both services and verify
GET http://localhost:8080/productsworks - Toggle Service B to slow mode:
curl -X POST http://localhost:8081/admin/slow
- Hit
/productsfrom multiple browser tabs simultaneously. What happens? How long does each request take? - Toggle Service B to error mode:
curl -X POST http://localhost:8081/admin/fail
- Hit
/productsagain. What do you see? - Bonus: If using Docker, stop Service B entirely:
What happens now? Is this different from the slow failure?
docker stop pricing-service
Think about:
- How many concurrent requests does it take to make Service A unresponsive?
- Which is worse: a slow service or a dead service? Why?
- What happens to Service A's thread pool when Service B is slow?
Don't forget to reset Service B when you're done:
curl -X POST http://localhost:8081/admin/healthyGoal: Handle transient failures by retrying failed requests. Then learn why naive retries can make things worse.
Mission: Service B fails randomly 40% of the time. Add retry logic so most user requests succeed.
# Set Service B to random failure mode (40%)
curl -X POST "http://localhost:8081/admin/random?rate=40"Now hit GET /products several times. You'll see failures. Fix it with a retry.
Where to add it: product-catalog-service/src/main/java/com/workshop/catalog/client/PricingClient.java
What to do:
- Add the
@Retryannotation to thegetPricemethod - Configure the retry in
application.yml
Hint 1: The annotation
@Retry(name = "pricingService")
public Map<String, Object> getPrice(String productId) {Hint 2: The YAML config
Add this to product-catalog-service/src/main/resources/application.yml:
resilience4j:
retry:
instances:
pricingService:
maxAttempts: 3
waitDuration: 500msHint 3: Don't forget the import!
import io.github.resilience4j.retry.annotation.Retry;Test it: With 40% failure rate and 3 attempts, what's the probability of all 3 attempts failing? (Answer: 0.4^3 = 6.4%. So ~94% of requests should succeed now!)
Check your metrics:
curl http://localhost:8080/actuator/retries
curl http://localhost:8080/actuator/retryeventsStuck? git checkout step-1-retry
Mission: Your retry works, but you're making things worse. Fix it.
- Toggle Service B to slow mode:
curl -X POST http://localhost:8081/admin/slow
- Send 10 concurrent requests:
for i in {1..10}; do curl -s http://localhost:8080/products/PROD-001 & done; wait
- Check Service B's logs. How many requests did it receive? (Hint: with 3 retries, it's up to 30!)
Problem: Your retries are hammering an already struggling service. This is called a retry storm.
Fix it: Add exponential backoff and jitter.
Hint 1: What is exponential backoff?
Instead of retrying every 500ms, each retry waits longer:
- 1st retry: 500ms
- 2nd retry: 1000ms (500ms * 2)
- 3rd retry: 2000ms (500ms * 2 * 2)
This gives the failing service time to recover.
Hint 2: What is jitter?
If 100 clients all retry at exactly the same intervals, they'll all hit the server at the same time (thundering herd). Jitter adds randomness to the wait time so retries are spread out.
Hint 3: The YAML config
resilience4j:
retry:
instances:
pricingService:
maxAttempts: 3
waitDuration: 500ms
enableExponentialBackoff: true
exponentialBackoffMultiplier: 2
enableRandomizedWait: true
randomizedWaitFactor: 0.5Bonus Challenge: Only retry on server errors (5xx), NOT on client errors (4xx). A 400 Bad Request will never succeed no matter how many times you retry.
Bonus Hint: Exception filtering
resilience4j:
retry:
instances:
pricingService:
maxAttempts: 3
waitDuration: 500ms
enableExponentialBackoff: true
exponentialBackoffMultiplier: 2
enableRandomizedWait: true
randomizedWaitFactor: 0.5
retryExceptions:
- org.springframework.web.reactive.function.client.WebClientResponseException.InternalServerError
- org.springframework.web.reactive.function.client.WebClientResponseException.ServiceUnavailable
- org.springframework.web.reactive.function.client.WebClientResponseException.BadGateway
- java.io.IOException
- java.util.concurrent.TimeoutException
ignoreExceptions:
- org.springframework.web.reactive.function.client.WebClientResponseException.BadRequest
- org.springframework.web.reactive.function.client.WebClientResponseException.NotFoundStuck? git checkout step-1-retry
Goal: When a service is consistently failing, stop calling it entirely. Fail fast and provide a fallback.
Mission: Service B is down. Instead of waiting and retrying (wasting time and resources), detect the failure pattern and stop calling it.
# Set Service B to fail mode
curl -X POST http://localhost:8081/admin/failWhere to add it: Same file — PricingClient.java
What to do:
- Add the
@CircuitBreakerannotation to thegetPricemethod - Configure it in
application.yml - Send requests repeatedly and watch the circuit breaker open
Hint 1: The annotation
@CircuitBreaker(name = "pricingService")
public Map<String, Object> getPrice(String productId) {Hint 2: The YAML config
resilience4j:
circuitbreaker:
instances:
pricingService:
registerHealthIndicator: true
slidingWindowSize: 10
failureRateThreshold: 50
waitDurationInOpenState: 10s
permittedNumberOfCallsInHalfOpenState: 3
slidingWindowType: COUNT_BASEDWhat this means:
- Look at the last 10 calls (
slidingWindowSize) - If 50% or more failed (
failureRateThreshold), open the circuit - Stay open for 10 seconds (
waitDurationInOpenState) - Then allow 3 test calls (
permittedNumberOfCallsInHalfOpenState) - If those succeed, close the circuit again
Hint 3: Import
import io.github.resilience4j.circuitbreaker.annotation.CircuitBreaker;Watch the state transitions:
# Check circuit breaker state
curl http://localhost:8080/actuator/circuitbreakers
# Check circuit breaker events
curl http://localhost:8080/actuator/circuitbreakereventsWARNING: The #1 Workshop Pitfall!
If your @CircuitBreaker annotation doesn't seem to work, check this: Spring AOP proxies do NOT intercept method calls within the same class. If ProductService calls a @CircuitBreaker method that's also in ProductService, the annotation is ignored!
The annotation must be on a method in a different Spring bean that is called from outside. That's why we put it on PricingClient (called by ProductService).
Stuck? git checkout step-2-circuitbreaker
Mission: When the circuit is open, users see an ugly error. Give them something useful instead.
What to do:
- Add a
fallbackMethodto your@CircuitBreakerannotation - The fallback should return a default price with a
"priceStale": trueflag - Test it: toggle Service B to fail, wait for the circuit to open, then hit
/products
Hint 1: The annotation with fallback
@CircuitBreaker(name = "pricingService", fallbackMethod = "getPriceFallback")
public Map<String, Object> getPrice(String productId) {
// ... existing code
}Hint 2: The fallback method signature
The fallback method must:
- Be in the same class
- Have the same parameters as the original method, plus a
Throwableparameter - Have the same return type
private Map<String, Object> getPriceFallback(String productId, Throwable t) {
log.warn("Fallback triggered for product {}: {}", productId, t.getMessage());
return Map.of(
"productId", productId,
"price", 0,
"currency", "EUR",
"discount", 0,
"finalPrice", 0,
"priceStale", true,
"error", "Price temporarily unavailable"
);
}Hint 3: Bonus — cache the last known good price
Add a simple in-memory cache to PricingClient:
private final Map<String, Map<String, Object>> priceCache = new ConcurrentHashMap<>();
public Map<String, Object> getPrice(String productId) {
// ... existing WebClient call
Map<String, Object> response = // ... call Service B
priceCache.put(productId, response); // Cache successful responses
return response;
}
private Map<String, Object> getPriceFallback(String productId, Throwable t) {
Map<String, Object> cached = priceCache.get(productId);
if (cached != null) {
Map<String, Object> result = new HashMap<>(cached);
result.put("priceStale", true);
return result;
}
// No cached price available
return Map.of(
"productId", productId,
"price", 0,
"currency", "EUR",
"discount", 0,
"finalPrice", 0,
"priceStale", true,
"error", "Price temporarily unavailable"
);
}Stuck? git checkout step-2-circuitbreaker
Mission: Use both patterns together. But the order matters!
The question: If you have both @Retry and @CircuitBreaker on the same method, which one executes first?
The wrong order: Retry wraps Circuit Breaker
- Circuit opens → Retry still tries 3 times → Each attempt is instantly rejected → Wasted effort
The right order: Circuit Breaker wraps Retry
- Circuit open? → Don't even bother retrying, go straight to fallback
- Circuit closed? → Try the call, retry on failure, count the final result
Hint 1: How to control the order
Resilience4j uses aspect ordering. Lower number = higher priority (executes first, wraps the others).
resilience4j:
circuitbreaker:
circuitBreakerAspectOrder: 1
retry:
retryAspectOrder: 2With this config: CircuitBreaker (order 1) wraps Retry (order 2) wraps the actual call.
Hint 2: Both annotations together
@CircuitBreaker(name = "pricingService", fallbackMethod = "getPriceFallback")
@Retry(name = "pricingService")
public Map<String, Object> getPrice(String productId) {
// ... existing code
}Hint 3: Full YAML config
resilience4j:
circuitbreaker:
circuitBreakerAspectOrder: 1
instances:
pricingService:
registerHealthIndicator: true
slidingWindowSize: 10
failureRateThreshold: 50
waitDurationInOpenState: 10s
permittedNumberOfCallsInHalfOpenState: 3
slidingWindowType: COUNT_BASED
retry:
retryAspectOrder: 2
instances:
pricingService:
maxAttempts: 3
waitDuration: 500ms
enableExponentialBackoff: true
exponentialBackoffMultiplier: 2
enableRandomizedWait: true
randomizedWaitFactor: 0.5
retryExceptions:
- org.springframework.web.reactive.function.client.WebClientResponseException.InternalServerError
- org.springframework.web.reactive.function.client.WebClientResponseException.ServiceUnavailable
- java.io.IOException
- java.util.concurrent.TimeoutExceptionTest the full flow:
- Service B healthy → all good
curl -X POST "http://localhost:8081/admin/random?rate=40"→ retry saves youcurl -X POST http://localhost:8081/admin/fail→ circuit opens → fallback kicks incurl -X POST http://localhost:8081/admin/healthy→ circuit goes half-open → closes again
Stuck? git checkout step-2-circuitbreaker
The Problem: Everything we've built so far is in-memory. If Service A crashes, all pending retries are lost. In production with multiple instances, you need persistent retry.
The Pattern:
- When the circuit breaker fallback fires, save the failed request to a database
- A
@Scheduledjob polls the database and retries periodically - Use ShedLock to ensure only one instance runs the scheduler
This is implemented in the solution branch. Check it out to see the full implementation:
git checkout solutionKey files:
RetryRequest.java— JPA entity for the retry queueRetryRequestRepository.java— Spring Data repositoryDistributedRetryService.java— Scheduled job with ShedLockapplication.yml— H2 + ShedLock configuration
| Endpoint | Description |
|---|---|
GET http://localhost:8080/products |
All products with prices |
GET http://localhost:8080/products/PROD-001 |
Single product |
GET http://localhost:8080/actuator/health |
Health check (includes CB state) |
GET http://localhost:8080/actuator/retries |
Retry instances and config |
GET http://localhost:8080/actuator/retryevents |
Retry event log |
GET http://localhost:8080/actuator/circuitbreakers |
Circuit breaker states |
GET http://localhost:8080/actuator/circuitbreakerevents |
Circuit breaker event log |
GET http://localhost:8081/admin/status |
Service B failure mode |
If you get stuck, checkout the solution branch for that phase:
| Branch | Content |
|---|---|
main |
Starter code (no resilience) |
step-1-retry |
Retry with backoff, jitter, and exception filtering |
step-2-circuitbreaker |
Circuit breaker with fallback and in-memory price cache |
solution |
Everything + distributed retry with ShedLock |