How a Performance Optimization Caused Cascading Redis Timeouts in Spring WebFlux ::

The Incident#

After a performance optimization commit that refactored our distributed lock template to use Mono.usingWhen and removed a publishOn(Schedulers.boundedElastic()) call, our Spring WebFlux application started experiencing cascading Redis timeouts under moderate load. The stack traces all pointed to the same place:

io.lettuce.core.RedisCommandTimeoutException: Command timed out after 2 second(s)
  at org.springframework.data.redis.connection.lettuce.LettuceConnection.await()
  at org.springframework.data.redis.connection.lettuce.LettuceStringCommands.set()
  at org.springframework.data.redis.cache.DefaultRedisCacheWriter.execute()

The confusing part: we hadn’t changed any Redis configuration, cache TTLs, or business logic. We had only “cleaned up” the threading model.

Background: The Architecture#

Our application is a reactive Spring Boot microservice using:

Spring WebFlux with Netty as the HTTP server
Reactive Redis (Lettuce) for distributed locking and caching
Spring @Cacheable with RedisCacheManager for caching domain entities
A DistributedLockTemplate that acquires a Redis lock before executing wallet transactions

The write operation pipeline looks like this:

HTTP Request (Netty event loop)
  → Signature Validation
  → Idempotency Check (Reactive Redis)
  → Acquire Distributed Lock (Reactive Redis)
  → Business Logic (includes @Cacheable service calls)
  → DB Transaction (R2DBC)
  → Release Lock
  → Response

The Optimization That Broke Everything#

The original DistributedLockTemplate code had a publishOn(Schedulers.boundedElastic()) after acquiring the lock:

// BEFORE: worked correctly
return tryLock(lockKey, lockValue, timeout, maxRetries, retryDelay)
    .publishOn(Schedulers.boundedElastic())  // <-- this line was removed
    .flatMap(acquired -> action
        .flatMap(result -> releaseLock(lockKey, lockValue).thenReturn(result))
        .onErrorResume(err -> releaseLock(lockKey, lockValue).then(Mono.error(err))));

The optimization removed publishOn to avoid unnecessary thread-hopping, since the entire chain was “reactive”:

// AFTER: caused cascading timeouts
return tryLock(lockKey, lockValue, timeout, maxRetries, retryDelay)
    .flatMap(acquired -> action  // action now runs on Netty event loop!
        .flatMap(result -> releaseLock(lockKey, lockValue).thenReturn(result))
        .onErrorResume(err -> releaseLock(lockKey, lockValue).then(Mono.error(err))));

This looked perfectly reasonable. The lock acquisition uses ReactiveRedisTemplate (non-blocking), and the action should also be reactive. Why would we need publishOn?

The Hidden Blocker: `@Cacheable` + `RedisCacheManager`#

The problem is that Spring’s @Cacheable with RedisCacheManager is not truly reactive. Even in Spring Framework 6.2 (the latest as of this writing), RedisCacheManager does not support async cache mode.

When a @Cacheable method with a Mono<T> return type is invoked:

Spring intercepts the call
It checks the cache via DefaultRedisCacheWriter.execute()
This calls LettuceStringCommands.set() or LettuceStringCommands.get()
Which calls LettuceConnection.await() — a synchronous, blocking call

// Inside Spring's DefaultRedisCacheWriter — this is BLOCKING
private <T> T execute(String name, Function<RedisConnection, T> callback) {
    try (RedisConnection connection = connectionFactory.getConnection()) {
        return callback.apply(connection);  // blocks until Redis responds
    }
}

Our service layer has many @Cacheable methods:

@Cacheable(value = "players", key = "#id")
public Mono<Player> findById(Long id) {
    return playerRepository.findById(id);
}

@Cacheable(value = "tokens", key = "#code")
public Mono<Token> findByCode(String code) {
    return tokenRepository.findByCode(code);
}

These are called within the action Mono passed to executeWithLock(). With the old code, publishOn(boundedElastic) ensured these blocking cache operations ran on the elastic thread pool. After the removal, they ran directly on the Netty event loop thread.

Why It Cascades#

Netty uses a small, fixed number of event loop threads (typically equal to the number of CPU cores). When one event loop thread blocks on a synchronous Redis call:

Thread: reactor-http-nio-1
  └─ Blocked on: LettuceConnection.await()  (waiting for Redis response)
     └─ But Redis response arrives as an I/O event
        └─ Which needs THIS SAME event loop thread to process!

This creates a deadlock-like situation:

Event loop thread A blocks waiting for Redis response
The Redis response arrives as a Netty I/O event
The I/O event needs an available event loop thread to be processed
But thread A is blocked, and other threads may also be blocked by their own @Cacheable calls
Eventually all event loop threads are blocked
No Redis responses can be processed → all commands time out
The timeout releases the blocked threads, but incoming requests immediately block them again

Under low load, you might never notice — the blocking call completes quickly and frees the thread. Under moderate-to-high load, the probability of all event loop threads being simultaneously blocked increases dramatically.

Why BlockHound Didn’t Catch It#

We had BlockHound installed in our test suite to detect blocking calls on non-blocking threads. So why didn’t it flag this?

Because we had explicitly allowlisted it:

public class BlockHoundTestIntegration implements BlockHoundIntegration {
  @Override
  public void applyTo(BlockHound.Builder builder) {
    // Spring's @Cacheable uses DefaultRedisCacheWriter.execute() →
    // LettuceStringCommands.set() → LettuceConnection.await() (synchronous/blocking).
    // This is a known issue...
    builder.allowBlockingCallsInside(
        "org.springframework.data.redis.cache.DefaultRedisCacheWriter", "execute");
  }
}

When we first integrated BlockHound, this blocking call was detected immediately. At the time, publishOn(boundedElastic) was in place, so the blocking happened on elastic threads (which is fine). We added the allowlist entry because BlockHound doesn’t distinguish which thread the blocking occurs on — it flags the call site regardless. The comment even noted it was a “known issue” that needed a proper fix.

The allowlist stayed. The publishOn was removed. And BlockHound stayed silent.

Lesson: BlockHound allowlists are a liability. Every allowlist entry is an assumption that “this blocking call is happening in a safe context.” When the surrounding code changes, that assumption can silently become invalid.

The Fix#

Short-Term: Restore `publishOn`#

The immediate fix is straightforward — add back publishOn(Schedulers.boundedElastic()) after lock acquisition:

return tryLock(lockKey, lockValue, timeout, maxRetries, retryDelay)
    .publishOn(Schedulers.boundedElastic())
    .flatMap(acquired -> action
        .flatMap(result -> releaseLock(lockKey, lockValue).thenReturn(result))
        .onErrorResume(err -> releaseLock(lockKey, lockValue).then(Mono.error(err))));

This ensures any downstream @Cacheable blocking calls happen on the elastic thread pool, not the Netty event loop.

Long-Term: Replace `@Cacheable` with Reactive Caching#

The proper fix is to stop using @Cacheable with RedisCacheManager for reactive methods entirely.

Spring Framework 6.1+ added native reactive support for @Cacheable, but it requires the cache provider to support async mode. As of Spring 6.2, only CaffeineCacheManager supports setAsyncCacheMode(true). RedisCacheManager does not.

Options:

Manual ReactiveRedisTemplate caching — Replace @Cacheable with explicit cache-aside logic using reactive Redis operations
Two-level cache — Use CaffeineCacheManager with asyncCacheMode(true) as L1 (local, truly non-blocking), backed by reactive Redis as L2
Wait for Spring Data Redis to implement async cache mode support (no timeline as of early 2026)

Key Takeaways#

@Cacheable + RedisCacheManager is blocking, even when your method returns Mono<T>. Spring Framework 6.1+ added reactive @Cacheable support, but it requires the cache provider to opt in via setAsyncCacheMode(true), which RedisCacheManager does not support.
publishOn(Schedulers.boundedElastic()) is not just a performance hint — in reactive applications with hidden blocking calls, it’s a safety net that prevents Netty event loop starvation.
BlockHound allowlists are assumptions, not permanent solutions. Document why each allowlist entry is safe, and re-evaluate them when the surrounding threading model changes. Consider adding comments that reference the specific publishOn or subscribeOn that makes the blocking call safe.
Blocking on Netty event loops doesn’t just slow down one request — it creates cascading failures because Redis I/O events need those same threads to be processed, creating a deadlock-like feedback loop.
“Reactive” doesn’t mean “non-blocking everywhere.” Framework abstractions like @Cacheable can hide synchronous I/O behind reactive-looking APIs. Always verify the actual execution path, especially for Spring’s cache and transaction infrastructure.

How a Performance Optimization Caused Cascading Redis Timeouts in Spring WebFlux

目录