Implement `TryLock` for use with optional tasks like ARI updates to reduce lock contention #357

polyscone · 2025-10-22T03:37:10Z

This is a small change to the (*Config).updateARI function to reduce lock contention and help prevent goroutines wasting time/resources waiting on locks that they could just skip.

I wrote about the problems on the Caddy community forums here: https://caddy.community/t/many-unable-to-obtain-ari-lock-errors-and-connection-timeouts/33035

We have a simple Caddy server setup which serves roughly 150 domains that all use on-demand TLS. It's running on a Windows Server 2022 machine (file system is NTFS) and the majority of work it's doing is reverse proxying to vairous virtual machines.

Certmagic uses the file system to aquire locks for inter-process coordination and in some cases this would lead to a build-up of many thousands of goroutines just blocking trying to aquire the lock for ARI updates.
The build-up of goroutines blocking and constantly trying to aquire the lock meant that as lock contention rose more and more of them never actually succeeded to take the lock, which then snowballed into more and more build-up and lock contention. Eventually they would all fail at their 8 minute context deadlines.

In the worst cases, as far as I could see from the logs, we would see thousands of goroutines all attempting to aquire a single lock for a single domain. When this happened for many domains all at the same time we would see connection timeouts when trying to make requests to the server.

To solve the problem I've added a TryLock function to the Storage interface.
The (*FileStorage).TryLock implementation uses the same code as the (*FileStorage).Lock function but it's been cut down and simplified a little to just move on when it can't aquire the lock.
If it can't immediately aquire the lock it still does the same stale lock file check that the (*FileStorage).Lock does before concluding that it can't take the lock at all.

Since making this change and running it on our server for the past 10 days we've seen lock contention in the absolute worst cases reduce to roughly 10% from what was 99.99%+ in the worst cases before the change, and since it's a TryLock it means that in the case where the lock can't be taken we don't waste resources waiting for it. We've also seen zero downtime related to connection timeouts, where before we saw downtime once or twice every day or two, even if it was short-lived.

I know that TryLock usage is typically discouraged as a solution in most cases, and Go's own sync.Mutex.TryLock documentation heavily discourages its usage, so I understand if there is some hesitance to accept this as a solution. I think it fits well here, though, given that ARI updates are optional, and all of the updateARI callsites allow things to continue as normal in the case of error, only emitting a log in those cases anyway. Since that's the case I don't see any reason why other goroutines should be forced to block on ARI updates when they could just move on and get some useful work done instead.

In this initial implementation I just added TryLock to the existing Locker interface. If this is too big of a change then we could change it to do a type assertion for a narrower TryLocker interface and only use the new tryAquireLock function if it returns true, in the same way Go's own http package does type assertions for things like http.Pusher.
I also emit an info log in the case where a lock was already taken, but this could be changed to debug or something else if needed.

This seems to have solved all of the connection timeout problems and lock contention problems we were seeing, but we have a relatively simple usecase for Caddy, so let me know what you think.

francislavoie · 2025-10-22T03:53:23Z

We cannot change the Locker interface at this point, it would break every storage plugin in the wild (lots of them).

An new interface could be introduced though (something like TryLocker idk), and assert that the storage implementation has that interface, and if so use it, otherwise retain the existing behaviour.

I'm not making a judgement on whether this makes sense to do in general, just flagging it early that this would be a massively breaking change as-is.

mholt

Dangit @francislavoie -- beat me to it AGAIN while I was writing a reply 😆

Yeah, we can't change that interface, but we can make it an optional interface.

The one thing I'm wary of is the fact that TryLockers have no guarantee of succeeding, and:

ARI updates are optional

is a debatable assumption. 😅

It is true that ARI is entirely optional, but failing to abide its recommendations could, in theory result in downtime (which, makes it not optional?) -- I dunno, I'm still divided on ARI tbh.

Debate aside, the patch looks excellent for the most part. Left a few comments.

mholt · 2025-10-22T04:14:06Z

maintain.go

 		return cert, false, fmt.Errorf("unable to obtain ARI lock: %v", err)
 	}
+	if !ok {
+		logger.Info("attempted to obtain ARI lock but it was already taken")


This could be a Debug() probably

mholt · 2025-10-22T04:14:25Z

filestorage.go

 	}
 }

+func (s *FileStorage) TryLock(ctx context.Context, name string) (bool, error) {


Think we can reduce duplication in here?

storage.go

…educe lock contention

polyscone · 2025-10-22T05:15:32Z

@francislavoie @mholt Thank you both for the quick responses.

Yes, I was worried that changing the Locker interface would be too much.
I've updated the code to define a new TryLocker interface which is then tested through type assertion in the updateARI function implementation.

In addition to the type assertion in updateARI there's now a type assertion in tryAcquireLock too, which will just return an error if you attempt to call it on Storage that doesn't implement TryLocker.

I've tried to remove duplication between the Lock and TryLock implementations here by moving the logic out into a separate private function which takes a number of attempts that it can check in the loop; if it's negative then it should preserve the existing Lock behaviour.

Finally I've updated the log to use Debug instead.

And yes, perhaps ARI being "optional" was a poor choice of words on my part. I wanted to get at the fact that all callsites for updateARI just let the code continue no matter what happens, so maybe "retryable" is better (?).

Anyway, let me know what you think; I'm happy to implement any other changes if needed.

mholt

Looks good now. Thank you. I still question the correctness of optional locking but we can see how it fares!

maintain.go

mholt requested changes Oct 22, 2025

View reviewed changes

Implement TryLock for use with optional tasks like ARI updates to r…

c51b3b6

…educe lock contention

polyscone force-pushed the ari-try-lock branch from 128b123 to c51b3b6 Compare October 22, 2025 05:10

polyscone requested a review from mholt October 22, 2025 09:52

mholt previously approved these changes Oct 23, 2025

View reviewed changes

maintain.go Outdated Show resolved Hide resolved

Update maintain.go

cd1f51f

mholt dismissed their stale review via cd1f51f October 23, 2025 19:24

mholt approved these changes Oct 23, 2025

View reviewed changes

mholt merged commit d66689d into caddyserver:master Oct 23, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Implement `TryLock` for use with optional tasks like ARI updates to reduce lock contention #357

Implement `TryLock` for use with optional tasks like ARI updates to reduce lock contention #357

Uh oh!

polyscone commented Oct 22, 2025

Uh oh!

francislavoie commented Oct 22, 2025 •

edited

Loading

Uh oh!

mholt left a comment

Uh oh!

mholt Oct 22, 2025

Uh oh!

mholt Oct 22, 2025

Uh oh!

Uh oh!

polyscone commented Oct 22, 2025

Uh oh!

mholt left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Implement TryLock for use with optional tasks like ARI updates to reduce lock contention #357

Implement TryLock for use with optional tasks like ARI updates to reduce lock contention #357

Uh oh!

Conversation

polyscone commented Oct 22, 2025

Uh oh!

francislavoie commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mholt left a comment

Choose a reason for hiding this comment

Uh oh!

mholt Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

mholt Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

polyscone commented Oct 22, 2025

Uh oh!

mholt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Implement `TryLock` for use with optional tasks like ARI updates to reduce lock contention #357

Implement `TryLock` for use with optional tasks like ARI updates to reduce lock contention #357

francislavoie commented Oct 22, 2025 •

edited

Loading