Skip to content

Conversation

@polyscone
Copy link
Contributor

This is a small change to the (*Config).updateARI function to reduce lock contention and help prevent goroutines wasting time/resources waiting on locks that they could just skip.

I wrote about the problems on the Caddy community forums here: https://caddy.community/t/many-unable-to-obtain-ari-lock-errors-and-connection-timeouts/33035

We have a simple Caddy server setup which serves roughly 150 domains that all use on-demand TLS. It's running on a Windows Server 2022 machine (file system is NTFS) and the majority of work it's doing is reverse proxying to vairous virtual machines.

Certmagic uses the file system to aquire locks for inter-process coordination and in some cases this would lead to a build-up of many thousands of goroutines just blocking trying to aquire the lock for ARI updates.
The build-up of goroutines blocking and constantly trying to aquire the lock meant that as lock contention rose more and more of them never actually succeeded to take the lock, which then snowballed into more and more build-up and lock contention. Eventually they would all fail at their 8 minute context deadlines.

In the worst cases, as far as I could see from the logs, we would see thousands of goroutines all attempting to aquire a single lock for a single domain. When this happened for many domains all at the same time we would see connection timeouts when trying to make requests to the server.

To solve the problem I've added a TryLock function to the Storage interface.
The (*FileStorage).TryLock implementation uses the same code as the (*FileStorage).Lock function but it's been cut down and simplified a little to just move on when it can't aquire the lock.
If it can't immediately aquire the lock it still does the same stale lock file check that the (*FileStorage).Lock does before concluding that it can't take the lock at all.

Since making this change and running it on our server for the past 10 days we've seen lock contention in the absolute worst cases reduce to roughly 10% from what was 99.99%+ in the worst cases before the change, and since it's a TryLock it means that in the case where the lock can't be taken we don't waste resources waiting for it. We've also seen zero downtime related to connection timeouts, where before we saw downtime once or twice every day or two, even if it was short-lived.

I know that TryLock usage is typically discouraged as a solution in most cases, and Go's own sync.Mutex.TryLock documentation heavily discourages its usage, so I understand if there is some hesitance to accept this as a solution. I think it fits well here, though, given that ARI updates are optional, and all of the updateARI callsites allow things to continue as normal in the case of error, only emitting a log in those cases anyway. Since that's the case I don't see any reason why other goroutines should be forced to block on ARI updates when they could just move on and get some useful work done instead.

In this initial implementation I just added TryLock to the existing Locker interface. If this is too big of a change then we could change it to do a type assertion for a narrower TryLocker interface and only use the new tryAquireLock function if it returns true, in the same way Go's own http package does type assertions for things like http.Pusher.
I also emit an info log in the case where a lock was already taken, but this could be changed to debug or something else if needed.

This seems to have solved all of the connection timeout problems and lock contention problems we were seeing, but we have a relatively simple usecase for Caddy, so let me know what you think.

@francislavoie
Copy link
Member

francislavoie commented Oct 22, 2025

We cannot change the Locker interface at this point, it would break every storage plugin in the wild (lots of them).

An new interface could be introduced though (something like TryLocker idk), and assert that the storage implementation has that interface, and if so use it, otherwise retain the existing behaviour.

I'm not making a judgement on whether this makes sense to do in general, just flagging it early that this would be a massively breaking change as-is.

Copy link
Member

@mholt mholt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dangit @francislavoie -- beat me to it AGAIN while I was writing a reply 😆

Yeah, we can't change that interface, but we can make it an optional interface.

The one thing I'm wary of is the fact that TryLockers have no guarantee of succeeding, and:

ARI updates are optional

is a debatable assumption. 😅

It is true that ARI is entirely optional, but failing to abide its recommendations could, in theory result in downtime (which, makes it not optional?) -- I dunno, I'm still divided on ARI tbh.

Debate aside, the patch looks excellent for the most part. Left a few comments.

maintain.go Outdated
return cert, false, fmt.Errorf("unable to obtain ARI lock: %v", err)
}
if !ok {
logger.Info("attempted to obtain ARI lock but it was already taken")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a Debug() probably

filestorage.go Outdated
}
}

func (s *FileStorage) TryLock(ctx context.Context, name string) (bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think we can reduce duplication in here?

@polyscone
Copy link
Contributor Author

@francislavoie @mholt Thank you both for the quick responses.

Yes, I was worried that changing the Locker interface would be too much.
I've updated the code to define a new TryLocker interface which is then tested through type assertion in the updateARI function implementation.

In addition to the type assertion in updateARI there's now a type assertion in tryAcquireLock too, which will just return an error if you attempt to call it on Storage that doesn't implement TryLocker.

I've tried to remove duplication between the Lock and TryLock implementations here by moving the logic out into a separate private function which takes a number of attempts that it can check in the loop; if it's negative then it should preserve the existing Lock behaviour.

Finally I've updated the log to use Debug instead.

And yes, perhaps ARI being "optional" was a poor choice of words on my part. I wanted to get at the fact that all callsites for updateARI just let the code continue no matter what happens, so maybe "retryable" is better (?).

Anyway, let me know what you think; I'm happy to implement any other changes if needed.

@polyscone polyscone requested a review from mholt October 22, 2025 09:52
mholt
mholt previously approved these changes Oct 23, 2025
Copy link
Member

@mholt mholt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now. Thank you. I still question the correctness of optional locking but we can see how it fares!

@mholt mholt merged commit d66689d into caddyserver:master Oct 23, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants