-
-
Notifications
You must be signed in to change notification settings - Fork 311
Implement TryLock for use with optional tasks like ARI updates to reduce lock contention
#357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
We cannot change the An new interface could be introduced though (something like I'm not making a judgement on whether this makes sense to do in general, just flagging it early that this would be a massively breaking change as-is. |
mholt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dangit @francislavoie -- beat me to it AGAIN while I was writing a reply 😆
Yeah, we can't change that interface, but we can make it an optional interface.
The one thing I'm wary of is the fact that TryLockers have no guarantee of succeeding, and:
ARI updates are optional
is a debatable assumption. 😅
It is true that ARI is entirely optional, but failing to abide its recommendations could, in theory result in downtime (which, makes it not optional?) -- I dunno, I'm still divided on ARI tbh.
Debate aside, the patch looks excellent for the most part. Left a few comments.
maintain.go
Outdated
| return cert, false, fmt.Errorf("unable to obtain ARI lock: %v", err) | ||
| } | ||
| if !ok { | ||
| logger.Info("attempted to obtain ARI lock but it was already taken") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a Debug() probably
filestorage.go
Outdated
| } | ||
| } | ||
|
|
||
| func (s *FileStorage) TryLock(ctx context.Context, name string) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think we can reduce duplication in here?
…educe lock contention
128b123 to
c51b3b6
Compare
|
@francislavoie @mholt Thank you both for the quick responses. Yes, I was worried that changing the In addition to the type assertion in I've tried to remove duplication between the Finally I've updated the log to use And yes, perhaps ARI being "optional" was a poor choice of words on my part. I wanted to get at the fact that all callsites for Anyway, let me know what you think; I'm happy to implement any other changes if needed. |
mholt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good now. Thank you. I still question the correctness of optional locking but we can see how it fares!
This is a small change to the
(*Config).updateARIfunction to reduce lock contention and help prevent goroutines wasting time/resources waiting on locks that they could just skip.I wrote about the problems on the Caddy community forums here: https://caddy.community/t/many-unable-to-obtain-ari-lock-errors-and-connection-timeouts/33035
We have a simple Caddy server setup which serves roughly 150 domains that all use on-demand TLS. It's running on a Windows Server 2022 machine (file system is NTFS) and the majority of work it's doing is reverse proxying to vairous virtual machines.
Certmagic uses the file system to aquire locks for inter-process coordination and in some cases this would lead to a build-up of many thousands of goroutines just blocking trying to aquire the lock for ARI updates.
The build-up of goroutines blocking and constantly trying to aquire the lock meant that as lock contention rose more and more of them never actually succeeded to take the lock, which then snowballed into more and more build-up and lock contention. Eventually they would all fail at their 8 minute context deadlines.
In the worst cases, as far as I could see from the logs, we would see thousands of goroutines all attempting to aquire a single lock for a single domain. When this happened for many domains all at the same time we would see connection timeouts when trying to make requests to the server.
To solve the problem I've added a
TryLockfunction to theStorageinterface.The
(*FileStorage).TryLockimplementation uses the same code as the(*FileStorage).Lockfunction but it's been cut down and simplified a little to just move on when it can't aquire the lock.If it can't immediately aquire the lock it still does the same stale lock file check that the
(*FileStorage).Lockdoes before concluding that it can't take the lock at all.Since making this change and running it on our server for the past 10 days we've seen lock contention in the absolute worst cases reduce to roughly 10% from what was 99.99%+ in the worst cases before the change, and since it's a
TryLockit means that in the case where the lock can't be taken we don't waste resources waiting for it. We've also seen zero downtime related to connection timeouts, where before we saw downtime once or twice every day or two, even if it was short-lived.I know that
TryLockusage is typically discouraged as a solution in most cases, and Go's ownsync.Mutex.TryLockdocumentation heavily discourages its usage, so I understand if there is some hesitance to accept this as a solution. I think it fits well here, though, given that ARI updates are optional, and all of theupdateARIcallsites allow things to continue as normal in the case of error, only emitting a log in those cases anyway. Since that's the case I don't see any reason why other goroutines should be forced to block on ARI updates when they could just move on and get some useful work done instead.In this initial implementation I just added
TryLockto the existingLockerinterface. If this is too big of a change then we could change it to do a type assertion for a narrowerTryLockerinterface and only use the newtryAquireLockfunction if it returns true, in the same way Go's ownhttppackage does type assertions for things likehttp.Pusher.I also emit an info log in the case where a lock was already taken, but this could be changed to debug or something else if needed.
This seems to have solved all of the connection timeout problems and lock contention problems we were seeing, but we have a relatively simple usecase for Caddy, so let me know what you think.