Optional cross-worker lock for shared adaptor repo#1416
Conversation
Add `--repo-lock` / `WORKER_REPO_LOCK` to coordinate adaptor installs across multiple workers sharing one repo directory (e.g. an NFS mount or k8s PVC). Uses proper-lockfile per-adaptor plus a sentinel cache, so the cache-hit path stays lock-free. Off by default; requires --repo-dir. Lock retry ceiling (6 min) is set above STALE_MS (5 min) so a dead holder's stale window expires before waiters give up, making cold- start of N pods against an empty repo recoverable rather than fatal.
Drop redundant pre-check in ensureLockTarget (wx already throws EEXIST), parallelise fileExists pairs in handleIsInstalled and the post-lock double-check, collapse trivial if/return, drop unused default export, and fix the worker test harness' mode list comment plus an any-typed logger.
| 'Repo lock enabled: coordinating adaptor installs via filesystem lock at', | ||
| args.repoDir | ||
| ); | ||
| engineOptions.autoinstall = createLockedHandlers(args.repoDir); |
There was a problem hiding this comment.
I'm surprised to see all this stuff in the worker - isn't it more a concern for the engine, which actually handles the auto install? Or even the runtime, which provides the actuall npm install hook.
I've also just checked the runtime install code and it's doing quite a lot of stuff which is going to be duplicated or removed by the implementation here
There was a problem hiding this comment.
Oh interesting, I actually intentionally stayed away from engine - after some consternation; thought that the idea of parallel installs happen only in the worker and thought to keep it out . Happy to redo it, or move the repoLock stuff into engine? Thoughts on where you see it fitting? (I totally get the knowledge leak in the lock wrapper, since it needs to know a fair bit about the package and path.
| logger: Logger | ||
| ) => Promise<unknown>; | ||
|
|
||
| const defaultInstall: InstallFn = (specifier, repoDir, logger) => |
There was a problem hiding this comment.
Ah so THIS guy calls out to the runtime's install
So this file isn't really a repo lock at all - it's an "install adaptor WITH repo lock"
I still think this logic should go down into the runtime, but I also think the lock logic and the actual install should be better separated. Very misleading to have repo-lock.ts actually take responsibility for the install!
There was a problem hiding this comment.
You reckon I move the lock wrapper to runtime or engine? My gut says runtime, but now it feels like installing things should be in runtime? 😅
Short Description
Adds an optional filesystem lock so multiple ws-workers can safely share a single adaptor repo directory (e.g. an NFS mount or k8s PVC), without two workers racing on the same
npm install.Fixes #1414
Implementation Details
When the repo directory is shared between worker pods, two workers can otherwise hit
handleInstallfor the same adaptor at the same time and corrupt each other'snode_modules. This PR adds a new opt-in flag —--repo-lock/WORKER_REPO_LOCK— that wraps the engine's autoinstall handlers with cross-process coordination.How it works:
<repoDir>/.locks/<alias>.lockis acquired viaproper-lockfilebefore any install runs. Different adaptors don't block each other.<repoDir>/.sentinels/<alias>.done.handleIsInstalledrequires BOTH the sentinel ANDnode_modules/<alias>/package.jsonto be present, so a half-finished install left by a crashed worker is correctly re-attempted.handleIsInstalledreturns true) stays lock-free — no syscall fan-out for adaptors that are already installed.installFnthrows, the sentinel isn't written, and the next worker re-runs the install.--repo-lockrequires--repo-dir; if it's set without one we warn and continue without the lock.New dep:
proper-lockfile@4.1.2(withgraceful-fs,retry,signal-exittransitives). All checked for CVEs — clean.QA Notes
The feature is gated behind
WORKER_REPO_LOCK=true. With the flag off, behaviour is identical to before this PR — please confirm that path is unchanged.With the flag on, the interesting cases to exercise:
--repo-dirstarting a run that needs the same adaptor at the same time — only onenpm installshould actually run; the other should see the sentinel and skip.--repo-lock, multi-worker behaviour should be unchanged.There's a multi-process test suite using
child_process.forkthat exercises all of the above deterministically —pnpm test test/util/repo-lock-multiprocess.test.tsfrompackages/ws-worker.K8s deployments using this need NTP/chrony across nodes — stale-lock detection is mtime-based.
AI Usage
You can read more details in our
Responsible AI Policy