Skip to content

Optional cross-worker lock for shared adaptor repo#1416

Draft
stuartc wants to merge 2 commits into
mainfrom
1414-distributed-adaptor-install-lock
Draft

Optional cross-worker lock for shared adaptor repo#1416
stuartc wants to merge 2 commits into
mainfrom
1414-distributed-adaptor-install-lock

Conversation

@stuartc
Copy link
Copy Markdown
Member

@stuartc stuartc commented May 19, 2026

Short Description

Adds an optional filesystem lock so multiple ws-workers can safely share a single adaptor repo directory (e.g. an NFS mount or k8s PVC), without two workers racing on the same npm install.

Fixes #1414

Implementation Details

When the repo directory is shared between worker pods, two workers can otherwise hit handleInstall for the same adaptor at the same time and corrupt each other's node_modules. This PR adds a new opt-in flag — --repo-lock / WORKER_REPO_LOCK — that wraps the engine's autoinstall handlers with cross-process coordination.

How it works:

  • A per-adaptor lockfile under <repoDir>/.locks/<alias>.lock is acquired via proper-lockfile before any install runs. Different adaptors don't block each other.
  • After a successful install, a sentinel file is written under <repoDir>/.sentinels/<alias>.done. handleIsInstalled requires BOTH the sentinel AND node_modules/<alias>/package.json to be present, so a half-finished install left by a crashed worker is correctly re-attempted.
  • The cache-hit path (handleIsInstalled returns true) stays lock-free — no syscall fan-out for adaptors that are already installed.
  • If installFn throws, the sentinel isn't written, and the next worker re-runs the install.
  • The retry ceiling (6 min) sits above the stale-lock window (5 min), so when a pod dies mid-install its lock expires before waiting workers give up. Cold-start of N pods against an empty repo is recoverable, not fatal.
  • Off by default. --repo-lock requires --repo-dir; if it's set without one we warn and continue without the lock.

New dep: proper-lockfile@4.1.2 (with graceful-fs, retry, signal-exit transitives). All checked for CVEs — clean.

QA Notes

The feature is gated behind WORKER_REPO_LOCK=true. With the flag off, behaviour is identical to before this PR — please confirm that path is unchanged.

With the flag on, the interesting cases to exercise:

  • Two workers pointing at the same --repo-dir starting a run that needs the same adaptor at the same time — only one npm install should actually run; the other should see the sentinel and skip.
  • A worker killed mid-install (SIGKILL) — the next worker should recover after the stale window (5 min) rather than getting stuck forever.
  • Different adaptors should install in parallel, not serialise.
  • Without --repo-lock, multi-worker behaviour should be unchanged.

There's a multi-process test suite using child_process.fork that exercises all of the above deterministically — pnpm test test/util/repo-lock-multiprocess.test.ts from packages/ws-worker.

K8s deployments using this need NTP/chrony across nodes — stale-lock detection is mtime-based.

AI Usage

  • I have used Claude Code
  • I have used another model
  • I have not used AI

You can read more details in our
Responsible AI Policy

stuartc added 2 commits May 19, 2026 15:20
Add `--repo-lock` / `WORKER_REPO_LOCK` to coordinate adaptor installs
across multiple workers sharing one repo directory (e.g. an NFS mount
or k8s PVC). Uses proper-lockfile per-adaptor plus a sentinel cache,
so the cache-hit path stays lock-free. Off by default; requires
--repo-dir.

Lock retry ceiling (6 min) is set above STALE_MS (5 min) so a dead
holder's stale window expires before waiters give up, making cold-
start of N pods against an empty repo recoverable rather than fatal.
Drop redundant pre-check in ensureLockTarget (wx already throws EEXIST),
parallelise fileExists pairs in handleIsInstalled and the post-lock
double-check, collapse trivial if/return, drop unused default export,
and fix the worker test harness' mode list comment plus an any-typed
logger.
@github-project-automation github-project-automation Bot moved this to New Issues in Core May 19, 2026
'Repo lock enabled: coordinating adaptor installs via filesystem lock at',
args.repoDir
);
engineOptions.autoinstall = createLockedHandlers(args.repoDir);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised to see all this stuff in the worker - isn't it more a concern for the engine, which actually handles the auto install? Or even the runtime, which provides the actuall npm install hook.

I've also just checked the runtime install code and it's doing quite a lot of stuff which is going to be duplicated or removed by the implementation here

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh interesting, I actually intentionally stayed away from engine - after some consternation; thought that the idea of parallel installs happen only in the worker and thought to keep it out . Happy to redo it, or move the repoLock stuff into engine? Thoughts on where you see it fitting? (I totally get the knowledge leak in the lock wrapper, since it needs to know a fair bit about the package and path.

logger: Logger
) => Promise<unknown>;

const defaultInstall: InstallFn = (specifier, repoDir, logger) =>
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah so THIS guy calls out to the runtime's install

So this file isn't really a repo lock at all - it's an "install adaptor WITH repo lock"

I still think this logic should go down into the runtime, but I also think the lock logic and the actual install should be better separated. Very misleading to have repo-lock.ts actually take responsibility for the install!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You reckon I move the lock wrapper to runtime or engine? My gut says runtime, but now it feels like installing things should be in runtime? 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: New Issues

Development

Successfully merging this pull request may close these issues.

Worker: allow multiple workers to safely share a repo directory

3 participants