Optional cross-worker lock for shared adaptor repo by stuartc · Pull Request #1416 · OpenFn/kit

stuartc · 2026-05-19T13:38:26Z

Short Description

Adds an optional filesystem lock so multiple ws-workers can safely share a single adaptor repo directory (e.g. an NFS mount or k8s PVC), without two workers racing on the same npm install.

Fixes #1414

Implementation Details

When the repo directory is shared between worker pods, two workers can otherwise hit handleInstall for the same adaptor at the same time and corrupt each other's node_modules. This PR adds a new opt-in flag — --repo-lock / WORKER_REPO_LOCK — that wraps the engine's autoinstall handlers with cross-process coordination.

How it works:

A per-adaptor lockfile under <repoDir>/.locks/<alias>.lock is acquired via proper-lockfile before any install runs. Different adaptors don't block each other.
After a successful install, a sentinel file is written under <repoDir>/.sentinels/<alias>.done. handleIsInstalled requires BOTH the sentinel AND node_modules/<alias>/package.json to be present, so a half-finished install left by a crashed worker is correctly re-attempted.
The cache-hit path (handleIsInstalled returns true) stays lock-free — no syscall fan-out for adaptors that are already installed.
If installFn throws, the sentinel isn't written, and the next worker re-runs the install.
The retry ceiling (6 min) sits above the stale-lock window (5 min), so when a pod dies mid-install its lock expires before waiting workers give up. Cold-start of N pods against an empty repo is recoverable, not fatal.
Off by default. --repo-lock requires --repo-dir; if it's set without one we warn and continue without the lock.

New dep: proper-lockfile@4.1.2 (with graceful-fs, retry, signal-exit transitives). All checked for CVEs — clean.

QA Notes

The feature is gated behind WORKER_REPO_LOCK=true. With the flag off, behaviour is identical to before this PR — please confirm that path is unchanged.

With the flag on, the interesting cases to exercise:

Two workers pointing at the same --repo-dir starting a run that needs the same adaptor at the same time — only one npm install should actually run; the other should see the sentinel and skip.
A worker killed mid-install (SIGKILL) — the next worker should recover after the stale window (5 min) rather than getting stuck forever.
Different adaptors should install in parallel, not serialise.
Without --repo-lock, multi-worker behaviour should be unchanged.

There's a multi-process test suite using child_process.fork that exercises all of the above deterministically — pnpm test test/util/repo-lock-multiprocess.test.ts from packages/ws-worker.

K8s deployments using this need NTP/chrony across nodes — stale-lock detection is mtime-based.

AI Usage

I have used Claude Code
I have used another model
I have not used AI

You can read more details in our
Responsible AI Policy

Add `--repo-lock` / `WORKER_REPO_LOCK` to coordinate adaptor installs across multiple workers sharing one repo directory (e.g. an NFS mount or k8s PVC). Uses proper-lockfile per-adaptor plus a sentinel cache, so the cache-hit path stays lock-free. Off by default; requires --repo-dir. Lock retry ceiling (6 min) is set above STALE_MS (5 min) so a dead holder's stale window expires before waiters give up, making cold- start of N pods against an empty repo recoverable rather than fatal.

Drop redundant pre-check in ensureLockTarget (wx already throws EEXIST), parallelise fileExists pairs in handleIsInstalled and the post-lock double-check, collapse trivial if/return, drop unused default export, and fix the worker test harness' mode list comment plus an any-typed logger.

josephjclark · 2026-05-19T13:54:46Z

+        'Repo lock enabled: coordinating adaptor installs via filesystem lock at',
+        args.repoDir
+      );
+      engineOptions.autoinstall = createLockedHandlers(args.repoDir);


I'm surprised to see all this stuff in the worker - isn't it more a concern for the engine, which actually handles the auto install? Or even the runtime, which provides the actuall npm install hook.

I've also just checked the runtime install code and it's doing quite a lot of stuff which is going to be duplicated or removed by the implementation here

Oh interesting, I actually intentionally stayed away from engine - after some consternation; thought that the idea of parallel installs happen only in the worker and thought to keep it out . Happy to redo it, or move the repoLock stuff into engine? Thoughts on where you see it fitting? (I totally get the knowledge leak in the lock wrapper, since it needs to know a fair bit about the package and path.

josephjclark · 2026-05-19T13:56:43Z

+  logger: Logger
+) => Promise<unknown>;
+
+const defaultInstall: InstallFn = (specifier, repoDir, logger) =>


Ah so THIS guy calls out to the runtime's install

So this file isn't really a repo lock at all - it's an "install adaptor WITH repo lock"

I still think this logic should go down into the runtime, but I also think the lock logic and the actual install should be better separated. Very misleading to have repo-lock.ts actually take responsibility for the install!

You reckon I move the lock wrapper to runtime or engine? My gut says runtime, but now it feels like installing things should be in runtime? 😅

stuartc added 2 commits May 19, 2026 15:20

taylordowns2000 added this to Core May 19, 2026

github-project-automation Bot moved this to New Issues in Core May 19, 2026

josephjclark reviewed May 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optional cross-worker lock for shared adaptor repo#1416

Optional cross-worker lock for shared adaptor repo#1416
stuartc wants to merge 2 commits into
mainfrom
1414-distributed-adaptor-install-lock

stuartc commented May 19, 2026

Uh oh!

josephjclark May 19, 2026

Uh oh!

stuartc May 19, 2026

Uh oh!

josephjclark May 19, 2026

Uh oh!

stuartc May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

stuartc commented May 19, 2026

Short Description

Implementation Details

QA Notes

AI Usage

Uh oh!

josephjclark May 19, 2026

Choose a reason for hiding this comment

Uh oh!

stuartc May 19, 2026

Choose a reason for hiding this comment

Uh oh!

josephjclark May 19, 2026

Choose a reason for hiding this comment

Uh oh!

stuartc May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants