Skip to content

Conversation

@dimitri-yatsenko
Copy link
Member

Summary

Adds comprehensive decision guidance for choosing between different populate() modes, addressing a critical gap for users deploying distributed pipelines.

Problem

From cohesion review (COHESION-REVIEW.md Issue #9, Medium Priority):

  • run-computations.md doesn't explain when reserve_jobs=True is necessary
  • distributed-computing.md jumps to advanced patterns without decision criteria
  • Users asking: "Should I use distributed mode?" or "When do I need reserve_jobs?"
  • No guidance on performance trade-offs

Solution

New "When to Use Distributed Mode" section in how-to/run-computations.md

Content Structure

Three Populate Modes with Clear Criteria:

1. populate() (Default - Simple Mode)

Use when:
✅ Single worker
✅ Fast computations (< 1 minute each)
✅ Small job count (< 100 entries)
✅ Development/testing

Advantages:

  • Simplest approach
  • No job management overhead
  • Immediate execution
  • Easy debugging

2. populate(reserve_jobs=True) (Distributed Mode)

Use when:
✅ Multiple workers (different machines/processes)
✅ Long computations (> 1 minute each)
✅ Production pipelines
✅ Worker crashes expected

Advantages:

  • Prevents duplicate work
  • Fault tolerance
  • Job status tracking
  • Error isolation

Performance note:

  • Overhead: ~100ms per job
  • Worth it when: computations > 10 seconds

3. populate(reserve_jobs=True, processes=N) (Parallel Mode)

Use when:
✅ Multi-core machine
✅ CPU-bound tasks
✅ Independent computations

Advantages:

  • Parallel execution on single machine
  • No network coordination needed
  • Combines safety with parallelism

Caution: Don't exceed CPU core count

Decision Tree

How many workers?
├─ One → populate()
└─ Multiple → Continue...

How long per computation?
├─ < 1 minute → populate() (overhead not worth it)
└─ > 1 minute → Continue...

Need fault tolerance?
├─ Yes → populate(reserve_jobs=True)
└─ No → populate() (simpler)

Multiple cores on one machine?
└─ Yes → populate(reserve_jobs=True, processes=N)

User Impact

Before (Confusion)

  • "Do I need reserve_jobs=True?"
  • "Why is my single-worker pipeline using job reservation?"
  • "Should I use distributed mode for 100 quick jobs?"
  • Reading distributed-computing.md to understand basics

After (Clarity)

  • Clear decision criteria for each mode
  • Understand performance trade-offs (100ms overhead vs fault tolerance)
  • Choose optimal mode for workload
  • Know when distributed mode helps vs hurts

Performance Guidance

Key insights added:

  • Job reservation overhead: ~100ms per job
  • Worth it threshold: computations > 10 seconds
  • Multi-core: Don't exceed CPU count
  • Fast jobs (< 1 min): Default mode often better

Placement

Inserted before "Distributed Computing" section in run-computations.md:

  1. Basic Usage (existing)
  2. Restrict What to Compute (existing)
  3. Error Handling (existing)
  4. When to Use Distributed Mode ← NEW (decision guide)
  5. Distributed Computing (existing, now with context)
  6. make() Method (existing)

Rationale: Users need decision guidance before learning distributed computing patterns.

Related

Completes Medium-Priority Cohesion Review

This PR completes all medium-priority issues from the cohesion review:

- Fixed llms.txt manual reference from migrate-from-0x to migrate-to-v20
- Regenerated llms-full.txt to pick up all corrected migration guide links from PR #107
- Verified no remaining broken internal links in LLM documentation files
Addresses gap identified in cohesion review (COHESION-REVIEW.md Issue #9):
- Users unclear when to use reserve_jobs=True
- No decision guide: "Should I use distributed mode?"
- run-computations.md doesn't explain job reservation overhead

New "When to Use Distributed Mode" section provides:

1. Three clear use cases with criteria:
   - populate() (default) - single worker, fast, simple
   - populate(reserve_jobs=True) - multiple workers, long computations, fault tolerance
   - populate(reserve_jobs=True, processes=N) - multi-core, CPU-bound, parallel

2. Each use case includes:
   - ✅ When to use (4 clear criteria)
   - Advantages (why choose this mode)
   - Code example
   - Performance notes (overhead, when worth it)

3. Decision tree flowchart:
   - How many workers? → One vs Multiple
   - How long per computation? → < 1 min vs > 1 min
   - Need fault tolerance? → Yes vs No
   - Multiple cores? → Use processes=N

4. Key insights:
   - Job reservation overhead: ~100ms per job
   - Worth it when: computations > 10 seconds
   - Caution: Don't exceed CPU core count

Impact:
- Users can quickly decide which populate mode to use
- Clear performance trade-offs documented
- Prevents common mistakes (using distributed for fast jobs)
- Reduces confusion about reserve_jobs parameter

Fixes: COHESION-REVIEW.md Issue #9 (Medium Priority)
@dimitri-yatsenko
Copy link
Member Author

Consolidated into #119 - Documentation Cohesion Review: Comprehensive Improvements for DataJoint 2.0

@dimitri-yatsenko dimitri-yatsenko deleted the docs/jobs-decision-guidance branch January 14, 2026 23:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants