Skip to content

Conversation

@grondo
Copy link
Contributor

@grondo grondo commented Nov 13, 2025

This PR improves error messages from the limit-job-size and limit-duration plugins
to include the requested resource count/duration and the target queue name
when rejecting jobs.

Example improvement:

  • Before: requested nnodes exceeds policy limit of 16
  • After: requested nnodes (20) exceeds policy limit of 16 for queue debug

Fixes: #7201

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@grondo
Copy link
Contributor Author

grondo commented Nov 17, 2025

After rebasing, the ENOSPC tests are failing on every builder:

Warning: Found 1 errors from 445 tests in testsuite
Error: not ok 2 - flux still operates with content-sqlite running out of space
Error: not ok 3 - flux still operates with content-files running out of space
Error: not ok 4 - content flush returns error on ENOSPC
Error: not ok 5 - kvs sync fails due to ENOSPC
Error: ERROR: t0090-content-enospc.t - exited with status 1

Looking into it.

Problem: The limit-job-size plugin has repetitive checks for over
and under policy limits, which will make updating the similar error
messages tedious.

Add a few helper functions so the similar error message is created
in a single place.
Problem: The error message from the limit-job-size plugin does not
include the requested resource count or the target queue. This could
leave users confused about the source of the policy.

Add the requested resource count as computed by the plugin, as well
as any queue name when generating error messages in the limit-job-size
plugin.
Problem: The tests for the limit-job-size plugin in the testsuite
do not ensure queue policy limit errors include the queue name.

Add a test that ensures the queue name is present in a queue-specific
error.
Problem: The job update tests expect specific errors when moving
a job to a new queue or updating duration exceeds policy limits,
but these error messages may be expanded in the future.

Update the error patterns to allow for future changes.
Problem: The limit-duration plugin does not include the requested
duration or target queue in the error message when rejecting a job.
This can lead to user confusion about the source of the policy limit.

Add the requested duration (formatted as fsd) and the queue (if any)
to the error message sent back to the user.
Problem: The tests for the limit-duration job-manager plugin do
not ensure error messages contain expected details.

Add a couple new tests to t2221-job-manager-limit-duration.t.
@mergify mergify bot added the queued label Nov 17, 2025
@mergify mergify bot merged commit db47b85 into flux-framework:master Nov 17, 2025
34 of 35 checks passed
@mergify mergify bot removed the queued label Nov 17, 2025
@codecov
Copy link

codecov bot commented Nov 17, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.71%. Comparing base (ecd1e54) to head (6d54ba2).
⚠️ Report is 8 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7204      +/-   ##
==========================================
- Coverage   83.71%   83.71%   -0.01%     
==========================================
  Files         553      553              
  Lines       92370    92379       +9     
==========================================
+ Hits        77329    77336       +7     
- Misses      15041    15043       +2     
Files with missing lines Coverage Δ
src/modules/job-manager/plugins/limit-duration.c 78.35% <100.00%> (+0.49%) ⬆️
src/modules/job-manager/plugins/limit-job-size.c 79.08% <100.00%> (+0.66%) ⬆️

... and 11 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@grondo grondo deleted the issue#7201 branch November 17, 2025 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

give more detail when jobs are rejected by limit-* plugins

2 participants