Skip to content

Conversation

@chu11
Copy link
Member

@chu11 chu11 commented Nov 10, 2025

Problem: A single corrupted entry in a job directory will effectively make all data in the job directory unusable (i.e. if one piece of data is corrupted, other uncorrupted data may not be usable). The --repair option only moves the corrupted data to the lost+found, leaving the uncorrupted data in place. This can effect several job related modules, that expect specific job data to always be available.

Support a new --job-aware option to flux-fsck. In concert with the --repair option, if any data in a job directory is corrupted, move
all contents of the job directory to the lost+found.

Fixes #7121

Problem: Some code breaks up function parameters onto multiple lines
that is not necessary and does not conform to current coding patterns.

If a line of code is clearly < 80 chars, do not break up function parameters
onto multiple lines.
Problem: The function put_valref_lost_and_found() could be used
far more generally, but is currently isolated to repaired valref
treeobjs.

Generalize the function and rename it to put_lost_and_found().
Problem: A single corrupted entry in a job directory will
effectively make all data in the job directory unusable (i.e. if
one piece of data is corrupted, other uncorrupted data may not be
usable).  The --repair option only moves the corrupted data to
the lost+found, leaving the uncorrupted data in place.  This can effect
several job related modules, that expect specific job data to always
be available.

Support a new --job-aware option to flux-fsck.  In concert with the
--repair option, if any data in a job directory is corrupted, move
all contents of the job directory to the lost+found.

Fixes flux-framework#7121
Problem: The new flux-fsck --job-ware option is not documented.

Add documentation to flux-fsck(1).
Problem: There is no test coverage for the new --job-aware option
in flux-fsck.

Add coverage in t2816-fsck-cmd.t.
@garlick
Copy link
Member

garlick commented Nov 12, 2025

Any thoughts about abstracting some of these functions out into a private library (or private portion of libkvs) that could be used by multiple offline KVS tools or the KVS itself?

It'd be nice to shrink the volume of code in fsck.c and have unit tests for some of the functions in here.

(I'm just asking - mabye that's not practical)

@chu11
Copy link
Member Author

chu11 commented Nov 12, 2025

Any thoughts about abstracting some of these functions out into a private library (or private portion of libkvs) that could be used by multiple offline KVS tools or the KVS itself?

In an earlier iteration I did ponder this. I can't remember the specific reasons why, but the flux-fsck needs were (unsurprisingly) simpler than the KVS module needs, and things didn't seem to line up. But as flux-fsck grows and advances, it certainly is something worth re-visiting. I'll put it in a TODO item.

Edit: and as you mention, splicing things out into a lib just for unit testing does seem like a good idea

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

flux-fsck: support a --job-aware option

2 participants