Skip to content

Conversation

@mmccrackan
Copy link
Contributor

@mmccrackan mmccrackan commented Aug 22, 2025

Sort of an update to #1156. Does the following:

  1. Switches to per-group parallelization for preprocess_tod and multilayer_preprocess_tod.
  2. Reworks preprocess_tod and multilayer_preprocess_tod to rely more on preproc_or_load_group since it does many of the same things.
  3. Progress bar for preprocess_tod and multilayer_preprocess_tod.
  4. Start adding jobdb support for both preprocess_tod and multilayer_preprocess_tod following similar implementation to record_qa.
  5. Adds some standardized error messages in a class which can be input into a preprocessing error db.
  6. Some more cleanup of functions and docstrings in preprocess_util.

Might split into multiple PRs if possible. Will need to adjust some things to make sure it doesn't break things downstream.

Copy link
Contributor

@msilvafe msilvafe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how this is looking. Comments for site_pipeline.multilayer_preprocess_tod also apply to site_pipeline.preprocess_tod

return error[0], [error[1], error[2]], [error[1], error[2]], None
if errors[0] is not None:
logger.error(f"Get Groups Error for {obs_id}: {group}\n{errors[1]}\n{errors[2]}")
return None, None, None, (errors[0], errors[1], errors[2])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't this one get the error class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error[0] is set to PreprocessErrors.GetGroupsError in the get_groups function itself since that gets called outside of preproc_or_load_group sometimes.

I guess a side question here is whether or not calling get_groups to verify if the group exists is worth it since preproc_or_load_group is called by preprocess_tod/multilayer_preprocess_tod which calls get_groups already and we could just let it fail later on if the group doesn't exist.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright yes let's delete the get_groups call here but then in the get_obs/load_and_preprocess try/except blocks before returning the error call get_groups to check for the GetGroupsError to distinguish it from other metadata errors that way you only suffer the time penalty to call get_groups twice if it fails on data load.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already discussed offline, but we don't really need the get_groups call in the excepts since the errors should be fairly explanative (IncompleteMetadata was the primary reason why we added the try except in get_groups before), will fail early on before processing (in get_obs/get_meta), and we have the full traceback. We're also doing per-group runs now so we don't need to worry about a single group messing an entire obs up like before.

@msilvafe msilvafe mentioned this pull request Dec 5, 2025
@mmccrackan mmccrackan marked this pull request as ready for review December 5, 2025 18:29
@mmccrackan
Copy link
Contributor Author

Okay, after addressing comments and running some tests, I think this is probably in a state where it can be reviewed more officially. I'll do some more testing of the various logic branches to look for failure cases but I've run the first 25 obs_ids from ISO v3 for SATp3 here:

/global/cfs/cdirs/sobs/users/mmccrack/so/preprocess/common/20250821_preproc_jobdb/archives

I've made some edits to the mapmaker to accomodate these changes, but these are untested.

We need to check and update the sims versions.

For future to-dos I'd like to update the error class to actually have its own exceptions but will do that in a different PR. I was also thinking of removing error logging from the standard output to clean that up a bit and only putting it in the error log which should be much more parsable now. I'll also do the merging of load_and_preprocess and multilayer_load_and_preprocess in a different PR after this.

Copy link
Contributor

@adrien-laposta adrien-laposta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no particular comments on the code of this PR.
I mainly focused on the functions in preprocess_util.py that are used in the sim filtering routines.
I tested it on v3 preprocessing configuration files loading a data AxisManager from multilayer_load_and_preprocess and using multilayer_and_preprocess_sim to filter a simulated CAR map. I did not get any error and did a quick comparison with processed TOD from the master branch: there is perfect agreement between timestreams.

This PR looks good to me, let me know if you'd like me to test specific features beyond the standard functions we are using to filter simulations.

with jdb.locked(job) as j:
j.mark_visited()
if errors[0] is not None:
j.jstate = "failed"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throughout this whole space I think you should import jobdb.JState and use that class instead of these strings.

@kmharrington
Copy link
Member

I've noticed what might be an edge case to this setup. (but also could be related to how I decided to try and run a bunch of things straight off a jobdb). The save_group_and_cleanup function does not update the jobdb. So if I run has made a bunch of open jobs, they've been saved in temp but not added into the manifestdb, and then things were cancelled. The current setup doesn't have an obvious way to set all those completed jobs to done when things are restarted.

Michael McCrackan added 2 commits January 2, 2026 06:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants