CPU and Max RSS Analysis tools #6663

ChrisPaulBennett · 2025-03-12T09:14:38Z

This apart of 3 pull requests for adding CPU time and Max RSS analysis to the Cylc UI.

This adds the Max RSS and CPU time (as measured by cgroups) to the table view, box plot and time series views.

This adds a python profiler script. This profiler will will be ran by cylc in the same crgroup as the cylc task. It will periodically poll cgroups and save data to a file. Cylc will then store these values in the sql db file.

Linked to;
cylc/cylc-ui#2100
cylc/cylc-uiserver#675

Check List

I have read CONTRIBUTING.md and added my name as a Code Contributor.
Contains logically grouped changes (else tidy your branch by rebase).
Does not contain off-topic changes (use other PRs for other changes).
Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
Tests are included (or explain why tests are not needed).
Changelog entry included if this is a change that can affect users
Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

oliver-sanders

🎉

cylc/flow/etc/job.sh

cylc/flow/job_file.py

cylc/flow/etc/job.sh

cylc/flow/scripts/profile.py

tests/functional/jobscript/02-profiler.t

oliver-sanders

👍

cylc/flow/cfgspec/globalcfg.py

cylc/flow/etc/job.sh

cylc/flow/scripts/profiler.py

oliver-sanders · 2025-04-03T10:59:35Z

tests/functional/jobscript/02-profiler.t

+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#-------------------------------------------------------------------------------
+# cylc profile test


This test will run regular background jobs, no slurm / pbs / whatever, so no cgroups.

I think this is testing that the profiler will not cause the job to fail, even if it cannot poll cgroups? Which is worthwhile testing.

We should test the jobs stderr for the line(s) written by the profiler script complaining of the fault.

@ChrisPaulBennett

The profiler actually fails in this test, but the test passes anyway because it doesn't check whether the profiler did anything useful.

I've had a crack at a test here: ChrisPaulBennett#1

A couple of the sub-tests don't pass at the moment because the cpu/memory are not returned if the job fails.

tests/functional/jobscript/02-profiler/flow.cylc

cylc/flow/scripts/profiler.py

oliver-sanders · 2025-04-03T11:03:48Z

(please ignore the manylinux test failures, we'll be removing this test on master shortly)

cylc/flow/cfgspec/globalcfg.py

wxtim · 2025-04-16T13:50:42Z

I'm getting lots of failures with this (admittedly nasty) workflow on localhost:

[task parameters]
    time = 1..10
    reps = 1..5
[scheduling]
    cycling mode = integer
    [[graph]]
        R1 = task<time><reps>
[runtime]
    [[task<time><reps>]]
        script = sleep $CYLC_TASK_PARAM_time

About 2/3 of tasks have FileNotFoundError: [Errno 2] No such file or directory: 'cpu_time' - It looks to me like the profiler fails if the task exits too fast?

Full Traceback

Traceback (most recent call last):
  File "/home/users/tim.pillinger/conda-envs/cylc39/bin/cylc", line 8, in <module>
    sys.exit(main())
  File "/home/users/tim.pillinger/repos/cylc-flow/cylc/flow/scripts/cylc.py", line 702, in main
    execute_cmd(command, *cmd_args)
  File "/home/users/tim.pillinger/repos/cylc-flow/cylc/flow/scripts/cylc.py", line 333, in execute_cmd
    entry_point.load()(*args)
  File "/home/users/tim.pillinger/repos/cylc-flow/cylc/flow/terminal.py", line 298, in wrapper
    wrapped_function(*wrapped_args, **wrapped_kwargs)
  File "/home/users/tim.pillinger/repos/cylc-flow/cylc/flow/scripts/profiler.py", line 62, in main
    get_config(options)
  File "/home/users/tim.pillinger/repos/cylc-flow/cylc/flow/scripts/profiler.py", line 180, in get_config
    profile(process, cgroup_version, args.delay)
  File "/home/users/tim.pillinger/repos/cylc-flow/cylc/flow/scripts/profiler.py", line 159, in profile
    write_data(str(cpu_time), "cpu_time")
  File "/home/users/tim.pillinger/repos/cylc-flow/cylc/flow/scripts/profiler.py", line 103, in write_data
    with open(filename, 'w') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'cpu_time'

oliver-sanders · 2025-04-16T14:17:45Z

Note, it's not really valid to configure the profiler for the localhost platform as the job isn't running in a cgroup, but jobs that exit faster than the profiler's poll interval is an edge case that we should handle.

wxtim · 2025-04-17T12:59:20Z

Note, it's not really valid to configure the profiler for the localhost platform as the job isn't running in a cgroup

Probably need some user safety rails/warnings about that

oliver-sanders · 2025-04-17T13:01:33Z

Probably need some user safety rails/warnings about that

It's difficult for us to say which job runners do or do not support cgroup profiling. The best we can do is to document it.

cylc/flow/etc/job.sh

cylc/flow/cfgspec/globalcfg.py

cylc/flow/etc/job.sh

ChrisPaulBennett · 2025-04-28T10:23:28Z

I'm not sure how to deal with the linting failure. My Perl is rusty, at best.
If I add "export", as the error code recommends, the test fails. If I remove it the test also fails.
Dave Matthews recommendations have been implemented

oliver-sanders · 2025-05-07T11:57:17Z

Works fine for me:

$ ctb -v tests/functional/jobscript/02-profiler.t -p '*'
ok 1 - 02-profiler-validate
ok 2 - 02-profiler-run
ok    20179 ms ( 0.01 usr  0.01 sys +  3.16 cusr  1.10 csys =  4.28 CPU)
[12:56:44]
All tests successful.
Files=1, Tests=2, 24 wallclock secs ( 0.02 usr  0.01 sys +  3.16 cusr  1.10 csys =  4.29 CPU)
Result: PASS

$ git diff
diff --git a/tests/functional/jobscript/02-profiler.t b/tests/functional/jobscript/02-profiler.t
index 1d8dbc548..601d12971 100644
--- a/tests/functional/jobscript/02-profiler.t
+++ b/tests/functional/jobscript/02-profiler.t
@@ -16,7 +16,7 @@
 # along with this program.  If not, see <http://www.gnu.org/licenses/>.
 #-------------------------------------------------------------------------------
 # cylc profile test
-REQUIRE_PLATFORM='runner:?(pbs|slurm)'
+export REQUIRE_PLATFORM='runner:?(pbs|slurm)'
 . "$(dirname "$0")/test_header"
 #-------------------------------------------------------------------------------
 set_test_number 2

$ etc/bin/shellchecker 
$ echo $?
0

ChrisPaulBennett · 2025-05-19T10:18:12Z

Works fine for me:

$ ctb -v tests/functional/jobscript/02-profiler.t -p '*'
ok 1 - 02-profiler-validate
ok 2 - 02-profiler-run
ok    20179 ms ( 0.01 usr  0.01 sys +  3.16 cusr  1.10 csys =  4.28 CPU)
[12:56:44]
All tests successful.
Files=1, Tests=2, 24 wallclock secs ( 0.02 usr  0.01 sys +  3.16 cusr  1.10 csys =  4.29 CPU)
Result: PASS

$ git diff
diff --git a/tests/functional/jobscript/02-profiler.t b/tests/functional/jobscript/02-profiler.t
index 1d8dbc548..601d12971 100644
--- a/tests/functional/jobscript/02-profiler.t
+++ b/tests/functional/jobscript/02-profiler.t
@@ -16,7 +16,7 @@
 # along with this program.  If not, see <http://www.gnu.org/licenses/>.
 #-------------------------------------------------------------------------------
 # cylc profile test
-REQUIRE_PLATFORM='runner:?(pbs|slurm)'
+export REQUIRE_PLATFORM='runner:?(pbs|slurm)'
 . "$(dirname "$0")/test_header"
 #-------------------------------------------------------------------------------
 set_test_number 2

$ etc/bin/shellchecker 
$ echo $?
0

So it does. Weird. Anyway, done.

cylc/flow/etc/job.sh

cylc/flow/job_file.py

tests/functional/jobscript/02-profiler.t

cylc/flow/etc/job.sh

cylc/flow/scripts/profiler.py

oliver-sanders · 2025-10-21T11:00:54Z

We seem to have agreement on the validity cgroups field(s) we're polling, we should be able to push forward with this quickly now.

Last couple of outstanding comments:

Remove global variables.
- Information is being duplicated into both the Process object and the global variables.
- We can remove the global variables in favour of the Process object.
- POC refactor: https://github.com/oliver-sanders/cylc-flow/pull/new/cylc_profiler.remove_global_variables
  - Add a cgroup_version field to the Process object.
  - Pass the Process object into the stop_profiler method.
  - Remove usage of global variables.
Backup the profiler output to the job.status file.
- This PR is using the cylc.flow.send_messages interface to send the message back to the scheduler.
- However, we should probably be using the record_messages interface (MB I think, sry).
- This interface will additionally write the message to the job.status file.
- This way, the message will not be lost if the scheduler is stopped at the time the message is sent, or if a network issue prevents the transmission of the message.
- Note, this interface does not presently support a comms_timeout option. This isn't a biggie, we can drop this.

Then we can get Dave to ok the cgroup stuff and we're away...

cylc/flow/scripts/profiler.py

ChrisPaulBennett · 2025-10-21T14:26:16Z

* Pass the `Process` object into the `stop_profiler` method.

How do I do that? I couldn't see a way to do it. Since I'm not calling the function, its the registered function for sigkill
The use of globals is the only way I could see to get around. I'd love to get rid of them.

*EDIT. Sorry, I didn't see the pull request, I'll go through it now

oliver-sanders · 2025-11-13T14:50:06Z

cylc/flow/scripts/profiler.py

+    try:
+        # Get the cgroup information for the current process
+        with open('/proc/' + str(pid) + '/cgroup', 'r') as f:
+            result = f.read()
+        result = PID_REGEX.search(result).group()
+        return result
+    except FileNotFoundError as err:
+        raise FileNotFoundError(
+            '/proc/' + str(pid) + '/cgroup not found') from err


This is catching a FileNotFoundError and raising a near identical FileNotFoundError in it's place.

I think the intention was to replace a scary looking traceback with a more informative error. If so, try this out (note you'll likely need to import cylc.exceptions.CylcError first):

try: # Get the cgroup information for the current process with open('/proc/' + str(pid) + '/cgroup', 'r') as f: result = f.read() result = PID_REGEX.search(result).group() return result except FileNotFoundError as err: - raise FileNotFoundError( - '/proc/' + str(pid) + '/cgroup not found') from err + raise CylcError('CGroup file not found: {err}') from None

CylcError is the class that (almost) all Cylc exceptions inherit from.

tldr; If you want a short clean error message, use CylcError or a subclass of it. If you want a scary traceback, use a plain exception.

CylcErrors get special treatment, the str(exc) gets written to stderr in red text. The traceback is not displayed unless running with --debug.

Note the from None hides the parent exception, preventing it from appearing in the traceback.

cylc/flow/scripts/profiler.py

dpmatthews

cgroups usage looks sensible

cylc/flow/scripts/profiler.py

tests/unit/scripts/test_profiler.py

cylc/flow/scripts/profiler.py

Changed the name of the profiler module. Linting Profiler sends KB instead of bytes Time Series now working CPU/Memory Logging working

Initial profiler implementation (non working) Changed the name of the profiler module. Linting Profiler sends KB instead of bytes Time Series now working CPU/Memory Logging working Adding profiler unit tests updating tests Fail gracefully if cgroups cannot be found Revert "Fail gracefully if cgroups cannot be found" This reverts commit 92e1e11c9b392b4742501d399f191f590814e95e. Linting Modifying unit tests Linting Changed the name of the profiler module. Profiler sends KB instead of bytes Time Series now working

ChrisPaulBennett marked this pull request as draft March 12, 2025 09:19

oliver-sanders reviewed Mar 12, 2025

View reviewed changes

oliver-sanders added this to the 8.x milestone Mar 12, 2025

oliver-sanders assigned ChrisPaulBennett Mar 12, 2025

This was referenced Mar 13, 2025

CPU and Max RSS Analysis tools cylc/cylc-ui#2100

Open

CPU and Max RSS Analysis tools cylc/cylc-uiserver#675

Open

ChrisPaulBennett force-pushed the cylc_profiler branch 2 times, most recently from fb1b12b to c5d30b3 Compare March 21, 2025 11:37

ChrisPaulBennett force-pushed the cylc_profiler branch 3 times, most recently from 30a7bb0 to 7091711 Compare April 2, 2025 08:35

ChrisPaulBennett marked this pull request as ready for review April 2, 2025 14:20

oliver-sanders reviewed Apr 3, 2025

View reviewed changes

oliver-sanders reviewed Apr 10, 2025

View reviewed changes

cylc/flow/cfgspec/globalcfg.py Show resolved Hide resolved

ChrisPaulBennett force-pushed the cylc_profiler branch from 4f3d03a to 49fcbc8 Compare April 15, 2025 07:51

ChrisPaulBennett requested a review from oliver-sanders April 15, 2025 09:39

ChrisPaulBennett force-pushed the cylc_profiler branch from 68b0687 to 66acd1f Compare April 17, 2025 09:41

oliver-sanders reviewed Apr 23, 2025

View reviewed changes

cylc/flow/etc/job.sh Outdated Show resolved Hide resolved

oliver-sanders reviewed Apr 24, 2025

View reviewed changes

cylc/flow/cfgspec/globalcfg.py Outdated Show resolved Hide resolved

oliver-sanders reviewed Apr 24, 2025

View reviewed changes

cylc/flow/etc/job.sh Outdated Show resolved Hide resolved

oliver-sanders reviewed Apr 24, 2025

View reviewed changes

cylc/flow/etc/job.sh Outdated Show resolved Hide resolved

oliver-sanders reviewed Apr 24, 2025

View reviewed changes

cylc/flow/etc/job.sh Show resolved Hide resolved

ChrisPaulBennett requested a review from oliver-sanders April 28, 2025 14:02

oliver-sanders reviewed Jun 5, 2025

View reviewed changes

cylc/flow/etc/job.sh Show resolved Hide resolved

oliver-sanders self-requested a review June 19, 2025 12:01

oliver-sanders reviewed Jun 20, 2025

View reviewed changes

cylc/flow/job_file.py Outdated Show resolved Hide resolved

oliver-sanders reviewed Jun 20, 2025

View reviewed changes

tests/functional/jobscript/02-profiler.t Show resolved Hide resolved

ChrisPaulBennett requested a review from oliver-sanders July 10, 2025 12:59

oliver-sanders reviewed Aug 18, 2025

View reviewed changes

cylc/flow/etc/job.sh Outdated Show resolved Hide resolved

oliver-sanders reviewed Aug 18, 2025

View reviewed changes

dpmatthews requested changes Oct 21, 2025

View reviewed changes

cylc/flow/scripts/profiler.py Outdated Show resolved Hide resolved

cylc/flow/scripts/profiler.py Outdated Show resolved Hide resolved

oliver-sanders requested a review from dpmatthews November 13, 2025 13:35

oliver-sanders reviewed Nov 13, 2025

View reviewed changes

cylc/flow/scripts/profiler.py Outdated Show resolved Hide resolved

dpmatthews approved these changes Nov 13, 2025

View reviewed changes

ChrisPaulBennett closed this Nov 17, 2025

ChrisPaulBennett reopened this Nov 17, 2025

ChrisPaulBennett force-pushed the cylc_profiler branch from 3320697 to f5e5af8 Compare November 20, 2025 14:53

oliver-sanders reviewed Dec 1, 2025

View reviewed changes

ChrisPaulBennett added 2 commits December 3, 2025 16:45

Initial profiler implementation (non working)

125ed0c

Changed the name of the profiler module. Linting Profiler sends KB instead of bytes Time Series now working CPU/Memory Logging working

ChrisPaulBennett force-pushed the cylc_profiler branch 2 times, most recently from a61828c to a2fbded Compare December 4, 2025 09:36

Time Series now working

36de59f

ChrisPaulBennett force-pushed the cylc_profiler branch from a2fbded to 61cbe0b Compare December 4, 2025 09:38

ChrisPaulBennett added 3 commits December 4, 2025 09:39

Code review changes

12c347d

Adding test coverage

e793c00

Added custom exception for cylc profiler

b4a32cd

ChrisPaulBennett force-pushed the cylc_profiler branch from 61cbe0b to b4a32cd Compare December 4, 2025 09:39

Linting

d310c12

CPU and Max RSS Analysis tools #6663

Are you sure you want to change the base?

CPU and Max RSS Analysis tools #6663

Conversation

ChrisPaulBennett commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oliver-sanders left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

oliver-sanders left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

oliver-sanders Apr 3, 2025

Choose a reason for hiding this comment

Uh oh!

oliver-sanders Apr 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

oliver-sanders commented Apr 3, 2025

Uh oh!

Uh oh!

wxtim commented Apr 16, 2025

Uh oh!

oliver-sanders commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wxtim commented Apr 17, 2025

Uh oh!

oliver-sanders commented Apr 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ChrisPaulBennett commented Apr 28, 2025

Uh oh!

oliver-sanders commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChrisPaulBennett commented May 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

oliver-sanders commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ChrisPaulBennett commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oliver-sanders Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ChrisPaulBennett commented Mar 12, 2025 •

edited

Loading

oliver-sanders commented Apr 16, 2025 •

edited

Loading

oliver-sanders commented May 7, 2025 •

edited

Loading

oliver-sanders commented Oct 21, 2025 •

edited

Loading

ChrisPaulBennett commented Oct 21, 2025 •

edited

Loading

oliver-sanders Nov 13, 2025 •

edited

Loading