-
Notifications
You must be signed in to change notification settings - Fork 95
CPU and Max RSS Analysis tools #6663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
oliver-sanders
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
fb1b12b to
c5d30b3
Compare
30a7bb0 to
7091711
Compare
oliver-sanders
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| # You should have received a copy of the GNU General Public License | ||
| # along with this program. If not, see <http://www.gnu.org/licenses/>. | ||
| #------------------------------------------------------------------------------- | ||
| # cylc profile test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test will run regular background jobs, no slurm / pbs / whatever, so no cgroups.
I think this is testing that the profiler will not cause the job to fail, even if it cannot poll cgroups? Which is worthwhile testing.
We should test the jobs stderr for the line(s) written by the profiler script complaining of the fault.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The profiler actually fails in this test, but the test passes anyway because it doesn't check whether the profiler did anything useful.
I've had a crack at a test here: ChrisPaulBennett#1
A couple of the sub-tests don't pass at the moment because the cpu/memory are not returned if the job fails.
|
(please ignore the manylinux test failures, we'll be removing this test on master shortly) |
4f3d03a to
49fcbc8
Compare
|
I'm getting lots of failures with this (admittedly nasty) workflow on localhost: [task parameters]
time = 1..10
reps = 1..5
[scheduling]
cycling mode = integer
[[graph]]
R1 = task<time><reps>
[runtime]
[[task<time><reps>]]
script = sleep $CYLC_TASK_PARAM_timeAbout 2/3 of tasks have Full Traceback |
|
Note, it's not really valid to configure the profiler for the localhost platform as the job isn't running in a cgroup, but jobs that exit faster than the profiler's poll interval is an edge case that we should handle. |
68b0687 to
66acd1f
Compare
Probably need some user safety rails/warnings about that |
It's difficult for us to say which job runners do or do not support cgroup profiling. The best we can do is to document it. |
|
I'm not sure how to deal with the linting failure. My Perl is rusty, at best. |
|
Works fine for me: $ ctb -v tests/functional/jobscript/02-profiler.t -p '*'
ok 1 - 02-profiler-validate
ok 2 - 02-profiler-run
ok 20179 ms ( 0.01 usr 0.01 sys + 3.16 cusr 1.10 csys = 4.28 CPU)
[12:56:44]
All tests successful.
Files=1, Tests=2, 24 wallclock secs ( 0.02 usr 0.01 sys + 3.16 cusr 1.10 csys = 4.29 CPU)
Result: PASS
$ git diff
diff --git a/tests/functional/jobscript/02-profiler.t b/tests/functional/jobscript/02-profiler.t
index 1d8dbc548..601d12971 100644
--- a/tests/functional/jobscript/02-profiler.t
+++ b/tests/functional/jobscript/02-profiler.t
@@ -16,7 +16,7 @@
# along with this program. If not, see <http://www.gnu.org/licenses/>.
#-------------------------------------------------------------------------------
# cylc profile test
-REQUIRE_PLATFORM='runner:?(pbs|slurm)'
+export REQUIRE_PLATFORM='runner:?(pbs|slurm)'
. "$(dirname "$0")/test_header"
#-------------------------------------------------------------------------------
set_test_number 2
$ etc/bin/shellchecker
$ echo $?
0 |
So it does. Weird. Anyway, done. |
|
We seem to have agreement on the validity cgroups field(s) we're polling, we should be able to push forward with this quickly now. Last couple of outstanding comments:
Then we can get Dave to ok the cgroup stuff and we're away... |
How do I do that? I couldn't see a way to do it. Since I'm not calling the function, its the registered function for sigkill *EDIT. Sorry, I didn't see the pull request, I'll go through it now |
cylc/flow/scripts/profiler.py
Outdated
| try: | ||
| # Get the cgroup information for the current process | ||
| with open('/proc/' + str(pid) + '/cgroup', 'r') as f: | ||
| result = f.read() | ||
| result = PID_REGEX.search(result).group() | ||
| return result | ||
| except FileNotFoundError as err: | ||
| raise FileNotFoundError( | ||
| '/proc/' + str(pid) + '/cgroup not found') from err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is catching a FileNotFoundError and raising a near identical FileNotFoundError in it's place.
I think the intention was to replace a scary looking traceback with a more informative error. If so, try this out (note you'll likely need to import cylc.exceptions.CylcError first):
try:
# Get the cgroup information for the current process
with open('/proc/' + str(pid) + '/cgroup', 'r') as f:
result = f.read()
result = PID_REGEX.search(result).group()
return result
except FileNotFoundError as err:
- raise FileNotFoundError(
- '/proc/' + str(pid) + '/cgroup not found') from err
+ raise CylcError('CGroup file not found: {err}') from NoneCylcErroris the class that (almost) all Cylc exceptions inherit from.- tldr; If you want a short clean error message, use
CylcErroror a subclass of it. If you want a scary traceback, use a plain exception. CylcErrors get special treatment, thestr(exc)gets written to stderr in red text. The traceback is not displayed unless running with--debug.- Note the
from Nonehides the parent exception, preventing it from appearing in the traceback.
dpmatthews
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cgroups usage looks sensible
3320697 to
f5e5af8
Compare
Changed the name of the profiler module. Linting Profiler sends KB instead of bytes Time Series now working CPU/Memory Logging working
Initial profiler implementation (non working) Changed the name of the profiler module. Linting Profiler sends KB instead of bytes Time Series now working CPU/Memory Logging working Adding profiler unit tests updating tests Fail gracefully if cgroups cannot be found Revert "Fail gracefully if cgroups cannot be found" This reverts commit 92e1e11c9b392b4742501d399f191f590814e95e. Linting Modifying unit tests Linting Changed the name of the profiler module. Profiler sends KB instead of bytes Time Series now working
a61828c to
a2fbded
Compare
a2fbded to
61cbe0b
Compare
61cbe0b to
b4a32cd
Compare
This apart of 3 pull requests for adding CPU time and Max RSS analysis to the Cylc UI.
This adds the Max RSS and CPU time (as measured by cgroups) to the table view, box plot and time series views.
This adds a python profiler script. This profiler will will be ran by cylc in the same crgroup as the cylc task. It will periodically poll cgroups and save data to a file. Cylc will then store these values in the sql db file.
Linked to;
cylc/cylc-ui#2100
cylc/cylc-uiserver#675
Check List
CONTRIBUTING.mdand added my name as a Code Contributor.setup.cfg(andconda-environment.ymlif present).?.?.xbranch.