-
Notifications
You must be signed in to change notification settings - Fork 101
ceph-pull-requests: omit run-rbd-unit-tests unless pr matches @ceph/rbd codeowner #2361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
some obstacles: i don't really understand how bash functions work, and i'm pretty sure i'm using them wrong
most patterns in CODEOWNERS start with a / which is meant to match an absolute path, but the list of "changed files" we're matching against don't start with slashes: other patterns like |
c37484d to
3afd0fb
Compare
|
the matching for codeowners follows different rules than the bash pattern matching this is documented in CODEOWNERS syntax, which uses a subset of the gitignore PATTERN FORMAT. the biggest difference is that as much as i hate regex, it felt like the right tool to faithfully reproduce this matching. and i tried to get it right so we don't have to deal with bugs here. this logic is complicated enough that it probably deserves some unit tests, but i don't know of a good way to do that while it wasn't absolutely necessary, i did also unify the |
|
i was able to do some smoke testing with the following external script match.sh: #!/bin/bash
. /home/cbodley/ceph-build/scripts/build_utils.sh
matches_codeowner() {
local filenames=$1
local owner=$2
local patterns=$(patterns_for_codeowner "$owner")
if no_filenames_match "$filenames" "$patterns"; then return 1; fi
return 0
}
filenames="$(< /dev/stdin)"
owner="$1"
if matches_codeowner "$filenames" "$owner"; then
echo "codeowner match"
else
echo "no codeowner match"
fifor example, |
djgalloway
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conceptually I am fine with this. I am no bash expert, however.
can you think of a good way to test this before merge? i could prepare some ceph branches to test for regressions in the doc-only/container-only cases, along with rbd and non-rbd branches to test the new logic |
Ah, right, I did say I'd do that. Yeah, let me push this change to a new job and we can test it. |
|
i pushed the following 4 branches: i can open draft prs for those if you like |
|
@cbodley https://jenkins.ceph.com/view/all/job/preserve-ceph-pull-requests/ Commenting "jenkins test pr-2361" on a PR in ceph.git will trigger this ⬆️ job. |
cbodley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some initial testing
|
I'm looking at the output in the Jenkins console log and my head is spinning. I'd agree with Ilya's suggestions but also ask for some additional |
@djgalloway i've pushed some updates to this pull request, but they don't seem to be taking effect when i retrigger the job |
|
@cbodley that's expected; to test, one copies the branch and pushes the job by hand, so anything in the job (the yaml, the scripts, etc.) has to be repushed to take effect (where 'push' here means "jenkins-jobs update") |
|
thanks @dmick. could i ask you or @djgalloway to please update/replace that job so i can keep testing with "jenkins test pr-2361"? |
Done |
|
cool, all 4 test prs do the right thing now: rbd ceph/ceph#63076 https://jenkins.ceph.com/view/all/job/preserve-ceph-pull-requests/9/console rgw ceph/ceph#63077 https://jenkins.ceph.com/view/all/job/preserve-ceph-pull-requests/8/consoleFull rgw pr with "rbd" label added https://jenkins.ceph.com/view/all/job/preserve-ceph-pull-requests/10/console container ceph/ceph#63078 https://jenkins.ceph.com/view/all/job/preserve-ceph-pull-requests/6/console docs ceph/ceph#63079 https://jenkins.ceph.com/view/all/job/preserve-ceph-pull-requests/7/console once the rgw build completes, i'll make sure the ctest |
|
squashed commits, no other changes |
taking forever, but i can confirm that the rbd job did run all 6 versions of run-rbd-unit-tests:
and that the rgw job scheduled 6 fewer tests:
|
|
FWIW, I'd like to move the code that figures out which tests to trigger to a Github action, so that we don't have to bother Jenkins at all, but I haven't figured out how Github would communicate that to Jenkins. It might be that we'd need to change the trigger mechanism for ceph-dev-new-build from GitHub PullRequest Builder (a plugin that's no longer maintained and has known security issues) to the generic "webhook" plugin, so that we could craft a webhook packet that contains the results of the classification. That would be much more efficient. |
@dmick am i correct that this would only apply to the docs-only/container-only checks that may skip the 'make check' job altogether? this rbd-specific logic just filters out some test cases from 'make check', so i think it still makes to run here if you're planning to move the docs/container stuff out anyways, i can drop the "scripts: use codeowner matching for docs/containers" commit to make that easier to port |
|
So far it's just a vague dream of moving whatever decisions can easily be moved to github. I haven't tried to define its scope until I understand what capabilities I've got for webhooking. It may be just a pipe dream for now, so don't let it super-affect your work here. |
|
i've been writing some test cases for the regex stuff and finding issues, will follow up |
|
thanks to David, i was able to retest with the final two commits ( the logs are quite verbose due to the are folks satisfied with the state of this pr? if so i can squash the commits and merge (and again, my most profuse apologies for inflicting you with regex here) |
Signed-off-by: Casey Bodley <[email protected]>
the .github/CODEOWNERS file contains a list of filename patterns and their associated teams patterns_for_codeowner() greps that CODEOWNERS file for the matching owner and returns all associated filename patterns codeowners_pattern_to_regex() applies the CODEOWNERS syntax rules to the given patterns, and regex operator =~ performs the matching pr_matches_codeowner() passes those regex patterns to no_filenames_match() to test for a match against any of the filenames changed by the pull request Signed-off-by: Casey Bodley <[email protected]>
pr_only_for() renamed to all_filenames_match() and takes patterns as a line-separated string to match the output of 'grep' from patterns_for_codeowner() Signed-off-by: Casey Bodley <[email protected]>
Signed-off-by: Casey Bodley <[email protected]>
…bd codeowner ceph-pull-requests and ceph-pull-requests-arm64 use pr_matches_codeowner() to check for @ceph/rbd ownership. if there is no match, use the CHECK_MAKEOPTS environment variable to pass an additional '-E run-rbd-unit-tests' argument to ctest this causes ctest to exclude the following long-running tests: > run-rbd-unit-tests-N.sh > run-rbd-unit-tests-0.sh > run-rbd-unit-tests-1.sh > run-rbd-unit-tests-61.sh > run-rbd-unit-tests-109.sh > run-rbd-unit-tests-127.sh Fixes: https://tracker.ceph.com/issues/70993 Signed-off-by: Casey Bodley <[email protected]>
Signed-off-by: Casey Bodley <[email protected]>
even if the pull request doesn't modify any files owned by @ceph/rbd, the "rbd" label can be added to opt in to its unit tests Signed-off-by: Casey Bodley <[email protected]>
test cases for codeowners_pattern_to_regex(), though they aren't run automatically Signed-off-by: Casey Bodley <[email protected]>
|
squashed/rebased, no other changes |
|
@cbodley what a nice surprise! @idryomov just pointed me to this PR from this other one (5 years old), which I'm rebasing to use as the foundation for a conditional test runner. Instead of focusing on a specific test, like
The approach would be slightly different that this:
Challenges/Open questions:
|
|
thanks @epuertat,
the reason that i like CODEOWNERS for this is that we're using it for github reviews/labels, so teams already have an incentive to keep that up-to-date while gitattributes would save us from this regex mess for CODEOWNERS, it's a new mapping that we'd need to maintain separately
i'm skeptical that this is what we actually want what if i modify a file under src/common that's included/linked into several other components? a bug in my changes may not be caught unless the unit tests from those components run too to make this reliable, it sounds like we would have to track and add the labels from all dependencies for every source/header. how do you suggest we do this, when each pr could change those dependencies? without a reliable mapping, we'd risk merging changes that break the required checks for other components in my view, avoiding this kind of disruption is the primary purpose of required checks in github: don't break the build, don't break something obvious that can be caught by unit test Ilya and i discussed the rbd unit tests as a special case because, while they do have a version that runs against rados in teuthology, the version in 'make check' runs against a mock librados. that means there's little risk that changes to common files would break them. and when there is a potential risk, the 'rbd' label can be added to run them the next time 'make check' is triggered |
Right! I initially thought about
Well, that's one of the open questions that I mentioned, and it basically comes down to establishing a set of rules (e.g.: only trigger conditional testing on PRs that only contain tagged files). In any case, the ultimate goal of this proposal goes beyond skipping tests. It's about making CI smarter/more efficient and reducing feedback loop/resource usage. For example:
That's why I mentioned that full builds & checks would probably still be required as post-merge jobs on the target branch. I thought that there was some ongoing initiative to improve Developer Experience. Brute force CI is an impediment to that. With some work and bold decisions we might reduce CI times from 1h-2h to 30 mins or even less than that. |
this pr is finally tested and working, so i'd prefer to limit its scope to https://tracker.ceph.com/issues/70993. i'm willing to adapt it to .gitattributes once @idryomov is satisfied with the rbd coverage (including backports), but that could follow in a separate pr i'm also keen to discuss larger-scale initiatives, but they'll need both consensus and resourcing. when we talk about big changes to ci, i worry that we don't factor in the cost of continued maintenance which falls on the infrastructure team in my opinion, the easiest/cheapest solution is to move integration tests and longer-running unit tests out of github checks entirely and instead rely on teuthology which should already be part of pr validation. teuthology was built for integration testing, so has several advantages over github checks - the biggest being that test failures can be tracked and prioritized without disrupting the whole project our tests are never going to be perfect. as long as the github checks are 'required', we should be picky about what runs there
we could accomplish this without elaborate changes to cmake, github, or jenkins. but i've now tried and failed to get consensus to move two of the outliers, the rbd and api tests. it will be difficult to make bold decisions if teams aren't willing to change their processes so yes, if the only compromise is conditional builds/tests so that teams can continue piling tests into github checks, then lets build on codeowners/labels/gitattributes. but i don't think this is sustainable, and we'll all be paying for the complexity |
just sharing ceph/ceph#63306 as an example of how easy this is in teuthology |
👍🏼 I'm ok then!
That's why I think moving to GH Actions or any other CI/CD aaS platform would be our best choice.
The moment we increase the testing overhead (by moving things outside of automated CI) and we make it opinionable (right now Pulpito only shows 2 green runs out of 25), we're immediately decreasing the quality of Ceph:
In other words: the more automated and objective our CI is, the less likely we are to let issues slip through. BTW, disruptive-ness has little to do with either Jenkins or Teuthology: API tests were never considered "disruptive" until they became required (they had been running in PRs for a couple of years before). On the contrary, it was left up to developers to check the API test results and decide (exactly the same as with Teuthology). And they became required exactly because 4 regressions from different teams were introduced in a lapse of 2 months with failing tests. To me, we have to make our CI faster, more stable and smarter. Shifting things to a context where we can relax our quality rules is to lie to us. |
|
@epuertat those issues sound like management issues to me. "Don't allow merging code with failing tests" is something that can be applied by humans, too, it doesn't have to be a robot. The issue is always that robots are stupid, and if the CI-required test failures aren't treated as emergencies, they slow every single developer for one team's mistakes. Not only mistakes....piling more and more into CI tests slows every single developer, even when there aren't test failures. We can choose not to lower quality of Ceph by requiring compliance to test results and test maintenance; punishing innocents is not an acceptable tradeoff for careful oversight. Also, of course the infrastructure team is often blamed first for a failed build/check, regardless of where the code is; putting into workflows just makes it that much harder to diagnose and iterate on. |
thanks @epuertat, it sounds like this may be our main point of contention. teuthology is how we ensure the quality of our releases, so if it isn't up to that task then we're already failing. Ceph is a large and complex distributed system, so is very hard to test extensively. we're never going to fit everything into our github pipeline, and i really don't think that should be a goal. testing is far more valuable when the deployment looks like a real ceph cluster
part of this may be that other teams didn't really buy into the vision of a stable ceph api, and weren't given extra resources to support the committment that you've asked of them but more generally, this is the difficulty of software development at scale. we have several teams working independently, and need a way to detect and resolve conflicting changes. integration tests between components is indeed the solution to that each team needs to be accountable to the others that depend on them, and that accountability includes the maintenance of tests i see these challenges as social ones, which have a tendency to get messy. it's far easier to reach for technical solutions like the 'required' option in github, but that doesn't address the underlying issues i don't want rgw changes to break the dashboard or anything else. but regressions happen, and they don't need to be escalated to project-wide disaster status. if they're in teuthology, we can assign and priorize them according to severity, and they can even block releases when necessary but the amount of disruption caused by required checks is doing more harm than good at this point, and it's not because they're catching regressions |
sorry, this was a bit hyperbolic. but still, i think it's unrealistic to expect that we can get our existing tests 100% green. our employers just aren't going to invest the resources necessary to get there |
@epuertat have you discussed this part with the infra team? if you're interested, there's a new #ceph-build-sig channel on slack and weekly meeting about ci, mostly focused around the containerization initiative. the sense i got from this week is that, while some triggers are moving into github actions, moving away from jenkins entirely wouldn't be tractable |
The thing is that we already tried that and it didn't work. It wasn't an arbitrary decision made out of thin air. Some automatic gating has to happen: whether it's building, running unit-tests or something else. Leaving that decision to human judgement (when you have an automated process) can lead to mistakes: my first contribution to the Ceph project just moved some Python code from 1 file to another... No one could have imagined that moving Python code could cause a segmentation fault. As a example, API tests aren't that comprehensive. They just do some basic e2e/functional testing on Ceph components on a vstart cluster. It's 293 tests and 1084 assertions that ensure that Ceph does what it's supposed to do on every PR. Only 3-4 of those assertions are randomly failing. And it all takes 30-40 mins + Ceph build (we could reduce that even more. If you compare that to the overhead of Teuthology (~3-5 hours), it's a no brainer. BTW, I see that after skipping the top 2-3 flaky tests, API is now much more stable.
@dmick I think you pointed out here 2 important and connected issues: blame culture (instead of understanding and appreciating someone else's contribution), and silo mentality/lack of co-ownership. And I do agree here that these are in fact management issues. A failure in Ceph should matter to all of us (no matter which team I'm on). And improving Ceph quality should be a concern for all of us. |
@cbodley I think we should distinguish what we can achieve now from what should be our goal. And in this very case, we can get both things. CI can't just do unit testing. It should at least ensure that Ceph runs and perform some functional/integration and even e2e tests. As I mentioned in my reply to @dmick, the cost-benefit of API tests is enormous: you get most Ceph features tested in 30-40 mins on every code push (compare that to a comprehensive Teuthology run). That said, we (Dashboard + every component team) can make them even more lightweight.
BTW I just realized that this has became (or I led it to) a discussion on API tests... And that should be probably better handled in the CSC meeting, where this discussion is pending. My bad. I'll send an email on this topic for the CSC. |
the .github/CODEOWNERS file contains a list of filename patterns and their associated teams
patterns_for_codeowner()greps that CODEOWNERS file for the matching owner and returns all associated filename patternscodeowners_pattern_to_regex()applies the CODEOWNERS syntax rules to the given patterns, andmatch_file_or_directory()performs the matchingpr_matches_codeowner()passes those regex patterns tono_filenames_match()to test for a match against any of the filenames changed by the pull requestceph-pull-requests and ceph-pull-requests-arm64 use
pr_matches_codeowner()to check for@ceph/rbdownership. if there is no match, use theCHECK_MAKEOPTSenvironment variable to pass an additional '-E run-rbd-unit-tests' argument to ctestthis causes ctest to exclude the following long-running tests:
Fixes: https://tracker.ceph.com/issues/70993