Skip to content

Conversation

@cbodley
Copy link
Contributor

@cbodley cbodley commented Apr 29, 2025

the .github/CODEOWNERS file contains a list of filename patterns and their associated teams

patterns_for_codeowner() greps that CODEOWNERS file for the matching owner and returns all associated filename patterns

codeowners_pattern_to_regex() applies the CODEOWNERS syntax rules to the given patterns, and match_file_or_directory() performs the matching

pr_matches_codeowner() passes those regex patterns to no_filenames_match() to test for a match against any of the filenames changed by the pull request

ceph-pull-requests and ceph-pull-requests-arm64 use pr_matches_codeowner() to check for @ceph/rbd ownership. if there is no match, use the CHECK_MAKEOPTS environment variable to pass an additional '-E run-rbd-unit-tests' argument to ctest

this causes ctest to exclude the following long-running tests:

run-rbd-unit-tests-N.sh
run-rbd-unit-tests-0.sh
run-rbd-unit-tests-1.sh
run-rbd-unit-tests-61.sh
run-rbd-unit-tests-109.sh
run-rbd-unit-tests-127.sh

Fixes: https://tracker.ceph.com/issues/70993

@cbodley
Copy link
Contributor Author

cbodley commented Apr 29, 2025

some obstacles:

i don't really understand how bash functions work, and i'm pretty sure i'm using them wrong

pr_only_for() requires all filenames to match at least one pattern, but pr_matches_codeowner() should succeed if at least one filename matches

pr_matches_codeowner() passes a multiline string (the output of grep) where pr_only_for() expects an array

most patterns in CODEOWNERS start with a / which is meant to match an absolute path, but the list of "changed files" we're matching against don't start with slashes:

changed files:
qa/suites/rgw/upgrade/1-install/quincy/install.yaml
qa/suites/rgw/upgrade/1-install/quincy/overrides.yaml

other patterns like README* should match any /path/to/README*

@cbodley cbodley force-pushed the wip-70993 branch 4 times, most recently from c37484d to 3afd0fb Compare April 30, 2025 13:46
@cbodley
Copy link
Contributor Author

cbodley commented Apr 30, 2025

the matching for codeowners follows different rules than the bash pattern matching if [[ $f == $p ]] used in pr_only_for()

this is documented in CODEOWNERS syntax, which uses a subset of the gitignore PATTERN FORMAT. the biggest difference is that * won't matches /s, but there's a special ** syntax that will

as much as i hate regex, it felt like the right tool to faithfully reproduce this matching. and i tried to get it right so we don't have to deal with bugs here. this logic is complicated enough that it probably deserves some unit tests, but i don't know of a good way to do that

while it wasn't absolutely necessary, i did also unify the docs_pr_only() and container_pr_only() to use the codeowners matching too. the docs_pr_only() check could potentially rely on patterns_for_codeowner "@ceph/doc-writers" instead of maintaining its own patterns, but i didn't make that change because the patterns aren't exactly the same

@cbodley
Copy link
Contributor Author

cbodley commented Apr 30, 2025

i was able to do some smoke testing with the following external script match.sh:

#!/bin/bash

. /home/cbodley/ceph-build/scripts/build_utils.sh

matches_codeowner() {
  local filenames=$1
  local owner=$2
  local patterns=$(patterns_for_codeowner "$owner")
  if no_filenames_match "$filenames" "$patterns"; then return 1; fi
  return 0
}

filenames="$(< /dev/stdin)"
owner="$1"

if matches_codeowner "$filenames" "$owner"; then
  echo "codeowner match"
else
  echo "no codeowner match"
fi

for example,

$ export WORKSPACE=/home/cbodley # for "$WORKSPACE/ceph/.github/CODEOWNERS"
$ echo 'qa/suites/rgw/verify.yaml' | ./match.sh "@ceph/rgw"
...
codeowner match
$ echo 'qa/suites/rgw/verify.yaml' | ./match.sh "@ceph/rbd"
...
no codeowner match

@cbodley cbodley marked this pull request as ready for review April 30, 2025 14:10
Copy link
Contributor

@djgalloway djgalloway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually I am fine with this. I am no bash expert, however.

@cbodley
Copy link
Contributor Author

cbodley commented Apr 30, 2025

Conceptually I am fine with this. I am no bash expert, however.

can you think of a good way to test this before merge? i could prepare some ceph branches to test for regressions in the doc-only/container-only cases, along with rbd and non-rbd branches to test the new logic

@djgalloway
Copy link
Contributor

Conceptually I am fine with this. I am no bash expert, however.

can you think of a good way to test this before merge? i could prepare some ceph branches to test for regressions in the doc-only/container-only cases, along with rbd and non-rbd branches to test the new logic

Ah, right, I did say I'd do that. Yeah, let me push this change to a new job and we can test it.

@cbodley
Copy link
Contributor Author

cbodley commented Apr 30, 2025

@djgalloway
Copy link
Contributor

@cbodley https://jenkins.ceph.com/view/all/job/preserve-ceph-pull-requests/

Commenting "jenkins test pr-2361" on a PR in ceph.git will trigger this ⬆️ job.

Copy link
Contributor Author

@cbodley cbodley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some initial testing

@djgalloway
Copy link
Contributor

I'm looking at the output in the Jenkins console log and my head is spinning. I'd agree with Ilya's suggestions but also ask for some additional CI_DEBUG output echos.

@cbodley
Copy link
Contributor Author

cbodley commented Apr 30, 2025

Commenting "jenkins test pr-2361" on a PR in ceph.git will trigger this ⬆️ job.

@djgalloway i've pushed some updates to this pull request, but they don't seem to be taking effect when i retrigger the job

@dmick
Copy link
Member

dmick commented Apr 30, 2025

@cbodley that's expected; to test, one copies the branch and pushes the job by hand, so anything in the job (the yaml, the scripts, etc.) has to be repushed to take effect (where 'push' here means "jenkins-jobs update")

@cbodley
Copy link
Contributor Author

cbodley commented May 1, 2025

thanks @dmick. could i ask you or @djgalloway to please update/replace that job so i can keep testing with "jenkins test pr-2361"?

@djgalloway
Copy link
Contributor

thanks @dmick. could i ask you or @djgalloway to please update/replace that job so i can keep testing with "jenkins test pr-2361"?

Done

@cbodley
Copy link
Contributor Author

cbodley commented May 1, 2025

cool, all 4 test prs do the right thing now:

rbd ceph/ceph#63076 https://jenkins.ceph.com/view/all/job/preserve-ceph-pull-requests/9/console

+ echo 'filename '\''src/librbd/empty.h'\'' matched pattern '\''/src/librbd'\'''
filename 'src/librbd/empty.h' matched pattern '/src/librbd'
+ return 1
+ return 0
+ echo 'Pull request changes files owned by @ceph/rbd.'
Pull request changes files owned by @ceph/rbd.

rgw ceph/ceph#63077 https://jenkins.ceph.com/view/all/job/preserve-ceph-pull-requests/8/consoleFull

+ echo 'No rbd files changed. No need to run rbd unit tests.'
No rbd files changed. No need to run rbd unit tests.
+ export 'CHECK_MAKEOPTS=-E run-rbd-unit-tests'

rgw pr with "rbd" label added https://jenkins.ceph.com/view/all/job/preserve-ceph-pull-requests/10/console

+ echo 'Pull request contains rbd label.'
Pull request contains rbd label.

container ceph/ceph#63078 https://jenkins.ceph.com/view/all/job/preserve-ceph-pull-requests/6/console

+ CONTAINER_ONLY=true
+ [[ false = true ]]
+ [[ true = true ]]
+ echo 'Only the doc/ or container/ dir changed.  No need to run make check.'
Only the doc/ or container/ dir changed.  No need to run make check.

docs ceph/ceph#63079 https://jenkins.ceph.com/view/all/job/preserve-ceph-pull-requests/7/console

+ return 0
+ DOCS_ONLY=true
...
+ echo 'Only the doc/ or container/ dir changed.  No need to run make check.'
Only the doc/ or container/ dir changed.  No need to run make check.

once the rgw build completes, i'll make sure the ctest -E run-rbd-unit-tests argument was applied correctly

@cbodley
Copy link
Contributor Author

cbodley commented May 1, 2025

squashed commits, no other changes

@cbodley
Copy link
Contributor Author

cbodley commented May 1, 2025

once the rgw build completes, i'll make sure the ctest -E run-rbd-unit-tests argument was applied correctly

taking forever, but i can confirm that the rbd job did run all 6 versions of run-rbd-unit-tests:

311/316 Test 32: run-rbd-unit-tests-61.sh .................. Passed 3365.51 sec

and that the rgw job scheduled 6 fewer tests:

1/310 Test 1: validate-options .......................... Passed 1.42 sec

@dmick
Copy link
Member

dmick commented May 1, 2025

FWIW, I'd like to move the code that figures out which tests to trigger to a Github action, so that we don't have to bother Jenkins at all, but I haven't figured out how Github would communicate that to Jenkins. It might be that we'd need to change the trigger mechanism for ceph-dev-new-build from GitHub PullRequest Builder (a plugin that's no longer maintained and has known security issues) to the generic "webhook" plugin, so that we could craft a webhook packet that contains the results of the classification. That would be much more efficient.

@cbodley
Copy link
Contributor Author

cbodley commented May 1, 2025

FWIW, I'd like to move the code that figures out which tests to trigger to a Github action

@dmick am i correct that this would only apply to the docs-only/container-only checks that may skip the 'make check' job altogether? this rbd-specific logic just filters out some test cases from 'make check', so i think it still makes to run here

if you're planning to move the docs/container stuff out anyways, i can drop the "scripts: use codeowner matching for docs/containers" commit to make that easier to port

@dmick
Copy link
Member

dmick commented May 1, 2025

So far it's just a vague dream of moving whatever decisions can easily be moved to github. I haven't tried to define its scope until I understand what capabilities I've got for webhooking. It may be just a pipe dream for now, so don't let it super-affect your work here.

@cbodley
Copy link
Contributor Author

cbodley commented May 1, 2025

i've been writing some test cases for the regex stuff and finding issues, will follow up

@cbodley
Copy link
Contributor Author

cbodley commented May 7, 2025

thanks to David, i was able to retest with the final two commits (squash! scripts: add pr_matches_codeowner() and scripts: add test_codeowners.sh) and the results were the same as above for all 4 prs 👍

the logs are quite verbose due to the set -ex in build_utils.sh, but could be helpful if i need to debug issues here

are folks satisfied with the state of this pr? if so i can squash the commits and merge

(and again, my most profuse apologies for inflicting you with regex here)

cbodley added 8 commits May 14, 2025 12:09
the .github/CODEOWNERS file contains a list of filename patterns and
their associated teams

patterns_for_codeowner() greps that CODEOWNERS file for the matching
owner and returns all associated filename patterns

codeowners_pattern_to_regex() applies the CODEOWNERS syntax rules to
the given patterns, and regex operator =~ performs the matching

pr_matches_codeowner() passes those regex patterns to no_filenames_match()
to test for a match against any of the filenames changed by the pull request

Signed-off-by: Casey Bodley <[email protected]>
pr_only_for() renamed to all_filenames_match() and takes patterns as a
line-separated string to match the output of 'grep' from
patterns_for_codeowner()

Signed-off-by: Casey Bodley <[email protected]>
…bd codeowner

ceph-pull-requests and ceph-pull-requests-arm64 use
pr_matches_codeowner() to check for @ceph/rbd ownership. if there is no
match, use the CHECK_MAKEOPTS environment variable to pass an additional
'-E run-rbd-unit-tests' argument to ctest

this causes ctest to exclude the following long-running tests:

> run-rbd-unit-tests-N.sh
> run-rbd-unit-tests-0.sh
> run-rbd-unit-tests-1.sh
> run-rbd-unit-tests-61.sh
> run-rbd-unit-tests-109.sh
> run-rbd-unit-tests-127.sh

Fixes: https://tracker.ceph.com/issues/70993

Signed-off-by: Casey Bodley <[email protected]>
even if the pull request doesn't modify any files owned by @ceph/rbd,
the "rbd" label can be added to opt in to its unit tests

Signed-off-by: Casey Bodley <[email protected]>
test cases for codeowners_pattern_to_regex(), though they aren't run
automatically

Signed-off-by: Casey Bodley <[email protected]>
@cbodley
Copy link
Contributor Author

cbodley commented May 14, 2025

squashed/rebased, no other changes

@epuertat
Copy link
Member

epuertat commented May 15, 2025

@cbodley what a nice surprise! @idryomov just pointed me to this PR from this other one (5 years old), which I'm rebasing to use as the foundation for a conditional test runner.

Instead of focusing on a specific test, like rbd-unit-tests, it'd provide a generic framework to trigger tests based on modified files between 2 git commits. For that it'd use the following tooling:

  1. cmake tests need to be tagged with the LABELS property. With the referred PR, it'd be as simple as:
    add_ceph_test(run-rbd-unit-tests-N.sh   ... N LABELS "rbd")
  2. Labels and paths need to be added to .gitattributes as new custom attributes (e.g.: ctest-labels:
    src/librbd/** ctest-labels=rbd
  3. git diff --name-only ... (as you did here) would give us the list of changed files.
  4. git check-attr handles attributes in .gitattributes:
    # git check-attr ctest-labels src/librbd/ImageCtx.cc
    src/librbd/ImageCtx.cc: ctest-labels: rbd
  • With all the above, we could trigger only tests labels for the modified files (this would need some extra tweak to deal with corner cases and multiple labels):
    MODIFIED_FILES=$(git diff --name-only)
    LABELS=$(git check-attr ctest-labels "$MODIFIED_FILES")
    ctest -L "$LABELS"

The approach would be slightly different that this:

  • No dependency on any Github stuff (like CODEOWNERS or GH API), only pure git standard tooling.
  • No dependency on Jenkins, but cmake/ctest.
  • Works both for Jenkins and local builds.
  • No need to mess with file parsing or path patterns, .gitattributes and git check-attr are already there.
  • Another benefit of LABELS is that we can start tagging tests that don't require compilation (e.g.: Python linting -> LABEL=no-build-required) and running them before (or in parallel) of the main ninja check.
  • gitattributes mapping could also be used to launch API tests only of the modified component (unfortunately Python unittest doesn't support test labelling from scratch, but it's not hard to implement).

Challenges/Open questions:

  • Dealing with multiple labels (trivial shell tokenization).
  • It would mostly benefit PRs that don't modify common files. This could be further refined to only test core subsystems (e.g.: msgr), but that's a risky move.
  • How to deal with fallback scenarios (e.g.: file modified without label attached, or conflicting labels). Until improved, this could be solved with a drastic decision: if no ctest-labels or more than ones found (e.g.: rbd and rgw), then trigger all tests.
  • If we find that tagging all tests and paths is challenging, we can start with a negative approach (e.g.: if rgw label found, then exclude cephfs and rbd, ctest --label-exclude $EXCLUDED_LABELS).
  • As a safety net, we might need to trigger a full make check post-merge job. This would still be more efficient, since it involves 1 full make check vs. the N updates that PRs usually go through.

@cbodley
Copy link
Contributor Author

cbodley commented May 15, 2025

thanks @epuertat, git check-attr looks like a cool feature

Labels and paths need to be added to .gitattributes as new custom attributes

the reason that i like CODEOWNERS for this is that we're using it for github reviews/labels, so teams already have an incentive to keep that up-to-date

while gitattributes would save us from this regex mess for CODEOWNERS, it's a new mapping that we'd need to maintain separately

With all the above, we could trigger only tests labels for the modified files

i'm skeptical that this is what we actually want

what if i modify a file under src/common that's included/linked into several other components? a bug in my changes may not be caught unless the unit tests from those components run too

to make this reliable, it sounds like we would have to track and add the labels from all dependencies for every source/header. how do you suggest we do this, when each pr could change those dependencies? without a reliable mapping, we'd risk merging changes that break the required checks for other components

in my view, avoiding this kind of disruption is the primary purpose of required checks in github: don't break the build, don't break something obvious that can be caught by unit test

Ilya and i discussed the rbd unit tests as a special case because, while they do have a version that runs against rados in teuthology, the version in 'make check' runs against a mock librados. that means there's little risk that changes to common files would break them. and when there is a potential risk, the 'rbd' label can be added to run them the next time 'make check' is triggered

@epuertat
Copy link
Member

the reason that i like CODEOWNERS for this is that we're using it for github reviews/labels, so teams already have an incentive to keep that up-to-date

while gitattributes would save us from this regex mess for CODEOWNERS, it's a new mapping that we'd need to maintain separately

Right! I initially thought about CODEOWNERS (BTW we already have another codetree mapping file over there: .github/labeler.yml), but I felt that .gitattributes might be better for an fresh implementation (for example, we don't want to trigger rbd tests if the modified files are /doc/rbd, or /qa/rbd). Anyway, I'm ok with either.

With all the above, we could trigger only tests labels for the modified files

i'm skeptical that this is what we actually want

what if i modify a file under src/common that's included/linked into several other components? a bug in my changes may not be caught unless the unit tests from those components run too

to make this reliable, it sounds like we would have to track and add the labels from all dependencies for every source/header. how do you suggest we do this, when each pr could change those dependencies? without a reliable mapping, we'd risk merging changes that break the required checks for other components

Well, that's one of the open questions that I mentioned, and it basically comes down to establishing a set of rules (e.g.: only trigger conditional testing on PRs that only contain tagged files).

In any case, the ultimate goal of this proposal goes beyond skipping tests. It's about making CI smarter/more efficient and reducing feedback loop/resource usage. For example:

  • Conditional builds:
    • E.g.: if a PR ONLY contains [RBD|RGW|Cephfs] changes, only build and test [RBD|RGW|Cephfs] stuff.
    • E.g.: if a PR ONLY contains Python changes don't trigger C++ builds .
  • Conditional API tests:
    • E.g.: if a PR ONLY contains [RBD|RGW|Cephfs] changes, only run [RBD|RGW|Cephfs] API tests.

in my view, avoiding this kind of disruption is the primary purpose of required checks in github: don't break the build, don't break something obvious that can be caught by unit test

That's why I mentioned that full builds & checks would probably still be required as post-merge jobs on the target branch.

I thought that there was some ongoing initiative to improve Developer Experience. Brute force CI is an impediment to that. With some work and bold decisions we might reduce CI times from 1h-2h to 30 mins or even less than that.

@cbodley
Copy link
Contributor Author

cbodley commented May 16, 2025

In any case, the ultimate goal of this proposal goes beyond skipping tests. It's about making CI smarter/more efficient and reducing feedback loop/resource usage.

this pr is finally tested and working, so i'd prefer to limit its scope to https://tracker.ceph.com/issues/70993. i'm willing to adapt it to .gitattributes once @idryomov is satisfied with the rbd coverage (including backports), but that could follow in a separate pr

i'm also keen to discuss larger-scale initiatives, but they'll need both consensus and resourcing. when we talk about big changes to ci, i worry that we don't factor in the cost of continued maintenance which falls on the infrastructure team

in my opinion, the easiest/cheapest solution is to move integration tests and longer-running unit tests out of github checks entirely and instead rely on teuthology which should already be part of pr validation. teuthology was built for integration testing, so has several advantages over github checks - the biggest being that test failures can be tracked and prioritized without disrupting the whole project

our tests are never going to be perfect. as long as the github checks are 'required', we should be picky about what runs there

With some work and bold decisions we might reduce CI times from 1h-2h to 30 mins or even less than that.

we could accomplish this without elaborate changes to cmake, github, or jenkins. but i've now tried and failed to get consensus to move two of the outliers, the rbd and api tests. it will be difficult to make bold decisions if teams aren't willing to change their processes

so yes, if the only compromise is conditional builds/tests so that teams can continue piling tests into github checks, then lets build on codeowners/labels/gitattributes. but i don't think this is sustainable, and we'll all be paying for the complexity

@cbodley
Copy link
Contributor Author

cbodley commented May 16, 2025

Conditional API tests:
E.g.: if a PR ONLY contains [RBD|RGW|Cephfs] changes, only run [RBD|RGW|Cephfs] API tests.

just sharing ceph/ceph#63306 as an example of how easy this is in teuthology

@epuertat
Copy link
Member

this pr is finally tested and working, so i'd prefer to limit its scope to tracker.ceph.com/issues/70993. i'm willing to adapt it to .gitattributes once @idryomov is satisfied with the rbd coverage (including backports), but that could follow in a separate pr

👍🏼 I'm ok then!

i'm also keen to discuss larger-scale initiatives, but they'll need both consensus and resourcing. when we talk about big changes to ci, i worry that we don't factor in the cost of continued maintenance which falls on the infrastructure team

That's why I think moving to GH Actions or any other CI/CD aaS platform would be our best choice.

in my opinion, the easiest/cheapest solution is to move integration tests and longer-running unit tests out of github checks entirely and instead rely on teuthology which should already be part of pr validation. teuthology was built for integration testing, so has several advantages over github checks - the biggest being that test failures can be tracked and prioritized without disrupting the whole project

The moment we increase the testing overhead (by moving things outside of automated CI) and we make it opinionable (right now Pulpito only shows 2 green runs out of 25), we're immediately decreasing the quality of Ceph:

  • "This PR doesn't need a teuthology run because..."
  • "The 5 teuthology failures seem unrelated..."

In other words: the more automated and objective our CI is, the less likely we are to let issues slip through.

BTW, disruptive-ness has little to do with either Jenkins or Teuthology: API tests were never considered "disruptive" until they became required (they had been running in PRs for a couple of years before). On the contrary, it was left up to developers to check the API test results and decide (exactly the same as with Teuthology). And they became required exactly because 4 regressions from different teams were introduced in a lapse of 2 months with failing tests.

To me, we have to make our CI faster, more stable and smarter. Shifting things to a context where we can relax our quality rules is to lie to us.

@dmick
Copy link
Member

dmick commented May 16, 2025

@epuertat those issues sound like management issues to me. "Don't allow merging code with failing tests" is something that can be applied by humans, too, it doesn't have to be a robot.

The issue is always that robots are stupid, and if the CI-required test failures aren't treated as emergencies, they slow every single developer for one team's mistakes. Not only mistakes....piling more and more into CI tests slows every single developer, even when there aren't test failures.

We can choose not to lower quality of Ceph by requiring compliance to test results and test maintenance; punishing innocents is not an acceptable tradeoff for careful oversight.

Also, of course the infrastructure team is often blamed first for a failed build/check, regardless of where the code is; putting into workflows just makes it that much harder to diagnose and iterate on.

@cbodley
Copy link
Contributor Author

cbodley commented May 16, 2025

The moment we increase the testing overhead (by moving things outside of automated CI) and we make it opinionable (right now Pulpito only shows 2 green runs out of 25), we're immediately decreasing the quality of Ceph:

thanks @epuertat, it sounds like this may be our main point of contention. teuthology is how we ensure the quality of our releases, so if it isn't up to that task then we're already failing. Ceph is a large and complex distributed system, so is very hard to test extensively. we're never going to fit everything into our github pipeline, and i really don't think that should be a goal. testing is far more valuable when the deployment looks like a real ceph cluster

BTW, disruptive-ness has little to do with either Jenkins or Teuthology: API tests were never considered "disruptive" until they became required (they had been running in PRs for a couple of years before). On the contrary, it was left up to developers to check the API test results and decide (exactly the same as with Teuthology). And they became required exactly because 4 regressions from different teams were introduced in a lapse of 2 months with failing tests.

part of this may be that other teams didn't really buy into the vision of a stable ceph api, and weren't given extra resources to support the committment that you've asked of them

but more generally, this is the difficulty of software development at scale. we have several teams working independently, and need a way to detect and resolve conflicting changes. integration tests between components is indeed the solution to that

each team needs to be accountable to the others that depend on them, and that accountability includes the maintenance of tests

i see these challenges as social ones, which have a tendency to get messy. it's far easier to reach for technical solutions like the 'required' option in github, but that doesn't address the underlying issues

i don't want rgw changes to break the dashboard or anything else. but regressions happen, and they don't need to be escalated to project-wide disaster status. if they're in teuthology, we can assign and priorize them according to severity, and they can even block releases when necessary

but the amount of disruption caused by required checks is doing more harm than good at this point, and it's not because they're catching regressions

@cbodley
Copy link
Contributor Author

cbodley commented May 16, 2025

but the amount of disruption caused by required checks is doing more harm than good at this point, and it's not because they're catching regressions

sorry, this was a bit hyperbolic. but still, i think it's unrealistic to expect that we can get our existing tests 100% green. our employers just aren't going to invest the resources necessary to get there

@cbodley
Copy link
Contributor Author

cbodley commented May 16, 2025

i'm also keen to discuss larger-scale initiatives, but they'll need both consensus and resourcing. when we talk about big changes to ci, i worry that we don't factor in the cost of continued maintenance which falls on the infrastructure team

That's why I think moving to GH Actions or any other CI/CD aaS platform would be our best choice.

@epuertat have you discussed this part with the infra team? if you're interested, there's a new #ceph-build-sig channel on slack and weekly meeting about ci, mostly focused around the containerization initiative. the sense i got from this week is that, while some triggers are moving into github actions, moving away from jenkins entirely wouldn't be tractable

@epuertat
Copy link
Member

@epuertat those issues sound like management issues to me. "Don't allow merging code with failing tests" is something that can be applied by humans, too, it doesn't have to be a robot.

The thing is that we already tried that and it didn't work. It wasn't an arbitrary decision made out of thin air. Some automatic gating has to happen: whether it's building, running unit-tests or something else. Leaving that decision to human judgement (when you have an automated process) can lead to mistakes: my first contribution to the Ceph project just moved some Python code from 1 file to another... No one could have imagined that moving Python code could cause a segmentation fault.

As a example, API tests aren't that comprehensive. They just do some basic e2e/functional testing on Ceph components on a vstart cluster. It's 293 tests and 1084 assertions that ensure that Ceph does what it's supposed to do on every PR. Only 3-4 of those assertions are randomly failing. And it all takes 30-40 mins + Ceph build (we could reduce that even more. If you compare that to the overhead of Teuthology (~3-5 hours), it's a no brainer.

BTW, I see that after skipping the top 2-3 flaky tests, API is now much more stable.

they slow every single developer for one team's mistakes
of course the infrastructure team is often blamed first for a failed build/check, regardless of where the code is;

@dmick I think you pointed out here 2 important and connected issues: blame culture (instead of understanding and appreciating someone else's contribution), and silo mentality/lack of co-ownership. And I do agree here that these are in fact management issues. A failure in Ceph should matter to all of us (no matter which team I'm on). And improving Ceph quality should be a concern for all of us.

@epuertat
Copy link
Member

thanks @epuertat, it sounds like this may be our main point of contention. teuthology is how we ensure the quality of our releases, so if it isn't up to that task then we're already failing. Ceph is a large and complex distributed system, so is very hard to test extensively. we're never going to fit everything into our github pipeline, and i really don't think that should be a goal. testing is far more valuable when the deployment looks like a real ceph cluster

@cbodley I think we should distinguish what we can achieve now from what should be our goal. And in this very case, we can get both things. CI can't just do unit testing. It should at least ensure that Ceph runs and perform some functional/integration and even e2e tests. As I mentioned in my reply to @dmick, the cost-benefit of API tests is enormous: you get most Ceph features tested in 30-40 mins on every code push (compare that to a comprehensive Teuthology run). That said, we (Dashboard + every component team) can make them even more lightweight.

part of this may be that other teams didn't really buy into the vision of a stable ceph api, and weren't given extra resources to support the committment that you've asked of them

BTW I just realized that this has became (or I led it to) a discussion on API tests... And that should be probably better handled in the CSC meeting, where this discussion is pending. My bad. I'll send an email on this topic for the CSC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants