-
Notifications
You must be signed in to change notification settings - Fork 70
test: initial implementation of SDK e2e #488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
WalkthroughAdds end-to-end Kubeflow distributed FashionMNIST tests: a Jupyter notebook, Kubernetes-based test orchestration, cluster and RBAC preparation utilities, notebook lifecycle helpers, a sanity test entry, and a .gitignore entry to exclude Changes
Sequence Diagram(s)sequenceDiagram
participant Test as Test Runner
participant K8s as Kubernetes Cluster
participant Notebook as Notebook Pod
participant Trainer as Kubeflow Trainer
Test->>K8s: Ensure JobSet CRD & trainer controller ready
Test->>K8s: Create test namespace
Test->>K8s: Ensure Notebook RBAC (SA, ClusterRole, Role)
Test->>K8s: Create ConfigMap (notebook bytes) & PVC
Test->>K8s: Deploy Notebook CR with Papermill command
rect rgb(235,245,255)
Note over Notebook,Trainer: Notebook execution and trainer submission
Notebook->>Notebook: Install papermill + extra packages
Notebook->>Trainer: Query runtimes & submit CustomTrainer
Trainer->>Trainer: Run distributed training (workers)
Trainer->>Notebook: Stream job status/logs
Notebook->>Notebook: Write completion marker (SUCCESS/FAILURE)
end
Test->>Notebook: Poll marker file until SUCCESS/FAILURE
alt SUCCESS
Test->>K8s: Delete Notebook, PVC, ConfigMap (cleanup)
else FAILURE or timeout
Test->>K8s: Gather logs, delete resources
end
Test->>Test: Report test result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
🧹 Nitpick comments (4)
tests/trainer/utils/utils_cluster_prep.go (2)
53-55: Consider making expected runtimes configurable.The hardcoded list of ClusterTrainingRuntimes (
torch-cuda-241, etc.) conflicts with the PR's stated requirement for a "universal" runtime, and the TODO comment suggests this list will change.Consider accepting expected runtime names as a parameter or reading from an environment variable to make the test more flexible:
-func EnsureTrainerClusterReady(t *testing.T, test Test) { +func EnsureTrainerClusterReady(t *testing.T, test Test, expectedRuntimes ...string) { + if len(expectedRuntimes) == 0 { + // Default runtimes if none specified + expectedRuntimes = []string{"torch-cuda-241", "torch-cuda-251", "torch-rocm-241", "torch-rocm-251"} + } ... - for _, name := range []string{"torch-cuda-241", "torch-cuda-251", "torch-rocm-241", "torch-rocm-251"} { + for _, name := range expectedRuntimes {
64-66: Hardcoded ServiceAccount name reduces reusability.The ServiceAccount name "jupyter-nb-kube-3aadmin" is hardcoded and matches
common.NOTEBOOK_CONTAINER_NAME, but this tight coupling limits flexibility for other test scenarios.Consider accepting the SA name as a parameter:
-func EnsureNotebookRBAC(t *testing.T, test Test, namespace string) { +func EnsureNotebookRBAC(t *testing.T, test Test, namespace string, saName string) { t.Helper() - saName := "jupyter-nb-kube-3aadmin"Alternatively, if this SA name is always tied to notebooks, accept it from
common.NOTEBOOK_CONTAINER_NAMEexplicitly to make the dependency clear.tests/trainer/resources/mnist.ipynb (1)
17-125: Consider adding training validation.The training function completes successfully but doesn't validate that the model actually learned (e.g., checking loss decreased or accuracy improved). For a sanity test, consider adding basic validation.
For example, after training completes, you could add:
# At the end of train_fashion_mnist, before dist.destroy_process_group() if dist.get_rank() == 0: if loss.item() > initial_loss: print("WARNING: Loss did not decrease during training") print(f"Final loss: {loss.item():.6f}")tests/trainer/utils/utils_notebook.go (1)
49-53: Unused helper function.
CreateNotebookFromBytesis defined but not used anywhere in this PR. The test infashion_mnist_tests.gocreates the ConfigMap and Notebook separately.Consider either:
- Using this helper in
fashion_mnist_tests.goto reduce duplication- Removing it if it's not needed yet
- Adding a comment explaining it's for future use
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
.gitignore(1 hunks)tests/trainer/kubeflow_sdk_test.go(1 hunks)tests/trainer/resources/mnist.ipynb(1 hunks)tests/trainer/sdk_tests/fashion_mnist_tests.go(1 hunks)tests/trainer/utils/utils_cluster_prep.go(1 hunks)tests/trainer/utils/utils_notebook.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
tests/trainer/utils/utils_notebook.go (4)
tests/common/support/test.go (1)
Test(34-45)tests/common/notebook.go (3)
ContainerSize(72-72)CreateNotebook(104-185)NOTEBOOK_CONTAINER_NAME(37-37)tests/common/support/core.go (4)
CreateConfigMap(33-54)CreatePersistentVolumeClaim(242-276)AccessModes(235-240)GetPods(111-116)tests/common/support/support.go (1)
TestTimeoutLong(35-35)
tests/trainer/kubeflow_sdk_test.go (2)
tests/common/test_tag.go (2)
Tags(32-40)Sanity(48-50)tests/trainer/sdk_tests/fashion_mnist_tests.go (1)
RunFashionMnistCpuDistributedTraining(39-86)
tests/trainer/sdk_tests/fashion_mnist_tests.go (8)
tests/common/support/test.go (2)
T(90-102)With(61-63)tests/trainer/utils/utils_cluster_prep.go (2)
EnsureTrainerClusterReady(33-56)EnsureNotebookRBAC(60-83)tests/common/environment.go (2)
GetNotebookUserName(70-76)GetNotebookUserToken(78-84)tests/common/support/rbac.go (1)
CreateUserRoleBindingWithClusterRole(206-238)tests/common/support/core.go (3)
CreateConfigMap(33-54)CreatePersistentVolumeClaim(242-276)AccessModes(235-240)tests/trainer/utils/utils_notebook.go (3)
BuildPapermillShellCmd(36-46)WaitForNotebookPodRunning(56-64)PollNotebookCompletionMarker(67-85)tests/common/notebook.go (4)
CreateNotebook(104-185)ContainerSizeSmall(75-75)DeleteNotebook(187-190)Notebooks(192-204)tests/common/support/support.go (2)
TestTimeoutLong(35-35)TestTimeoutDouble(36-36)
tests/trainer/utils/utils_cluster_prep.go (4)
tests/common/support/test.go (2)
T(90-102)Test(34-45)tests/common/support/client.go (1)
Client(39-50)tests/common/support/core.go (2)
ServiceAccount(194-200)ServiceAccounts(207-219)tests/common/support/rbac.go (4)
CreateClusterRole(48-70)CreateClusterRoleBinding(135-169)CreateRole(27-46)CreateRoleBinding(72-102)
🪛 GitHub Actions: Verify Generated Files and Import Organization
tests/trainer/utils/utils_notebook.go
[error] 1-1: openshift-goimports check failed: file is not sorted. Run 'make imports' to fix imports.
tests/trainer/utils/utils_cluster_prep.go
[error] 1-1: openshift-goimports check failed: file is not sorted. Run 'make imports' to fix imports.
🔇 Additional comments (5)
.gitignore (1)
2-2: LGTM!Adding
.vscode/*to gitignore is a standard best practice that prevents user-specific IDE settings and workspace configurations from being committed to version control. This is especially helpful for e2e test development where contributors may use VSCode.tests/trainer/kubeflow_sdk_test.go (1)
26-30: LGTM!Clean test entry point that properly tags the test as Sanity and delegates to the actual test implementation. The comment on line 29 provides a clear extension point for additional tests.
tests/trainer/sdk_tests/fashion_mnist_tests.go (2)
39-86: Well-structured test implementation.The test follows good practices:
- Proper test namespace isolation
- Prerequisite checks before execution
- RBAC setup
- Resource cleanup with defer
- Clear error messages
The flow is logical and the use of helper utilities keeps the test readable.
33-36: Relative path verified—notebook file found at expected location.The notebook file exists at
tests/trainer/resources/mnist.ipynb, confirming the relative path"resources/"resolves correctly when tests run from the package directory.tests/trainer/utils/utils_notebook.go (1)
36-46: LGTM!The Papermill shell command construction is clean and handles optional extra packages appropriately. The
set -eensures failures are caught, and the marker file pattern is a good approach for async completion detection.
| "for runtime in client.list_runtimes():\n", | ||
| " print(runtime)\n", | ||
| " if runtime.name == \"universal\": # Update to actual universal image runtime once available\n", | ||
| " torch_runtime = runtime" | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Hardcoded runtime name needs to be resolved.
The comment on line 158 indicates this uses a placeholder runtime name "universal" that needs updating. This conflicts with the hardcoded runtime checks in utils_cluster_prep.go (lines 53-54) which expect different runtime names.
The PR description mentions needing a custom clusterTrainingRuntime named "universal", but the cluster prep utility checks for different names. Ensure consistency between:
- The runtime name expected by this notebook (line 158)
- The runtimes validated in EnsureTrainerClusterReady (utils_cluster_prep.go:53)
- The actual runtime available in the test environment
Consider parameterizing the runtime name or documenting the expected setup more clearly.
---
🏁 Script executed:
```shell
#!/bin/bash
# Extract and examine the notebook file structure and content around lines 156-160
fd -e ipynb -path '*/tests/trainer/resources/mnist.ipynb' --exec cat {} \; | python3 << 'EOF'
import json
import sys
try:
data = json.load(sys.stdin)
cells = data.get('cells', [])
# Flatten all cells and count lines
line_count = 0
for cell_idx, cell in enumerate(cells):
source = cell.get('source', [])
if isinstance(source, str):
source = source.split('\n')
for line_idx, line in enumerate(source):
line_count += 1
# Print lines 150-165 to capture context around 156-160
if 150 <= line_count <= 165:
print(f"Line {line_count}: {line}")
except Exception as e:
print(f"Error: {e}", file=sys.stderr)
EOF
Length of output: 217
🏁 Script executed:
#!/bin/bash
# Examine utils_cluster_prep.go around lines 53-54
fd -e go -name 'utils_cluster_prep.go' --exec cat {} \; | head -70 | tail -25Length of output: 230
🏁 Script executed:
#!/bin/bash
# Search for runtime-related configurations and definitions
rg -n "runtime" -i --type go -A2 -B2 | head -100Length of output: 7893
---
🏁 Script executed:
```shell
#!/bin/bash
# Find and read the notebook file directly
find . -name "mnist.ipynb" -path "*/tests/trainer/resources/*" -type f
Length of output: 37
🏁 Script executed:
#!/bin/bash
# Once found, extract notebook content and look for "universal" runtime reference
find . -name "mnist.ipynb" -path "*/tests/trainer/resources/*" -type f -exec python3 << 'EOF' {} \;
import json
import sys
try:
with open(sys.argv[1], 'r') as f:
notebook = json.load(f)
# Search through all cells for "universal" string
for cell_idx, cell in enumerate(notebook.get('cells', [])):
source = cell.get('source', [])
if isinstance(source, str):
source_lines = source.split('\n')
else:
source_lines = source
for line_idx, line in enumerate(source_lines):
if 'universal' in line.lower() or 'torch_runtime' in line.lower():
print(f"Cell {cell_idx}, Line {line_idx}: {line}")
except Exception as e:
print(f"Error processing {sys.argv[1]}: {e}", file=sys.stderr)
EOFLength of output: 218
🏁 Script executed:
#!/bin/bash
# Also search codebase for references to "universal" runtime
rg -n "universal" --type goLength of output: 113
---
🏁 Script executed:
```shell
#!/bin/bash
# Search notebook file for "universal" using grep/rg
rg -n "universal" ./tests/trainer/resources/mnist.ipynb
Length of output: 108
🏁 Script executed:
#!/bin/bash
# Search for torch_runtime usage in notebook
rg -n "torch_runtime\|runtime" ./tests/trainer/resources/mnist.ipynb | head -30Length of output: 0
🏁 Script executed:
#!/bin/bash
# Extract notebook with cat and parse JSON properly
cat ./tests/trainer/resources/mnist.ipynb | python3 << 'PYTHON_EOF'
import json
import sys
try:
nb = json.load(sys.stdin)
for idx, cell in enumerate(nb.get('cells', [])):
source = cell.get('source', [])
if isinstance(source, list):
content = ''.join(source)
else:
content = source
# Look for runtime-related content
if 'universal' in content.lower() or 'torch_runtime' in content.lower():
lines = content.split('\n')
for line_num, line in enumerate(lines):
if 'universal' in line.lower() or 'torch_runtime' in line.lower():
print(f"Cell {idx}: {line}")
except Exception as e:
print(f"Error: {e}", file=sys.stderr)
PYTHON_EOFLength of output: 48
Update notebook runtime selection to match validated test runtimes.
The notebook hardcodes a check for runtime.name == "universal" at line 158, but the test utility (utils_cluster_prep.go lines 53-54) validates only torch-cuda-241, torch-cuda-251, torch-rocm-241, and torch-rocm-251 runtimes are available. The "universal" runtime is marked as a future addition (see the TODO at utils_cluster_prep.go line 52). Since "universal" does not currently exist in the test environment, the notebook's torch_runtime variable will remain unset, causing failures in subsequent code.
Align the notebook to use one of the currently available runtimes (e.g., "torch-cuda-241") or parameterize it with a fallback mechanism.
🤖 Prompt for AI Agents
In tests/trainer/resources/mnist.ipynb around lines 156 to 160, the notebook
checks for runtime.name == "universal" which doesn't exist in the test
environment and leaves torch_runtime unset; change the selection to match
validated runtimes (for example "torch-cuda-241") or implement a fallback loop
that picks the first available runtime from the allowed set
["torch-cuda-241","torch-cuda-251","torch-rocm-241","torch-rocm-251"]; update
the condition to test membership in that list and set torch_runtime accordingly
so downstream cells always have a valid runtime.
| @@ -0,0 +1,83 @@ | |||
| /* | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix import organization per pipeline failure.
The pipeline check failed because imports are not properly sorted.
Run make imports to fix the import organization as indicated by the pipeline failure.
🧰 Tools
🪛 GitHub Actions: Verify Generated Files and Import Organization
[error] 1-1: openshift-goimports check failed: file is not sorted. Run 'make imports' to fix imports.
🤖 Prompt for AI Agents
In tests/trainer/utils/utils_cluster_prep.go around lines 1 to 1 the import
statements are not properly organized causing the CI pipeline to fail; run the
project's import formatter (make imports) or manually sort and group imports
according to the project's gofmt/gofmtimports rules (standard libs first, blank
line, third-party, blank line, internal packages) and ensure there are no unused
imports, then save and re-run the pipeline.
| if out, err := exec.Command("kubectl", "get", "crd", "jobsets.jobset.x-k8s.io").CombinedOutput(); err != nil { | ||
| t.Fatalf("JobSet CRD missing: %v\n%s", err, string(out)) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace kubectl shell command with Kubernetes client API.
Using exec.Command("kubectl", ...) is brittle and introduces external dependencies. The test framework already provides a Kubernetes client via test.Client().
Apply this diff to use the Kubernetes API client:
- if out, err := exec.Command("kubectl", "get", "crd", "jobsets.jobset.x-k8s.io").CombinedOutput(); err != nil {
- t.Fatalf("JobSet CRD missing: %v\n%s", err, string(out))
- }
+ _, err := test.Client().Dynamic().Resource(schema.GroupVersionResource{
+ Group: "apiextensions.k8s.io",
+ Version: "v1",
+ Resource: "customresourcedefinitions",
+ }).Get(test.Ctx(), "jobsets.jobset.x-k8s.io", metav1.GetOptions{})
+ test.Expect(err).NotTo(HaveOccurred(), "JobSet CRD missing")Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In tests/trainer/utils/utils_cluster_prep.go around lines 37 to 39, replace the
exec.Command("kubectl", ...) call with a Kubernetes API call using the provided
test.Client(): call
test.Client().ApiextensionsV1().CustomResourceDefinitions().Get(ctx,
"jobsets.jobset.x-k8s.io", metav1.GetOptions{}) (using a context, e.g.,
context.Background()), and if it returns an error fail the test with t.Fatalf
including the error; remove the exec.Command usage and its output handling and
ensure the file imports context and metav1 (and any apiextensions client types
if missing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good suggestion!
Will switch to it most likely but after comments from others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MStokluska, sounds good! Feel free to implement the change once you've gathered feedback from the other reviewers. I'm here if you need any help with the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is just temporary and will be removed once Trainer v2 is onboarded into ODH/RHOAI, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so then we can keep it as it is now, not to invest much effort into something what will be deleted soon.
| if out, err := exec.Command("kubectl", "-n", "opendatahub", "wait", "--for=condition=available", "--timeout=180s", "deploy/kubeflow-trainer-controller-manager").CombinedOutput(); err != nil { | ||
| t.Fatalf("Trainer controller not available: %v\n%s", err, string(out)) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace kubectl shell command with Kubernetes client API and remove hardcoded namespace.
Using exec.Command("kubectl", ...) is brittle. Additionally, the namespace "opendatahub" is hardcoded, reducing reusability.
Apply this diff to use the Kubernetes API client with Eventually for polling:
- if out, err := exec.Command("kubectl", "-n", "opendatahub", "wait", "--for=condition=available", "--timeout=180s", "deploy/kubeflow-trainer-controller-manager").CombinedOutput(); err != nil {
- t.Fatalf("Trainer controller not available: %v\n%s", err, string(out))
- }
+ odhNamespace := "opendatahub" // Consider making this configurable
+ test.Eventually(func(g Gomega) {
+ deploy, err := test.Client().Core().AppsV1().Deployments(odhNamespace).Get(test.Ctx(), "kubeflow-trainer-controller-manager", metav1.GetOptions{})
+ g.Expect(err).NotTo(HaveOccurred())
+ g.Expect(deploy.Status.Conditions).To(ContainElement(MatchFields(IgnoreExtras, Fields{
+ "Type": Equal(appsv1.DeploymentAvailable),
+ "Status": Equal(corev1.ConditionTrue),
+ })))
+ }, 180*time.Second).Should(Succeed(), "Trainer controller not available")Committable suggestion skipped: line range outside the PR's diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Fetch all review comments from this PR to find the "above" comment being referenced
gh pr view 488 --repo opendatahub-io/distributed-workloads --json comments,reviews --jq '.reviews[].comments[] | select(.author.login == "coderabbitai" or .author.login == "MStokluska") | {path: .path, line: .line, body: .body, author: .author.login}' | head -50
Length of output: 25
🏁 Script executed:
#!/bin/bash
# Try fetching review comments using the GitHub API directly
gh api repos/opendatahub-io/distributed-workloads/pulls/488/comments --jq '.[] | select(.user.login == "coderabbitai[bot]" or .user.login == "MStokluska") | {path: .path, line: .line, body: .body, author: .user.login}' | head -100
Length of output: 18212
@MStokluska, understood! I'll wait for your updates after you've gathered feedback from the team. Feel free to reach out if you need any assistance with the implementation.
🐰✨
| // Ensure ServiceAccount exists | ||
| saName := "jupyter-nb-kube-3aadmin" | ||
| sa := &corev1.ServiceAccount{ObjectMeta: metav1.ObjectMeta{Name: saName, Namespace: namespace}} | ||
| _, _ = test.Client().Core().CoreV1().ServiceAccounts(namespace).Create(test.Ctx(), sa, metav1.CreateOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignoring Create error could mask issues.
The error from ServiceAccounts(...).Create(...) is silently ignored, which might hide actual problems (e.g., permission issues).
Consider logging the error or checking if it's an AlreadyExists error:
- _, _ = test.Client().Core().CoreV1().ServiceAccounts(namespace).Create(test.Ctx(), sa, metav1.CreateOptions{})
+ _, err := test.Client().Core().CoreV1().ServiceAccounts(namespace).Create(test.Ctx(), sa, metav1.CreateOptions{})
+ if err != nil && !errors.IsAlreadyExists(err) {
+ test.T().Logf("Warning: failed to create ServiceAccount: %v", err)
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| _, _ = test.Client().Core().CoreV1().ServiceAccounts(namespace).Create(test.Ctx(), sa, metav1.CreateOptions{}) | |
| _, err := test.Client().Core().CoreV1().ServiceAccounts(namespace).Create(test.Ctx(), sa, metav1.CreateOptions{}) | |
| if err != nil && !errors.IsAlreadyExists(err) { | |
| test.T().Logf("Warning: failed to create ServiceAccount: %v", err) | |
| } |
🤖 Prompt for AI Agents
In tests/trainer/utils/utils_cluster_prep.go around line 66, the code currently
ignores the error returned by ServiceAccounts(...).Create(...); capture the
returned error, check if it's nil or an apierrors.IsAlreadyExists(err) and only
treat non-nil/non-AlreadyExists errors as failures (fail the test or log the
error), otherwise proceed; update the call to handle the error accordingly (use
t.Fatalf/t.Fatalf-like helper or test.Logger.Errorf) so permission or creation
failures are not silently swallowed.
| func PollNotebookCompletionMarker(test Test, namespace, podName, containerName, marker string, timeout time.Duration) error { | ||
| var finalErr error | ||
| test.Eventually(func() bool { | ||
| out, err := exec.Command("kubectl", "-n", namespace, "exec", podName, "-c", containerName, "--", "cat", marker).CombinedOutput() | ||
| if err != nil { | ||
| return false | ||
| } | ||
| switch strings.TrimSpace(string(out)) { | ||
| case "SUCCESS": | ||
| return true | ||
| case "FAILURE": | ||
| finalErr = fmt.Errorf("Notebook execution failed") | ||
| return true | ||
| default: | ||
| return false | ||
| } | ||
| }, timeout).Should(BeTrue(), "Notebook did not reach definitive state") | ||
| return finalErr | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace kubectl shell command with Kubernetes client API.
Using exec.Command("kubectl", ...) to read the marker file is brittle and adds external dependencies.
Consider using the Kubernetes API to read the file contents via the pod exec subresource:
func PollNotebookCompletionMarker(test Test, namespace, podName, containerName, marker string, timeout time.Duration) error {
var finalErr error
test.Eventually(func() bool {
req := test.Client().Core().CoreV1().RESTClient().
Get().
Namespace(namespace).
Resource("pods").
Name(podName).
SubResource("exec").
VersionedParams(&corev1.PodExecOptions{
Container: containerName,
Command: []string{"cat", marker},
Stdout: true,
Stderr: true,
}, scheme.ParameterCodec)
exec, err := remotecommand.NewSPDYExecutor(test.Config(), "POST", req.URL())
if err != nil {
return false
}
var stdout, stderr bytes.Buffer
err = exec.Stream(remotecommand.StreamOptions{
Stdout: &stdout,
Stderr: &stderr,
})
if err != nil {
return false
}
switch strings.TrimSpace(stdout.String()) {
case "SUCCESS":
return true
case "FAILURE":
finalErr = fmt.Errorf("Notebook execution failed")
return true
default:
return false
}
}, timeout).Should(BeTrue(), "Notebook did not reach definitive state")
return finalErr
}🤖 Prompt for AI Agents
In tests/trainer/utils/utils_notebook.go around lines 67 to 85, replace the
brittle exec.Command("kubectl", ...) call with a Kubernetes client-based pod
exec using test.Client() and the pod exec subresource: build a REST GET request
against pods/{name}/exec with corev1.PodExecOptions (container, command ["cat",
marker], Stdout/Stderr true), create an SPDY executor with
remotecommand.NewSPDYExecutor(test.Config(), "POST", req.URL()), stream into
bytes.Buffer for stdout/stderr, handle executor creation/stream errors by
returning false (and record finalErr when stdout trimmed equals "FAILURE"),
return true on "SUCCESS" or "FAILURE", and ensure you add necessary imports
(bytes, corev1, scheme, remotecommand) and preserve the existing
Eventually/timeout behavior and return finalErr at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
tests/trainer/utils/utils_notebook.go (2)
17-32: Fix import organization per pipeline failure.The pipeline check failed because imports are not properly sorted. Run
make importsto fix the import organization.
66-85: Replace kubectl shell command with Kubernetes client API.Using
exec.Command("kubectl", ...)is brittle and requires kubectl in PATH. This should use the Kubernetes client API via the pod exec subresource as detailed in the previous review comment.
🧹 Nitpick comments (1)
tests/trainer/utils/utils_notebook.go (1)
55-64: Eliminate redundant GetPods call.Lines 62-63 repeat the same
GetPodscall that was already executed in theEventuallyblock (lines 58-60), which is inefficient.Apply this diff to store and reuse the pods result:
func WaitForNotebookPodRunning(test Test, namespace string) (string, string) { labelSelector := fmt.Sprintf("notebook-name=%s", common.NOTEBOOK_CONTAINER_NAME) + var pods []corev1.Pod test.Eventually(func() []corev1.Pod { - return GetPods(test, namespace, metav1.ListOptions{LabelSelector: labelSelector, FieldSelector: "status.phase=Running"}) + pods = GetPods(test, namespace, metav1.ListOptions{LabelSelector: labelSelector, FieldSelector: "status.phase=Running"}) + return pods }, TestTimeoutLong).Should(HaveLen(1), "Expected exactly one notebook pod") - pods := GetPods(test, namespace, metav1.ListOptions{LabelSelector: labelSelector, FieldSelector: "status.phase=Running"}) return pods[0].Name, pods[0].Spec.Containers[0].Name }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tests/trainer/utils/utils_cluster_prep.go(1 hunks)tests/trainer/utils/utils_notebook.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/trainer/utils/utils_cluster_prep.go
🧰 Additional context used
🧬 Code graph analysis (1)
tests/trainer/utils/utils_notebook.go (4)
tests/common/support/test.go (1)
Test(34-45)tests/common/notebook.go (3)
ContainerSize(72-72)CreateNotebook(104-185)NOTEBOOK_CONTAINER_NAME(37-37)tests/common/support/core.go (4)
CreateConfigMap(33-54)CreatePersistentVolumeClaim(242-276)AccessModes(235-240)GetPods(111-116)tests/common/support/support.go (1)
TestTimeoutLong(35-35)
🔇 Additional comments (2)
tests/trainer/utils/utils_notebook.go (2)
34-46: LGTM! Papermill command construction is sound.The shell command properly installs dependencies, executes the notebook with papermill, writes a completion marker, and uses
sleep infinityto keep the container running for inspection. The use ofset -eensures proper error propagation.
48-53: LGTM! Notebook setup orchestration is correct.The function properly orchestrates the creation of a ConfigMap, PVC (10Gi is reasonable for test scenarios), and Notebook CR, passing through all necessary parameters.
| " dataset = datasets.FashionMNIST(\n", | ||
| " \"./data\",\n", | ||
| " train=True,\n", | ||
| " download=True,\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will become an issue for running tests on disconnected clusters.
In Trainer v1 tests we uploaded dataset on AWS S3, it is downloaded from there if AWS env variables are declared - https://github.com/opendatahub-io/distributed-workloads/blob/main/tests/kfto/resources/kfto_sdk_mnist.py#L67
| "for runtime in client.list_runtimes():\n", | ||
| " print(runtime)\n", | ||
| " if runtime.name == \"universal\": # Update to actual universal image runtime once available\n", | ||
| " torch_runtime = runtime" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this approach instead of getting universal runtime by client.get_runtime("universal")?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That notebook is basically a copy from usptream test that SDK upstream relies on. I wanted to keep it as close as possible to original but I guess there's no harm in moving to get_runtime. Thanks Karel!
| " \"cpu\": 2,\n", | ||
| " \"memory\": \"8Gi\",\n", | ||
| " },\n", | ||
| " packages_to_install=[\"torchvision\"],\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to install specific version, to make sure that a future upgrade doesn't break test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I think it makes sense. Thanks.
| } | ||
| // TODO: Extend / tweak with universal image runtime once available | ||
| for _, name := range []string{"torch-cuda-241", "torch-cuda-251", "torch-rocm-241", "torch-rocm-251"} { | ||
| test.Expect(found[name]).To(BeTrue(), fmt.Sprintf("Expected ClusterTrainingRuntime '%s' not found", name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test verifying what TrainingRuntimes are available is already implemented in https://github.com/opendatahub-io/distributed-workloads/blob/main/tests/trainer/custom_training_runtimes_test.go#L63
Is there any specific reason to check it here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially I was planning on making this test a smoke test, given that your runtime check is also smoke, I wasn't sure which one is going to go first so decided to add the additional check to capture the missing runtime error if it happens.
But since I've moved to sanity, and I'm assuming sanity will run after smoke - I guess it's fine to remove this check? WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, can be removed
|
|
||
| // EnsureNotebookRBAC sets up the Notebook ServiceAccount and RBAC so that notebooks can | ||
| // read ClusterTrainingRuntimes (cluster-scoped), and create/read TrainJobs and pod logs in the namespace. | ||
| func EnsureNotebookRBAC(t *testing.T, test Test, namespace string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather prefer passing user token into SDK client to provide needed access.
IMHO that is the approach which majority of customers would take.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, thanks Karel
| // PollNotebookCompletionMarker polls the given marker file inside the notebook pod until SUCCESS/FAILURE or timeout. | ||
| func PollNotebookCompletionMarker(test Test, namespace, podName, containerName, marker string, timeout time.Duration) error { | ||
| var finalErr error | ||
| test.Eventually(func() bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about whether it would be simpler to print status into Workbench log directly.
It could potentially mix with other log messages produced by Workbench, but it would avoid having to access Pod filesystem.
WDYT?
|
|
||
| // Wait for the Notebook Pod and get pod/container names | ||
| podName, containerName := trainerutils.WaitForNotebookPodRunning(test, namespace.Name) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it have sense to add assertion checking that TrainJob is created and successfully finished?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think this is very important :)
Thanks
Addition of initial e2e sanity test for kubeflow sdk
Description
As part of this work we are adding the initial e2e for kubeflow SDK (2 nodes cpu mnist fashion training).
The goal of it is to make it as easy as possible to extend with further notebook based tests.
How Has This Been Tested?
--
"TEST_TIER": "Sanity"--
"ODH_NAMESPACE": "opendatahub"--
"NOTEBOOK_USER_NAME": "xyz"--
"NOTEBOOK_USER_TOKEN": "some_token"--
"NOTEBOOK_IMAGE": "quay.io/bgallagher/universal-image:v2"Note: The PR will not work without trainer v2 controller, CRDs and custom clusterTrainingRuntime created that uses the universal image under the hood (our dev image: quay.io/bgallagher/universal-image:v2)
Merge criteria:
Summary by CodeRabbit
Tests
Chores