Skip to content

Commit f3697a9

Browse files
MichaelCliffordleseb
authored andcommitted
incorporate #118 from redhat-et/ilab-on-ocp
Signed-off-by: Michael Clifford <[email protected]> Co-authored-by: Michael Clifford <[email protected]> Co-authored-by: Sébastien Han <[email protected]>
1 parent cbf4a40 commit f3697a9

File tree

2 files changed

+139
-1
lines changed

2 files changed

+139
-1
lines changed

instructlab/standalone/README.md

Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,144 @@ of models without relying on centralized orchestration tools like KubeFlow.
99
The `standalone.py` tool provides support for fetching generated SDG (Synthetic Data Generation) data from an AWS S3 compatible object store.
1010
While AWS S3 is supported, alternative object storage solutions such as Ceph, Nooba, and MinIO are also compatible.
1111

12+
## Overall end-to-end workflow
13+
14+
```text
15+
+-------------------------------+
16+
| Kubernetes Job |
17+
| "data-download" |
18+
+-------------------------------+
19+
| Init Container |
20+
| "download-data-object-store" |
21+
| (Fetches data from object |
22+
| storage) |
23+
+-------------------------------+
24+
| Main Container |
25+
| "sdg-data-preprocess" |
26+
| (Processes the downloaded |
27+
| data) |
28+
+-------------------------------+
29+
|
30+
v
31+
+-------------------------------+
32+
| "watch for completion" |
33+
+-------------------------------+
34+
|
35+
v
36+
+-----------------------------------+
37+
| PytorchJob CR training phase 1 |
38+
| |
39+
| +---------------------+ |
40+
| | Master Pod | |
41+
| | (Trains and | |
42+
| | Coordinates the | |
43+
| | distributed | |
44+
| | training) | |
45+
| +---------------------+ |
46+
| | |
47+
| v |
48+
| +---------------------+ |
49+
| | Worker Pod 1 | |
50+
| | (Handles part of | |
51+
| | the training) | |
52+
| +---------------------+ |
53+
| | |
54+
| v |
55+
| +---------------------+ |
56+
| | Worker Pod 2 | |
57+
| | (Handles part of | |
58+
| | the training) | |
59+
| +---------------------+ |
60+
+-----------------------------------+
61+
|
62+
v
63+
+-------------------------------+
64+
| "wait for completion" |
65+
+-------------------------------+
66+
|
67+
v
68+
+-----------------------------------+
69+
| PytorchJob CR training phase 2 |
70+
| |
71+
| +---------------------+ |
72+
| | Master Pod | |
73+
| | (Trains and | |
74+
| | Coordinates the | |
75+
| | distributed | |
76+
| | training) | |
77+
| +---------------------+ |
78+
| | |
79+
| v |
80+
| +---------------------+ |
81+
| | Worker Pod 1 | |
82+
| | (Handles part of | |
83+
| | the training) | |
84+
| +---------------------+ |
85+
| | |
86+
| v |
87+
| +---------------------+ |
88+
| | Worker Pod 2 | |
89+
| | (Handles part of | |
90+
| | the training) | |
91+
| +---------------------+ |
92+
+-----------------------------------+
93+
|
94+
v
95+
+-------------------------------+
96+
| "wait for completion" |
97+
+-------------------------------+
98+
|
99+
v
100+
+-------------------------------+
101+
| Kubernetes Job |
102+
| "eval-mt-bench" |
103+
+-------------------------------+
104+
| Init Container |
105+
| "run-eval-mt-bench" |
106+
| (Runs evaluation on MT Bench)|
107+
+-------------------------------+
108+
| Main Container |
109+
| "output-eval-mt-bench-scores"|
110+
| (Outputs evaluation scores) |
111+
+-------------------------------+
112+
|
113+
v
114+
+-------------------------------+
115+
| "wait for completion" |
116+
+-------------------------------+
117+
|
118+
v
119+
+-------------------------------+
120+
| Kubernetes Job |
121+
| "eval-final" |
122+
+-------------------------------+
123+
| Init Container |
124+
| "run-eval-final" |
125+
| (Runs final evaluation) |
126+
+-------------------------------+
127+
| Main Container |
128+
| "output-eval-final-scores" |
129+
| (Outputs final evaluation |
130+
| scores) |
131+
+-------------------------------+
132+
|
133+
v
134+
+-------------------------------+
135+
| "wait for completion" |
136+
+-------------------------------+
137+
|
138+
v
139+
+-------------------------------+
140+
| Kubernetes Job |
141+
| "trained-model-upload" |
142+
+-------------------------------+
143+
| Main Container |
144+
| "upload-data-object-store" |
145+
| (Uploads the trained model to|
146+
| the object storage) |
147+
+-------------------------------+
148+
```
149+
12150
## Requirements
13151

14152
The `standalone.py` script is designed to run within a Kubernetes environment. The following requirements must be met:

instructlab/standalone/standalone.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3212,4 +3212,4 @@ def upload_trained_model(ctx: click.Context):
32123212
logger.info("Failed to load kube config. Trying in-cluster config")
32133213
kubernetes.config.load_incluster_config()
32143214

3215-
cli()
3215+
cli()

0 commit comments

Comments
 (0)