Skip to content

Commit 5f6b092

Browse files
committed
Add support for kserve
Signed-off-by: Daniel J Walsh <[email protected]>
1 parent b9ffbf8 commit 5f6b092

File tree

5 files changed

+209
-6
lines changed

5 files changed

+209
-6
lines changed

docs/ramalama-serve.1.md

Lines changed: 66 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -69,8 +69,9 @@ Generate specified configuration format for running the AI Model as a service
6969

7070
| Key | Description |
7171
| ------------ | -------------------------------------------------------------------------|
72-
| quadlet | Podman supported container definition for running AI Model under systemd |
72+
| kserve | KServe YAML definition for running the AI Model as a KServe service in Kubernetes |
7373
| kube | Kubernetes YAML definition for running the AI Model as a service |
74+
| quadlet | Podman supported container definition for running AI Model under systemd |
7475
| quadlet/kube | Kubernetes YAML definition for running the AI Model as a service and Podman supported container definition for running the Kube YAML specified pod under systemd|
7576

7677
#### **--help**, **-h**
@@ -112,7 +113,7 @@ On Nvidia based GPU systems, RamaLama defaults to using the
112113
`nvidia-container-runtime`. Use this option to override this selection.
113114

114115
#### **--port**, **-p**
115-
port for AI Model server to listen on. It must be available. If not specified,
116+
port for AI Model server to listen on. It must be available. If not specified,
116117
the serving port will be 8080 if available, otherwise a free port in 8081-8090 range.
117118

118119
#### **--privileged**
@@ -159,7 +160,7 @@ llama.cpp explains this as:
159160

160161
The higher the number is the more creative the response is, but more likely to hallucinate when set too high.
161162

162-
Usage: Lower numbers are good for virtual assistants where we need deterministic responses. Higher numbers are good for roleplay or creative tasks like editing stories
163+
Usage: Lower numbers are good for virtual assistants where we need deterministic responses. Higher numbers are good for roleplay or creative tasks like editing stories
163164

164165
#### **--threads**, **-t**
165166
Maximum number of cpu threads to use.
@@ -187,6 +188,68 @@ CONTAINER ID IMAGE COMMAND CREATED
187188
3f64927f11a5 quay.io/ramalama/ramalama:latest /usr/bin/ramalama... 17 seconds ago Up 17 seconds 0.0.0.0:8082->8082/tcp ramalama_YMPQvJxN97
188189
```
189190

191+
### Generate kserve service off of OCI Model car quay.io/ramalama/granite:1.0
192+
```
193+
./bin/ramalama serve --port 8081 --generate kserve oci://quay.io/ramalama/granite:1.0
194+
Generating kserve runtime file: granite-1.0-kserve-runtime.yaml
195+
Generating kserve file: granite-1.0-kserve.yaml
196+
197+
$ cat granite-1.0-kserve-runtime.yaml
198+
apiVersion: serving.kserve.io/v1alpha1
199+
kind: ServingRuntime
200+
metadata:
201+
name: llama.cpp-runtime
202+
spec:
203+
annotations:
204+
prometheus.io/port: '8081'
205+
prometheus.io/path: '/metrics'
206+
multiModel: false
207+
supportedModelFormats:
208+
- autoSelect: true
209+
name: vLLM
210+
containers:
211+
- name: kserve-container
212+
image: quay.io/ramalama/ramalama:latest
213+
command:
214+
- python
215+
- -m
216+
- vllm.entrypoints.openai.api_server
217+
args:
218+
- "--port=8081"
219+
- "--model=/mnt/models"
220+
- "--served-model-name={.Name}"
221+
env:
222+
- name: HF_HOME
223+
value: /tmp/hf_home
224+
ports:
225+
- containerPort: 8081
226+
protocol: TCP
227+
228+
$ cat granite-1.0-kserve.yaml
229+
# RamaLama quay.io/ramalama/granite:1.0 AI Model Service
230+
# kubectl create -f to import this kserve file into Kubernetes.
231+
#
232+
apiVersion: serving.kserve.io/v1beta1
233+
kind: InferenceService
234+
metadata:
235+
name: huggingface-quay.io/ramalama/granite:1.0
236+
spec:
237+
predictor:
238+
model:
239+
modelFormat:
240+
name: vLLM
241+
storageUri: "oci://quay.io/ramalama/granite:1.0"
242+
resources:
243+
limits:
244+
cpu: "6"
245+
memory: 24Gi
246+
nvidia.com/gpu: "1"
247+
requests:
248+
cpu: "6"
249+
memory: 24Gi
250+
nvidia.com/gpu: "1"
251+
```
252+
190253
### Generate quadlet service off of HuggingFace granite Model
191254
```
192255
$ ramalama serve --name MyGraniteServer --generate=quadlet granite

ramalama/cli.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -861,7 +861,12 @@ def serve_parser(subparsers):
861861
)
862862
parser.add_argument(
863863
"--generate",
864-
choices=["quadlet", "kube", "quadlet/kube"],
864+
choices=[
865+
"kserve",
866+
"kube",
867+
"quadlet",
868+
"quadlet/kube",
869+
],
865870
help="generate specified configuration format for running the AI Model as a service",
866871
)
867872
parser.add_argument(

ramalama/kserve.py

Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
import os
2+
3+
from ramalama.common import get_accel_env_vars
4+
5+
def create_yaml(template_str, params):
6+
return(template_str.format(**params))
7+
8+
9+
KSERVE_RUNTIME_TMPL = """
10+
apiVersion: serving.kserve.io/v1alpha1
11+
kind: ServingRuntime
12+
metadata:
13+
name: {runtime}-runtime
14+
spec:
15+
annotations:
16+
prometheus.io/port: '{port}'
17+
prometheus.io/path: '/metrics'
18+
multiModel: false
19+
supportedModelFormats:
20+
- autoSelect: true
21+
name: vLLM
22+
containers:
23+
- name: kserve-container
24+
image: {image}
25+
command: ["python", "-m", "vllm.entrypoints.openai.api_server"]
26+
args: ["--port={port}", "--model=/mnt/models", "--served-model-name={name}"]
27+
env:
28+
- name: HF_HOME
29+
value: /tmp/hf_home
30+
ports:
31+
- containerPort: {port}
32+
protocol: TCP
33+
"""
34+
35+
KSERVE_MODEL_SERVICE = """\
36+
# RamaLama {name} AI Model Service
37+
# kubectl create -f to import this kserve file into Kubernetes.
38+
#
39+
apiVersion: serving.kserve.io/v1beta1
40+
kind: InferenceService
41+
metadata:
42+
name: huggingface-{name}
43+
spec:
44+
predictor:
45+
model:
46+
modelFormat:
47+
name: vLLM
48+
storageUri: "oci://{model}"
49+
resources:
50+
limits:
51+
cpu: "6"
52+
memory: 24Gi{gpu}
53+
requests:
54+
cpu: "6"
55+
memory: 24Gi{gpu}
56+
"""
57+
58+
59+
class Kserve:
60+
def __init__(self, model, chat_template_path, image, args, exec_args):
61+
self.ai_image = model
62+
if hasattr(args, "MODEL"):
63+
self.ai_image = args.MODEL
64+
self.ai_image = self.ai_image.removeprefix("oci://")
65+
if args.name:
66+
self.name = args.name
67+
else:
68+
self.name = os.path.basename(self.ai_image)
69+
70+
self.model = model.removeprefix("oci://")
71+
self.args = args
72+
self.exec_args = exec_args
73+
self.image = image
74+
self.runtime = args.runtime
75+
76+
def generate(self):
77+
env_var_string = ""
78+
for k, v in get_accel_env_vars().items():
79+
env_var_string += f"Environment={k}={v}\n"
80+
81+
_gpu = ""
82+
if os.getenv("CUDA_VISIBLE_DEVICES") != "":
83+
_gpu = 'nvidia.com/gpu'
84+
elif os.getenv("HIP_VISIBLE_DEVICES") != "":
85+
_gpu = 'amd.com/gpu'
86+
87+
outfile = self.name + "-kserve-runtime.yaml"
88+
outfile = outfile.replace(":", "-")
89+
print(f"Generating kserve runtime file: {outfile}")
90+
91+
# In your generate() method:
92+
yaml_content = create_yaml(
93+
KSERVE_RUNTIME_TMPL,
94+
{
95+
'runtime' : self.runtime,
96+
'model' : self.model,
97+
'gpu' : _gpu if _gpu else "",
98+
'port' : self.args.port,
99+
'image' : self.image,
100+
'name' : self.name,
101+
}
102+
)
103+
with open(outfile, 'w') as c:
104+
c.write(yaml_content)
105+
106+
outfile = self.name + "-kserve.yaml"
107+
outfile = outfile.replace(":", "-")
108+
print(f"Generating kserve file: {outfile}")
109+
yaml_content = create_yaml(
110+
KSERVE_MODEL_SERVICE,
111+
{
112+
'name': self.name,
113+
'model': self.model,
114+
'gpu':_gpu if _gpu else "",
115+
}
116+
)
117+
with open(outfile, 'w') as c:
118+
c.write(yaml_content)

ramalama/model.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323
from ramalama.console import EMOJI
2424
from ramalama.engine import Engine, dry_run
2525
from ramalama.gguf_parser import GGUFInfoParser
26+
from ramalama.kserve import Kserve
2627
from ramalama.kube import Kube
2728
from ramalama.model_inspect import GGUFModelInfo, ModelInfoBase
2829
from ramalama.model_store import ModelStore
@@ -553,7 +554,9 @@ def handle_runtime(self, args, exec_args, exec_model_path):
553554

554555
def generate_container_config(self, model_path, chat_template_path, args, exec_args):
555556
self.image = accel_image(CONFIG, args)
556-
if args.generate == "quadlet":
557+
if args.generate == "kserve":
558+
self.kserve(model_path, chat_template_path, args, exec_args)
559+
elif args.generate == "quadlet":
557560
self.quadlet(model_path, chat_template_path, args, exec_args)
558561
elif args.generate == "kube":
559562
self.kube(model_path, chat_template_path, args, exec_args)
@@ -613,6 +616,10 @@ def serve(self, args, quiet=False):
613616

614617
self.execute_command(model_path, exec_args, args)
615618

619+
def kserve(self, model, chat_template_path, args, exec_args):
620+
kserve = Kserve(model, chat_template_path, self.image, args, exec_args)
621+
kserve.generate()
622+
616623
def quadlet(self, model, chat_template, args, exec_args):
617624
quadlet = Quadlet(model, chat_template, self.image, args, exec_args)
618625
quadlet.generate()

test/system/040-serve.bats

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -197,7 +197,17 @@ verify_begin=".*run --rm"
197197

198198
rm tinyllama.container
199199
run_ramalama 2 serve --name=${name} --port 1234 --generate=bogus tiny
200-
is "$output" ".*error: argument --generate: invalid choice: 'bogus' (choose from.*quadlet.*kube.*quadlet/kube.*)" "Should fail"
200+
is "$output" ".*error: argument --generate: invalid choice: 'bogus' (choose from.*kserve.*kube.*quadlet.*quadlet/kube.*)" "Should fail"
201+
}
202+
203+
@test "ramalama serve --generate=kserve" {
204+
model=smollm:135m
205+
fixed_model=$(echo $model | tr ':' '-')
206+
name=c_$(safename)
207+
run_ramalama pull ${model}
208+
run_ramalama -q serve --port 1234 --generate=kserve ${model}
209+
is "$output" "Generating kserve runtime file: ${fixed_model}-kserve-runtime.yaml.*" "generate kserve runtime file"
210+
is "$output" ".*Generating kserve file: ${fixed_model}-kserve.yaml" "generate kserve file"
201211
}
202212

203213
@test "ramalama serve --generate=quadlet and --generate=kube with OCI" {

0 commit comments

Comments
 (0)