Add support for kserve #877

rhatdan · 2025-02-24T18:12:52Z

Summary by Sourcery

This pull request introduces support for generating KServe configurations, allowing users to deploy AI models on Kubernetes using the KServe framework. It adds a new kserve option to the ramalama serve --generate command, which generates the necessary YAML files for deploying a model as a KServe service.

New Features:

Adds support for generating KServe YAML definitions for running AI models as a service in Kubernetes, enabling deployment and management of models using the KServe framework.

Enhancements:

The ramalama serve command now accepts a --generate kserve option to generate KServe YAML files.
The generated KServe YAML files include resource requests and limits for CPU, memory, and GPU (if available).

sourcery-ai · 2025-02-24T18:13:00Z

Reviewer's Guide by Sourcery

This pull request adds support for generating KServe YAML definitions, enabling users to deploy AI models as KServe services within a Kubernetes environment. It introduces a new Kserve class responsible for generating the necessary YAML files and integrates it into the existing ramalama serve command-line interface.

Sequence diagram for generating KServe configuration

sequenceDiagram
    participant CLI
    participant Model
    participant Kserve

    CLI->>Model: serve(args)
    Model->>Model: execute_command(model_path, exec_args, args)
    alt args.generate == "kserve"
        Model->>Model: kserve(model_path, args, exec_args)
        Model->>Kserve: __init__(model_path, image, args, exec_args)
        Kserve->>Kserve: generate()
        Kserve-->>Model: True
        Model-->>CLI: None
    else
        Model-->>CLI: None
    end

File-Level Changes

Change	Details	Files
Introduces support for generating KServe YAML definitions for deploying AI models as KServe services in Kubernetes.	Adds 'kserve' as a choice to the `--generate` argument in the `serve` command. Creates a new `Kserve` class to handle the generation of KServe YAML files. Implements the `generate` method in the `Kserve` class to create the necessary YAML files for deploying a model with KServe. Adds logic to the `generate_container_config` method to call the new KServe functionality. Adds documentation for generating kserve service.	`docs/ramalama-serve.1.md` `ramalama/model.py` `ramalama/cli.py` `ramalama/kserve.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!
Generate a plan of action for an issue: Comment @sourcery-ai plan on
an issue to generate a plan of action for it.

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey @rhatdan - I've reviewed your changes - here's some feedback:

Overall Comments:

Consider generating the KServe runtime file only when necessary, instead of every time.
The KServe code duplicates some logic from other modules; consider refactoring to share code.

Here's what I looked at during the review

🟡 General issues: 3 issues found
🟢 Security: all looks good
🟢 Testing: all looks good
🟡 Complexity: 1 issue found
🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-02-24T18:14:00Z

ramalama/kserve.py

+            env_var_string += f"Environment={k}={v}\n"
+
+        _gpu = ""
+        if os.getenv("CUDA_VISIBLE_DEVICES") != "":


issue (bug_risk): GPU env var check may be flawed.

The condition os.getenv("CUDA_VISIBLE_DEVICES") != "" will return True even if the variable is not set (i.e. returns None). Consider using a check such as if os.getenv("CUDA_VISIBLE_DEVICES") to better capture whether the variable is defined and non-empty.

sourcery-ai · 2025-02-24T18:14:00Z

ramalama/kserve.py

+      resources:
+        limits:
+          cpu: "6"
+          memory: 24Gi{gpu}


issue (bug_risk): Potential undefined variable 'gpu'.

If neither CUDA_VISIBLE_DEVICES nor HIP_VISIBLE_DEVICES is set, the variable 'gpu' will not be defined before it's used in the f-string. Initializing 'gpu' to an empty string by default would prevent a potential NameError.

docs/ramalama-serve.1.md

sourcery-ai · 2025-02-24T18:14:01Z

ramalama/kserve.py

+        outfile = self.name + "-kserve-runtime.yaml"
+        outfile = outfile.replace(":", "-")
+        print(f"Generating kserve runtime file: {outfile}")
+        with open(outfile, 'w') as c:


issue (complexity): Consider using a templating engine like Jinja2 to generate the YAML files, which will reduce code duplication and improve readability.

Consider abstracting YAML creation into a dedicated templating helper to reduce the inline formatting repetition. For example, you might use Jinja2 templates (or PyYAML with dictionaries) to consolidate and reuse the YAML structure. Here’s a concise example using Jinja2:

from jinja2 import Template def create_yaml(template_str, **params): return Template(template_str).render(**params) # Define your runtime YAML template once. KSERVE_RUNTIME_TMPL = """ apiVersion: serving.kserve.io/v1alpha1 kind: ServingRuntime metadata: name: {{ runtime }}-runtime annotations: openshift.io/display-name: "KServe ServingRuntime for {{ model }}" opendatahub.io/recommended-accelerators: '["{{ gpu }}"]' labels: opendatahub.io/dashboard: 'true' spec: annotations: prometheus.io/port: '{{ port }}' prometheus.io/path: '/metrics' multiModel: false supportedModelFormats: - autoSelect: true name: vLLM containers: - name: kserve-container image: {{ image }} command: ["python", "-m", "vllm.entrypoints.openai.api_server"] args: ["--port={{ port }}", "--model=/mnt/models", "--served-model-name={{ name }}"] env: - name: HF_HOME value: /tmp/hf_home ports: - containerPort: {{ port }} protocol: TCP """ # In your generate() method: yaml_content = create_yaml(KSERVE_RUNTIME_TMPL, runtime=self.runtime, model=self.model, gpu=_gpu if _gpu else "", port=self.args.port, image=self.image, name=self.name) with open(self.name + "-kserve-runtime.yaml".replace(":", "-"), 'w') as c: c.write(yaml_content)

Repeat a similar approach for the second YAML. This not only reduces repetition but also improves readability and maintainability.

IMHO: jinja2 is not fast, maybe a more readable f"" string templateway.

Here's a helpful response to the conversation:

You raise a good point about performance. While Jinja2 is powerful for templating, it can introduce overhead. An alternative approach using f-strings can indeed provide a more lightweight and potentially faster solution for YAML generation.

Here's a refined approach using f-strings that maintains readability while avoiding the performance overhead of Jinja2:

def generate_kserve_runtime_yaml(self): gpu = f'nvidia.com/gpu: "1"' if os.getenv("CUDA_VISIBLE_DEVICES") != "" else \ f'amd.com/gpu: "1"' if os.getenv("HIP_VISIBLE_DEVICES") != "" else "" env_vars = "\n".join(f" - name: {k}\n value: {v}" for k, v in get_env_vars().items()) yaml_content = f""" apiVersion: serving.kserve.io/v1alpha1 kind: ServingRuntime metadata: name: {self.runtime}-runtime annotations: openshift.io/display-name: "KServe ServingRuntime for {self.model}" spec: containers: - name: kserve-container image: {self.image} resources: requests: {gpu} env: {env_vars} """ outfile = f"{self.name}-kserve-runtime.yaml".replace(":", "-") with open(outfile, 'w') as c: c.write(yaml_content)

This approach:

Uses f-strings for string interpolation

Keeps the code concise and readable

Avoids the performance overhead of a full templating engine

Provides a clean way to handle environment variables and GPU resources

The key advantages are simplicity, performance, and maintainability. It addresses the original suggestion of reducing code duplication while using a more Pythonic approach that aligns with your performance concerns.

Would you like me to elaborate on any part of this implementation?

ramalama/kserve.py

danielezonca · 2025-02-24T18:35:08Z

ramalama/kserve.py

+      name: vLLM
+  containers:
+    - name: kserve-container
+      image: {self.image}


This code looks wrong, as far as I understand checking the example code, it will return ramalama as image and not vLLM.
Moreover vLLM is specific per accelerator so the same if/else condition to produce the gpu requirements should be applied here to select the right image

danielezonca · 2025-02-24T18:37:15Z

ramalama/kserve.py

+apiVersion: serving.kserve.io/v1alpha1
+kind: ServingRuntime
+metadata:
+  name: {self.runtime}-runtime


I think this is misleading, it produces a llama.cpp-runtime value but this is supposed to be vLLM

You would need to specify --runtime vllm and then you get the correct output. Is kserve totally tied to vllm? Or is there ability to support multiple runtimes?

KServe is multi runtime, there is a list of runtime included OOTB but it is also dynamic, you can bring your own runtime

Link to the list? vllm and llama.cpp included?

https://kserve.github.io/website/0.10/modelserving/v1beta1/serving_runtime/

This looks like something different and we would need to hard code the most common one in the tool? Which is that, does not look like kserve gets down to the llama.cpp or vllm level?

danielezonca · 2025-02-24T18:37:39Z

docs/ramalama-serve.1.md

 | Key          | Description                                                              |
 | ------------ | -------------------------------------------------------------------------|
-| quadlet      | Podman supported container definition for running AI Model under systemd |
+| kserve       | Kserve YAML definition for running the AI Model as a kserve service in Kubernetes        |


terrytangyuan · 2025-04-30T18:06:23Z

docs/ramalama-serve.1.md

+apiVersion: serving.kserve.io/v1alpha1
+kind: ServingRuntime
+metadata:
+  name: llama.cpp-cuda-runtime


Should this be granite instead of llama?

That is the runtime. But maybe granite makes more sense.

terrytangyuan · 2025-04-30T18:07:25Z

docs/ramalama-serve.1.md

+      name: vLLM
+  containers:
+    - name: kserve-container
+      image: quay.io/ramalama/cuda:0.8


Is this the correct image? I see that CPUs are used in the generated InferenceService

What should the InferenceService look like if it was using nvidia/cuda?

The official community image for vLLM is on DockerHub and the cuda image is
vllm/vllm-openai:${version}
like
vllm/vllm-openai:v0.8.5
and I think we should use them as starting point (it is CUDA only)

I suggest to consider this as additional parameter like --vllm-version

Note 1:
We have a quay.io build that is like quay.io/vllm/vllm:0.8.4.20250429 but it is still in progress

Note 2:
RHOAI product builds use a different org and the images are like
quay.io/modh/vllm:rhoai-2.19-cuda / quay.io/modh/vllm:rhoai-2.19-rocm
but I don't think we want to use product images here

terrytangyuan · 2025-04-30T18:09:44Z

docs/ramalama-serve.1.md


+### Generate kserve service off of OCI Model car quay.io/ramalama/granite:1.0
+```
+$ ramalama serve --pull=never --threads 10 --port 8081 --generate kserve oci://quay.io/rhatdan/granite


Is Ramalama only designed for single-node serving?

RamaLama is just a tool to launch AI models in containers, the idea would be to generate kubernetes content, to allow models to run cross multi-nodes, but not something ramalama would do from command line.

I think we can consider this for another iteration: given that ramalama is mainly designed for local I don't expect a user will run a model big enough to require multi node GPU setup

danielezonca · 2025-05-02T12:35:58Z

docs/ramalama-serve.1.md

+      name: vLLM
+  containers:
+    - name: kserve-container
+      image: quay.io/ramalama/cuda:0.8


The official community image for vLLM is on DockerHub and the cuda image is
vllm/vllm-openai:${version}
like
vllm/vllm-openai:v0.8.5
and I think we should use them as starting point (it is CUDA only)

I suggest to consider this as additional parameter like --vllm-version

Note 1:
We have a quay.io build that is like quay.io/vllm/vllm:0.8.4.20250429 but it is still in progress

Note 2:
RHOAI product builds use a different org and the images are like
quay.io/modh/vllm:rhoai-2.19-cuda / quay.io/modh/vllm:rhoai-2.19-rocm
but I don't think we want to use product images here

danielezonca · 2025-05-02T12:37:11Z

docs/ramalama-serve.1.md

+        limits:
+          cpu: "10"
+          memory: 24Gi
+        requests:
+          cpu: "10"
+          memory: 24Gi


We need at least the GPU required too

Suggested change

limits:

cpu: "10"

memory: 24Gi

requests:

cpu: "10"

memory: 24Gi

limits:

cpu: "10"

memory: 24Gi

nvidia.com/gpu: '1'

requests:

cpu: "10"

memory: 24Gi

nvidia.com/gpu: '1'

Signed-off-by: Daniel J Walsh <[email protected]>

github-actions · 2025-07-25T00:18:37Z

A friendly reminder that this PR had no activity for 30 days.

rhatdan requested review from bmahabirbu, cgruver, engelmi, ericcurtin, jhjaggars, maxamillion, slp and swarajpande5 as code owners February 24, 2025 18:12

sourcery-ai bot reviewed Feb 24, 2025

View reviewed changes

rhatdan force-pushed the kserve branch from 775232c to 40cd706 Compare February 24, 2025 18:16

danielezonca reviewed Feb 24, 2025

View reviewed changes

sourcery-ai bot mentioned this pull request Feb 25, 2025

Update docker.io/nvidia/cuda Docker tag to v12.8.0 #883

Closed

1 task

rhatdan force-pushed the kserve branch 2 times, most recently from 4eaadcf to 7f86ca2 Compare March 12, 2025 12:44

rhatdan force-pushed the kserve branch 4 times, most recently from 622e665 to 5f6b092 Compare April 29, 2025 19:28

rhatdan force-pushed the kserve branch 3 times, most recently from 4c14c21 to a135ddd Compare April 30, 2025 15:04

rhatdan changed the title ~~[WIP] Add support for kserve~~ Add support for kserve Apr 30, 2025

terrytangyuan reviewed Apr 30, 2025

View reviewed changes

danielezonca reviewed May 2, 2025

View reviewed changes

Add support for kserve

7ad4006

Signed-off-by: Daniel J Walsh <[email protected]>

rhatdan force-pushed the kserve branch from a135ddd to 7ad4006 Compare May 2, 2025 15:11

github-actions bot added the stale-pr label Jul 25, 2025

	\| kserve \| Kserve YAML definition for running the AI Model as a kserve service in Kubernetes \|
	\| kserve \| Kserve YAML definition for running the AI Model as a kserve service in Kubernetes using vLLM \|

Add support for kserve #877

Are you sure you want to change the base?

Add support for kserve #877

Uh oh!

Conversation

rhatdan commented Feb 24, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide by Sourcery

Sequence diagram for generating KServe configuration

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Feb 24, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Feb 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sourcery-ai bot Feb 24, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Apr 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

rhatdan commented Feb 24, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Feb 24, 2025 •

edited

Loading