-
Notifications
You must be signed in to change notification settings - Fork 272
Add support for kserve #877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer's Guide by SourceryThis pull request adds support for generating KServe YAML definitions, enabling users to deploy AI models as KServe services within a Kubernetes environment. It introduces a new Sequence diagram for generating KServe configurationsequenceDiagram
participant CLI
participant Model
participant Kserve
CLI->>Model: serve(args)
Model->>Model: execute_command(model_path, exec_args, args)
alt args.generate == "kserve"
Model->>Model: kserve(model_path, args, exec_args)
Model->>Kserve: __init__(model_path, image, args, exec_args)
Kserve->>Kserve: generate()
Kserve-->>Model: True
Model-->>CLI: None
else
Model-->>CLI: None
end
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @rhatdan - I've reviewed your changes - here's some feedback:
Overall Comments:
- Consider generating the KServe runtime file only when necessary, instead of every time.
- The KServe code duplicates some logic from other modules; consider refactoring to share code.
Here's what I looked at during the review
- 🟡 General issues: 3 issues found
- 🟢 Security: all looks good
- 🟢 Testing: all looks good
- 🟡 Complexity: 1 issue found
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
ramalama/kserve.py
Outdated
| env_var_string += f"Environment={k}={v}\n" | ||
|
|
||
| _gpu = "" | ||
| if os.getenv("CUDA_VISIBLE_DEVICES") != "": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): GPU env var check may be flawed.
The condition os.getenv("CUDA_VISIBLE_DEVICES") != "" will return True even if the variable is not set (i.e. returns None). Consider using a check such as if os.getenv("CUDA_VISIBLE_DEVICES") to better capture whether the variable is defined and non-empty.
| resources: | ||
| limits: | ||
| cpu: "6" | ||
| memory: 24Gi{gpu} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): Potential undefined variable 'gpu'.
If neither CUDA_VISIBLE_DEVICES nor HIP_VISIBLE_DEVICES is set, the variable 'gpu' will not be defined before it's used in the f-string. Initializing 'gpu' to an empty string by default would prevent a potential NameError.
ramalama/kserve.py
Outdated
| outfile = self.name + "-kserve-runtime.yaml" | ||
| outfile = outfile.replace(":", "-") | ||
| print(f"Generating kserve runtime file: {outfile}") | ||
| with open(outfile, 'w') as c: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (complexity): Consider using a templating engine like Jinja2 to generate the YAML files, which will reduce code duplication and improve readability.
Consider abstracting YAML creation into a dedicated templating helper to reduce the inline formatting repetition. For example, you might use Jinja2 templates (or PyYAML with dictionaries) to consolidate and reuse the YAML structure. Here’s a concise example using Jinja2:
from jinja2 import Template
def create_yaml(template_str, **params):
return Template(template_str).render(**params)
# Define your runtime YAML template once.
KSERVE_RUNTIME_TMPL = """
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: {{ runtime }}-runtime
annotations:
openshift.io/display-name: "KServe ServingRuntime for {{ model }}"
opendatahub.io/recommended-accelerators: '["{{ gpu }}"]'
labels:
opendatahub.io/dashboard: 'true'
spec:
annotations:
prometheus.io/port: '{{ port }}'
prometheus.io/path: '/metrics'
multiModel: false
supportedModelFormats:
- autoSelect: true
name: vLLM
containers:
- name: kserve-container
image: {{ image }}
command: ["python", "-m", "vllm.entrypoints.openai.api_server"]
args: ["--port={{ port }}", "--model=/mnt/models", "--served-model-name={{ name }}"]
env:
- name: HF_HOME
value: /tmp/hf_home
ports:
- containerPort: {{ port }}
protocol: TCP
"""
# In your generate() method:
yaml_content = create_yaml(KSERVE_RUNTIME_TMPL,
runtime=self.runtime,
model=self.model,
gpu=_gpu if _gpu else "",
port=self.args.port,
image=self.image,
name=self.name)
with open(self.name + "-kserve-runtime.yaml".replace(":", "-"), 'w') as c:
c.write(yaml_content)Repeat a similar approach for the second YAML. This not only reduces repetition but also improves readability and maintainability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO: jinja2 is not fast, maybe a more readable f"" string templateway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a helpful response to the conversation:
You raise a good point about performance. While Jinja2 is powerful for templating, it can introduce overhead. An alternative approach using f-strings can indeed provide a more lightweight and potentially faster solution for YAML generation.
Here's a refined approach using f-strings that maintains readability while avoiding the performance overhead of Jinja2:
def generate_kserve_runtime_yaml(self):
gpu = f'nvidia.com/gpu: "1"' if os.getenv("CUDA_VISIBLE_DEVICES") != "" else \
f'amd.com/gpu: "1"' if os.getenv("HIP_VISIBLE_DEVICES") != "" else ""
env_vars = "\n".join(f" - name: {k}\n value: {v}" for k, v in get_env_vars().items())
yaml_content = f"""
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: {self.runtime}-runtime
annotations:
openshift.io/display-name: "KServe ServingRuntime for {self.model}"
spec:
containers:
- name: kserve-container
image: {self.image}
resources:
requests:
{gpu}
env:
{env_vars}
"""
outfile = f"{self.name}-kserve-runtime.yaml".replace(":", "-")
with open(outfile, 'w') as c:
c.write(yaml_content)This approach:
- Uses f-strings for string interpolation
- Keeps the code concise and readable
- Avoids the performance overhead of a full templating engine
- Provides a clean way to handle environment variables and GPU resources
The key advantages are simplicity, performance, and maintainability. It addresses the original suggestion of reducing code duplication while using a more Pythonic approach that aligns with your performance concerns.
Would you like me to elaborate on any part of this implementation?
ramalama/kserve.py
Outdated
| name: vLLM | ||
| containers: | ||
| - name: kserve-container | ||
| image: {self.image} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code looks wrong, as far as I understand checking the example code, it will return ramalama as image and not vLLM.
Moreover vLLM is specific per accelerator so the same if/else condition to produce the gpu requirements should be applied here to select the right image
ramalama/kserve.py
Outdated
| apiVersion: serving.kserve.io/v1alpha1 | ||
| kind: ServingRuntime | ||
| metadata: | ||
| name: {self.runtime}-runtime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is misleading, it produces a llama.cpp-runtime value but this is supposed to be vLLM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You would need to specify --runtime vllm and then you get the correct output. Is kserve totally tied to vllm? Or is there ability to support multiple runtimes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KServe is multi runtime, there is a list of runtime included OOTB but it is also dynamic, you can bring your own runtime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link to the list? vllm and llama.cpp included?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://kserve.github.io/website/0.10/modelserving/v1beta1/serving_runtime/
This looks like something different and we would need to hard code the most common one in the tool? Which is that, does not look like kserve gets down to the llama.cpp or vllm level?
docs/ramalama-serve.1.md
Outdated
| | Key | Description | | ||
| | ------------ | -------------------------------------------------------------------------| | ||
| | quadlet | Podman supported container definition for running AI Model under systemd | | ||
| | kserve | Kserve YAML definition for running the AI Model as a kserve service in Kubernetes | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| | kserve | Kserve YAML definition for running the AI Model as a kserve service in Kubernetes | | |
| | kserve | Kserve YAML definition for running the AI Model as a kserve service in Kubernetes using vLLM | |
4eaadcf to
7f86ca2
Compare
622e665 to
5f6b092
Compare
4c14c21 to
a135ddd
Compare
| apiVersion: serving.kserve.io/v1alpha1 | ||
| kind: ServingRuntime | ||
| metadata: | ||
| name: llama.cpp-cuda-runtime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be granite instead of llama?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is the runtime. But maybe granite makes more sense.
| name: vLLM | ||
| containers: | ||
| - name: kserve-container | ||
| image: quay.io/ramalama/cuda:0.8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the correct image? I see that CPUs are used in the generated InferenceService
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should the InferenceService look like if it was using nvidia/cuda?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The official community image for vLLM is on DockerHub and the cuda image is
vllm/vllm-openai:${version}
like
vllm/vllm-openai:v0.8.5
and I think we should use them as starting point (it is CUDA only)
I suggest to consider this as additional parameter like --vllm-version
Note 1:
We have a quay.io build that is like quay.io/vllm/vllm:0.8.4.20250429 but it is still in progress
Note 2:
RHOAI product builds use a different org and the images are like
quay.io/modh/vllm:rhoai-2.19-cuda / quay.io/modh/vllm:rhoai-2.19-rocm
but I don't think we want to use product images here
|
|
||
| ### Generate kserve service off of OCI Model car quay.io/ramalama/granite:1.0 | ||
| ``` | ||
| $ ramalama serve --pull=never --threads 10 --port 8081 --generate kserve oci://quay.io/rhatdan/granite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is Ramalama only designed for single-node serving?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RamaLama is just a tool to launch AI models in containers, the idea would be to generate kubernetes content, to allow models to run cross multi-nodes, but not something ramalama would do from command line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can consider this for another iteration: given that ramalama is mainly designed for local I don't expect a user will run a model big enough to require multi node GPU setup
| name: vLLM | ||
| containers: | ||
| - name: kserve-container | ||
| image: quay.io/ramalama/cuda:0.8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The official community image for vLLM is on DockerHub and the cuda image is
vllm/vllm-openai:${version}
like
vllm/vllm-openai:v0.8.5
and I think we should use them as starting point (it is CUDA only)
I suggest to consider this as additional parameter like --vllm-version
Note 1:
We have a quay.io build that is like quay.io/vllm/vllm:0.8.4.20250429 but it is still in progress
Note 2:
RHOAI product builds use a different org and the images are like
quay.io/modh/vllm:rhoai-2.19-cuda / quay.io/modh/vllm:rhoai-2.19-rocm
but I don't think we want to use product images here
| limits: | ||
| cpu: "10" | ||
| memory: 24Gi | ||
| requests: | ||
| cpu: "10" | ||
| memory: 24Gi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need at least the GPU required too
| limits: | |
| cpu: "10" | |
| memory: 24Gi | |
| requests: | |
| cpu: "10" | |
| memory: 24Gi | |
| limits: | |
| cpu: "10" | |
| memory: 24Gi | |
| nvidia.com/gpu: '1' | |
| requests: | |
| cpu: "10" | |
| memory: 24Gi | |
| nvidia.com/gpu: '1' |
Signed-off-by: Daniel J Walsh <[email protected]>
|
A friendly reminder that this PR had no activity for 30 days. |
Summary by Sourcery
This pull request introduces support for generating KServe configurations, allowing users to deploy AI models on Kubernetes using the KServe framework. It adds a new
kserveoption to theramalama serve --generatecommand, which generates the necessary YAML files for deploying a model as a KServe service.New Features:
Enhancements:
ramalama servecommand now accepts a--generate kserveoption to generate KServe YAML files.