Skip to content

Conversation

@rhatdan
Copy link
Member

@rhatdan rhatdan commented Feb 24, 2025

Summary by Sourcery

This pull request introduces support for generating KServe configurations, allowing users to deploy AI models on Kubernetes using the KServe framework. It adds a new kserve option to the ramalama serve --generate command, which generates the necessary YAML files for deploying a model as a KServe service.

New Features:

  • Adds support for generating KServe YAML definitions for running AI models as a service in Kubernetes, enabling deployment and management of models using the KServe framework.

Enhancements:

  • The ramalama serve command now accepts a --generate kserve option to generate KServe YAML files.
  • The generated KServe YAML files include resource requests and limits for CPU, memory, and GPU (if available).

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Feb 24, 2025

Reviewer's Guide by Sourcery

This pull request adds support for generating KServe YAML definitions, enabling users to deploy AI models as KServe services within a Kubernetes environment. It introduces a new Kserve class responsible for generating the necessary YAML files and integrates it into the existing ramalama serve command-line interface.

Sequence diagram for generating KServe configuration

sequenceDiagram
    participant CLI
    participant Model
    participant Kserve

    CLI->>Model: serve(args)
    Model->>Model: execute_command(model_path, exec_args, args)
    alt args.generate == "kserve"
        Model->>Model: kserve(model_path, args, exec_args)
        Model->>Kserve: __init__(model_path, image, args, exec_args)
        Kserve->>Kserve: generate()
        Kserve-->>Model: True
        Model-->>CLI: None
    else
        Model-->>CLI: None
    end
Loading

File-Level Changes

Change Details Files
Introduces support for generating KServe YAML definitions for deploying AI models as KServe services in Kubernetes.
  • Adds 'kserve' as a choice to the --generate argument in the serve command.
  • Creates a new Kserve class to handle the generation of KServe YAML files.
  • Implements the generate method in the Kserve class to create the necessary YAML files for deploying a model with KServe.
  • Adds logic to the generate_container_config method to call the new KServe functionality.
  • Adds documentation for generating kserve service.
docs/ramalama-serve.1.md
ramalama/model.py
ramalama/cli.py
ramalama/kserve.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!
  • Generate a plan of action for an issue: Comment @sourcery-ai plan on
    an issue to generate a plan of action for it.

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @rhatdan - I've reviewed your changes - here's some feedback:

Overall Comments:

  • Consider generating the KServe runtime file only when necessary, instead of every time.
  • The KServe code duplicates some logic from other modules; consider refactoring to share code.
Here's what I looked at during the review
  • 🟡 General issues: 3 issues found
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟡 Complexity: 1 issue found
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

env_var_string += f"Environment={k}={v}\n"

_gpu = ""
if os.getenv("CUDA_VISIBLE_DEVICES") != "":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): GPU env var check may be flawed.

The condition os.getenv("CUDA_VISIBLE_DEVICES") != "" will return True even if the variable is not set (i.e. returns None). Consider using a check such as if os.getenv("CUDA_VISIBLE_DEVICES") to better capture whether the variable is defined and non-empty.

resources:
limits:
cpu: "6"
memory: 24Gi{gpu}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Potential undefined variable 'gpu'.

If neither CUDA_VISIBLE_DEVICES nor HIP_VISIBLE_DEVICES is set, the variable 'gpu' will not be defined before it's used in the f-string. Initializing 'gpu' to an empty string by default would prevent a potential NameError.

outfile = self.name + "-kserve-runtime.yaml"
outfile = outfile.replace(":", "-")
print(f"Generating kserve runtime file: {outfile}")
with open(outfile, 'w') as c:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider using a templating engine like Jinja2 to generate the YAML files, which will reduce code duplication and improve readability.

Consider abstracting YAML creation into a dedicated templating helper to reduce the inline formatting repetition. For example, you might use Jinja2 templates (or PyYAML with dictionaries) to consolidate and reuse the YAML structure. Here’s a concise example using Jinja2:

from jinja2 import Template

def create_yaml(template_str, **params):
    return Template(template_str).render(**params)

# Define your runtime YAML template once.
KSERVE_RUNTIME_TMPL = """
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: {{ runtime }}-runtime
  annotations:
    openshift.io/display-name: "KServe ServingRuntime for {{ model }}"
    opendatahub.io/recommended-accelerators: '["{{ gpu }}"]'
  labels:
    opendatahub.io/dashboard: 'true'
spec:
  annotations:
    prometheus.io/port: '{{ port }}'
    prometheus.io/path: '/metrics'
  multiModel: false
  supportedModelFormats:
    - autoSelect: true
      name: vLLM
  containers:
    - name: kserve-container
      image: {{ image }}
      command: ["python", "-m", "vllm.entrypoints.openai.api_server"]
      args: ["--port={{ port }}", "--model=/mnt/models", "--served-model-name={{ name }}"]
      env:
        - name: HF_HOME
          value: /tmp/hf_home
      ports:
        - containerPort: {{ port }}
          protocol: TCP
"""

# In your generate() method:
yaml_content = create_yaml(KSERVE_RUNTIME_TMPL,
                           runtime=self.runtime,
                           model=self.model,
                           gpu=_gpu if _gpu else "",
                           port=self.args.port,
                           image=self.image,
                           name=self.name)
with open(self.name + "-kserve-runtime.yaml".replace(":", "-"), 'w') as c:
    c.write(yaml_content)

Repeat a similar approach for the second YAML. This not only reduces repetition but also improves readability and maintainability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO: jinja2 is not fast, maybe a more readable f"" string templateway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a helpful response to the conversation:

You raise a good point about performance. While Jinja2 is powerful for templating, it can introduce overhead. An alternative approach using f-strings can indeed provide a more lightweight and potentially faster solution for YAML generation.

Here's a refined approach using f-strings that maintains readability while avoiding the performance overhead of Jinja2:

def generate_kserve_runtime_yaml(self):
    gpu = f'nvidia.com/gpu: "1"' if os.getenv("CUDA_VISIBLE_DEVICES") != "" else \
          f'amd.com/gpu: "1"' if os.getenv("HIP_VISIBLE_DEVICES") != "" else ""
    
    env_vars = "\n".join(f"        - name: {k}\n          value: {v}" for k, v in get_env_vars().items())
    
    yaml_content = f"""
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: {self.runtime}-runtime
  annotations:
    openshift.io/display-name: "KServe ServingRuntime for {self.model}"
spec:
  containers:
    - name: kserve-container
      image: {self.image}
      resources:
        requests:
          {gpu}
      env:
{env_vars}
"""
    
    outfile = f"{self.name}-kserve-runtime.yaml".replace(":", "-")
    with open(outfile, 'w') as c:
        c.write(yaml_content)

This approach:

  • Uses f-strings for string interpolation
  • Keeps the code concise and readable
  • Avoids the performance overhead of a full templating engine
  • Provides a clean way to handle environment variables and GPU resources

The key advantages are simplicity, performance, and maintainability. It addresses the original suggestion of reducing code duplication while using a more Pythonic approach that aligns with your performance concerns.

Would you like me to elaborate on any part of this implementation?

name: vLLM
containers:
- name: kserve-container
image: {self.image}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code looks wrong, as far as I understand checking the example code, it will return ramalama as image and not vLLM.
Moreover vLLM is specific per accelerator so the same if/else condition to produce the gpu requirements should be applied here to select the right image

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: {self.runtime}-runtime

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is misleading, it produces a llama.cpp-runtime value but this is supposed to be vLLM

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You would need to specify --runtime vllm and then you get the correct output. Is kserve totally tied to vllm? Or is there ability to support multiple runtimes?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KServe is multi runtime, there is a list of runtime included OOTB but it is also dynamic, you can bring your own runtime

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to the list? vllm and llama.cpp included?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://kserve.github.io/website/0.10/modelserving/v1beta1/serving_runtime/

This looks like something different and we would need to hard code the most common one in the tool? Which is that, does not look like kserve gets down to the llama.cpp or vllm level?

| Key | Description |
| ------------ | -------------------------------------------------------------------------|
| quadlet | Podman supported container definition for running AI Model under systemd |
| kserve | Kserve YAML definition for running the AI Model as a kserve service in Kubernetes |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| kserve | Kserve YAML definition for running the AI Model as a kserve service in Kubernetes |
| kserve | Kserve YAML definition for running the AI Model as a kserve service in Kubernetes using vLLM |

@rhatdan rhatdan force-pushed the kserve branch 2 times, most recently from 4eaadcf to 7f86ca2 Compare March 12, 2025 12:44
@rhatdan rhatdan force-pushed the kserve branch 4 times, most recently from 622e665 to 5f6b092 Compare April 29, 2025 19:28
@rhatdan rhatdan force-pushed the kserve branch 3 times, most recently from 4c14c21 to a135ddd Compare April 30, 2025 15:04
@rhatdan rhatdan changed the title [WIP] Add support for kserve Add support for kserve Apr 30, 2025
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: llama.cpp-cuda-runtime

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be granite instead of llama?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the runtime. But maybe granite makes more sense.

name: vLLM
containers:
- name: kserve-container
image: quay.io/ramalama/cuda:0.8

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the correct image? I see that CPUs are used in the generated InferenceService

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should the InferenceService look like if it was using nvidia/cuda?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The official community image for vLLM is on DockerHub and the cuda image is
vllm/vllm-openai:${version}
like
vllm/vllm-openai:v0.8.5
and I think we should use them as starting point (it is CUDA only)

I suggest to consider this as additional parameter like --vllm-version

Note 1:
We have a quay.io build that is like quay.io/vllm/vllm:0.8.4.20250429 but it is still in progress

Note 2:
RHOAI product builds use a different org and the images are like
quay.io/modh/vllm:rhoai-2.19-cuda / quay.io/modh/vllm:rhoai-2.19-rocm
but I don't think we want to use product images here


### Generate kserve service off of OCI Model car quay.io/ramalama/granite:1.0
```
$ ramalama serve --pull=never --threads 10 --port 8081 --generate kserve oci://quay.io/rhatdan/granite

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Ramalama only designed for single-node serving?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RamaLama is just a tool to launch AI models in containers, the idea would be to generate kubernetes content, to allow models to run cross multi-nodes, but not something ramalama would do from command line.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can consider this for another iteration: given that ramalama is mainly designed for local I don't expect a user will run a model big enough to require multi node GPU setup

name: vLLM
containers:
- name: kserve-container
image: quay.io/ramalama/cuda:0.8

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The official community image for vLLM is on DockerHub and the cuda image is
vllm/vllm-openai:${version}
like
vllm/vllm-openai:v0.8.5
and I think we should use them as starting point (it is CUDA only)

I suggest to consider this as additional parameter like --vllm-version

Note 1:
We have a quay.io build that is like quay.io/vllm/vllm:0.8.4.20250429 but it is still in progress

Note 2:
RHOAI product builds use a different org and the images are like
quay.io/modh/vllm:rhoai-2.19-cuda / quay.io/modh/vllm:rhoai-2.19-rocm
but I don't think we want to use product images here

Comment on lines +241 to +246
limits:
cpu: "10"
memory: 24Gi
requests:
cpu: "10"
memory: 24Gi

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need at least the GPU required too

Suggested change
limits:
cpu: "10"
memory: 24Gi
requests:
cpu: "10"
memory: 24Gi
limits:
cpu: "10"
memory: 24Gi
nvidia.com/gpu: '1'
requests:
cpu: "10"
memory: 24Gi
nvidia.com/gpu: '1'

Signed-off-by: Daniel J Walsh <[email protected]>
@github-actions
Copy link

A friendly reminder that this PR had no activity for 30 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants