-
Notifications
You must be signed in to change notification settings - Fork 272
Add support for kserve #877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -69,8 +69,9 @@ Generate specified configuration format for running the AI Model as a service | |||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| | Key | Description | | ||||||||||||||||||||||||||||||
| | ------------ | -------------------------------------------------------------------------| | ||||||||||||||||||||||||||||||
| | quadlet | Podman supported container definition for running AI Model under systemd | | ||||||||||||||||||||||||||||||
| | kserve | KServe YAML definition for running the AI Model as a KServe service in Kubernetes | | ||||||||||||||||||||||||||||||
| | kube | Kubernetes YAML definition for running the AI Model as a service | | ||||||||||||||||||||||||||||||
| | quadlet | Podman supported container definition for running AI Model under systemd | | ||||||||||||||||||||||||||||||
| | quadlet/kube | Kubernetes YAML definition for running the AI Model as a service and Podman supported container definition for running the Kube YAML specified pod under systemd| | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| #### **--help**, **-h** | ||||||||||||||||||||||||||||||
|
|
@@ -112,7 +113,7 @@ On Nvidia based GPU systems, RamaLama defaults to using the | |||||||||||||||||||||||||||||
| `nvidia-container-runtime`. Use this option to override this selection. | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| #### **--port**, **-p** | ||||||||||||||||||||||||||||||
| port for AI Model server to listen on. It must be available. If not specified, | ||||||||||||||||||||||||||||||
| port for AI Model server to listen on. It must be available. If not specified, | ||||||||||||||||||||||||||||||
| the serving port will be 8080 if available, otherwise a free port in 8081-8090 range. | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| #### **--privileged** | ||||||||||||||||||||||||||||||
|
|
@@ -159,7 +160,7 @@ llama.cpp explains this as: | |||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| The higher the number is the more creative the response is, but more likely to hallucinate when set too high. | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| Usage: Lower numbers are good for virtual assistants where we need deterministic responses. Higher numbers are good for roleplay or creative tasks like editing stories | ||||||||||||||||||||||||||||||
| Usage: Lower numbers are good for virtual assistants where we need deterministic responses. Higher numbers are good for roleplay or creative tasks like editing stories | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| #### **--threads**, **-t** | ||||||||||||||||||||||||||||||
| Maximum number of cpu threads to use. | ||||||||||||||||||||||||||||||
|
|
@@ -187,6 +188,64 @@ CONTAINER ID IMAGE COMMAND CREATED | |||||||||||||||||||||||||||||
| 3f64927f11a5 quay.io/ramalama/ramalama:latest /usr/bin/ramalama... 17 seconds ago Up 17 seconds 0.0.0.0:8082->8082/tcp ramalama_YMPQvJxN97 | ||||||||||||||||||||||||||||||
| ``` | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| ### Generate kserve service off of OCI Model car quay.io/ramalama/granite:1.0 | ||||||||||||||||||||||||||||||
| ``` | ||||||||||||||||||||||||||||||
| $ ramalama serve --pull=never --threads 10 --port 8081 --generate kserve oci://quay.io/rhatdan/granite | ||||||||||||||||||||||||||||||
| Generating kserve runtime file: granite-cuda-kserve-runtime.yaml | ||||||||||||||||||||||||||||||
| Generating kserve file: granite-cuda-kserve.yaml | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| $ cat granite-cuda-kserve-runtime.yaml | ||||||||||||||||||||||||||||||
| apiVersion: serving.kserve.io/v1alpha1 | ||||||||||||||||||||||||||||||
| kind: ServingRuntime | ||||||||||||||||||||||||||||||
| metadata: | ||||||||||||||||||||||||||||||
| name: llama.cpp-cuda-runtime | ||||||||||||||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this be granite instead of llama?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That is the runtime. But maybe granite makes more sense. |
||||||||||||||||||||||||||||||
| annotations: | ||||||||||||||||||||||||||||||
| opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]' | ||||||||||||||||||||||||||||||
| labels: | ||||||||||||||||||||||||||||||
| opendatahub.io/dashboard: 'true' | ||||||||||||||||||||||||||||||
| spec: | ||||||||||||||||||||||||||||||
| annotations: | ||||||||||||||||||||||||||||||
| prometheus.io/port: '8081' | ||||||||||||||||||||||||||||||
| prometheus.io/path: '/metrics' | ||||||||||||||||||||||||||||||
| multiModel: false | ||||||||||||||||||||||||||||||
| supportedModelFormats: | ||||||||||||||||||||||||||||||
| - autoSelect: true | ||||||||||||||||||||||||||||||
| name: vLLM | ||||||||||||||||||||||||||||||
| containers: | ||||||||||||||||||||||||||||||
| - name: kserve-container | ||||||||||||||||||||||||||||||
| image: quay.io/ramalama/cuda:0.8 | ||||||||||||||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this the correct image? I see that CPUs are used in the generated InferenceService
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What should the InferenceService look like if it was using nvidia/cuda? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The official community image for vLLM is on DockerHub and the cuda image is I suggest to consider this as additional parameter like Note 1: Note 2: |
||||||||||||||||||||||||||||||
| command: ["python", "-m", "vllm.entrypoints.openai.api_server"] | ||||||||||||||||||||||||||||||
| args: ["--port=8081", "--model=/mnt/models", "--served-model-name=granite"] | ||||||||||||||||||||||||||||||
| env: | ||||||||||||||||||||||||||||||
| - name: HF_HOME | ||||||||||||||||||||||||||||||
| value: /tmp/hf_home | ||||||||||||||||||||||||||||||
| ports: | ||||||||||||||||||||||||||||||
| - containerPort: 8081 | ||||||||||||||||||||||||||||||
| protocol: TCP | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| $ cat granite-cuda-kserve.yaml | ||||||||||||||||||||||||||||||
| # RamaLama granite AI Model Service | ||||||||||||||||||||||||||||||
| # kubectl create -f to import this kserve file into Kubernetes. | ||||||||||||||||||||||||||||||
| # | ||||||||||||||||||||||||||||||
| apiVersion: serving.kserve.io/v1beta1 | ||||||||||||||||||||||||||||||
| kind: InferenceService | ||||||||||||||||||||||||||||||
| metadata: | ||||||||||||||||||||||||||||||
| name: huggingface-granite | ||||||||||||||||||||||||||||||
| spec: | ||||||||||||||||||||||||||||||
| predictor: | ||||||||||||||||||||||||||||||
| model: | ||||||||||||||||||||||||||||||
| modelFormat: | ||||||||||||||||||||||||||||||
| name: vLLM | ||||||||||||||||||||||||||||||
| storageUri: "oci://quay.io/rhatdan/granite" | ||||||||||||||||||||||||||||||
| resources: | ||||||||||||||||||||||||||||||
| limits: | ||||||||||||||||||||||||||||||
| cpu: "10" | ||||||||||||||||||||||||||||||
| memory: 24Gi | ||||||||||||||||||||||||||||||
| requests: | ||||||||||||||||||||||||||||||
| cpu: "10" | ||||||||||||||||||||||||||||||
| memory: 24Gi | ||||||||||||||||||||||||||||||
|
Comment on lines
+241
to
+246
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We need at least the GPU required too
Suggested change
|
||||||||||||||||||||||||||||||
| ``` | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| ### Generate quadlet service off of HuggingFace granite Model | ||||||||||||||||||||||||||||||
| ``` | ||||||||||||||||||||||||||||||
| $ ramalama serve --name MyGraniteServer --generate=quadlet granite | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,125 @@ | ||
| import os | ||
|
|
||
| from ramalama.common import get_accel_env_vars, get_accel | ||
|
|
||
|
|
||
| def create_yaml(template_str, params): | ||
| return template_str.format(**params) | ||
|
|
||
|
|
||
| KSERVE_RUNTIME_TMPL = """ | ||
| apiVersion: serving.kserve.io/v1alpha1 | ||
| kind: ServingRuntime | ||
| metadata: | ||
| name: {runtime}-runtime | ||
| annotations: | ||
| opendatahub.io/recommended-accelerators: '["{gpu}"]' | ||
| labels: | ||
| opendatahub.io/dashboard: 'true' | ||
| spec: | ||
| annotations: | ||
| prometheus.io/port: '{port}' | ||
| prometheus.io/path: '/metrics' | ||
| multiModel: false | ||
| supportedModelFormats: | ||
| - autoSelect: true | ||
| name: vLLM | ||
| containers: | ||
| - name: kserve-container | ||
| image: {image} | ||
| command: ["python", "-m", "vllm.entrypoints.openai.api_server"] | ||
| args: ["--port={port}", "--model=/mnt/models", "--served-model-name={name}"] | ||
| env: | ||
| - name: HF_HOME | ||
| value: /tmp/hf_home | ||
| ports: | ||
| - containerPort: {port} | ||
| protocol: TCP | ||
| """ | ||
|
|
||
| KSERVE_MODEL_SERVICE = """\ | ||
| # RamaLama {name} AI Model Service | ||
| # kubectl create -f to import this kserve file into Kubernetes. | ||
| # | ||
| apiVersion: serving.kserve.io/v1beta1 | ||
| kind: InferenceService | ||
| metadata: | ||
| name: huggingface-{name} | ||
| spec: | ||
| predictor: | ||
| model: | ||
| modelFormat: | ||
| name: vLLM | ||
| storageUri: "oci://{model}" | ||
| resources: | ||
| limits: | ||
| cpu: "{threads}" | ||
| memory: 24Gi{gpu} | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. issue (bug_risk): Potential undefined variable 'gpu'. If neither CUDA_VISIBLE_DEVICES nor HIP_VISIBLE_DEVICES is set, the variable 'gpu' will not be defined before it's used in the f-string. Initializing 'gpu' to an empty string by default would prevent a potential NameError. |
||
| requests: | ||
| cpu: "{threads}" | ||
| memory: 24Gi{gpu} | ||
| """ | ||
|
|
||
|
|
||
| class Kserve: | ||
| def __init__(self, model, chat_template_path, image, args, exec_args): | ||
| self.ai_image = model | ||
| if hasattr(args, "MODEL"): | ||
| self.ai_image = args.MODEL | ||
| self.ai_image = self.ai_image.removeprefix("oci://") | ||
| if args.name: | ||
| self.name = args.name | ||
| else: | ||
| self.name = os.path.basename(self.ai_image) | ||
|
|
||
| self.model = model.removeprefix("oci://") | ||
| self.args = args | ||
| self.exec_args = exec_args | ||
| self.image = image | ||
| self.runtime = args.runtime | ||
|
|
||
| def generate(self): | ||
| env_var_string = "" | ||
| for k, v in get_accel_env_vars().items(): | ||
| env_var_string += f"Environment={k}={v}\n" | ||
|
|
||
| _gpu = "" | ||
| if os.getenv("CUDA_VISIBLE_DEVICES") != "": | ||
| _gpu = 'nvidia.com/gpu' | ||
| elif os.getenv("HIP_VISIBLE_DEVICES") != "": | ||
| _gpu = 'amd.com/gpu' | ||
|
|
||
| outfile = f"{self.name}-{get_accel()}-kserve-runtime.yaml" | ||
| outfile = outfile.replace(":", "-") | ||
| print(f"Generating kserve runtime file: {outfile}") | ||
|
|
||
| # In your generate() method: | ||
| yaml_content = create_yaml( | ||
| KSERVE_RUNTIME_TMPL, | ||
| { | ||
| 'runtime': self.runtime + "-" + get_accel(), | ||
| 'model': self.model, | ||
| 'gpu': _gpu if _gpu else "", | ||
| 'port': self.args.port, | ||
| 'image': self.image, | ||
| 'name': self.name, | ||
| }, | ||
| ) | ||
| with open(outfile, 'w') as c: | ||
| c.write(yaml_content) | ||
|
|
||
| outfile = f"{self.name}-{get_accel()}-kserve.yaml" | ||
| outfile = outfile.replace(":", "-") | ||
| print(f"Generating kserve file: {outfile}") | ||
| yaml_content = create_yaml( | ||
| KSERVE_MODEL_SERVICE, | ||
| { | ||
| 'name': self.name, | ||
| 'model': self.model, | ||
| 'gpu': _gpu if _gpu else "", | ||
| 'threads': self.args.threads, | ||
| 'gpu': f"\n {_gpu}: '1'" if _gpu else "", | ||
| }, | ||
| ) | ||
| with open(outfile, 'w') as c: | ||
| c.write(yaml_content) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is Ramalama only designed for single-node serving?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RamaLama is just a tool to launch AI models in containers, the idea would be to generate kubernetes content, to allow models to run cross multi-nodes, but not something ramalama would do from command line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can consider this for another iteration: given that ramalama is mainly designed for local I don't expect a user will run a model big enough to require multi node GPU setup