llama server fails to find Intel GPU and crashes #1933
Replies: 35 comments
-
|
just curious, did you test this directly with llama.cpp as well? i'm finding that the error came directly from Llama.cpp and would be good to test it at the engine level first. |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for the quick reply, I've tried to run with the > ramalama --nocontainer serve tinyllama
Traceback (most recent call last):
File "/home/charles/.local/share/pipx/venvs/ramalama/libexec/ramalama/ramalama-serve-core", line 16, in <module>
main(sys.argv[1:])
File "/home/charles/.local/share/pipx/venvs/ramalama/libexec/ramalama/ramalama-serve-core", line 8, in main
from ramalama.common import exec_cmd
ModuleNotFoundError: No module named 'ramalama'but this may be a completely different install error... |
Beta Was this translation helpful? Give feedback.
-
|
Unfortunately not. RamaLama uses Llama.cpp or vLLM under the hood and since the error came directly from Llama.cpp instead of RamaLama, it would be good if you can test it directly in Llama.cpp first. @ericcurtin Do you know who maintains the compatibility table and if they can test this? Was wondering if RamaLama is failing to passthrough the GPU or is Llama.cpp failing to detect the GPU |
Beta Was this translation helpful? Give feedback.
-
|
I understand the logic, but I would need more context regarding llama.cpp install in order to make use of the intel GPU. Moreover, I'm just starting on a freshly installed system, and would appreciate keeping it somewhat clean of one-off installs (I was pretty happy about the container solution). I could also try pulling a LocalAI image and try to run a server, LocalAI also depends on llama.cpp and should fail the same way if llama.cpp is the culprit (provided I find an image with the same llama.cpp version) |
Beta Was this translation helpful? Give feedback.
-
I feel you! Unfortunately I don't have an Intel GPU to test this problem so I can only count on you haha.
Please update your result once you have it. We can also try spinning up an ephemeral container with the required build tools already in to quickly test Llama.cpp's build with Intel GPU and teardown once complete. |
Beta Was this translation helpful? Give feedback.
-
|
Just a follow up on this: I installed a LocalAI docker image (details below) and I could run a mistral-7b model making use of my intel GPU using llama.cpp as a backend (verified via intel_gpu_top). > docker image list
REPOSITORY TAG IMAGE ID CREATED SIZE
localai/localai v3.0.0-sycl-f32 1e973dbe4525 7 days ago 16.5GB
> docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
2e995866de10 localai/localai:v3.0.0-sycl-f32 "/build/entrypoint.s…" 16 hours ago Up 5 minutes (healthy) 0.0.0.0:8080->8080/tcp, :::8080->8080/tcp local-ai
> docker exec local-ai sycl-ls
[level_zero:gpu][level_zero:0] Intel(R) oneAPI Unified Runtime over Level-Zero, Intel(R) Arc(TM) Graphics 12.71.4 [1.6.32567+18]
[opencl:cpu][opencl:0] Intel(R) OpenCL, Intel(R) Core(TM) Ultra 7 165H OpenCL 3.0 (Build 0) [2025.19.3.0.17_230222]
[opencl:gpu][opencl:1] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) Graphics OpenCL 3.0 NEO [25.05.32567]
> docker exec local-ai ls /dev/dri
card1
renderD128I'm not sure I understand how the image is built, but after a bit a search it seems it's pinned to this commit of llama.cpp (commit you may find familiar ;) ): > docker exec -ti local-ai grep -rnwi '.' -e 'CPPLLAMA_VERSION'
./Makefile:9:CPPLLAMA_VERSION?=8d947136546773f6410756f37fcc5d3e65b8135d
+ some other hits |
Beta Was this translation helpful? Give feedback.
-
Hey @taronaeo I struggle to keep up with all the issues, PRs, etc. The community maintains the compatibility table, because there's just a huge wide array of hardware that no single person can test every piece of hardware. So we depend on the community to test and update the table as appropriate. For me when I'm enabling hardware, I find that's the best route, like you suggested, try and get working with just llama.cpp, no containers, a lot of the issues are at llama.cpp level, then make it play nicely with RamaLama and containers. |
Beta Was this translation helpful? Give feedback.
-
Alternatively, as I mentioned earlier, "We can also try spinning up an ephemeral container with the required build tools already in to quickly test Llama.cpp's build with Intel GPU and teardown once complete." - I think this would provide a clearer answer as to which component is failing and narrow down further from there |
Beta Was this translation helpful? Give feedback.
-
|
Yep I couldn't find in RamaLama's docs the actual backend used for Intel GPUs either. Sure happy to try the containerized build of Llama.cpp if this can be helpful, but I would need more guidance. I'm a bit lost regarding backend selection and how to write a minimal test (just using llama-cli?) afterwards. The llama.cpp build documentation doesn't mention Linux as an option for Vulkan based builds. The SYCL backend seems to work in Docker as mentionned above (I have not tested running the docker container via podman). Or did you want to test even a simple CPU backend? |
Beta Was this translation helpful? Give feedback.
-
It does mention it, but admittedly it's a little hidden. It's below the MSYS-2 section below the "Without Docker" header. TLDR you can build most llama.cpp backends the same way: |
Beta Was this translation helpful? Give feedback.
-
|
I'm also struggeling with gpu accelaration on my Maybe have a look at https://github.com/eleiton/ollama-intel-arc and how this project incoperates ipex-llm. This is docker image works for me but I would prefer to use ramalama if ipex-llm support is possible. |
Beta Was this translation helpful? Give feedback.
-
Vulkan is one of the most mature backends for Linux using llama.cpp in fact. |
Beta Was this translation helpful? Give feedback.
-
I'm speculating, but try podman. What is the OS here? We don't test docker often. |
Beta Was this translation helpful? Give feedback.
-
|
I'm running this in a Podman also fails: As before the vulkan backend works just fine: |
Beta Was this translation helpful? Give feedback.
-
|
Could you try |
Beta Was this translation helpful? Give feedback.
-
|
@afazekas PTAL |
Beta Was this translation helpful? Give feedback.
-
|
(ramalama-0.11.0-1.fc42.noarch), podman. vulcan: intel_gpu: both worked on my laptop. Probably good idea to try to run intel diag tools from similar container to see is it at least listing the device. |
Beta Was this translation helpful? Give feedback.
-
|
This is still broken for me on the latest I tried to start the podman container manually but it just hangs. |
Beta Was this translation helpful? Give feedback.
-
|
Have you tried to run |
Beta Was this translation helpful? Give feedback.
-
|
Hi,
just tried what @rhatdan asked and it works: > podman run docker.io/intelanalytics/ipex-llm-inference-cpp-xpu:latest sycl-ls
[opencl:cpu][opencl:0] Intel(R) OpenCL, Intel(R) Core(TM) Ultra 7 165H OpenCL 3.0 (Build 0) [2024.18.12.0.05_160000]on the other hand > podman run quay.io/ramalama/intel-gpu:latest sycl-ls
:: initializing oneAPI environment ...
entrypoint.sh: BASH_VERSION = 5.2.37(1)-release
args: Using "$@" for setvars.sh arguments: sycl-ls
:: compiler -- latest
:: mkl -- latest
:: tbb -- latest
:: umf -- latest
:: oneAPI environment initialized ::
/usr/bin/entrypoint.sh: line 6: exec: sycl-ls: not found
|
Beta Was this translation helpful? Give feedback.
-
|
Coming back to this I think the issue was hiding in plain sight, and is related to permissions set on /dev/dri/renderXXX. First coming back to my previous comment
I actually forgot to pass > podman run --device /dev/dri/ docker.io/intelanalytics/ipex-llm-inference-cpp-xpu:latest sycl-ls
[level_zero:gpu][level_zero:0] Intel(R) oneAPI Unified Runtime over Level-Zero, Intel(R) Arc(TM) Graphics 12.71.4 [1.6.32224.500000]
[opencl:cpu][opencl:0] Intel(R) OpenCL, Intel(R) Core(TM) Ultra 7 165H OpenCL 3.0 (Build 0) [2024.18.12.0.05_160000]
[opencl:gpu][opencl:1] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) Graphics OpenCL 3.0 NEO [24.52.32224.5]Now coming back to our issue with ramalama, mudler/LocalAI#3437 mentioned rendering device permissions. In that issue permissions on > ls -la /dev/dri
crw-rw----+ 1 root video 226, 1 Sep 11 09:18 card1
crw-rw----+ 1 root render 226, 128 Sep 11 09:18 renderD128Trying to give rw access to > sudo chmod o+rw /dev/dri/renderD128
> ramalama serve tinyllama
:: initializing oneAPI environment ...
entrypoint.sh: BASH_VERSION = 5.2.37(1)-release
args: Using "$@" for setvars.sh arguments: llama-server --port 8084 --model /mnt/models/tinyllama --no-warmup --jinja --chat-template-file /mnt/models/chat_template_converted --log-colors --alias tinyllama --ctx-size 2048 --temp 0.8 --cache-reuse 256 -ngl 999 --threads 11 --host 0.0.0.0
:: compiler -- latest
:: mkl -- latest
:: tbb -- latest
:: umf -- latest
:: oneAPI environment initialized ::
error while handling argument "--log-colors": error: unkown value for --log-colors: '--alias'
usage:
--log-colors [on|off|auto] Set colored logging ('on', 'off', or 'auto', default: 'auto')
'auto' enables colors when output is to a terminal
(env: LLAMA_LOG_COLORS)
to show complete usage, run with -hI'm however not satisfied with this solution which seems insecure. And I guess running the podman container adding it to the |
Beta Was this translation helpful? Give feedback.
-
|
have you already tried using |
Beta Was this translation helpful? Give feedback.
-
|
I had done it at the podman level, tried several variation on groups related commands, starting from the command spat by |
Beta Was this translation helpful? Give feedback.
-
|
is your user part of the |
Beta Was this translation helpful? Give feedback.
-
|
Nope, I should have pointed this out indeed. The default |
Beta Was this translation helpful? Give feedback.
-
|
how does it work on the host? Can your user access the device file? |
Beta Was this translation helpful? Give feedback.
-
|
if the user cannot access the device on the host, it won't automatically get access to it in the container. If access to the device is limited to root (and you want to keep it that way), then the container must be rootful |
Beta Was this translation helpful? Give feedback.
-
|
I just found this Intel doc on setting up permissions: https://www.intel.com/content/www/us/en/docs/oneapi/installation-guide-hpc-cluster/2025-1/step-4-set-up-user-permissions-for-using-the.html#SET-PERMISSIONS It states:
So I guess I'll just add my current user to the This leaves two questions:
|
Beta Was this translation helpful? Give feedback.
-
|
Ok this looks like a configuration issue, so moving to dicussion. |
Beta Was this translation helpful? Give feedback.
-
|
Just adding my user to the render group was not enough, I also have to add "--keep-groups" to get rid of the However this just gets ride of the error, but fails to use the GPU. If I set up a ramalama server, and then query the API from the host via curl http://localhost:8087/v1/completions -d '{
"model": "qwen3:4b",
"prompt": "Why is the sky blue?",
"stream": false
}'the inference with a single CPU used a 100% and no GPU usage (monitored via gputop). |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Issue Description
I'd like to use my intel iGPU (my CPU is bundled with an iGPU ID ID55 Intel Arc MeteorLake which is listed on ramalama's compatibility table) to run local chats, on an ubuntu 24.04 system with podman installed.
I however fail to serve/run any model, the container crashes upon start and doesn't even get listed in
podman ps --allafterward. When usingramalama run amodelthe terminal hangs until I enter any command and then fails with errorError: could not connect to: http://127.0.0.1:8080/v1/chat/completionsnot matter how long I wait before entering an input. After reading #1568 I've triedramalama serve amodelinstead and getAny pointers on how to debug?
Thanks for the help!
Steps to reproduce the issue
Describe the results you received
This fails with the following output:
It seems that ramalama has downlaoded the correct image for my config:
> podman images REPOSITORY TAG IMAGE ID CREATED SIZE quay.io/ramalama/intel-gpu 0.9 eac7acb20df9 6 days ago 3.3 GBUsing
--dryrunto get the podman command I get:> ramalama --dry-run serve tinyllama podman run --rm --label ai.ramalama.model=tinyllama --label ai.ramalama.engine=podman --label ai.ramalama.runtime=llama.cpp --label ai.ramalama.port=8082 --label ai.ramalama.command=serve --device /dev/dri --device /dev/accel -e INTEL_VISIBLE_DEVICES=1 -p 8082:8082 --security-opt=label=disable --cap-drop=all --security-opt=no-new-privileges --pull newer --label ai.ramalama --name ramalama_k9gI9kV2vt --env=HOME=/tmp --init --mount=type=bind,src=/home/charles/.local/share/ramalama/store/ollama/tinyllama/tinyllama/blobs/sha256-2af3b81862c6be03c769683af18efdadb2c33f60ff32ab6f83e42c043d6c7816,destination=/mnt/models/model.file,ro --mount=type=bind,src=/home/charles/.local/share/ramalama/store/ollama/tinyllama/tinyllama/snapshots/sha256-2af3b81862c6be03c769683af18efdadb2c33f60ff32ab6f83e42c043d6c7816/chat_template_converted,destination=/mnt/models/chat_template.file,ro quay.io/ramalama/intel-gpu:0.9 /usr/libexec/ramalama/ramalama-serve-core llama-server --port 8082 --model /mnt/models/model.file --no-warmup --jinja --log-colors --alias tinyllama --ctx-size 2048 --temp 0.8 --cache-reuse 256 -ngl 999 --threads 11 --host 0.0.0.0I've tried to create a container without the llama-server launch command:
> podman run --rm --label ai.ramalama.model=tinyllama --label ai.ramalama.engine=podman --label ai.ramalama.runtime=llama.cpp --label ai.ramalama.port=8082 --label ai.ramalama.command=serve --device /dev/dri --device /dev/accel -e INTEL_VISIBLE_DEVICES=1 -p 8082:8082 --security-opt=label=disable --cap-drop=all --security-opt=no-new-privileges --pull newer --label ai.ramalama --name ramalama_k9gI9kV2vt --env=HOME=/tmp --init --mount=type=bind,src=/home/charles/.local/share/ramalama/store/ollama/tinyllama/tinyllama/blobs/sha256-2af3b81862c6be03c769683af18efdadb2c33f60ff32ab6f83e42c043d6c7816,destination=/mnt/models/model.file,ro --mount=type=bind,src=/home/charles/.local/share/ramalama/store/ollama/tinyllama/tinyllama/snapshots/sha256-2af3b81862c6be03c769683af18efdadb2c33f60ff32ab6f83e42c043d6c7816/chat_template_converted,destination=/mnt/models/chat_template.file,ro quay.io/ramalama/intel-gpu:0.9that gets created without error. The GPU devices seem to be correctly passed to the container:
Describe the results you expected
Well, something that finds my iGPU doesn't crash :)
And if doesn't find my GPU, I would expect it to still run onto CPU without crashing but issuing a warning.
ramalama info output
Upstream Latest Release
No
Additional environment details
No response
Additional information
I have stumbled on a similar issue on the LocalAI project.
sycl-lsis not installed in the ramalama container and I did not know how to interact with the intel APIs from bash with the tools already installed in the container.Passing the complete device names
--device /dev/dri/card1and--device /dev/dri/renderD128along withramalama servedid not help.Beta Was this translation helpful? Give feedback.
All reactions