Best practices for deploying and scaling Pathway LLM app templates on Kubernetes clusters #122
Replies: 2 comments 1 reply
-
|
Adding a bit more context on why I asked these questions: I’m in the middle of shaping the cluster layout for a real-time RAG/LLM setup, and before I lock in the pod sizes, node pool mix, and autoscaling logic, I want to understand what has actually worked for people running Pathway in production. I’m not after exact tuning numbers — mostly trying to avoid going down a path that’s known to cause trouble later. If there are a few “rules of thumb” you’ve seen teams rely on (like how they normally size their first worker pods, what they watch for when enabling HPA, or common mistakes in splitting CPU-heavy vs memory-heavy components), that would already help me line things up properly from the start. Really appreciate any practical notes you can share. Getting these foundations right usually saves a lot of rework once traffic ramps up. Thank you, |
Beta Was this translation helpful? Give feedback.
-
|
Hello @nholuongut, While we'd be interested in hearing more about your case (collection size, number of users, etc.), I hope that the following answers will give you a general intuition.
Please let us know if you have any follow-up questions. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi Pathway team 👋,
I’ve been going through the llm-app templates and exploring how they behave in a Kubernetes setup (AKS/EKS/GKE). I’m planning to run them as part of a larger ML/DevOps cluster, so I wanted to check a few practical things from your experience.
For the worker pods, does Pathway generally scale better with more small pods (horizontal) or with fewer but larger pods? I’ve seen both patterns work depending on the engine.
Do you have any rough CPU/memory ranges that tend to be stable for real workloads (e.g., RAG pipelines with fairly heavy traffic)? I’m trying to avoid guessing too much during initial sizing.
For autoscaling, is CPU usually enough, or do people rely more on custom metrics like queue depth / processing delay?
When running on mixed node pools (compute-optimized vs memory-optimized), is there any recommended placement strategy for the Pathway workers vs vector DB or storage components?
And finally, any common issues when putting llm-app behind an API Gateway or Ingress (timeouts, concurrency limits, pooling, etc.)?
I’m asking because I’m designing a real-time RAG/LLM deployment and want to align with what tends to work well in practice for Pathway users.
Thanks a lot!
Best,
Nho Luong
Beta Was this translation helpful? Give feedback.
All reactions