Replies: 9 comments 12 replies
-
|
@apenney I'd say the whole thing really depends on your application code, rather than the framework itself.
This perfectly makes sense. The more threads you have, the more those threads will try to acquire the GIL (and thus fight each other) to run Python code. In WSGI world though, that's the only way to "do work" while your waiting on GIL-free I/O calls. So it's about trading some latency for increased throughput.
I would rather rely on k8s to horizontally scale – as written in the readme – thus I wouldn't go higher than 1-2 workers per pod. You can scale (or auto-scale) the number of pods at that point. If I had to suggest a configuration, and presuming your application has some form of I/O, I would rather:
I guess this is one of the reason for #610. Once that lands, it will be easier to track what's happening on the Granian runtime.
ASGI is a completely different beast. All the things I said about blocking threads won't matter in that case, as everything is running on the Python event loop, and that will control the concurrency over everything else. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
|
@gi0baro I am hoping I can bug you about a new issue that's cropped up - we've stuck with granian for a month and things are mostly good but I do have this one weird issue that crops up. Over time the pods build up a large amount of memory usage and eventually cpu usage skyrockets and users report slowness (which is correlated in frontend metrics). Over the weekend I would see connections to cloudfront reach the ALB then stall for 60 seconds trying to fetch /login from a pod. It would make no sense, and eventually I restarted all pods (that had only been up for 2.5 days) which cleared the issue up totally.
You can see when I restarted things, dropping a bunch of pods with extra high memory usage back to normal. What's weird is I set GRANIAN_WORKERS_LIFETIME=3600 and confirm that I see workers routinely restart, so I don't really understand how memory usage can climb forever. Is there any way you can think of for us to retain memory usage despite recycling workers on a regular basis. Is there a pool of memory that sits outside individual workers? I am python challenged, just fumbling my way through this issue, so any suggestions are gratefully appreciated! |
Beta Was this translation helpful? Give feedback.
-
|
I noticed something weird: Almost every pod has a process stuff at 70-120% cpu, while the rest are at 15%. If I do something like: The stuck worker remains stuck. I have: So I feel like this should be force killed after 120 seconds, no matter what, but it doesn't ever disappear. We might be building up stuck processes over time until things degrade. |
Beta Was this translation helpful? Give feedback.
-
|
I saw this pattern in many pods, a worker that was stuck and had a lifespan of 12h or more. I was unable to force them to recycle: And the state of the threads: |
Beta Was this translation helpful? Give feedback.
-
|
We made some progress - ditched granian_server.py and went direct with ASGI again - clearly we weren't passing the thread timeout env vars through, so we were getting eternal threads. I don't see the stuck threads in ASGI, maybe just because the recycling works, which at least solves the problem. I will try to figure out more on how to debug a stuck WSGI thread as i'm sure our software is doing something terrible that causes threads to get wedged. |
Beta Was this translation helpful? Give feedback.
-
|
@apenney In general, I think there's a lot going on with your custom launcher. My general feelings are:
|
Beta Was this translation helpful? Give feedback.
-
|
Pretty sure based on those profiles I found the cause of at least one of our issues - the datadog WAF stuff is just.. broken? I disabled it and our 99.9 latency plummeted, so one more mystery solved. It may be what messes with threads as it injects itself too. |
Beta Was this translation helpful? Give feedback.
-
|
Rest of our woes (stuck granian threads that couldn't be killed by granian) were also caused by datadog - this time the profiler. No idea if the issue is with them or something unique with granian (I blame datadog myself, this stuff works terribly) but that was ultimately our root cause. Just adding this update in case anyone else ever has to troubleshoot something similar. |
Beta Was this translation helpful? Give feedback.







Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
We recently switched to Granian away from Gunicorn (we had a previous attempt to do this switch back in October that failed and we came begging for help from you at the time).
So far (I say, before we hit peak traffic) I think we're holding up OK but I thought it might make sense to discuss what I ended up with in terms of settings after doing a bunch of load tests, get your opinion, and talk about if this would change for ASGI. (that would be the next goal, to flip to ASGI).
Right now I'm running in WSGI and a bunch of settings that end up looking like:
Every time I increased BLOCKING_THREADS, RUNTIME_THREADS, or tried to have more backpressure or anything, I ended up getting worse results. Now this was a somewhat artifical test (50 virtual users with k6 accessing a pretty basic workflow) but I was surprised to constantly end up with lower latency and better results with less and less threads.
We're running 6-10 pods with 6 workers each, right now. I already saw issues overnight where a user hammered one pod with requests to an endpoint that times out eventually and our health checks then couldn't respond and it was marked unhealthy.
This feels wrong! I thought I'd just come and validate these past you @gi0baro and see how bad this feels to you.
Beyond that, I have the question of "how can I determine better values for this beyond artificial tests". It's hard for me to answer if threads are starved, and I don't know if I should be trying to get some sort of perf report off one of these nodes (if I can, it's running in Kubernetes), some other kind of profiler, etc, to answer questions like these.
Beta Was this translation helpful? Give feedback.
All reactions