Solid settings for Django applications? #663

apenney · 2025-08-19T13:00:39Z

apenney
Aug 19, 2025

We recently switched to Granian away from Gunicorn (we had a previous attempt to do this switch back in October that failed and we came begging for help from you at the time).

So far (I say, before we hit peak traffic) I think we're holding up OK but I thought it might make sense to discuss what I ended up with in terms of settings after doing a bunch of load tests, get your opinion, and talk about if this would change for ASGI. (that would be the next goal, to flip to ASGI).

Right now I'm running in WSGI and a bunch of settings that end up looking like:

DEFAULT_WORKERS = 6
DEFAULT_BLOCKING_THREADS = 1  # Serial processing - no GIL contention
DEFAULT_BACKPRESSURE = 1  # One request at a time per worker
DEFAULT_RUNTIME_THREADS = 1  # Prevent work stealing between threads
# Note: runtime_blocking_threads is intentionally not set - let granian auto-configure
DEFAULT_PORT = 8000
DEFAULT_BACKLOG = 2048
DEFAULT_IDLE_TIMEOUT = 60
DEFAULT_WORKERS_LIFETIME = 3600  # 1 hour

Every time I increased BLOCKING_THREADS, RUNTIME_THREADS, or tried to have more backpressure or anything, I ended up getting worse results. Now this was a somewhat artifical test (50 virtual users with k6 accessing a pretty basic workflow) but I was surprised to constantly end up with lower latency and better results with less and less threads.

We're running 6-10 pods with 6 workers each, right now. I already saw issues overnight where a user hammered one pod with requests to an endpoint that times out eventually and our health checks then couldn't respond and it was marked unhealthy.

This feels wrong! I thought I'd just come and validate these past you @gi0baro and see how bad this feels to you.

Beyond that, I have the question of "how can I determine better values for this beyond artificial tests". It's hard for me to answer if threads are starved, and I don't know if I should be trying to get some sort of perf report off one of these nodes (if I can, it's running in Kubernetes), some other kind of profiler, etc, to answer questions like these.

gi0baro · 2025-08-20T10:56:34Z

gi0baro
Aug 20, 2025
Maintainer

@apenney I'd say the whole thing really depends on your application code, rather than the framework itself.

Every time I increased BLOCKING_THREADS, RUNTIME_THREADS, or tried to have more backpressure or anything, I ended up getting worse results. Now this was a somewhat artifical test (50 virtual users with k6 accessing a pretty basic workflow) but I was surprised to constantly end up with lower latency and better results with less and less threads.

This perfectly makes sense. The more threads you have, the more those threads will try to acquire the GIL (and thus fight each other) to run Python code. In WSGI world though, that's the only way to "do work" while your waiting on GIL-free I/O calls. So it's about trading some latency for increased throughput.
Going serial, in that regards, is probably worth if and only if your application code never does I/O (no database calls, no network calls), which is rarely the case. As soon as your application calls a database and release the GIL, you would just lock the whole server waiting for that result to be available, while you could do something else in the meantime.
So, again, it really depends on your application code.
I personally wouldn't go serial with an application that does I/O: that sacrifices too much throughput to my perspective, even if the latency goes higher by few percentage points.

We're running 6-10 pods with 6 workers each, right now.

I would rather rely on k8s to horizontally scale – as written in the readme – thus I wouldn't go higher than 1-2 workers per pod. You can scale (or auto-scale) the number of pods at that point.

If I had to suggest a configuration, and presuming your application has some form of I/O, I would rather:

have 1-2 workers top
lower down the backlog to something more in line with the concurrency you expect, eg: 64-128
have at least 2-4 blocking threads
set a backpressure with the amount of concurrency you expect in terms of new requests: if we say you have N blocking threads and you match that number on the backpressure, it means you can never have more than N requests inflight. On the other hands, if you have N threads and, for example, 2N backpressure, it means you can have 2N requests inflight, but you can only schedule N Python code runs at a given time. See this as a way to instruct the blocking threadpool to prioritize existing work (backpressure == threads) vs new work (backpressure > threads)
leave runtime config as default

Beyond that, I have the question of "how can I determine better values for this beyond artificial tests". It's hard for me to answer if threads are starved, and I don't know if I should be trying to get some sort of perf report off one of these nodes (if I can, it's running in Kubernetes), some other kind of profiler, etc, to answer questions like these.

I guess this is one of the reason for #610. Once that lands, it will be easier to track what's happening on the Granian runtime.

and talk about if this would change for ASGI. (that would be the next goal, to flip to ASGI).

ASGI is a completely different beast. All the things I said about blocking threads won't matter in that case, as everything is running on the Python event loop, and that will control the concurrency over everything else.

7 replies

gi0baro Aug 20, 2025
Maintainer

(I set backpressure=2 and blocking threads to 4 as a test)

Backpressure should always be >= the blocking threadpool size.

If I set a backpressure >1 then k8s checks to /health end up queued behind slow requests (we have a lot of 30s-5m endpoints for horrible reasons). Then health checks fail and k8s starts churning pods.

That's.. confusing.
Backpressure 1 means each request has to wait for the previous one to be completed (literally serial).
So with your current config, whatever request (including the healthcheck) you make should be queued behind whatever you have in line.
So the confusing part to me is that with your current configuration, you should experience health checks timouts every time you serve a long request..

I may experiment with disabling keepalives to see if that forces that case to distribute better.

One additional point to make here: backpressure works on connections, not requests, thus if the connection to Granian is kept alive, backpressure 1 means you can only establish 1 single TCP connection, which can led (IMHO) to very weird behaviors.

is there anything else you can think of to make sure slow requests don’t block health checks that need to be fast?

Unfortunately, no. There's no way for Granian to distinguish about requests, it's a single accept loop that push stuff to the same threadpool.

(I’ve experimented with more pods with 1-2 workers but we find we end up paying a lot more overall due to increased memory usage as with 6 workers they can somewhat share memory.

That's also.. confusing. Workers are independent processes (and thus interpreters), so the memory usage should grow linearly (eg: if 1 worker use 100MB, 6 workers should use 600MB). The only memory you "save" is the one used by the main Granian process. So I'd guess if your application has really low memory usage (like ~50M?) you can notice a difference going from 2 workers to 6 (as the ~30MB required by the Granian main process will now weight less on the overall consumption), but for the vast majority of cases the difference shouldn't be that noticeable..

gi0baro Aug 20, 2025
Maintainer

My other choice is maybe setup a static-path-route for /health and keep that part outside of the python loop altogether, which feels like a hack.

Yeah, I can't say I would do something like that.
At the end of the day it really depends on what the health check means to you. If you have requests that can last 5m, and you say the healthcheck fails if it takes more than 1s, then to my perspective it's just a wrong way to say the pod is not healthy.
If you're just interested into "probing Granian is accepting requests", you can probably switch from the http healthcheck to a tcp healthcheck on the container port: in theory that would interact with the accept loop, but it won't be sent down the line to the Python code.

apenney Aug 20, 2025
Author

I will definitely admit to being confused! I was definitely able to correlate traces in datadog where we just sat for 2.5 minutes before replying to the /health request at 3am (when almost nobody is using the platform) at the exact same time someone did happen to use it and hit an API endpoint that then took 2.5 minutes.

I had assumed because I was seeing a datadog trace that I had at least made it into granian by that point to start the processing, but I guess it's somehow possible I was just sat in the global backlog and not the part where granian was doing work? I've struggled to form a good mental model for this stuff, so I'm pretty sure I still don't have the full mental model from the tcp request being accepted through to django doing work.

I'm going to go back to doing experiments locally and loadtesting various scenarios while I make sense of this. I'll continue experimenting with more pods with fewer workers as well. I was probably led astray last time by the nature of artificial tests vs the reality of many mixed workloads.

Again, thank you for continuing to help us as we wrap our brains around this. Our platform tends to peak at about 30 requests a second, and despite that we were running 10 pods with 6 workers each because of how heavy individual requests end up. I've always felt like we're doing something wrong, but each time I try to adapt to what I think should happen reality disagrees. :D

gi0baro Aug 20, 2025
Maintainer

Our platform tends to peak at about 30 requests a second, and despite that we were running 10 pods with 6 workers each because of how heavy individual requests end up

Yeah, the whole issue with WSGI is not really the throughput itself, but rather the time the request/response flow holds the GIL. Unfortunately, a threadpool in Python is not a good design for cases in which the GIL is not frequently released; that's where projects like greenlet – which would be quite costly to support in Granian at the current stage – and asyncio – plus the ASGI protocol – actually make the difference.

I think what I would do in your case while being in the WSGI land and long requests, is to split the liveness and readiness probes, so that the liveness probe use TCP and thus just checks the pod is accepting connections, and keep the readiness probe to actually talk with your application, and if that fails the pod will just be took out temporarily from the k8s service, as it cannot process more requests within a reasonable amount of time, but you don't want that pod to be restarted:

livenessProbe:
  tcpSocket:
    port: 8000

as soon as the pod is done processing its backlog and resume answering the readiness healthcheck, it will be added back to the service and new requests will flow in.

apenney Aug 20, 2025
Author

That makes sense! I am going to focus my attention on ASGI and whatever testing I will need to do to start moving in that direction, that was one of the core reasons to escape Gunicorn and as far as I can tell from my very initial testing it's pretty much just "turn it on and off we go" even if there's a slight overhead from running in ASGI code for non async native stuff, it just seemed to work overall.

I will get that rolled out into our development env, let people bash on it for a few weeks, see what weird Django stuff we've got that failed when run that way, but maybe that's the best way to take this rather than focusing on WSGI. Once we have granian running on ASGI we can start moving stuff over to actually be async too.

apenney · 2025-08-21T01:42:52Z

apenney
Aug 21, 2025
Author

I did some synthetic tests and while I lose some performance with ASGI, it's not dramatic and it's certainly easier to tune. I experimented with the settings I could (keepalives, backpressure, etc) and generally none of that made much of a difference in my tests which makes it feel like a viable path forward.

For something like django is it worth playing with any other values I didn't tune/touch or is it basically 'set an appropriate number of workers vs cores, and let granian do the rest'?

2 replies

gi0baro Aug 21, 2025
Maintainer

Unless you have specific requirements to control the backlog – an example might be to limit the connections you can establish to a database – there's no actual need to configure backpressure on ASGI.

So my general suggestion for ASGI on kubernetes is to use uvloop and – same for WSGI – have 1-2 workers per pod, leaving everything else to default. You might experiment with --runtime-mode mt and increasing --runtime-threads, but I don't expect huge changes in terms of throughput with those options.

Also, I don't know how you tested keepalived, but if that's not enabled on the tool you're using to spawn requests, it won't make any difference as the target can only respect the intent, but the decision is made at the source.

apenney Aug 25, 2025
Author

I'll try the defaults for backpressure! I enabled ASGI after a bunch of small code fixes in our develop environment and so far I think it's behaving better (we're not seeing the crazy long /health check times that still crop up in production with tuned WSGI).

Hopefully this is promising and after more testing we can continue to move in this direction. For WSGI in production we've moved to 20 pods with 2 workers per pod right now, which has overall been successful, so we're slowly converging in the right direction!

apenney · 2025-09-15T14:26:32Z

apenney
Sep 15, 2025
Author

@gi0baro I am hoping I can bug you about a new issue that's cropped up - we've stuck with granian for a month and things are mostly good but I do have this one weird issue that crops up.

Over time the pods build up a large amount of memory usage and eventually cpu usage skyrockets and users report slowness (which is correlated in frontend metrics). Over the weekend I would see connections to cloudfront reach the ALB then stall for 60 seconds trying to fetch /login from a pod. It would make no sense, and eventually I restarted all pods (that had only been up for 2.5 days) which cleared the issue up totally.

You can see when I restarted things, dropping a bunch of pods with extra high memory usage back to normal. What's weird is I set GRANIAN_WORKERS_LIFETIME=3600 and confirm that I see workers routinely restart, so I don't really understand how memory usage can climb forever.

Is there any way you can think of for us to retain memory usage despite recycling workers on a regular basis. Is there a pool of memory that sits outside individual workers? I am python challenged, just fumbling my way through this issue, so any suggestions are gratefully appreciated!

2 replies

gi0baro Sep 15, 2025
Maintainer

@apenney can you share your complete Granian config and how you launch it inside those pods?

Is there any way you can think of for us to retain memory usage despite recycling workers on a regular basis. Is there a pool of memory that sits outside individual workers?

The only explanation that comes to my mind is for memory allocations growing in the Granian main process. But I'm not sure how that could happen.

apenney Sep 21, 2025
Author

Just picking this back up as we had exactly the same issue this weekend. Same symptoms, pods up for 2.5 days and worked better after a restart.

How we start is actually complicated! We had a WSGI path and an ASGI simplified path, but after seeing performance issues in production we rolled back to this way more overengineered WSGI path which I'll paste below:

  args:
    - ddtrace-run
    - python
    - granian_server.py
    - --target
    - "core.wsgi:application"

Granian_server.py:

#!/usr/bin/env python3
"""
Simplified Granian server configuration using environment variables.
Optimized defaults based on load testing results.
"""

import argparse
import json
import multiprocessing
import os
import sys
from typing import Any, Dict, List, Tuple

from granian import Granian
from granian import log as granian_log
from granian.constants import Interfaces
from granian.log import LogLevels

from core.datadog import is_datadog_enabled


logger = granian_log.logger

# Configuration defaults based on load testing
# Optimal configuration: serial processing per worker avoids GIL contention
DEFAULT_WORKERS = 6
DEFAULT_BLOCKING_THREADS = 1  # Serial processing - no GIL contention
DEFAULT_BACKPRESSURE = 1  # One request at a time per worker
DEFAULT_RUNTIME_THREADS = 1  # Prevent work stealing between threads
# Note: runtime_blocking_threads is intentionally not set - let granian auto-configure
DEFAULT_PORT = 8000
DEFAULT_BACKLOG = 2048
DEFAULT_IDLE_TIMEOUT = 60
DEFAULT_WORKERS_LIFETIME = 3600  # 1 hour
VALID_MODES = {"local", "kubernetes"}
VALID_LOG_LEVELS = {"debug", "info", "warning", "error", "critical"}


class CustomGranian(Granian):
    def __init__(self, **kwargs):
        kwargs.setdefault("address", "0.0.0.0")  # noqa
        kwargs.setdefault("interface", Interfaces.WSGI)
        kwargs.setdefault("log_access", True)
        kwargs.setdefault("respawn_failed_workers", True)

        # Store interface for worker spawning logic
        self._interface = kwargs.get("interface", Interfaces.WSGI)

        # Configure Datadog logging format if enabled
        if is_datadog_enabled():
            kwargs.setdefault(
                "log_access_format",
                json.dumps(
                    {
                        "timestamp": "%(time)s",
                        "client_ip": "%(addr)s",
                        "method": "%(method)s",
                        "path": "%(path)s",
                        "query_string": "%(query_string)s",
                        "protocol": "%(protocol)s",
                        "status": "%(status)s",
                        "response_time_ms": "%(dt_ms).3f",
                    }
                ),
            )

        super().__init__(**kwargs)

    @staticmethod
    def _spawn_wsgi_worker(*args, **kwargs):
        # Refresh logging configuration after spawn
        configure_datadog_logging()
        Granian._spawn_wsgi_worker(*args, **kwargs)

    @staticmethod
    def _spawn_asgi_worker(*args, **kwargs):
        # Refresh logging configuration after spawn
        configure_datadog_logging()
        Granian._spawn_asgi_worker(*args, **kwargs)

    @staticmethod
    def _spawn_asginl_worker(*args, **kwargs):
        # Refresh logging configuration after spawn
        configure_datadog_logging()
        Granian._spawn_asginl_worker(*args, **kwargs)


def configure_datadog_logging() -> None:
    """Configure Datadog JSON logging if enabled."""
    if not is_datadog_enabled():
        return

    granian_log.LOGGING_CONFIG["formatters"]["json"] = {
        "()": "core.datadog.DatadogJSONFormatter"
    }
    granian_log.LOGGING_CONFIG["handlers"]["console"]["formatter"] = "json"
    # Use GRANIAN_LOG_LEVEL consistently
    log_level = os.getenv("GRANIAN_LOG_LEVEL", "info").lower()
    granian_log.configure_logging(log_level, None)


def get_int_env(key: str, default: int) -> int:
    """Safely get integer from environment variable.

    Args:
        key: Environment variable name
        default: Default value if not set or invalid

    Returns:
        Integer value from environment or default
    """
    try:
        return int(os.getenv(key, str(default)))
    except ValueError:
        logger.warning(f"Invalid value for {key}, using default {default}")
        return default


def detect_interface(target: str) -> Interfaces:
    """
    Auto-detect interface type from target module name.

    Args:
        target: The target module string (e.g., "core.asgi:application")

    Returns:
        The detected Granian interface type
    """
    target_lower = target.lower()
    if "asgi" in target_lower:
        logger.info(f"Detected ASGI interface from target: {target}")
        return Interfaces.ASGINL  # ASGI No Lifespan - avoids lifespan warnings
    elif "wsgi" in target_lower:
        logger.info(f"Detected WSGI interface from target: {target}")
        return Interfaces.WSGI
    else:
        logger.info(f"No interface indicator in target '{target}', defaulting to WSGI")
        return Interfaces.WSGI


def get_config(interface: Interfaces) -> Tuple[Dict[str, Any], str]:
    """
    Get configuration from environment variables with optimized defaults.

    ASGI mode uses Granian defaults per author's recommendation.
    WSGI mode uses our optimized settings for Django with database I/O.

    Environment variables:
        GRANIAN_MODE: Deployment mode: local|kubernetes (default: local)
        GRANIAN_WORKERS: Number of worker processes (default: 6 or CPU count)
        GRANIAN_PORT: Server port (default: 8000)
        GRANIAN_BACKLOG: Connection backlog (default: 2048)
        GRANIAN_WORKERS_LIFETIME: Worker lifetime in seconds (default: 3600)
        GRANIAN_LOG_LEVEL: Log level: debug|info|warning|error|critical (default: info)
        GRANIAN_ENABLE_RELOAD: Enable auto-reload: true|false|auto (default: auto-detect based on mode)

        WSGI-specific (not recommended for ASGI):
        GRANIAN_BLOCKING_THREADS: Threads per worker (WSGI default: 1)
        GRANIAN_BACKPRESSURE: Request queue size (WSGI default: 1)
        GRANIAN_RUNTIME_THREADS: Rust runtime threads (WSGI default: 1)
        GRANIAN_IDLE_TIMEOUT: Idle timeout for blocking threads (WSGI default: 60)

    Args:
        interface: The detected or specified interface (ASGI/WSGI)

    Returns:
        Tuple of (config dict, mode string)
    """
    # Validate and determine mode from environment
    mode = os.getenv("GRANIAN_MODE", "local").lower()
    if mode not in VALID_MODES:
        logger.warning(f"Unknown mode '{mode}', defaulting to 'local'")
        mode = "local"

    is_local = mode == "local"
    is_asgi = interface in (Interfaces.ASGI, Interfaces.ASGINL)

    # Base configuration - always applied
    config: Dict[str, Any] = {
        "workers": get_int_env(
            "GRANIAN_WORKERS", min(multiprocessing.cpu_count(), DEFAULT_WORKERS)
        ),
        "port": get_int_env("GRANIAN_PORT", DEFAULT_PORT),
        "backlog": get_int_env("GRANIAN_BACKLOG", DEFAULT_BACKLOG),
    }

    # Apply performance tuning only for WSGI mode
    # For ASGI, let Granian use its defaults as recommended by the author
    if not is_asgi:
        logger.info("Applying optimized settings for WSGI mode")
        config.update(
            {
                "blocking_threads": get_int_env(
                    "GRANIAN_BLOCKING_THREADS", DEFAULT_BLOCKING_THREADS
                ),
                "backpressure": get_int_env(
                    "GRANIAN_BACKPRESSURE", DEFAULT_BACKPRESSURE
                ),
                "runtime_threads": get_int_env(
                    "GRANIAN_RUNTIME_THREADS", DEFAULT_RUNTIME_THREADS
                ),
                "blocking_threads_idle_timeout": get_int_env(
                    "GRANIAN_IDLE_TIMEOUT", DEFAULT_IDLE_TIMEOUT
                ),
            }
        )
    else:
        logger.info("Using Granian defaults for ASGI mode")
        # For ASGI, don't set any of these - let Granian use its actual defaults

    # Only set runtime_blocking_threads if explicitly provided
    # Let granian auto-configure based on protocol if not set
    if os.getenv("GRANIAN_RUNTIME_BLOCKING_THREADS"):
        try:
            config["runtime_blocking_threads"] = int(
                os.getenv("GRANIAN_RUNTIME_BLOCKING_THREADS")
            )
        except ValueError:
            logger.warning(
                "Invalid GRANIAN_RUNTIME_BLOCKING_THREADS value, letting granian auto-configure"
            )

    # Worker lifecycle settings
    # Recycle workers every hour to prevent memory leaks
    # This is safer than max_requests and more predictable than memory limits
    config["workers_lifetime"] = get_int_env(
        "GRANIAN_WORKERS_LIFETIME", DEFAULT_WORKERS_LIFETIME
    )

    # Logging configuration
    log_level = os.getenv("GRANIAN_LOG_LEVEL", "info").lower()
    if log_level in VALID_LOG_LEVELS:
        config["log_level"] = getattr(LogLevels, log_level)
        config["log_enabled"] = True
    else:
        logger.warning(f"Invalid log level '{log_level}', using 'info'")
        config["log_level"] = LogLevels.info
        config["log_enabled"] = True

    # Auto-reload for local mode (can be overridden)
    enable_reload = os.getenv("GRANIAN_ENABLE_RELOAD", "auto")
    if enable_reload == "auto":
        config["reload"] = is_local
    else:
        config["reload"] = enable_reload.lower() in ["true", "1", "yes"]

    # Configure reload settings if enabled
    if config.get("reload"):
        config["reload_ignore_dirs"]: List[str] = [
            ".flox",
            "__pycache__",
            ".git",
            "node_modules",
            ".venv",
            ".pytest_cache",
            "tmp",
            "dist",
            "staticfiles",
        ]
        # Disable worker lifetime during development with reload
        config["workers_lifetime"] = None

    return config, mode


def main(argv: List[str]) -> None:
    """Main entry point.

    Args:
        argv: Command line arguments
    """
    # Set Django defaults
    os.environ.setdefault("DJANGO_SETTINGS_MODULE", "settings.local")

    # Use polling for file watching in containers/VMs
    if os.getenv("GRANIAN_ENABLE_RELOAD", "auto") != "false":
        os.environ.setdefault("WATCHFILES_FORCE_POLLING", "1")

    # Configure Datadog logging early
    configure_datadog_logging()

    # Parse minimal command line arguments
    parser = argparse.ArgumentParser(
        description="Granian server optimized for Django (WSGI/ASGI)",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
Environment Variables:
  GRANIAN_MODE                  Deployment mode: local|kubernetes (default: local)
  GRANIAN_TARGET                Application target module (default: core.wsgi:application)
                                Interface auto-detected from target name (asgi/wsgi)
  GRANIAN_WORKERS               Number of worker processes (default: 6 or CPU count)

  WSGI-specific settings (not recommended for ASGI):
  GRANIAN_BLOCKING_THREADS      Threads per worker (WSGI default: 1 - serial processing)
  GRANIAN_BACKPRESSURE          Request queue size (WSGI default: 1 - perfect distribution)
  GRANIAN_RUNTIME_THREADS       Rust runtime threads (WSGI default: 1 - no work stealing)

  Common settings:
  GRANIAN_PORT                  Server port (default: 8000)
  GRANIAN_WORKERS_LIFETIME      Worker lifetime in seconds (default: 3600)
  GRANIAN_LOG_LEVEL             Log level: debug|info|warning|error (default: info)
  GRANIAN_ENABLE_RELOAD         Enable auto-reload: true|false|auto (default: auto based on mode)

  Note: ASGI mode uses Granian defaults per author recommendation.
  WSGI mode uses optimized settings for Django with database I/O.
        """,
    )
    parser.add_argument(
        "--target",
        default=None,
        help="Application target (e.g., core.asgi:application or core.wsgi:application). Interface will be auto-detected from target name",
    )

    args = parser.parse_args(argv)

    # Determine target (CLI arg > env var > default)
    target: str = args.target or os.getenv("GRANIAN_TARGET") or "core.asgi:application"

    # Auto-detect interface from target module name
    interface: Interfaces = detect_interface(target)

    # Always install uvloop for ASGI - it's required for best performance
    is_asgi = interface in (Interfaces.ASGI, Interfaces.ASGINL)
    if is_asgi:
        try:
            import uvloop

            uvloop.install()
            logger.info("uvloop installed for ASGI mode")
        except ImportError:
            logger.warning("uvloop not available - strongly recommended for ASGI mode")

    # Get configuration from environment, passing interface for mode-specific defaults
    config, mode = get_config(interface)
    config["interface"] = interface

    # Log startup information
    logger.info(f"Starting Granian server (mode: {mode}, interface: {interface.name})")

    # Log configuration differently based on interface
    if is_asgi:
        logger.info(
            f"Configuration: {config['workers']} workers (ASGI mode with Granian defaults)"
        )
    else:
        logger.info(
            f"Configuration: {config['workers']} workers × {config.get('blocking_threads', 'default')} threads, "
            f"backpressure={config.get('backpressure', 'default')}"
        )

    if config.get("reload"):
        logger.info("Auto-reload enabled for development")

    # Create and start server
    server = CustomGranian(target=target, **config)

    # Log detailed config in debug mode
    if config.get("log_level") == LogLevels.debug:
        logger.debug(f"Full configuration: {json.dumps(config, default=str)}")

    server.serve()


if __name__ == "__main__":
    main(sys.argv[1:])

Env vars:

    GRANIAN_WORKERS: "2"
    GRANIAN_BLOCKING_THREADS: "4"
    GRANIAN_BACKPRESSURE: "6"
    GRANIAN_BACKLOG: "128"

core/wsgi.py

╰⎯  more core/wsgi.py
# stdlib
import os

# django
from django.core.wsgi import get_wsgi_application
from django.db.backends.signals import connection_created
from django.dispatch import receiver


os.environ.setdefault("DJANGO_SETTINGS_MODULE", "settings.local")

# Configure Granian-specific logging if running under Granian with Datadog
# DD_AGENT_HOST indicates Datadog is available on this node
if os.getenv("DD_AGENT_HOST"):
    import json

    # Configure Granian's access log format for Datadog
    # This mimics what granian_server.py was doing
    os.environ["GRANIAN_LOG_ACCESS_FORMAT"] = json.dumps(
        {
            "timestamp": "%(time)s",
            "client_ip": "%(addr)s",
            "method": "%(method)s",
            "path": "%(path)s",
            "query_string": "%(query_string)s",
            "protocol": "%(protocol)s",
            "status": "%(status)s",
            "response_time_ms": "%(dt_ms).3f",
        }
    )

    # Configure Granian's internal logging to use JSON format
    # This happens before Django's logging setup
    try:
        import granian.log as granian_log

        granian_log.LOGGING_CONFIG["formatters"]["json"] = {
            "()": "core.datadog.DatadogJSONFormatter"
        }
        granian_log.LOGGING_CONFIG["handlers"]["console"]["formatter"] = "json"
        log_level = os.getenv("GRANIAN_LOG_LEVEL", "info").lower()
        granian_log.configure_logging(log_level, None)
    except ImportError:
        # Granian not installed or different version
        pass

application = get_wsgi_application()

# Wrap application with Granian proxy headers support when running behind a proxy
# This fixes client IP detection in Granian's access logs and Django request handling
# when nginx is removed from k8s pods
# Uses ALLOWED_CIDR_NETS which is already configured for trusted networks
if os.getenv("ALLOWED_CIDR_NETS"):
    from granian.utils.proxies import wrap_wsgi_with_proxy_headers

    # Trust the same networks that are allowed to connect (k8s ingress/load balancer)
    # This reuses the existing ALLOWED_CIDR_NETS configuration
    trusted_hosts = os.getenv("ALLOWED_CIDR_NETS", "").split(",")
    application = wrap_wsgi_with_proxy_headers(application, trusted_hosts=trusted_hosts)


# This is taken from https://medium.com/@hakibenita/9-django-tips-for-working-with-databases-beba787ed7d3
# This will "Abort any statement that takes more than the specified number of milliseconds" - https://www.postgresql.org/docs/9.6/static/runtime-config-client.html#GUC-STATEMENT-TIMEOUT
# Since we are currently on Heroku, and Heroku enforces a 30s timeout
# this will save us from runaway DB queries that run for longer that 30 seconds
# but only for web calls. This should not affect Celery tasks or anything else.
# Was recommended by Heroku as well: https://devcenter.heroku.com/articles/heroku-postgres-database-tuning#identify-and-fix-expensive-queries
@receiver(connection_created)
def setup_postgres(connection, **kwargs):
    if connection.vendor != "postgresql":
        return

    # Timeout statements after 30 seconds.
    with connection.cursor() as cursor:
        cursor.execute(
            """
            SET statement_timeout TO 30000;
        """
        )

The better path which is still enabled everywhere but production is:

    GRANIAN_INTERFACE: "asginl"
    GRANIAN_RUNTIME_THREADS: "4"
    GRANIAN_RUNTIME_MODE: "st"

  args:
    - ddtrace-run
    - granian
    - "core.asgi:application"

apenney · 2025-09-21T21:25:35Z

apenney
Sep 21, 2025
Author

I noticed something weird:

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
app            1 15.3  0.3 1269816 107892 ?      Ssl  06:12 139:00 /app/.venv/bin/python granian_server.py --target core.wsgi:application
app          472  102  1.2 1988416 387808 ?      Sl   07:12 870:01 /app/.venv/bin/python granian_server.py --target core.wsgi:application
app          686 15.7  1.8 3382988 581480 ?      Sl   08:12 123:39 /app/.venv/bin/python granian_server.py --target core.wsgi:application
app          704 16.3  2.4 3771048 801556 ?      Sl   08:13 128:52 /app/.venv/bin/python granian_server.py --target core.wsgi:application
app         4638  0.0  0.0   4484  3956 pts/0    Ss   21:06   0:00 bash
app         4802  0.0  0.0   6420  3620 pts/0    R+   21:19   0:00 ps auxww
app@dashboard-web-68b75956ff-kvkr5:/app$ date
Sun Sep 21 21:20:06 UTC 2025

Almost every pod has a process stuff at 70-120% cpu, while the rest are at 15%.

If I do something like:

app@dashboard-web-68b75956ff-kvkr5:/app$   kill -HUP 472
app@dashboard-web-68b75956ff-kvkr5:/app$ sleep 5
app@dashboard-web-68b75956ff-kvkr5:/app$   ps aux | grep 472
app          472  102  1.2 1988416 387808 ?      Sl   07:12 874:13 /app/.venv/bin/python granian_server.py --target core.wsgi:application
app         4847  0.0  0.0   3528  1656 pts/0    S+   21:23   0:00 grep 472
app@dashboard-web-68b75956ff-kvkr5:/app$

The stuck worker remains stuck. I have:

app@dashboard-web-68b75956ff-kvkr5:/app$   cat /proc/472/environ | tr '\0' '\n' | grep -E "WORKERS|THREADS|GRANIAN"
  GRANIAN_WORKERS_LIFETIME=3600
  GRANIAN_BACKLOG=128
  GRANIAN_PORT=8000
  GRANIAN_WORKERS_KILL_TIMEOUT=120
  GRANIAN_BLOCKING_THREADS=4
  GRANIAN_HOST=0.0.0.0
  GRANIAN_BACKPRESSURE=6
  GRANIAN_LOG_ACCESS_ENABLED=true
  GRANIAN_WORKERS=2
  GRANIAN_LOOP=uvloop
  GRANIAN_MODE=kubernetes

So I feel like this should be force killed after 120 seconds, no matter what, but it doesn't ever disappear. We might be building up stuck processes over time until things degrade.

0 replies

apenney · 2025-09-21T21:34:39Z

apenney
Sep 21, 2025
Author

I saw this pattern in many pods, a worker that was stuck and had a lifespan of 12h or more. I was unable to force them to recycle:

  app@dashboard-web-68b75956ff-kvkr5:/app$   kill -TERM 472
  app@dashboard-web-68b75956ff-kvkr5:/app$   sleep 5
    ps aux | grep " 472 " | grep -v grep
  app          472  102  1.2 1988416 387808 ?      Sl   07:12 877:57 /app/.venv/bin/python granian_server.py --target core.wsgi:application

And the state of the threads:

  app@dashboard-web-68b75956ff-kvkr5:/app$   for tid in $(ls /proc/472/task/); do
        echo "Thread $tid: $(cat /proc/472/task/$tid/stat | awk '{print $3}')"
    done
  Thread 472: S
  Thread 473: S
  Thread 475: S
  Thread 476: S
  Thread 480: S
  Thread 481: S
  Thread 482: R
  Thread 483: S
  Thread 484: S
  Thread 485: S
  Thread 486: S
  Thread 490: S
  Thread 491: S
  Thread 492: S
  Thread 499: S
  Thread 500: S
  Thread 501: S
  Thread 502: S
  Thread 503: S
  Thread 530: S
  Thread 531: S
  Thread 532: S
  Thread 533: S

0 replies

apenney · 2025-09-22T00:47:00Z

apenney
Sep 22, 2025
Author

We made some progress - ditched granian_server.py and went direct with ASGI again - clearly we weren't passing the thread timeout env vars through, so we were getting eternal threads.

I don't see the stuck threads in ASGI, maybe just because the recycling works, which at least solves the problem. I will try to figure out more on how to debug a stuck WSGI thread as i'm sure our software is doing something terrible that causes threads to get wedged.

0 replies

gi0baro · 2025-09-22T11:49:53Z

gi0baro
Sep 22, 2025
Maintainer

@apenney In general, I think there's a lot going on with your custom launcher.

My general feelings are:

be sure the workers_kill_timeout is actually passed to the Granian instance: if that variable is not set, in case one single worker is stuck, Granian will wait forever for it to exit, and thus will stop recycling all the workers. Also mind that all GRANIAN_ env vars are not processed unless you run granian from its CLI
given what you reported, it seems like your code is causing some Python threads to never terminate
seems you're trying to avoid workers consume too much memory: in that case you should also set the --workers-max-rss option to a proper value. But again, if you don't pass the kill timeout, Granian will wait for a single worker to exits forever like for the lifetime.

1 reply

apenney Sep 22, 2025
Author

Things have definitely been better this morning since I ripped out all the custom stuff - we now just fire up granian and have nothing in the middle. I still get a really weird result based on what I see in the traces, we get a lot of requests come in and just sit for 60 seconds then time out completely.

I see a datadog trace for those so I know the requests made it through the AWS infrastructure and to our pods running the code, and then we just.. never reply which is weird. I do need to figure out what has the exact 60 second timeout that causes this.

When I look at the profiles generated, I don't see anything exciting during these 60 seconds (attached some screenshots showing all I can see).

I'll experiment with --workers-max-rss, we definitely see memory rise and then get reaped by the hourly worker lifespan (which mostly solves our issue, other than these weird dead requests).

(Also I definitely believe our code is probably doing some awful threaded stuff, it's a big complex codebase, I just haven't found where and how yet).

apenney · 2025-09-22T14:33:41Z

apenney
Sep 22, 2025
Author

Pretty sure based on those profiles I found the cause of at least one of our issues - the datadog WAF stuff is just.. broken? I disabled it and our 99.9 latency plummeted, so one more mystery solved. It may be what messes with threads as it injects itself too.

0 replies

apenney · 2025-09-24T22:46:30Z

apenney
Sep 24, 2025
Author

Rest of our woes (stuck granian threads that couldn't be killed by granian) were also caused by datadog - this time the profiler.

No idea if the issue is with them or something unique with granian (I blame datadog myself, this stuff works terribly) but that was ultimately our root cause. Just adding this update in case anyone else ever has to troubleshoot something similar.

0 replies

Uh oh!

Solid settings for Django applications? #663

Uh oh!

Uh oh!

apenney Aug 19, 2025

Replies: 9 comments · 12 replies

Uh oh!

Uh oh!

gi0baro Aug 20, 2025 Maintainer

Uh oh!

gi0baro Aug 20, 2025 Maintainer

Uh oh!

gi0baro Aug 20, 2025 Maintainer

Uh oh!

apenney Aug 20, 2025 Author

Uh oh!

gi0baro Aug 20, 2025 Maintainer

Uh oh!

apenney Aug 20, 2025 Author

Uh oh!

Uh oh!

apenney Aug 21, 2025 Author

Uh oh!

gi0baro Aug 21, 2025 Maintainer

Uh oh!

apenney Aug 25, 2025 Author

Uh oh!

apenney Sep 15, 2025 Author

Uh oh!

gi0baro Sep 15, 2025 Maintainer

Uh oh!

Uh oh!

apenney Sep 21, 2025 Author

Uh oh!

apenney Sep 21, 2025 Author

Uh oh!

apenney Sep 21, 2025 Author

Uh oh!

apenney Sep 22, 2025 Author

Uh oh!

gi0baro Sep 22, 2025 Maintainer

Uh oh!

apenney Sep 22, 2025 Author

Uh oh!

apenney Sep 22, 2025 Author

Uh oh!

apenney Sep 24, 2025 Author

apenney
Aug 19, 2025

Replies: 9 comments 12 replies

gi0baro
Aug 20, 2025
Maintainer

gi0baro Aug 20, 2025
Maintainer

gi0baro Aug 20, 2025
Maintainer

apenney Aug 20, 2025
Author

gi0baro Aug 20, 2025
Maintainer

apenney Aug 20, 2025
Author

apenney
Aug 21, 2025
Author

gi0baro Aug 21, 2025
Maintainer

apenney Aug 25, 2025
Author

apenney
Sep 15, 2025
Author

gi0baro Sep 15, 2025
Maintainer

apenney Sep 21, 2025
Author

apenney
Sep 21, 2025
Author

apenney
Sep 21, 2025
Author

apenney
Sep 22, 2025
Author

gi0baro
Sep 22, 2025
Maintainer

apenney Sep 22, 2025
Author

apenney
Sep 22, 2025
Author

apenney
Sep 24, 2025
Author