Skip to content

fix(bot): 端口冲突预检 + /health 上报真实版本#1566

Open
ZaynJarvis wants to merge 1 commit intomainfrom
fix/bot-port-in-use-and-version
Open

fix(bot): 端口冲突预检 + /health 上报真实版本#1566
ZaynJarvis wants to merge 1 commit intomainfrom
fix/bot-port-in-use-and-version

Conversation

@ZaynJarvis
Copy link
Copy Markdown
Collaborator

@ZaynJarvis ZaynJarvis commented Apr 18, 2026

修两个排障时被坑的点

1. --with-bot 端口被占时不报错

bootstrap.py 启动 vikingbot 子进程后只 sleep(2) 看进程死没死。但 vikingbot 要 ~5s 才走到 uvicorn bind,2s 时还在 import,进程当然是活的 → openviking-server 以为成功了,HTTP 代理转发到 18790,那里旧的 vikingbot 还在响应。从外面看一切正常,实际升级根本没生效。

修复:fork 前用 socket.connect 试一下 18790,被占就直接 exit 1 + 打 lsof 提示。两层都加(bootstrap.pyvikingbot gateway)。

2. /health 永远返回 version: "0.1.3"

vikingbot/__init__.py 里写死的。改成从 importlib.metadata.version("openviking") 动态读。

测试

  • 端口空闲 → 静默通过
  • 端口被占 → 毫秒级 exit 1,错误信息带 lsof 提示
  • /health 正确反映已安装版本

🤖 Generated with Claude Code

- bootstrap.py / vikingbot gateway: abort with a clear message when
  18790 is already held, before fork / before uvicorn startup.
  Avoids the silent fail-open caused by the existing 2s subprocess
  poll firing before uvicorn's bind attempt (~5s).
- vikingbot/__init__.py: read __version__ from openviking package
  metadata instead of a hardcoded "0.1.3" that lied in /health.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🏅 Score: 80
🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Fix version reporting for vikingbot

Relevant files:

  • bot/vikingbot/init.py

Sub-PR theme: Add port conflict pre-check for vikingbot

Relevant files:

  • bot/vikingbot/cli/commands.py
  • openviking/server/bootstrap.py

⚡ Recommended focus areas for review

Port Check Limitation

The bot port pre-check uses a hardcoded default port (18790) instead of the actual configured bot port. This will miss conflicts if the bot is running on a non-default port.

VIKINGBOT_DEFAULT_PORT = 18790


def _abort_if_port_in_use(port: int, label: str) -> None:
    """Exit with a clear message if anything is already listening on ``port``.

    Without this, ``--with-bot`` would spawn a vikingbot subprocess that
    silently fails to bind while a stale process keeps serving traffic —
    the operator believes they upgraded but the old binary still answers.
    """
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
        s.settimeout(0.5)
        try:
            s.connect(("127.0.0.1", port))
            in_use = True
        except (ConnectionRefusedError, socket.timeout, OSError):
            in_use = False
    if in_use:
        print(
            f"Error: {label} port {port} is already in use.\n"
            f"  A previous process is still bound — refusing to start a duplicate.\n"
            f"  Identify it:  lsof -nP -iTCP:{port} -sTCP:LISTEN\n"
            f"  Kill it, then retry.",
            file=sys.stderr,
        )
        sys.exit(1)
Code Duplication

The _abort_if_port_in_use function is duplicated in both commands.py and bootstrap.py. This could lead to inconsistent behavior if the function is updated in one place but not the other.

def _abort_if_port_in_use(port: int, label: str) -> None:
    """Exit with a clear message if anything is already listening on ``port``.

    Without this check, a stale process holding the port keeps serving
    traffic while a freshly-started gateway silently fails to bind — the
    operator believes they upgraded but the old (potentially unpatched)
    binary is still answering requests.
    """
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
        s.settimeout(0.5)
        try:
            s.connect(("127.0.0.1", port))
            in_use = True
        except (ConnectionRefusedError, socket.timeout, OSError):
            in_use = False
    if in_use:
        print(
            f"Error: {label} port {port} is already in use.\n"
            f"  A previous process is still bound — refusing to start a duplicate.\n"
            f"  Identify it:  lsof -nP -iTCP:{port} -sTCP:LISTEN\n"
            f"  Kill it, then retry.",
            file=sys.stderr,
        )
        sys.exit(1)

@github-actions
Copy link
Copy Markdown

PR Code Suggestions ✨

No code suggestions found for the PR.

@qin-ctx
Copy link
Copy Markdown
Collaborator

qin-ctx commented Apr 18, 2026

api 测试流水线有报错

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

2 participants