Skip to content

Code review: findings across CLI, reporter, Flask backend, infra, docs & CI #58

@NikolayS

Description

@NikolayS

Summary

Comprehensive review of the postgresai repo covering architecture, code quality, tests, CI, docs, dependency management, and security. Most of the project is in decent shape — a strong quality framework exists, CodeQL + gitleaks are wired up, and the bulk of the checkup/reporter logic is well commented. However, there are a number of real issues worth fixing. They are grouped below by severity.

Scope examined: cli/ (TypeScript), reporter/ (Python), monitoring_flask_backend/ (Python Flask), config/, docker-compose*.yml, .gitlab-ci.yml, quality/, tests/, terraform/, postgres_ai_helm/, top-level docs.

Tooling run locally: ruff check (0.15.8), git log, module import checks, grep/find across the tree.


🔴 Critical / High

1. PostgreSQL version support gap — CI, conftest, and README disagree

The README advertises PostgreSQL 14-18 (badge + Requirements section), but:

  • conftest.py:9 only searches for binaries in versions ["16", "15", "14", "13", "12", "11"] — never finds PG 17 or PG 18 (the current latest stable).
  • quality/gitlab-ci-quality.yml:225 quality:pg-version-matrix runs only PG_VERSION: ["14", "15", "16"].
  • No CI coverage for PG 17 or PG 18.

This is a correctness/QA claim the project cannot currently back up. Either update the matrix + conftest to cover 17/18, or adjust the README badge to the versions actually tested.

2. Large "planning" doc committed to repo root

TMP_MON_PREVIEWS.md (57 KB, 1000+ lines) is a dated (2026-01-25, v1.7.0) implementation plan for preview environments. The TMP_ prefix and the content ("Status: Implementation Complete - Awaiting VM Provisioning", embedded docker-compose samples at lines 1074, 1347) strongly suggest this should not be checked in. The pre-commit check-added-large-files rule (.pre-commit-config.yaml:11 — maxkb=500) would now block this file, but it was committed earlier.

Action: move to the wiki / issue / internal docs and delete from the repo.

3. README is structurally duplicated / inconsistent

README.md reads like two READMEs glued together:

  • Lines 28-213: modern/lean postgresai README.
  • Lines 228-534: emoji-heavy legacy doc (## 🎯 Use cases, ## 🔧 Management commands, ## 🔄 Upgrading, ## ☸️ Kubernetes deployment, ## 📋 Checkup reports, ## 🌐 Access points, etc.) repeating and partly contradicting the earlier sections.

Concrete problems:

  • Two Roadmaps (one in the clean half, one further down at line 456).
  • A stray <div align="center"> opens at line 224 and closes only at line 535, so the entire bottom half is inadvertently centre-aligned.
  • The testing section at line 464 says "Python-based report generation lives under reporter/" but the CLI is TypeScript and the project's CLI coverage badge (line 10) points at cli:node:tests. Both are true — the README just doesn't explain the mix.
  • The 🧪 Testing section references tests/reporter/ as the entire suite; there are actually 6 test roots (tests/reporter/, tests/monitoring_flask_backend/, tests/e2e/, tests/lock_waits/, tests/settings/, tests/compliance_vectors/).
  • .env.example:6 pins PGAI_TAG=0.14.0, README line 332 also mentions 0.14.0, but cli/package.json:3 and pgai/package.json:3 both have "version": "0.0.0-dev.0". There's no single source of truth for the shipped version.

Action: de-duplicate README.md, remove the stray centre <div>, and document the Python/TypeScript split in one place.

4. Repository URLs point to GitLab only

cli/package.json:9, cli/package.json:12, pgai/package.json:9, pgai/package.json:12 all point to gitlab.com/postgres-ai/postgres_ai. The package is also hosted at github.com/postgres-ai/postgresai (this issue is being filed there), and the README references GitLab for issues while the GitHub repo is the mirror target (.github/workflows/mirror-to-gitlab.yml). Decide which is canonical and make it consistent — at minimum fix bugs.url/repository.url in the npm manifests.

5. Dependency fragmentation & drift

Three separate Python dependency files, no pyproject.toml:

File pytest boto3 requests-aws4auth
reporter/requirements.txt (not pinned) 1.34.69 1.2.3
reporter/requirements-dev.txt 8.3.5
monitoring_flask_backend/requirements.txt 9.0.2 1.34.69 1.3.1

Two different pytest majors, two different requests-aws4auth versions, and boto3 pinned to a ~1-year-old release on both sides. No top-level pyproject.toml → no single dev-env bootstrap. Consider consolidating into pyproject.toml with optional extras ([project.optional-dependencies]) and aligning pins.

6. Massive single-file modules (god-objects)

File LOC Notes
cli/bin/postgres-ai.ts 4,836 Single CLI entry file; commander tree likely needs splitting into subcommand modules.
reporter/postgres_reports.py 5,166 One class (PostgresReportGenerator) with 78 methods; several functions >150 LOC.
cli/lib/checkup.ts 1,744
monitoring_flask_backend/app.py 1,536 10 routes + SQL allowlist logic + truncation helpers all in one module.
cli/lib/init.ts 1,131
cli/lib/issues.ts 1,060

These aren't bugs, but they are the main obstacle to onboarding new contributors and to targeted testing. Suggest extracting one check family per module (e.g. reporter/generators/hX.py, reporter/generators/fX.py) and splitting the Commander tree in postgres-ai.ts into per-command files.


🟠 Medium

7. zip() without strict= on paired time series

reporter/postgres_reports.py:2002, 2021, 2379 (and two more, B905 ×5 total) zip exec/plan or read/write values that are assumed to line up hour-by-hour. If Prometheus returns a different number of points for the two series (stale scrape, gap), results are silently wrong, not errored:

hourly_total_time = [e + p for e, p in zip(exec_values, plan_values)]

Add strict=True (Python 3.10+) or explicit length validation.

8. Bare except Exception: pass swallowing failures

Silent swallows with no logging:

  • reporter/postgres_reports.py:178 — file read failure returns None silently.
  • reporter/postgres_reports.py:4407S110 — metrics parsing failure hidden by comment-only justification.
  • monitoring_flask_backend/app.py:228 — smart query truncation falls back silently.
  • tests/lock_waits/test_lock_waits_metric.py:170, 334, 342 — three occurrences in a single test.

At minimum add logger.debug(…, exc_info=True) so these failures are traceable in production.

9. Flask backend posture

  • monitoring_flask_backend/app.py:403 /health leaks PROMETHEUS_URL to any caller (unauthenticated) — could help an attacker pivot to internal Prometheus.
  • monitoring_flask_backend/app.py:272 Prometheus client defaults to disable_ssl=True; OK for a Docker-internal URL but nothing validates that PROMETHEUS_URL actually points at localhost before disabling TLS. A misconfigured env var would MITM happily.
  • No security headers (X-Content-Type-Options, X-Frame-Options, basic CSP) on the CSV/JSON endpoints. Add a @app.after_request handler.
  • Most routes (/pgss_metrics/csv, /query_texts, /btree_bloat/csv, /table_info/csv, /query_info_metrics, /metrics, /debug/metrics, /version) have no authentication. If the container is not behind an authenticating reverse proxy, these are open. Document the required deployment contract or add optional basic-auth.
  • monitoring_flask_backend/app.py:1536 app.run(host='0.0.0.0', …) — only used when run directly (gunicorn is the real entrypoint) but still ruff-flagged S104; pin to localhost for dev.

10. No Python linting or type-checking in CI

Release-readiness runs bun run typecheck for the CLI, but there is no ruff/mypy/black/flake8 gate for the 6 700+ lines of Python. Running ruff check . --select E,W,B,S surfaces 3 158 issues today, including:

  • 5 × S608 hardcoded SQL (monitoring_flask_backend/app.py:181, 185, 218, test code)
  • 4 × S110 try/except/pass
  • 5 × B905 zip-without-strict (see Text review for errors #7)
  • 9 × B007 unused loop control variable
  • 27 × F841 unused variable
  • 35 × F401 unused import
  • 419 × W293 blank-line-with-whitespace
  • 1 071 × E501 line-too-long
  • 1 × S104 bind-all-interfaces

These are individually small but cumulatively indicate code hygiene drift. Recommend adding ruff check + mypy to .pre-commit-config.yaml and to .gitlab-ci.yml (non-blocking at first, then block on a pinned baseline).

11. CLI smoke job doesn't actually test behaviour

.gitlab-ci.yml:220-250 (cli:node:smoke) validates that the binary runs (--help, mon status --help, prepare-db --print-sql | grep CREATE). It does not assert any CLI output correctness. The real unit tests live in cli:node:tests (bun test) — that's fine, but then the smoke job is misnamed; it's closer to a "binary sanity" check.

12. quality:sql-safety and quality:connection-safety are advisory only

Both have allow_failure: true (quality/gitlab-ci-quality.yml:69, 153). For a PostgreSQL observability tool that executes SQL against customer databases, these checks should eventually be hard gates on main.

13. Test-file names leak coverage-farming history

tests/reporter/test_final_coverage_push.py and tests/reporter/test_final_push_to_85.py exist as top-level files. The names advertise that they were written to hit a coverage threshold rather than to verify behaviour. They should be renamed by concern (e.g. test_parse_memory_value.py, test_format_bytes.py) and merged into the existing per-feature suites.

14. Commander version jump + Bun/Node dual runtime

cli/package.json:44 has "commander": "^14.0.3" and "typescript": "^6.0.2" — both are noticeably ahead of what's typical and may not yet be widely tested with bun. Also the build script rewrites the shebang from #!/usr/bin/env bun to #!/usr/bin/env node via inline node -e; this works but is fragile. Consider a small scripts/postbuild.ts instead.


🟡 Low / polish

15. Makefile is trivial

Makefile (lines 1-15) only exposes up, up-local, down, logs. No test, lint, format, build, install-dev. Given how scattered the build steps are (bun in cli/, pip in reporter/ + monitoring_flask_backend/, helm in postgres_ai_helm/), a thin Makefile would materially help contributors.

16. No issue or PR templates

.github/ has dependabot.yml, codeql-analysis.yml, mirror-to-gitlab.yml, but no ISSUE_TEMPLATE/*.md and no pull_request_template.md. The ecosystem (Cursor, Claude Code) discovers these and they raise the quality of first-issue reports.

17. Cross-module import reaches into Flask backend from reporter

reporter/postgres_reports.py:47 imports monitoring_flask_backend.promql_utils.escape_promql_label with a try/except fallback. Putting the small utility in a shared common/ or publishing it once would eliminate the cross-component coupling and the inline fallback.

18. requirements-aws4auth and boto3 used for AMP auth are not gated behind a feature flag

Both packages ship unconditionally in reporter/requirements.txt even though AMP (AWS-managed Prometheus) support is optional. Moving them to an [amp] extra keeps the default install slim.

19. Misc code smells found by ruff

  • reporter/postgres_reports.py:426 B007 — loop variable label_name never used.
  • tests/compliance_vectors/test_vm_auth.py:116, 125 S607subprocess.run(['helm', …]) uses partial path.
  • Trailing-whitespace / blank-line whitespace spread across hundreds of lines; a single ruff --fix pass will clean it up.

20. Infra nits

  • docker-compose.yml:11postgres-ai-configs:${PGAI_TAG:?…} hard-fails if PGAI_TAG is missing, but there's no hint in the error message to point at .env.example. README has recovery guidance but the error message does not.
  • docker-compose.yml:261 cAdvisor runs privileged: true — necessary for cAdvisor, but worth an inline comment justifying it for security auditors.
  • docker-compose.yml:148 --sink=prometheus://0.0.0.0:9091/pgwatch — literal 0.0.0.0 as a client target reads like a typo; confirm this is "listen on all" vs "connect to all".

What worked well (so we don't regress)

  • quality/QUALITY_ENGINEERING_GUIDE.md + quality/scripts/release-readiness.sh are a solid gate: 70 % Python / 60 % CLI coverage minimums enforced, with 9 other checks (build, types, schemas, gitleaks, large-file, docs).
  • Secret-handling is genuinely careful: gitleaks pre-commit (.githooks/pre-commit), gitleaks pre-commit hook repo, gitleaks.toml with synthetic-secret allowlist, no real secrets committed.
  • /execute-query debug endpoint is defence-in-depth'd: hidden behind ENABLE_DEBUG, constant-time hmac.compare_digest auth, SQL allowlist + quote-aware comment stripper + multi-statement reject + connect_timeout=10 + statement_timeout=10s + read-only transaction. This is the highest-quality code in the repo.
  • CLI schema validation (cli/test/schema-validation.test.ts) against JSON schemas in reporter/schemas/ gives cross-language guarantees.
  • CodeQL + dependabot + mirror-to-gitlab are wired up.

Suggested triage

Priority Items
Do first #1 PG-version mismatch, #2 remove TMP_MON_PREVIEWS.md, #3 de-duplicate README, #4 fix repo URLs
Next #5 consolidate deps, #7 zip(strict=), #8 except-pass logging, #9 Flask hardening, #10 ruff/mypy in CI
Backlog #6 split god-modules, #11-14 CI polish, #15-20 DX & cleanups

Happy to open PRs for any of the "Do first" items — they are small and mostly mechanical.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions