Skip to content

Road to 1.0: comprehensive audit + offer to contribute fixes #307

Description

@n-rodriguez

Hi 👋

While evaluating crystal-pg for a Crystal project, I ran a full multi-dimension audit toward a 1.0 tag (correctness, concurrency / fiber-safety, error model, perf/memory/CPU, PostgreSQL protocol completeness, public API / SemVer surface, tests / CI). I want to lead with the honest part: the architecture is genuinely good. A clean two-layer split — the wire protocol in PQ::* cleanly separated from the crystal-db driver in PG::* — correct binary decoders for the large majority of types, SCRAM-SHA-256 + channel binding (rare and well done), logical replication (pgoutput), and a real round-trip spec suite against Postgres 14–18. The bones are solid.

This is meant as a constructive map of what stands between today and a 1.0 that can be promised — and I'm offering to contribute the fixes, starting with the blockers and the most-requested security/recovery gaps.

📄 Full audit (canonical, cited file:line): https://gist.github.com/n-rodriguez/4cbd8a46b0892adc8305156c70273a50

Summary by tier

Tier Theme Headline
0 Blockers PG::Interval#to_spans calls methods that don't exist (won't compile when called); .size instead of .bytesize corrupts the startup/password packet for non-ASCII; a stray pp debug print in replication; CI triggers on main while the default branch is master (push CI never fires); a decode error poisons a pooled connection silently
1 Security verify-ca/verify-full are accepted but the TLS context is hard-coded VerifyMode::NONE — no cert/hostname verification (silent MITM exposure); an over-permissive .pgpass is warned about but read anyway; PGSSLMODE/PGSSL* env vars are ignored
2 Correctness & concurrency RowDescription table_oid/column fields swapped + truncated; infinity/-infinity timestamps unhandled; Numeric#to_s fractional truncation is fragile; two array code paths disagree on non-1 lower bounds; the LISTEN/NOTIFY loop shares the socket read path with queries without a lock; @@location_cache is mutated cross-fiber without synchronization; two disjoint exception trees (PQ::PQError is not under PG::Error)
3 Protocol completeness verify-* unimplemented; NegotiateProtocolVersion unhandled; no CancelRequest; missing types (time/timetz, inet/cidr/macaddr, money, ranges); connect_timeout/options not honored (a dead host hangs forever); replication: no slot creation, proto pinned to v1, ErrorFrame ignored
4 Perf / memory / CPU decoder resolved by per-value OID hash lookup instead of per-column; no real prepared statements (Parse on every exec — the unnamed statement); Array(Slice?) + per-column copy per row; text param encoding allocates per value
5 API / SemVer the whole PQ::* namespace is :nodoc: but public in practice (callbacks return PQ::Notification/Notice); PQ::Connection#soc + protocol I/O are public; sslmode is a Symbol; Geo::Path is inconsistent with its sibling records; PG::Error is :nodoc: yet is the root users must rescue; dead pre-Crystal-1.0 monkeypatches
6 Tests / CI / packaging CI is nix-only with a single Crystal version; pq/frame.cr, pq/query.cr, ext/openssl.cr, pg/record.cr have zero coverage; auth & escape-error specs are disabled by default; depends on crystal-db which is itself pre-1.0

All headline compile / wire-format / security findings were verified by hand against the source.

How this maps to existing PRs and issues

Several long-standing issues corroborate these findings — they're not theoretical:

The takeaway: Tier 0 and Tier 1 have no competing PRs but multiple corroborating issues — that's where I'd like to start.

Proposed plan

I'll open one child issue per tier (linked below) so each is independently discussable, and I'd like to start with a PR for Tier 0 (the blockers) — each with a failing spec first, plus the CI branch fix so they can't recur — and then Tier 1 (verify-full, closing #240/#237/#231).

Two questions before I send code:

  1. Spec framework for new specs — stdlib spec (what the suite uses today)?
  2. For the completeness gaps (Tier 3) and security (Tier 1): prefer to implement them (real TLS verification, missing types, query cancel), or to stop advertising them (reject verify-* and unsupported sslmodes loudly) until they exist? Happy either way.

Thanks for building and maintaining this — it's the reference Postgres driver for Crystal, and I'd be glad to help push it toward 1.0. 🙏

ℹ️ Transparency: this audit was produced with Claude Opus 4.8 [1M] (multi-pass static review), then the headline compile / wire-format / security findings were verified by hand against the source.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions