Skip to content

Add optimized secp384r1 (NIST P-384) ECDSA implementation#973

Draft
tamashi095 wants to merge 4 commits into
MystenLabs:mainfrom
tamashi095:secp384r1
Draft

Add optimized secp384r1 (NIST P-384) ECDSA implementation#973
tamashi095 wants to merge 4 commits into
MystenLabs:mainfrom
tamashi095:secp384r1

Conversation

@tamashi095

Copy link
Copy Markdown

Description

Adds a secp384r1 module implementing ECDSA over NIST P-384, mirroring the secp256r1 module's architecture, ~3.6x faster verify and ~2.4x faster sign than the RustCrypto p384 crate on the same inputs.

Closes the alignment questions in #972 — please see that issue for the design questions (low-s policy, experimental gating, no-recovery scope) before review.

Motivation: Sui's new 0x2::ecdsa_p384::secp384r1_verify native (sui#26934, for Apple App Attest / Android Key Attestation cert chains) verifies via the p384 crate and shipped with a 54,000 gas base priced off its ~532µs verify (~12.8x fastcrypto's secp256r1). With this implementation the ratio drops to ~3.45x secp256r1, supporting a gas base around ~15,000.

Benchmarks

Criterion, Apple Silicon, rustc 1.92, identical keys/messages/signatures for both implementations (cargo bench --features experimental --bench secp384r1):

op fastcrypto (this PR) RustCrypto p384 0.13.1 speedup
verify 143.5 µs 518.6 µs 3.61x
sign 119.1 µs 284.1 µs 2.39x

What drives it: the p384 crate uses fully constant-time complete-addition formulas with no precomputation; this PR reuses fastcrypto's WindowedScalarMultiplier (256-point precomputed generator table; Straus interleaved double-mul with width-5 sliding window) over arkworks Jacobian arithmetic (ark-secp384r1 0.4.0), exactly as secp256r1 does. Tuning was benchmarked: width 6 measured ~3% slower than width 5; a 512-point table gained <1% (not worth 2x memory) — so the constants match secp256r1.

Design notes

  • Exact p384-crate equivalence: accepts/rejects exactly the same signatures as p384 (asserted per-vector on both wycheproof sets) and produces byte-identical RFC6979/SHA-384 signatures. Consequently — and unlike secp256r1high-s signatures are accepted and sign does not normalize s: the use case is verifying externally-produced X.509/attestation signatures, so rejecting high-s would break real certificate chains. Documented in the module docs; signatures are therefore malleable ((r,s) -> (r,n-s)).
  • Gated behind experimental with an explicit not-yet-audited note (open question in Optimized secp384r1 (NIST P-384) ECDSA: ~3.6x faster verify than RustCrypto p384 #972: Sui consuming an experimental module for a production native is a real tension).
  • No recoverable.rs — public-key recovery isn't needed for certificate verification (v1 scope).
  • New Sha384 hash function added to hash.rs; sign/verify are generic over digest length (verify_with_hash::<Sha256, 32> etc.); DefaultHash is SHA-384 matching the p384 crate.
  • Side-channel posture matches secp256r1: signing uses the same constant-time fixed-window path as secp256r1's signer; verification (public data) uses the same vartime sliding-window double-mul.
  • Dependencies added: p384 0.13, ark-secp384r1 0.4.0cargo deny check bans licenses sources is clean.

Test plan

46 new tests, all green (cargo test --all-features):

  • Wycheproof EcdsaSecp384r1Sha384 + EcdsaSecp384r1Sha512 (no SHA-256 set exists upstream); every vector additionally asserted to accept/reject identically to the p384 crate.
  • RFC 6979 A.2.6 KATs (P-384/SHA-384, 'sample' and 'test').
  • Equivalence: byte-for-byte sign equality with p384 (fixed vectors + 64-case proptest incl. mutation rejection), cross-verification both directions, high-s acceptance test.
  • Scheme tests mirroring secp256r1_tests.rs (serde/base64/ordering/zeroize-on-drop/display-elision/batch/...), group + conversion + multiplier tests, runnable doctest.

CI gates verified locally: cargo fmt --check, cargo xclippy (zero warnings), license headers, cargo deny check bans licenses sources, cargo build --benches --features experimental,copy_key,unsecure_schemes, scripts/changed-files.sh. (The only failing test in the full workspace run is the pre-existing fastcrypto-zkp zk_login e2e test, which calls an external rapidsnark service that returned HTTP 500 — unrelated.)

Honest limits

This does not reach secp256r1's ~42µs: P-384's 6-limb field makes each multiplication ~2.3x dearer with 1.5x more bits to process, so ~3.4x secp256r1's cost is in line with the curve-size scaling — the remaining gap to p384 was the precomputation/window machinery, which this PR captures. Further gains would need a specialized P-384 field backend (assembly or fiat-crypto-style dedicated reduction), which I deliberately avoided in favor of vetted arkworks arithmetic.

Mirrors the secp256r1 module architecture: p384 crate types for
encoding and RFC6979 nonce generation, ark-secp384r1 for field and
curve arithmetic, and a WindowedScalarMultiplier with a precomputed
generator table for fast fixed-base and double-base scalar
multiplication.

Unlike secp256r1, high-s signatures are accepted and sign does not
normalize s, matching the RustCrypto p384 crate exactly (required for
verifying X.509/attestation certificate chains).

Gated behind the experimental feature (not yet audited).
Each iteration signs a random message with both implementations
(asserting byte equality) and verifies a signature variant (valid,
bit-flipped, edge-case scalars, random bytes, wrong key, malleated)
with both, asserting identical accept/reject decisions for SHA-256,
SHA-384 and SHA-512 digests. Runs 100 iterations in CI; longer
sessions via SECP384R1_FUZZ_ITERATIONS.
@tamashi095

Copy link
Copy Markdown
Author

Added a differential fuzz test (fuzz_differential_against_p384_crate) that drives this implementation and the p384 crate with identical inputs: every iteration asserts byte-identical RFC6979 signing, then verifies a signature variant — valid, single-bit-flipped, edge-case scalars (1, 2, n−1, >n), pure random bytes, wrong-key, or malleated (r, n−s) — with both implementations, asserting identical accept/reject decisions for SHA-256, SHA-384 and SHA-512 digests (exercising the bits2field pad/truncate paths).

Ran a 250,000-iteration release-mode session locally (~250k differential sign comparisons, ~1.3M differential verify decisions): zero disagreements, zero panics (seed ca9eda6b156a2da4e449246fafb3f8e37c40dcb2edb4840792c56ed79f5a8822, 634s). The committed test runs 100 iterations in CI with a fresh printed seed each run; longer sessions via SECP384R1_FUZZ_ITERATIONS.

Since the two pipelines share no math code (arkworks Jacobian + Straus windows here vs fiat-crypto field + complete formulas in p384), agreement on random and adversarial inputs is strong evidence of behavioural equivalence beyond the wycheproof/RFC 6979 fixed vectors.

@tamashi095

Copy link
Copy Markdown
Author

A few additional notes for reviewers, surfaced while double-checking the PR for gaps:

  1. Benchmarks are Apple Silicon only. The 3.6x/2.4x ratios should be re-validated on x86_64 (validator hardware) before any gas repricing in Sui — arkworks' generic Montgomery code and the p384 crate's fiat-crypto codegen may scale differently across ISAs. The bench (cargo bench --features experimental --bench secp384r1) compares both implementations on identical inputs, so reproducing on a target machine is one command.

  2. Worst-case vs average-case verify time. Verification is variable-time (public inputs only, same posture as secp256r1). There is no adversarial blow-up input — precomputation sizes are fixed and the main loop is bounded by the scalar bit length — but timing varies by roughly ±10% with the scalar bit patterns, so a gas price derived from these numbers should use the worst case, whereas the p384 crate's constant-time verify is flat by construction.

  3. DER parsing is out of scope here. Attestation certificate chains carry DER-encoded ECDSA signatures; like the other fastcrypto schemes, Secp384r1Signature only accepts the fixed 96-byte r || s encoding. If Sui wants to drop its direct p384 dependency entirely, either fastcrypto grows a from_der constructor or the caller keeps doing DER->fixed conversion — happy to add from_der if you prefer.

  4. Zeroization hygiene matches secp256r1. Secret material passes through unzeroized stack temporaries in the arkworks conversion layer during signing (as it does in secp256r1's signer); the cached key bytes and key objects zeroize on drop. Documented here for completeness rather than as a regression.

Also added the module to the README scheme list (including the high-s acceptance note, since the README is where the other curves' low-s rules are documented).

Attestation certificate chains carry DER-encoded ECDSA signatures.
Parsing accepts exactly the same encodings as the p384 crate, which is
asserted per-vector on the wycheproof DER test cases.
@tamashi095

Copy link
Copy Markdown
Author

Resolved note 3 from the previous comment: added Secp384r1Signature::from_der (2acf028) so Sui can drop its direct p384 dependency entirely — attestation certificate chains carry DER-encoded signatures, and requiring callers to do their own DER conversion would have kept p384 in the dependency tree.

Parsing delegates to the p384 crate's strict DER parser, so it accepts exactly the same encodings (asserted per-vector on the wycheproof DER test cases, which include the BER-variant and integer-mangling cases, for both test sets). as_ref on a DER-parsed signature returns the fixed 96-byte r || s form. to_der is intentionally omitted — the use case only imports certificate signatures; happy to add it if you'd like a symmetric API.

@tamashi095

Copy link
Copy Markdown
Author

Resolved note 1 (Apple Silicon-only benchmarks): re-ran the same bench on x86_64 — an AMD EPYC dedicated-performance VM (Fly.io, 2 dedicated vCPUs, rust 1.92, cargo bench -p fastcrypto --features experimental --bench secp384r1 on this branch). The speedup ratio holds across ISAs, slightly better on x86 than on ARM:

op fastcrypto RustCrypto p384 speedup
verify (x86_64 EPYC) 402.4 µs 1512.0 µs 3.76x
sign (x86_64 EPYC) 335.6 µs 840.3 µs 2.50x
verify (Apple Silicon) 143.5 µs 518.6 µs 3.61x
sign (Apple Silicon) 119.1 µs 284.1 µs 2.39x

Absolute numbers on the EPYC VM are ~2.8x the Apple Silicon ones across the board (virtualized, conservative clocks — the hypervisor masks the EPYC generation), which is exactly why the ratio, not the absolute latency, is the portable claim. Both implementations were measured back-to-back on the same core in the same run, and the ours/p384 ratio agrees within ~4% across the two ISAs. Final gas numbers should still come from Sui's own reference hardware, but the speedup is not an artifact of ARM codegen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant