Skip to content

feat: split localpi token status rates#14

Open
osolmaz wants to merge 2 commits into
mainfrom
feat/token-status-phase-rates
Open

feat: split localpi token status rates#14
osolmaz wants to merge 2 commits into
mainfrom
feat/token-status-phase-rates

Conversation

@osolmaz

@osolmaz osolmaz commented Jun 26, 2026

Copy link
Copy Markdown
Member

Summary

Localpi was showing one token speed number for a whole turn.
That mixed prompt processing time with output generation time.
This change makes the Pi token status extension report generation speed separately, and adds a final prefill rate when usage data is available.
It uses the first streamed assistant output as the only phase boundary Pi exposes to this extension.

What Changed

The token status extension now tracks when assistant output first appears.
Before that point, it treats the turn as prefill/time-to-first-output; after that point, it reports generation speed using output tokens over generation elapsed time.

  • Added firstOutputAt to per-turn token status state.
  • Changed live status from one whole-turn tok/s value to gen ... tok/s after output begins.
  • Added final prefill ... tok/s when usage input/cache data is available.
  • Kept existing output, input, cache, elapsed, and context status fields.
  • Updated the runtime spec and extension regression test.

Testing

The changed extension source transpiles, the focused extension tests pass, and the TypeScript project builds.
The full local check is blocked on this machine by live local model servers being discovered during an unrelated runtime test.

  • npm run format passed.
  • npm run typecheck passed.
  • npm test -- tests/extensions.test.ts tests/extension-source.test.ts passed.
  • npm run build passed.
  • npm run check failed only in tests/runtime.test.ts > runtime resolution > selects profile aliases for providers with discovery disabled; the failure shows live LM Studio/vLLM models from 127.0.0.1:1234 and 127.0.0.1:8000 being included in catalog data.

Risks

This is a display-only change in the generated Pi extension.
The main limitation is that Pi does not expose DS4's internal prompt-sync boundary here, so localpi uses first streamed output as the observable boundary.

  • If a provider does not stream message_update events, final generation timing falls back to whole-turn elapsed time.
  • Prefill speed is only shown when final usage data includes input token counts.

@osolmaz

osolmaz commented Jun 26, 2026

Copy link
Copy Markdown
Member Author

Final report:

Implemented the localpi token status phase split and pushed the branch.
The extension now reports live generation speed separately from prefill/time-to-first-output and reports final prefill rate from the uncached input buckets when usage data is available.

Validation:

  • npm run format passed.
  • npm run lint passed locally during npm run check before the unrelated runtime-test failure, and passed in Codex review.
  • npm run typecheck passed.
  • npm test -- tests/extensions.test.ts tests/extension-source.test.ts passed.
  • npm run build passed.
  • npm run check locally reached the full test suite but failed in tests/runtime.test.ts > runtime resolution > selects profile aliases for providers with discovery disabled because this machine has live local model servers on 127.0.0.1:1234 and 127.0.0.1:8000 contaminating catalog data.
  • codex review --base main first found a valid cache bucket issue; that was fixed in f0617d5.
  • codex review --base main after the fix found no actionable correctness issues.
  • GitHub CI ci / test passed.

PR is ready for human review/merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant