Skip to content

[Feature] Implement /v1/embeddings endpoint for OpenAI-compatible API#4550

Open
ZhijunLStudio wants to merge 1 commit intoInternLM:mainfrom
ZhijunLStudio:feat/embeddings-endpoint
Open

[Feature] Implement /v1/embeddings endpoint for OpenAI-compatible API#4550
ZhijunLStudio wants to merge 1 commit intoInternLM:mainfrom
ZhijunLStudio:feat/embeddings-endpoint

Conversation

@ZhijunLStudio
Copy link
Copy Markdown
Contributor

Motivation

The /v1/embeddings endpoint is a standard OpenAI API supported by vLLM, SGLang, and TGI. Many downstream tools (LangChain, LlamaIndex, RAG pipelines) depend on it to generate text embeddings. Currently lmdeploy's /v1/embeddings is a stub that returns Unsupported by turbomind.

The infrastructure to pass last_hidden_state through the pipeline already exists at the high level (Response, EngineOutput, GenOut all have the field), but the PyTorch engine's internal pipeline never populates it.

Modification

API layer

  • lmdeploy/serve/openai/protocol.py: Add encoding_format field to EmbeddingsRequest (supports float and base64)
  • lmdeploy/serve/openai/api_server.py: Replace stub with full implementation that calls engine with max_new_tokens=1 + output_last_hidden_state='all', applies mean pooling across input sequence, and returns EmbeddingsResponse

PyTorch engine pipeline (threading hidden states from model forward to API response)

  • lmdeploy/pytorch/messages.py: Add output_last_hidden_state field to SamplingParam, add return_last_hidden_states property to SchedulerSequence, replace unsupported warning with validation
  • lmdeploy/pytorch/engine/inputs_maker.py: Add __need_hidden_states check and pass return_last_hidden_states flag
  • lmdeploy/pytorch/engine/model_agent/agent.py: Add last_hidden_states to BatchedOutputs, capture full-sequence hidden states in _async_model_forward before postprocessing slices to last token, mean pool per-sequence
  • lmdeploy/pytorch/engine/engine.py: Add last_hidden_states field to InferOutput
  • lmdeploy/pytorch/engine/engine_loop.py: Thread hidden states through _send_resp and _make_infer_outputs
  • lmdeploy/pytorch/engine/engine_instance.py: Pass last_hidden_state to EngineOutput

Tested with

  • Qwen3-8B on PyTorch backend: single/multi input, cosine similarity (cat/cat-like=0.9754 > cat/stock=0.9478), empty input validation, base64 encoding

BC-breaking

No. The new endpoint is additive. Existing TurboMind output_last_hidden_state support is unchanged.

Use cases

# Start server
lmdeploy serve api_server Qwen/Qwen3-8B --backend pytorch

# Get embeddings
curl -X POST http://localhost:23333/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen3", "input": ["Hello", "World"]}'

Checklist

  • Pre-commit or other linting tools are used to fix the potential lint issues.
  • The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
  • If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
  • The documentation has been modified accordingly, like docstring or example tutorials.

Add support for the standard OpenAI embeddings endpoint that extracts
last hidden states from the model and applies mean pooling. This enables
downstream tools (LangChain, LlamaIndex, RAG pipelines) to use lmdeploy
for text embedding generation.

Changes:
- Replace stub /v1/embeddings with full implementation supporting
  float and base64 encoding formats
- Thread last_hidden_states through the PyTorch engine pipeline
  (BatchedOutputs -> InferOutput -> EngineOutput)
- Capture full-sequence hidden states before postprocessing slices
  to last token, and mean pool per-sequence in the engine
- Add output_last_hidden_state to SamplingParam with validation
- Tested end-to-end with Qwen3-8B: cosine similarity ordering
  is correct (0.9754 > 0.9478)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant