[Feature] Implement `/v1/embeddings` endpoint for OpenAI-compatible API by ZhijunLStudio · Pull Request #4550 · InternLM/lmdeploy

ZhijunLStudio · 2026-04-23T08:14:13Z

Motivation

The /v1/embeddings endpoint is a standard OpenAI API supported by vLLM, SGLang, and TGI. Many downstream tools (LangChain, LlamaIndex, RAG pipelines) depend on it to generate text embeddings. Currently lmdeploy's /v1/embeddings is a stub that returns Unsupported by turbomind.

The infrastructure to pass last_hidden_state through the pipeline already exists at the high level (Response, EngineOutput, GenOut all have the field), but the PyTorch engine's internal pipeline never populates it.

Modification

API layer

lmdeploy/serve/openai/protocol.py: Add encoding_format field to EmbeddingsRequest (supports float and base64)
lmdeploy/serve/openai/api_server.py: Replace stub with full implementation that calls engine with max_new_tokens=1 + output_last_hidden_state='all', applies mean pooling across input sequence, and returns EmbeddingsResponse

PyTorch engine pipeline (threading hidden states from model forward to API response)

lmdeploy/pytorch/messages.py: Add output_last_hidden_state field to SamplingParam, add return_last_hidden_states property to SchedulerSequence, replace unsupported warning with validation
lmdeploy/pytorch/engine/inputs_maker.py: Add __need_hidden_states check and pass return_last_hidden_states flag
lmdeploy/pytorch/engine/model_agent/agent.py: Add last_hidden_states to BatchedOutputs, capture full-sequence hidden states in _async_model_forward before postprocessing slices to last token, mean pool per-sequence
lmdeploy/pytorch/engine/engine.py: Add last_hidden_states field to InferOutput
lmdeploy/pytorch/engine/engine_loop.py: Thread hidden states through _send_resp and _make_infer_outputs
lmdeploy/pytorch/engine/engine_instance.py: Pass last_hidden_state to EngineOutput

Tested with

Qwen3-8B on PyTorch backend: single/multi input, cosine similarity (cat/cat-like=0.9754 > cat/stock=0.9478), empty input validation, base64 encoding

BC-breaking

No. The new endpoint is additive. Existing TurboMind output_last_hidden_state support is unchanged.

Use cases

# Start server
lmdeploy serve api_server Qwen/Qwen3-8B --backend pytorch

# Get embeddings
curl -X POST http://localhost:23333/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen3", "input": ["Hello", "World"]}'

Checklist

Pre-commit or other linting tools are used to fix the potential lint issues.
The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
The documentation has been modified accordingly, like docstring or example tutorials.

Add support for the standard OpenAI embeddings endpoint that extracts last hidden states from the model and applies mean pooling. This enables downstream tools (LangChain, LlamaIndex, RAG pipelines) to use lmdeploy for text embedding generation. Changes: - Replace stub /v1/embeddings with full implementation supporting float and base64 encoding formats - Thread last_hidden_states through the PyTorch engine pipeline (BatchedOutputs -> InferOutput -> EngineOutput) - Capture full-sequence hidden states before postprocessing slices to last token, and mean pool per-sequence in the engine - Add output_last_hidden_state to SamplingParam with validation - Tested end-to-end with Qwen3-8B: cosine similarity ordering is correct (0.9754 > 0.9478)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Implement `/v1/embeddings` endpoint for OpenAI-compatible API#4550

[Feature] Implement `/v1/embeddings` endpoint for OpenAI-compatible API#4550
ZhijunLStudio wants to merge 1 commit intoInternLM:mainfrom
ZhijunLStudio:feat/embeddings-endpoint

ZhijunLStudio commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ZhijunLStudio commented Apr 23, 2026

Motivation

Modification

API layer

PyTorch engine pipeline (threading hidden states from model forward to API response)

Tested with

BC-breaking

Use cases

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant