-
Notifications
You must be signed in to change notification settings - Fork 789
Add streaming support for chat: async generator bug and add documentation #7187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add streaming support for chat: async generator bug and add documentation #7187
Conversation
Implements real-time streaming for chat UI elements: Backend changes: - Modified chat._send_prompt to stream chunks via _send_message for async generators - Added UUID generation for tracking streaming messages - Sends incremental chunks with 'stream_chunk' type and 'is_final' flag - Updated docstring to reflect streaming support Frontend changes: - Added MarimoIncomingMessageEvent listener to receive streaming chunks - Implemented streamingStateRef to track backend message_id and frontend message index - Creates placeholder messages and updates them in real-time as chunks arrive - Added host prop to Chatbot component for event listening AI model changes: - Converted mo.ai.llm.openai to async generator with stream=True - Yields accumulated content as tokens arrive from the API - Enables automatic streaming for built-in OpenAI models The implementation uses existing SendUIElementMessage infrastructure for bidirectional communication via WebSockets. Async generator functions now automatically stream responses to the frontend in real-time. Resolves the TODO at line 232 in chat.py about streaming support.
- Backend: Send streaming chunks via WebSocket messages (_send_message) - Async generators now stream responses to frontend in real-time - Each chunk includes message_id, content, and is_final flag - Updated documentation to reflect streaming support - Frontend: Listen for streaming chunks via MarimoIncomingMessageEvent - Track streaming state and update UI as chunks arrive - Accumulate and display content incrementally - Support both streaming and non-streaming responses - Built-in models: Add stream parameter to mo.ai.llm.openai - Defaults to False for backward compatibility - When True, streams tokens from OpenAI API as async generator - Supports both streaming and non-streaming modes - Examples: Add streaming chat examples - streaming_custom.py: Shows custom async generator streaming - streaming_openai.py: Shows OpenAI API streaming with stream=True - Updated README.md with streaming documentation Streaming creates a ChatGPT-like experience where responses appear token-by-token as they're generated, improving perceived responsiveness.
- Test verifies streaming async generators send chunks via _send_message - Validates message structure (type, message_id, content, is_final) - Confirms intermediate chunks have is_final=False - Confirms final chunk has is_final=True and accumulated content
Change 'return' to 'yield' in non-streaming path to fix SyntaxError. In Python, async generators cannot use 'return' with a value - they must consistently use 'yield' throughout the function.
- Remove trailing whitespace - Fix return statement to return tuple with chatbot only - Use single quotes in markdown code example to avoid escaping issues when marimo auto-saves the notebook
- Add 'Streaming Responses' section explaining real-time streaming - Show how to enable streaming with built-in models (stream=True) - Show how to implement streaming with custom async generators - Include links to streaming examples in the repo
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
All contributors have signed the CLA ✍️ ✅ |
for more information, see https://pre-commit.ci
|
I have read the CLA Document and I hereby sign the CLA |
- Add stream parameter to anthropic, google, groq, and bedrock models - Implement streaming logic for each model using their native APIs - Update documentation to list all models that support streaming - All models now follow the same pattern: stream=True enables real-time responses
for more information, see https://pre-commit.ci
- Add commented-out stream=True parameter to all example files - Shows users how to enable streaming without changing default behavior - Covers: OpenAI, Anthropic, Google (Gemini), Groq, and Bedrock examples
- Add noqa comment for unused buffers parameter - Replace == False with 'not' operator - Replace == True with direct boolean check
|
|
||
| for word in words: | ||
| accumulated += word + " " | ||
| yield accumulated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be just yield word? seems like the accumulation happens client-side
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The accumulation must happen server-side. Each yield sends the full accumulated text to the frontend, which displays it. If we yielded just word, the UI would only show individual words instead of the progressively building response. This is consistent with how all built-in models work (OpenAI, Anthropic, Google, etc.).
| chunk_text = str(latest_response) | ||
|
|
||
| # Send incremental update to frontend | ||
| self._send_message( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we DRY this up? to:
self._send_message(
{
"type": "stream_chunk",
"message_id": message_id,
"content": accumulated_text,
"is_final": latest_response is not None,
},
buffers=None,
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Simplified by eliminating the chunk_text variable and directly using accumulated_text. Fixed in d71c725.
marimo/_ai/llm/_impl.py
Outdated
| return response.choices[0].message.content | ||
| if self.stream: | ||
| # Stream the response | ||
| response = litellm_completion( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we can write response = litellm_completion( once with stream=self.stream and then just handle the response differently in the if/else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! Unified the completion call using stream=self.stream. The different response types are handled in the if/else branches. Fixed in d71c725.
marimo/_ai/llm/_impl.py
Outdated
| response = client.models.generate_content_stream( | ||
| model=self.model, | ||
| contents=google_messages, | ||
| config={ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we DRY up this config and pull it out above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed! Extracted the config dict to a single generation_config variable. Fixed in d71c725.
for more information, see https://pre-commit.ci
- Change delete icon to rotate-cw icon for clearing chat history - Add disabled state when no messages exist (consistent with send button) - Improves UX clarity: trash icon was confusing as it looked like deleting the cell rather than resetting the conversation - Individual message delete buttons still use trash icon appropriately
Additional UI Improvement: Chat Reset Icon ChangeChanged the chat reset button icon from trash to rotate-clockwise to improve UX clarity. Why this change? Additional improvements:
See commit: 4b987ea |
…dels Problem: - When stream=False, chat models returned None instead of strings - Python treats any function with 'yield' as a generator, even if the yield is in an unexecuted branch - This caused __call__ to return a generator object instead of a string when stream=False Solution: - Extract streaming logic into separate _stream_response() helper methods - __call__ now only contains return statements (no yield) - When stream=True: returns generator from helper method - When stream=False: returns string directly - Maintains backward compatibility (kept sync def __call__) Changes: - All chat models (OpenAI, Anthropic, Google, Groq, Bedrock) updated - Added sync generator support to chat UI (inspect.isgenerator) - Added test_chat_sync_generator_streaming() test - Both streaming and non-streaming modes now work correctly
Changes: - Refactor chat UI to use single _handle_streaming_response() method for both sync and async generators (eliminated ~70 lines of duplication) - Fix streaming cutoff: always yield final accumulated result even when last chunk has no content (fixes incomplete responses) - Add test_chat_streaming_complete_response() to catch cutoff bugs - Increase default max_tokens from 100 to 4096 tokens (100 was way too low, caused stories/responses to cut off mid-sentence) The streaming cutoff fix ensures that when OpenAI/other APIs send final chunks with no content (just finish_reason), we still capture the complete accumulated text. The high default (4096 tokens ~3000 words) matches industry standards and prevents artificial truncation while still providing reasonable cost control.
for more information, see https://pre-commit.ci
Add return type annotations to all _stream_response() and _handle_streaming_response() methods to satisfy linting: - _stream_response() -> Generator[str, None, None] - _handle_streaming_response() -> str Also added Generator to typing imports.
for more information, see https://pre-commit.ci
Ruff requires type-only imports like Generator to be in a TYPE_CHECKING block to avoid runtime overhead.
Update documentation to clarify that both sync and async generators are supported for streaming chat responses: - docs/api/inputs/chat.md: Added sync generator example alongside async - chat.py docstring: Clarified both sync and async generators work Built-in models (OpenAI, Anthropic, etc.) use sync generators internally, while custom models can use either depending on their needs.
|
@adamwdraper looks great! few build errors to fix, but otherwise good to merge afterwards. we can followup as a team to figure out the correct default for |
- Fix circular type reference in ChatPlugin.tsx by using inferred types - Remove forbidden non-null assertion in chat-ui.tsx streaming handler - Extract frontendMessageIndex before null check to avoid non-null assertion operator
for more information, see https://pre-commit.ci
- Add type annotations to all _stream_response methods (openai, anthropic, google, groq, bedrock) - Fix union-attr error in openai response by using cast(Any, response) - Add missing return statement in anthropic __call__ for else branch - Fix google generate_content config type with cast(Any, generation_config) - Add type annotations to chat._handle_streaming_response - Fix return type in chat._handle_streaming_response to handle None case
for more information, see https://pre-commit.ci
- Replace deprecated 'gemini-1.5-pro-latest' with 'gemini-2.5-flash' - Add google-genai>=1.20.0 dependency to script metadata - Update generated_with version to 0.17.8
Summary
This PR adds real-time streaming support to
mo.ui.chat, enabling ChatGPT-like experiences where responses appear word-by-word as they're generated.What's New
✨ Streaming Support for Chat Models
Users can now stream responses from chat models in real-time:
With built-in models (OpenAI, etc.):
With custom models using sync generators:
📚 Documentation
streaming_openai.pyandstreaming_custom.py)🎯 Examples
streaming_openai.py: Demonstrates streaming with OpenAI modelsstreaming_custom.py: Shows how to build custom streaming chatbotsImplementation Details
streamparameter toopenaiChatModel classstream_chunkmessagesis_final: trueflagTesting
test_chat_streaming_sends_messagesto verify streaming behaviorUser Experience
Users will see responses appear progressively in the chat interface, creating a more engaging and responsive experience similar to ChatGPT, Claude, and other modern AI chat interfaces.