Skip to content

Conversation

@adamwdraper
Copy link

@adamwdraper adamwdraper commented Nov 16, 2025

Kapture 2025-11-18 at 09 08 50

Summary

This PR adds real-time streaming support to mo.ui.chat, enabling ChatGPT-like experiences where responses appear word-by-word as they're generated.

What's New

✨ Streaming Support for Chat Models

Users can now stream responses from chat models in real-time:

With built-in models (OpenAI, etc.):

chat = mo.ui.chat(
    mo.ai.llm.openai(
        "gpt-4o",
        stream=True,  # Enable streaming! Uses sync generator internally
    )
)

With custom models using sync generators:

# Custom models can use EITHER sync or async generators
def my_streaming_model(messages, config):
    yield accumulated  # Sync generator - works!

async def my_async_model(messages, config):
    yield accumulated  # Async generator - also works!

📚 Documentation

  • Added comprehensive "Streaming Responses" section to chat API docs
  • Explains both built-in model streaming and custom async generator approach
  • Includes links to working examples (streaming_openai.py and streaming_custom.py)

🎯 Examples

  • streaming_openai.py: Demonstrates streaming with OpenAI models
  • streaming_custom.py: Shows how to build custom streaming chatbots
  • Fixed quote escaping in examples to prevent file corruption on auto-save

Implementation Details

  • Added stream parameter to openai ChatModel class
  • Implemented async generator support in chat UI element
  • Frontend receives incremental updates via stream_chunk messages
  • Final message marked with is_final: true flag

Testing

  • Added test_chat_streaming_sends_messages to verify streaming behavior
  • Manually tested both OpenAI and custom streaming examples
  • Code passes lint and format checks

User Experience

Users will see responses appear progressively in the chat interface, creating a more engaging and responsive experience similar to ChatGPT, Claude, and other modern AI chat interfaces.

Implements real-time streaming for chat UI elements:

Backend changes:
- Modified chat._send_prompt to stream chunks via _send_message for async generators
- Added UUID generation for tracking streaming messages
- Sends incremental chunks with 'stream_chunk' type and 'is_final' flag
- Updated docstring to reflect streaming support

Frontend changes:
- Added MarimoIncomingMessageEvent listener to receive streaming chunks
- Implemented streamingStateRef to track backend message_id and frontend message index
- Creates placeholder messages and updates them in real-time as chunks arrive
- Added host prop to Chatbot component for event listening

AI model changes:
- Converted mo.ai.llm.openai to async generator with stream=True
- Yields accumulated content as tokens arrive from the API
- Enables automatic streaming for built-in OpenAI models

The implementation uses existing SendUIElementMessage infrastructure
for bidirectional communication via WebSockets. Async generator functions
now automatically stream responses to the frontend in real-time.

Resolves the TODO at line 232 in chat.py about streaming support.
- Backend: Send streaming chunks via WebSocket messages (_send_message)
  - Async generators now stream responses to frontend in real-time
  - Each chunk includes message_id, content, and is_final flag
  - Updated documentation to reflect streaming support

- Frontend: Listen for streaming chunks via MarimoIncomingMessageEvent
  - Track streaming state and update UI as chunks arrive
  - Accumulate and display content incrementally
  - Support both streaming and non-streaming responses

- Built-in models: Add stream parameter to mo.ai.llm.openai
  - Defaults to False for backward compatibility
  - When True, streams tokens from OpenAI API as async generator
  - Supports both streaming and non-streaming modes

- Examples: Add streaming chat examples
  - streaming_custom.py: Shows custom async generator streaming
  - streaming_openai.py: Shows OpenAI API streaming with stream=True
  - Updated README.md with streaming documentation

Streaming creates a ChatGPT-like experience where responses appear
token-by-token as they're generated, improving perceived responsiveness.
- Test verifies streaming async generators send chunks via _send_message
- Validates message structure (type, message_id, content, is_final)
- Confirms intermediate chunks have is_final=False
- Confirms final chunk has is_final=True and accumulated content
Change 'return' to 'yield' in non-streaming path to fix SyntaxError.
In Python, async generators cannot use 'return' with a value - they must
consistently use 'yield' throughout the function.
- Remove trailing whitespace
- Fix return statement to return tuple with chatbot only
- Use single quotes in markdown code example to avoid escaping issues
  when marimo auto-saves the notebook
- Add 'Streaming Responses' section explaining real-time streaming
- Show how to enable streaming with built-in models (stream=True)
- Show how to implement streaming with custom async generators
- Include links to streaming examples in the repo
@vercel
Copy link

vercel bot commented Nov 16, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
marimo-docs Ready Ready Preview Comment Nov 19, 2025 4:30am

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Nov 16, 2025
@github-actions
Copy link

github-actions bot commented Nov 16, 2025

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@adamwdraper
Copy link
Author

I have read the CLA Document and I hereby sign the CLA

- Add stream parameter to anthropic, google, groq, and bedrock models
- Implement streaming logic for each model using their native APIs
- Update documentation to list all models that support streaming
- All models now follow the same pattern: stream=True enables real-time responses
- Add commented-out stream=True parameter to all example files
- Shows users how to enable streaming without changing default behavior
- Covers: OpenAI, Anthropic, Google (Gemini), Groq, and Bedrock examples
- Add noqa comment for unused buffers parameter
- Replace == False with 'not' operator
- Replace == True with direct boolean check
@adamwdraper adamwdraper changed the title Fix streaming support for chat: async generator bug and add documentation Add streaming support for chat: async generator bug and add documentation Nov 16, 2025

for word in words:
accumulated += word + " "
yield accumulated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be just yield word? seems like the accumulation happens client-side

Copy link
Author

@adamwdraper adamwdraper Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The accumulation must happen server-side. Each yield sends the full accumulated text to the frontend, which displays it. If we yielded just word, the UI would only show individual words instead of the progressively building response. This is consistent with how all built-in models work (OpenAI, Anthropic, Google, etc.).

chunk_text = str(latest_response)

# Send incremental update to frontend
self._send_message(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we DRY this up? to:

                self._send_message(
                    {
                        "type": "stream_chunk",
                        "message_id": message_id,
                        "content": accumulated_text,
                        "is_final": latest_response is not None,
                    },
                    buffers=None,
                )

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Simplified by eliminating the chunk_text variable and directly using accumulated_text. Fixed in d71c725.

return response.choices[0].message.content
if self.stream:
# Stream the response
response = litellm_completion(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we can write response = litellm_completion( once with stream=self.stream and then just handle the response differently in the if/else

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Unified the completion call using stream=self.stream. The different response types are handled in the if/else branches. Fixed in d71c725.

response = client.models.generate_content_stream(
model=self.model,
contents=google_messages,
config={
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we DRY up this config and pull it out above

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! Extracted the config dict to a single generation_config variable. Fixed in d71c725.

- Change delete icon to rotate-cw icon for clearing chat history
- Add disabled state when no messages exist (consistent with send button)
- Improves UX clarity: trash icon was confusing as it looked like deleting the cell rather than resetting the conversation
- Individual message delete buttons still use trash icon appropriately
@adamwdraper
Copy link
Author

Additional UI Improvement: Chat Reset Icon Change

Changed the chat reset button icon from trash to rotate-clockwise to improve UX clarity.

Why this change?
The trash icon was confusing as it looked like it would delete the cell itself rather than reset the conversation. The rotate icon better represents the "reset/restart" action.

Additional improvements:

  • Reset button is now disabled when there are no messages (consistent with send button behavior)
  • Individual message delete buttons still use the trash icon appropriately

See commit: 4b987ea

…dels

Problem:
- When stream=False, chat models returned None instead of strings
- Python treats any function with 'yield' as a generator, even if the
  yield is in an unexecuted branch
- This caused __call__ to return a generator object instead of a string
  when stream=False

Solution:
- Extract streaming logic into separate _stream_response() helper methods
- __call__ now only contains return statements (no yield)
- When stream=True: returns generator from helper method
- When stream=False: returns string directly
- Maintains backward compatibility (kept sync def __call__)

Changes:
- All chat models (OpenAI, Anthropic, Google, Groq, Bedrock) updated
- Added sync generator support to chat UI (inspect.isgenerator)
- Added test_chat_sync_generator_streaming() test
- Both streaming and non-streaming modes now work correctly
Changes:
- Refactor chat UI to use single _handle_streaming_response() method
  for both sync and async generators (eliminated ~70 lines of duplication)
- Fix streaming cutoff: always yield final accumulated result
  even when last chunk has no content (fixes incomplete responses)
- Add test_chat_streaming_complete_response() to catch cutoff bugs
- Increase default max_tokens from 100 to 4096 tokens
  (100 was way too low, caused stories/responses to cut off mid-sentence)

The streaming cutoff fix ensures that when OpenAI/other APIs send
final chunks with no content (just finish_reason), we still capture
the complete accumulated text. The high default (4096 tokens ~3000 words)
matches industry standards and prevents artificial truncation while still
providing reasonable cost control.
Add return type annotations to all _stream_response() and
_handle_streaming_response() methods to satisfy linting:
- _stream_response() -> Generator[str, None, None]
- _handle_streaming_response() -> str

Also added Generator to typing imports.
Ruff requires type-only imports like Generator to be in a
TYPE_CHECKING block to avoid runtime overhead.
Update documentation to clarify that both sync and async generators
are supported for streaming chat responses:

- docs/api/inputs/chat.md: Added sync generator example alongside async
- chat.py docstring: Clarified both sync and async generators work

Built-in models (OpenAI, Anthropic, etc.) use sync generators internally,
while custom models can use either depending on their needs.
@mscolnick
Copy link
Contributor

@adamwdraper looks great! few build errors to fix, but otherwise good to merge afterwards.

we can followup as a team to figure out the correct default for stream

- Fix circular type reference in ChatPlugin.tsx by using inferred types
- Remove forbidden non-null assertion in chat-ui.tsx streaming handler
- Extract frontendMessageIndex before null check to avoid non-null assertion operator
adamwdraper and others added 2 commits November 18, 2025 16:13
- Add type annotations to all _stream_response methods (openai, anthropic, google, groq, bedrock)
- Fix union-attr error in openai response by using cast(Any, response)
- Add missing return statement in anthropic __call__ for else branch
- Fix google generate_content config type with cast(Any, generation_config)
- Add type annotations to chat._handle_streaming_response
- Fix return type in chat._handle_streaming_response to handle None case
- Replace deprecated 'gemini-1.5-pro-latest' with 'gemini-2.5-flash'
- Add google-genai>=1.20.0 dependency to script metadata
- Update generated_with version to 0.17.8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants