Skip to content

Conversation

@SiqiLi-Fighting
Copy link
Collaborator

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

  • Please use English, otherwise it will be closed.
  • The purpose of the PR, or link existing issues this PR will resolve.
  • The test plan, such as providing test command.
  • (Optional) The necessary documentation update.

@SiqiLi-Fighting SiqiLi-Fighting marked this pull request as draft November 12, 2025 13:37
@SiqiLi-Fighting SiqiLi-Fighting self-assigned this Nov 12, 2025
@gemini-code-assist
Copy link

Summary of Changes

Hello @SiqiLi-Fighting, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request lays the groundwork for a significant performance improvement in language model inference by integrating Multi-Token Prediction (MTP) into the EAGLE speculative decoding framework. The changes span across various system components, from low-level attention kernels to high-level scheduling and model management, all designed to enable models to generate text more rapidly while maintaining quality. The introduction of MTP-specific model configurations and optimized data structures is central to achieving higher throughput and lower latency in sequence generation.

Highlights

  • Multi-Token Prediction (MTP) for EAGLE Speculative Decoding: This pull request introduces the design and initial implementation for Multi-Token Prediction (MTP) as an enhancement to the existing EAGLE speculative decoding algorithm. MTP aims to significantly improve inference throughput by enabling models to predict multiple tokens simultaneously.
  • Core Architectural Changes: The implementation includes modifications to the attention kernel to support custom masking for parallel verification, updates to model configurations to recognize MTP-capable architectures (e.g., LlamaForCausalLMEagle3), and the integration of new MTP logic within the scheduling and model worker components.
  • New EAGLE-Specific Components: New files have been added to define EAGLE-specific model layers and structures, utility functions for managing speculative decoding inputs and outputs, and Pallas kernels for efficient tree building and verification processes.
  • Precompilation and Memory Management: The system now includes precompilation steps for speculative decoding and refined memory pool allocation strategies to handle the multi-token sequences efficiently, including backing up and restoring allocator states.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@SiqiLi-Fighting SiqiLi-Fighting added the enhancement New feature or request label Nov 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants