Skip to content

Conversation

@federicobrancasi
Copy link
Collaborator

Add transformer model support, fix Multi-Head Attention quantization and enhance the pipeline

Summary

  • Added comprehensive support for transformer-based architectures (CCT, ViT) to DeepQuant.
  • Fixed critical Multi-Head Attention (MHA) quantization issues that prevented proper handling of packed projections and batch-first layouts.
  • Enhanced the quantization pipeline with new utilities and improved error handling.

Key Changes

  1. New models support
  • Implemented Compact Convolutional Transformer (CCT) architecture
  • Added Vision Transformer (ViT-B/32) support
  • Created comprehensive test suites: TestCCT.py, TestCCTPretrained.py, TestVitB32.py, TestVitB32Pretrained.py
  • Added ResNet-50 pretrained test to validate pipeline scalability on larger networks
  1. Multi-Head Attention fixes (needed to make ViT's QuantMultiheadAttention work)
  • Fixed MHA handling in CustomForwards/MultiHeadAttention.py to properly support:
    • Packed in_proj weights (common in transformers)
    • Batch-first tensor layouts with correct transposition
    • View operations that previously caused dimension mismatches
  • Added support for self-attention scenarios where Q, K, V use the same input
  1. Quantization improvements
  • Changed rounding policy from round to floor in quantization modules for more accurate integer representation
  • Added checkEquivalence flag to brevitasToTrueQuant() function to automatically validate quantized models against original outputs
  1. ONNX Runtime enhancements
  • Updated to ONNX opset version 18 for proper LayerNormalization folding required by Deeploy
  • Implemented deterministic session handling by creating a function to disable all optimizations, ensuring exact reproducibility for Deeploy verification
  1. Infrastructure improvements
  • Fixed .view(-1) operations in tensor recording that caused shape inference issues
  • Included CIFAR-10 dataset for CI testing of CCT models

federicobrancasi and others added 30 commits April 19, 2025 10:48
* Initial commit fbrancasi/dev

* Working Resnet18

* Codebase Refactor

* update Resnet18 test

* Fix CI

* Minor Fixes
The current version of DeepQuant simply assumes that all linear
modules have a bias, otherwise it skips unifying the Dequant
nodes.
This modification enables unification of Dequant blocks even
when there is no `biasDequantNode`.
This implementation is incomplete as it assumes that the input
Dequant zeroPoint is 0.
@Xeratec Xeratec self-requested a review June 26, 2025 08:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants