Covering the latest development of novel methodologies for Extreme Quantization, Binary Neural Networks and their application to Computer Vision. Bringing together a diverse group of researchers working in several related areas.
Authors are welcome to submit full 8-page papers or short 2-page extended abstracts on any of the following topics:
Paper submission deadline: | July 3st, 2025 (11:59pm PST) |
Decisions: | July 11th, 2025 (11:59pm PST) |
Camera ready papers due: | August 17th, 2025 (11:59pm PST) |
Extended abstract submission: | August 11th, 2025 (11:59pm PST) |
Extended abstract decisions: | August 17th, 2025 (11:59pm PST) |
Workshop Date: | TBD |
Please upload submissions at: link
The Workshop will take place on the 20th of October according to the following schedule. All times are in GMT-10 (Hawaii Standard Time).
8:15 - 8:20 | Opening remarks and workshop kickoff |
8:20 - 8:50 | Invited talk: Amir Gholami - XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization
Large language models power many applications today, but their deployment is increasingly limited by memory bottlenecks rather than computation. While GPU compute performance continues to grow rapidly, memory capacity and bandwidth improvements lag behind, creating a “memory wall” that slows down inference. In this talk, I will introduce XQuant, a new approach designed to break through this barrier. Rather than storing large key–value caches, XQuant quantizes and stores compact input activations, and then reconstructs the key–value states on-the-fly. This shift yields substantial memory savings—up to 7–12× compared to standard FP16 baselines—while retaining near-original model accuracy. Building further, I will discuss XQuant-CL, which leverages the surprising similarity of activations across layers to push memory compression even further. Together, these techniques demonstrate a forward-looking path: by trading a modest increase in computation for dramatic reductions in memory, we can make LLM inference far more efficient and scalable. |
8:50 - 9:20 | Invited talk: Mohammad Rastegari - TBD
[TBD] |
9:20 - 9:30 | Oral presentation |
9:30 - 9:40 | Oral presentation |
9:40 - 10:50 | Poster Session |
10:50 - 11:20 | Invited talk: Song Han - TBD
[TBD] |
11:20 - 11:50 | Invited Talk: Raghu Krishnamoorthi - TBD
[TBD] |
11:50 - 12:00 | Oral presentation |
12:00 - 12:05 | Closing remarks and Conclusions |
Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models Keda TAO, Haoxuan You, Yang Sui, Can Qin, Huan Wang | [Download] |
Binary SqueezeNet: Enhancing Parameter Efficiency in Binary Neural Networks Salih Atabey, Erdem Akagündüz | [Download] |
Ultra-Efficient and Effective LLMs with Multi-Boolean Architectures Ba-Hien Tran, Van Minh Nguyen | [Download] |
PREFILT: Prefiltering for Fully Quantized Image Restoration Neural Networks Denis Makhov, Ruslan Ostapets, Irina Zhelavskaya, Dehua Song | [Download] |
Mitigating GELU Quantization Errors via Activation Distribution Shaping in Vision Transformer Wakayama Hiroyuki, Naoki Okamoto, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi | [Download] |
Exploiting Information Redundancy in Attention Maps for Extreme Quantization of Vision Transformers Lucas Maisonnave, Karim Haroun, Tom Pegeot | [Download] |
MoPEQ: Mixture of Mixed Precision Quantized Experts Krishna Teja Chitty-Venkata, Jie Ye, Murali Emani | [Download] |
PTQAT: A Hybrid Parameter-Efficient Quantization Algorithm for 3D Perception Tasks Xinhao Wang, Zhiwei Lin, Zhongyu Xia, Yongtao Wang | [Download] |
Enhancing Generalization in Data-free Quantization via Mixup-class Prompting Jiwoong Park, Chaeun Lee, Yongseok Choi, Sein Park, Deokki Hong, Jungwook Choi | [Download] |
HC‑PTQ: Poincaré‑Based Hyperbolic Clustering for Data‑Free Quantization of Vision Transformers Raffaele Mineo, Simone Palazzo, Concetto Spampinato, Francesco Rundo | [Download] |
Extreme Compression of Adaptive Neural Images Leo Hoshikawa, Marcos V. Conde, Takeshi Ohashi, Atsushi Irie | [Download] |
Certifying Robustness of Binary Neural Networks Using Sparse Polynomial Optimization Jianting Yang, Srecko Durasinovic, Victor Magron, Jean B. Lasserre, Jun Zhao | [Download] |
MSQ: Memory-Efficient Bit Sparsification Quantization Seokho Han, Seo Yeon Yoon, Jinhee Kim, Dongwei Wang, Kang Eun Jeon, Huanrui Yang, Jong Hwan Ko | [Download] |
Gradient-Free Training of Quantized Neural Networks Noa Cohen, Dotan Di Castro, Omkar Joglekar, Shir Kozlovsky, Vladimir Tchuiev, Michal Moshkovitz | [Download] |
VAR-Q: Tuning-free Quantized KV Caching for Visual Autoregressive Models Boxun Xu, Zihu Wang, Yu Wang, Zirui Liu, Peng Li | [Download] |
42 Million FPS on CIFAR-10 with Convolutional Differentiable Logic Gate Networks Felix Petersen, Hilde Kuehne, Christian Borgelt, Julian Welzel, Stefano Ermon | [Download] |
Probabilistic dynamic quantization for memory constrained devices Gabriele Santini, Francesco Paissan, Elisabetta Farella | [Download] |
Samsung AI
Meta Reality Labs
ETH
QMUL
University of Cambridge and Flower Labs
QMUL and Samsung AI