Pre-Conference Talk by DU Cunxiao | GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative Decoding

Please click here if you are unable to view this page.

GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative Decoding

Speaker (s):

DU Cunxiao
PhD Candidate
School of Computing and Information Systems
Singapore Management University

Date:

Time:

Venue:

11 July 2024, Thursday

5:00pm – 5:30pm

Meeting room 4.4, Level 4
School of Computing and
Information Systems 1,
Singapore Management University,
80 Stamford Road, Singapore 178902

We look forward to seeing you at this research seminar.

Please register by 10 July 2024.

About the Talk

Speculative decoding is a relatively new decoding framework that leverages small and efficient draft models to reduce the latency of LLMs. In this study, we introduce GliDe and CaPE, two low-hassle modifications to vanilla speculative decoding to further improve the decoding speed of a frozen LLM. Specifically, GliDe is a modified draft model architecture that reuses the cached keys and values from the target LLM, while CaPE is a proposal expansion method that uses the draft model's confidence scores to help select additional candidate tokens for verification. Extensive experiments on different benchmarks demonstrate that our proposed GliDe draft model significantly reduces the expected decoding latency. Additional evaluation using walltime reveals that GliDe can accelerate Vicuna models up to 2.17x and further extend the improvement to 2.61x with CaPE. We will release our code, data, and the trained draft models.

This is a Pre-Conference talk for The Forty-first International Conference on Machine Learning (ICML 2024).

About the Speaker

Cunxiao DU is a PhD candidate under the supervision of Prof. JIANG Jing.

Where to find us

Get in touch