Google Research Introduces "Speculative Cascades" Method to Accelerate LLMs

Published on: 12.09.2025 11:00

The Google Research team published a paper on September 11, 2025, on a new method for optimizing the performance of large language models (LLMs) called "speculative cascades." This hybrid approach aims to radically accelerate inference (response generation) and reduce computational costs. The core of the method is to use a "cascade" of several models of different sizes. First, a very small and fast model speculatively generates a draft of the response. Then, a larger and more accurate model verifies this entire draft in a single pass, which is much faster than generating the response token by token. If the draft is correct, it is accepted. If not, the larger model corrects the error and generates the correct continuation. This process allows the powerful model to be engaged only at key moments, saving up to 80% of computations while maintaining high-quality output. The "speculative cascades" technology could make the use of advanced LLMs significantly cheaper and more accessible for real-time applications.

« Back to News List