LLM Speculative Decoding for AI Inference Acceleration
Summary
The USPTO published patent application US20260093960A1 by inventors Yao Cui Fehlis and Jalal Uddin Mahmud, disclosing a method for accelerating large language model inference using speculative decoding. The invention uses a neural network to output speculative decoding parameters iteratively, minimizing cumulative runtime. The patent covers CPC classifications G06N 3/047, G06F 40/284, and G06N 3/092.
What changed
The patent application discloses a method for accelerating LLM inference through speculative decoding. The method involves a first neural network selecting from multiple sets of speculative decoding parameters across iterations, with speculative decoding used to generate subsequent tokens appended to the prompt or previous iteration output. Runtime is collected during each iteration until the updated token sequence reaches a maximum length. The neural network is trained to minimize the sum of runtimes across iterations. Application No. 18901142 was filed on September 30, 2024.
This patent application represents a technical disclosure in the AI acceleration space rather than a regulatory action. Technology companies developing LLMs, AI inference systems, or related hardware should review the claims to assess potential licensing implications or design-around considerations. No compliance actions or deadlines are associated with this document as it is a published patent application rather than a regulatory requirement.
Archived snapshot
Apr 2, 2026GovPing captured this document from the original source. If the source has since changed or been removed, this is the text as it existed at that time.
LARGE LANGUAGE MODEL INFERENCING ACCELERATION TECHNIQUES
Application US20260093960A1 Kind: A1 Apr 02, 2026
Inventors
Yao Cui Fehlis, Jalal Uddin Mahmud
Abstract
A method includes generating a plurality of tokens from a prompt to a large language model (LLM). The method includes, in one or more iterations, using a first neural network to output a set of speculative decoding parameters selected from a plurality of sets of speculative decoding parameters. Additionally, in the one or more iterations, the method includes performing speculative decoding using the set of speculative decoding parameters to generate a subsequent plurality of tokens appended to the plurality of tokens from on the prompt or from a previous iteration to generate an updated plurality of tokens and collecting a runtime of the speculative decoding. The one or more iterations are repeated until the updated plurality of tokens reaches a maximum token length. The first neural network is trained to output sets of speculative decoding parameters to minimize a sum of runtimes during the one or more iterations.
CPC Classifications
G06N 3/047 G06F 40/284 G06N 3/092
Filing Date
2024-09-30
Application No.
18901142
Named provisions
Related changes
Get daily alerts for ChangeBridge: Patent Apps - AI & Computing (G06N)
Daily digest delivered to your inbox.
Free. Unsubscribe anytime.
Source
About this page
Every important government, regulator, and court update from around the world. One place. Real-time. Free. Our mission
Source document text, dates, docket IDs, and authority are extracted directly from USPTO.
The plain-English summary, classification, and "what to do next" steps are AI-generated from the original text. Cite the source document, not the AI analysis.
Classification
Who this affects
Taxonomy
Browse Categories
Get alerts for this source
We'll email you when ChangeBridge: Patent Apps - AI & Computing (G06N) publishes new changes.