LARGE LANGUAGE MODEL INFERENCING ACCELERATION TECHNIQUES

Application US20260093960A1 Kind: A1 Apr 02, 2026

Inventors

Yao Cui Fehlis, Jalal Uddin Mahmud

Abstract

A method includes generating a plurality of tokens from a prompt to a large language model (LLM). The method includes, in one or more iterations, using a first neural network to output a set of speculative decoding parameters selected from a plurality of sets of speculative decoding parameters. Additionally, in the one or more iterations, the method includes performing speculative decoding using the set of speculative decoding parameters to generate a subsequent plurality of tokens appended to the plurality of tokens from on the prompt or from a previous iteration to generate an updated plurality of tokens and collecting a runtime of the speculative decoding. The one or more iterations are repeated until the updated plurality of tokens reaches a maximum token length. The first neural network is trained to output sets of speculative decoding parameters to minimize a sum of runtimes during the one or more iterations.

CPC Classifications

G06N 3/047 G06F 40/284 G06N 3/092

Filing Date

2024-09-30

Application No.

18901142

View original document →