LARGE LANGUAGE MODEL INFERENCING ACCELERATION TECHNIQUES
Inventors
Yao Cui Fehlis, Jalal Uddin Mahmud
Abstract
A method includes generating a plurality of tokens from a prompt to a large language model (LLM). The method includes, in one or more iterations, using a first neural network to output a set of speculative decoding parameters selected from a plurality of sets of speculative decoding parameters. Additionally, in the one or more iterations, the method includes performing speculative decoding using the set of speculative decoding parameters to generate a subsequent plurality of tokens appended to the plurality of tokens from on the prompt or from a previous iteration to generate an updated plurality of tokens and collecting a runtime of the speculative decoding. The one or more iterations are repeated until the updated plurality of tokens reaches a maximum token length. The first neural network is trained to output sets of speculative decoding parameters to minimize a sum of runtimes during the one or more iterations.
CPC Classifications
Filing Date
2024-09-30
Application No.
18901142