EFFICIENT ESTIMATION & VERIFICATION WITH EARLY EXITS
Applicants
GOOGLE LLC
Inventors
SCHUSTER, Tal, KOROTKOV, Ivan, JI, Ziwei, KIM, Seungyeon
Abstract
One example aspect is directed to a computer-implemented method (400) for performing model decoding with reduced latency. The method includes obtaining (402) a pre-trained sequence processing model comprising a plurality of layers. The method includes modifying (404) the sequence processing model to contain an adapter layer (106) that is configured to receive and process an intermediate representation generated by a particular intermediate layer of the plurality of layers to predict an output token. The method includes training (406) the adapter layer while holding the plurality of layers of the sequence processing model frozen. The method includes deploying (408) the sequence processing model for speculative decoding in which the adapter layer, the particular intermediate layer, and the plurality of layers (104) that precede the particular intermediate layer perform speculative token decoding and the plurality of layers (108) that are subsequent to the particular intermediate layer perform token verification.
IPC Classifications
Designated States
AL, AT, BE, BG, CH, CY, CZ, DE, DK, EE, ES, FI, FR, GB, GR, HR, HU, IE, IS, IT, LI, LT, LU, LV, MC, ME, MK, MT, NL, NO, PL, PT, RO, RS, SE, SI, SK, SM, TR