EFFICIENT ESTIMATION & VERIFICATION WITH EARLY EXITS

Publication EP4711981A1 Kind: A1 Mar 18, 2026

Applicants

GOOGLE LLC

Inventors

SCHUSTER, Tal, KOROTKOV, Ivan, JI, Ziwei, KIM, Seungyeon

Abstract

One example aspect is directed to a computer-implemented method (400) for performing model decoding with reduced latency. The method includes obtaining (402) a pre-trained sequence processing model comprising a plurality of layers. The method includes modifying (404) the sequence processing model to contain an adapter layer (106) that is configured to receive and process an intermediate representation generated by a particular intermediate layer of the plurality of layers to predict an output token. The method includes training (406) the adapter layer while holding the plurality of layers of the sequence processing model frozen. The method includes deploying (408) the sequence processing model for speculative decoding in which the adapter layer, the particular intermediate layer, and the plurality of layers (104) that precede the particular intermediate layer perform speculative token decoding and the plurality of layers (108) that are subsequent to the particular intermediate layer perform token verification.

IPC Classifications

G06N 3/045 20230101AFI20260128BHEP G06N 3/084 20230101ALI20260128BHEP G06N 3/096 20230101ALI20260128BHEP G06N 3/0442 20230101ALN20260128BHEP G06N 3/0464 20230101ALN20260128BHEP G06N 3/0495 20230101ALN20260128BHEP G06N 3/09 20230101ALN20260128BHEP G06N 3/094 20230101ALN20260128BHEP

Designated States

AL, AT, BE, BG, CH, CY, CZ, DE, DK, EE, ES, FI, FR, GB, GR, HR, HU, IE, IS, IT, LI, LT, LU, LV, MC, ME, MK, MT, NL, NO, PL, PT, RO, RS, SE, SI, SK, SM, TR

View original document →