KEY-VALUE CACHE COMPRESSION BASED ON GAUGE TRANSFORMATION

Application US20260080217A1 Kind: A1 Mar 19, 2026

Assignee

Intel Corporation

Inventors

Hong Wang

Abstract

KV cache for transformer models may be compressed through gauge transformation, entropy encoding, or rank-r approximation. Transformation matrices may be determined for gauge transformation of an attention layer. The query weight matrix and key weight matrix of the head may be transformed using a transformation matrix. The value weight matrix and output weight matrix of the head may be transformed using another transformation matrix. The gauge transformation may produce canonicalized weights. The attention layer may be updated with the canonicalized weights. The canonicalized model may be executed, and canonicalized KV data may be produced during the execution. A portion of the canonicalized KV data may be further compressed entropy encoding and then stored in a cold tail cache. The rest of the canonicalized KV data may be stored in a hot window cache. The canonicalized KV data may be further compressed based on rank-r approximation before or after gauge transformation.

CPC Classifications

G06N 3/0455 G06N 3/0495

Filing Date

2025-11-21

Application No.

19396765

View original document →