KEY-VALUE CACHE COMPRESSION BASED ON GAUGE TRANSFORMATION
Assignee
Intel Corporation
Inventors
Hong Wang
Abstract
KV cache for transformer models may be compressed through gauge transformation, entropy encoding, or rank-r approximation. Transformation matrices may be determined for gauge transformation of an attention layer. The query weight matrix and key weight matrix of the head may be transformed using a transformation matrix. The value weight matrix and output weight matrix of the head may be transformed using another transformation matrix. The gauge transformation may produce canonicalized weights. The attention layer may be updated with the canonicalized weights. The canonicalized model may be executed, and canonicalized KV data may be produced during the execution. A portion of the canonicalized KV data may be further compressed entropy encoding and then stored in a cold tail cache. The rest of the canonicalized KV data may be stored in a hot window cache. The canonicalized KV data may be further compressed based on rank-r approximation before or after gauge transformation.
CPC Classifications
Filing Date
2025-11-21
Application No.
19396765