Pyramid Key-Value Cache Compression for Transformer Models
Summary
USPTO published patent application US20260099695A1 on April 9, 2026, for a method of operating transformer models with algorithmic key-value cache memory allocation across decoding layers. The invention allocates a fixed memory budget progressively across layers, with higher layers receiving smaller cache allocations. Each layer independently determines maximum key-value vector pairs based on its allocated cache.
What changed
USPTO published patent application US20260099695A1 for a pyramid key-value cache compression method in transformer models. The application claims a system for allocating a fixed cache memory budget across multiple decoding layers, with progressively higher layers receiving smaller cache allocations. Each layer independently caps the maximum number of key-value vector pairs it retains during token decoding operations.
This patent application affects AI researchers, machine learning engineers, and technology companies developing transformer-based models. If granted, the patent would provide intellectual property protection for cache compression techniques that may be relevant to optimizing large language model inference and deployment. The patent has no compliance deadlines or regulatory obligations associated with its publication.
Archived snapshot
Apr 17, 2026GovPing captured this document from the original source. If the source has since changed or been removed, this is the text as it existed at that time.
PYRAMID KEY-VALUE CACHE COMPRESSION FOR TRANSFORMER MODELS
Application US20260099695A1 Kind: A1 Apr 09, 2026
Inventors
Wen XIAO, Wei XIONG, Abedelkader ASI, Zefan CAI
Abstract
A method for operating a transformer model includes algorithmically allocating a fixed budget for a key-value cache between multiple decoding layers per an allocation scheme that ensures progressively higher decoding layers in the transformer model are allocated progressively smaller quantities of cache memory. The method further includes configuring each of the multiple decoding layers of the transformer model to retain no more than a maximum number of key-value vector pairs in the key-value cache during a token decoding operation, the maximum number of key-value vector pairs being independently determined for each decoding layer of the multiple decoding layers based on the cache memory that is allocated to the decoding layer.
CPC Classifications
G06N 3/045
Filing Date
2024-10-09
Application No.
18910974
Related changes
Get daily alerts for USPTO Patent Applications - AI & Computing (G06N)
Daily digest delivered to your inbox.
Free. Unsubscribe anytime.
Source
About this page
Every important government, regulator, and court update from around the world. One place. Real-time. Free. Our mission
Source document text, dates, docket IDs, and authority are extracted directly from USPTO.
The summary, classification, recommended actions, deadlines, and penalty information are AI-generated from the original text and may contain errors. Always verify against the source document.
Classification
Who this affects
Taxonomy
Browse Categories
Get alerts for this source
We'll email you when USPTO Patent Applications - AI & Computing (G06N) publishes new changes.
Subscribed!
Optional. Filters your digest to exactly the updates that matter to you.