Bridging LLMs of Differing Sizes to Reduce Latency
Summary
USPTO published patent application US20260099528A1 by inventor Brett Barros for methods of reducing LLM latency by using a smaller LLM to generate immediate responses while a larger LLM produces refined content starting from the smaller model's output. The larger model generates a refined portion succeeding the initial content, which can then be rendered to the user. Alternative implementations use default text strings or predefined templates selected via natural language understanding of the user query.
What changed
USPTO published patent application US20260099528A1 for LLM latency reduction technology. The application discloses methods where a smaller LLM generates initial content responsive to user queries, allowing immediate rendering of a portion as a response. A larger LLM then generates refined content beginning with that portion and including additional refined content. Alternative embodiments describe using default text strings or templates selected via natural language understanding instead of a smaller LLM.
Technology companies developing LLM-based applications or chatbots may benefit from reviewing this patent filing to understand potential claims around latency reduction techniques. The application has no immediate compliance implications as it represents a patent application rather than a granted patent.
Archived snapshot
Apr 18, 2026GovPing captured this document from the original source. If the source has since changed or been removed, this is the text as it existed at that time.
LLM LATENCY REDUCTION VIA BRIDGING MULTIPLE LLMS OF DIFFERING SIZES
Application US20260099528A1 Kind: A1 Apr 09, 2026
Inventors
Brett Barros
Abstract
Implementations utilize a smaller LLM to generate content responsive to a user query and cause a portion of the generated content to be rendered as an immediate response to the user query. Implementations further utilize a larger LLM to generate content that starts with the portion of the generated content and that includes a refined portion succeeding the portion of the generated content. The refined portion can be rendered succeeding the portion of the generated content. In some implementations, instead of using the smaller LLM, alternatively, the portion of the generated content rendered as the immediate response can be generated based on a default text string or a template, where the template can be determined/selected from a plurality of predefined templates based on a natural language understanding of the user query.
CPC Classifications
G06F 16/3344 G06F 16/338 G06F 40/289 G06F 40/35 G06N 3/0475
Filing Date
2025-12-11
Application No.
19416474
Related changes
Get daily alerts for USPTO Patent Applications - AI & Computing (G06N)
Daily digest delivered to your inbox.
Free. Unsubscribe anytime.
Source
About this page
Every important government, regulator, and court update from around the world. One place. Real-time. Free. Our mission
Source document text, dates, docket IDs, and authority are extracted directly from USPTO.
The summary, classification, recommended actions, deadlines, and penalty information are AI-generated from the original text and may contain errors. Always verify against the source document.
Classification
Who this affects
Taxonomy
Browse Categories
Get alerts for this source
We'll email you when USPTO Patent Applications - AI & Computing (G06N) publishes new changes.
Subscribed!
Optional. Filters your digest to exactly the updates that matter to you.