USPTO Patent Grant: End-to-End Segmentation ASR Model

ChangeBridge: Patent Grants - AI & Computing (G06N)

Published March 24th, 2026

Detected March 25th, 2026

Summary

The USPTO has granted patent US12586579B2 to Google LLC for an end-to-end segmentation and two-pass cascaded encoder automatic speech recognition (ASR) model. The patent describes a system designed to improve speech recognition accuracy by segmenting audio and processing it through multiple encoder-decoder stages.

View original document View source feed page

What changed

The United States Patent and Trademark Office (USPTO) has granted patent US12586579B2, titled 'End-to-end segmentation in a two-pass cascaded encoder automatic speech recognition model,' to Google LLC. This patent covers a novel ASR model architecture that includes a unified end-to-end segmenter and a two-pass cascaded encoder. The model comprises a first encoder and decoder to generate higher-order feature representations and identify speech segment ends, followed by a second encoder and decoder that utilize these representations and timestamps to generate a final probability distribution for speech recognition.

This patent grant is primarily an intellectual property development and does not impose direct regulatory obligations on businesses. However, companies involved in developing or utilizing speech recognition technology, particularly those in the AI and computing sectors, should be aware of this granted patent. It may impact future product development, licensing strategies, and potential infringement considerations. The filing date for this patent was November 17, 2023, and the grant date is March 24, 2026.

Source document (simplified)

← USPTO Patent Grants

End-to-end segmentation in a two-pass cascaded encoder automatic speech recognition model

Grant US12586579B2 Kind: B2 Mar 24, 2026

Assignee

Google LLC

Inventors

Wenqian Ronny Huang, Shuo-yiin Chang, Tara N. Sainath, Yanzhang He

Abstract

A unified end-to-end segmenter and two-pass automatic speech recognition (ASR) model includes a first encoder, a first decoder, a second encoder, and a second decoder. The first encoder is configured to receive a sequence of acoustic frames and generate a first higher order feature representation. The first decoder is configured to receive the first higher order feature representation and generate, at each of a plurality of output steps, a first probability distribution and an indication of whether the output step corresponds to an end of speech segment, and emit an end of speech timestamp. The second encoder is configured to receive the first higher order feature representation and the end of speech timestamp, and generate a second higher order feature representation. The second decoder is configured to receive the second higher order feature representation and generate a second probability distribution.

CPC Classifications

G10L 15/063 G10L 15/16 G10L 15/22 G10L 15/05 G10L 15/02 G10L 15/32 G10L 2015/0631 G10L 15/197 G10L 2015/025 G10L 15/28 G10L 15/30 G10L 15/19 G10L 15/167 G10L 15/183 G10L 15/26 G06N 20/00 G06N 5/00