Automatic speech recognition accuracy with multimodal embeddings search

Grant US12579995B2 Kind: B2 Mar 17, 2026

Assignee

Google LLC

Inventors

Christopher Li, Kyle Scott Kastner, Yuan Wang, Zhehuai Chen, Andrew Maxwell Rosenberg, Heng Su, Qian Chen, Leonid Aleksandrovich Velikovich, Patrick Maxim Rondon, Diamantino Antonio Caseiro, Zelin Wu

Abstract

A method includes receiving training data that includes a set of transcribed speech utterances where each respective transcribed speech utterance is paired with a corresponding transcription. For each respective transcribed speech utterance, the method includes generating an encoded audio representation and an encoded textual representation, generating a higher order audio feature representation for a corresponding encoded audio representation, generating a higher order textual feature representation for a corresponding encoded textual representation, and determining a loss for the respective transcribed speech utterance based on the higher order audio feature representation and the higher order textual feature representation. The method also includes training a speech encoder and a text encoder of a correction model based on the loss determined for each transcribed speech utterance of the set of transcribed speech utterances.

CPC Classifications

G10L 25/30 G10L 15/26 G10L 15/063 G10L 15/16 G06N 3/045

Filing Date

2023-06-29

Application No.

18344007

Claims

View original document →