Determining audio and video representations using self-supervised learning

Grant US12586350B2 Kind: B2 Mar 24, 2026

Assignee

Adobe Inc.

Inventors

Simon Jenni, John Collomosse

Abstract

Embodiments are disclosed for training a system to generate audio and video representations using self-supervised learning. The method may include receiving a video signal including an audio component and a video component. A first machine learning model is trained to determine a representation of the audio component using a contrastive learning task and a temporal learning task. A second machine learning model to determine a representation of the video component using the contrastive learning task and the temporal learning task. By training the machine learning models using both contrastive learning tasks and temporal learning tasks, the machine learning models learn short term features, long term features, and semantic features of input data.

CPC Classifications

G06V 10/764 G06V 10/82 G06V 20/46 G06N 3/045

Filing Date

2023-01-31

Application No.

18162544

Claims

View original document →