Determining audio and video representations using self-supervised learning
Assignee
Adobe Inc.
Inventors
Simon Jenni, John Collomosse
Abstract
Embodiments are disclosed for training a system to generate audio and video representations using self-supervised learning. The method may include receiving a video signal including an audio component and a video component. A first machine learning model is trained to determine a representation of the audio component using a contrastive learning task and a temporal learning task. A second machine learning model to determine a representation of the video component using the contrastive learning task and the temporal learning task. By training the machine learning models using both contrastive learning tasks and temporal learning tasks, the machine learning models learn short term features, long term features, and semantic features of input data.
CPC Classifications
Filing Date
2023-01-31
Application No.
18162544
Claims
20