SYSTEMS AND METHODS FOR VIDEO-LANGUAGE NEURAL NETWORKS

Application US20260080681A1 Kind: A1 Mar 19, 2026

Inventors

Michael S. Ryoo, Honglu Zhou, Shrikant Kendre, Can Qin, Le Xue, Manli Shu, Silvio Savarese, Ran Xu, Caiming Xiong, Juan Carlos Niebles Duque

Abstract

Embodiments described herein provide a vision-language neural network framework that outputs a text response to a user text query relating to the media content of the video input. Specifically, the vision-language neural network may comprise (1) a vision encoder (ViT) transforming each frame input from the video input into a set of tokens, (2) a frame-level tokenizer to reduce the number of tokens, (3) a temporal encoder to build video-level token representations, and (4) an autoregressive LLM generating a text output based on such video tokens and text prompt tokens.

CPC Classifications

G06V 20/41 G06F 40/40 G06N 3/084 G06V 10/778 G06V 10/82 G06V 20/46

Filing Date

2025-01-30

Application No.

19041811

View original document →