SYSTEMS AND METHODS FOR VIDEO-LANGUAGE NEURAL NETWORKS
Application
US20260080681A1
Kind: A1
Mar 19, 2026
Inventors
Michael S. Ryoo, Honglu Zhou, Shrikant Kendre, Can Qin, Le Xue, Manli Shu, Silvio Savarese, Ran Xu, Caiming Xiong, Juan Carlos Niebles Duque
Abstract
Embodiments described herein provide a vision-language neural network framework that outputs a text response to a user text query relating to the media content of the video input. Specifically, the vision-language neural network may comprise (1) a vision encoder (ViT) transforming each frame input from the video input into a set of tokens, (2) a frame-level tokenizer to reduce the number of tokens, (3) a temporal encoder to build video-level token representations, and (4) an autoregressive LLM generating a text output based on such video tokens and text prompt tokens.
CPC Classifications
G06V 20/41
G06F 40/40
G06N 3/084
G06V 10/778
G06V 10/82
G06V 20/46
Filing Date
2025-01-30
Application No.
19041811