USPTO Patent for Text-to-Image Generation Using Language Models

ChangeBridge: Patent Grants - AI & Computing (G06N)

Published March 24th, 2026

Detected March 25th, 2026

Summary

The USPTO has granted Salesforce, Inc. a patent (US12585919B2) for systems and methods related to text-to-image generation using language models. The patent describes a mechanism to integrate pre-trained language models into text-to-image generation models, enhancing image generation capabilities based on textual prompts.

View original document View source feed page

What changed

The United States Patent and Trademark Office (USPTO) has granted patent US12585919B2 to Salesforce, Inc. The patent covers "Systems and methods for text-to-image generation using language models," detailing a mechanism to replace existing text encoders with more powerful pre-trained language models. This involves training a translation network to map features from the language model output into the target text encoder's space, preserving the language model's structure while enabling its use in text-to-image generation.

This patent grant is primarily of interest to technology companies and researchers in the AI and computing fields. While it does not impose new regulatory obligations or compliance deadlines on regulated entities, it signifies a development in AI technology that may influence future product development and intellectual property strategies. Companies operating in AI-driven image generation should be aware of this patented technology, particularly if their products utilize similar integration methods.

Source document (simplified)

← USPTO Patent Grants

Systems and methods for text-to-image generation using language models

Grant US12585919B2 Kind: B2 Mar 24, 2026

Assignee

Salesforce, Inc.

Inventors

Ning Yu, Can Qin, Chen Xing, Shu Zhang, Stefano Ermon, Caiming Xiong, Ran Xu

Abstract

Embodiments described herein provide a mechanism for replacing existing text encoders in text-to-image generation models with more powerful pre-trained language models. Specifically, a translation network is trained to map features from the pre-trained language model output into the space of the target text encoder. The training preserves the rich structure of the pre-trained language model while allowing it to operate within the text-to-image generation model. The resulting modularized text-to-image model receives prompt and generates an image representing the features contained in the prompt.