USPTO Patent Grant for Policy Neural Network Agent Control

ChangeBridge: Patent Grants - AI & Computing (G06N)

Published March 24th, 2026

Detected March 25th, 2026

Summary

The USPTO has granted patent US12585941B2 to GDM Holding LLC for a method of training a policy neural network to control an agent. The patent, filed on January 7, 2022, details a process involving best response policy iteration and updating the neural network based on generated training data.

View original document View source feed page

What changed

The United States Patent and Trademark Office (USPTO) has issued patent US12585941B2, titled "Training a policy neural network for controlling an agent using best response policy iteration," to GDM Holding LLC. The patent, granted on March 24, 2026, with a filing date of January 7, 2022, describes a method for training a policy neural network by repeatedly updating it through iterations. This process involves generating training data using an improved policy and performing a best response computation with candidate policies and a candidate value neural network.

This patent grant is primarily an intellectual property matter and does not impose direct regulatory obligations on businesses. However, it signifies innovation in the field of AI and machine learning, specifically in agent control and policy optimization. Companies operating in AI development, particularly those utilizing neural networks for agent control, may wish to review the patent's claims to understand the scope of the granted intellectual property and ensure their own development activities do not infringe upon this patent.

Source document (simplified)

← USPTO Patent Grants

Training a policy neural network for controlling an agent using best response policy iteration

Grant US12585941B2 Kind: B2 Mar 24, 2026

Assignee

GDM Holding LLC

Inventors

Thomas William Anthony, Thomas Edward Eccles, Andrea Tacchetti, János Kramár, Ian Michael Gemp, Thomas Chalmers Hudson, Nicolas Pierre Mickaël Porcel, Marc Lanctot, Julien Perolat, Richard Everett, Thore Kurt Hartwig Graepel, Yoram Bachrach

Abstract

Methods, systems and apparatus, including computer programs encoded on computer storage media, for training a policy neural network by repeatedly updating the policy neural network at each of a plurality of training iterations. One of the methods includes generating training data for the training iteration by controlling the agent in accordance with an improved policy that selects actions in response to input state representations. A best response computation is performed using (i) a candidate policy generated from respective policy neural networks as of one or more preceding iterations and (ii) a candidate value neural network. The candidate value neural network is configured to generate a value output that is an estimate of a value of the environment being in the state characterized by a state representation to complete a particular task. The policy neural network is updated by training the policy neural network on the training data.