SINGLE TRAJECTORY POLICY OPTIMIZATION FOR GENERATIVE MACHINE LEARNING MODELS

Application US20260087409A1 Kind: A1 Mar 26, 2026

Inventors

Bilal Piot, Pierre Richemond, Yunhao Tang, Daniele Calandriello, Zhaohan Guo, Gil Shamir, Tianqi Liu, Rishabh Joshi, Lior Shani, Eugene Tarassov, Remi Munos, Bernardo Avila Pires, Lucas Joseph Spangher, Mohammad Gheshlaghi Azar, Rafael Mitkov Rafailov

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training a generative machine learning machine learning models to perform a machine learning task. In one aspect, a method comprises at each of a sequence of training iterations for a target generative model: obtaining a plurality of training examples that each include an example prompt, an example data item, and a quality score for the example data item; determining likelihoods of the target generative machine learning model generating the example data items for the training examples; determining expected quality scores for the training examples; and training the target generative machine learning model to optimize an objective function that depends on the likelihoods of the target generative machine learning model generating the example data items for the training examples and a difference between the quality scores and the expected quality scores for the training examples.

CPC Classifications

G06N 20/00

Filing Date

2025-05-22

Application No.

19216677

View original document →