SINGLE TRAJECTORY POLICY OPTIMIZATION FOR GENERATIVE MACHINE LEARNING MODELS
Inventors
Bilal Piot, Pierre Richemond, Yunhao Tang, Daniele Calandriello, Zhaohan Guo, Gil Shamir, Tianqi Liu, Rishabh Joshi, Lior Shani, Eugene Tarassov, Remi Munos, Bernardo Avila Pires, Lucas Joseph Spangher, Mohammad Gheshlaghi Azar, Rafael Mitkov Rafailov
Abstract
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training a generative machine learning machine learning models to perform a machine learning task. In one aspect, a method comprises at each of a sequence of training iterations for a target generative model: obtaining a plurality of training examples that each include an example prompt, an example data item, and a quality score for the example data item; determining likelihoods of the target generative machine learning model generating the example data items for the training examples; determining expected quality scores for the training examples; and training the target generative machine learning model to optimize an objective function that depends on the likelihoods of the target generative machine learning model generating the example data items for the training examples and a difference between the quality scores and the expected quality scores for the training examples.
CPC Classifications
Filing Date
2025-05-22
Application No.
19216677