ALIGNMENT OF NEURAL NETWORKS USING ARCHITECTURAL MODIFICATIONS AND TRAINING EXAMPLES
Inventors
Xiangyu Qi, Xiao Ma, Ahmad Beirami
Abstract
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium for aligning the output of a pre-trained generative neural network. In one aspect, the pre-trained generative neural network is adapted by introducing one or more filter layers. Each filter layer processes a filter layer input comprising an output from a stack of the pre-trained neural network layers, in accordance with trainable parameters of the filter layer, to generate a filter layer output. A next neural network layer after the stack of pre-trained neural network layers is configured to process at least the filter layer output. The trainable parameters of the filter layer(s) are adjusted using a training objective to increase the likelihood of the adapted neural network generating aligned responses to a plurality of training requests, whilst keeping pre-trained trainable parameters of the pre-trained neural network layers fixed.
CPC Classifications
Filing Date
2025-10-01
Application No.
19346875