Training Transformers Using Sliceout
Assignee
Cohere Inc.
Inventors
Aidan GOMEZ, Seoyeon YOO
Abstract
A system for training the neural network using dropout with slicing operations preserves the regularization effects of dropout, while speeding up computations and reducing the memory requirements of training the neural network. Instead of randomly dropping weights connected to neurons in a neural network, the system slices contiguous memory segments of weight matrices. For transformer models, the approach first receives input data that consist of a sequence of elements. Based on the input data, input embedding vectors with positional encoding are generated. Then the transformer model is trained by passing the input embedding vectors through various neural network layers. While passing through linear layers, some of the weight matrices are sliced (e.g., masked) such that a contiguous section of a weight matrix is kept unsliced and used for training and the rest of the weight matrix is not accessed.
CPC Classifications
Filing Date
2025-12-08
Application No.
19412214