Layered gradient accumulation and modular pipeline parallelism for improved training of machine learning models
Assignee
ServiceNow, Inc.
Inventors
Joel Lamy-Poirier
Abstract
A method is provided including: (i) assigning sequentially-ordered layers of a machine learning model to a plurality of compute nodes, each of the layers being assigned to exactly one of the nodes; (ii) dividing training data into micro-batches; (iii) forward-propagating the micro-batches through the model, each node operating in parallel to generate respective activation states for the micro-batches with their assigned layers, and with the activation states being communicated between the nodes according to the layers' sequential ordering; and (iv) backward-propagating the micro-batches through the model, each node operating in parallel to generate respective error states for the micro-batches with their assigned layers, with the error states being communicated between the nodes according to the layers' reverse sequential ordering, wherein each of the nodes completes the backward-propagation of all micro-batches through a given layer prior to performing backward-propagation through any layer that precedes the given layer in the sequential ordering.
CPC Classifications
Filing Date
2022-02-09
Application No.
17668200
Claims
19