RESILIENT OPTIMIZER STATES FOR FULLY SHARDED DATA PARALLEL
Inventors
Lianjie Cao, Saeed Rashidi, Garrett Goon, Paolo Faraboschi, Puneet Sharma
Abstract
Systems and methods are provided for failure resiliency in distributed training of machine learning (ML) models. Examples include a plurality of compute nodes storing optimizer shards of a plurality of optimizer shards and a first compute node storing a first optimizer shard of optimizer states. The first compute node can store optimizer shard portions, each of which can be received from a respective compute node of the plurality of compute nodes and can be a replica of a portion of a respective optimizer shard of the plurality of optimizer shards, stored at the respective compute node. Responsive to a failure of a compute node of the plurality of compute nodes, the first compute node can update the first optimizer shard with an optimizer shard portion corresponding to the failed compute node and the ML model can be trained based on the updated first optimizer shard.
CPC Classifications
Filing Date
2025-12-11
Application No.
19416964