RESILIENT OPTIMIZER STATES FOR FULLY SHARDED DATA PARALLEL

Application US20260099411A1 Kind: A1 Apr 09, 2026

Inventors

Lianjie Cao, Saeed Rashidi, Garrett Goon, Paolo Faraboschi, Puneet Sharma

Abstract

Systems and methods are provided for failure resiliency in distributed training of machine learning (ML) models. Examples include a plurality of compute nodes storing optimizer shards of a plurality of optimizer shards and a first compute node storing a first optimizer shard of optimizer states. The first compute node can store optimizer shard portions, each of which can be received from a respective compute node of the plurality of compute nodes and can be a replica of a portion of a respective optimizer shard of the plurality of optimizer shards, stored at the respective compute node. Responsive to a failure of a compute node of the plurality of compute nodes, the first compute node can update the first optimizer shard with an optimizer shard portion corresponding to the failed compute node and the ML model can be trained based on the updated first optimizer shard.

CPC Classifications

G06F 11/2028 G06N 3/098

Filing Date

2025-12-11

Application No.

19416964

View original document →