Resilient Optimizer States for Fully Sharded Data Parallel Distributed ML Training
Summary
USPTO published patent application US20260099411A1 for systems and methods enabling failure resiliency in distributed machine learning model training. The invention allows compute nodes to store replicated optimizer shards and recover from node failures by reconstructing optimizer state from surviving replicas. The application names five inventors and claims priority to filing date December 11, 2025.
What changed
USPTO published application US20260099411A1 disclosing systems for maintaining resilient optimizer states in fully sharded data parallel distributed ML training environments. The invention addresses failure recovery by storing replicated optimizer shard portions across multiple compute nodes, enabling any surviving node to reconstruct the optimizer state of a failed node. This allows ML model training to continue without full checkpoint restoration delays.
Technology companies developing distributed training infrastructure, AI research organizations, and cloud service providers offering ML compute resources should monitor this filing for competitive intelligence. The patent's claims covering shard replication and dynamic state reconstruction in distributed training could affect how companies design fault-tolerant ML training pipelines. If granted, the patent may influence approaches to optimizer state management in large-scale model training systems.
What to do next
- Monitor for patent grant and claims examination outcomes
- Review for freedom-to-operate implications if developing similar distributed ML training systems
Archived snapshot
Apr 9, 2026GovPing captured this document from the original source. If the source has since changed or been removed, this is the text as it existed at that time.
RESILIENT OPTIMIZER STATES FOR FULLY SHARDED DATA PARALLEL
Application US20260099411A1 Kind: A1 Apr 09, 2026
Inventors
Lianjie Cao, Saeed Rashidi, Garrett Goon, Paolo Faraboschi, Puneet Sharma
Abstract
Systems and methods are provided for failure resiliency in distributed training of machine learning (ML) models. Examples include a plurality of compute nodes storing optimizer shards of a plurality of optimizer shards and a first compute node storing a first optimizer shard of optimizer states. The first compute node can store optimizer shard portions, each of which can be received from a respective compute node of the plurality of compute nodes and can be a replica of a portion of a respective optimizer shard of the plurality of optimizer shards, stored at the respective compute node. Responsive to a failure of a compute node of the plurality of compute nodes, the first compute node can update the first optimizer shard with an optimizer shard portion corresponding to the failed compute node and the ML model can be trained based on the updated first optimizer shard.
CPC Classifications
G06F 11/2028 G06N 3/098
Filing Date
2025-12-11
Application No.
19416964
Named provisions
Related changes
Get daily alerts for USPTO Patent Applications - AI & Computing (G06N)
Daily digest delivered to your inbox.
Free. Unsubscribe anytime.
Source
About this page
Every important government, regulator, and court update from around the world. One place. Real-time. Free. Our mission
Source document text, dates, docket IDs, and authority are extracted directly from USPTO.
The plain-English summary, classification, and "what to do next" steps are AI-generated from the original text. Cite the source document, not the AI analysis.
Classification
Who this affects
Taxonomy
Browse Categories
Get alerts for this source
We'll email you when USPTO Patent Applications - AI & Computing (G06N) publishes new changes.