Huawei Fault Processing Method for Training Systems
Summary
The European Patent Office has published patent application EP4711932A1 by Huawei Technologies Co., Ltd. The application describes a fault processing method for training systems used in AI and computing. The method aims to improve the efficiency and speed of recovering training tasks after a fault occurs.
What changed
This document is a published patent application (EP4711932A1) from Huawei Technologies Co., Ltd., detailing a fault processing method for training systems. The method is designed for systems with a host-side chip and device-side chips that collaboratively execute training tasks. It outlines a process where, upon detecting a fault in a subtask, the host chip saves fault data and synchronizes it to a rescheduled chip on the device side, allowing the task to continue with minimal interruption. The primary benefit highlighted is the reduction in recovery time and improved efficiency for training tasks.
As this is a patent application, it does not impose direct compliance obligations on regulated entities. However, it represents a technological development in the field of AI training systems. Companies involved in developing or utilizing such systems, particularly those with distributed computing architectures, may find the described fault tolerance mechanisms relevant for their internal R&D or operational strategies. No immediate actions or deadlines are associated with this publication.
Source document (simplified)
FAULT PROCESSING METHOD AND RELATED DEVICE
Publication EP4711932A1 Kind: A1 Mar 18, 2026
Applicants
Huawei Technologies Co., Ltd.
Inventors
HAO, Ripei, CAI, Zhifang, SONG, Xin
Abstract
This application provides a fault processing method, applied to a training system. The training system includes a first chip on a host side and a plurality of second chips on a device side, and the first chip and the plurality of second chips are configured to collaboratively execute a training task. The training task includes a first subtask and a plurality of second subtasks, and execution of the second subtask depends on an execution result of the first subtask. The method includes: The first chip executes the first subtask. When a fault occurs, the first chip saves a fault file before a second chip in a normal state on the device side stops executing the second subtask. The first chip synchronizes the fault file to a rescheduled chip on the device side, so that the rescheduled chip continues to execute the second subtask. In the method, the first chip on the host side may not stop executing the second subtask. In this way, the execution result of the first subtask on the host side can be reused to recover the training task, to shorten recovery time of the training task, and improve recovery efficiency of the training task.
IPC Classifications
G06F 11/20 20060101AFI20250110BHEP G06F 11/14 20260101ALI20250110BHEP G06F 11/07 20060101ALI20250110BHEP G06F 11/30 20060101ALI20250110BHEP G06N 3/098 20230101ALI20250110BHEP
Designated States
AL, AT, BE, BG, CH, CY, CZ, DE, DK, EE, ES, FI, FR, GB, GR, HR, HU, IE, IS, IT, LI, LT, LU, LV, MC, ME, MK, MT, NL, NO, PL, PT, RO, RS, SE, SI, SK, SM, TR
Named provisions
Related changes
Source
Classification
Who this affects
Taxonomy
Browse Categories
Get Telecom & Technology alerts
Weekly digest. AI-summarized, no noise.
Free. Unsubscribe anytime.
Get alerts for this source
We'll email you when ChangeBridge: EPO Bulletin - AI & Computing (G06N) publishes new changes.