Huawei Fault Processing Method for Training Systems

ChangeBridge: EPO Bulletin - AI & Computing (G06N)

Published March 18th, 2026

Detected March 24th, 2026

Summary

The European Patent Office has published patent application EP4711932A1 by Huawei Technologies Co., Ltd. The application describes a fault processing method for training systems used in AI and computing. The method aims to improve the efficiency and speed of recovering training tasks after a fault occurs.

View original document View source feed page

What changed

This document is a published patent application (EP4711932A1) from Huawei Technologies Co., Ltd., detailing a fault processing method for training systems. The method is designed for systems with a host-side chip and device-side chips that collaboratively execute training tasks. It outlines a process where, upon detecting a fault in a subtask, the host chip saves fault data and synchronizes it to a rescheduled chip on the device side, allowing the task to continue with minimal interruption. The primary benefit highlighted is the reduction in recovery time and improved efficiency for training tasks.

As this is a patent application, it does not impose direct compliance obligations on regulated entities. However, it represents a technological development in the field of AI training systems. Companies involved in developing or utilizing such systems, particularly those with distributed computing architectures, may find the described fault tolerance mechanisms relevant for their internal R&D or operational strategies. No immediate actions or deadlines are associated with this publication.

Source document (simplified)

← EPO Patent Bulletin

FAULT PROCESSING METHOD AND RELATED DEVICE

Publication EP4711932A1 Kind: A1 Mar 18, 2026

Applicants

Huawei Technologies Co., Ltd.

Inventors

HAO, Ripei, CAI, Zhifang, SONG, Xin

Abstract

This application provides a fault processing method, applied to a training system. The training system includes a first chip on a host side and a plurality of second chips on a device side, and the first chip and the plurality of second chips are configured to collaboratively execute a training task. The training task includes a first subtask and a plurality of second subtasks, and execution of the second subtask depends on an execution result of the first subtask. The method includes: The first chip executes the first subtask. When a fault occurs, the first chip saves a fault file before a second chip in a normal state on the device side stops executing the second subtask. The first chip synchronizes the fault file to a rescheduled chip on the device side, so that the rescheduled chip continues to execute the second subtask. In the method, the first chip on the host side may not stop executing the second subtask. In this way, the execution result of the first subtask on the host side can be reused to recover the training task, to shorten recovery time of the training task, and improve recovery efficiency of the training task.

IPC Classifications

G06F 11/20 20060101AFI20250110BHEP G06F 11/14 20260101ALI20250110BHEP G06F 11/07 20060101ALI20250110BHEP G06F 11/30 20060101ALI20250110BHEP G06N 3/098 20230101ALI20250110BHEP

Designated States

AL, AT, BE, BG, CH, CY, CZ, DE, DK, EE, ES, FI, FR, GB, GR, HR, HU, IE, IS, IT, LI, LT, LU, LV, MC, ME, MK, MT, NL, NO, PL, PT, RO, RS, SE, SI, SK, SM, TR

View original document →