← EPO Patent Bulletin

FAULT PROCESSING METHOD AND RELATED DEVICE

Publication EP4711932A1 Kind: A1 Mar 18, 2026

Applicants

Huawei Technologies Co., Ltd.

Inventors

HAO, Ripei, CAI, Zhifang, SONG, Xin

Abstract

This application provides a fault processing method, applied to a training system. The training system includes a first chip on a host side and a plurality of second chips on a device side, and the first chip and the plurality of second chips are configured to collaboratively execute a training task. The training task includes a first subtask and a plurality of second subtasks, and execution of the second subtask depends on an execution result of the first subtask. The method includes: The first chip executes the first subtask. When a fault occurs, the first chip saves a fault file before a second chip in a normal state on the device side stops executing the second subtask. The first chip synchronizes the fault file to a rescheduled chip on the device side, so that the rescheduled chip continues to execute the second subtask. In the method, the first chip on the host side may not stop executing the second subtask. In this way, the execution result of the first subtask on the host side can be reused to recover the training task, to shorten recovery time of the training task, and improve recovery efficiency of the training task.

IPC Classifications

G06F 11/20 20060101AFI20250110BHEP G06F 11/14 20260101ALI20250110BHEP G06F 11/07 20060101ALI20250110BHEP G06F 11/30 20060101ALI20250110BHEP G06N 3/098 20230101ALI20250110BHEP

Designated States

AL, AT, BE, BG, CH, CY, CZ, DE, DK, EE, ES, FI, FR, GB, GR, HR, HU, IE, IS, IT, LI, LT, LU, LV, MC, ME, MK, MT, NL, NO, PL, PT, RO, RS, SE, SI, SK, SM, TR