FAULT PROCESSING METHOD AND RELATED DEVICE
Applicants
Huawei Technologies Co., Ltd.
Inventors
HAO, Ripei, CAI, Zhifang, SONG, Xin
Abstract
This application provides a fault processing method, applied to a training system. The training system includes a first chip on a host side and a plurality of second chips on a device side, and the first chip and the plurality of second chips are configured to collaboratively execute a training task. The training task includes a first subtask and a plurality of second subtasks, and execution of the second subtask depends on an execution result of the first subtask. The method includes: The first chip executes the first subtask. When a fault occurs, the first chip saves a fault file before a second chip in a normal state on the device side stops executing the second subtask. The first chip synchronizes the fault file to a rescheduled chip on the device side, so that the rescheduled chip continues to execute the second subtask. In the method, the first chip on the host side may not stop executing the second subtask. In this way, the execution result of the first subtask on the host side can be reused to recover the training task, to shorten recovery time of the training task, and improve recovery efficiency of the training task.
IPC Classifications
Designated States
AL, AT, BE, BG, CH, CY, CZ, DE, DK, EE, ES, FI, FR, GB, GR, HR, HU, IE, IS, IT, LI, LT, LU, LV, MC, ME, MK, MT, NL, NO, PL, PT, RO, RS, SE, SI, SK, SM, TR