SYSTEMS FOR TRAINING ARTIFICIAL INTELLIGENCE MODEL AND CHECKPOINT FILE STORAGE METHODS
Applicants
Alipay (Hangzhou) Information Technology Co., Ltd.
Inventors
LIU, Jian, GU, Shuwei, ZHAN, Xiaojun, RUAN, Ruoyi
Abstract
One or more embodiments of this specification provide a system for training an artificial intelligence model and a checkpoint file storage method. The system includes a model training module and a first cache module. The model training module is configured to: read a dataset needed for training from the first cache module, to execute a training task of an artificial intelligence model, where computation of the training task is performed by a GPU chip, and in a process of executing the training task, generate a checkpoint file and send the checkpoint file to the first cache module. The first cache module is configured to: identify a type of obtained to-be-stored data; and if the type of the to-be-stored data is a dataset, first write the to-be-stored data into a local buffer, and then store the to-be-stored data in a local hard disk from the local buffer; or if the type of the to-be-stored data is a checkpoint file, directly store the to-be-stored data in the local hard disk.
IPC Classifications
Designated States
AL, AT, BE, BG, CH, CY, CZ, DE, DK, EE, ES, FI, FR, GB, GR, HR, HU, IE, IS, IT, LI, LT, LU, LV, MC, ME, MK, MT, NL, NO, PL, PT, RO, RS, SE, SI, SK, SM, TR