WO2025112801A1 - Procédé d'entraînement de modèle d'apprentissage profond et système d'entraînement de modèle d'apprentissage profond - Google Patents

Procédé d'entraînement de modèle d'apprentissage profond et système d'entraînement de modèle d'apprentissage profond Download PDF

Info

Publication number
WO2025112801A1
WO2025112801A1 PCT/CN2024/118478 CN2024118478W WO2025112801A1 WO 2025112801 A1 WO2025112801 A1 WO 2025112801A1 CN 2024118478 W CN2024118478 W CN 2024118478W WO 2025112801 A1 WO2025112801 A1 WO 2025112801A1
Authority
WO
WIPO (PCT)
Prior art keywords
distributed
deep learning
model
training
learning model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/CN2024/118478
Other languages
English (en)
Chinese (zh)
Inventor
林哲宇
赵汉宇
肖文聪
李永
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Alicloud Apsara Information Technology Co Ltd
Original Assignee
Hangzhou Alicloud Apsara Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Alicloud Apsara Information Technology Co Ltd filed Critical Hangzhou Alicloud Apsara Information Technology Co Ltd
Publication of WO2025112801A1 publication Critical patent/WO2025112801A1/fr
Anticipated expiration legal-status Critical
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/098Distributed learning, e.g. federated learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the embodiments of the present disclosure relate to the field of deep learning technology, and in particular to a deep learning model training method and a deep learning model training system.
  • the embodiments of the present disclosure provide a deep learning model training method.
  • One or more embodiments of the present disclosure also relate to another deep learning model training method, a deep learning model training system, a deep learning model training device, another deep learning model training device, a computing device, a computer-readable storage medium, and a computer program to solve the technical defects existing in the prior art.
  • a deep learning model training method comprising:
  • the deep learning model is distributedly trained based on the sample data set, and in the process of calculating the adjustment parameters of the distributed training, the model parameters of the deep learning model are stored according to the target storage parameters, wherein the target storage parameters are determined based on the model specification information of the deep learning model and the preset distributed training strategy.
  • another deep learning model training method is provided, which is applied to a cloud-side device, where the cloud-side device includes a plurality of distributed nodes and a storage medium; the method includes:
  • a preset distributed training strategy multiple distributed nodes are called to perform distributed training on the deep learning model based on a sample data set, and in a process of calculating adjustment parameters of the distributed training, model parameters of the deep learning model are stored in a storage medium according to target storage parameters, wherein the target storage parameters are determined based on model specification information of the deep learning model and the preset distributed training strategy;
  • multiple distributed nodes are triggered to stop distributed training of the deep learning model, and target model parameters currently stored in the storage medium are determined;
  • a deep learning model training system comprising a management and control unit and a plurality of distributed nodes, the plurality of distributed nodes comprising a first distributed node, the first distributed node being any one of the plurality of distributed nodes;
  • the control unit is used to obtain the initial deep learning model and sample data set, build multiple distributed data based on the deep learning model and sample data set according to the preset distributed training strategy, and distribute the multiple distributed data to each distributed node;
  • the first distributed node is used to perform distributed training on the deep learning model based on the sample data set; and in the process of calculating the adjustment parameters of the distributed training, the model parameters of the deep learning model are stored therein according to the target storage parameters, wherein the target storage parameters are determined based on the model specification information of the deep learning model and the preset distributed training strategy.
  • a deep learning model training device comprising:
  • a first acquisition module is configured to acquire an initial deep learning model and a sample data set
  • the first training module is configured to perform distributed training on the deep learning model based on the sample data set according to a preset distributed training strategy, and store the model parameters of the deep learning model according to the target storage parameters during the adjustment parameter calculation process of the distributed training, wherein the target storage parameters are determined based on the model specification information of the deep learning model and the preset distributed training strategy.
  • another deep learning model training device is provided, which is applied to a cloud-side device, wherein the cloud-side device includes a plurality of distributed nodes and a storage medium; the device includes:
  • a second acquisition module is configured to acquire an initial deep learning model and a sample data set
  • a second training module is configured to call multiple distributed nodes according to a preset distributed training strategy, perform distributed training on the deep learning model based on the sample data set, and store the model parameters of the deep learning model to a storage medium according to a target storage parameter during the calculation of adjustment parameters of the distributed training, wherein the target storage parameter is determined based on model specification information of the deep learning model and the preset distributed training strategy;
  • a stop module is configured to trigger multiple distributed nodes to stop the distributed training of the deep learning model when an abnormality in the deep learning model training is identified, and determine the target model parameters currently stored in the storage medium;
  • a parameter acquisition module is configured to acquire target model parameters from a storage medium when receiving a request to resume training
  • the recovery module is configured to call multiple distributed nodes to perform distributed training on the deep learning model based on target model parameter recovery.
  • a computing device including:
  • the memory is used to store computer-executable instructions
  • the processor is used to execute the computer-executable instructions.
  • the steps of the above method are implemented.
  • a computer-readable storage medium which stores computer-executable instructions, and the instructions implement the steps of the above method when executed by a processor.
  • a computer program is provided, wherein when the computer program is When executed in a computer, the computer is caused to execute the steps of the above method.
  • an initial deep learning model and a sample data set are obtained; distributed training is performed on the deep learning model based on the sample data set according to a preset distributed training strategy, and in the process of calculating the adjustment parameters of the distributed training, the model parameters of the deep learning model are stored according to the target storage parameters, wherein the target storage parameters are determined based on the model specification information of the deep learning model and the preset distributed training strategy.
  • the target storage parameters are determined based on the model specification information of the deep learning model and the preset distributed training strategy, and the iterative law of distributed training is fully considered.
  • the model parameters of the deep learning model are stored according to the target storage parameters, and the storage process of the model parameters and the adjustment parameter calculation process in the distributed training are overlapped, which fully saves performance overhead, and completes the real-time storage of the deep learning model parameters in a manner close to zero performance overhead, so that the deep learning model training has high fault tolerance and high efficiency.
  • FIG1 is a flow chart of a deep learning model training method provided by an embodiment of the present disclosure
  • FIG2 is a system architecture diagram of a deep learning model training method provided by an embodiment of the present disclosure
  • FIG4 is a schematic diagram of a deep learning model training method provided by an embodiment of the present disclosure.
  • FIG5 is a flowchart of another deep learning model training method provided by an embodiment of the present disclosure.
  • FIG6 is a schematic diagram of the structure of a deep learning model training system provided by an embodiment of the present disclosure.
  • FIG7 is a schematic diagram of the structure of a deep learning model training device provided by an embodiment of the present disclosure.
  • FIG8 is a schematic diagram of the structure of another deep learning model training device provided by an embodiment of the present disclosure.
  • FIG. 9 is a structural block diagram of a computing device provided by an embodiment of the present disclosure.
  • first, second, etc. may be used to describe various information in one or more embodiments of the present disclosure, these information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other.
  • the first may also be referred to as the second, and similarly, the second may also be referred to as the first.
  • word "if” as used herein may be interpreted as "at the time of” or "when” or "in response to determining”.
  • the user information including but not limited to user device information, user personal information, etc.
  • data including but not limited to data used for analysis, stored data, displayed data, etc.
  • the collection, use and processing of data must comply with the relevant laws, regulations and standards of the relevant countries and regions, and provide corresponding operation entrances for users to choose to authorize or refuse.
  • a large model refers to a deep learning model with large-scale model parameters, which usually contains hundreds of millions, tens of billions, hundreds of billions, trillions, or even more than ten trillion model parameters.
  • a large model can also be called a foundation model/foundation model.
  • the large model is pre-trained with large-scale unlabeled corpus to produce a pre-trained model with more than 100 million parameters.
  • This model can adapt to a wide range of downstream tasks, and the model has good generalization ability, such as a large-scale language model (Large Language Model, LLM for short), a multi-modal pre-training model, etc.
  • LLM Large Language Model
  • the big model can be widely used in natural language processing (NLP), computer vision and other fields. Specifically, it can be applied to computer vision tasks such as visual question answering (VQA), image caption (IC), image generation, as well as natural language processing tasks such as text-based sentiment classification, text summary generation, and machine translation.
  • NLP natural language processing
  • VQA visual question answering
  • IC image caption
  • natural language processing tasks such as text-based sentiment classification, text summary generation, and machine translation.
  • the main application scenarios of the big model include digital assistants, intelligent robots, search, online education, office software, e-commerce, intelligent design, etc.
  • Natural Language Processing is an important direction in the fields of computer science and artificial intelligence. Its purpose is to enable computers to understand and use human language to perform useful tasks. Natural language processing is mainly used in machine translation, speech recognition, text analysis, text question answering and other fields.
  • Hyperparameters are fixed parameters calculated during the model training process. They can be understood as a strategy for updating model parameters and are used to control the update process of model parameters.
  • Gradient weight refers to the relationship between the gradient in the model and the model parameters.
  • the gradient describes the direction and rate of change of the loss function with the model parameters.
  • the weight refers to the degree of influence of the model parameters on the loss function. It determines the speed at which the gradient descent algorithm updates the parameters. A larger gradient weight leads to faster parameter updates.
  • Grid search parameter tuning It is a commonly used hyperparameter adjustment strategy. It is a process of finding the target hyperparameter combination by traversing all possible hyperparameter combinations. Specifically, we first need to define a hyperparameter space, which contains the possible value range of each hyperparameter. Then, we divide the space of these hyperparameters into a series of subspaces, each subspace corresponding to a set of hyperparameter combinations. Next, we apply each hyperparameter combination in each subspace to the model and measure the performance of the model. Finally, we will find the target hyperparameter combination as the basis for updating the model parameters.
  • Bayesian optimization is a commonly used hyperparameter tuning strategy, a hyperparameter tuning strategy based on Bayes' theorem.
  • Bayes' theorem is a theory of probability that is used to estimate the probability of an event, which can be used to infer the target optimization value of hyperparameters.
  • Bayesian parameter tuning we first define a prior distribution that represents our knowledge of the hyperparameters. Then, we combine the prior distribution with the observed experimental results to obtain the posterior distribution. Through the posterior distribution, we can estimate the target optimization value of the hyperparameter.
  • GPU Graphics Processing Unit
  • processors have been widely used as computing hardware for model training because of their parallel structure, which can realize efficient matrix operations.
  • TPU Tensor Processing Unit
  • FPGA Field-Programmable Gate Array
  • ASIC Application Specific Integrated Circuit
  • the Central Processing Unit can also be used for model training, but it is usually not as efficient as GPU and TPU.
  • Deep Self-Attention Model A deep learning architecture based on the attention mechanism (Attention) for processing sequence data such as natural language.
  • Bidirectional Encoder Representations from Transformers (BERT): A special Transformer model trained using a bidirectional Transformer encoder and large-scale unlabeled text data. BERT’s outstanding performance has made it a standard baseline for many natural language processing tasks.
  • LLM Large Language Model
  • These models generally contain a multi-layer neural network, whose input is a text sequence for text generation, and the output is the task result text generated by performing a specific natural language processing task on the text sequence.
  • Pre-training means that before a specific task, the model has been trained and learned to process a large amount of language data in advance. By pre-training the models, they can capture more complex language and semantic rules, thereby performing well in various natural language processing tasks and reducing the demand for large-scale data for specific tasks.
  • Distributed training A method of training deep learning models using multiple distributed nodes, which can greatly increase training speed and reduce computation time.
  • Model Parallel A distributed training strategy that splits a model into multiple parts and deploys each part to different distributed nodes for training. This method can effectively solve the problem of too many parameters in the model.
  • DP Data Parallel
  • Pipeline Parallel A distributed training strategy that is between model parallelism and data parallelism. The core idea is to decompose a large model into multiple layers and combine them into a pipeline to perform forward propagation and back propagation calculations, thereby reducing the memory usage of a single card and reducing communication overhead.
  • Remote Direct Memory Access is a technology used to directly read and write memory between remote computers, which can greatly improve the speed of data transmission in the network.
  • Network card A hardware device that is used to realize the physical connection between a computer and a network and transmit and receive Receive data packets to complete data transmission.
  • PCIe Peripheral Component Interconnect Express
  • Inter-GPU Express A fast communication channel between two or more GPUs that provides higher bandwidth and lower latency for data transfer between GPUs.
  • a channel used to transmit data signals which can be a physical channel or a logical channel.
  • a physical channel is composed of a transmission medium and related communication equipment, and is used to transmit actual data signals;
  • a logical channel refers to a logical path realized through an intermediate node on the basis of a physical channel, that is, a logical path formed between the sender and the receiver.
  • the distributed training of deep learning models is mainly based on sample sets or model parameters of deep learning models to build multiple distributed data, and then distribute the multiple distributed data to different distributed nodes. Iterative training is performed on any distributed node. During any iterative training process, the training of the language model is completed in accordance with the method of propagation calculation and parameter update.
  • a training anomaly such as hardware anomaly, system anomaly, network anomaly, or other unknown anomaly
  • distributed training needs to be re-executed without storing the updated model parameters, which is unbearable for deep learning models with large performance overhead.
  • the updated model parameters can be stored at specific checkpoints (for example, after the parameter update of each iteration process), in the case of large deep learning model parameters, it is necessary to wait for the model parameters to be stored before continuing training.
  • time overhead often reaches several minutes or even more than ten minutes, which determines that model parameters cannot be stored frequently.
  • the time overhead of resuming distributed training may reach several hours. Therefore, there is an urgent need for a deep learning model training method with high stability and high efficiency, which can achieve pre-storage of updated model parameters when training anomalies occur through low performance overhead, and no recalculation is required after resuming distributed training.
  • the present disclosure provides a deep learning model training method.
  • the present disclosure also involves another deep learning model training method, a deep learning model training system, a deep learning model training device, another deep learning model training device, a computing device, a computer-readable storage medium and a computer program, which are described in detail one by one in the following embodiments.
  • FIG. 1 shows a flow chart of a deep learning model training method provided by an embodiment of the present disclosure, including the following specific steps:
  • Step 102 Obtain an initial deep learning model and sample data set.
  • any distributed node includes computing hardware for model training, such as a GPU, an NPU, an FPGA, an ASIC, or a CPU.
  • a deep learning model refers to a type of machine learning model based on a deep neural network structure, which is widely used in multiple fields such as vision, speech, and natural language processing.
  • a deep learning model has large-scale model parameters, and includes but is not limited to: language processing models, image processing models, speech processing models, code processing models, etc.
  • the language processing model can perform one or more natural language processing tasks, including but not limited to: machine translation tasks, speech recognition tasks, text analysis tasks, or text question and answer tasks.
  • the language processing model can be regarded as: a translation model, a speech recognition model, a text analysis model, and a text question and answer model, etc.
  • model structure the language processing model can be a Transformer model, a BERT model, a large language model, etc., which are not limited here.
  • the sample data set is a collection of sample data of the deep learning model used for training, and the sample data set includes large-scale sample data.
  • the sample data set can be a labeled sample data set or an unlabeled sample data set.
  • the sample data can be data of different modalities.
  • the deep learning model needs to be trained into a model with text processing function, sample text of text modality, and another example is that the deep learning model needs to be trained into a model with audio processing function, sample audio of audio modality, and another example is that the deep learning model needs to be trained into a model with image processing function, sample image of image modality, and another example is that the deep learning model needs to be trained into a model with numerical processing function, sample numerical value of numerical modality.
  • the sample data set can be obtained from a sample database, such as an open source sample database, or it can be artificially constructed, such as generated using a generative model, and it can also be obtained from a historical database, such as obtaining historical query text and historical answer text from a historical database to construct a sample text set, which is not limited here.
  • a sample database such as an open source sample database
  • a historical database such as obtaining historical query text and historical answer text from a historical database to construct a sample text set, which is not limited here.
  • the trained target large language model has a text question-answering function and can perform text question-answering tasks.
  • the initial large language model is obtained from the model library, and a sample text set is obtained from the open source sample database.
  • the sample text set includes 10,000,000 sample question-answering text pairs.
  • Step 104 According to the preset distributed training strategy, the deep learning model is distributedly trained based on the sample data set, and in the process of calculating the adjustment parameters of the distributed training, the model parameters of the deep learning model are stored according to the target storage parameters, wherein the target storage parameters are determined based on the model specification information of the deep learning model and the preset distributed training strategy.
  • Distributed training is a training method that uses multiple distributed nodes to train deep learning models.
  • Model training is iterative training, and one iteration process includes adjustment parameter calculation and model parameter update.
  • Adjustment parameters are hyperparameters for adjusting model parameters
  • adjustment parameter calculation is the calculation of hyperparameters for adjusting model parameters, including but not limited to: propagation calculation, probability calculation (updating model parameters through Bayesian optimization), grid calculation (updating model parameters through grid search) and diffusion derivation process (using diffusion model for forward diffusion process and reverse derivation process).
  • the distributed training strategy is a strategy that uses a distributed computing framework to split the large-scale training task of the deep learning model into multiple small-scale training tasks and distributes them to multiple distributed nodes for execution, including but not limited to: data parallel strategy, model parallel strategy and pipeline parallel strategy.
  • the distributed training strategy is pre-set according to the model attributes, sample data sets and/or training tasks of the deep learning model. For example, the model layers of the deep learning model have execution restrictions in sequence, which makes it difficult to split the training, so the data parallel strategy is adopted. For another example, the data scale of the sample data set is too large, so the data parallel strategy is adopted.
  • the use of data parallel strategy may cause network bottlenecks, and the use of model parallel strategy may be too complex, so the pipeline parallel strategy is adopted.
  • Different distributed training strategies correspond to different iterative characteristics of model training. For example, when the data parallel strategy is adopted, the amount of sample data on each distributed node is small, while the model parameter specifications are large, the time overhead of adjusting the parameter calculation at one time is high, and the number of times is small.
  • the model parallel strategy is adopted, the amount of sample data on each distributed node is large, while the model parameter specifications are small, the time overhead of adjusting the parameter calculation at one time is low, and the number of times is large.
  • Different iteration characteristics determine different time costs for adjusting parameter calculations and storage times. It is necessary to fine-tune the target storage parameters and integrate the storage process and distributed The calculation process of tuning parameters during training overlaps.
  • the model specification information is information about the model parameter specifications of the deep learning model, including but not limited to: model parameter quantity, benchmark model, etc.
  • the model parameter quantity is the parameter specification quantity of the model parameter, for example, the model parameter quantity of the deep learning model is a specification of 10 billion.
  • the benchmark model is a specific deep learning model architecture, for example, in natural language processing, the benchmark model of the deep learning model is the Transformer model, the BERT model, and the large language model, etc.
  • the storage is the model parameter storage executed synchronously with the adjustment parameter calculation process, that is, the storage process and the adjustment parameter calculation process have a time overhead that is not much different.
  • the time overhead of the adjustment parameter calculation process is T
  • the time overhead of the storage process is T'. If T is less than T', even during the model parameter update process, the storage is still being executed. Once a training anomaly occurs, it is difficult to restore the model training. For details, see the following description.
  • the storage process can be in the forward propagation calculation process, in the reverse propagation calculation process, or in the forward propagation calculation process and the reverse propagation calculation process, which is not limited here.
  • the target storage parameters are configuration parameters for storing model parameters.
  • the target storage parameters are determined based on the model specification information of the deep learning model and the preset distributed training strategy, including but not limited to: target storage model parameter specifications, target storage frequency, target storage read and write speed, target storage bandwidth, etc.
  • the model parameters are stored during the model parameter update process, resulting in the updated model parameters and the unupdated model parameters being mixed and stored, once a training anomaly occurs, the model training cannot be restored through the stored mixed updated model parameters and the unupdated model parameters.
  • the current model parameter is ⁇ i-1 , and the i-th iteration training is performed.
  • the current model parameter ⁇ i-1 is stored.
  • the model parameter ⁇ i-1 can be obtained to re-execute the i-th iteration training. If the current model parameter ⁇ i-1 is stored during the model parameter update process, part of the updated model parameter ⁇ i will be introduced, and thus the i-th iteration training cannot be re-executed in the event of a training anomaly.
  • the deep learning model is distributedly trained based on the sample data set.
  • the specific method is: according to the preset distributed training strategy, multiple distributed data are constructed, the multiple distributed data are distributed to distributed nodes, and iterative training is performed, wherein the iterative training includes adjusting parameter calculation and model parameter update.
  • the model parameters of the deep learning model are stored according to the target storage parameters, specifically in the following manner: according to the target storage parameters, the model parameters of the deep learning model are stored in the storage medium.
  • the storage medium is a hardware device for storing model parameters, including but not limited to: computing hardware cache (GPU cache, NPU cache, FPGA cache, ASIC cache and CPU cache), memory, hard disk and distributed persistent storage array.
  • the target storage parameters are determined based on the model specification information of the deep learning model and the preset distributed training strategy.
  • the specific method is: based on the model specification information of the deep learning model and the preset distributed training strategy, time cost analysis is performed to obtain the target storage parameters.
  • the time cost analysis is to analyze the time cost of adjusting the parameter calculation process and the model parameter storage process in iterative training, and then determine the target storage parameters that can be stored.
  • the preset distributed training strategy is a data parallel strategy. Based on the model parameter amount of the large language model of 10 13 level and the data parallel strategy, a time cost analysis is performed to determine the time cost t1 of the forward propagation calculation process and the time cost t2 of the backward propagation calculation process of one iteration. Based on the time cost and the model parameter amount, the target storage The storage model parameter specifications are P1 and P2. According to the data parallel strategy, the 10,000,000 sample question-answer text pairs in the sample text set are divided. 64 distributed data are constructed and distributed to 64 distributed nodes. A GPU is deployed on each distributed node. On any distributed node, any distributed data is divided into 16 small batches, and the GPU is used to perform iterative training of 16 small batches.
  • the model parameters ⁇ of the large language model are stored in the GPU cache, memory and hard disk of the distributed node in turn according to the target storage model parameter specifications P1 and P2.
  • the target large language model with text question-answering function is obtained, and the target large language model is deployed on the cloud-side device of the website of the large language model to provide users with virtual character dialogue function.
  • an initial deep learning model and a sample data set are obtained; distributed training is performed on the deep learning model based on the sample data set according to a preset distributed training strategy, and in the process of calculating the adjustment parameters of the distributed training, the model parameters of the deep learning model are stored according to the target storage parameters, wherein the target storage parameters are determined based on the model specification information of the deep learning model and the preset distributed training strategy.
  • the target storage parameters are determined based on the model specification information of the deep learning model and the preset distributed training strategy, and the iterative law of distributed training is fully considered.
  • the model parameters of the deep learning model are stored according to the target storage parameters, and the storage process of the model parameters and the adjustment parameter calculation process in the distributed training are overlapped, which fully saves performance overhead, and completes the real-time storage of the deep learning model parameters in a manner close to zero performance overhead, so that the deep learning model training has high fault tolerance and high efficiency.
  • step 104 distributed training is performed on the deep learning model based on the sample data set according to a preset distributed training strategy, including the following specific steps:
  • a preset distributed training strategy based on the deep learning model and the sample data set, multiple distributed data are constructed, wherein the preset distributed training strategy includes a model parallel training strategy or a data parallel training strategy;
  • a propagation calculation is performed based on the distributed data to obtain a gradient weight, wherein the first distributed node is any one of the multiple distributed nodes;
  • the model parameters of the deep learning model are updated, and when the preset training end conditions are met, a trained deep learning model is obtained.
  • Distributed data refers to partial data distributed to distributed nodes for execution. If the entire model training is understood as a large-scale training task, distributed data refers to the task data of the small-scale training tasks obtained by splitting the large-scale training task.
  • Different distributed training strategies are used to construct different distributed data. For example, the model parallel strategy is used to split the model parameters of the deep learning model into multiple parts to construct distributed data. Another example is that the data parallel strategy is used to split the sample data set into multiple parts to construct distributed data.
  • Propagation calculation is the process of determining the gradient weight based on sample data using a deep learning model, including the forward propagation calculation process and the back propagation calculation process.
  • Forward propagation calculation is the process of inputting sample data into a deep learning model and outputting predicted data.
  • Back propagation calculation is the process of determining the loss value through predicted data, and then inputting the deep learning model back to determine the gradient weights of each model layer.
  • Model parameter updating is the process of adjusting the model parameters of each model layer of the model based on the gradient weights. For example, in the forward propagation calculation process, sample data X is input into a large language model and predicted data Z is output.
  • the loss value is determined based on the predicted data Z, and the large language model is input back to determine the gradient weights of the n model layers of the large language model.
  • Model parameter update process In the example, based on the gradient weight Adjust the model parameters of n model layers in a large language model.
  • Gradient weights are used to assign different gradient weights to the model parameter updates of each model layer during the reverse calculation propagation process, thereby controlling the update amplitude of the parameters of each model layer and the linear speed of the model parameter update. The larger the parameter update value of each model layer, the faster the parameter update, and vice versa.
  • the preset training end conditions are preset judgment conditions for stopping training, including but not limited to: a preset number of iterations, a preset loss value threshold, a preset training time, and a preset model convergence condition.
  • the preset distributed training strategy multiple distributed data are constructed based on the deep learning model and the sample data set.
  • the specific method is: according to the preset distributed training strategy, the model parameters and/or the sample data set of the deep learning model are divided to obtain multiple distributed data.
  • Propagation calculation is performed based on distributed data to obtain gradient weights.
  • the specific method is as follows: sample data in the distributed data is input into the deep learning model, forward propagation calculation is performed to obtain predicted data, loss value is determined based on sample data and predicted data, loss value is reversely input into the deep learning model, back propagation calculation is performed to obtain gradient weights.
  • the model parameters of the deep learning model are updated. Specifically, based on the gradient weights on each distributed node, the model parameters of the deep learning model are updated by the gradient update method.
  • sample question-answer text pairs in the sample text set are divided to obtain 64 distributed data.
  • the 64 distributed data are distributed to 64 distributed nodes, each of which is deployed with a GPU.
  • any distributed data is divided into 16 small batches.
  • the sample question text X in the small batch of distributed data is input into the large language model, and the forward propagation calculation is performed to obtain the predicted answer text Z′.
  • the loss value Loss is determined, and the loss value is reversely input into the large language model, and the back propagation calculation is performed to obtain the gradient weight.
  • the model parameters ⁇ of the large language model are updated through the gradient update method.
  • the trained target large language model is obtained.
  • the target large language model has a text question and answer function.
  • the preset distributed training strategy multiple distributed data are constructed based on the deep learning model and the sample data set, wherein the preset distributed training strategy includes a model parallel training strategy or a data parallel training strategy; multiple distributed data are distributed to each distributed node; on the first distributed node, propagation calculation is performed based on the distributed data to obtain the gradient weight, wherein the first distributed node is any one of the multiple distributed nodes; based on the gradient weight on each distributed node, the model parameters of the deep learning model are updated, and when the preset training end conditions are met, a deep learning model that has been trained is obtained.
  • the preset distributed training strategy multiple distributed data are constructed, distributed training is performed on multiple distributed nodes, gradient weights are obtained, and the model parameters of the deep learning model are updated, thereby improving the efficiency of model training.
  • the distributed data includes multiple batches of distributed data
  • a propagation calculation is performed based on the distributed data of the current batch to obtain a gradient weight
  • Update the distributed data of the current batch return to the step of executing on the first distributed node, performing propagation calculation based on the distributed data of the current batch, and obtaining the gradient weight.
  • the current batch of distributed data is the distributed data of the batch for training the deep learning model in the current iterative training process.
  • the distributed data is divided into 16 batches, and 16 iterative trainings need to be performed.
  • One iterative training process includes a forward propagation calculation process, a backpropagation calculation process, and a parameter update process.
  • a communication process (a process of integrating gradient weights) is also included.
  • the current process is the i-th iterative training process
  • the current batch of distributed data is the i-th batch of distributed data
  • the current model parameters are the model parameters ⁇ i-1 after the i-1th update.
  • the current batch of distributed data is updated, that is, the i-th batch of distributed data is updated to the i+1-th batch of distributed data.
  • Propagation calculation is performed based on the current batch of distributed data to obtain gradient weights.
  • the specific method is: input the sample data in the current batch of distributed data into the deep learning model, perform forward propagation calculation, obtain predicted data, determine the loss value based on the sample data and the predicted data, reversely input the loss value into the deep learning model, perform back propagation calculation, and obtain gradient weights.
  • the sample question text Xi in the i-th batch of distributed data is input into the large language model (the current model parameter is the model parameter ⁇ i -1 after the i-1th update), and the forward propagation calculation is performed to obtain the predicted data predicted answer text Z′i .
  • the loss value Loss is determined, and the loss value is reversely input into the large language model, and the back propagation calculation is performed to obtain the gradient weight
  • the model parameters of the large language model are updated from ⁇ i-1 to ⁇ i through the gradient update method, and the current batch of distributed data is updated from the i-th batch to the i+1-th batch.
  • the step of inputting the sample question text Xi +1 in the i+1-th batch of distributed data into the large language model is continued.
  • a trained target large language model is obtained, and the target large language model has a text question and answer function.
  • the propagation calculation is performed based on the current batch of distributed data to obtain the gradient weight; the distributed data of the current batch is updated, and the step of performing the propagation calculation based on the current batch of distributed data on the first distributed node to obtain the gradient weight is returned.
  • the target storage parameters corresponding to the propagation calculation process are determined.
  • the number of forward and backward calculations and the corresponding time overhead required for each batch of distributed data in the training process of the deep learning model under different distributed training strategies are also different.
  • a data parallel strategy when a data parallel strategy is adopted, the amount of sample data on each distributed node is small, while the model parameter specifications are large, the time overhead of a propagation calculation is high, and the number of times is small.
  • a model parallel strategy when a model parallel strategy is adopted, the amount of sample data on each distributed node is large, while the model parameter specifications are small, the time overhead of a propagation calculation is low, and the number of times is large. It is necessary to finely determine the target storage parameters based on the number and time overhead of propagation calculations, and overlap the storage process of the model parameters in each iterative training process with the propagation calculation process in distributed training.
  • the number and time overhead of propagation calculations are predicted, which are specifically predicted by prediction algorithms, such as the torch.distributed module, the torch.profiler tool, and the tf.data API interface.
  • the target storage parameters corresponding to the propagation calculation process are determined.
  • the specific method is: with the time overhead of the model parameter storage process not exceeding the time overhead of the propagation calculation process as the goal, based on the number of propagation calculations and the time overhead, the target storage parameters corresponding to the propagation calculation process are determined.
  • the torch.distributed module for each batch of distributed data, based on the model parameter amount of the large language model of 10 13 level and the data parallel strategy, the number of propagation calculations is predicted to be 16 times, the time overhead of the forward propagation calculation process is t1, and the time overhead of the backward propagation calculation process is t2.
  • the target storage model parameter specifications corresponding to the propagation calculation process are determined to be P1 and P2.
  • the number of propagation calculations and time overhead are predicted; based on the number of propagation calculations and time overhead, the target storage parameters corresponding to the propagation calculation process are determined. This ensures the feasibility of subsequent storage of model parameters and more accurately overlaps the storage process of model parameters with the propagation calculation process in distributed training.
  • the gradient weights on each distributed node are integrated through the communication channels between the distributed nodes.
  • the communication channel between distributed nodes is a channel connection for data transmission between distributed nodes. It can be a physical connection, such as a network card, PCIe topology, or optical fiber, or a virtual connection, such as a high-speed channel between GPUs or between distributed nodes.
  • the communication channel can be implemented through RDMA technology to improve the transmission speed.
  • the high-speed channel constructed by the network cards on 64 distributed nodes uses RDMA technology to integrate the gradient weights on each distributed node. Based on the gradient weights on each distributed node, the model parameters ⁇ of the large language model are updated through the gradient update method.
  • the communication channels between distributed nodes are used to integrate the gradient weights on each distributed node.
  • the communication process of iterative training is completed in a centralized manner through the communication channels, which improves the efficiency and stability of model training.
  • any distributed node includes a first communication channel connected to a storage medium and a second communication channel connected to other distributed nodes;
  • step 104 the model parameters of the deep learning model are stored according to the target storage parameters, including the following specific steps:
  • the model parameters of the deep learning model are stored in the storage medium through the first communication channel;
  • the gradient weights on each distributed node are integrated through the communication channel between the distributed nodes, including the following specific steps:
  • the gradient weights on each distributed node are integrated through the second communication channel.
  • the storage medium is independent of the distributed nodes to avoid abnormalities on any distributed node and data loss, thereby ensuring the reliability of the entire distributed system. Therefore, the distributed nodes need to establish a connection with the storage medium.
  • a communication channel is established for data storage.
  • the communication channel is needed in the process of storing model parameters. In the communication process, that is, integrating the gradient weights on each distributed node, the communication channel is also needed. This is difficult to achieve for large-scale deep learning models. Therefore, it is necessary to distinguish the communication channels of the two processes and isolate the two data transmission processes to avoid introducing additional collective communications and ensure that there is no interference with distributed training.
  • the first communication channel connecting the storage medium is a communication channel used for the model parameter storage process, including physical connections and virtual connections between distributed nodes and storage media, such as network cards, PCIe topology, optical fibers, read and write channels (data buses) between distributed nodes and storage media, etc.
  • the second communication channel connected to other distributed nodes is a communication channel used for the communication process, including physical connections and virtual connections between distributed nodes and storage media, such as network cards, PCIe topology, optical fibers, distributed nodes, high-speed channels between GPUs, high-speed channels between distributed nodes, etc.
  • the second communication channel can be implemented through RDMA technology, which improves the transmission speed.
  • the first communication channel is the physical connection between the network cards on the 64 distributed nodes and the distributed persistent storage array.
  • the model parameters ⁇ of the large language model are stored in the distributed persistent storage array through the first communication channel.
  • the second communication channel is the physical connection between the network cards on the 64 distributed nodes, and the PCIe topology with GPUs inserted on each distributed node, with the high-speed channels between GPUs and the virtual connection between the high-speed channels between each distributed node.
  • the gradient weights on each distributed node are integrated Based on the gradient weights on each distributed node, the model parameters ⁇ of the large language model are updated through the gradient update method.
  • the model parameters of the deep learning model are stored in the storage medium through the first communication channel; the gradient weights on each distributed node are integrated through the second communication channel.
  • the two data transmission processes of model parameter storage and communication are isolated to avoid introducing additional collective communications, ensuring that there is no interference with distributed training, improving the reliability of distributed training, and improving the effect of model training.
  • the propagation calculation includes forward propagation calculation and backward propagation calculation
  • step 104 Before storing the model parameters of the deep learning model according to the target storage parameters, the following specific steps are also included:
  • the model parameters of the deep learning model are stored according to the second target storage parameters.
  • the first target storage parameter is a configuration parameter for storing model parameters in the forward propagation calculation process.
  • the first target storage parameter is determined based on the model specification information of the deep learning model and the preset distributed training strategy, including but not limited to Limited to: first target storage model parameter specifications, first target storage frequency, first target storage read and write speed, first target storage bandwidth, etc.
  • the second target storage parameters are configuration parameters for storing model parameters during the back-propagation calculation process.
  • the second target storage parameters are determined based on the model specification information of the deep learning model and the preset distributed training strategy, including but not limited to: second target storage model parameter specifications, second target storage frequency, second target storage read and write speed, second target storage bandwidth, etc.
  • the specific method is: for each batch of distributed data, based on the model specification information of the deep learning model and the preset distributed training strategy, predict the number and time overhead of forward propagation calculation and reverse propagation calculation, and based on the number and time overhead of forward propagation calculation and reverse propagation calculation, determine the first target storage parameter corresponding to the forward propagation calculation and the second target storage parameter corresponding to the reverse propagation calculation process.
  • the model parameters of the deep learning model are stored according to the first target storage parameters.
  • the specific method is: during the forward propagation calculation process, the model parameters of the deep learning model are stored in the storage medium according to the first target storage parameters.
  • the model parameters of the deep learning model are stored according to the second target storage parameters.
  • the specific method is: during the back propagation calculation process, the model parameters of the deep learning model are stored in the storage medium according to the second target storage parameters.
  • the first target storage model parameter specification corresponding to the forward propagation calculation process is determined to be P1 and the second target storage model parameter specification corresponding to the reverse propagation calculation process is determined to be P2.
  • the model parameter ⁇ of the large language model is stored in the GPU cache, memory and distributed persistent storage array of the distributed node.
  • the model parameter ⁇ of the large language model is stored in the GPU cache, memory and distributed persistent storage array of the distributed node.
  • the first target storage parameter corresponding to the forward propagation calculation process and the second target storage parameter corresponding to the reverse propagation calculation process are determined; in the forward propagation calculation process, the model parameters of the deep learning model are stored according to the first target storage parameter; in the reverse propagation calculation process, the model parameters of the deep learning model are stored according to the second target storage parameter.
  • the target storage parameters corresponding to forward propagation and reverse propagation are divided more finely, and the storage process of the model parameters is overlapped with the forward propagation calculation process and the reverse propagation calculation process in distributed training, so as to complete the real-time storage of the deep learning model parameters in a manner close to zero performance overhead, so that the deep learning model training has high fault tolerance and high efficiency.
  • storing the model parameters of the deep learning model according to the target storage parameters in step 104 includes the following specific steps:
  • Model parameters of the deep learning model are stored in multiple storage media according to target storage parameters and storage performance priorities of the multiple storage media.
  • Different storage media have different storage performance, including but not limited to: read and write speed and persistence.
  • the read and write speed of memory is higher than that of hard disk, but the persistence of memory is lower than that of hard disk.
  • memory has lower abnormal tolerance.
  • the model parameters of the deep learning model are stored in multiple storage media.
  • the specific method is: according to the target storage parameters and the storage performance priorities of multiple storage media, a storage strategy is determined, and according to the storage strategy, the model parameters of the deep learning model are stored in multiple storage media.
  • the disclosed embodiment provides a storage strategy: establish a storage medium hierarchy: from top to bottom, they are GPU cache, memory and hard disk, 1. Use higher-level storage media with faster read and write speeds as much as possible to maximize the parameter specification and abnormal recovery capabilities of model parameters; 2. Even if the upper-level storage medium is unavailable, the lower-level, more persistent storage medium can still be relied on to ensure the persistent storage of the current model parameters; 3. Save overhead through asynchronous execution between upper and lower layers.
  • the model parameters of the deep learning model are stored in a multi-level storage medium, including the following specific steps: according to the target storage parameters, the model parameters of the deep learning model are stored in a first storage medium and a second storage medium respectively, wherein the reading and writing speed of the first storage medium is faster than that of the second storage medium.
  • the model parameters ⁇ of the large language model are stored in the GPU cache, memory and hard disk of the distributed nodes respectively.
  • the model parameters of the deep learning model are stored in a multi-level storage medium, including the following specific steps: according to the target storage parameters, the model parameters of the deep learning model are stored in a first storage medium, so that the model parameters are transferred from the first storage medium to the second storage medium, wherein the reading and writing speed of the first storage medium is faster than that of the second storage medium.
  • the model parameters ⁇ of the large language model are stored in the GPU cache, so that the model parameters are transferred from the GPU cache to the memory, so that the model parameters are transferred from the memory to the hard disk.
  • the model parameters of the deep learning model are stored in multiple storage media.
  • the storage performance of different storage media is fully utilized to achieve high-frequency model parameter storage, and the current model parameters are stored, saving overhead.
  • the method further includes the following specific steps:
  • Upon receiving a request to resume training obtaining stored target model parameters, wherein the request to resume training is generated after determining that the training anomaly of the distributed training has been recovered, and the target model parameters are model parameters stored before the training anomaly occurs;
  • Training exceptions are abnormal situations that occur during model training. Abnormal situations may cause model training to fail to run normally or affect the performance and accuracy of the trained model, including but not limited to: hardware abnormalities, system abnormalities, network abnormalities, or other unknown abnormalities. Training abnormality recovery refers to taking corresponding measures to resume model training when training abnormalities occur. Resume training request is an instruction request for resuming model training.
  • the target model parameters are the model parameters stored before the training anomaly occurs. For example, after completing the i-1th iteration training, the target model parameters are ⁇ i-1 , and the i-th iteration training is performed. During the propagation calculation process, the target model parameters ⁇ i-1 are stored. After the storage is completed, if a training anomaly occurs, the model parameters ⁇ i-1 can be obtained and the i-th iteration training is performed again. practice.
  • the stored target model parameters are obtained by: obtaining the stored target model parameters from a storage medium.
  • the stored target model parameters ⁇ are obtained from the hard disk. Based on the target model parameters ⁇ , the distributed training of the large language model is resumed to obtain a trained target large language model, which has a text question-answering function.
  • the stored target model parameters are obtained, where the request to resume training is generated after determining that the training anomaly of distributed training has been recovered, and the target model parameters are the model parameters stored before the training anomaly occurs; based on the target model parameters, the distributed training of the deep learning model is resumed.
  • FIG2 shows a system architecture diagram of a deep learning model training method provided by an embodiment of the present disclosure, as shown in FIG2 :
  • the system architecture includes multi-layer storage media, from top to bottom: GPU cache, memory, and hard disk.
  • the multi-layer storage media has higher persistence from top to bottom, and the read and write speeds are faster from bottom to top.
  • the model parameters of the deep learning model are stored from the GPU to the GPU cache for non-persistent high-speed reading and writing.
  • the model parameters of the deep learning model are stored from the GPU to the memory for non-persistent high-speed reading and writing.
  • the model parameters of the deep learning model are transferred from the memory to the hard disk for persistent storage.
  • the model parameters of the deep learning model are obtained from the multi-layer storage media and the model training is performed again.
  • FIG3 shows a flow chart of a deep learning model training method provided by an embodiment of the present disclosure, as shown in FIG3 :
  • each iteration includes propagation calculation, parameter update and parameter storage.
  • the propagation calculation and parameter update are completed, and the updated model parameters are stored.
  • the i-th iteration is started, and the propagation calculation and parameter update are also completed, and the updated model parameters are stored.
  • each iteration includes propagation calculation, communication integration and parameter updating process.
  • the model parameters are stored, and then the communication integration and parameter updating process are performed, and the i-th iteration is started.
  • the model parameters are stored, and then the communication integration and parameter updating process are performed.
  • FIG4 shows a schematic diagram of a deep learning model training method provided by an embodiment of the present disclosure, as shown in FIG4 :
  • the system includes multiple distributed nodes (two distributed nodes in the figure) and storage media (distributed persistent storage array in the figure), and any distributed node includes memory, multiple graphics processing units, a first network card, and a second network card.
  • a high-speed communication channel is established between each graphics processing unit in the distributed node, a first communication channel is established between the first network card on the distributed node and the distributed storage node, and a second communication channel is established between the second network cards of each distributed node.
  • FIG. 5 shows a flowchart of another deep learning model training method provided by an embodiment of the present disclosure, which is applied to a cloud-side device, wherein the cloud-side device includes a plurality of distributed nodes and a storage medium; the method includes the following specific steps: step:
  • Step 502 Obtain an initial deep learning model and sample data set.
  • Step 504 According to the preset distributed training strategy, multiple distributed nodes are called to perform distributed training on the deep learning model based on the sample data set, and in the process of calculating the adjustment parameters of the distributed training, the model parameters of the deep learning model are stored in the storage medium according to the target storage parameters, wherein the target storage parameters are determined based on the model specification information of the deep learning model and the preset distributed training strategy.
  • Step 506 When an abnormality in the deep learning model training is identified, trigger multiple distributed nodes to stop the distributed training of the deep learning model, and determine the target model parameters currently stored in the storage medium.
  • Step 508 When a request to resume training is received, the target model parameters are obtained from the storage medium.
  • Step 510 Call multiple distributed nodes to perform distributed training on the deep learning model based on target model parameter recovery.
  • the embodiments of the present disclosure are applied to a cloud-side device with a distributed training function.
  • the cloud-side device is a network cloud device, a virtual device, and is composed of multiple distributed nodes and storage media.
  • Any distributed node includes computing hardware for model training, such as a GPU, NPU, FPGA, ASIC, or CPU.
  • the target large language model obtained by training has a text question and answer function and can perform text question and answer tasks.
  • the initial large language model is obtained from the model library, and a sample text set is obtained from the open source sample database.
  • the sample text set includes 10,000,000 sample question and answer text pairs, and the preset distributed training strategy is a data parallel strategy.
  • a time overhead analysis is performed to determine that the time overhead of the forward propagation calculation process of one iteration is t1 and the time overhead of the reverse propagation calculation process is t2.
  • the target storage model parameter specifications are obtained as P1 and P2.
  • the 10,000,000 sample question and answer text pairs in the sample text set are divided.
  • 64 distributed data are constructed and distributed to 64 distributed nodes.
  • a GPU is deployed on each distributed node.
  • any distributed data is divided into 16 small batches, and the GPU is used to perform iterative training of 16 small batches.
  • the model parameters ⁇ of the large language model are stored in the GPU cache, memory and hard disk of the distributed node in turn according to the target storage model parameter specifications P1 and P2.
  • the 64 distributed nodes are triggered to stop the distributed training of the large language model, and the target model parameters ⁇ currently stored in the hard disk are determined.
  • the stored target model parameters ⁇ are obtained from the hard disk.
  • the distributed training of the large language model is resumed to obtain the trained target large language model.
  • the target large language model has a text question-answering function.
  • the target large language model is deployed on the cloud-side device of the website of the large language model to provide users with a virtual character dialogue function.
  • an initial deep learning model and a sample data set are obtained; according to a preset distributed training strategy, multiple distributed nodes are called to perform distributed training on the deep learning model based on the sample data set, and in the process of calculating the adjustment parameters of the distributed training, the model parameters of the deep learning model are stored in a storage medium according to the target storage parameters, wherein the target storage parameters are determined based on the model specification information of the deep learning model and the preset distributed training strategy; when an abnormality in the deep learning model training is identified, multiple distributed nodes are triggered to stop the distributed training of the deep learning model.
  • Training and determining the target model parameters currently stored in the storage medium; in the case of receiving a request to resume training, obtaining the target model parameters from the storage medium; calling multiple distributed nodes, and restoring the distributed training of the deep learning model based on the target model parameters.
  • the target storage parameters are determined based on the model specification information of the deep learning model and the preset distributed training strategy, and the iterative law of distributed training is fully considered.
  • the model parameters of the deep learning model are stored in the storage medium according to the target storage parameters, so that the storage process of the model parameters and the adjustment parameter calculation process in distributed training are overlapped, and the performance overhead is fully saved.
  • the real-time storage of the deep learning model parameters is completed, so that the deep learning model training has high fault tolerance and high efficiency.
  • the stored target model parameters are obtained from the storage medium, and the distributed training of the deep learning model is restored, which increases the fault tolerance of the deep learning model training and avoids re-training the deep learning model. It has stability while ensuring training efficiency and reducing training costs.
  • the storage medium includes a plurality of storage media with different storage performances
  • step 504 during the distributed training adjustment parameter calculation process, the model parameters of the deep learning model are stored in the storage medium according to the target storage parameters, including the following specific steps:
  • obtaining the target model parameters from the storage medium in step 508 includes the following specific steps:
  • the target model parameters are obtained from the second storage medium, wherein the storage performance priority of the first storage medium is higher than that of the second storage medium.
  • the first storage medium is the memory
  • the second storage medium is the hard disk.
  • the read and write speed of the memory is higher than that of the hard disk. If it can be obtained, the training efficiency is higher than that of the hard disk, but the memory is non-persistent storage. Therefore, the target model parameters may not be obtained and need to be obtained from the hard disk.
  • the stored target model parameter ⁇ is obtained from the memory, and if not obtained, the target model parameter ⁇ is obtained from the hard disk.
  • the target model parameters are obtained from the first storage medium; if the target model parameters are not obtained, the target model parameters are obtained from the second storage medium, wherein the storage performance priority of the first storage medium is higher than that of the second storage medium.
  • the storage performance priority differences of the storage mediums are fully utilized. In the event of training anomalies, the target model parameters are obtained from the storage medium with high storage performance priority first, and the storage medium with high storage performance priority is ensured to have persistent storage of the target model parameters, which improves the efficiency of model training while ensuring the reliability of model training.
  • the cloud-side device further includes a first communication channel connected to each distributed node and a second communication channel connected to the storage medium;
  • step 504 according to the preset distributed training strategy, multiple distributed nodes are called to perform distributed training on the deep learning model based on the sample data set, including the following specific steps:
  • multiple distributed nodes are called through the first communication channel to train the sample data set.
  • Distributed training of deep learning models are called through the first communication channel to train the sample data set.
  • step 504 storing the model parameters of the deep learning model to the storage medium according to the target storage parameters includes the following specific steps:
  • the second communication channel connected to the storage medium is a communication channel used for the model parameter storage process, including physical channels and logical channels between distributed nodes and storage media, such as network cards, PCIe topology, optical fibers, read and write channels (data buses) between distributed nodes and storage media, etc.
  • the first communication channel connected to other distributed nodes is a communication channel used for the communication process, including a physical channel and a logical channel between a distributed node and a storage medium, for example, a network card, a PCIe topology, an optical fiber, a distributed node, a high-speed channel between GPUs, a high-speed channel between each distributed node, etc.
  • the first communication channel can be implemented by RDMA technology, which improves the transmission speed.
  • the second communication channel is a physical channel between the network cards on the 64 distributed nodes and the distributed persistent storage array.
  • the model parameter ⁇ of the large language model is stored in the distributed persistent storage array through the second communication channel.
  • the first communication channel is a physical channel between the network cards on the 64 distributed nodes, and a PCIe topology with a GPU inserted on each distributed node, with a logical channel of a high-speed channel between GPUs and a high-speed channel between each distributed node.
  • the gradient weights on each distributed node are integrated using RDMA technology. Based on the gradient weights on each distributed node, the model parameters ⁇ of the large language model are updated through the gradient update method.
  • multiple distributed nodes are called through the first communication channel to perform distributed training on the deep learning model based on the sample data set; according to the target storage parameters, the model parameters of the deep learning model are stored in the storage medium through the second communication channel.
  • the two data transmission processes of model parameter storage and communication are isolated to avoid the introduction of additional collective communications, ensure that there is no interference with distributed training, improve the reliability of distributed training, and improve the effect of model training.
  • the present disclosure also provides a deep learning model training system embodiment
  • Figure 6 shows a structural schematic diagram of a deep learning model training system provided by an embodiment of the present disclosure.
  • the system includes a control unit 602 and multiple distributed nodes, and the multiple distributed nodes include a first distributed node 604, and the first distributed node 604 is any one of the multiple distributed nodes;
  • the management and control unit 602 is used to obtain an initial deep learning model and a sample data set, build multiple distributed data based on the deep learning model and the sample data set according to a preset distributed training strategy, and distribute the multiple distributed data to each distributed node;
  • the first distributed node 604 is used to perform distributed training on the deep learning model based on the sample data set; and in the process of calculating the adjustment parameters of the distributed training, the model parameters of the deep learning model are stored therein according to the target storage parameters, wherein the target storage parameters are determined based on the model specification information of the deep learning model and the preset distributed training strategy.
  • system further includes a persistent storage medium, and the first distributed node 604 includes a non-persistent storage medium;
  • the first distributed node 604 is also used to store the model parameters of the deep learning model to a non-persistent storage medium according to the target storage parameters during the adjustment parameter calculation process of distributed training, so as to transfer the model parameters of the deep learning model from the non-persistent storage medium to the persistent storage medium.
  • the adjustment parameter is a gradient weight
  • the adjustment parameter calculation is a propagation calculation
  • the first distributed node 604 further includes a first communication channel connected to each distributed node and a second communication channel connected to the storage medium;
  • the first distributed node 604 is further used to integrate the gradient weights on each distributed node through the first communication channel;
  • the first distributed node 604 is also used to send the model parameters of the deep learning model from the non-persistent storage medium to the persistent storage medium for storage through the second communication channel.
  • target storage parameters are determined based on model specification information of the deep learning model and a preset distributed training strategy, fully considering the iterative law of distributed training.
  • the model parameters of the deep learning model are stored according to the target storage parameters.
  • the storage process of the model parameters and the adjustment parameter calculation process in distributed training are overlapped, which fully saves performance overhead. Real-time storage of deep learning model parameters is completed in a manner close to zero performance overhead, making deep learning model training highly fault-tolerant and efficient.
  • the above is a schematic scheme of a deep learning model training system of this embodiment. It should be noted that the technical scheme of the deep learning model training system and the technical scheme of the deep learning model training method described above belong to the same concept, and the details not described in detail in the technical scheme of the deep learning model training system can be found in the description of the technical scheme of the deep learning model training method described above.
  • a first acquisition module 702 is configured to acquire an initial deep learning model and a sample data set
  • the first training module 704 is configured to perform distributed training on the deep learning model based on the sample data set according to a preset distributed training strategy, and store the model parameters of the deep learning model according to the target storage parameters during the calculation of the adjustment parameters of the distributed training, wherein the target storage parameters are determined based on the model specification information of the deep learning model and the preset distributed training strategy.
  • the first training module 704 is further configured to:
  • the preset distributed training strategy multiple distributed data are constructed based on the deep learning model and the sample data set, wherein the preset distributed training strategy includes a model parallel training strategy or a data parallel training strategy; the multiple distributed data are distributed to each distributed node; on the first distributed node, propagation calculation is performed based on the distributed data to obtain the gradient weight, wherein the first distributed node is any one of the multiple distributed nodes; based on the gradient weights on each distributed node, the model parameters of the deep learning model are updated, and when the preset training end conditions are met, a trained deep learning model is obtained.
  • the distributed data includes multiple batches of distributed data
  • the first training module 704 is further configured to:
  • a propagation calculation is performed based on the distributed data of the current batch to obtain a gradient weight
  • the device also includes:
  • the iteration module is configured to update the distributed data of the current batch, return to the step of performing propagation calculation on the first distributed node based on the distributed data of the current batch, and obtain the gradient weight.
  • the device further comprises:
  • the storage parameter determination module is configured to predict the number of propagation calculations and the time overhead for each batch of distributed data based on the model specification information of the deep learning model and the preset distributed training strategy; based on the number of propagation calculations and the time overhead, determine the target storage parameters corresponding to the propagation calculation process.
  • the device further comprises:
  • the gradient weight integration module is configured to integrate the gradient weights on each distributed node through the communication channel between the distributed nodes.
  • any distributed node includes a first communication channel connected to a storage medium and a second communication channel connected to other distributed nodes;
  • the first training module 704 is further configured to:
  • the model parameters of the deep learning model are stored in the storage medium through the first communication channel;
  • the gradient weight integration module is further configured as:
  • the gradient weights on each distributed node are integrated through the second communication channel.
  • the propagation calculation includes forward propagation calculation and backward propagation calculation
  • the device also includes:
  • a forward and reverse storage parameter determination module is configured to determine a first target storage parameter corresponding to a forward propagation calculation process and a second target storage parameter corresponding to a reverse propagation calculation process based on model specification information of a deep learning model and a preset distributed training strategy;
  • the first training module 704 is further configured as follows:
  • the model parameters of the deep learning model are stored according to the first target storage parameters; during the backward propagation calculation process, the model parameters of the deep learning model are stored according to the second target storage parameters.
  • the device further comprises:
  • the training resumption module is configured to obtain the stored target model parameters upon receiving a training resumption request, wherein the training resumption request is generated after determining that the distributed training has recovered from a training anomaly, and the target model parameters are the model parameters stored before the training anomaly occurs; based on the target model parameters, the distributed training of the deep learning model is resumed.
  • target storage parameters are determined based on the model specification information of the deep learning model and a preset distributed training strategy, fully considering the iterative law of distributed training.
  • the model parameters of the deep learning model are stored according to the target storage parameters.
  • the storage process of the model parameters is overlapped with the propagation calculation process in distributed training, which fully saves performance overhead.
  • the real-time storage of the deep learning model parameters is completed in a manner close to zero performance overhead, making the deep learning model training highly fault-tolerant and efficient.
  • the above is a schematic scheme of a deep learning model training device of this embodiment. It should be noted that the technical scheme of the deep learning model training device and the technical scheme of the deep learning model training method described above belong to the same concept, and the details not described in detail in the technical scheme of the deep learning model training device can be found in the description of the technical scheme of the deep learning model training method described above.
  • the present disclosure also provides a deep learning model training device embodiment.
  • FIG8 shows a schematic diagram of the structure of another deep learning model training device provided by one embodiment of the present disclosure.
  • the cloud-side device Used for a cloud-side device, the cloud-side device includes multiple distributed nodes and a storage medium; the device includes:
  • a second acquisition module 802 is configured to acquire an initial deep learning model and a sample data set
  • the second training module 804 is configured to call multiple distributed nodes according to a preset distributed training strategy, perform distributed training on the deep learning model based on the sample data set, and store the model parameters of the deep learning model to the storage medium according to the target storage parameter during the adjustment parameter calculation process of the distributed training, wherein the target storage parameter is determined based on the model specification information of the deep learning model and the preset distributed training strategy;
  • the stop module 806 is configured to trigger the multiple distributed nodes to stop the distributed training of the deep learning model when an abnormality in the deep learning model training is identified, and determine the target model parameters currently stored in the storage medium;
  • the parameter acquisition module 808 is configured to acquire the target model parameters from the storage medium when receiving the request to resume training;
  • the recovery module 810 is configured to call multiple distributed nodes to perform distributed training on the deep learning model based on target model parameter recovery.
  • the storage medium includes a plurality of storage media with different storage performances
  • the second training module 804 is further configured to:
  • the parameter acquisition module 808 is further configured as follows:
  • the target model parameters are obtained from the first storage medium; if the target model parameters are not obtained, the target model parameters are obtained from the second storage medium, wherein the storage performance priority of the first storage medium is higher than that of the second storage medium.
  • the cloud-side device further includes a first communication channel connected to each distributed node and a second communication channel connected to the storage medium;
  • the second training module 804 is further configured to:
  • multiple distributed nodes are called through the first communication channel to perform distributed training on the deep learning model based on the sample data set; according to the target storage parameters, the model parameters of the deep learning model are stored in the storage medium through the second communication channel.
  • the stored target model parameters are obtained from the storage medium, and the distributed training of the deep learning model is restored, which increases the fault tolerance of the deep learning model training and avoids re-training of the deep learning model. While having stability, it ensures the training efficiency and reduces the training cost.
  • the above is a schematic scheme of a deep learning model training device of this embodiment. It should be noted that the technical scheme of the deep learning model training device and the technical scheme of the deep learning model training method described above belong to the same concept, and the details not described in detail in the technical scheme of the deep learning model training device can be found in the description of the technical scheme of the deep learning model training method described above.
  • FIG. 9 shows a block diagram of a computing device provided by an embodiment of the present disclosure.
  • the components of the computing device 900 include but are not limited to a memory 910 and a processor 920.
  • the processor 920 is connected to the memory 910 via a bus 930, and a database 950 is used to store data.
  • the computing device 900 also includes an access device 940 that enables the computing device 900 to communicate via one or more networks 960.
  • networks 960 include a Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the Internet.
  • PSTN Public Switched Telephone Network
  • LAN Local Area Network
  • WAN Wide Area Network
  • PAN Personal Area Network
  • a combination of communication networks such as the Internet.
  • the access device 940 may include one or more of any type of wired or wireless network interface (e.g., a network interface card (NIC)), such as an IEEE802.11 wireless local area network (WLAN) wireless interface, a Worldwide Interoperability for Microwave Access (Wi-MAX) interface, an Ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a Bluetooth interface, and a Near Field Communication (NFC).
  • NIC network interface card
  • the above components of the computing device 900 and other components not shown in FIG. 9 may also be connected to each other, for example, through a bus. It should be understood that the computing device structure block diagram shown in FIG. 9 is only for illustrative purposes and is not intended to limit the scope of the present disclosure. Those skilled in the art may add or replace other components as needed.
  • the computing device 900 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., a tablet computer, a personal digital assistant, a laptop computer, a notebook computer, a netbook, etc.), a mobile phone (e.g., a smart phone), a wearable computing device (e.g., a smart watch, smart glasses, etc.), or other types of mobile devices, or a stationary computing device such as a desktop computer or a personal computer (PC).
  • the computing device 900 may also be a mobile or stationary server.
  • the above is a schematic scheme of a computing device of this embodiment. It should be noted that the technical scheme of the computing device and the technical scheme of the above-mentioned deep learning model training method belong to the same concept, and the details not described in detail in the technical scheme of the computing device can be referred to the description of the technical scheme of the above-mentioned deep learning model training method.
  • An embodiment of the present disclosure also provides a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, can implement the steps of the above-mentioned deep learning model training method.
  • the above is a schematic scheme of a computer-readable storage medium of this embodiment. It should be noted that the technical scheme of the storage medium and the technical scheme of the above-mentioned deep learning model training method belong to the same concept, and the details not described in detail in the technical scheme of the storage medium can be referred to the description of the technical scheme of the above-mentioned deep learning model training method.
  • An embodiment of the present disclosure also provides a computer program, wherein when the computer program is executed in a computer, the computer is caused to execute the steps of the above-mentioned deep learning model training method.
  • the above is a schematic scheme of a computer program of this embodiment. It should be noted that the technical scheme of the computer program and the technical scheme of the above-mentioned deep learning model training method belong to the same concept, and the details not described in detail in the technical scheme of the computer program can be found in the description of the technical scheme of the above-mentioned deep learning model training method.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Machine Translation (AREA)

Abstract

Des modes de réalisation de la présente divulgation concernent un procédé d'entraînement de modèle d'apprentissage profond et un système d'entraînement de modèle d'apprentissage profond. Le procédé d'entraînement de modèle d'apprentissage profond consiste à : acquérir un modèle d'apprentissage profond initial et un ensemble de données échantillon ; et effectuer un apprentissage distribué sur le modèle d'apprentissage profond sur la base de l'ensemble de données échantillon selon une politique d'apprentissage distribuée prédéfinie, et pendant le calcul de paramètre d'ajustement pour un apprentissage distribué, stocker des paramètres de modèle du modèle d'apprentissage profond sur la base de paramètres de stockage cibles, les paramètres de stockage cibles étant déterminés sur la base d'informations de spécification de modèle du modèle d'apprentissage profond et de la politique d'apprentissage distribuée prédéfinie. Les paramètres de stockage cibles sont déterminés sur la base des informations de spécification de modèle du modèle d'apprentissage profond et de la politique d'apprentissage distribuée prédéfinie, qui prend pleinement en compte les motifs d'itération d'apprentissage distribué, et pendant le calcul de paramètre d'ajustement, les paramètres de modèle du modèle d'apprentissage profond sont stockés sur la base des paramètres de stockage cibles, de telle sorte qu'une efficacité élevée est obtenue tout en permettant l'entraînement du modèle d'apprentissage profond pour obtenir une tolérance aux défauts élevée.
PCT/CN2024/118478 2023-11-30 2024-09-12 Procédé d'entraînement de modèle d'apprentissage profond et système d'entraînement de modèle d'apprentissage profond Pending WO2025112801A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202311636365.9A CN117669700B (zh) 2023-11-30 2023-11-30 深度学习模型训练方法和深度学习模型训练系统
CN202311636365.9 2023-11-30

Publications (1)

Publication Number Publication Date
WO2025112801A1 true WO2025112801A1 (fr) 2025-06-05

Family

ID=90080170

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2024/118478 Pending WO2025112801A1 (fr) 2023-11-30 2024-09-12 Procédé d'entraînement de modèle d'apprentissage profond et système d'entraînement de modèle d'apprentissage profond

Country Status (2)

Country Link
CN (1) CN117669700B (fr)
WO (1) WO2025112801A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN121412032A (zh) * 2025-09-25 2026-01-27 北京邮电大学 大模型分布式并行训练方法、程序产品、计算节点及存储介质

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117669700B (zh) * 2023-11-30 2025-05-09 杭州阿里云飞天信息技术有限公司 深度学习模型训练方法和深度学习模型训练系统
CN120780536A (zh) * 2024-04-09 2025-10-14 华为技术有限公司 一种备份方法及其装置
CN118312326B (zh) * 2024-06-06 2024-08-09 新华策(北京)科技有限公司 一种高效分布式大模型训练方法及系统
CN118612219A (zh) * 2024-06-13 2024-09-06 中国电信股份有限公司技术创新中心 分布式训练的通信方法以及相关设备
CN119150804B (zh) * 2024-11-14 2025-03-18 之江实验室 一种模型训练和业务执行方法、装置、存储介质及设备
CN119149245B (zh) * 2024-11-14 2025-05-13 浪潮电子信息产业股份有限公司 一种业务处理方法、测试系统、介质以及产品
CN119760428B (zh) * 2024-12-19 2025-09-23 北京百度网讯科技有限公司 基于流水线并行训练策略的异常检测方法、装置以及设备
CN120031107A (zh) * 2025-01-21 2025-05-23 中煤科工开采研究院有限公司 机器学习模型的训练与部署方法、装置、设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109754060A (zh) * 2017-11-06 2019-05-14 阿里巴巴集团控股有限公司 一种神经网络机器学习模型的训练方法及装置
CN111788585A (zh) * 2019-01-16 2020-10-16 华为技术有限公司 一种深度学习模型的训练方法、系统
CN113515370A (zh) * 2021-04-28 2021-10-19 之江实验室 一种面向大规模深度神经网络的分布式训练方法
CN113705801A (zh) * 2020-05-22 2021-11-26 华为技术有限公司 一种神经网络模型的训练装置、方法及相关设备
CN117093871A (zh) * 2023-10-16 2023-11-21 之江实验室 一种面向深度学习分布式训练测评方法和系统
CN117669700A (zh) * 2023-11-30 2024-03-08 杭州阿里云飞天信息技术有限公司 深度学习模型训练方法和深度学习模型训练系统

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114035937B (zh) * 2021-10-15 2024-11-26 北京潞晨科技有限公司 一种基于人工智能的分布式训练和推理方法、系统、设备和可读存储介质
CN113961351B (zh) * 2021-10-28 2022-12-30 北京百度网讯科技有限公司 深度学习模型的分布式训练方法、装置、设备及存储介质
CN115018072A (zh) * 2022-06-06 2022-09-06 上海商汤智能科技有限公司 模型训练方法、装置、设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109754060A (zh) * 2017-11-06 2019-05-14 阿里巴巴集团控股有限公司 一种神经网络机器学习模型的训练方法及装置
CN111788585A (zh) * 2019-01-16 2020-10-16 华为技术有限公司 一种深度学习模型的训练方法、系统
CN113705801A (zh) * 2020-05-22 2021-11-26 华为技术有限公司 一种神经网络模型的训练装置、方法及相关设备
CN113515370A (zh) * 2021-04-28 2021-10-19 之江实验室 一种面向大规模深度神经网络的分布式训练方法
CN117093871A (zh) * 2023-10-16 2023-11-21 之江实验室 一种面向深度学习分布式训练测评方法和系统
CN117669700A (zh) * 2023-11-30 2024-03-08 杭州阿里云飞天信息技术有限公司 深度学习模型训练方法和深度学习模型训练系统

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN121412032A (zh) * 2025-09-25 2026-01-27 北京邮电大学 大模型分布式并行训练方法、程序产品、计算节点及存储介质

Also Published As

Publication number Publication date
CN117669700A (zh) 2024-03-08
CN117669700B (zh) 2025-05-09

Similar Documents

Publication Publication Date Title
WO2025112801A1 (fr) Procédé d'entraînement de modèle d'apprentissage profond et système d'entraînement de modèle d'apprentissage profond
US12613927B2 (en) Framework for optimization of machine learning architectures
EP4163833A1 (fr) Conception de modèle de réseau neuronal profond améliorée par rétroaction sur l'évaluation par substitution en temps réel
US12367248B2 (en) Hardware-aware machine learning model search mechanisms
US20190354868A1 (en) Multi-task neural networks with task-specific paths
EP3938963A1 (fr) Planification de graphes de calcul à l'aide de réseaux neuronaux
CN110622178A (zh) 学习神经网络结构
CN112541124A (zh) 生成多任务模型的方法、装置、设备、介质及程序产品
JP2021505993A (ja) 深層学習アプリケーションのための堅牢な勾配重み圧縮方式
US12443839B2 (en) Hyperparameter transfer via the theory of infinite-width neural networks
Shan et al. Cognitive memory in large language models
WO2018099084A1 (fr) Procédé, dispositif, puce et système d'apprentissage de modèle de réseau neuronal
US20230267307A1 (en) Systems and Methods for Generation of Machine-Learned Multitask Models
CN112149809A (zh) 模型超参数的确定方法及设备、计算设备和介质
Tanghatari et al. Federated learning by employing knowledge distillation on edge devices with limited hardware resources
WO2023174189A1 (fr) Procédé et appareil de classification de nœuds de modèle de réseau de graphes, et dispositif et support de stockage
CN110968692A (zh) 一种文本分类方法及系统
US20250148280A1 (en) Techniques for learning co-engagement and semantic relationships using graph neural networks
CN114780997A (zh) 一种数据处理方法、装置、设备和介质
CN115564041A (zh) 神经网络模型训练系统、方法及相关设备
US20250373615A1 (en) Context-aware permission reduction
WO2025101527A1 (fr) Techniques d'apprentissage de co-engagement et de relations sémantiques à l'aide de réseaux neuronaux graphiques
US20250315719A1 (en) Performance evaluation of generative question-answering systems
US20260119845A1 (en) Low complexity prefix processing in language modeling
Mu et al. Boosting the convergence of reinforcement learning-based auto-pruning using historical data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24895899

Country of ref document: EP

Kind code of ref document: A1