WO2020147142A1 - 一种深度学习模型的训练方法、系统 - Google Patents
一种深度学习模型的训练方法、系统 Download PDFInfo
- Publication number
- WO2020147142A1 WO2020147142A1 PCT/CN2019/072895 CN2019072895W WO2020147142A1 WO 2020147142 A1 WO2020147142 A1 WO 2020147142A1 CN 2019072895 W CN2019072895 W CN 2019072895W WO 2020147142 A1 WO2020147142 A1 WO 2020147142A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- gradient
- deep learning
- learning model
- iteration
- layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0499—Feedforward networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/09—Supervised learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/098—Distributed learning, e.g. federated learning
Definitions
- This application relates to the field of artificial intelligence, and more specifically, to a training method of a deep learning model and a system for executing the training method.
- Artificial intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results.
- artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence.
- Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
- Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, and basic AI theories.
- deep learning is a learning technology based on deep neural network algorithms.
- the deep learning model includes forward propagation (FP) calculation and back propagation (BP) calculation.
- FP calculation is used to calculate the output of each layer of neurons according to the parameter matrix corresponding to each layer of neurons
- BP calculation is used to calculate each layer according to the error between the predicted value generated by the FP calculation and the prior knowledge The gradient corresponding to the neuron, so that the parameter matrix corresponding to each layer of neuron can be corrected according to the gradient calculated by BP in the next iteration of FP calculation.
- each deep learning model needs to synchronize the gradient generated by each BP calculation.
- the method of gradient synchronization in the training process of traditional distributed deep learning models leads to low training efficiency.
- the present application provides a method for training a deep learning model, which improves the training efficiency of the deep learning model by adjusting the order in which the gradients calculated by BP are transmitted to the parameter storage space in this iteration.
- a method for training a deep learning model is provided.
- the method is applied to a training system.
- the training system includes N deep learning models.
- Each of the deep learning models includes n layers of neurons.
- the training process of the learning model includes multiple iterations.
- Each iteration includes forward propagation FP calculation and back propagation BP calculation, where N is a positive integer greater than 1, and n is a positive integer greater than 1.
- the method includes: In the BP calculation of the jth iteration of the N deep learning models, N first gradient sets are generated, and the communication sequence of the gradients included in each first gradient set is adjusted in the process of generating the N first gradient sets, and According to the adjusted communication sequence of the gradients included in each first gradient set, the gradients included in the N first gradient sets are respectively sent to the parameter storage space of the training system, and then stored according to the parameter storage space Obtain the second gradient set from the N first gradient sets, and modify the parameter matrix of each layer of the neuron of each deep learning model according to the gradients included in the second gradient set to be used for each deep learning The model performs the (j+1)th iteration of FP calculation.
- each first gradient set includes a gradient corresponding to a parameter matrix of each neuron layer of a deep learning model, where j is a positive integer greater than 0.
- N depths can be calculated respectively.
- the weighted average calculation can be performed on the gradient of each layer of neurons included in the N first gradient set, so that the parameter matrix of each layer of the N deep learning model can be calculated
- the corresponding gradient average value, the gradient average value corresponding to the parameter matrix of each layer of neuron forms the second gradient set. That is to say, the second gradient set includes the gradient average value corresponding to the parameter matrix of each layer of the neuron of the N deep learning models.
- the calculation obtained by BP during this iteration can be adjusted
- the sequence of transmission to the parameter storage space can reduce the iteration time of the deep learning model during this iteration process and improve the iteration efficiency of the deep learning model.
- the gradient corresponding to the parameter matrix of the a-th layer neuron is adjusted to the gradient corresponding to the parameter matrix of the b-th layer neuron before being sent to the parameter storage space,
- b is less than or equal to n
- a is less than b
- a is a positive integer greater than zero.
- the gradient corresponding to the parameter matrix of the a-th layer neuron is adjusted to the gradient corresponding to the parameter matrix of the b-th layer neuron before being sent to the parameter storage space, which can reduce the cost.
- the communication sequence of the gradients included in each first gradient set can be adjusted according to a gradient communication strategy, where the gradient communication strategy is set according to at least one of the following parameters: the deep learning model and the The communication bandwidth between the parameter storage spaces, the size of the gradient corresponding to the parameter matrix of each layer of neuron of the deep learning model, and the time required for each layer of neuron of the deep learning model in FP calculation.
- the deep learning model is any one or more of the N deep learning models.
- the parameter matrix of the b-th layer neuron may correspond to the communication bandwidth between the deep learning module and the parameter storage space.
- the size of the gradient of the parameter matrix of the b-th layer neuron starts to be sent to the parameter storage space to the b-1th layer in the (j+1)th iteration of the deep learning model
- the gradient communication strategy is calculated for the time between the FP calculation of the neuron.
- the gradient corresponding to the parameter matrix of the a-th layer neuron is adjusted to the gradient corresponding to the parameter matrix of the b-th layer neuron and sent to the parameter storage space before being sent to the parameter storage space.
- the gradient communication strategy includes: the order in which each gradient in the first gradient set is transmitted to the parameter storage area.
- the communication bandwidth between the deep learning module and the parameter storage space, the size of the gradient corresponding to the parameter matrix of each layer of the deep learning model, and the depth of each layer of the deep learning model can be calculated in the FP
- adjusting the gradient corresponding to the parameter matrix of the a-th layer neuron is sent before the gradient corresponding to the parameter matrix of the b-th layer neuron is sent to the parameter storage space, and the b-th layer The gradient corresponding to the parameter matrix of the neuron is sent to the parameter storage space before the neuron of the b-1th layer in the (j+1)th iteration completes the corresponding FP calculation.
- the method further includes: obtaining an iteration time of the deep learning model; and adjusting the gradient communication strategy according to the iteration time.
- the acquired iteration time of the deep learning model may be the sum of the BP calculation time in the current iteration process and the FP calculation time in the next iteration process. That is, the iteration time of the deep learning model is the BP calculation time of the Lth iteration of the deep learning model and the FP calculation time of the L+1th iteration, and L is a positive integer greater than j.
- the deep learning model is any one or more of the N deep learning models.
- the gradient communication strategy of the deep learning model can be adjusted according to the iteration time of the feedback deep learning model, so that the optimal gradient communication strategy can be determined according to the actual iteration time of the deep learning model to improve the performance of the deep learning model. Iterative training speed.
- a training system for a deep learning model includes N deep learning models, a gradient communication module, a gradient update module, a correction module, and a parameter storage space.
- Each deep learning model includes n layers of neurons, and the training process of each deep learning model includes multiple iterations, and each iteration includes forward propagation FP calculation and back propagation BP calculation, where N is greater than 1.
- a positive integer of, n is a positive integer greater than 1;
- Each deep learning module of the N deep learning models is respectively used to generate a first gradient set in the BP calculation of the jth iteration, and each first gradient set includes each layer of each deep learning model The gradient corresponding to the parameter matrix of the neuron, where j is a positive integer greater than 0;
- the gradient communication module is configured to adjust the communication sequence of the gradients included in each first gradient set, and according to the adjusted communication sequence of the gradients included in each first gradient set, include N of the first gradient sets
- the gradients of are respectively sent to the parameter storage space of the training system
- the gradient update module is configured to obtain a second gradient set according to the N first gradient sets stored in the parameter storage space;
- the correction module is configured to respectively correct the parameter matrix of each layer of neurons of each deep learning model according to the gradients included in the second gradient set, so as to be used for the (j+1)th of each deep learning model. ) FP calculation for the second iteration.
- the gradient communication module may include two sub-modules, one is an adjustment sub-module, which is used to adjust the communication sequence of the gradients included in each first gradient set.
- the other is a communication sub-module, which is configured to send the gradients included in the N first gradient sets to the parameter storage space of the training system according to the adjusted communication sequence of the gradients included in each first gradient set.
- the correction module may be a module in a parameter server, or may be a module in at least one model training server.
- the correction module is in the parameter server, and the correction module is used to respectively correct the parameter matrix of each layer of the neuron of any deep learning model according to the gradient included in the second gradient set, and correct The parameter matrix corresponding to each layer of neurons is stored in the parameter storage space of the parameter server, so that at least one model training server can obtain the modified parameter matrix from the parameter storage space during the next iteration of the model training process .
- the correction module is in at least one model training server, and after the at least one model training server can obtain the second gradient set from the parameter storage space of the parameter server, the correction module can perform the evaluation of any deep learning model according to the second gradient set.
- the parameter matrix of each layer of neurons is modified to be used in the FP calculation of the (j+1)th iteration of any deep learning model of the training system.
- the calculation obtained by BP during this iteration can be adjusted
- the sequence of transmission to the parameter storage space can reduce the iteration time of the deep learning model during this iteration process and improve the iteration efficiency of the deep learning model.
- the gradient communication module is specifically configured to: adjust the gradient corresponding to the parameter matrix of the a-th layer neuron to the gradient corresponding to the parameter matrix of the b-th layer neuron before sending it to the parameter storage space To the parameter storage space, where b is less than or equal to n, a is less than b, and a is a positive integer greater than zero.
- the gradient communication module is specifically configured to adjust the communication sequence of the gradients included in each first gradient set according to the gradient communication strategy.
- the gradient communication strategy is set according to at least one of the following parameters: the communication bandwidth between the deep learning model and the parameter storage space, and the gradient corresponding to the parameter matrix of each layer of the neuron of the deep learning model The size, the time required for each layer of neurons of the deep learning model in FP calculation.
- the deep learning model is any one or more of the N deep learning models.
- system further includes a feedback module
- the feedback module is configured to obtain the iteration time of the deep learning model and feed it back to the gradient communication module, and feedback the obtained iteration time to the gradient communication module.
- the acquired iteration time of the deep learning model may be the sum of the BP calculation time in the current iteration process and the FP calculation time in the next iteration process.
- the gradient communication module is further configured to adjust the gradient communication strategy according to the iteration time of the deep learning model fed back by the feedback module.
- the feedback module is a collection of feedback modules in each model training server in at least one model training server.
- a training system for a deep learning model includes at least one computing node, each computing node includes a memory and at least one processor, the memory is used to store program instructions, and the training system During operation, at least one processor of the at least one computing node executes the program instructions in the memory to execute the first aspect or the method in any one of the possible implementation manners of the first aspect.
- the training system of the deep learning model includes a parameter server and at least one model training server.
- one model training server can be used as a computing node
- the N deep learning modules and gradient communication modules can be respectively run in at least one model training server.
- the gradient update module can run in the parameter server in the training system.
- the correction module can run in at least one model training server or can also run in a parameter server.
- the feedback module runs in at least one model training server.
- the gradient communication module may be a collection of gradient communication modules in each model training server in at least one model training server
- the correction module may be at least one A collection of correction modules in each model training server in the model training server.
- the feedback module may be a collection of feedback modules in each model training server in at least one model training server.
- the deep learning model training system includes a model training server.
- one model training server includes at least one processor, where one processor can be used as a computing node, and N deep learning modules, gradient communication modules, gradient update modules, and correction modules can be run in at least one processor respectively.
- the feedback module runs in at least one processor of a model training server.
- the gradient communication module, gradient update module, correction module, and feedback module may be at least one processor in a model training server.
- a non-transitory readable storage medium including program instructions, and when the program instructions are executed by at least one computing node, the at least one computing node executes the operations as in the first aspect and the first aspect Any one of the possible implementation methods.
- a computer program product including program instructions.
- the program instructions are executed by at least one computing node, the at least one computing node executes any one of the first aspect and the first aspect. The method in the implementation mode.
- Fig. 1 is a schematic block diagram of a deep learning model 100 provided by an embodiment of the present application.
- FIG. 2 is a schematic structural diagram of a distributed training system 200 of a deep learning model 100 provided by an embodiment of the present application.
- Fig. 3 is a schematic block diagram of communication between each model training server and a parameter server according to an embodiment of the present application.
- FIG. 4 is a schematic structural diagram of a distributed training system 400 of a deep learning model 100 provided by an embodiment of the present application.
- Fig. 5 is a schematic flowchart of a method for training a deep learning model provided by an embodiment of the present application.
- FIG. 6 is a schematic structural diagram of a training system for a deep learning model provided by an embodiment of the present application.
- FIG. 7 is a schematic flowchart of a method for accelerating training of a deep learning model provided by an embodiment of the present application.
- (A) in FIG. 8 is a comparison diagram of the iterative time effect of the method for accelerating the training of the deep learning model provided by the embodiment of the present application.
- (B) in FIG. 8 is a comparison diagram of the iterative time effect of the method for accelerating the training of the deep learning model provided by the embodiment of the present application.
- deep learning is a learning technology based on deep neural network algorithms.
- the deep learning model includes an input layer, a hidden layer, and an output layer, which use multiple nonlinear transformations to process data.
- the neural network is a kind of behavioral characteristics that imitate the animal neural network. This network depends on the complexity of the system and adjusts the interconnection relationship between a large number of internal nodes to achieve the purpose of processing information.
- a deep neural network can be understood as a neural network with multiple hidden layers, and there is no special metric for "multiple” here. Theoretically speaking, a model with more parameters is more complex and has a greater “capacity", which means it can complete more complex learning tasks. Training a deep neural network is the process of learning the parameter matrix. The ultimate goal is to obtain the parameter matrix of each layer of neurons in the trained deep neural network (the parameter matrix of each layer of neurons includes each layer of neurons). The weight corresponding to each neuron).
- Fig. 1 is a schematic block diagram of a deep learning model 100 provided by an embodiment of the present application.
- the deep learning model 100 may include an input layer 110, a hidden layer 120, and an output layer 130.
- the hidden layer 120 includes n (n greater than 1) layers of neurons as an example for description.
- each of the input layer 110, the output layer 130, and the hidden layer 120 includes one or more neurons.
- the input layer 110 includes two neurons
- each of the n layers in the hidden layer 120 includes three neurons
- the output layer 130 includes one neuron as an example for illustration.
- the deep learning model 100 shown in FIG. 1 may be a fully connected neural network, or a convolutional neural network (convolutional neural network, CNN).
- CNN convolutional neural network
- the deep learning model 100 is a fully connected neural network model.
- the deep learning model 100 is a CNN model.
- the deep learning model 100 may include forward propagation (FP) calculation and back propagation (BP) calculation.
- FP forward propagation
- BP back propagation
- training data such as pixel information of the input image
- the input of the input layer 110 may output a prediction result from the output layer 130 after passing through multiple neurons in the hidden layer 120.
- each layer of neurons in the hidden layer 120 corresponds to a parameter matrix.
- the product of the input of the input layer 110 and the parameter matrix of the first layer of neurons is used as the input of the first layer of neurons of the hidden layer 120, and the input of the first layer of neurons of the hidden layer 120 passes through the first layer of neurons.
- the output value of a first layer neuron is output.
- the product of the output value of the neuron of the first layer of the hidden layer 120 and the parameter matrix of the neuron of the second layer is used as the input of the neuron of the second layer of the hidden layer 120.
- a prediction result is finally output from the output layer 130.
- Each parameter matrix formed by the weights obtained through training can extract pixel information from the image to be inferred input by the user, thereby helping the deep learning model 100 The image to be inferred performs correct inference.
- the input of the first neuron in layer 1 is: The output of the first neuron in layer 1 is: The input of the second neuron in layer 1 is: The output of the second neuron in layer 1 is: The input of the third neuron in layer 1 is: The output of the third neuron in layer 1 is: among them, For input as The activation function.
- the input of the neuron in the first layer is:
- the input of the neuron in layer 1 can be expressed as
- the output can be expressed as
- j is used to indicate the number of iterations, which is generally equal to the number of times the input layer 110 obtains inputs (i 1 , i 2 ). Used to represent the parameter matrix of the neuron in the first layer during the jth iteration.
- the input of the neuron in the i layer can be expressed as
- the output of the neuron in the i-th layer can be expressed as Among them, 1 ⁇ i ⁇ n.
- the parameter matrix of each layer in the deep learning model 100 can be updated according to the difference between the current predicted value and prior knowledge (of course, there is usually an initialization process before the first update) , which is the parameter matrix corresponding to each layer of neurons in the hidden layer 120 of the initialization deep learning model 100).
- the error BP algorithm is used to modify the weight of the parameter matrix in the deep learning model 100 in the process of training the deep learning model 100, so that the error loss of the deep learning model 100 becomes smaller and smaller.
- BP calculation is an error-oriented reverse motion, aiming to obtain the optimal parameter matrix of each layer of neurons.
- the training data input by the user may include the training data as input and the prediction result corresponding to the training data provided by the person.
- the deep learning model 100 is applied to the field of image recognition.
- the training data input to the deep learning model 100 is the pixel information of the image, and the prior knowledge corresponding to the training data is the label "dog" of the image.
- the training data is input to the input layer 110, and after the FP calculation of the deep learning model 100, the predicted value output by the output layer 130 is compared with prior knowledge. For example, if the predicted value output by the output layer 130 is "cat", the parameter matrix of each layer in the deep learning model 100 can be updated according to the error between the predicted value and the prior knowledge "dog".
- the BP calculation can calculate the output prediction value o 1 and the error E between the prior knowledge.
- the weight in the parameter matrix of each layer of the neuron in the deep learning model 100 can be corrected according to the error E.
- the correction of the weight can be to calculate the gradient of the weight in the parameter matrix respectively
- the gradient The error E can be used to obtain the derivative of the weight in the parameter matrix, where 1 ⁇ i ⁇ n.
- the process is similar to the jth iteration, and the FP calculation is performed first, and then the BP calculation is performed.
- the gradient calculated according to the FP of the jth iteration Correct the weight in the parameter matrix, and calculate the predicted output value according to the corrected parameter matrix.
- the BP calculation process of the (j+1)th iteration calculate the gradient of the weight in the parameter matrix according to the error E between the output value calculated by the FP in the (j+1)th iteration and the prior knowledge So that during the (j+2)th iteration Modify the weights in the parameter matrix again.
- the weights in the parameter matrix are continuously modified, so as to realize that the output value predicted by the deep learning model 100 is as close as possible to the prior knowledge of the training data.
- the parameter matrix of the neuron in the i-th layer becomes according to
- the process of calculating the input and output of each layer of neurons please refer to the description of the FP calculation of the jth iteration above, which will not be repeated here.
- the training process (including the FP calculation process and the BP calculation process) of the aforementioned deep learning model 100 in the embodiment of the present application may be completed in a training system including at least one computing node.
- the at least one computing node may be at least one model training server, or at least one processor in a model training server.
- the following describes the scene of training the deep learning model 100 in conjunction with FIGS. 2 to 4.
- FIG. 2 is a schematic structural diagram of a distributed training system 200 of a deep learning model 100 provided by an embodiment of the present application.
- the distributed training system 200 shown in FIG. 2 may include a model training server 210, a model training server 220, a model training server 230, a parameter server 240, and a cloud storage 250.
- Distributed deep learning training aims to increase computing resources by using multiple computing nodes, and to iterate the trained model through multiple computing nodes to improve the training speed of deep learning models.
- the distributed training system 200 may include at least one model training server, where one model training server may serve as a computing node.
- Figure 2 uses three model training servers as examples for illustration.
- the structure of the model training server 220 and the model training server 230 are similar to the model training server 210, and the model training server 210 will be described in detail below.
- Model training server 210 (1) Model training server 210:
- It includes: at least one processor, a memory 213, an input/output interface 214, a communication interface 215, and a bus 216.
- At least one processor may be connected to the memory 213.
- the memory 213 can be used to store the program code and training data.
- the memory 213 may be a storage unit inside at least one processor, or an external storage unit independent of at least one processor, or may include a storage unit inside at least one processor and an external storage unit independent of at least one processor. Parts of the storage unit.
- the memory 213 can be a solid state drive (SSD), a hard disk drive (HDD), a read-only memory (ROM), a random access memory (random access memory) , RAM) etc.
- SSD solid state drive
- HDD hard disk drive
- ROM read-only memory
- RAM random access memory
- the at least one processor can obtain program code and training data from the memory 213, and train the deep learning model 100.
- at least one processor may perform iterative calculations (for example, perform FP calculation and BP calculation as shown in FIG. 1) according to program code and training data, and may also perform BP calculation in the distributed training system 200 The gradient of the weights in the parameter matrix Send (push) to the parameter server 240.
- the at least one processor may include two types of processors.
- One type of processor includes at least one data processor 211, and the other type of processor includes at least one iterative processor 212.
- the data processor 211 may be a central processing unit (CPU)
- the iterative processor 212 may be an embedded neural network processor (neural-network process units, NPU), or an image Processor (graphics processing unit, GPU).
- the iterative processor 212 runs the deep learning model 100, and in the BP calculation of the deep learning model 100, the gradient of the weight in the parameter matrix of the neuron of each layer is calculated
- the data processor 211 can be used to convert the gradient calculated by BP Send (push) to the parameter server 240.
- the data processor 211 may run a gradient communication module 2111, and the iterative processor 212 may run a deep learning module 100.
- a feedback module 2121 may run in the iteration processor 212.
- the correction module can run in the data processor 211 or the parameter server 240.
- the correction module 2112 runs in the data processor 211.
- the correction module 2412 runs in the data processor 241 of the parameter server 240. .
- the model training server 210 may further include a bus 216.
- the memory 213, the input/output interface 214, and the communication interface 215 may be connected to at least one processor (for example, the data processor 211, the iterative processor 212) through the bus 216.
- the bus 216 may be a peripheral component interconnect standard (PCI) bus or an extended industry standard architecture (EISA) bus, etc.
- PCI peripheral component interconnect standard
- EISA extended industry standard architecture
- the bus 216 can be divided into an address bus, a data bus, a control bus, and the like. For ease of presentation, only one line is used in FIG. 2, but it does not mean that there is only one bus or one type of bus.
- the parameter storage space in the memory 243 may store the gradient of the weights sent by the model training server 210, the model training server 220, and the model training server 230 respectively.
- At least one data processor 241 may be connected to the memory 243, and the data processor 241 may be a CPU, for example.
- the at least one data processor 241 can obtain the gradients respectively sent by the model training server 210, the model training server 220, and the model training server 230 from the memory 243. Multiple gradients pushed by the processor for multiple model training servers For processing, and the processed gradient Store to the memory 243.
- at least one data processor 241 may push multiple gradients to multiple model training servers. Perform a weighted average calculation to get And average the gradient Store to the memory 243. According to multiple gradients Get gradient In addition to the weighted average calculation, other algorithms can also be used in the process.
- the data processor 241 may run a gradient update module 2411.
- a correction module 242 may also run in the data processor 241.
- each module running in the data processor 241 please refer to the description in FIG. 6, which will not be repeated here.
- the data processor 241 in the parameter server 240 calculates the gradient average value After that, in the FP calculation in the (j+1)th iteration, it is also necessary to calculate the gradient average Modify the parameter matrix in the (j+1)th iteration And will
- the parameter storage space stored in the memory 243 is used by the model training server 210, the model training server 220, and the model training server 230 in the (j+1)th round of training.
- the processor 212 in the model training server 210 may calculate In the FP calculation of the (j+1)th iteration, the iteration processor 212 pulls the calculated gradient average value from the memory 243 Iterative processor 212 according to Calculate the parameter matrix in the (j+1)th iteration And stored in the memory 243 so that the model training server 210 can use it in the (j+1)th round of training.
- the parameter server 240 may further include an iterative processor 242, and the deep learning model 100 may run in the iterative processor 242.
- the iterative processor 242 may be an NPU or a GPU.
- the iterative processor 242 may also be based on the gradient of the weights sent by the model training server 210, the model training server 220, and the model training server 230 respectively. Calculate the gradient average And the calculated gradient average Store to the memory 243. The iterative processor 242 can also Calculate the parameter matrix in the (j+1)th iteration And will It is stored in the memory 243 so that the model training server 210, the model training server 220, and the model training server 230 can be used in the (j+1)th round of training.
- the system 200 may further include a cloud storage 250.
- the cloud storage 250 can be used as an external storage, and users can store program codes and training data on the external storage.
- the model training server 210 as an example, during the operation of at least one processor, the program code and data stored in the cloud storage 250 may be stored in the memory 213 first, so that at least one process can obtain the program code and training from the memory 213. Data, and the deep learning model 100 can be trained according to the program code and training data.
- the data stored in the cloud storage 250 may include training data, prior knowledge corresponding to the training data, and parameters corresponding to the neurons of each layer in the hidden layer 120 of each deep learning training model 100 The initial value of the matrix, etc.
- a training process of the jth iteration and the (j+1)th iteration of the deep learning model 100 is taken as an example.
- at least one data processor 211 can calculate the gradient corresponding to the neuron of the i-th hidden layer calculated by the at least one iterative processor 212 (1) Push to the memory 243 of the parameter server 240.
- the model training server 220 can calculate the BP calculation, and at least one data processor can calculate the gradient (2) Push to the memory 243 in the parameter server 240, and at least one data processor in the model training server 230 can transfer the calculated gradient (3) Push to the memory 243 of the parameter server 240.
- At least one iteration processor 242 in the parameter server 240 can obtain the stored data from the memory 243 according to Calculate the gradient average And can be
- the parameter storage space stored in the memory 243 is used by the model training server 210, the model training server 220, and the model training server 230 in the (j+1)th round of training. Specific calculation Please refer to the embodiment corresponding to FIG. 2 for the process.
- the parameters stored in the parameter storage space of the memory 243 include:
- At least one iterative processor 242 may also obtain stored data from the memory 243 according to Calculate the parameter matrix in the (j+1)th iteration And stored in the memory 243, so that the model training server 210 performs the training according to the (j+1)th round Perform BP calculation. Therefore, in some embodiments, the parameter storage space of the memory 243 also stores:
- multiple model training servers can obtain the stored parameters from the parameter server, and pass the input value (training data) and the parameter matrix Calculate the predicted output value.
- the model training server 210 pulls the stored data from the memory 243 in the parameter server 240 in the BP calculation. according to Calculate the parameter matrix corresponding to the neurons in the i-th hidden layer in the (j+1)th iteration And pass the input value and parameter matrix Calculate the predicted output value.
- the model training server 220 pulls the stored data from the parameter server 240 in the BP calculation
- the model training server 230 pulls and stores from the parameter server 240 in the BP calculation
- the memory 243 in the parameter server 240 stores The model training server 210, the model training server 220, and the model training server 230 can pull and store data from the parameter server 242 in the BP calculation.
- a distributed training system includes a model training server, where a model training server includes at least one processor, which can be used as a computing node to detail the scene of training the deep learning model 100 description.
- FIG. 4 is a schematic structural diagram of a distributed training system 400 of a deep learning model 400 provided by an embodiment of the present application.
- the distributed training system 400 may include: a model training server 410.
- the model training server 410 may include: at least one processor, a memory 414, an input/output interface 415, a communication interface 416, and a bus 417.
- At least one processor may be connected to the memory 414.
- the memory 414 can be used to store the program code and training data.
- the at least one processor can obtain program codes and training data from the memory 414, and train the deep learning model 100.
- At least one processor may include two types of processors.
- One type of processor includes at least one data processor 411, and the other type of processor includes at least one iterative processor.
- the data processor 411 may be a CPU
- the iterative processor may be an NPU or a GPU.
- model training server 410 may include at least one iteration processor.
- iteration processor 412 and the iteration processor 413 are used as examples for illustration.
- the iterative processor 412 and the iterative processor 413 respectively run the deep learning model 100, and in the BP calculation of the deep learning model 100, the gradient of the weight in the parameter matrix of the neuron of each layer can be calculated separately And respectively add the calculated gradient It is stored in the memory 414 via the bus 417.
- the iterative processor 412 may also run a feedback module 4121, and similarly, the iterative processor 413 may also run a feedback module 4131.
- feedback modules running in the iterative processor 412 and the iterative processor 413 please refer to the description in FIG. 6, which will not be repeated here.
- At least one data processor 411 may run a gradient communication module 4111, a gradient update module 4112, and a correction module 4113.
- a gradient communication module 4111 may run a gradient communication module 4111, a gradient update module 4112, and a correction module 4113.
- a correction module 4113 may be used for specific information about each module running in the data processor 411.
- the at least one data processor 411 can also obtain the gradient stored by the iterative processor 412 and the iterative processor 413 from the memory 414 through the bus 417. After that, the multiple gradients calculated by the iteration processor 412 and the iteration processor 413 Calculate the gradient average And average the gradient through bus 417 It is stored in the memory 414 so that the iterative processor 412 and the iterative processor 413 can be used in the (j+1)th round of training.
- the iterative processor 412 can calculate the gradient corresponding to each layer of the deep learning model 100. And can calculate the gradient corresponding to the neurons of the i-th hidden layer It is stored in the parameter storage space of the memory 414 via the bus 417. In the same way, the iterative processor 413 can also calculate the gradient corresponding to the neurons of the i-th hidden layer It is stored in the parameter storage space of the memory 414 via the bus 417.
- the data processor 411 can obtain the stored gradient from the parameter storage space of the memory 414 via the bus 417 And according to the gradient Calculate the average gradient of the neurons in the i-th hidden layer And via bus 417 Stored in the parameter storage space of the memory 414.
- the iteration processor 412 and the iteration processor 413 may obtain the gradient average value from the parameter storage space of the memory 414 through the bus 417, respectively. Based on gradient average Calculate the parameter matrix in the (j+1)th iteration And according to the modified parameter matrix Perform FP calculation.
- the data processor 411 may also obtain the stored gradient average value from the parameter storage space of the memory 414 through the bus 417. Based on gradient average Calculate the parameter matrix in the (j+1)th iteration And will Stored in the parameter storage space of the memory 414. So that in the (j+1)th iteration, the iteration processor 412 and the iteration processor 413 can respectively obtain from the parameter storage space of the memory 414 through the bus 417 And perform FP calculation.
- the distributed training system 400 further includes a cloud storage 420, where the cloud storage 420 is connected to the model training server 410.
- the cloud storage 420 can be used as an external storage, and the user can store program codes and training data on the external storage.
- the program code and training data stored in the cloud storage 420 may be stored in the memory 414 first, so that the at least one processor can obtain the program code from the memory 414 and train The deep learning model 100 can be iteratively trained according to the program code and training data.
- each deep learning model calculates the gradient of each layer of neurons in the direction from the nth layer of neurons to the first layer of neurons And calculate the gradient of each layer of neurons To the parameter storage space.
- the closer to the output layer neuron in the deep learning model the larger the dimension of the parameter matrix, the larger the size of the gradient value corresponding to the parameter matrix, and the longer it takes to send it to the parameter storage space.
- the deep learning model starts to obtain the average value of the stored parameters from the parameter storage space in the direction from the first layer of neurons to the nth layer of neurons.
- the deep learning model needs to wait for the gradient corresponding to the parameter matrix on the first layer of neurons to be transferred to the parameter storage space before starting the next iteration of FP calculation. If in the current iteration of the BP calculation, follow the order of the gradient generation of each layer of neurons to Sending gradients to the parameter storage space in turn, the deep learning model takes a long time to start the next iteration, and its iterative training efficiency is low.
- the training method of a deep learning model proposed in the embodiment of this application can adjust the value calculated by BP during this iteration
- the transmission sequence to the parameter storage space reduces the communication time of each iteration of the deep learning training model and improves the efficiency of deep learning model training.
- FIG. 5 is a schematic flowchart of a method for training a deep learning model provided by an embodiment of the present application.
- the method may include steps 510-550, and steps 510-550 will be described in detail below.
- Step 510 N deep learning models respectively generate N first gradient sets in the BP calculation of the jth iteration.
- the training system in the embodiment of the present application may include N deep learning models, where N is a positive integer greater than zero.
- the training process of each deep learning model may include multiple iterations, and the training process of each iteration may include FB calculation and BP calculation.
- training system may be the distributed training system 200 shown in FIG. 2 or the distributed training system 400 shown in FIG. 4.
- each deep learning model calculates the corresponding gradient of each layer of neurons according to the direction from the nth layer of neurons to the first layer of neurons, and forms the first gradient set, where
- the gradient corresponding to the i-th neuron is i is a positive integer greater than 0 and less than or equal to n.
- any one of the first gradient sets may include: the gradient of the parameter matrix corresponding to each layer of neurons in the n layers of neurons of the deep learning model That includes n gradients
- Step 520 Determine a gradient adjustment strategy according to the training meta information.
- the adjustment sub-module in the gradient communication module may determine the gradient adjustment strategy according to the parameters included in the training meta-information input by the user, so that the communication sub-module in the gradient communication module can change N based on the determined gradient adjustment strategy.
- the gradients included in the first gradient set are respectively sent to the parameter storage space.
- the training meta-information may include any of the following parameters: the communication bandwidth between the deep learning model and the parameter storage space, the size of the gradient corresponding to the parameter matrix of each layer of the neuron of the deep learning model, the The time required for the neurons of each layer of the deep learning model in the FP calculation.
- the gradient communication strategy may be determined according to at least one of the above parameters.
- the gradient communication strategy includes the order in which each gradient in the first gradient set is transmitted to the parameter storage area. The detailed description will be given below in conjunction with FIG. 6 to FIG. 7, and will not be repeated here.
- step 510 can be performed first, and then step 520, or step 520 can be performed first, and then step 510 can be performed, or step 510 and step 510 can be performed simultaneously.
- step 520 This embodiment of the application does not specifically limit this.
- Step 530 In the process of generating the first gradient set, adjust the neurons corresponding to the a-th layer The transmission sequence to the parameter storage space.
- the gradient corresponding to the parameter matrix of the a-th neuron is adjusted to that of the parameter matrix of the b-th neuron.
- the gradient is sent to the parameter storage space before it is sent to the parameter storage space, where b is less than or equal to n, a is less than b, and a is a positive integer greater than zero.
- the gradient communication strategy instructs to follow The sequence of sending the gradient to the parameter storage space. Then you can generate Don’t transmit wait Transfer after generation wait Transfer after generation And so on until After being generated and transferred to the parameter storage space, the previously generated Send to the parameter storage space. Therefore, the adjustment of the transmission order of the gradients included in the first gradient set does not need to wait for all the gradients in the first gradient set to be generated before being executed.
- the deep learning module of the deep learning model 100 can generate the first gradient set in the jth iterative BP calculation, and can calculate the first gradient set in the process of generating the first gradient set
- the gradient corresponding to the parameter matrix of the neuron is stored in the memory 213.
- the gradient communication module in the data processor 211 is used to determine a gradient communication strategy.
- the gradient communication strategy is used to indicate the communication sequence of the gradients included in each first gradient set.
- the gradient communication module stores the gradient communication strategy in the memory based on the determined gradient communication strategy.
- the gradient corresponding to the parameter matrix of the a-th layer neuron included in the first gradient set in 213 is adjusted to the gradient corresponding to the parameter matrix of the b-th layer neuron before being sent to the parameter storage space in the memory 243 of the parameter server 240.
- the parameter storage space is adjusted to the gradient corresponding to the parameter matrix of the b-th layer neuron before being sent to the parameter storage space in the memory 243 of the parameter server 240.
- the deep learning module of the deep learning model 100 can generate the first gradient set in the jth iterative BP calculation, and can generate the first gradient set in the In the process, the gradient corresponding to the calculated parameter matrix of the neuron is stored in the memory 4121.
- the gradient communication module in the data processor 411 adjusts the gradient corresponding to the parameter matrix of the a-th layer neuron included in the first gradient set stored in the memory 4121 to that of the parameter matrix of the b-th layer based on the determined gradient communication strategy.
- the gradient is sent to the parameter storage space of the memory 414 before being sent to the parameter storage space.
- the gradient communication module may be a collection of gradient communication modules of each model training server in at least one model training server in the system 200.
- the gradient communication module may include two sub-modules, one is an adjustment sub-module for determining a gradient adjustment strategy.
- the other is a communication sub-module, which is used to send the gradients included in the N first gradient sets to the parameter storage space according to the gradient adjustment strategy.
- Step 540 In the j+1th iteration, each deep learning model obtains the second gradient set from the parameter storage space.
- the gradient update module 2411 in the data processor 241 in the parameter server 240 obtains the second gradient set according to the N first gradient sets stored in the parameter storage space of the memory 243.
- the gradient update module 2411 in the data processor 411 in the model training server 410 obtains the second gradient set according to the N first gradient sets stored in the parameter storage space of the memory 414.
- N can be calculated separately.
- the weighted average calculation can be performed on the gradient of each layer of neurons included in the N first gradient set, so that the gradient average corresponding to the parameter matrix of each layer of the neuron of the N deep learning models can be calculated Value, the gradient average value corresponding to the parameter matrix of each layer of neuron forms the second gradient set. That is to say, the second gradient set includes the gradient average value corresponding to the parameter matrix of each layer of the neuron of the N deep learning models.
- Step 550 In the j+1th iteration of each deep learning model, perform FP calculation according to the second gradient set.
- the model training system may include a correction module, which is used to correct the parameter matrix of each layer of the neuron of any deep learning model according to the gradient included in the second gradient set for The FP calculation of the (i+1)th iteration of any deep learning model of the training system.
- the correction module may be in the data processor 241 of the parameter server 240 or in the data processor 211 of the model training server 210.
- the data processor 211 of the model training server 210 includes a correction module 2112.
- the gradient communication module 2111 of the model training server 210 can obtain the second gradient set from the parameter storage space in the memory 243, and determine the depth according to the gradient average value of the parameter matrix corresponding to each layer of the neuron in the second gradient set.
- the parameter matrix of each layer of the learning model is modified. So that in the next iteration of BP calculation, according to the modified parameter matrix Calculate the input and output corresponding to each layer of neurons.
- the data processor 241 in the parameter server 240 includes a correction module 2411.
- the correction module 2411 can obtain the second gradient set from the parameter storage space in the memory 243, and calculate the gradient value of each layer of the deep learning model according to the gradient average value of the parameter matrix corresponding to each layer of neurons in the second gradient set.
- the parameter matrix of the neuron is corrected, and the corrected include parameter matrix
- the set of is stored in the parameter storage space of the memory 243. So that in the next iteration of the BP calculation, the gradient communication module 2111 of the model training server 210 can obtain the modified including parameter matrix from the parameter storage space in the memory 243 Collection, and according to the Calculate the input and output corresponding to each layer of neurons.
- the correction module may be a collection of correction modules of each model training server in at least one model training server in the system 200.
- the data processor 411 in the model training server 410 may include a correction module 4113.
- the correction module 4113 may obtain the second gradient set from the parameter storage space in the memory 414, and according to the corresponding value of each layer of neurons in the second gradient set
- the gradient average value of the parameter matrix, the parameter matrix of each layer of the deep learning model is corrected, and the corrected include parameter matrix
- the set of is stored in the parameter storage space of the memory 414. So that in the next iteration of BP calculation, the gradient communication module 4111 of the model training server 410 can obtain the corrected parameter matrix from the parameter storage space in the memory 414 Collection, and according to the Calculate the input and output corresponding to each layer of neurons.
- the training convergence accuracy of the deep learning model can be adjusted while adjusting the BP calculated during this iteration.
- the sequence of transmission to the parameter storage space can reduce the communication time of this iteration and improve the efficiency of model training.
- the embodiment of the present application does not specifically limit the gradient communication strategy in the iterative process of each deep learning model. It can be set according to empirical rules, or it can be a gradient communication strategy compatible with other methods, such as an intelligent gradient communication strategy based on reinforcement learning. The following will adjust the n gradients included in the first gradient set in conjunction with Figure 6. The specific implementation of the communication sequence is described in detail.
- Fig. 6 is a schematic diagram of the architecture of a training system for a deep learning model provided by an embodiment of the present application.
- the system architecture may include a user side and a cloud platform side.
- the user side can input at least one of the deep learning model 100, training meta information 660, and training data 670 to the cloud platform side through an interface.
- the training meta-information 610 may include the communication bandwidth between the deep learning model 100 and the parameter storage space 640, the size of the gradient corresponding to the parameter matrix of each layer of neurons in the deep learning model 100, and the size of each layer of neurons in the deep learning model 100.
- the training data 670 may include training data as input and prediction results corresponding to the training data provided by people.
- the deep learning model 100 can be sent to the cloud platform side via an interface from the user side, or can be a model stored on the cloud platform side, which is not specifically limited in this application.
- the cloud platform side may include: a gradient communication module 620, a local memory 630, a parameter storage space 640, and a deep learning model 100.
- the cloud platform side may further include cloud storage 610.
- the cloud storage 610 may store the deep learning model 100, training meta information 660, and training data 670 sent by the user side.
- the cloud platform side may further include a feedback module 650.
- the gradient communication module 620 on the cloud platform side may include an adjustment sub-module 621 and a communication sub-module 622.
- the adjustment sub-module 621 may be used to execute the method in step 520
- the communication sub-module 622 may be used to execute the method in step 530.
- the feedback module 650 on the cloud platform side may be used to execute the method in step 540.
- platform side may correspond to the distributed training system 200 shown in FIG. 2 or the distributed training system 400 shown in FIG. 4.
- the gradient communication module 620 shown in FIG. 6 corresponds to the gradient communication module 2111 in the model training server 210 and the gradient communication running in the model training server 220 and the model training server 230 in FIG. A collection of modules.
- the feedback module 650 corresponds to the feedback module 2121 in the model training server 210, the model training server 220, and the set of feedback modules running in the model training server 230 in FIG.
- the feedback module 650 corresponds to the set of the feedback module 4121 in the iteration processor 412 and the feedback module 4131 in the iteration processor 413 in FIG. 4.
- the adjustment sub-module 621 may determine the gradient communication strategy according to the training meta-information 660. For details, please refer to the description in step 520, which will not be repeated here.
- the deep learning module in the deep learning model 100 obtains training data 670 from the cloud storage 610, and starts iterative training of the model according to the training data 670.
- the communication sub-module 622 may be used to execute the method in step 530.
- the communication sub-module 622 may send the gradients stored in the local memory 630 to the parameter storage area 640 in the adjusted order according to the gradient communication strategy determined by the adjustment sub-module 621.
- the deep learning model 100 starts the next iteration of FP calculation, it can obtain stored data from the parameter storage area 640, and start FP calculation based on the data.
- step 530 please refer to the description in step 530, which will not be repeated here.
- the local storage 630 is a collection of the storage 213 in the model training server 210, the storage in the model training server 220, and the model training server 230 in the distributed training system 200.
- parameter storage area 640 corresponds to the memory 243 in the parameter server 240 in the distributed training system 200.
- the feedback module 650 may obtain the iteration time of the deep learning model 100, and the iteration time may be the BP calculation time of the deep learning model 100 in this iteration and the FP calculation time of the next iteration.
- the feedback module 650 may obtain the BP calculation time of the Lth iteration and the FP calculation time of the L+1th iteration of the deep learning model 100, where L is a positive integer greater than j.
- the feedback module 650 can feed back the iteration time of the deep learning model 100 to the adjustment sub-module 621 in the gradient communication module 620.
- the adjustment sub-module 621 may adjust the determined gradient communication strategy after receiving the iteration time of the deep learning model 100 fed back by the feedback module 650, so that the subsequent iteration training speed is faster.
- the subsequent iteration training speed is faster.
- Fig. 7 is a schematic flowchart of a method for training a deep learning model provided by an embodiment of the present application.
- the method shown in Fig. 7 may include steps 710-730, and steps 710-730 will be described in detail below.
- Step 710 Initialize the gradient communication strategy of the deep learning model.
- the corresponding gradient on the FC layer neurons in the last layer of the model Adjust to the gradient corresponding to the parameter matrix of the first layer of neurons After being sent to the parameter storage space, it is transferred to the parameter storage space.
- the gradient corresponding to the parameter matrix of the first layer of neurons After transferring to the parameter storage space, you can start the next iteration of FB calculation, so that the gradient corresponding to the parameter matrix of the last layer of neuron can be transmitted within the time period of the next iteration of FB calculation To parameter storage space.
- Fig. 8 shows the gradient included in the first gradient set calculated by the BP of the j-th iteration of at least one deep learning model 100 to Transfer to the parameter storage space in sequence from the neuron of the nth layer to the neuron of the first layer.
- t n represents the gradient corresponding to the nth layer of neurons calculated by BP
- t n-1 represents the gradient corresponding to the nth layer of neurons calculated by BP
- t 1 represents the gradient corresponding to the first layer of neurons calculated by BP The communication time required for transmission to the parameter storage space.
- (B) in Figure 8 shows at least one deep learning model 100.
- the corresponding BP is calculated from the nth layer neuron to the first layer neuron.
- Gradient During the process, the gradient corresponding to the parameter matrix of the nth layer of neuron can be calculated according to the communication scheduling strategy
- the gradient corresponding to the parameter matrix adjusted to the neuron of the first layer is sent to the parameter storage space and then sent to the parameter storage space.
- the time for triggering the deep learning model to perform the next iteration of the FP calculation is the time when the gradient corresponding to the neuron of the first layer in the j-th iteration of the BP calculation is transmitted to the parameter storage space.
- the time required for BP to calculate the gradient will be lower than the time required for the calculated gradient to be transmitted to the parameter storage space. If the time required for the BP to calculate the gradient is higher than the time required for the calculated gradient to be transmitted to the parameter storage space, it will cause the above calculation from the BP calculation of the jth iteration The time from the beginning to triggering the deep learning model to perform the FP calculation of the j+1th iteration is longer than T 1 or T 2 .
- the example proposed a method of training deep learning model can adjust the BP calculated during this iteration The transmission sequence to the parameter storage space, thereby reducing the communication time of this iteration and improving the efficiency of model training.
- Step 720 The deep learning model performs iterative training.
- the above-mentioned initial gradient communication strategy is written into the adjustment sub-module 621 in the gradient communication module 620 of FIG. 6, and the adjustment sub-module 621.
- the communication submodule 622 sends the gradients in the first gradient set stored in the local memory 630 to the parameter storage space 640 according to the gradient communication strategy determined by the adjustment submodule 621.
- the feedback module 650 may obtain the iteration time of the deep learning model 100 from the deep learning module of the deep learning model 100, and the iteration time may be the BP calculation time of the deep learning model 100 in this iteration and the FP calculation time of the next iteration. And the iteration time of the deep learning model 100 can be fed back to the adjustment sub-module 621 in the gradient communication module 620.
- Step 730 Tuning the gradient communication strategy in the deep learning model.
- the adjustment sub-module 621 may adjust the determined gradient communication strategy after receiving the iteration time of the deep learning model 100 fed back by the feedback module 650, so that the subsequent iteration training speed is faster. After multiple iterations and adjustments of the gradient communication strategy, until the optimal gradient communication strategy is found, then this optimal gradient communication strategy is used in the subsequent training steps to perform iterative training of the deep learning model 100.
- the training speed of the deep learning model is improved only by adjusting the communication sequence of the gradient without affecting the training convergence accuracy of the deep learning model.
- current deep learning models basically rely on BP algorithms (including backpropagation through time, BPTT) algorithms
- most open source deep learning engines such as TensorFlow
- the method for accelerating deep learning model training provided by the embodiments of the present application has wide application value.
- the above embodiments can be implemented in whole or in part by software, hardware, firmware, or any other combination.
- the above-described embodiments may be fully or partially implemented in the form of computer program products.
- the computer program product includes one or more computer instructions.
- the processes or functions described in the embodiments of the present application are generated in whole or in part.
- the computer may be a general-purpose computer, a dedicated computer, a computer network, or other programmable devices.
- the computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be from a website site, computer, server or data center Transmission to another website, computer, server or data center via wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.).
- the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center that contains one or more collections of available media.
- the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium.
- the semiconductor medium may be a solid state drive (SSD).
- the embodiments of the present application also provide a computer program product, the computer program product includes: program instructions, when the program instructions run on a computer, the computer executes the methods in the above aspects.
- the embodiments of the present application also provide a computer-readable medium, the computer-readable medium stores program instructions, and when the program instructions are run on a computer, the computer executes the methods in the foregoing aspects.
- Computer-readable media may include, but are not limited to: magnetic storage devices (eg, hard disks, floppy disks, or magnetic tapes, etc.), optical disks (eg, compact discs (CD), digital universal discs (digital) discs, DVDs) Etc.), smart cards and flash memory devices (for example, erasable programmable read-only memory (EPROM), cards, sticks or key drives, etc.).
- magnetic storage devices eg, hard disks, floppy disks, or magnetic tapes, etc.
- optical disks eg, compact discs (CD), digital universal discs (digital) discs, DVDs) Etc.
- smart cards and flash memory devices for example, erasable programmable read-only memory (EPROM), cards, sticks or key drives, etc.
- various storage media described herein may represent one or more devices and/or other machine-readable media for storing information.
- machine-readable medium may include, but is not limited to, wireless channels and various other media capable of storing, containing, and/or carrying instructions and/or data.
- the disclosed system, device, and method may be implemented in other ways.
- the device embodiments described above are only schematic.
- the division of the unit is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented.
- the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical, or other forms.
- the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
- each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
- the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
- the technical solution of the present application essentially or part of the contribution to the existing technology or part of the technical solution can be embodied in the form of a software product
- the computer software product is stored in a storage medium, including Several instructions are used to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application.
- the foregoing storage media include: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Neurology (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Image Analysis (AREA)
- Machine Translation (AREA)
- Debugging And Monitoring (AREA)
Abstract
一种深度学习模型的训练方法,该方法包括:在N个深度学习模型的第j次迭代的BP计算中生成N个第一梯度集合,调整第一梯度集合包括的梯度的通信顺序,不按照第一梯度集合中包括的梯度的生成顺序来将第一梯度集合包括的梯度发送至参数存储空间。并按照调整之后的梯度的通信顺序,将N个第一梯度集合包括的梯度分别发送至参数存储空间。该方法通过调整本次迭代过程中得到的梯度传输到参数存储空间的传输顺序,提高了深度学习模型的训练效率。
Description
本申请要求于2019年1月16日提交中国专利局、申请号为201910041235.8、申请名称为“一种深度学习模型的分布式训练方法、系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及人工智能领域,并且更具体地,涉及一种深度学习模型的训练方法以及执行该训练方法的系统。
人工智能(artificial intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。人工智能领域的研究包括机器人,自然语言处理,计算机视觉,决策与推理,人机交互,推荐与搜索,AI基础理论等。
在AI领域中,深度学习是一种基于深层次的神经网络算法的学习技术。深度学习模型包括前向传播(forward propagation,FP)计算和反向传播(back propagation,BP)计算。其中,FP计算用于根据每一层神经元相对应的参数矩阵进行每一层神经元的输出计算,BP计算用于根据FP计算产生的预测值和先验知识之间的误差计算每一层神经元相对应的梯度,以便于在下一次迭代的FP计算中根据BP计算出的梯度对每一层神经元相对应的参数矩阵进行修正。
由于训练数据往往比较庞大,深度学习模型的训练一般采用分布式的方式,采用多个深度学习模型分布式的根据训练数据完成训练,因此各个深度学习模型之间需要同步每次BP计算生成的梯度,以便实现同步训练。传统的分布式深度学习模型的训练过程中的梯度同步的方法导致训练的效率低下。
发明内容
本申请提供一种深度学习模型的训练方法,通过调整本次迭代过程中BP计算得到的梯度传输到参数存储空间的顺序,提高了深度学习模型的训练效率。
第一方面,提供了一种深度学习模型的训练方法,该方法应用于训练系统,该训练系统包括N个深度学习模型,每个所述深度学习模型包括n层神经元,每个所述深度学习模型的训练过程包括多次迭代,每次迭代包括前向传播FP计算和反向传播BP计算,其中,N为大于1的正整数,n为大于1的正整数,所述方法包括:在所述N个深度学 习模型的第j次迭代的BP计算中生成N个第一梯度集合,在生成N个第一梯度集合的过程中调整每一个第一梯度集合包括的梯度的通信顺序,并按照调整之后的每个第一梯度集合包括的梯度的通信顺序,将所述N个第一梯度集合包括的梯度分别发送至所述训练系统的参数存储空间,再根据所述参数存储空间中存储的所述N个第一梯度集合获取第二梯度集合,并根据第二梯度集合包括的梯度分别对每个深度学习模型的每一层神经元的参数矩阵进行修正,以用于每个深度学习模型进行第(j+1)次迭代的FP计算。
应理解,每个第一梯度集合包括一个深度学习模型的每一层神经元的参数矩阵对应的梯度,其中,j为大于0的正整数。
还应理解,将按照调整之后的每个第一梯度集合中包括的梯度的通信顺序,将N个第一梯度集合包括的梯度分别发送至训练系统的参数存储空间之后,可以分别计算N个深度学习模型的每一层神经元的参数矩阵对应的梯度平均值。
一种可能的实现方式中,可以对N个第一梯度集合中包括的每一层神经元的梯度进行加权平均的计算,从而可以计算出N个深度学习模型的每一层神经元的参数矩阵对应的梯度平均值,每一层神经元的参数矩阵对应的梯度平均值形成了第二梯度集合。也就是说,第二梯度集合中包括N个深度学习模型的每一层神经元的参数矩阵对应的梯度平均值。
在一种可能的实现方式中,将第a层神经元的参数矩阵对应的梯度调整至第b层神经元的参数矩阵对应的梯度发送至所述参数存储空间之前发送至所述参数存储空间,其中,b小于或等于n,a小于b,a为大于0的正整数。
上述技术方案中,将第a层神经元的参数矩阵对应的梯度调整至第b层神经元的参数矩阵对应的梯度发送至所述参数存储空间之前发送至所述参数存储空间,可以减小本次迭代的BP计算的结束时间与下一次迭代的FP计算的开始时间之间的时间差,从而减少深度学习模型的迭代时间。
在另一种可能的实现方式中,可以根据梯度通信策略调整每个第一梯度集合包括的梯度的通信顺序,其中,梯度通信策略根据以下参数中的至少一个设置:所述深度学习模型和所述参数存储空间之间的通信带宽,所述深度学习模型的各层神经元的参数矩阵对应的梯度的大小,所述深度学习模型的各层神经元在FP计算中所需的时间。
需要说明的是,所述深度学习模型是N个深度学习模型中的任意一个或多个。
具体的,在调整第a层神经元的参数矩阵对应的梯度的发送顺序之前,可以先根据深度学习模块和所述参数存储空间之间的通信带宽,所述第b层神经元的参数矩阵对应的梯度的大小,所述第b层神经元的参数矩阵对应的梯度开始被发送至所述参数存储空间的时刻至所述深度学习模型的第(j+1)次迭代中第b-1层神经元对应的FP计算完毕之间的时间计算梯度通信策略。再根据梯度调整策略将第a层神经元的参数矩阵对应的梯度调整至第b层神经元的参数矩阵对应的梯度发送至所述参数存储空间之前发送至所述参数存储空间。
需要说明的是,梯度通信策略包括:第一梯度集合中各个梯度传输到参数存储区域的 顺序。
上述技术方案中,可以根据深度学习模块和所述参数存储空间之间的通信带宽、深度学习模型的各层神经元的参数矩阵对应的梯度的大小、深度学习模型的各层神经元在FP计算中所需的时间确定梯度通信策略,从而可以根据最优的梯度通信策略调整深度学习模型的第一梯度集合中各个梯度的通信顺序,使得后续的迭代训练的速度更快,提高深度学习模型的训练效率。
在一种可能的实现方式中,调整第a层神经元的参数矩阵对应的梯度在第b层神经元的参数矩阵对应的梯度发送至所述参数存储空间之前发送,尽可能的将第b层神经元的参数矩阵对应的梯度在第(j+1)次迭代中的第b-1层神经元完成相对应的FP计算之前发送至参数存储空间。
在另一种可能的实现方式中,所述方法还包括:获取所述深度学习模型的迭代时间;根据所述迭代时间对所述梯度通信策略进行调整。
应理解,获取到的深度学习模型的迭代时间可以是当前迭代过程中BP计算的时间和下一次迭代过程中FP计算的时间之和。即深度学习模型的迭代时间为深度学习模型的第L次迭代的BP计算的时间和第L+1次迭代的FP计算的时间,L为大于j的正整数。
需要说明的是,所述深度学习模型是N个深度学习模型中的任意一个或多个。
上述技术方案中,可以根据反馈的深度学习模型的迭代时间,对深度学习模型的梯度通信策略进行调整,从而可以根据深度学习模型实际的迭代时间确定最优的梯度通信策略,提高深度学习模型的迭代训练速度。
第二方面,提供了一种深度学习模型的训练系统,该训练系统包括N个深度学习模型、梯度通信模块、梯度更新模块、修正模块和参数存储空间。每个所述深度学习模型包括n层神经元,每个所述深度学习模型的训练过程包括多次迭代,每次迭代包括前向传播FP计算和反向传播BP计算,其中,N为大于1的正整数,n为大于1的正整数;
所述N个深度学习模型的每个深度学习模块,分别用于在第j次迭代的BP计算中生成第一梯度集合,每个第一梯度集合包括所述每个深度学习模型的每一层神经元的参数矩阵对应的梯度,其中,j为大于0的正整数;
所述梯度通信模块,用于调整每个第一梯度集合包括的梯度的通信顺序,并根据调整之后的每个第一梯度集合包括的梯度的通信顺序,将N个所述第一梯度集合包括的梯度分别发送至所述训练系统的参数存储空间;
所述梯度更新模块,用于根据所述参数存储空间中存储的N个所述第一梯度集合获取第二梯度集合;
所述修正模块,用于根据所述第二梯度集合包括的梯度分别对每个深度学习模型的每一层神经元的参数矩阵进行修正,以用于每个深度学习模型的第(j+1)次迭代的FP计算。
需要说明的是,梯度通信模块中可以包括两个子模块,一个是调整子模块,用于调整每个第一梯度集合包括的梯度的通信顺序。另一个是通信子模块,用于根据调整之后的每个第一梯度集合包括的梯度的通信顺序,将N个所述第一梯度集合包括的梯度分别发送至所述训练系统的参数存储空间。
还需要说明的是,在包括至少一个模型训练服务器和一个参数服务器的分布式模型 训练系统中,修正模块可以是个参数服务器中的一个模块,也可以是至少一个模型训练服务器中的模块。作为一个示例,修正模块在参数服务器中,该修正模块用于根据所述第二梯度集合包括的梯度分别对所述任一深度学习模型的每一层神经元的参数矩阵进行修正,并将修正后的每一层神经元相对应的参数矩阵存储在该参数服务器的参数存储空间中,以便于至少一个模型训练服务器在下一次迭代的模型训练过程中,从参数存储空间中获取修正之后的参数矩阵。作为另一个示例,修正模块在至少一个模型训练服务器中,至少一个模型训练服务器可以从参数服务器的参数存储空间中获取到第二梯度集合之后,可以根据第二梯度集合对任一深度学习模型的每一层神经元的参数矩阵进行修正,以用于所述训练系统的任一深度学习模型的第(j+1)次迭代的FP计算。
在一种可能的实现方式中,梯度通信模块具体用于:将第a层神经元的参数矩阵对应的梯度调整至第b层神经元的参数矩阵对应的梯度发送至所述参数存储空间之前发送至所述参数存储空间,其中,b小于或等于n,a小于b,a为大于0的正整数。
在另一种可能的实现方式中,梯度通信模块具体用于:根据梯度通信策略调整每个第一梯度集合包括的梯度的通信顺序。
其中,所述梯度通信策略根据以下参数中的至少一个设置:所述深度学习模型和所述参数存储空间之间的通信带宽,所述深度学习模型的各层神经元的参数矩阵对应的梯度的大小,所述深度学习模型的每层神经元在FP计算中所需的时间。
需要说明的是,所述深度学习模型是N个深度学习模型中的任意一个或多个。
在另一种可能的实现方式中,所述系统还包括反馈模块,
所述反馈模块,用于获取所述深度学习模型的迭代时间反馈至所述梯度通信模块,并将获取到的迭代时间反馈至梯度通信模块。
应理解,获取到的深度学习模型的迭代时间可以是当前迭代过程中BP计算的时间和下一次迭代过程中FP计算的时间之和。
所述梯度通信模块,还用于根据反馈模块反馈的深度学习模型的迭代时间对所述梯度通信策略进行调整。
应理解,在包括至少一个模型训练服务器和一个参数服务器的分布式模型训练系统中,反馈模块是至少一个模型训练服务器中每一个模型训练服务器中反馈模块的集合。
第三方面,提供了一种深度学习模型的训练系统,所述训练系统包括至少一个计算节点,每个计算节点包括存储器和至少一个处理器,所述存储器用于存储程序指令,所述训练系统运行时,所述至少一个计算节点的至少一个处理器执行所述存储器中的程序指令以执行第一方面或第一方面中任一种可能的实现方式中的方法。
在一种可能的实现方式中,所述深度学习模型的训练系统包括一个参数服务器和至少一个模型训练服务器。其中,一个模型训练服务器可以作为一个计算节点,N个深度学习模块、梯度通信模块可以分别运行于至少一个模型训练服务器中。梯度更新模块可以运行于训练系统中的参数服务器中。修正模块可以运行于至少一个模型训练服务器中或也可以运行于参数服务器中。
在一种可能的实现方式中,在包括一个参数服务器和至少一个模型训练服务器的深度学习模型的训练系统中,反馈模块运行于至少一个模型训练服务器中。
需要说明的是,在包括一个参数服务器和至少一个模型训练服务器的训练系统中,梯度通信模块可以是至少一个模型训练服务器中每一个模型训练服务器中梯度通信模块的集合,修正模块可以是至少一个模型训练服务器中每一个模型训练服务器中修正模块的集合。反馈模块可以是至少一个模型训练服务器中每一个模型训练服务器中反馈模块的集合。
在另一种可能的实现方式中,深度学习模型的训练系统包括一个模型训练服务器。其中,一个模型训练服务器中包括至少一个处理器,其中,一个处理器可以作为一个计算节点,N个深度学习模块、梯度通信模块、梯度更新模块、修正模块可以分别运行于至少一个处理器中。
在一种可能的实现方式中,在包括一个模型训练服务器的深度学习模型的训练系统中,所述反馈模块运行于一个模型训练服务器的至少一个处理器中。
需要说明的是,在包括一个模型训练服务器的训练系统中,梯度通信模块、梯度更新模块、修正模块、反馈模块可以分别是一个模型训练服务器中的至少一个处理器中每个处理器包括的各个上述模块的集合。
第四方面,提供了一种非瞬态的可读存储介质,包括程序指令,当所述程序指令被至少一个计算节点运行时,所述至少一个计算节点执行如第一方面及第一方面中任一种可能的实现方式中的方法。
第五方面,提供了一种计算机程序产品,包括程序指令,当所述程序指令被至少一个计算节点运行时,所述至少一个计算节点执行如第一方面及第一方面中任一种可能的实现方式中的方法。
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。
图1是本申请实施例提供的一种深度学习模型100的示意性框图。
图2是本申请实施例提供的一种深度学习模型100的分布式训练系统200的示意性结构图。
图3是本申请实施例提供的一种各个模型训练服务器与参数服务器之间进行通信的示意性框图。
图4是本申请实施例提供的一种深度学习模型100的分布式训练系统400的示意性结构图。
图5是本申请实施例提供的一种深度学习模型的训练方法的示意性流程图。
图6是本申请实施例提供的一种深度学习模型的训练系统的架构示意图。
图7是本申请实施例提供的一种加速训练深度学习模型的方法的示意性流程图。
图8中的(a)是采用本申请实施例提供的加速训练深度学习模型的方法的迭代时间效果对比图。
图8中的(b)是采用本申请实施例提供的加速训练深度学习模型的方法的迭代时间 效果对比图。
下面将结合附图,对本申请中的技术方案进行描述。
在AI领域中,深度学习是一种基于深层次的神经网络算法的学习技术。深度学习模型包括输入层、隐含层、输出层,其使用多重非线性变换对数据进行处理。
应理解,神经网络是一种模仿动物神经网络行为特征,这种网络依靠系统的复杂程度,通过调整内部大量节点之间相互连接的关系,从而达到处理信息的目的。
还应理解,深层次的神经网络(深度学习模型)可以理解为具有多个隐含层的神经网络,这里的“多个”并没有特别的度量标准。理论上而言,参数越多的模型复杂度越高,“容量”也就越大,也就意味着它能完成更复杂的学习任务。训练深度神经网络的也就是学习参数矩阵的过程,其最终目的是得到训练好的深度神经网络的每一层神经元的参数矩阵(每一层神经元的参数矩阵包括该层神经元包括的每个神经元对应的权重)。
下面结合图1,对应用于本申请实施例的一种可能的深度学习模型的训练过程进行详细描述。
图1是本申请实施例提供的一种深度学习模型100的示意性框图。深度学习模型100可以包括输入层110,隐含层120以及输出层130。
应理解,本申请实施例中以隐含层120包括n(n大于1)层神经元作为示例进行说明。
还应理解,输入层110、输出层130以及隐含层120中的每一层包括一个或多个神经元。图1中以输入层110包括两个神经元、隐含层120中的n层中的每一层均包括三个神经元、输出层130包括一个神经元为例进行说明。
图1所示的深度学习模型100可以是全连接的神经网络,也可以是卷积神经网络(convolutional neural network,CNN)。在每一层的所有神经元与下一层的所有神经元连接(每一层的每一个神经元的权重w均不为0)的情况下,该深度学习模型100是一个全连接的神经网络模型。在每一层的所有神经元不与下一层的所有神经元连接(每一层的每一个神经元上的权重w部分为0)的情况下,该深度学习模型100是一个CNN模型。
参见图1,在深度学习模型100中可以包括前向传播(forward propagation,FP)计算和反向传播(back propagation,BP)计算。
下面对在一个计算节点中进行FP计算的过程进行详细描述。
在FP计算的过程中,获取训练数据,例如输入图像的像素信息,将训练数据作为深度学习模型100输入层110的输入(i
1,i
2)。输入层110的输入可以在经过隐含层120中的多个神经元之后,从输出层130输出一个预测结果。具体的,隐含层120中的每一层神经元都对应有一个参数矩阵。输入层110的输入与第1层神经元的参数矩阵的乘积作为隐含层120的第1层神经元的输入,该隐含层120的第1层神经元的输入经过第1层神经元中的激活函数(例如,可以是sigmoid函数)之后输出一个第1层神经元的输出值。隐含层120的第1层神经元的输出值与第2层神经元的参数矩阵的乘积作为隐含层120的第2层神经元的输入。同理,以此类推,最终从输出层130输出一个预测结果。
这些参数矩阵中的权重在实际应用中需要在大量的训练中得到修正,通过训练得到的 权重形成的各个参数矩阵可以从用户输入的待推理图像中提取像素信息,从而帮助深度学习模型100对该待推理图像进行正确的推理。
在FP计算的第j次迭代过程中,第1层中的第一个神经元的输入为:
第1层中的第一个神经元的输出为:
第1层中的第二个神经元的输入为:
第1层中的第二个神经元的输出为:
第1层中的第三个神经元的输入为:
第1层中的第三个神经元的输出为:
其中,
为输入为
的激活函数。
第j次的迭代过程中,第1层中的神经元的输入为:
下面对在一个计算节点中进行BP计算的过程进行详细描述。
在训练深度学习模型100的过程中,希望深度学习模型100的输出层130输出的预测值o
1尽可能的接近训练数据的先验知识(prior knowledge),先验知识也被称为真实值(ground truth),一般包括由人提供的训练数据对应的预测结果。所以可以通过比较当前的预测值和先验知识,再根据两者之间的差异情况来更新深度学习模型100中每一层的参数矩阵(当然,在第一次更新之前通常会有初始化的过程,即为初始化深度学习模型100的隐含层120的各层神经元对应的参数矩阵)。并采用误差BP算法在训练深度学习模型100的过程中修正深度学习模型100中参数矩阵的权重大小,使得深度学习模型100的误差损失越来越小。
具体地,FP计算过程中产生的预测值和先验知识间可能会有误差,如果输出的预测值大于先验知识,可以调整参数矩阵中的权重,使得输出的预测值低一些。如果输出的预测值小于先验知识,可以调整参数矩阵中的权重,使得输出的预测值高一些。BP计算是以误差为主导的反向运动,旨在得到最优的各层神经元的参数矩阵。
应理解,用户输入的训练数据中可以包括作为输入的训练数据以及由人提供的训练数据对应的预测结果。
作为一个示例,深度学习模型100应用于图像识别领域。深度学习模型100输入的训 练数据为图像的像素信息,训练数据所对应的先验知识为该图像的标签“dog”。将训练数据输入到输入层110,经过深度学习模型100的FP计算之后,将输出层130输出的预测值与先验知识比较。例如,如果输出层130输出的预测值为“cat”,则可以根据预测值与先验知识“dog”之间的误差来更新深度学习模型100中每一层的参数矩阵。
在第j次的迭代过程中,BP计算可以计算输出的预测值o
1以及先验知识之间的误差E。并可以沿着输出层130、隐含层120、输入层110的方向,根据误差E修正深度学习模型100中每一层神经元的参数矩阵中的权重。具体的,对权重的修正可以是分别计算出参数矩阵中的权重的梯度
该梯度
可以是用误差E对参数矩阵中的权重求导数,其中,1≤i≤n。
深度学习模型100在第(j+1)次迭代中,和第j次迭代过程类似,还是先进行FP计算,再进行BP计算。例如,在第(j+1)次迭代的FP计算过程中,根据第j次迭代的FP计算出的梯度
对参数矩阵中的权重进行修正,并根据修正后的参数矩阵计算预测输出值。在第(j+1)次迭代的BP计算过程中,根据第(j+1)次迭代中FP计算出的输出值与先验知识之间的误差E计算参数矩阵中的权重的梯度
使得在第(j+2)次迭代过程中可以根据
再次对参数矩阵中的权重进行修正。在多次迭代的过程中不断的修正参数矩阵中的权重,从而实现深度学习模型100预测的输出值尽可能的接近训练数据的先验知识。
具体的,在第(j+1)次迭代中FP计算中,计算第i层中的神经元的输入以及输出时,第i层中的神经元的参数矩阵变成了
根据
计算每一层神经元的输入以及输出的过程请参考上文中第j次迭代的FP计算的描述,此处不再赘述。
需要说明的是,上文中示出的参数矩阵计算公式是一种可能的实现方式,也可以是该公式的其他变形,都在本申请实施例的保护范围之内。
本申请实施例中上述深度学习模型100训练过程(包括FP计算过程以及BP计算过程)可以在包括至少一个计算节点的训练系统中完成。该至少一个计算节点可以是至少一个一个模型训练服务器,也可以是一个模型训练服务器中的至少一个处理器。下面将结合图2至图4,对训练深度学习模型100的场景进行描述。
图2是本申请实施例提供的一种深度学习模型100的分布式训练系统200的示意性结构图。图2所示的分布式训练系统200中可以包括模型训练服务器210、模型训练服务器220、模型训练服务器230、参数服务器240、云存储器250。
一般而言,深度学习模型的精度随着训练数据量增加而上升。但是训练数据量的增加使得计算负荷变大,因此分布式深度学习训练技术应运而生。分布式深度学习训练旨在通过采用多个计算节点来增加计算资源,并通过多个计算节点对训练的模型进行迭代,以提升深度学习模型训练速度。
参见图2,分布式训练系统200可以包括至少一个模型训练服务器,其中,一个模型训练服务器可以作为一个计算节点。为了便于描述,图2中以三个模型训练服务器作为示例进行说明。模型训练服务器220以及模型训练服务器230的结构与模型训练服务器210类似,下面对模型训练服务器210进行详细描述。
(1)模型训练服务器210:
包括:至少一个处理器、存储器213、输入输出接口214、通信接口215、总线216。
其中,至少一个处理器可以与存储器213连接。该存储器213可以用于存储该程序代码和训练数据。该存储器213可以是至少一个处理器内部的存储单元,也可以是与至少一 个处理器独立的外部存储单元,还可以是包括与至少一个处理器内部的存储单元和与至少一个处理器独立的外部存储单元的部件。
存储器213可以是固态硬盘(solid state drive,SSD),也可以是硬盘驱动器(hard disk drive,HDD),还可以是只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)等。
该至少一个处理器可以从存储器213获取程序代码以及训练数据,并对深度学习模型100进行训练。作为示例而非限定,至少一个处理器可以根据程序代码以及训练数据进行迭代计算(例如,进行如图1所示的FP计算以及BP计算),还可以在分布式训练系统200中,将BP计算出的参数矩阵中的权重的梯度
发送(push)至参数服务器240。
具体的,至少一个处理器可以包括两类处理器。其中,一类处理器包括至少一个数据处理器211,另一类处理器包括至少一个迭代处理器212。作为一个示例而非限定,数据处理器211可以是中央处理单元(central processing unit,CPU),迭代处理器212可以是嵌入式神经网络处理器(neural-network process units,NPU),也可以是图像处理器(graphics processing unit,GPU)。
其中,迭代处理器212中运行有深度学习模型100,并在对深度学习模型100进行BP计算中,计算出每一层的神经元的参数矩阵中权重的梯度
数据处理器211可以用于将BP计算出的梯度
发送(push)至参数服务器240。
数据处理器211中可以运行有梯度通信模块2111,迭代处理器212中可以运行有深度学习模块100。可选的,迭代处理器212中可以运行有反馈模块2121。修正模块可以运行于数据处理器211中,也可以运行于参数服务器240中,例如,数据处理器211中运行有修正模块2112,又如,参数服务器240的数据处理器241中运行有修正模块2412。具体的有关数据处理器211中运行的各个模块,请参考图6中的描述,此处不再赘述。
可选的,模型训练服务器210还可以包括总线216。其中,存储器213、输入输出接口214、通信接口215可以通过总线216与至少一个处理器(例如,数据处理器211、迭代处理器212)连接。总线216可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。所述总线216可以分为地址总线、数据总线、控制总线等。为便于表示,图2中仅用一条线表示,但并不表示仅有一根总线或一种类型的总线。
(2)参数服务器240:
包括:至少一个数据处理器241、存储器243、输入输出接口244、通信接口245、总线246。
至少一个数据处理器241可以与存储器243连接,数据处理器241例如可以是CPU。至少一个数据处理器241可以从存储器243中获取模型训练服务器210、模型训练服务器220、模型训练服务器230分别发送的梯度
对多个模型训练服务器通过处理器push的多个梯度
进行处理,并可以将处理得到的梯度
存储至存储器243。作为一个示例,至少一个数据处理器241可以对多个模型训练服务器分别push的多个梯度
进行加权平均计算以得到
并将梯度平均值
存储至存储器243。根据多个梯度
获得梯度
的过程 除了采用加权平均计算之外,还可以采用其他算法。
数据处理器241中可以运行有梯度更新模块2411。可选的,数据处理器241中还可以运行有修正模块242。具体的有关数据处理器241中运行的各个模块,请参考图6中的描述,此处不再赘述。
需要说明的是,本申请实施例在第j次迭代过程中,参数服务器240中的数据处理器241计算出梯度平均值
之后,在第(j+1)次迭代中FP计算中,还需要根据梯度平均值
修正第(j+1)次迭代中参数矩阵中的
并将
存入存储器243的参数存储空间以供模型训练服务器210、模型训练服务器220、模型训练服务器230在第(j+1)轮训练中使用。
作为一个示例,可以由参数服务器240中的至少一个数据处理器241计算
并存储至存储器243中,第(j+1)次迭代的FP计算中,多个模型训练服务器可以从存储器243中直接获取(pull)
作为另一个示例,可以由模型训练服务器210中的处理器212计算
在第(j+1)次迭代的FP计算中,迭代处理器212从存储器243中pull计算出的梯度平均值
迭代处理器212根据
计算第(j+1)次迭代中参数矩阵中的
并存储在存储器243中,以便于模型训练服务器210在第(j+1)轮训练中使用。
可选的,在一些实施例中,参数服务器240中还可以包括迭代处理器242,该迭代处理器242中可以运行有深度学习模型100。作为一个示例,迭代处理器242可以是NPU,也可以是GPU。
需要说明的是,迭代处理器242也可以根据模型训练服务器210、模型训练服务器220、模型训练服务器230分别发送的权重的梯度
计算梯度平均值
并将计算出的梯度平均值
存储至存储器243。迭代处理器242还可以根据
计算出第(j+1)次迭代中参数矩阵中的
并将
存储至存储器243中,以便于模型训练服务器210、模型训练服务器220、模型训练服务器230在第(j+1)轮训练中使用。
(3)云存储器250:
可选,在一些实施例中,系统200还可以包括云存储器250。云存储器250可以作为外部存储器,用户可以将程序代码和训练数据存储在该外部存储器上。以模型训练服务器210为例,至少一个处理器在运行过程中,可以先将云存储器250中存储的程序代码和数据存储在存储器213中,以便于至少一个处理可以从存储器213获取程序代码以及训练数据,并可以根据程序代码和训练数据对深度学习模型100进行训练。
需要说明的是,云存储器250中存储的数据可以包括训练数据、该训练数据相对应的先验知识以及各个深度学习训练模型100的隐含层120中的每一层的神经元所对应的参数矩阵的初始值等。
下面结合图3,对图2所示的系统200中各个模型训练服务器与参数服务器240之间进行通信的过程进行进一步的详细说明。
需要说明的是,为了便于描述,图3中没有详细画出多个模型训练服务器以及参数服务器240的内部结构图,具体的请参考图2中的描述,此处不再赘述。
参见图3,以对深度学习模型100进行第j次迭代和第(j+1)次迭代的训练过程为例。在第j次迭代的训练过程中,模型训练服务器210在BP计算中,至少一个数据处理器211可以将至少一个迭代处理器212计算出的第i隐含层的神经元上所对应的梯度
(1)push到参数服务器240的存储器243中。同样的,模型训练服务器220可以将在BP计算中,至少一个数据处理器可以将计算出的梯度
(2)push到参数服务器240中的存储器243中,模型训练服务器230中的至少一个数据处理器可以将计算出的梯度
(3)push到参数服务器240的存储器243中。
参数服务器240中的至少一个迭代处理器242可以从存储器243中获取存储的
根据
计算出梯度平均值
并可以将
存入存储器243的参数存储空间以供模型训练服务器210、模型训练服务器220、模型训练服务器230在第(j+1)轮训练中使用。具体的计算
的过程请参考图2对应的实施例。
可选的,在一些实施例中,至少一个迭代处理器242还可以从存储器243中获取存储的
根据
计算出第(j+1)次迭代中参数矩阵中的
并存储在存储器243中,以便于模型训练服务器210在第(j+1)轮训练中根据
进行BP计算。因此,在一些实施例中,存储器243的参数存储空间中还存储有:
在对深度学习模型100进行第(j+1)次迭代训练的BP计算过程中,多个模型训练服务器可以从参数服务器中获取存储的参数,并通过输入值(训练数据)以及参数矩阵
计算预测的输出值。作为一个示例,模型训练服务器210在BP计算中从参数服务器240中的存储器243pull存储的
根据
计算出第(j+1)次迭代中第i隐含层中的神经元多对应的参数矩阵
并通过输入值以及参数矩阵
计算预测的输出值。同样的,模型训练服务器220在BP计算中从参数服务器240中pull存储的
模型训练服务器230在BP计算中从参数服务器240中pull存储的
作为另一个示例,如果参数服务器240中的存储器243中存储有
模型训练服务器210、模型训练服务器220、模型训练服务器230在BP计算中可以分别从参数服务器242中pull存储的
下面结合图4,以一个分布式训练系统中包括一个模型训练服务器,其中,一个模型训练服务器包括至少一个处理器,该一个处理器可以作为一个计算节点,对训练深度学习模型100的场景进行详细描述。
图4是本申请实施例提供的一种深度学习模型400的分布式训练系统400的示意性结构图。如图4所示,分布式训练系统400可以包括:模型训练服务器410。
其中,模型训练服务器410中可以包括:至少一个处理器、存储器414、输入输出接口415、通信接口416、总线417。
至少一个处理器可以与存储器414连接。该存储器414可以用于存储该程序代码和训练数据。该至少一个处理器可以从存储器414获取程序代码以及训练数据,并对深度学习模型100进行训练。
至少一个处理器可以包括两类处理器。其中,一类处理器包括至少一个数据处理器411,另一类处理器包括至少一个迭代处理器。作为一个示例而非限定,数据处理器411可以是CPU,迭代处理器可以是NPU,也可以是GPU。
应理解,模型训练服务器410中可以包括至少一个迭代处理器,为了便于描述,图4中以迭代处理器412和迭代处理器413作为示例进行说明。
迭代处理器412和迭代处理器413中分别运行有深度学习模型100,并在对深度学习模型100进行BP计算中,可以分别计算出每一层的神经元的参数矩阵中权重的梯度
并分别将计算出的梯度
通过总线417存储至存储器414中。
可选的,在一些实施例中,迭代处理器412中还可以运行有反馈模块4121,同样的,迭代处理器413中还可以运行有反馈模块4131。具体的有关迭代处理器412、迭代处理器413中运行的反馈模块,请参考图6中的描述,此处不再赘述。
至少一个数据处理器411中可以运行有梯度通信模块4111、梯度更新模块4112、修正模块4113,具体的有关数据处理器411中运行的各个模块,请参考图6中的描述,此处不再赘述。
至少一个数据处理器411还可以通过总线417从存储器414中获取迭代处理器412、迭代处理器413存储的梯度
之后,可以根据迭代处理器412、迭代处理器413计算出的多个梯度
计算梯度平均值
并通过总线417将梯度平均值
存储至存储器414中,以便于迭代处理器412、迭代处理器413在第(j+1)轮训练中使用。
具体的,在第j次迭代的BP计算过程中,迭代处理器412可以计算出深度学习模型100每一层神经元相对应的梯度
并可以将计算出的第i隐含层的神经元相对应的梯度
通过总线417存储至存储器414的参数存储空间中。同理,迭代处理器413也可以将计算出的第i隐含层的神经元相对应的梯度
通过总线417存储至存储器414的参数存储空间中。数据处理器411可以总线417从存储器414的参数存储空间中获取存储的梯度
并根据梯度
计算出第i隐含层的神经元相对应的梯度平均值
并通过总线417将
存储至存储器414的参数存储空间中。在第(j+1)次迭代中FP的计算过程中,迭代处理器412、迭代处理器413可以分别通过总线417从存储器414的参数存储空间中获取梯度平均值
根据梯度平均值
计算第(j+1)次迭代中参数矩阵中的
并根据修正之后的参数矩阵
进行FP计算。
可选的,在一些实施例中,数据处理器411还可以通过总线417从存储器414的参数存储空间中获取存储的梯度平均值
根据梯度平均值
计算第(j+1)次迭代中参数矩阵中的
并将
存储至存储器414的参数存储空间中。以便于在第(j+1)次迭代中,迭代处理器412、迭代处理器413可以分别通过总线417从存储器414的参数存储空间中获取
并进行FP计算。
可选,在一些实施例中,分布式训练系统400还包括云存储器420,其中,云存储器420与模型训练服务器410连接。云存储器420可以作为外部存储器,用户可以将程序代码和训练数据存储在该外部存储器上。模型训练服务器410中的至少一个处理器在运行过程中,可以先将云存储器420中存储的程序代码和训练数据存储在存储器414中,以便于至少一个处理器可以从存储器414获取程序代码以及训练数据,并可以根据程序代码和训练数据对深度学习模型100进行迭代训练。
上文的图2至图4中详细描述了在分布式训练系统200或分布式训练系统400中训练深度学习模型的过程。在当前迭代的BP计算中,每个深度学习模型按照从第n层神经元到第1层神经元的方向计算每一层神经元的梯度
并将计算出的每一层神经元的梯度
至参数存储空间中。一般而言,深度学习模型中越靠近输出层神经元的参数矩阵维度较大,该参数矩阵对应的梯度值的大小也就越大,将其发送至参数存储空间所需要的 时间也就越久。下一次迭代的FP计算中,深度学习模型按照从第1层神经元到第n层神经元的方向,依次开始从参数存储空间中获取存储的参数平均值
或参数矩阵
因此,在下一次迭代的FP计算中,深度学习模型需要等待将第1层神经元上的参数矩阵对应的梯度被传输至参数存储空间之后,才会开始下一次迭代的FP计算。如果在当前迭代的BP计算中,按照每一层神经元的梯度生成的顺序
至
将梯度依次发送至参数存储空间,深度学习模型开始下一次迭代的时间较长,其迭代训练的效率较低。
下面结合图5至图7,对本申请实施例中对深度学习模型进行加速训练的过程进行进一步的详细说明。
图5是本申请实施例提供的一种训练深度学习模型的方法的示意性流程图,该方法可以包括步骤510-550,下面对步骤510-550进行详细描述。
步骤510:N个深度学习模型在第j次迭代的BP计算中分别生成N个第一梯度集合。
本申请实施例中训练系统可以包括N个深度学习模型,其中,N为大于0的正整数。每个深度学习模型的训练过程可以包括多次迭代,每次迭代的训练过程中可以包括FB计算和BP计算。
应理解,训练系统可以是如图2所示的分布式训练系统200,也可以是如图4所示的分布式训练系统400。
在第j次迭代的BP计算中,每一个深度学习模型按照从第n层神经元到第1层神经元的方向分别计算出每一层神经元对应的梯度,并形成第一梯度集合,其中第i层神经元对应的梯度为
i为大于0小于或等于n的正整数。其中,任意一个第一梯度集合中可以包括:深度学习模型的n层神经元中每一层神经元所对应的参数矩阵的梯度
即包括n个梯度
步骤520:根据训练元信息确定梯度调整策略。
在步骤510开始之前,梯度通信模块中的调整子模块可以根据用户输入的训练元信息中包括的参数确定梯度调整策略,以便于梯度通信模块中的通信子模块基于确定的梯度调整策略,将N个所述第一梯度集合包括的梯度分别发送至参数存储空间。
训练元信息中可以包括以下参数之任一:所述深度学习模型和所述参数存储空间之间的通信带宽,所述深度学习模型的各层神经元的参数矩阵对应的梯度的大小,所述深度学习模型的各层神经元在FP计算中所需的时间。
应理解,可以根据上述参数中的至少一个确定梯度通信策略。该梯度通信策略包括第一梯度集合中各个梯度传输到参数存储区域的顺序。下面会结合图6至图7进行详细描述,此处不再赘述。
值得说明的是,步骤510和步骤520之间并无先后顺序关系,可以先执行步骤510,再执行步骤520,也可以先执行步骤520,再执行步骤510,或者,可以同时执行步骤510和步骤520,本申请实施例对此不作具体限定。
本申请实施例中可以在步骤510生成第一梯度集合的过程中,基于步骤520中确定的梯度通信策略将第a层神经元参数矩阵对应的梯度调整至第b层神经元的参数矩阵对应的梯度发送至所述参数存储空间之前发送至所述参数存储空间,其中,b小于或等于n,a小于b,a为大于0的正整数。
例如,如果梯度通信策略指示按照
的顺序将梯度发送至参数存储空间。那么可以在生成
后先暂不传输
等待
生成后传输
等待
生成后传输
依次类推,直至
生成并被传输至参数存储空间后,再将之前生成的
发送至参数存储空间。因此,对第一梯度集合中包括的梯度的传输顺序的调整无须等待第一梯度集合中的全部梯度生成后再执行。
以图2所示的分布式训练系统200为例。模型训练服务器210中的迭代处理器212中,深度学习模型100的深度学习模块可以在第j次迭代BP计算中生成第一梯度集合,并可以在生成第一梯度集合的过程中,将计算出的神经元的参数矩阵对应的梯度存储至存储器213中。数据处理器211中的梯度通信模块用于确定梯度通信策略,该梯度通信策略用于表示每个第一梯度集合包括的梯度的通信顺序,梯度通信模块基于确定的梯度通信策略,将存储在存储器213中的第一梯度集合包括的第a层神经元的参数矩阵对应的梯度调整至第b层神经元的参数矩阵对应的梯度发送至参数服务器240的存储器243中的参数存储空间之前发送至所述参数存储空间。
以图4所示的模型训练服务器410为例。模型训练服务器410中的数据处理器411中的迭代处理器412中,深度学习模型100的深度学习模块可以在第j次迭代BP计算中生成第一梯度集合,并可以在生成第一梯度集合的过程中,将计算出的神经元的参数矩阵对应的梯度存储至内存4121。数据处理器411中的梯度通信模块基于确定的梯度通信策略将存储在内存4121中的第一梯度集合包括的第a层神经元参数矩阵对应的梯度调整至第b层神经元的参数矩阵对应的梯度发送至存储器414的参数存储空间之前发送至所述参数存储空间。
应理解,图2所示的系统200中,梯度通信模块可以是系统200中至少一个模型训练服务器中每一个模型训练服务器的梯度通信模块的集合。
还应理解,梯度通信模块可以包括两个子模块,一个是调整子模块,用于确定梯度调整策略。另一个是通信子模块,用于根据梯度调整策略将N个所述第一梯度集合包括的梯度分别发送至参数存储空间。
步骤540:在第j+1次迭代过程中,每一个深度学习模型分别从参数存储空间中获取第二梯度集合。
以图2所示的系统200为例。参数服务器240中数据处理器241中的梯度更新模块2411根据存储器243的参数存储空间中存储的N个所述第一梯度集合获取第二梯度集合。
以图4所示的模型训练服务器410为例。模型训练服务器410中数据处理器411中的梯度更新模块2411根据存储器414的参数存储空间中存储的N个所述第一梯度集合获取第二梯度集合。
本申请实施例中,将按照调整之后的每个第一梯度集合中包括的梯度的通信顺序,将N个第一梯度集合包括的梯度分别发送至训练系统的参数存储空间之后,可以分别计算N个深度学习模型的每一层神经元的参数矩阵对应的梯度平均值。
作为一个示例,可以对N个第一梯度集合中包括的每一层神经元的梯度进行加权平均的计算,从而可以计算出N个深度学习模型的每一层神经元的参数矩阵对应的梯度平均值,每一层神经元的参数矩阵对应的梯度平均值形成了第二梯度集合。也就是说,第二梯度集合中包括N个深度学习模型的每一层神经元的参数矩阵对应的梯度平均值。
步骤550:每一个深度学习模型在第j+1次迭代中,根据第二梯度集合进行FP计算。
本申请实施例中,模型训练系统可以包括修正模块,用于根据所述第二梯度集合包括的梯度分别对所述任一深度学习模型的每一层神经元的参数矩阵进行修正,以用于所述训练系统的任一深度学习模型的第(i+1)次迭代的FP计算。
以图2所示的系统200为例。修正模块可以是在参数服务器240的数据处理器241中,也可以在模型训练服务器210的数据处理器211中。
可选的,在一些实施例中,模型训练服务器210的数据处理器211中包括修正模块2112。模型训练服务器210的梯度通信模块2111可以从存储器243中的参数存储空间中获取第二梯度集合,并根据第二梯度集合中的每一层神经元所对应的参数矩阵的梯度平均值,对深度学习模型的每一层神经元的参数矩阵进行修正。使得在下一次迭代的BP计算中,根据修正之后的参数矩阵
计算每一层神经元对应的输入输出。
可选的,在另一些实施例中,参数服务器240中的数据处理器241中包括修正模块2411。修正模块2411可以从存储器243中的参数存储空间中获取第二梯度集合,并根据第二梯度集合中的每一层神经元所对应的参数矩阵的梯度平均值,对深度学习模型的每一层神经元的参数矩阵进行修正,并将修正之后的包括参数矩阵
的集合存储在存储器243的参数存储空间中。以便于在下一次迭代的BP计算中,模型训练服务器210的梯度通信模块2111可以从存储器243中的参数存储空间中获取修正之后的包括参数矩阵
集合,并根据集合中的
计算每一层神经元对应的输入输出。
需要说明的是,图2所示的系统200中,修正模块可以是系统200中至少一个模型训练服务器中每一个模型训练服务器的修正模块的集合。
以图4所示的模型训练服务器410为例。模型训练服务器410中的数据处理器411可以包括修正模块4113,修正模块4113可以从存储器414中的参数存储空间获取第二梯度集合,并根据第二梯度集合中的每一层神经元所对应的参数矩阵的梯度平均值,对深度学习模型的每一层神经元的参数矩阵进行修正,并将修正之后的包括参数矩阵
的集合存储在存储器414的参数存储空间中。以便于在下一次迭代的BP计算中,模型训练服务器410的梯度通信模块4111可以从存储器414中的参数存储空间中获取修正之后的包括参数矩阵
集合,并根据集合中的
计算每一层神经元对应的输入输出。
本申请实施例对每个深度学习模型迭代过程中的梯度通信策略不作具体限定。可以是根据经验规则进行设置,也可以是兼容其他方式的梯度通信策略,比如基于强化学习的智能梯度通信策略。下面会结合图6对调整第一梯度集合中包括的n个梯度
的通信顺序的具体实现方式进行详细描述。
图6是本申请实施例提供的一种深度学习模型的训练系统的架构示意图,该系统架构 可以包括用户侧和云平台侧。
如图6所示,用户侧可以通过接口将深度学习模型100、训练元信息660、训练数据670中的至少一种数据输入至云平台侧。
其中,训练元信息610中可以包括深度学习模型100和参数存储空间640之间的通信带宽,深度学习模型100的每层神经元的参数矩阵对应的梯度的大小,深度学习模型100的每层神经元在FP计算中所需的时间。训练数据670可以包括作为输入的训练数据以及由人提供的训练数据对应的预测结果。
需要说明的是,深度学习模型100可以是用户侧可以通过接口发送至云平台侧,也可以是云平台侧存储的模型,本申请对此不作具体限定。
云平台侧可以包括:梯度通信模块620、本地存储器630、参数存储空间640、深度学习模型100。
可选的,在一些实施例中,云平台侧还可以包括云存储器610。云存储器610可以存储用户侧发送的深度学习模型100、训练元信息660、训练数据670。
可选的,在一些实施例中,云平台侧还可以包括反馈模块650。
参见图6,云平台侧的梯度通信模块620中可以包括调整子模块621以及通信子模块622。调整子模块621可以用于执行步骤520中的方法,通信子模块622可以用于执行步骤530中的方法。云平台侧的反馈模块650可以用于执行步骤540中的方法。
应理解,平台侧可以对应于图2所示的分布式训练系统200或图4所示的分布式训练系统400。
下面以图2所示的分布式训练系统200为例,对图6所示的云平台侧深度学习模型100的迭代过程进行详细描述。
需要说明的是,对于图2而言,图6所示的梯度通信模块620对应于图2中模型训练服务器210中的梯度通信模块2111以及模型训练服务器220、模型训练服务器230中运行的梯度通信模块的集合。反馈模块650对应于图2中模型训练服务器210中的反馈模块2121和模型训练服务器220、模型训练服务器230中运行的反馈模块的集合。对于图4而言,反馈模块650对应于图4中迭代处理器412中的反馈模块4121和迭代处理器413中的反馈模块4131的集合。
调整子模块621可以根据训练元信息660确定梯度通信策略,具体的请参考步骤520中的描述,此处不再赘述。
在深度学习模型100开始训练时,深度学习模型100中的深度学习模块从云存储器610中获取训练数据670,并根据训练数据670开始模型的迭代训练。在当前迭代的BP计算中,通信子模块622可以用于执行步骤530中的方法。通信子模块622可以根据调整子模块621确定的梯度通信策略将本地存储器630中存储的梯度按照调整之后的顺序发送至参数存储区域640。深度学习模型100在开始下一次迭代的FP计算时,可以从参数存储区域640中获取存储的数据,并根据该数据开始FP计算。具体的请参考步骤530中的描述,此处不再赘述。
应理解,本地存储器630为分布式训练系统200中模型训练服务器210中的存储器213和模型训练服务器220、模型训练服务器230中的存储器的集合。
还应理解,参数存储区域640对应于分布式训练系统200中参数服务器240中的存储 器243。
反馈模块650可以获取深度学习模型100的迭代时间,该迭代时间可以是深度学习模型100在本次迭代的BP计算时间和下一次迭代的FP计算时间。例如,反馈模块650可以获取深度学习模型100的第L次迭代的BP计算的时间和第L+1次迭代的FP计算的时间,L为大于j的正整数。反馈模块650可以将深度学习模型100的迭代时间反馈至梯度通信模块620中的调整子模块621。调整子模块621可以在接收到反馈模块650反馈的深度学习模型100的迭代时间之后,可以对确定的梯度通信策略进行调整,使得后续的迭代训练速度更快。具体的请参考步骤540中的描述,此处不再赘述。
下面以某人脸识别模型resnet50,计算引擎为Tensor Flow作为示例,结合图6对深度学习训练模型的训练方法进行进一步的解释说明。
图7是本申请实施例提供的一种训练深度学习模型的方法的示意性流程图,图7所示的方法可以包括步骤710-730,下面分别对步骤710-730进行详细描述。
应注意,图7的例子仅仅是为了帮助本领域技术人员理解本申请实施例,而非要将本申请实施例限于所例示的具体数值或具体场景。本领域技术人员根据所给出的图7的例子,显然可以进行各种等价的修改或变化,这样的修改或变化也落入本申请实施例的范围内。
步骤710:初始化深度学习模型的梯度通信策略。
以人脸识别模型ResNet50(分类数为一万个类)为例。由于该ResNet50人脸模型最后一层(全连接(fully connected,FC)层)神经元上对应的参数矩阵中的参数量为78MB左右,占到了整个模型大小的一半左右。计算出的该层神经元上对应的梯度越大,其所需要的发送至参数存储空间所需要的通信时间也就越长。
因此,在初始化时,假设该ResNet50人脸模型有n层神经元,将模型最后1层的FC层神经元上对应的梯度
调整至第1层神经元的参数矩阵对应的梯度
发送至所述参数存储空间之后传输至参数存储空间。也就是说,在将第1层神经元的参数矩阵对应的梯度
传输至参数存储空间之后,可以开始启动下一次迭代的FB计算,从而使得在下一次迭代的FB计算的时间段内完成传输最后一层神经元的参数矩阵对应的梯度
至参数存储空间。
参见图8,图8中的(a)表示的是至少一个深度学习模型100,将第j轮迭代的BP计算出的第一梯度集合中包括的梯度
至
按照从第n层神经元至第1层神经元的顺序依次传输到参数存储空间中。t
n表示的是将BP计算出的第n层神经元相对应的梯度
传输至参数存储空间所需要的通信时间,t
n-1表示的是BP计算出的第n层神经元相对应的梯度
传输至参数存储空间所需要的通信时间,依次类推,t
1表示的是BP计算出的第1层神经元相对应的梯度
传输至参数存储空间所需要的通信时间。触发深度学习模型进行下一次迭代(第j+1轮迭代)的FP计算的时间为第j轮迭代的BP计算中第1层神经元相对应的梯度传输至参数存储空间的时间。参见图8中的(a),从第j轮迭代的BP计算中计算出
开始至触发深度学习模型进行第j+1次迭代的FP计算的时间不小于T
1,其中T
1=t
n+t
n-1+...+t
2+t
1,也就是说,在经过传输时间T
1之后,深度学习模型可以进行下一次迭代的FP计算。
图8中的(b)表示的是至少一个深度学习模型100,按照本申请实施例提供的加速训 练深度学习模型的方案,在BP从第n层神经元到第1层神经元计算出相对应的梯度
的过程中,可以按照通信调度策略将第n层神经元的参数矩阵对应的梯度
调整至第1层神经元的参数矩阵对应的梯度发送至所述参数存储空间之后发送至所述参数存储空间。触发深度学习模型进行下一次迭代的FP计算的时间为第j轮迭代的BP计算中第1层神经元相对应的梯度传输至参数存储空间的时间。参见图8中的(b),从第j轮迭代的BP计算中计算出
开始至触发深度学习模型进行第j+1次迭代的FP计算的时间不小于T
2,其中T
2=第j轮迭代的BP计算中计算
所需的时间+t
n-1+...+t
2+t
1,也就是说,在经过传输时间T
2之后,深度学习模型可以进行下一次迭代的FP计算。并在t
1时间之后,将第n层神经元相对应的梯度
传输至参数存储空间。
一般而言,由于迭代处理器的性能逐日提升,BP计算梯度的所需的时间会低于计算出来的梯度传输到参数存储空间所需的时间。如果BP计算梯度所需的时间高于计算出来的梯度传输到参数存储空间所需的时间,会就造成上述从第j轮迭代的BP计算中计算出
开始至触发深度学习模型进行第j+1次迭代的FP计算的时间大于T
1或T
2的情况。
由于图8(b)中触发深度学习模型进行下一次迭代的FP计算的时间T
2小于图8(a)中触发深度学习模型进行下一次迭代的FP计算的时间T
1,因此,本申请实施例提出的一种训练深度学习模型方法可以调整本次迭代过程中,将BP计算得到的
传输到参数存储空间的传输顺序,从而减少本次迭代的通信时间,提高模型训练的效率。
步骤720:深度学习模型进行迭代训练。
将上述初始化梯度通信策略写入到图6的梯度通信模块620中的调整子模块621,调整子模块621。通信子模块622按照调整子模块621确定的梯度通信策略,将本地存储器630中存储的第一梯度集合中的梯度发送至参数存储空间640中。
反馈模块650可以从深度学习模型100的深度学习模块中获取深度学习模型100的迭代时间,该迭代时间可以是深度学习模型100在本次迭代的BP计算时间和下一次迭代的FP计算时间。并可以将深度学习模型100的迭代时间反馈至梯度通信模块620中的调整子模块621。
步骤730:深度学习模型中梯度通信策略的调优。
调整子模块621可以在接收到反馈模块650反馈的深度学习模型100的迭代时间之后,可以对确定的梯度通信策略进行调整,使得后续的迭代训练速度更快。经过多次的迭代和梯度通信策略的调整,直到寻找到最优的梯度通信策略,然后在后续的训练步数内采用这个最优的梯度通信策略进行深度学习模型100的迭代训练。
本申请实施例中,在不影响深度学习模型训练收敛精度的同时,仅通过调整梯度的通讯顺序,使得深度学习模型的训练速度有所提升。由于当前深度学习模型基本上都依赖于BP算法(包括随时间反向传播(back propagation through time,BPTT)算法),大多数开源深度学习引擎(如Tensor Flow)也都是基于BP算法实现的,因此本申请实施例提供的深度学习模型训练加速的方法具备广泛的应用价值。
上述实施例,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载或执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以为通用计 算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集合的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质。半导体介质可以是固态硬盘(solid state drive,SSD)。
本申请实施例还提供了一种计算机程序产品,所述计算机程序产品包括:程序指令,当所述程序指令在计算机上运行时,使得计算机执行上述各方面中的方法。
本申请实施例还提供了一种计算机可读介质,所述计算机可读介质存储有程序指令,当所述程序指令在计算机上运行时,使得计算机执行上述各方面中的方法。
本申请的各个方面或特征可以实现成方法、装置或使用标准编程和/或工程技术的制品。本申请中使用的术语“制品”涵盖可从任何计算机可读器件、载体或介质访问的计算机程序。例如,计算机可读介质可以包括,但不限于:磁存储器件(例如,硬盘、软盘或磁带等),光盘(例如,压缩盘(compact disc,CD)、数字通用盘(digital versatile disc,DVD)等),智能卡和闪存器件(例如,可擦写可编程只读存储器(erasable programmable read-only memory,EPROM)、卡、棒或钥匙驱动器等)。另外,本文描述的各种存储介质可代表用于存储信息的一个或多个设备和/或其它机器可读介质。术语“机器可读介质”可包括但不限于,无线信道和能够存储、包含和/或承载指令和/或数据的各种其它介质。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
Claims (11)
- 一种深度学习模型的训练方法,其特征在于,应用所述方法的训练系统包括N个深度学习模型,每个深度学习模型包括n层神经元,每个深度学习模型的训练过程包括多次迭代,每次迭代包括前向传播FP计算和反向传播BP计算,其中,N为大于1的正整数,n为大于1的正整数,所述方法包括:在所述N个深度学习模型的第j次迭代的BP计算中生成N个第一梯度集合,每个第一梯度集合包括一个深度学习模型的每一层神经元的参数矩阵对应的梯度,其中,j为大于0的正整数;调整每个第一梯度集合包括的梯度的通信顺序;根据调整之后的每个第一梯度集合包括的梯度的通信顺序,将所述N个第一梯度集合包括的梯度分别发送至所述训练系统的参数存储空间;根据所述参数存储空间中存储的所述N个第一梯度集合获取第二梯度集合;根据所述第二梯度集合包括的梯度分别对每个深度学习模型的每一层神经元的参数矩阵进行修正,以用于每个深度学习模型进行第(j+1)次迭代的FP计算。
- 如权利要求1所述的方法,其特征在于,所述调整每个第一梯度集合包括的梯度的通信顺序,包括:将第a层神经元的参数矩阵对应的梯度调整至第b层神经元的参数矩阵对应的梯度发送至所述参数存储空间之前发送至所述参数存储空间,其中,b小于或等于n,a小于b,a为大于0的正整数。
- 如权利要求1或2所述的方法,其特征在于,所述调整每个第一梯度集合包括的梯度的通信顺序,包括:根据梯度通信策略调整每个第一梯度集合包括的梯度的通信顺序;其中,所述梯度通信策略根据以下参数中的至少一个设置:所述深度学习模型和所述参数存储空间之间的通信带宽,所述深度学习模型的各层神经元的参数矩阵对应的梯度的大小,所述深度学习模型的每层神经元在FP计算中所需的时间。
- 如权利要求1至3中任一项所述的方法,其特征在于,所述调整每个第一梯度集合包括的梯度的通信顺序,包括:根据梯度通信策略调整每个第一梯度集合包括的梯度的通信顺序;所述方法还包括:获取所述深度学习模型的迭代时间,所述迭代时间包括所述深度学习模型的第L次迭代的BP计算的时间和第L+1次迭代的FP计算的时间,L为大于j的正整数;根据所述迭代时间对所述梯度通信策略进行调整。
- 一种深度学习模型的训练系统,其特征在于,所述训练系统包括N个深度学习模型、梯度通信模块、梯度更新模块、修正模块和参数存储空间,每个所述深度学习模型包括n层神经元,每个所述深度学习模型的训练过程包括多次迭代,每次迭代包括前向传播FP计算和反向传播BP计算,其中,N为大于1的正整数,n为大于1的正整数;所述N个深度学习模型的每个深度学习模块,分别用于在第j次迭代的BP计算中生 成第一梯度集合,每个第一梯度集合包括所述每个深度学习模型的每一层神经元的参数矩阵对应的梯度,其中,j为大于0的正整数;所述梯度通信模块,用于调整每个第一梯度集合包括的梯度的通信顺序,并根据调整之后的每个第一梯度集合包括的梯度的通信顺序,将N个所述第一梯度集合包括的梯度分别发送至所述参数存储空间;所述梯度更新模块,用于根据所述参数存储空间中存储的N个所述第一梯度集合获取第二梯度集合;所述修正模块,用于根据所述第二梯度集合包括的梯度分别对每个深度学习模型的每一层神经元的参数矩阵进行修正,以用于每个深度学习模型的第(j+1)次迭代的FP计算。
- 如权利要求5所述的系统,其特征在于,所述梯度通信模块用于:将第a层神经元的参数矩阵对应的梯度调整至第b层神经元的参数矩阵对应的梯度发送至所述参数存储空间之前发送至所述参数存储空间,其中,b小于或等于n,a小于b,a为大于0的正整数。
- 如权利要求5或6所述的系统,其特征在于,所述梯度通信模块用于:根据梯度通信策略调整每个第一梯度集合包括的梯度的通信顺序;其中,所述梯度通信策略根据以下参数中的至少一个设置:所述深度学习模型和所述参数存储空间之间的通信带宽,所述深度学习模型的各层神经元的参数矩阵对应的梯度的大小,所述深度学习模型的每层神经元在FP计算中所需的时间。
- 如权利要求5至7中任一项所述的系统,其特征在于,所述训练系统还包括反馈模块,所述反馈模块,用于将获取到的所述深度学习模型的迭代时间反馈至所述梯度通信模块,所述迭代时间包括所述深度学习模型的第L次迭代的BP计算的时间和第L+1次迭代的FP计算的时间,L为大于j的正整数;所述梯度通信模块,还用于根据所述迭代时间对所述梯度通信策略进行调整。
- 一种深度学习模型的训练系统,其特征在于,所述训练系统包括至少一个计算节点,每个计算节点包括存储器和至少一个处理器,所述存储器用于存储程序指令,所述至少一个计算节点的至少一个处理器执行所述存储器中的程序指令以执行权利要求1至4中任一所述方法的操作步骤。
- 一种非瞬态的可读存储介质,其特征在于,包括程序指令,当所述程序指令被至少一个计算节点运行时,所述至少一个计算节点执行如权利要求1至4中任一项所述的方法。
- 一种计算机程序产品,其特征在于,包括程序指令,当所述程序指令被至少一个计算节点运行时,所述至少一个计算节点执行如权利要求1至4中任一项所述的方法。
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201980000128.9A CN111788585B (zh) | 2019-01-16 | 2019-01-24 | 一种深度学习模型的训练方法、系统 |
| EP19910606.3A EP3889846A4 (en) | 2019-01-16 | 2019-01-24 | METHOD AND SYSTEM FOR TRAINING DEEP LEARNING MODELS |
| US17/376,722 US20210342696A1 (en) | 2019-01-16 | 2021-07-15 | Deep Learning Model Training Method and System |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201910041235.8 | 2019-01-16 | ||
| CN201910041235 | 2019-01-16 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US17/376,722 Continuation US20210342696A1 (en) | 2019-01-16 | 2021-07-15 | Deep Learning Model Training Method and System |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2020147142A1 true WO2020147142A1 (zh) | 2020-07-23 |
Family
ID=71613070
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2019/072895 Ceased WO2020147142A1 (zh) | 2019-01-16 | 2019-01-24 | 一种深度学习模型的训练方法、系统 |
Country Status (4)
| Country | Link |
|---|---|
| US (1) | US20210342696A1 (zh) |
| EP (1) | EP3889846A4 (zh) |
| CN (1) | CN111788585B (zh) |
| WO (1) | WO2020147142A1 (zh) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112949853A (zh) * | 2021-02-23 | 2021-06-11 | 北京金山云网络技术有限公司 | 深度学习模型的训练方法、系统、装置及设备 |
| WO2024139420A1 (zh) * | 2022-12-28 | 2024-07-04 | 华为技术有限公司 | 一种模型训练方法、装置、设备、系统和存储介质 |
Families Citing this family (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111935179B (zh) * | 2020-09-23 | 2021-01-12 | 支付宝(杭州)信息技术有限公司 | 一种基于可信执行环境的模型训练方法和装置 |
| CN112329941B (zh) * | 2020-11-04 | 2022-04-12 | 支付宝(杭州)信息技术有限公司 | 深度学习模型的更新方法及装置 |
| CN113419931B (zh) * | 2021-05-24 | 2024-05-17 | 北京达佳互联信息技术有限公司 | 分布式机器学习系统的性能指标确定方法及装置 |
| CN113642740B (zh) * | 2021-08-12 | 2023-08-01 | 百度在线网络技术(北京)有限公司 | 模型训练方法及装置、电子设备和介质 |
| CN114595050B (zh) * | 2022-03-15 | 2025-03-11 | 阿里巴巴(中国)有限公司 | 模型训练请求的调度方法及装置 |
| CN115080249B (zh) * | 2022-08-22 | 2022-12-16 | 南京可信区块链与算法经济研究院有限公司 | 一种基于联邦学习的车联网多维资源分配方法及系统 |
| CN115965074B (zh) * | 2022-11-28 | 2023-11-10 | 北京百度网讯科技有限公司 | 深度学习模型的训练方法、数据处理方法、装置和设备 |
| CN117669700B (zh) * | 2023-11-30 | 2025-05-09 | 杭州阿里云飞天信息技术有限公司 | 深度学习模型训练方法和深度学习模型训练系统 |
| CN119067271B (zh) * | 2024-11-07 | 2025-05-30 | 山东零公里润滑科技有限公司 | 用于车辆润滑油更换周期预测的数据处理方法及装置 |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20170061281A1 (en) * | 2015-08-27 | 2017-03-02 | International Business Machines Corporation | Deep neural network training with native devices |
| CN107273975A (zh) * | 2017-06-15 | 2017-10-20 | 北京大学 | 一种神经网络模型的稀疏化后向传播训练方法 |
| CN107516127A (zh) * | 2017-08-21 | 2017-12-26 | 山东大学 | 服务机器人自主获取人穿携物品归属语义的方法及系统 |
| CN108053029A (zh) * | 2017-12-27 | 2018-05-18 | 宁波山丘电子科技有限公司 | 一种基于存储阵列的神经网络的训练方法 |
Family Cites Families (13)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20150324690A1 (en) * | 2014-05-08 | 2015-11-12 | Microsoft Corporation | Deep Learning Training System |
| CN104036451B (zh) * | 2014-06-20 | 2018-12-11 | 深圳市腾讯计算机系统有限公司 | 基于多图形处理器的模型并行处理方法及装置 |
| US10402469B2 (en) * | 2015-10-16 | 2019-09-03 | Google Llc | Systems and methods of distributed optimization |
| CN107292385A (zh) * | 2016-03-31 | 2017-10-24 | 阿里巴巴集团控股有限公司 | 一种类Alexnet网络的模型训练方法和装置 |
| US10949746B2 (en) * | 2016-10-27 | 2021-03-16 | International Business Machines Corporation | Efficient parallel training of a network model on multiple graphics processing units |
| US12154028B2 (en) * | 2017-05-05 | 2024-11-26 | Intel Corporation | Fine-grain compute communication execution for deep learning frameworks via hardware accelerated point-to-point primitives |
| US12242952B2 (en) * | 2017-12-18 | 2025-03-04 | Kabushiki Kaisha Toshiba | System for distributed processing of nodes |
| US11270201B2 (en) * | 2017-12-29 | 2022-03-08 | Intel Corporation | Communication optimizations for distributed machine learning |
| CN108491928B (zh) * | 2018-03-29 | 2019-10-25 | 腾讯科技(深圳)有限公司 | 模型参数发送方法、装置、服务器及存储介质 |
| CN108829441B (zh) * | 2018-05-14 | 2022-10-18 | 中山大学 | 一种分布式深度学习的参数更新优化系统 |
| CN108960410A (zh) * | 2018-06-13 | 2018-12-07 | 华为技术有限公司 | 基于神经网络的参数更新方法、相关平台及计算机存储介质 |
| CN110795228B (zh) * | 2018-08-03 | 2023-08-25 | 伊姆西Ip控股有限责任公司 | 用于训练深度学习模型的方法和制品、以及计算系统 |
| US10776164B2 (en) * | 2018-11-30 | 2020-09-15 | EMC IP Holding Company LLC | Dynamic composition of data pipeline in accelerator-as-a-service computing environment |
-
2019
- 2019-01-24 EP EP19910606.3A patent/EP3889846A4/en active Pending
- 2019-01-24 CN CN201980000128.9A patent/CN111788585B/zh active Active
- 2019-01-24 WO PCT/CN2019/072895 patent/WO2020147142A1/zh not_active Ceased
-
2021
- 2021-07-15 US US17/376,722 patent/US20210342696A1/en active Pending
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20170061281A1 (en) * | 2015-08-27 | 2017-03-02 | International Business Machines Corporation | Deep neural network training with native devices |
| CN107273975A (zh) * | 2017-06-15 | 2017-10-20 | 北京大学 | 一种神经网络模型的稀疏化后向传播训练方法 |
| CN107516127A (zh) * | 2017-08-21 | 2017-12-26 | 山东大学 | 服务机器人自主获取人穿携物品归属语义的方法及系统 |
| CN108053029A (zh) * | 2017-12-27 | 2018-05-18 | 宁波山丘电子科技有限公司 | 一种基于存储阵列的神经网络的训练方法 |
Non-Patent Citations (2)
| Title |
|---|
| CHEN, JIANTING ET AL: "Survey of Unstable Gradients in Deep Neural Network Training", JOURNAL OF SOFTWARE, vol. 29, no. 7, 30 July 2018 (2018-07-30), pages 2071 - 2091, XP009521940, ISSN: 1000-9825 * |
| See also references of EP3889846A4 |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112949853A (zh) * | 2021-02-23 | 2021-06-11 | 北京金山云网络技术有限公司 | 深度学习模型的训练方法、系统、装置及设备 |
| CN112949853B (zh) * | 2021-02-23 | 2024-04-05 | 北京金山云网络技术有限公司 | 深度学习模型的训练方法、系统、装置及设备 |
| WO2024139420A1 (zh) * | 2022-12-28 | 2024-07-04 | 华为技术有限公司 | 一种模型训练方法、装置、设备、系统和存储介质 |
Also Published As
| Publication number | Publication date |
|---|---|
| US20210342696A1 (en) | 2021-11-04 |
| CN111788585B (zh) | 2024-04-12 |
| EP3889846A4 (en) | 2022-06-01 |
| CN111788585A (zh) | 2020-10-16 |
| EP3889846A1 (en) | 2021-10-06 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2020147142A1 (zh) | 一种深度学习模型的训练方法、系统 | |
| CN111882031B (zh) | 一种神经网络蒸馏方法及装置 | |
| CN113435682B (zh) | 分布式训练的梯度压缩 | |
| CN111368993B (zh) | 一种数据处理方法及相关设备 | |
| EP4145308A1 (en) | Search recommendation model training method, and search result sorting method and device | |
| CN113168559B (zh) | 机器学习模型的自动化生成 | |
| WO2020019236A1 (en) | Loss-error-aware quantization of a low-bit neural network | |
| CN117669700B (zh) | 深度学习模型训练方法和深度学习模型训练系统 | |
| WO2020140403A1 (zh) | 文本分类方法、装置、计算机设备及存储介质 | |
| CN111931901A (zh) | 一种神经网络构建方法以及装置 | |
| WO2020062299A1 (zh) | 一种神经网络处理器、数据处理方法及相关设备 | |
| WO2022267036A1 (zh) | 神经网络模型训练方法和装置、数据处理方法和装置 | |
| CN111105016B (zh) | 一种数据处理方法、装置、电子设备及可读存储介质 | |
| US20250356213A1 (en) | Federated learning method and related apparatus | |
| CN112561028B (zh) | 训练神经网络模型的方法、数据处理的方法及装置 | |
| WO2023020613A1 (zh) | 一种模型蒸馏方法及相关设备 | |
| CN113919479A (zh) | 一种提取数据特征的方法和相关装置 | |
| WO2024067884A1 (zh) | 一种数据处理方法及相关装置 | |
| WO2023231887A1 (zh) | 基于张量的持续学习方法和装置 | |
| WO2021042857A1 (zh) | 图像分割模型的处理方法和处理装置 | |
| WO2022156475A1 (zh) | 神经网络模型的训练方法、数据处理方法及装置 | |
| US20240161245A1 (en) | Image optimization | |
| CN117035045A (zh) | 模型参数更新方法、装置、设备、存储介质和程序产品 | |
| WO2023142886A1 (zh) | 表情迁移方法、模型训练方法和装置 | |
| WO2023123275A1 (zh) | 确定分布式训练算法框架配置方法、装置及系统 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19910606 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2019910606 Country of ref document: EP Effective date: 20210630 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |

