WO2024012476A1 - 一种模型训练方法及相关设备 - Google Patents

一种模型训练方法及相关设备 Download PDF

Info

Publication number
WO2024012476A1
WO2024012476A1 PCT/CN2023/106905 CN2023106905W WO2024012476A1 WO 2024012476 A1 WO2024012476 A1 WO 2024012476A1 CN 2023106905 W CN2023106905 W CN 2023106905W WO 2024012476 A1 WO2024012476 A1 WO 2024012476A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
training
accuracy range
range
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2023/106905
Other languages
English (en)
French (fr)
Inventor
于德权
赵寅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to KR1020257004231A priority Critical patent/KR20250037493A/ko
Priority to EP23838966.2A priority patent/EP4550209A4/en
Priority to JP2025501773A priority patent/JP2025522114A/ja
Publication of WO2024012476A1 publication Critical patent/WO2024012476A1/zh
Priority to US19/019,814 priority patent/US20250156712A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24143Distances to neighbourhood prototypes, e.g. restricted Coulomb energy networks [RCEN]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/20Scenes; Scene-specific elements in augmented reality scenes

Definitions

  • This application relates to the field of artificial intelligence, and in particular to a model training method and related equipment.
  • Artificial intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • Neural network is an artificially established dynamic system with a directed graph as a topological structure. It processes information by responding to continuous or intermittent inputs as state responses. It is a system designed to imitate the structure and function of the human brain. Information processing systems. After decades of development, artificial neural networks have been widely used in pattern recognition, automatic control, signal processing, assisted decision-making, artificial intelligence, scientific computing and many other fields, and have achieved widespread success. Especially in many fields such as image processing, audio and video processing, natural language processing, etc., artificial neural networks are in the stage of vigorous development and are playing an irreplaceable role.
  • the accuracy of the data format used is mainly set through manual experience.
  • the setter will determine based on experience whether each network layer uses 16-bit half-precision floating point (FP16) or 32-bit single-precision floating point (FP32).
  • This application provides a model training method and related equipment, which adjusts the accuracy range used in the process of training the model in real time when the calculated value of the parameter exceeds the accuracy range. It can effectively solve the training stagnation problem caused by overflow faced by low-precision training.
  • the first aspect of the embodiments of this application provides a model training method.
  • This method can be applied to dynamic computing graph scenarios as well as static computing graph scenarios.
  • the dynamic calculation graph scenario can be understood as updating the calculation graph after each layer of the network structure of the model is calculated.
  • the static calculation graph scenario can be understood as updating the calculation graph after all layer network structures of the model are calculated.
  • the main difference is the update timing of the calculation graph.
  • the method provided by the embodiments of this application can be applied to the calculation of the calculation graph.
  • the method may be executed by the training device, or may be executed by a component of the training device (such as a processor, a chip, or a chip system, etc.).
  • the method includes: obtaining training data; using the training data as input to the model, using the first precision range to calculate parameters during the model training process to obtain a calculated value; if the calculated value exceeds the first precision range, use the second precision range.
  • the parameters are recalculated, and the model is trained for one or more iterations using the recalculated parameters, and the second accuracy range includes the first accuracy range, or the second accuracy range partially overlaps with the first accuracy range.
  • the second precision range is used to recalculate the parameter. That is, automatic real-time adjustment of the accuracy range through the overflow information of parameter calculation values can not only reduce the memory occupied by the training model, but also improve the training efficiency of the model. This reduces problems such as training stagnation or training failure caused by parameters overflowing the first accuracy range.
  • the embodiment of the present application can use the overflow information of the parameters to determine the applicable precision range of the parameters in real time. Adjust and reduce overflow problems caused by low-precision floating point calculations.
  • the above model includes multiple network structures
  • the step of: using the second accuracy range to recalculate the parameters includes: using the second accuracy range to start from the first layer of the model.
  • the network structure begins to recalculate parameters.
  • a new accuracy range can be selected to recalculate from the first layer network structure of the model to reduce calculation errors caused by calculated value overflow.
  • the above model includes multiple network structures
  • the step of: using the second precision range to recalculate the parameters includes: using the second precision range to recalculate the current value that overflows from the calculated value.
  • Network structure recalculation parameters are used to calculate the second precision range to recalculate the current value that overflows from the calculated value.
  • a new accuracy range can be selected to calculate the current layer network structure. Recalculate to reduce calculation errors caused by overflow of calculated values.
  • the above parameters are related to the loss function of the model, or the parameters are related to the calculation of the model in the forward propagation process, or the parameters are related to the calculation of the model in the back propagation process. calculations related.
  • the above model includes multiple network structures, and the parameters include one or more of the following: intermediate features calculated by the multiple network structures in the forward propagation process; The value of the loss function of the model.
  • the intermediate feature is the output feature of any one of multiple network structures; the gradient calculated by multiple network structures during the backpropagation process.
  • the gradient includes: the gradient of the intermediate feature and/or the weight of the model. gradient.
  • the parameter can be a parameter that needs to be calculated during the forward propagation or back propagation process of the model, or it can be a parameter output by an individual layer in the model, or it can be all layers in the entire model. parameters after calculation, etc.
  • the applicable scenarios of this method in the training process are improved. In other words, the calculations involved in the model training process can be adjusted for accuracy using the method provided by the embodiments of this application.
  • the method when the parameter includes a gradient, the calculated value is the gradient divided by a scaling coefficient, and the scaling coefficient is used to reduce the probability of gradient overflow; the method also includes: using The first coefficient updates the scaling coefficient, and the updated scaling coefficient is used to replace the scaling coefficient before the update for the next iterative training of the model.
  • the first coefficient is a positive number less than 1.
  • the minimum value of the scaling coefficient is a preset threshold greater than or equal to 1, and the preset threshold is used to reduce the probability of gradient overflow.
  • the lower limit of the scaling coefficient is set to reduce the risk of subsequent parameter accuracy underflow.
  • the above step of: recalculating the parameters using the second precision range includes: using the second precision range Calculate the intermediate features of the overflow layer or calculate the intermediate features layer by layer starting from the first layer of the network structure.
  • the overflow layer is a network structure in which the calculated values of the intermediate features in multiple network structures overflow the first accuracy range.
  • the overflow layer can be recalculated using the second precision range, or the overflow layer can be restarted from the first layer network structure. That is, you can choose to modify the accuracy range of some layers, or you can choose to modify the accuracy range of all layers.
  • the solution is flexible.
  • the above step of: recalculating the parameters using the second accuracy range includes: calculating the loss function using the second accuracy range. The value of or calculation starting from the first layer network structure layer by layer until the value of the loss function is obtained.
  • the value of the loss function can be recalculated using the second precision range, or the value of the loss function can be calculated layer by layer starting from the first layer of the network structure until the value of the loss function is obtained. . That is, you can choose to modify the accuracy range of some layers, or you can choose to modify the accuracy range of all layers.
  • the solution is flexible.
  • the above step: training the model using the recalculated parameters includes: in the Nth iteration of the model training process, obtaining the first accuracy range based on For the number of overflows of multiple network structures in the model, N is a positive integer greater than or equal to 1; if the number of overflows is greater than or equal to the second threshold, it is determined that the initial accuracy range in the next iterative training process is changed from the first accuracy range to Second precision range, and clear the overflow count to zero.
  • the initial accuracy range is adjusted by recording the number of overflows, which affects model training in the first accuracy range.
  • the initial accuracy range is adjusted from the first accuracy range.
  • the range is adjusted to the second precision range.
  • the above-mentioned number of overflows includes: the number of overflows of multiple network structures in the forward propagation process, and/or the number of overflows of multiple network structures in the back propagation process. number of overflows.
  • the judgment condition for accuracy range adjustment ie, the determination of the number of overflows
  • the judgment condition for accuracy range adjustment can be the number of overflows in the entire training process, or the number of overflows in forward propagation or back propagation. Improve the applicability of this method.
  • the above parameters are related to the loss function.
  • the above loss function can vary according to the training method of the model.
  • the loss function is used to represent the difference between the output of the model and the label to which the training data belongs.
  • the loss function can be a custom function.
  • the loss function is used to represent the difference between the output of the model and the input (or clustering result, etc.).
  • it can be understood that in unsupervised training, it is hoped that the output of the model can be restored to the input of the model.
  • the label is the training data itself (that is, the output obtained by the model is then sent to another network to restore the training data).
  • the loss function is not limited in the embodiments of the present application.
  • the loss function can also be understood as the optimization objective function of the model, and can be set according to actual needs.
  • a second aspect of the embodiment of the present application provides a training device.
  • the training device can be applied to dynamic computing graph scenarios or static computing graph scenarios.
  • the training device includes: an acquisition unit, used to obtain training data; a calculation unit, used to use the training data as input to the model, and use the first accuracy range to calculate parameters during the model training process to obtain calculated values; the calculation unit also Used to recalculate the parameters using the second accuracy range if the calculated value exceeds the first accuracy range, and use the recalculated parameters to train the model one or more iterations, where the second accuracy range includes the first accuracy range, Or the second accuracy range partially overlaps with the first accuracy range.
  • the above-mentioned model includes multiple network structures and calculation units, specifically configured to use the second accuracy range to recalculate parameters starting from the first layer network structure of the model.
  • the above-mentioned calculation unit is specifically configured to recalculate parameters using the current network structure that overflows from the calculated value in the second precision range.
  • the above model includes multiple network structures, and the parameters include one or more of the following: intermediate features calculated by the multiple network structures during the forward propagation process; The value of the loss function of the model.
  • the intermediate feature is the output feature of any one of multiple network structures; the gradient calculated by multiple network structures during the backpropagation process.
  • the gradient includes: the gradient of the intermediate feature and/or the weight of the model. gradient.
  • the calculated value is the gradient divided by the scaling coefficient, and the scaling coefficient is used to reduce the risk of gradient overflow. Probability; the calculation unit is also used to update the scaling coefficient using the first coefficient. The updated scaling coefficient is used to replace the scaling coefficient before the update for the next iteration training of the model.
  • the first coefficient is a positive number less than 1; the scaling coefficient The minimum value is a preset threshold greater than or equal to 1.
  • the calculation unit when the above parameters include intermediate features calculated during forward propagation, is specifically configured to use the second accuracy range to calculate the intermediate features of the overflow layer or Intermediate features are calculated layer by layer starting from the first layer of network structure.
  • the overflow layer is a network structure in which the calculated values of intermediate features in multiple network structures overflow the first accuracy range.
  • the calculation unit is specifically configured to use the second accuracy range to calculate the value of the loss function or from the first layer
  • the network structure starts to be calculated layer by layer until the value of the loss function is obtained.
  • the above-mentioned computing unit is specifically used for the Nth iteration in the model training process to obtain the results of multiple network structures in the model based on the first accuracy range.
  • the number of overflows, N is a positive integer greater than or equal to 1; the calculation unit is specifically used to determine that if the number of overflows is greater than or equal to the second threshold, the initial accuracy range in the next iterative training process is changed from the first accuracy range to the second precision range and clear the number of overflows to zero.
  • the above-mentioned number of overflows includes: the number of overflows of multiple network structures in the forward propagation process, and/or the number of overflows of multiple network structures in the back propagation process. number of overflows.
  • the third aspect of the present application provides a training device, including: a processor, the processor is coupled to a memory, and the memory is used to store programs or instructions.
  • the training device implements the above first aspect. or a method in any possible implementation of the first aspect.
  • the fourth aspect of the present application provides a computer-readable medium on which a computer program or instructions are stored.
  • the computer program or instructions When the computer program or instructions are run on a computer, the computer is caused to execute the foregoing first aspect or any possible implementation of the first aspect. method within the method.
  • a fifth aspect of the present application provides a computer program product, which, when executed on a computer, causes the computer to execute the method in the foregoing first aspect or any possible implementation of the first aspect.
  • this application has the following advantages: when the calculated value of the parameter exceeds the first precision range during the model training process, the second precision range is used to recalculate the parameter. That is, automatic real-time adjustment of the accuracy range through the overflow information of parameter calculation values can not only reduce the memory occupied by the training model, but also improve the training efficiency of the model. This reduces problems such as training stagnation or training failure caused by parameters overflowing the first accuracy range.
  • the embodiment of the present application can use the overflow information of the parameters to determine the applicable precision range of the parameters. Adjust in real time and reduce overflow problems caused by low-precision floating point calculations.
  • Figure 1 is a schematic structural diagram of the system architecture provided by the embodiment of the present application.
  • Figure 2 is a schematic diagram of a chip hardware structure provided by an embodiment of the present application.
  • Figure 3 is a schematic flow chart of the model training method provided by the embodiment of the present application.
  • Figure 4 is a structural example diagram of the model provided by the embodiment of the present application.
  • Figure 5 is another schematic flow chart of the model training method provided by the embodiment of the present application.
  • Figure 6 is another schematic flow chart of the model training method provided by the embodiment of the present application.
  • Figure 7 is another schematic flow chart of the model training method provided by the embodiment of the present application.
  • Figure 8 is a schematic structural diagram of the training equipment provided by the embodiment of the present application.
  • Figure 9 is another schematic structural diagram of the training equipment provided by the embodiment of the present application.
  • This application provides a model training method and related equipment, which adjusts the accuracy range used in the process of training the model in real time when the calculated value exceeds the accuracy range. It can effectively solve the problem of training stagnation or training failure caused by overflow in low-precision training. In addition, it has low requirements for the network mixed precision initialization scheme, does not rely on manual experience to customize the initialization scheme, and can automatically adjust the training accuracy layer by layer in real time.
  • the neural network can be composed of neural units.
  • the neural unit can refer to an arithmetic unit that takes X s and intercept b as input.
  • the output of the arithmetic unit can be:
  • s 1, 2,...n, n is a natural number greater than 1
  • W s is the weight of X s
  • b is the bias of the neural unit.
  • f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next layer.
  • the activation function can be a Relu function.
  • a neural network is a network formed by connecting many of the above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected to the local receptive field of the previous layer to extract the features of the local receptive field.
  • the local receptive field can be an area composed of several neural units.
  • W is a weight vector, and each value in this vector represents the weight value of a neuron in this layer of neural network.
  • This vector W determines the spatial transformation from the above input space to the output space, that is, the weight W of each layer controls how to transform the space.
  • the purpose of training a neural network is to finally obtain the weight matrix of all layers of the trained neural network (a weight matrix formed by the vector W of many layers). Therefore, the training process of neural network is essentially to learn how to control spatial transformation, and more specifically, to learn the weight matrix.
  • Neural network can also be called Artificial Neural Network (ANN): it is an artificially established dynamic system with a directed graph as a topology. It performs information processing by responding to continuous or intermittent input as a state response. Processing is an information processing system designed to imitate the structure and function of the human brain.
  • ANN Artificial Neural Network
  • loss function loss function
  • objective function object function
  • Neural network forward propagation refers to the calculation process from the input layer through the hidden layer to the output layer. Starting from the input layer, according to the topology of the network, the output (activation value) of the previous layer is used as the input of the next layer, and the output of each layer is calculated layer by layer until the final output layer. This process is called the network's Forward propagation.
  • Neural network backpropagation is the abbreviation of "error backpropagation". It is a common method used in combination with optimization methods (such as gradient descent) to train artificial neural networks. This method calculates the gradient of the loss function for all weights in the network. This gradient is fed back to the optimization method, which is used to update the weights to minimize the loss function.
  • optimization methods such as gradient descent
  • the precision range refers to the precision range of the data type used by the computer. It can refer to the specific accuracy or the dynamic range of the accuracy.
  • the following takes the commonly used data type in neural networks as floating point (FP) as an example for an exemplary description. In actual applications, the data type can also be an integer (int), such as int8, int16, etc.
  • FP is mainly used to represent decimals and usually consists of three parts, namely the sign bit, the exponent bit and the mantissa bit.
  • the sign bit can be 1 bit indicating positive or negative, and the exponent bit and mantissa bit can be multiple bits.
  • the mantissa bit represents the precision
  • the exponent bit is used to represent the dynamic range within which the precision can be achieved (referred to as the precision range in the embodiment of this application).
  • Floating-point numbers can usually include three formats, namely half-precision floating-point numbers, single-precision floating-point numbers and double-precision floating-point numbers, as follows.
  • Half-precision floating-point It is a binary data type used by computers. It occupies 16 bits (that is, occupies 2 bytes) in computer memory. It can also be referred to as FP16.
  • the absolute value range of the values that can be represented by half-precision floating point numbers is approximately [6 ⁇ 10 -8 ,65504].
  • the accuracy of FP16 is 2-10 .
  • Single-precision floating-point It is a binary data type used by computers. It occupies 32 bits (i.e. 4 bytes) in computer memory. It can also be referred to as FP32.
  • the absolute value range of the values that can be represented by single-precision floating point numbers is approximately [1.4 ⁇ 10 -45,1.7 ⁇ 10 38 ].
  • the accuracy of FP32 is 2-23 .
  • Double precision floating point It is a binary data type used by computers. It occupies 64bits (that is, occupies 8 bytes) in computer memory, and can also be referred to as FP64. Double-precision floating point numbers can represent 15 or 16 significant decimal digits, and the absolute value range of the representable values is approximately [2.23 ⁇ 10 -308,1.80 ⁇ 10 38 ]. The accuracy of FP32 is 2-52 .
  • the sign bit occupies 1 bit
  • the exponent bit occupies 5 bits
  • the mantissa bit occupies 10 bits
  • the sign bit occupies 1 bit
  • the exponent bit occupies 8 bits
  • the mantissa bit occupies 23 bits
  • the sign bit occupies 1 bit
  • the exponent bit occupies 11 bits
  • the mantissa bit occupies 52 bits.
  • Overflow in the embodiment of this application includes overflow and underflow.
  • overflow means that the absolute value of the calculated value is too large and exceeds the maximum value that can be represented by a certain precision range.
  • Underflow means that the absolute value of the calculated value is too small, smaller than the closest positive value to 0 that can be represented by a certain precision range.
  • Overflow may include storage overflow and calculation overflow.
  • the embodiment of this application is mainly applied to the scenario of calculation overflow.
  • Overflow includes: if the calculated value is a positive number, and the calculated value is greater than the maximum positive number that FP16 can represent. Or if the calculated value is a negative number, and the calculated value is smaller than the smallest negative number that FP16 can represent.
  • Underflow includes: if the calculated value is a positive number, and the calculated value is smaller than the smallest positive number that FP16 can represent. Or if the calculated value is a negative number, and the calculated value is greater than the maximum negative number that FP16 can represent.
  • the key point of mixed precision is: what strategy is used to set which parts of the network are trained with high precision and which parts are trained with low precision, so as to ensure accuracy and improve training efficiency.
  • the key point of mixed precision is: how to combine single precision and high precision for training.
  • mixed precision training methods mainly include the following two methods.
  • the first one specifies the calculation accuracy through the type of each layer in the neural network. Some types of layers are calculated with high precision, and some types of layers are calculated with low precision.
  • the second one dynamically selects the accuracy by quantifying whether the error exceeds a threshold.
  • quantized error can be measured at different points in the network or measured over time as training proceeds. For example: Calculate quantization error by comparing training results with baseline values.
  • the baseline value can be determined by a variety of methods, such as training the same network using full-precision floating point values, repeating a subset of the calculations with high precision, and analyzing or sampling statistics specific to the calculations involved.
  • the adjustment of training accuracy depends on the accuracy baseline value.
  • This baseline value needs to be repeatedly calculated with high precision or obtained by training the same network with full precision floating point values. This makes the solution still not automated enough, and may significantly increase the computational complexity of network training due to the construction of baseline values. It runs counter to the purpose of using low-precision training in order to save calculations and speed up training.
  • embodiments of the present application provide a model training method and related equipment.
  • the network mixed precision initialization scheme has low requirements and does not rely on manual experience to customize the initialization scheme (for example, there is no need for manual experience to adjust the accuracy at the layer level for each network).
  • the training accuracy can be automatically adjusted layer by layer in real time during the training process based on whether it overflows.
  • an embodiment of the present invention provides a system architecture 100.
  • the data collection device 160 is used to collect training data.
  • the training data may include one or more of the following: images, speech, text, etc.
  • the training data is stored in the database 130, and the training device 120 trains to obtain the target model/rule 101 based on the training data maintained in the database 130.
  • the target model/rules 101 can be used to implement computer vision tasks (eg, classification, segmentation, detection, image generation, etc.).
  • the target model/rule 101 in the embodiment of this application may specifically be a neural network or the like.
  • the training data maintained in the database 130 may not all be collected by the data collection device 160, and may also be received from other devices.
  • the training device 120 may not necessarily train the target model/rules 101 based entirely on the training data maintained by the database 130. It may also obtain training data from the cloud or other places for model training. The above description should not be used as a guide for this application. Limitations of Examples.
  • the target model/rules 101 trained according to the training device 120 can be applied to different systems or devices, such as the execution device 110 shown in Figure 1 .
  • the execution device 110 can be a terminal, such as a mobile phone terminal, a tablet computer, or a laptop computer. , augmented reality (augmented reality, AR) equipment/virtual reality (VR) equipment, vehicle-mounted terminals, etc.
  • the execution device 110 can also be a server or a cloud, etc.
  • the execution device 110 is configured with an I/O interface 112 for data interaction with external devices. The user can input data to the I/O interface 112 through the client device 140. The input data corresponds to the training data.
  • Input data in embodiments may also include one or more of the following: images, voice, text, etc.
  • the input data can be input by the user, or uploaded by the user through the shooting device, and of course it can also come from a database, etc., and the details are not limited here.
  • the preprocessing module 113 is used to perform preprocessing (eg, segmentation, selection, transformation, etc.) according to the input data received by the I/O interface 112 .
  • preprocessing eg, segmentation, selection, transformation, etc.
  • the input data is divided into multiple data blocks (patches).
  • the execution device 110 When the execution device 110 preprocesses input data, or when the calculation module 111 of the execution device 110 performs calculations and other related processes, the execution device 110 can call data, codes, etc. in the data storage system 150 for corresponding processing. , the data, instructions, etc. obtained by corresponding processing can also be stored in the data storage system 150 .
  • the I/O interface 112 returns the processing results (eg, classification results, segmentation results, detection results, etc.) to the client device 140, thereby providing them to the user.
  • processing results eg, classification results, segmentation results, detection results, etc.
  • the training device 120 can generate corresponding target models/rules 101 based on different training data for different goals or different tasks, and the corresponding target models/rules 101 can be used to achieve the above goals or complete the The above tasks, thereby providing the user with the desired results.
  • the user can manually set the input data, and the manual setting can be operated through the interface provided by the I/O interface 112 .
  • the client device 140 can automatically send input data to the I/O interface 112. If requiring the client device 140 to automatically send input data requires the user's authorization, the user can set corresponding permissions in the client device 140.
  • the user can view the results output by the execution device 110 on the client device 140, and the specific presentation form may be display, sound, action, etc.
  • the client device 140 can also be used as a data collection end to collect the input data of the input I/O interface 112 and the output value of the output I/O interface 112 as new sample data, and store them in the database 130 .
  • the I/O interface 112 directly uses the input data input to the I/O interface 112 and the output value of the output I/O interface 112 as a new sample as shown in the figure.
  • the data is stored in database 130.
  • Figure 1 is only a schematic diagram of a system architecture provided by an embodiment of the present invention.
  • the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data storage system 150 is an external memory relative to the execution device 110. In other cases, the data storage system 150 can also be placed in the execution device 110.
  • Figure 2 is a chip hardware structure provided by an embodiment of the present invention.
  • the chip includes a neural network processor 20.
  • the chip can be disposed in the execution device 110 as shown in Figure 1 to complete the calculation work of the calculation module 111.
  • the chip can also be provided in the training device 120 as shown in Figure 1 to complete the training work of the training device 120 and output the target model/rules 101.
  • the neural network processor 20 may be a neural network processor (neural-network processing unit, NPU), a tensor processing unit (TPU), or a graphics processor (graphics processing unit, GPU), etc., which are suitable for large-scale applications.
  • NPU neural-network processing unit
  • TPU tensor processing unit
  • GPU graphics processor
  • the neural network processor 20 is mounted on the main central processing unit (central processing unit, CPU) (host CPU) as a co-processor, and the main CPU allocates tasks.
  • the core part of the NPU is the arithmetic circuit 203.
  • the controller 204 controls the arithmetic circuit 203 to extract data in the memory (weight memory or input memory) and perform operations.
  • the computing circuit 203 internally includes multiple processing units (process engines, PEs).
  • arithmetic circuit 203 is a two-dimensional systolic array.
  • the arithmetic circuit 203 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition.
  • arithmetic circuit 203 is a general-purpose matrix processor.
  • the operation circuit 203 obtains the corresponding data of matrix B from the weight memory 202 and caches it on each PE in the operation circuit.
  • the operation circuit takes the matrix A data from the input memory 201 and performs matrix operation on the matrix B, and the partial result or final result of the obtained matrix is stored in the accumulator 208 .
  • the vector calculation unit 207 can further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc.
  • the vector calculation unit 207 can be used for network calculations of non-convolutional/non-FC layers in neural networks, such as pooling, batch normalization, local response normalization, etc. .
  • the vector computation unit can 207 store the processed output vectors to the unified buffer 206 .
  • the vector calculation unit 207 may apply a nonlinear function to the output of the operation circuit 203, such as a vector of accumulated values, to generate an activation value.
  • vector calculation unit 207 generates normalized values, merged values, or both.
  • the processed output vector can be used as an activation input to the arithmetic circuit 203, such as for use in a subsequent layer in a neural network.
  • the unified memory 206 is used to store input data and output data.
  • the weight data directly transfers the input data in the external memory to the input memory 201 and/or the unified memory 206 through the storage unit access controller 205 (direct memory access controller, DMAC), and stores the weight data in the external memory into the weight memory 202. and storing the data in the unified memory 206 into the external memory.
  • DMAC direct memory access controller
  • a bus interface unit (BIU) 210 is used to implement interaction between the main CPU, the DMAC and the fetch memory 209 through the bus.
  • An instruction fetch buffer 209 connected to the controller 204 is used to store instructions used by the controller 204.
  • the controller 204 is used to call instructions cached in the fetch memory 209 to control the working process of the computing accelerator.
  • the unified memory 206, the input memory 201, the weight memory 202 and the instruction memory 209 are all on-chip memories, and the external memory is a memory external to the NPU.
  • the external memory can be double data rate synchronous dynamic random access. Memory (double data rate synchronous dynamic random access memory, DDR SDRAM for short), high bandwidth memory (high bandwidth memory, HBM) or other readable and writable memory.
  • the application scenarios of the model training method provided by the embodiments of this application are described.
  • This method can be applied to dynamic computing graph scenarios as well as static computing graph scenarios.
  • the dynamic calculation graph scenario can be understood as updating the calculation graph after each layer of the network structure of the model is calculated.
  • the static calculation graph scenario can be understood as updating the calculation graph after all layer network structures of the model are calculated.
  • the main difference is the update timing of the calculation graph.
  • the calculation of the calculation graph can be applied to the model training method (also known as the calculation accuracy adjustment method) provided by the embodiment of the present application.
  • the model training method of the embodiment of the present application will be introduced in detail below with reference to Figure 3.
  • the method may be executed by the training device, or may be executed by a component of the training device (for example, a processor, a chip, or a chip system, etc.).
  • the training device can be a cloud device or a terminal device.
  • it can also be a computer, server or other device with sufficient computing power to execute the model training method, or it can be a system composed of a cloud device and a terminal device.
  • the training method can be executed by the training device 120 in Figure 1 and the neural network processor 20 in Figure 2 .
  • the model training method can be processed by the CPU, or it can be processed by the CPU and GPU together, or it can not use the GPU but use other processors suitable for neural network calculations, which is not limited by this application.
  • FIG. 3 is a schematic flow chart of a model training method provided by an embodiment of the present application.
  • the method may include steps 301 to 303 . Steps 301 to 303 will be described in detail below.
  • Step 301 Obtain training data.
  • training device to obtain training data, which may be by receiving data sent by other devices (for example, servers, business equipment, etc.), or by selecting from a database, or by Through user photography and other methods, there are no specific limitations here.
  • the training data in the embodiment of this application may include one or more of the following: images, speech, text, etc.
  • the training data is specifically related to the scenario in which the model is applied. For example: when the function of the model is audio recognition, the specific form of training data can be audio data, etc. Another example: when the function of the model is image classification, the specific form of training data can be image data, etc. Another example: when the role of the model is to predict speech, the specific form of training data can be text data, etc. It can be understood that the above situations are only examples and do not necessarily have a one-to-one correspondence.
  • the specific form of training data can also be image data or text data (for example, if it is applied to the field of education In the scene of watching pictures and playing voice, the role of the model is to recognize the voice corresponding to the image, then the specific form of the training data can be image data).
  • the training data can be word vectors corresponding to movies, etc.
  • the above training data can also include data in different modalities at the same time.
  • the training data can include image/video data collected by the camera, and can also include voice/text data of instructions given by the user.
  • the specific form or type of training data is not limited in the embodiments of this application.
  • the training data obtained in this step is training data carrying labels. If the training of the model is unsupervised training, the training data obtained in this step is training data without labels.
  • Step 302 Use the training data as the input of the model, and use the first accuracy range to calculate parameters during the model training process to obtain calculated values;
  • the training device After the training device obtains the training data, it uses the training data as input to the model, and uses the first accuracy range to calculate parameters to obtain calculated values.
  • This parameter is related to the loss function of the model.
  • the loss function is used to represent the difference between the output of the model and the label to which the training data belongs.
  • the loss function can be a custom function. For example, when the task of the model is a classification task, the loss function is used to represent the difference between the output of the model and the input (or clustering result, etc.). Or it can be understood that in unsupervised training, it is hoped that the output of the model can be restored to the input of the model.
  • the label is the training data itself (that is, the output obtained by the model is then sent to another network to restore the training data).
  • the loss function is not limited in the embodiments of the present application.
  • the loss function can also be understood as the optimization objective function of the model, and can be set according to actual needs.
  • the parameters in the embodiments of this application may include one or more of the following possibilities: parameters related to the loss function of the model, parameters related to the calculation of the model in the forward propagation process, and parameters related to the calculation of the model in the back propagation process. Related.
  • the model includes multiple network structures.
  • the model in the embodiment of this application is specifically an artificial neural network.
  • the specific number of layers or structures included in the artificial neural network can be set according to actual needs and is not limited here.
  • the precision range in the embodiment of the present application may refer to the precision range of the data type (for example, int, float, etc.).
  • the accuracy range of FP is used as an example for illustrative description. Of course, in actual applications, it can also be other data types (for example, int, etc.).
  • the first precision range is a dynamic reachable range of FP16 precision.
  • FP16 and accuracy range can refer to the explanations in the aforementioned related terms, and will not be repeated here.
  • the accuracy ranges used by each layer of the network structure in the model may be the same or different.
  • the first accuracy range may be refined to the accuracy range of each layer in the model, or the same accuracy range may be used for all layers of the model. That is, the subsequent recalculation of parameters using the second precision range to replace the first precision range can be a precision range adjustment of layer granularity (that is, only the precision range of a specific layer is adjusted, and the precision range of a non-specific layer is not adjusted.
  • the specific layer It can also become an overflow layer), or it can be an adjustment of the accuracy range of the entire model (that is, all layers).
  • all layer network structures of the model adopt the same accuracy range, and the first accuracy range refers to the same accuracy range.
  • the embodiments of the present application can be applied to a single accuracy training scenario.
  • the accuracy ranges used are different.
  • the model uses mixed Accuracy training.
  • the first precision range refers to low precision or high precision in the mixed precision.
  • the default high precision is less likely to cause overflow problems, and the first precision range can specifically refer to the low-precision range of mixed precision.
  • the embodiments of the present application can be applied to mixed precision training scenarios. That is, only the accuracy of specific layers will be adjusted in the future.
  • the parameters are parameters in the forward propagation process.
  • the parameters can be the values of intermediate features or loss functions calculated by multiple network structures during the forward propagation process.
  • the intermediate feature can also be understood as the activation value obtained by multiple network structures using activation functions.
  • the value of this loss function can also be understood as the difference between the model output and the label calculated after the forward propagation of all layer network structures.
  • the model includes a three-layer network structure, namely the input layer (including 3 neurons), the hidden layer (including 4 neurons), and the output layer (including 2 neurons).
  • the weight from the j-th neuron in layer l-1 to the i-th neuron in layer l (for example, is the weight from the 3rd neuron in layer 1 to the 4th neuron in layer 2)
  • the bias of the j-th neuron in the l-th layer is the activation value of the j-th neuron in layer l.
  • the process of forward propagation can be shown as Formula 1 and Formula 2, where Formula 2 is the matrix form of Formula 1.
  • is the activation function.
  • the activation value can be calculated layer by layer using the above formula 2, and finally the output f(x) of the model can be obtained based on the input X. And calculate the value of the loss function based on the difference between the model's output f(x) and the label value Y of X.
  • the loss function can be mean squared error (MSE) loss, squared absolute error loss, cross entropy loss, hinge loss, etc.
  • MSE mean squared error
  • the specific settings can be set according to actual needs, and there is no limit to the structure of the loss function.
  • the parameters may include intermediate features and/or the value of the loss function.
  • a l obtained at each layer can be understood as the intermediate feature, and the difference between f(x) and Y is the value of the loss function.
  • the parameters are the parameters in the backpropagation process.
  • the parameters can be the gradients calculated by multiple network structures during the backpropagation process.
  • the gradient includes one or more of the following: the gradient of intermediate features, the gradient of the weights in the model, the gradient of the loss, etc.
  • the backpropagation process can be understood as the process of continuously adjusting the weight through the loss function value.
  • the backpropagation process may include: passing an optimizer and a preset learning rate, and then minimizing the loss function to find the best parameters (for example, weights ). Specifically, you can use the loss function and the chain criterion to derive partial derivatives of the weights, This deflection reflects the gradient.
  • Step 303 If the calculated value exceeds the first accuracy range, recalculate the parameters using the second accuracy range, and use the recalculated parameters to train the model one or more iterations.
  • the training device uses the first precision range to calculate the parameters and the calculated values exceed the first precision range in the above step 302, then the second precision range is used to recalculate the parameters, and the recalculated parameters are used to perform one or more iterations on the model. train.
  • the second accuracy range includes the first accuracy range, or the second accuracy range partially overlaps with the first accuracy range.
  • the training data used for the overflow will be discarded and the problem of training stagnation will occur.
  • the method provided by the embodiments of the present application can allow training to continue by adjusting the accuracy range in a timely manner, thereby improving the training efficiency of the model.
  • the second accuracy range includes the first accuracy range, or the second accuracy range partially overlaps with the first accuracy range.
  • the second accuracy range includes the first accuracy range. Take the first precision range as FP16 as an example, and the second precision range as FP32 as an example.
  • the first accuracy range may be (-65504, 65504), and the second accuracy range may be [-1.7 ⁇ 10 38 , 1.7 ⁇ 10 38 ].
  • the second accuracy range partially overlaps with the first accuracy range.
  • the first precision range as FP16 as an example
  • the second precision range as int16 as an example.
  • the first precision range is (-65504, 65504)
  • the second precision range can be an integer from -32768 to 32767.
  • the first precision range and the second precision range can also be concepts of relatively high and low ranges. If the first precision range is a high precision range (hereinafter referred to as high precision), the second precision range is a low precision range (hereinafter referred to as high precision). Referred to as low precision). If the first precision range is low precision, the second precision range is high precision. In general, overflow problems will occur with low precision.
  • the first precision range is low precision and the second precision range is high precision as an example for illustrative explanation.
  • the adjustment between high precision and low precision, or the adjustment from low precision to high precision can also be achieved through the method provided by the embodiment of the present application.
  • recalculating parameters using the second accuracy range may include multiple situations, and may include recalculating the parameters starting from the first layer network structure of the model using the second accuracy range. It is also possible to recalculate the parameter using the second precision range from the current network structure where the calculated value overflows. There are no specific restrictions here.
  • the above step of: recalculating the parameters using the second precision range specifically includes: using the second precision range to calculate the intermediate features of the overflow layer or from The first layer of network structure begins to calculate intermediate features layer by layer, and the overflow layer is a network structure in which the calculated values of intermediate features in multiple network structures overflow the first accuracy range.
  • the above step: recalculating the parameters using the second accuracy range specifically includes: using the second accuracy range to calculate the value of the loss function or from the first The layer network structure starts to be calculated layer by layer until the value of the loss function is obtained.
  • the model including a 5-layer network structure. If the calculated value of layer 4 parameter calculation using the first precision range exceeds the first precision range. Then you can use the second precision range to recalculate from level 1 until you reach level 4. It is also possible to recalculate the parameters of layer 4 using a second precision range.
  • the method when the parameter includes a gradient, the calculated value is the gradient divided by the scaling coefficient, and the scaling coefficient is used to reduce the probability of gradient overflow; the method also includes: using the first coefficient to update the scaling coefficient, and the updated scaling coefficient is The next iteration training of the model is performed by replacing the scaling coefficient before the update.
  • the first coefficient is a positive number less than 1.
  • the parameter is recalculated using a second accuracy range that is different from the first accuracy range, and the updated parameters are used to train the model one or more iterations.
  • the specific training process please refer to the description of the forward propagation and back propagation processes mentioned above.
  • the calculation of the loss function in forward propagation is performed by using the parameters calculated in the second precision range.
  • the model is trained one or more iterations with the goal that the value of the loss function is less than a certain threshold to obtain a trained model.
  • the second precision range recalculation may be a precision range adjustment at the layer granularity, or may be a precision range adjustment for the entire model (ie, all layers). That is, if the calculation accuracy of all layers is initialized to the same accuracy range, then the recalculation in this step is for all layers (or it can be understood as adjusting the calculation accuracy of all layers). If the calculation accuracy of all layers is initialized to different accuracy ranges (ie, mixed precision calculation), then the recalculation in this step is for a specific layer (or it can be understood as adjusting the calculation accuracy of a specific layer). The calculation accuracy of this particular layer exceeds the first accuracy range.
  • steps 301 to 303 in this embodiment can be executed one or more times, and steps 301 to 303 can be executed after each update one or more times during the training process of the model. It is also possible to perform steps 301 to 303 again when the preset period or the preset number of times is met.
  • the second precision range is used to recalculate the parameter. That is, automatic real-time adjustment of the accuracy range through the overflow information of parameter calculation values can not only reduce the memory occupied by the training model, but also improve the training efficiency of the model. This reduces problems such as training stagnation caused by parameter calculations overflowing the first accuracy range.
  • the embodiment of the present application can use the overflow information of the parameters to determine the applicable precision range of the parameters in real time. Adjust and reduce overflow problems caused by low-precision floating point calculations.
  • the calculated value of the parameter does not exceed the first precision range
  • the value of the loss function obtained during the forward propagation process is multiplied by the scaling coefficient, and the first precision range is used for inversion.
  • Gradient calculation in the propagation process If the value of the gradient divided by the scaling coefficient exceeds the first precision range, use the first coefficient to update the scaling coefficient, and use the second precision range to perform recalculation in the forward propagation process and the back propagation process.
  • the first coefficient is less than 1. Positive number, the updated scaling coefficient is used to replace the pre-updated scaling coefficient for the next iteration of model training.
  • the lower limit of the scaling factor can be limited.
  • the minimum value of the scaling factor is a preset threshold greater than or equal to 1.
  • the model may have an initial accuracy range (for example, the first accuracy range) set during the first iteration, in order to improve the efficiency of the model in subsequent iterations, the initial accuracy range for each iteration can be set based on the number of iterations and the number of overflows. The accuracy range can be adjusted.
  • the initial accuracy range for example, the first accuracy range
  • the number of overflows for multiple network structures in the model based on the first accuracy range is obtained, and N is a positive integer greater than or equal to 1. If the number of overflows is greater than or equal to the second threshold, it is determined that the initial accuracy range in the next iterative training process is changed from the first accuracy range to the second accuracy range, and the number of overflows is cleared.
  • the number of overflows includes: the number of overflows of multiple network structures in the forward propagation process, and/or the number of overflows of multiple network structures in the back propagation process.
  • the parameters in the embodiment shown in FIG. 3 have various situations.
  • the parameters are intermediate features. And the calculation of intermediate features overflows the first accuracy range.
  • the intermediate features can be recalculated using the second precision range.
  • the second precision range can also be used for subsequent loss calculations and/or weight gradient calculations.
  • the adjustment of accuracy calculation can be adjusted for the accuracy range of the overflow parameter, and can also be adjusted for the accuracy range of other parameter calculations in the subsequent training process. There is no specific limit here.
  • the parameter is loss. And the calculation of loss exceeds the first precision range.
  • the second precision range can involve this calculation or the loss. after multiple calculations before this overflow calculation.
  • the second accuracy range can also be used for subsequent weight gradient calculations.
  • the adjustment of the accuracy calculation can be adjusted for the accuracy range of the overflow parameters, or it can be adjusted for the accuracy range of the parameters that did not overflow during the training process before the overflow parameters. , and can also be used to adjust the calculation accuracy range of other parameters in the subsequent training process, which is not limited here.
  • Figure 5 is another model training method provided by an embodiment of the present application.
  • the execution body of this method is similar to the execution body of the embodiment shown in Figure 3 and will not be described again here.
  • the method may include steps 501 to 511. Steps 501 to 511 will be described in detail below.
  • Step 501 Obtain training data.
  • step 501 reference may be made to the description of step 301 in the embodiment shown in FIG. 3, which will not be described again here.
  • Step 502 Determine whether the number of overflows (num) is greater than or equal to the second threshold (N). If yes, execute step 503. If not, execute step Step 508.
  • the number of overflows including the number of times the calculated value of each layer of the model overflows the first precision range during the forward calculation and reverse calculation processes of the model is used as an example for an exemplary description. It can be understood that in practical applications, the number of overflows may be the number of times that the intermediate features of each layer of the model overflow the first accuracy range in the forward calculation. It can also be the number of times the model overflows the first accuracy range when calculating the gradient in reverse. That is, the number of overflows can be counted in the overall forward and reverse processes, or it can be counted separately in the forward or reverse processes, etc. The specific number is not limited here.
  • the forward calculation in the embodiment of this application refers to the calculation of parameters (for example, intermediate features, loss, etc.) of the model during the forward propagation process.
  • Backward calculation refers to the calculation of parameters (for example, gradients, etc.) of the model during the backpropagation process.
  • the training device determines whether the number of overflows is greater than or equal to the second threshold (N), where N is an integer greater than or equal to 0. If yes, execute step 503. If not, execute step 508.
  • Step 503 Use the first accuracy range to perform forward calculation to obtain the loss (loss).
  • step 502 If the number of overflows in the aforementioned step 502 is greater than or equal to the second threshold, execution of this step is triggered.
  • the calculation of loss in this step 503 may refer to the description of step 302 in the embodiment shown in FIG. 3 , and will not be described again here.
  • Step 504 determine whether loss overflows. If yes, execute step 509. If not, multiply the scaling factor and perform step 505.
  • the training device After the training device obtains the loss, it determines whether the loss has overflowed. If yes, execute step 509. If not, multiply the scaling factor and perform step 505.
  • Step 505 Use the first accuracy range to perform reverse calculation to obtain the weight gradient.
  • step 504 If the loss in the aforementioned step 504 does not exceed the first accuracy range, the scaling factor is multiplied and the execution of this step is triggered.
  • Step 505 is performed by multiplying the scaling coefficient to prevent underflow of the calculated value.
  • the description of calculating the weight gradient in this step 505 may refer to the description of step 302 in the embodiment shown in FIG. 3 , and will not be described again here.
  • Step 506 Determine whether the value of the weight gradient divided by the scaling coefficient overflows. If yes, execute step 510. If not, execute step 507.
  • the training device After the training device obtains the weight gradient, it determines whether the value of the weight gradient divided by the scaling coefficient exceeds the first accuracy range. If yes, execute step 510. If not, execute step 507.
  • Step 507 update weights.
  • step 506 If the value of the weight gradient divided by the scaling coefficient in step 506 does not exceed the first accuracy range, and/or after step 508 , execution of this step is triggered.
  • Step 508 Use the second accuracy range to perform forward and backward recalculation.
  • step 502 If the number of overflows in the aforementioned step 502 is less than the second threshold, and/or after the aforementioned step 509, execution of this step is triggered.
  • step 508 reference may be made to the description of step 303 in the embodiment shown in FIG. 3, which will not be described again here.
  • Step 509 The cumulative number of overflows is increased by 1 (that is, num+1).
  • step 504 If the loss overflows the first precision range in the aforementioned step 504, and/or the value of the weight gradient divided by the scaling coefficient overflows the first precision range, then the execution of this step is triggered.
  • the number of overflows is recorded, so that in subsequent iterations, based on the comparison between the number of overflows and the second threshold, it is determined whether to modify the initial accuracy range in each iteration process. For example, if the number of overflows is greater than the second threshold, the initial precision range is adjusted from the first precision range to the second precision range.
  • the number of iterations can also be considered when adjusting the initial accuracy range. For example, when the number of iterations reaches 1000 and the number of overflows is greater than 800 (that is, the second threshold is 800). This means that setting the initial accuracy range to the first accuracy range has affected the training of the model. For the accuracy of subsequent model training, the initial accuracy range is adjusted to the second accuracy range. To reduce problems such as training stagnation caused by parameters overflowing the first accuracy range.
  • Step 510 Use the first coefficient to update the scaling coefficient.
  • the execution of this step is triggered. If the value of the weight gradient divided by the scaling coefficient exceeds the first accuracy range, the first coefficient is used to update the scaling coefficient.
  • the first coefficient is a positive number less than 1.
  • the scaling coefficient is multiplied by a positive number less than 1 for adjustment.
  • Step 511 Determine that the scaling coefficient is greater than or equal to a preset threshold.
  • the training device determines that the value of the weight gradient divided by the scaling coefficient exceeds the first accuracy range, when updating the scaling coefficient using the first coefficient, It can be judged whether the scaling factor is smaller than the preset threshold. If the scaling coefficient is less than the preset threshold, the scaling coefficient is adjusted to the preset threshold. If the scaling coefficient is greater than or equal to the preset threshold, the scaling coefficient is not modified and step 509 is performed.
  • step 509 is performed.
  • steps 501 to 511 in this embodiment can be executed multiple times, and steps 501 to 511 can be executed after each update one or more times during the training process of the model. It is also possible to perform steps 501 to 511 again when the preset period or the preset number of times is met.
  • automatic real-time adjustment of the accuracy range through the overflow information generated during forward calculation and/or reverse calculation can not only reduce the memory occupied by the training model, but also improve the training efficiency of the model. . This reduces problems such as training stagnation caused by parameter calculations overflowing the first accuracy range.
  • the embodiment of the present application can use the overflow information of the parameters to determine the applicable precision range of the parameters in real time. Adjust and reduce overflow problems caused by low-precision floating point calculations.
  • the initial accuracy range can be adjusted to affect model training in the first accuracy range.
  • the initial accuracy range is adjusted from the first accuracy range to the second accuracy range.
  • the risk of subsequent parameter accuracy underflow is reduced.
  • the forward calculation can be rerun using the second precision range, and the backward calculation can be done using the first precision range.
  • the second precision range can be used to re-perform the forward calculation and subsequent reverse calculation. That is, the adjustment accuracy may be adjusted only for overflow calculations, or the accuracy may be adjusted for all calculations. The details are not limited here.
  • Figure 6 is another model training method provided by an embodiment of the present application.
  • the execution body of this method is similar to the execution body of the embodiment shown in Figure 3 and will not be described again here.
  • the method may include steps 601 to 615. Steps 601 to 615 will be described in detail below.
  • Step 601 Obtain training data.
  • Step 602 Determine whether the number of overflows (num) is greater than or equal to the second threshold (N). If yes, execute step 603. If not, execute step 612.
  • step 601 and step 602 reference may be made to the description of step 501 and step 502 in the embodiment shown in FIG. 5 , which will not be described again here.
  • Step 603 Use the first accuracy range to perform forward calculation to obtain intermediate features.
  • step 602 If the number of overflows in the aforementioned step 602 is greater than or equal to the second threshold, execution of this step is triggered.
  • Step 604 Determine whether the intermediate features overflow. If yes, execute step 605. If not, execute step 607.
  • the training device After the training device obtains the intermediate features, it determines whether the calculated value of the intermediate features overflows. If yes, execute step 605. If not, execute step 607.
  • Step 605 The cumulative number of overflows is increased by 1 (that is, num+1).
  • step 604 If the value of the intermediate feature in step 604, the loss in step 608, and/or the weight gradient divided by the scaling coefficient exceeds the first accuracy range, the execution of this step is triggered.
  • Step 606 Use the second accuracy range to calculate intermediate features of the overflow layer or calculate intermediate features layer by layer starting from the first layer.
  • This step can be understood as: if the intermediate feature overflows the first precision range, the second precision range can be used to perform calculations on the overflow layer or all layers.
  • Step 607 Use the first accuracy range to perform forward calculation to obtain the loss (loss).
  • step 604 If the intermediate features in the aforementioned step 604 do not overflow, the execution of this step is triggered.
  • Step 608 Determine whether loss overflows. If yes, execute step 613. If not, multiply the scaling factor and perform step 609.
  • Step 609 Use the first accuracy range to perform reverse calculation to obtain the weight gradient.
  • Step 610 Determine whether the value of the weight gradient divided by the scaling coefficient overflows. If yes, execute step 614. If not, execute step 611.
  • Step 611 update the weight.
  • Step 612 Use the second accuracy range to perform forward and backward recalculation.
  • Step 613 The cumulative number of overflows is increased by 1 (that is, num+1).
  • Step 614 Use the first coefficient to update the scaling coefficient.
  • Step 615 Determine that the scaling coefficient is greater than or equal to the preset threshold.
  • steps 607 to 615 reference may be made to the description of steps 503 to 511 in the embodiment shown in FIG. 5, which will not be described again here.
  • the second precision range when the calculation of intermediate features overflows the first precision range, the second precision range can be used to calculate the intermediate features of the overflow layer or the intermediate features can be calculated layer by layer starting from the first layer network structure.
  • the adjustment of the accuracy range can be an adjustment of a specific layer or an adjustment of all layers of the entire network structure.
  • automatic real-time adjustment of the accuracy range through the overflow information generated during forward calculation and/or reverse calculation can not only reduce the memory occupied by the training model, but also improve the training efficiency of the model. This reduces problems such as training stagnation caused by parameter calculations overflowing the first accuracy range.
  • the embodiment of the present application can use the overflow information of the parameters to determine the applicable precision range of the parameters in real time. Adjust and reduce overflow problems caused by low-precision floating point calculations.
  • the initial accuracy range can be adjusted to affect model training in the first accuracy range. For the accuracy of subsequent model training, the initial accuracy range is adjusted from the first accuracy range to the second accuracy range.
  • the reverse calculation process by setting the lower limit of the scaling coefficient, the risk of subsequent parameter accuracy underflow is reduced.
  • Figure 7 is another model training method provided by an embodiment of the present application.
  • the execution body of this method is similar to the execution body of the embodiment shown in Figure 3 and will not be described again here.
  • the method may include steps 701 to 717. Steps 701 to 717 will be described in detail below.
  • Step 701 Obtain training data.
  • Step 702 Determine whether the number of overflows (num) is greater than or equal to the second threshold (N). If yes, execute step 703. If not, execute step 712.
  • Step 703 Use the first accuracy range to perform forward calculation to obtain intermediate features.
  • Step 704 Determine whether the intermediate features overflow. If yes, execute step 705. If not, execute step 707.
  • Step 705 The cumulative number of overflows is increased by 1 (that is, num+1).
  • Step 706 Use the second precision range to calculate intermediate features of the overflow layer or calculate intermediate features layer by layer starting from the first layer.
  • Step 707 Use the first accuracy range to perform forward calculation to obtain the loss (loss).
  • Step 708 determine whether loss overflows. If yes, execute step 716. If not, multiply the scaling factor and perform step 709.
  • Step 709 Use the first accuracy range to perform reverse calculation to obtain the weight gradient.
  • Step 710 Determine whether the value of the weight gradient divided by the scaling coefficient overflows. If yes, execute step 714. If not, execute step 711.
  • Step 711 update the weight.
  • Step 712 Use the second accuracy range to perform forward and backward recalculation.
  • Step 713 The cumulative number of overflows is increased by 1 (that is, num+1).
  • Step 714 Use the first coefficient to update the scaling coefficient.
  • Step 715 Determine that the scaling coefficient is greater than or equal to the preset threshold.
  • steps 701 to 715 reference may be made to the description of steps 601 to 615 in the embodiment shown in FIG. 6, which will not be described again here.
  • Step 716 The cumulative number of overflows is increased by 1 (that is, num+1).
  • This step 716 is similar to the aforementioned step 705.
  • the number of overflows accumulated is the number of times that the intermediate features, loss, and weight gradients overflow the first accuracy range during the model training process.
  • Step 717 Recalculate the loss using the second accuracy range or calculate it layer by layer starting from the first layer until the loss is obtained. And multiply the loss by the scaling factor, and execute step 709.
  • This step can be understood as, if the loss overflows the first accuracy range, the second accuracy range can be used to re-calculate the loss for the last time to adjust the accuracy range or to adjust the accuracy range of the calculations of all layers of the model.
  • the loss when the calculation of loss exceeds the first precision range, the loss can be recalculated using the second precision range or calculated layer by layer starting from the first layer network structure until the loss is obtained.
  • the adjustment of the accuracy range can be an adjustment of a specific layer or an adjustment of all layers of the entire network structure.
  • automatic real-time adjustment of the accuracy range through the overflow information generated during forward calculation and/or reverse calculation can not only reduce the memory occupied by the training model, but also improve the training efficiency of the model. This reduces problems such as training stagnation caused by parameter calculations overflowing the first accuracy range.
  • the type of method determines whether to use high-precision floating point numbers or low-precision floating point numbers.
  • the embodiment of the present application can adjust the precision range applicable to the parameters in real time through the overflow information of the parameters, and reduce overflow problems caused by low-precision floating point number calculations.
  • the initial accuracy range can be adjusted to affect model training in the first accuracy range.
  • the initial accuracy range is adjusted from the first accuracy range to the second accuracy range.
  • the reverse calculation process by setting the lower limit of the scaling coefficient, the risk of subsequent parameter accuracy underflow is reduced.
  • One embodiment of the training device in the embodiment of the present application includes:
  • Acquisition unit 801 used to obtain training data
  • the calculation unit 802 is used to use training data as input to the model, and use the first accuracy range to calculate parameters during the model training process to obtain calculated values;
  • the calculation unit 802 is also configured to recalculate the parameters using the second accuracy range if the calculated value exceeds the first accuracy range, and use the recalculated parameters to train the model one or more iterations, where the second accuracy range includes The first accuracy range, or the second accuracy range partially overlaps with the first accuracy range.
  • the model includes multiple network structures, and the calculation unit 802 is specifically configured to use the second accuracy range to recalculate parameters starting from the first layer network structure of the model.
  • the calculation unit 802 is specifically configured to recalculate the parameters using the current network structure that overflows the calculated value using the second precision range.
  • the model includes multiple network structures, and the parameters include one or more of the following: intermediate features calculated by the multiple network structures during the forward propagation process or the value of the loss function of the model, and the intermediate features are the values of the multiple network structures.
  • the output features of any network structure the gradients calculated by multiple network structures during the backpropagation process.
  • the gradients include: the gradient of intermediate features and/or the weight gradient of the model.
  • the calculated value is the gradient divided by the scaling coefficient, and the scaling coefficient is used to reduce the probability of gradient overflow; the computing unit 802 is also used to update using the first coefficient Scaling coefficient.
  • the updated scaling coefficient is used to replace the pre-updated scaling coefficient for the next iteration of the model training.
  • the first coefficient is a positive number less than 1; the minimum value of the scaling coefficient is a preset threshold greater than or equal to 1.
  • the calculation unit 802 is specifically configured to use the second accuracy range to calculate the intermediate features of the overflow layer or to calculate the intermediate features layer by layer starting from the first layer network structure,
  • the overflow layer is a network structure in which the calculated values of intermediate features in multiple network structures overflow the first accuracy range.
  • the calculation unit 802 is specifically configured to use the second accuracy range to calculate the value of the loss function or calculate layer by layer starting from the first layer network structure until the value of the loss function is obtained.
  • the calculation unit 802 is specifically used for the Nth iteration in the model training process to obtain the number of overflows for multiple network structures in the model based on the first accuracy range, where N is a positive integer greater than or equal to 1; calculate Unit 802 is specifically configured to, if the number of overflows is greater than or equal to the second threshold, determine that the initial accuracy range in the next iterative training process is changed from the first accuracy range to the second accuracy range, and clear the number of overflows.
  • the number of overflows includes: the number of overflows of multiple network structures in the forward propagation process, and/or the number of overflows of multiple network structures in the back propagation process.
  • each unit in the training device the operations performed by each unit in the training device are similar to those described in the aforementioned embodiments shown in FIGS. 1 to 7 , and will not be described again here.
  • the calculation unit 602 when the calculated value of the parameter exceeds the first precision range during the model training process, the calculation unit 602 recalculates the parameter using the second precision range. That is, automatic real-time adjustment of the accuracy range through the overflow information of parameter calculation values can not only reduce the memory occupied by the training model, but also improve the training efficiency of the model. This reduces problems such as training stagnation caused by parameter calculations overflowing the first accuracy range.
  • the embodiment of the present application can use the overflow information of the parameters to determine the applicable precision range of the parameters in real time. Adjust and reduce overflow problems caused by low-precision floating point calculations.
  • the training device may include a processor 901, a memory 902, and a communication port 903.
  • the processor 901, memory 902 and communication port 903 are interconnected through lines.
  • the memory 902 stores program instructions and data.
  • the memory 902 stores program instructions and data corresponding to the steps executed by the training equipment in the corresponding embodiments shown in FIGS. 1 to 7 .
  • the processor 901 is configured to perform the steps performed by the training device shown in any of the embodiments shown in FIGS. 1 to 7 .
  • the communication port 903 can be used to receive and send data, and to perform steps related to acquisition and reception in any of the embodiments shown in FIGS. 1 to 7 .
  • the training device may include more or less components than in Figure 9 , which is merely an illustrative description in this application and is not limiting.
  • the disclosed systems, devices and methods can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or can be integrated into another system, or some features can be ignored, or not implemented.
  • the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application can be integrated into one processing unit, each unit can exist physically alone, or two or more units can be integrated into one unit.
  • the above integrated units can be implemented in the form of hardware or software functional units.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in various embodiments of this application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, read-only memory), random access memory (RAM, random access memory), magnetic disk or optical disk and other media that can store program code. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Complex Calculations (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本申请公开了一种模型训练方法。该方法可以适用于动态计算图场景,也可以适用于静态计算图场景。该方法包括:获取训练数据;将训练数据作为模型的输入,在模型训练过程中使用第一精度范围进行参数的计算以得到计算值;若计算值溢出第一精度范围,则使用第二精度范围重新计算参数,并使用重新计算后的参数对模型进行一次或多次迭代训练,第二精度范围包括第一精度范围,或者第二精度范围与第一精度范围部分重叠。通过在参数的计算值溢出精度范围的情况下,对训练模型过程中使用的精度范围进行实时调整。可以有效解决低精度训练面临的溢出所导致的训练停滞问题。另外,不依赖人工经验定制初始化方案,可实时自动逐层调整训练精度。

Description

一种模型训练方法及相关设备
本申请要求于2022年07月15日提交中国专利局、申请号为202210832123.6、发明名称为“一种模型训练方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,尤其涉及一种模型训练方法及相关设备。
背景技术
人工智能(artificial intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。神经网络是由人工建立的、以有向图为拓扑结构的动态系统,它通过对连续的或断续的输入作为状态响应而进行信息处理,是一种旨在模仿人脑结构及其功能的信息处理系统。经过几十年的发展,人工神经网络已经被广泛使用在模式识别、自动控制、信号处理、辅助决策、人工智能、科学计算等众多领域,并取得了广泛的成功。尤其是在图像处理、音视频处理、自然语言处理等众多领域,人工神经网络正处理蓬勃发展的阶段,正发挥着不可替代的作用。
目前,大多数的神经网络在训练过程的参数存储或参数计算中,所使用数据格式的精度主要是通过人工经验设置的。例如,设置者会根据经验确定各网络层是用16位半精度浮点数(FP16)还是32位单精度浮点数(FP32)。
然而,基于人工经验设置网络层结构所适用的精度过分依赖于设置者的专业能力。
发明内容
本申请提供了一种模型训练方法及相关设备,通过在参数的计算值溢出精度范围的情况下,对训练模型过程中使用的精度范围进行实时调整。可以有效解决低精度训练面临的溢出所导致的训练停滞问题。
本申请实施例第一方面提供了一种模型训练方法。该方法可以适用于动态计算图场景,也可以适用于静态计算图场景。其中,动态计算图场景可以理解为是在模型的每层网络结构计算后,更新计算图。静态计算图场景可以理解为是在模型的所有层网络结构计算后,再更新计算图。区别主要是计算图的更新时机。计算图的计算可以适用本申请实施例提供的方法。该方法可以由训练设备执行,也可以由训练设备的部件(例如处理器、芯片、或芯片系统等)执行。该方法包括:获取训练数据;将训练数据作为模型的输入,在模型训练过程中使用第一精度范围进行参数的计算以得到计算值;若计算值溢出第一精度范围,则使用第二精度范围重新计算参数,并使用重新计算后的参数对模型进行一次或多次迭代训练,第二精度范围包括第一精度范围,或者第二精度范围与第一精度范围部分重叠。
本申请实施例中,在模型训练过程中参数的计算值溢出第一精度范围的情况下,使用第二精度范围重新计算参数。即通过参数计算值的溢出信息进行精度范围的自动实时调整,不仅可以减少训练模型所占用的内存,还可以提升模型的训练效率。进而减少由于参数溢出第一精度范围导致的训练停滞或训练失败等问题。此外,相较于现有技术中需要使用模型中网络层的类型确定使用高精度浮点数还是低精度浮点数的方法,本申请实施例可以通过参数的溢出信息对参数所适用的精度范围进行实时调整,并减少由于低精度浮点数计算产生的溢出问题。
可选地,在第一方面的一种可能的实现方式中,上述的模型包括多个网络结构,步骤:使用第二精度范围重新计算参数,包括:使用第二精度范围从模型的第一层网络结构开始重新计算参数。
该种可能的实现方式中,若当前层网络结构的计算值溢出,可以选用新的精度范围从模型的第一层网络结构开始重新计算,以减少由于计算值溢出导致的计算错误问题。
可选地,在第一方面的一种可能的实现方式中,上述的模型包括多个网络结构,步骤:使用第二精度范围重新计算参数,包括:使用第二精度范围从计算值溢出的当前网络结构重新计算参数。
该种可能的实现方式中,若当前层网络结构的计算值溢出,可以选用新的精度范围对当前层网络结构 重新计算,以减少由于计算值溢出导致的计算错误问题。
可选地,在第一方面的一种可能的实现方式中,上述的参数与模型的损失函数相关,或者参数与模型在前向传播过程中的计算相关,或者参数与模型在反向传播过程中的计算相关。
可选地,在第一方面的一种可能的实现方式中,上述的模型包括多个网络结构,参数包括以下一项或多项:多个网络结构在前向传播过程中计算的中间特征或模型的损失函数的值,中间特征为多个网络结构中任意一个网络结构的输出特征;多个网络结构在反向传播过程中计算的梯度,梯度包括:中间特征的梯度和/或模型的权重梯度。
该种可能的实现方式中,该参数可以是模型在训练过程的前向传播或反向传播过程中需要计算的参数,也可以是模型中个别层输出的参数,也可以是整个模型中所有层计算结束后的参数等等。提升该方法在训练过程中的适用场景,换句话说,模型训练过程中涉及的计算都可以使用本申请实施例提供的方法进行精度调整。
可选地,在第一方面的一种可能的实现方式中,上述当参数包括梯度时,计算值为梯度除以缩放系数的值,缩放系数用于减少梯度溢出的概率;方法还包括:使用第一系数更新缩放系数,更新后的缩放系数用于替换更新前的缩放系数进行模型的下一次迭代训练,第一系数为小于1的正数。
该种可能的实现方式中,若反向传播过程中计算的参数(即梯度除以缩放系数的值)溢出,则使用其他精度范围进行权重梯度的重新计算,并更新缩放系数,以减少后续迭代训练中计算值的溢出。
可选地,在第一方面的一种可能的实现方式中,上述缩放系数的最小取值为大于或等于1的预设阈值,预设阈值用于减少梯度溢出的概率。
该种可能的实现方式中,在反向计算过程中,通过设置缩放系数的取值下限,减少后续参数精度下溢的风险。
可选地,在第一方面的一种可能的实现方式中,当参数包括前向传播过程中计算的中间特征时,上述步骤:使用第二精度范围重新计算参数,包括:使用第二精度范围计算溢出层的中间特征或从第一层网络结构开始逐层计算中间特征,溢出层为多个网络结构中中间特征的计算值溢出第一精度范围的网络结构。
该种可能的实现方式中,若中间特征溢出第一精度范围,则可以使用第二精度范围重新计算溢出层,或从第一层网络结构重新开始。即可以选择修改部分层的精度范围,也可以选择修改所有层的精度范围,方案灵活。
可选地,在第一方面的一种可能的实现方式中,当参数包括模型的损失函数的值时,上述步骤:使用第二精度范围重新计算参数,包括:使用第二精度范围计算损失函数的值或从第一层网络结构开始逐层计算直至得到损失函数的值。
该种可能的实现方式中,若损失函数的值溢出第一精度范围,则可以使用第二精度范围重新计算损失函数的值,或从第一层网络结构开始逐层计算直至得到损失函数的值。即可以选择修改部分层的精度范围,也可以选择修改所有层的精度范围,方案灵活。
可选地,在第一方面的一种可能的实现方式中,上述步骤:使用重新计算后的参数对模型进行训练,包括:在模型训练过程中的第N次迭代,获取基于第一精度范围对模型中多个网络结构的溢出次数,N为大于或等于1的正整数;若溢出次数大于或等于第二阈值,则确定下一次迭代训练过程中初始的精度范围从第一精度范围更换为第二精度范围,并将溢出次数清零。
该种可能的实现方式中,通过溢出次数的记录,对初始的精度范围进行调整,在第一精度范围影响到模型的训练,为了后续模型训练的准确性,将初始的精度范围从第一精度范围调整为第二精度范围。
可选地,在第一方面的一种可能的实现方式中,上述的溢出次数包括:多个网络结构在前向传播过程中的溢出次数,和/或多个网络结构在反向传播过程中的溢出次数。
该种可能的实现方式中,精度范围调整的判断条件(即溢出次数的确定)可以是整个训练过程中的溢出次数,也可以是前向传播或反向传播中的溢出次数。提升该方法的适用范围。
可选地,在第一方面的一种可能的实现方式中,上述的参数与损失函数相关。其中,上述的损失函数可以根据模型的训练方式不同有所不同。对于监督训练来说,损失函数用于表示模型的输出与训练数据所属标签之间的差异。对于非监督来说,损失函数可以是自定义的函数。例如,在模型的任务是分类任务的 情况下,损失函数用于表示模型的输出与输入(或者聚类结果等)之间的差异。或者理解为,无监督训练中,希望可以将模型的输出还原出模型的输入,例如,标签就是训练数据自己(即,模型得到的输出再送到另一个网络以还原训练数据)。可以理解的是,本申请实施例中对于损失函数并不进行限定,该损失函数也可以理解为是模型的优化目标函数,具体可以根据实际需要设置。
本申请实施例第二方面提供了一种训练设备。该训练设备可以适用于动态计算图场景,也可以适用于静态计算图场景。该训练设备包括:获取单元,用于获取训练数据;计算单元,用于将训练数据作为模型的输入,在模型训练过程中使用第一精度范围进行参数的计算以得到计算值;计算单元,还用于若计算值溢出第一精度范围,则使用第二精度范围重新计算参数,并使用重新计算后的参数对模型进行一次或多次迭代训练,其中,第二精度范围包括第一精度范围,或者第二精度范围与第一精度范围部分重叠。
可选地,在第二方面的一种可能的实现方式中,上述的模型包括多个网络结构,计算单元,具体用于使用第二精度范围从模型的第一层网络结构开始重新计算参数。
可选地,在第二方面的一种可能的实现方式中,上述的计算单元,具体用于使用第二精度范围从计算值溢出的当前网络结构重新计算参数。
可选地,在第二方面的一种可能的实现方式中,上述的模型包括多个网络结构,参数包括以下一项或多项:多个网络结构在前向传播过程中计算的中间特征或模型的损失函数的值,中间特征为多个网络结构中任意一个网络结构的输出特征;多个网络结构在反向传播过程中计算的梯度,梯度包括:中间特征的梯度和/或模型的权重梯度。
可选地,在第二方面的一种可能的实现方式中,上述当参数包括反向传播过程中计算的梯度时,计算值为梯度除以缩放系数的值,缩放系数用于减少梯度溢出的概率;计算单元,还用于使用第一系数更新缩放系数,更新后的缩放系数用于替换更新前的缩放系数进行模型的下一次迭代训练,第一系数为小于1的正数;缩放系数的最小取值为大于或等于1的预设阈值。
可选地,在第二方面的一种可能的实现方式中,上述当参数包括前向传播过程中计算的中间特征时,计算单元,具体用于使用第二精度范围计算溢出层的中间特征或从第一层网络结构开始逐层计算中间特征,溢出层为多个网络结构中中间特征的计算值溢出第一精度范围的网络结构。
可选地,在第二方面的一种可能的实现方式中,上述当参数包括模型的损失函数的值时,计算单元,具体用于使用第二精度范围计算损失函数的值或从第一层网络结构开始逐层计算直至得到损失函数的值。
可选地,在第二方面的一种可能的实现方式中,上述的计算单元,具体用于在模型训练过程中的第N次迭代,获取基于第一精度范围对模型中多个网络结构的溢出次数,N为大于或等于1的正整数;计算单元,具体用于若溢出次数大于或等于第二阈值,则确定下一次迭代训练过程中初始的精度范围从第一精度范围更换为第二精度范围,并将溢出次数清零。
可选地,在第二方面的一种可能的实现方式中,上述的溢出次数包括:多个网络结构在前向传播过程中的溢出次数,和/或多个网络结构在反向传播过程中的溢出次数。
本申请第三方面提供了一种训练设备,包括:处理器,处理器与存储器耦合,存储器用于存储程序或指令,当程序或指令被处理器执行时,使得该训练设备实现上述第一方面或第一方面的任意可能的实现方式中的方法。
本申请第四方面提供了一种计算机可读介质,其上存储有计算机程序或指令,当计算机程序或指令在计算机上运行时,使得计算机执行前述第一方面或第一方面的任意可能的实现方式中的方法。
本申请第五方面提供了一种计算机程序产品,该计算机程序产品在计算机上执行时,使得计算机执行前述第一方面或第一方面的任意可能的实现方式中的方法。
其中,第二、第三、第四、第五方面或者其中任一种可能实现方式所带来的技术效果可参见第一方面或第一方面不同可能实现方式所带来的技术效果,此处不再赘述。
从以上技术方案可以看出,本申请具有以下优点:在模型训练过程中参数的计算值溢出第一精度范围的情况下,使用第二精度范围重新计算参数。即通过参数计算值的溢出信息进行精度范围的自动实时调整,不仅可以减少训练模型所占用的内存,还可以提升模型的训练效率。进而减少由于参数溢出第一精度范围导致的训练停滞或训练失败等问题。此外,相较于现有技术中需要使用模型中网络层的类型确定使用高精度浮点数还是低精度浮点数的方法,本申请实施例可以通过参数的溢出信息对参数所适用的精度范围进行 实时调整,并减少由于低精度浮点数计算产生的溢出问题。
附图说明
图1为本申请实施例提供的系统架构的结构示意图;
图2为本申请实施例提供的一种芯片硬件结构示意图;
图3为本申请实施例提供的模型训练方法的一个流程示意图;
图4为本申请实施例提供的模型的一个结构示例图;
图5为本申请实施例提供的模型训练方法的另一个流程示意图;
图6为本申请实施例提供的模型训练方法的另一个流程示意图;
图7为本申请实施例提供的模型训练方法的另一个流程示意图;
图8为本申请实施例提供的训练设备的一个结构示意图;
图9为本申请实施例提供的训练设备的另一个结构示意图。
具体实施方式
本申请提供了一种模型训练方法及相关设备,通过在计算值溢出精度范围的情况下,对训练模型过程中使用的精度范围进行实时调整。可以有效解决低精度训练面临的溢出所导致的训练停滞或训练失败问题。另外,对于网络混合精度初始化方案要求低,不依赖人工经验定制初始化方案,可实时自动逐层调整训练精度。
为了便于理解,下面先对本申请实施例主要涉及的相关术语和概念进行介绍。
1、神经网络
神经网络可以是由神经单元组成的,神经单元可以是指以Xs和截距b为输入的运算单元,该运算单元的输出可以为:
其中,s=1、2、……n,n为大于1的自然数,Ws为Xs的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层的输入。激活函数可以是Relu函数。神经网络是将许多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
神经网络中的每一层的工作可以用数学表达式y=a(Wx+b)来描述:从物理层面神经网络中的每一层的工作可以理解为通过五种对输入空间(输入向量的集合)的操作,完成输入空间到输出空间的变换(即矩阵的行空间到列空间),这五种操作包括:1、升维/降维;2、放大/缩小;3、旋转;4、平移;5、“弯曲”。其中1、2、3的操作由Wx完成,4的操作由+b完成,5的操作则由a()来实现。这里之所以用“空间”二字来表述是因为被分类的对象并不是单个事物,而是一类事物,空间是指这类事物所有个体的集合。其中,W是权重向量,该向量中的每一个值表示该层神经网络中的一个神经元的权重值。该向量W决定着上文的输入空间到输出空间的空间变换,即每一层的权重W控制着如何变换空间。训练神经网络的目的,也就是最终得到训练好的神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。 因此,神经网络的训练过程本质上就是学习控制空间变换的方式,更具体的就是学习权重矩阵。
神经网络也可以称为人工神经网络(Artificial Neural Network,ANN):是由人工建立的、以有向图为拓扑结构的动态系统,它通过对连续的或断续的输入作为状态响应而进行信息处理,是一种旨在模仿人脑结构及其功能的信息处理系统。经过几十年的发展,人工神经网络已经被广泛使用在模式识别、自动控制、信号处理、辅助决策、人工智能、科学计算等众多领域,并取得了广泛的成功。通常一个网络由输入层、隐藏层、输出层构成。
2、损失函数
在训练神经网络的过程中,因为希望神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到神经网络能够预测出真正想要的目标值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么神经网络的训练就变成了尽可能缩小这个loss的过程。
3、前向传播(Forward Propagation)
神经网络前向传播指从输入层经过隐藏层到输出层的计算过程。从输入层开始,按照网络的拓扑结构,将前一层的输出(激活值)作为后一层的输入,逐层计算每一层的输出,直到最后的输出层为止,该过程称为网络的前向传播。
4、反向传播(Backward Propagation)
神经网络反向传播是“误差反向传播”的简称,是一种与最优化方法(如梯度下降法)结合使用的,用来训练人工神经网络的常见方法。该方法对网络中所有权重计算损失函数的梯度。这个梯度会反馈给最优化方法,用来更新权值以最小化损失函数。
5、计算图
计算图通常用箭头表示计算顺序。例如,以函数y=5(a+bc)为例,先计算出bc的值,并将其存在变量i中。再计算a+i,并将a+i存在变量j中。之后计算5*j以得到y的计算结果。
6、精度范围
该精度范围指的是计算机使用的数据类型的精度范围。可以是指具体的精度,也可以是指的精度的动态范围。下面以神经网络中常用的数据类型是浮点数(float point,FP)为例,进行示例性描述。在实际应用中,该数据类型也可以是整型(int),例如,int8、int16等。
FP主要用于表示小数,通常由三部分组成,即符号(sign)位、指数(exponent)位和尾数(mantissa)位。其中,符号位可以是表示正负的1比特(bit),指数位和尾数位可以为多个比特(bits)。一般情况下,尾数位表示精度,指数位用于表示该精度可达的动态范围(本申请实施例中称为精度范围)。浮点数在表示小数时,由于十进制小数在转换为二进制时,存在无法精确转换的情况,而在固定比特的计算机中存储时会被截断,所以浮点数表示小数可能存在精度损失。
浮点数通常可以包括三种格式(format),即半精度浮点数、单精度浮点数和双精度浮点数,具体如下。
半精度浮点数(half-precision floating-point):是计算机使用的一种二进制的数据类型,在计算机存储器中占用16bits(即占用2个字节),也可以简称为FP16。半精度浮点数可表示的数值的绝对值范围大约为[6×10-8,65504]。FP16的精度为2-10
单精度浮点数(single-precision floating-point):是计算机使用的一种二进制数据类型,在计算机存储器中占用32bits(即4个字节),也可以简称为FP32。单精度浮点数可表示的数值的绝对值范围大约为[1.4×10-45,1.7×1038]。FP32的精度为2-23
双精度浮点数(double precision floating point):是计算机使用的一种二进制的数据类型,在计 算机存储器中占用64bits(即占用8个字节),也可以简称为FP64。双精度浮点数可以表示十进制的15位或16位有效数字,可表示的数值的绝对值范围大约为[2.23×10-308,1.80×1038]。FP32的精度为2-52
为了更直观的看出上三种不同精度的浮点数,三种浮点数的结构如表1所示:
表1
其中,在FP16占用的16bits中,符号位占用1bit、指数位占用5bits、尾数位占用10bits;在FP32占用的32bits中,符号位占用1bit、指数位占用8bits、尾数位占用23bits;在FP64占用的64bits中,符号位占用1bit、指数位占用11bits、尾数位占用52bits。
可以理解的是,在实际应用中,为了表示更高精度的浮点数,还可以进一步扩展浮点数的格式以及占用更多比特位的存储格式等。例如,占用128bits的浮点数(可以简称为FP128)等,具体此处不做限定。
7、溢出
本申请实施例中的溢出包括上溢与下溢。其中,上溢是指计算值的绝对值太大,超过某一精度范围能表示的最大值。下溢是指计算值的绝对值太小,小于某一精度范围能表示的最接近0的正数值。
溢出可以包括存储溢出与计算溢出,本申请实施例主要应用于计算溢出的场景。
示例性的,若第一精度范围是1-5,假设两个参数的取值为3,存储的时候是在第一精度范围内的。但是,参数计算有可能溢出第一精度范围。例如,上述两个参数的相加操作,3+3=6,且6大于第一精度范围能表示的最大值5,即参数的计算值溢出第一精度范围。可以理解的是,该示例并不对精度范围的边界取值做限定。
具体的,以FP16为例介绍溢出的情况。上溢包括:若计算值是正数,且该计算值大于FP16能表示的最大正数。或者若计算值是负数,且该计算值小于FP16能表示的最小负数。下溢包括:若计算值是正数,且该计算值小于FP16能表示的最小正数。或者若计算值是负数,且该计算值大于FP16能表示的最大负数。
8、混合精度(Mixed Precision,MP)
目前,大多数的模型使用的是32位单精度浮点数(FP32)来进行训练,而混合精度训练的方法则通过用半精度甚至更低精度结合单精度进行模型训练,从而减少了训练模型所需的内存,同时由于低精度的运算比单精度的运算更快,从而也进一步提高了硬件效率。
由上可知,混合精度的关键点在于:以怎样的策略设置网络的哪些部分用高精度,哪些部分用低精度进行训练,从而保证精度,提升训练效率。或者说,混合精度的关键点在于:单精度与高精度具体如何结合训练。
目前,混合精度训练方法主要包括下述两种方式。
第一种,通过神经网络中各层的类型指定计算精度。某些类型的层用高精度计算,某些类型的层用低精度进行计算。
第二种,通过量化误差是否超过阈值来动态选择精度。其中,量化误差可以在网络中的不同点进行测量或随着训练的进行随时间进行测量。例如:通过将训练结果与基线值进行比较来计算量化误差。其中基线值可以通过多种方法来确定,例如,使用全精度浮点值来训练同一网络,以高精度重复计算的子集,以及分析或采样针对所涉及的计算的数据统计。
然而,上述第一种方式中,由于相同类型的层,在不同网络中,或者在相同网络不同的训练阶段,其对计算精度的要求是不同的。如果按照类型指定计算所适用的精度明显不够灵活,不够智能。
上述第二种方式中,对训练精度的调整依赖精度基线值。该基线值需要高精度的重复计算或者,或者通过全精度浮点值训练同一网络获得。这使得该方案仍然不够自动化,且有可能因为构建基线值而显著增大网络训练的计算量。与为了节省计算,加速训练而采用低精度训练的目的背道而驰。
鉴于此,本申请实施例提供了一种模型训练方法及相关设备。通过在神经网络中参数的计算值溢出精度范围的情况下,对训练模型过程中使用的精度范围进行实时调整。可以有效解决现有技术中因为低精度训练面临的溢出所导致的训练停滞。另外,对于网络混合精度初始化方案要求低,不依赖人工经验定制初始化方案(例如,无需人工经验逐个网络做层级别的精度调整),可通过是否溢出在训练过程中实时自动逐层调整训练精度。
下面结合附图详细说明本申请实施例提供的模型训练方法及相关设备。
首先介绍本申请实施例提供的系统架构。
参见附图1,本发明实施例提供了一种系统架构100。如系统架构100所示,数据采集设备160用于采集训练数据,本申请实施例中训练数据可以包括以下一项或多项:图像、语音、文本等。并将训练数据存入数据库130,训练设备120基于数据库130中维护的训练数据训练得到目标模型/规则101。该目标模型/规则101能够用于实现计算机视觉任务(例如,分类、分割、检测、图像生成等的)。本申请实施例中的目标模型/规则101具体可以是神经网络等。需要说明的是,在实际的应用中,数据库130中维护的训练数据不一定都来自于数据采集设备160的采集,也有可能是从其他设备接收得到的。另外需要说明的是,训练设备120也不一定完全基于数据库130维护的训练数据进行目标模型/规则101的训练,也有可能从云端或其他地方获取训练数据进行模型训练,上述描述不应该作为对本申请实施例的限定。
根据训练设备120训练得到的目标模型/规则101可以应用于不同的系统或设备中,如应用于图1所示的执行设备110,执行设备110可以是终端,如手机终端,平板电脑,笔记本电脑,增强现实(augmented reality,AR)设备/虚拟现实(virtual reality,VR)设备,车载终端等。当然,执行设备110还可以是服务器或者云端等。在附图1中,执行设备110配置有I/O接口112,用于与外部设备进行数据交互,用户可以通过客户设备140向I/O接口112输入数据,输入数据与训练数据对应,本申请实施例中的输入数据也可以包括以下一项或多项:图像、语音、文本等。另外,该输入数据可以是用户输入的,也可以是用户通过拍摄设备上传的,当然还可以来自数据库等,具体此处不做限定。
预处理模块113用于根据I/O接口112接收到的输入数据进行预处理(例如,分割、选择、变换等)。例如,对输入数据进行分割处理得到多个数据块(patch)。
在执行设备110对输入数据进行预处理,或者在执行设备110的计算模块111执行计算等相关的处理过程中,执行设备110可以调用数据存储系统150中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储系统150中。
最后,I/O接口112将处理结果(例如,分类结果、分割结果、检测结果等)返回给客户设备140,从而提供给用户。
值得说明的是,训练设备120可以针对不同的目标或称不同的任务,基于不同的训练数据生成相应的目标模型/规则101,该相应的目标模型/规则101即可以用于实现上述目标或完成上述任务,从而为用户提供所需的结果。
在附图1中所示情况下,用户可以手动给定输入数据,该手动给定可以通过I/O接口112提供的界面进行操作。另一种情况下,客户设备140可以自动地向I/O接口112发送输入数据,如果要求客户设备140自动发送输入数据需要获得用户的授权,则用户可以在客户设备140中设置相应权限。用户可以在客户设备140查看执行设备110输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备140也可以作为数据采集端,采集如图所示输入I/O接口112的输入数据及输出I/O接口112的输出值作为新的样本数据,并存入数据库130。当然,也可以不经过客户设备140进行采集,而是由I/O接口112直接将如图所示输入I/O接口112的输入数据及输出I/O接口112的输出值,作为新的样本数据存入数据库130。
值得注意的是,附图1仅是本发明实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在附图1中,数据存储系统150相对执行设备110是外部存储器,在其它情况下,也可以将数据存储系统150置于执行设备110中。
下面介绍本申请实施例提供的一种芯片硬件结构。
图2为本发明实施例提供的一种芯片硬件结构,该芯片包括神经网络处理器20。该芯片可以被设置在如图1所示的执行设备110中,用以完成计算模块111的计算工作。该芯片也可以被设置在如图1所示的训练设备120中,用以完成训练设备120的训练工作并输出目标模型/规则101。
神经网络处理器20可以是神经网络处理器(neural-network processing unit,NPU),张量处理器(tensor processing unit,TPU),或者图形处理器(graphics processing unit,GPU)等一切适合用于大规模异或运算处理的处理器。以NPU为例:神经网络处理器20作为协处理器挂载到主中央处理器(central processing unit,CPU)(host CPU)上,由主CPU分配任务。NPU的核心部分为运算电路203,控制器204控制运算电路203提取存储器(权重存储器或输入存储器)中的数据并进行运算。
在一些实现中,运算电路203内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路203是二维脉动阵列。运算电路203还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路203是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路203从权重存储器202中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器201中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器208中。
向量计算单元207可以对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。例如,向量计算单元207可以用于神经网络中非卷积/非FC层的网络计算,如池化(Pooling),批归一化(Batch Normalization),局部响应归一化(Local Response Normalization)等。
在一些实现中,向量计算单元能207将经处理的输出的向量存储到统一缓存器206。例如,向量计算单元207可以将非线性函数应用到运算电路203的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元207生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路203的激活输入,例如用于在神经网络中的后续层中的使用。
统一存储器206用于存放输入数据以及输出数据。
权重数据直接通过存储单元访问控制器205(direct memory access controller,DMAC)将外部存储器中的输入数据搬运到输入存储器201和/或统一存储器206、将外部存储器中的权重数据存入权重存储器202,以及将统一存储器206中的数据存入外部存储器。
总线接口单元(bus interface unit,BIU)210,用于通过总线实现主CPU、DMAC和取指存储器209之间进行交互。
与控制器204连接的取指存储器(instruction fetch buffer)209,用于存储控制器204使用的指令。
控制器204,用于调用取指存储器209中缓存的指令,实现控制该运算加速器的工作过程。
一般地,统一存储器206,输入存储器201,权重存储器202以及取指存储器209均为片上(On-Chip)存储器,外部存储器为该NPU外部的存储器,该外部存储器可以为双倍数据率同步动态随机存储器(double data rate synchronous dynamic random access memory,简称DDR SDRAM)、高带宽存储器(high bandwidth memory,HBM)或其他可读可写的存储器。
下面结合附图对本申请实施例的模型训练方法和数据处理方法进行详细的介绍。
首先,对本申请实施例提供的模型训练方法的应用场景进行描述。该方法可以适用于动态计算图场景,也可以适用于静态计算图场景。其中,动态计算图场景可以理解为是在模型的每层网络结构计算后,更新计算图。静态计算图场景可以理解为是在模型的所有层网络结构计算后,再更新计算图。区别主要是计算图的更新时机。计算图的计算可以适用本申请实施例提供的模型训练方法(或称为计算精度调整方法)。
下面结合图3对本申请实施例的模型训练方法进行详细介绍。该方法可以由训练设备执行,也可以由训练设备的部件(例如,处理器、芯片、或芯片系统等)执行。该训练设备可以是云端设备,也可以是终端设备,例如,还可以是电脑、服务器等运算能力足以用来执行模型训练方法的设备,也可以是由云端设备和终端设备构成的系统。示例性地,该训练方法可以由图1中的训练设备120、图2中的神经网络处理器20执行。
可选地,该模型训练方法可以由CPU处理,也可以由CPU和GPU共同处理,也可以不用GPU,而使用其他适合用于神经网络计算的处理器,本申请不做限制。
请参阅图3,本申请实施例提供的模型训练方法的一个流程示意图,该方法可以包括步骤301至步骤303。下面对步骤301至步骤303进行详细说明。
步骤301,获取训练数据。
本申请实施例中,训练设备获取训练数据的方式有多种方式,可以是通过接收其他设备(例如,服务器、业务设备等)发送的方式,还可以是从数据库中选取的方式,也可以是通过用户拍摄等方式,具体此处不做限定。
本申请实施例中的训练数据可以包括以下一项或多项:图像、语音、文本等。训练数据具体与模型所应用的场景相关。例如:当模型的作用是音频识别,则训练数据的具体形式可以是音频数据等。又例如:当模型的作用是图像分类,则训练数据的具体形式可以是图像数据等。再例如:当模型的作用是预测语音,则训练数据的具体形式可以是文本数据等。可以理解的是,上述几种情况只是举例,并且并不一定是一一对应的关系,例如对于音频识别,训练数据的具体形式还可以是图像数据或文本数据等(例如:若应用于教育领域中的看图播放语音场景,则模型的作用是识别图像对应的语音,则训练数据的具体形式可以是图像数据),在实际应用中,还有其他的场景,例如:当模型的作用的电影推荐场景,则训练数据可以是电影对应的词向量等。在一些应用场景,上述训练数据还可以同时包括不同模态的数据,比如在自动驾驶场景,训练数据可以包括摄像头采集的图像/视频数据,还可以包括用户发出指示的语音/文本数据等。本申请实施例中对于训练数据的具体形式或类型不做限定。
可以理解的是,若模型的训练为有监督训练,则本步骤获取的训练数据是携带有标签的训练数据。若模型的训练为无监督训练,则本步骤获取的训练数据是未携带有标签的训练数据。
步骤302,将训练数据作为模型的输入,在模型训练过程中使用第一精度范围进行参数的计算以得到计算值;
训练设备获取训练数据之后,将训练数据作为模型的输入,使用第一精度范围进行参数的计算以得到计算值。该参数与该模型的损失函数相关。对于监督训练来说,损失函数用于表示模型的输出与训练数据所属标签之间的差异。对于非监督来说,损失函数可以是自定义的函数。例如,在模型的任务是分类任务的情况下,损失函数用于表示模型的输出与输入(或者聚类结果等)之间的差异。或者理解为,无监督训练中,希望可以将模型的输出还原出模型的输入,例如,标签就是训练数据自己(即,模型得到的输出再送到另一个网络以还原训练数据)。可以理解的是,本申请实施例中对于损失函数并不进行限定,该损失函数也可以理解为是模型的优化目标函数,具体可以根据实际需要设置。
本申请实施例中的参数可以包括下述一项或多项可能:参数与模型的损失函数相关、参数与模型在前向传播过程中的计算相关、参数与模型在反向传播过程中的计算相关。
可选地,该模型包括多个网络结构。另外,本申请实施例中的模型具体是人工神经网络。对于人工神经网络包括的具体层数或结构可以根据实际需要设置,此处不做限定。
本申请实施例中的精度范围(例如,第一精度范围与第二精度范围)可以是指数据类型(例如,int、float等)的精度范围。为了方便后续描述,本申请实施例中仅以精度范围为FP的精度范围为例进行示例性说明。当然,在实际应用中,还可以是其他数据类型(例如,int等)。
可选地,第一精度范围为FP16精度的动态可达范围。其中,关于FP16与精度范围的描述可以参考前述相关术语中的解释,此处不再赘述。
本申请实施例中,模型中各层网络结构使用的精度范围可能相同或不同。换句话说,该第一精度范围可以是细化到模型中每一层的精度范围,也可以是模型的所有层都用同一精度范围。即后续使用第二精度范围替换第一精度范围进行参数的重新计算可以是层粒度的精度范围调整(即只对特定层的精度范围进行调整,并不调整非特定层的精度范围,该特定层也可以成为溢出层),也可以是整个模型(即所有层)的精度范围调整。
在一种可能实现的方式中,模型的所有层网络结构都采用同一精度范围,则第一精度范围指该同一精度范围。或者理解为,本申请实施例可以适用于单一精度训练场景。
在另一种可能实现的方式中,若模型中有至少两层网络结构采用的精度范围不同。例如,模型使用混 合精度训练。则第一精度范围指混合精度中的低精度或高精度。一般情况下,对于混合精度训练,默认高精度出现溢出的问题较低,第一精度范围可以具体指混合精度中低精度的范围。或者理解为,本申请实施例可以适用于混合精度训练场景。即后续只对特定层进行精度调整。
本申请实施例中的参数有多种情况,下面分别描述:
第一种,参数为前向传播过程中的参数。
该种方式下,参数可以是多个网络结构在前向传播过程中计算的中间特征或损失函数的值。其中,该中间特征也可以理解为是多个网络结构使用激活函数获取的激活值。该损失函数的值也可以理解为是所有层网络结构在前向传播结束后计算的模型输出与标签之间的差异值。
示例性的,以模型如图4为例。模型包括三层网络结构,分别为输入层(包括3个神经元)、隐藏层(包括4个神经元)以及输出层(包括2个神经元)。其中,为第l-1层第j个神经元到第l层第i个神经元的权重(例如,为第1层第3个神经元到第2层第4个神经元的权重),为第l层第j个神经元的偏置,为第l层第j个神经元的激活值。前向传播的过程可以如公式一与公式二所示,其中,公式二为公式一的矩阵形式。
公式一:
公式二:al=σ(Wlal-1+bl);
其中,σ为激活函数。
利用上述公式二可以一层层计算激活值,最终可以根据输入X得到模型的输出f(x)。并根据模型的输出f(x)与X的标签值Y之间的差异计算损失函数的值。损失函数可以是均方误差(Mean Squared Error,MSE)损失、平方绝对误差损失、交叉熵损失、合页损失等等,具体可以根据实际需要进行设置,对于损失函数的结构不做限定。
可以理解的是,图4所示的模型结构只是举例,并不对本申请实施例所提的模型造成限定,且公式一与公式二只是对于前向传播过程中一种表达形式的示例,在实际应用中,前向传播过程还可以有其他表达形式,具体此处不做限定。
上述举例中,参数可以包括中间特征,和/或损失函数的值。其中,每层求得的al可以理解为中间特征,f(x)与Y的差值为损失函数的值。
第二种,参数为反向传播过程中的参数。
该种方式下,参数可以是多个网络结构在反向传播过程中计算的梯度。该梯度包括以下一项或多项:中间特征的梯度、模型中权重的梯度、损失的梯度等。
其中,反向传播过程可以理解为是通过损失函数值不断调整权重的过程。
示例性的,延续上述举例,反向传播过程可以包括:通过一个优化器和预设学习率,然后通过最小化损失函数以求出最佳的参数(例如,权重)。具体的,可以是使用损失函数与链式准则对权重求偏导, 该偏导反应的就是梯度。
可以理解的是,上述参数的两种情况只是示例,在实际应用中,参数还可以有其他情况,具体此处不做限定。
步骤303,若计算值溢出第一精度范围,则使用第二精度范围重新计算参数,并使用重新计算后的参数对模型进行一次或多次迭代训练。
若训练设备在上述步骤302中,使用第一精度范围计算参数的计算值溢出第一精度范围,则使用第二精度范围重新计算参数,并使用重新计算后的参数对模型进行一次或多次迭代训练。其中,第二精度范围包括第一精度范围,或者第二精度范围与第一精度范围部分重叠。
相对于现有技术中,在某一次迭代中,若精度溢出,会丢弃溢出所使用的训练数据并产生训练停滞的问题。本申请实施例提供的方法可以通过及时对精度范围调整使得训练继续,进而可以提升模型的训练效率。
本申请实施例中第二精度范围包括第一精度范围,或者第二精度范围与第一精度范围部分重叠。
示例性的,第二精度范围包括第一精度范围的情况。以第一精度范围为FP16为例,第二精度范围为FP32为例。第一精度范围可以是(-65504,65504),第二精度范围可以是[-1.7×1038,1.7×1038]。
示例性的,第二精度范围与第一精度范围部分重叠的情况。以第一精度范围为FP16为例,第二精度范围为int16为例。第一精度范围是(-65504,65504),第二精度范围可以是-32768到32767的整数。
可以理解的是,上述示例并不对精度范围的边界取值做限定。
可选地,第一精度范围与第二精度范围也可以是相对高低范围的概念,若第一精度范围为高精度范围(后续简称为高精度),则第二精度范围为低精度范围(后续简称为低精度)。若第一精度范围为低精度,第二精度范围为高精度。一般情况下,低精度会出现溢出问题。
本申请实施例中,仅以第一精度范围是低精度,第二精度范围是高精度为例进行示例性说明。当然,实际应用中,也可以通过本申请实施例提供的方法实现高精度与低精度之间的调整、或者低精度到高精度的调整。
进一步的,使用第二精度范围重新计算参数可以包括多种情况,可以是使用第二精度范围从模型的第一层网络结构开始重新计算该参数。也可以是使用第二精度范围从计算值溢出的当前网络结构重新计算该参数。具体此处不做限定,
在一种可能实现的方式中,当参数包括前向传播过程中计算的中间特征时,上述步骤:使用第二精度范围重新计算参数具体包括:使用第二精度范围计算溢出层的中间特征或从第一层网络结构开始逐层计算中间特征,溢出层为多个网络结构中中间特征的计算值溢出第一精度范围的网络结构。
在另一种可能实现的方式中,当参数包括模型的损失函数的值时,上述步骤:使用第二精度范围重新计算参数,具体包括:使用第二精度范围计算损失函数的值或从第一层网络结构开始逐层计算直至得到损失函数的值。
示例的,以模型包括5层网络结构为例。若使用第一精度范围进行第4层的参数计算的计算值溢出第一精度范围。则可以使用第二精度范围从第1层开始重新计算直至计算到第4层。也可以使用第二精度范围重新计算第4层的参数。
可选地,当参数包括梯度时,计算值为梯度除以缩放系数的值,缩放系数用于减少梯度溢出的概率;该方法还包括:使用第一系数更新缩放系数,更新后的缩放系数用于替换更新前的缩放系数进行模型的下一次迭代训练,第一系数为小于1的正数。
其中,上述的溢出可以参考前述相关术语中的描述,此处不再赘述。计算值溢出第一精度范围也可以理解为是参数的计算值用第一精度范围无法准确表示。可能带来后续的舍入错误,或者狭窄的精度范围带来的溢出错误。
若参数的计算值溢出第一精度范围,则使用与第一精度范围不同的第二精度范围对参数重新进行计算,并使用更新后的参数对模型进行一次或多次迭代训练。具体训练过程可以参考前述前向传播与反向传播过程中的描述。例如,通过使用第二精度范围计算的参数进行前向传播中损失函数的计算。并以损失函数的值小于某个阈值为目标对模型进行一次或多次迭代训练,以得到训练好的模型。
需要注意的是,如前述步骤302中,第二精度范围重新计算可以是层粒度的精度范围调整,也可以是整个模型(即所有层)的精度范围调整。即,若初始化所有层的计算精度为同一精度范围,则本步骤的重新计算是针对所有层的(或者理解为是对所有层的计算精度进行调整)。若初始化所有层的计算精度为不同精度范围(即混合精度计算),则本步骤的重新计算是针对特定层的(或者理解为是对特定层的计算精度进行调整)。该特定层的计算精度溢出第一精度范围。
另外,本实施例中的步骤301至步骤303可以执行一次或多次,可以在模型的训练过程中每更新一次或多次后执行一遍步骤301至步骤303。也可以是满足预设周期或预设次数再执行一遍步骤301至步骤303。
本申请实施例中,在模型训练过程中参数的计算值溢出第一精度范围的情况下,使用第二精度范围重新计算参数。即通过参数计算值的溢出信息进行精度范围的自动实时调整,不仅可以减少训练模型所占用的内存,还可以提升模型的训练效率。进而减少由于参数的计算溢出第一精度范围导致的训练停滞等问题。此外,相较于现有技术中需要使用模型中网络层的类型确定使用高精度浮点数还是低精度浮点数的方法,本申请实施例可以通过参数的溢出信息对参数所适用的精度范围进行实时调整,并减少由于低精度浮点数计算产生的溢出问题。
可选地,若图3所示实施例中,参数的计算值未溢出第一精度范围,将在前向传播过程中获取的损失函数的值乘以缩放系数,并使用第一精度范围进行反向传播过程中的梯度计算。若梯度除以缩放系数的值溢出第一精度范围,使用第一系数更新缩放系数,并使用第二精度范围进行前向传播过程与反向传播过程中的重新计算,第一系数为小于1的正数,更新后的缩放系数用于替换更新前的缩放系数进行模型的下一次迭代训练。
为了减少后续参数精度下溢的风险,可以对上述缩放因子的取值下限进行限定,例如,缩放系数的最小取值为大于或等于1的预设阈值。
可选地,由于模型在第一次迭代过程中可能设置有初始的精度范围(例如,第一精度范围),为了提升模型在后续迭代的效率,可以根据迭代次数与溢出次数对每次迭代初始的精度范围进行调整。
具体的,在模型训练过程中的第N次迭代,获取基于第一精度范围对模型中多个网络结构的溢出次数,N为大于或等于1的正整数。若溢出次数大于或等于第二阈值,则确定下一次迭代训练过程中初始的精度范围从第一精度范围更换为第二精度范围,并将溢出次数清零。其中,该溢出次数包括:多个网络结构在前向传播过程中的溢出次数,和/或多个网络结构在反向传播过程中的溢出次数。
另外,由于图3所示实施例中的参数有多种情况。在一种可能实现的方式中,若参数是中间特征。且中间特征的计算溢出第一精度范围。则除了可以使用第二精度范围对中间特征进行重新计算后。还可以使用第二精度范围进行接下来的loss计算,和/或权重梯度计算。换句话说,精度计算的调整可以针对于溢出参数的精度范围调整,还可以针对于后面训练过程中其他参数计算精度范围的调整,具体此处不做限定。在另一种可能实现的方式中,若参数是loss。且loss的计算溢出第一精度范围。则除了可以使用第二精度范围对loss进行重新计算(包括使用第二精度范围重新计算一次loss,或者从第一层开始重新计算直至得到loss。即重新计算的范围可以涉及本次计算也可以涉及本次溢出计算之前的多个计算)后。还可以使用第二精度范围进行接下来的权重梯度计算。换句话说,对于训练过程中中间计算的参数溢出精度范围的情况下,精度计算的调整可以针对于溢出参数的精度范围调整,也可以针对于溢出参数之前训练过程中未溢出参数的精度范围调整,还可以针对于后面训练过程中其他参数计算精度范围的调整,具体此处不做限定。
图5为本申请实施例提供的另一种模型训练方法。该方法的执行主体与图3所示实施例的执行主体类似,此处不再赘述。该方法可以包括步骤501至步骤511。下面对步骤501至步骤511进行详细说明。
步骤501,获取训练数据。
本步骤501可以参考前述图3所示实施例中步骤301的描述,此处不再赘述。
步骤502,判断溢出次数(num)是否大于或等于第二阈值(N)。若是,执行步骤503。若否,执行步 骤508。
本申请实施例中,仅以溢出次数包括模型在前向计算与反向计算过程中各层的计算值溢出第一精度范围的次数为例进行示例性描述。可以理解的是,在实际应用中,该溢出次数可以是模型在前向计算中各层的中间特征溢出第一精度范围的次数。也可以是模型在反向计算梯度溢出第一精度范围的次数。即该溢出次数可以是前向与反向整体过程中的计数,也可以是前向或反向过程中的分别计数等,具体此处不做限定。
本申请实施例中的前向计算是指模型在前向传播过程中的参数(例如,中间特征、loss等)计算。反向计算是指模型在反向传播过程中的参数(例如,梯度等)计算。
训练设备判断该溢出次数是否大于或等于第二阈值(N),N为大于或等于0的整数。若是,执行步骤503。若否,执行步骤508。
可以理解的是,在初始的情况下,溢出次数num置0。
步骤503,使用第一精度范围进行前向计算得到损失(loss)。
若前述步骤502中的溢出次数大于或等于第二阈值,则触发本步骤的执行。
本步骤503中loss的计算可以参考前述图3所示实施例中步骤302的描述,此处不再赘述。
步骤504,判断loss是否溢出。若是,执行步骤509。若否,乘以缩放系数,并执行步骤505。
训练设备获取loss之后,判断loss是否溢出。若是,执行步骤509。若否,乘以缩放系数,并执行步骤505。
步骤505,使用第一精度范围进行反向计算得到权重梯度。
若前述步骤504中的loss未溢出第一精度范围,则乘以缩放系数,并触发本步骤的执行。
其中,乘以缩放系数执行步骤505,是为了防止计算值下溢。
本步骤505计算权重梯度的描述可以参考前述图3所示实施例中步骤302的描述,此处不再赘述。
步骤506,判断权重梯度除以缩放系数的值是否溢出。若是,执行步骤510。若否,执行步骤507。
训练设备获取权重梯度之后,判断权重梯度除以缩放系数的值是否溢出第一精度范围。若是,执行步骤510。若否,执行步骤507。
步骤507,更新权重。
若前述步骤506中权重梯度除以缩放系数的值未溢出第一精度范围,和/或前述步骤508之后,则触发本步骤的执行。
或者理解为,若权重梯度除以缩放系数的值没有溢出第一精度范围,则使用该值对权重进行迭代更新。
步骤508,使用第二精度范围进行前向与后向的重新计算。
若前述步骤502中的溢出次数小于第二阈值,和/或前述步骤509之后,则触发本步骤的执行。
本步骤508可以参考前述图3所示实施例中步骤303的描述,此处不再赘述。
步骤509,溢出次数累计加1(即num+1)。
若前述步骤504中loss溢出第一精度范围,和/或权重梯度除以缩放系数的值溢出第一精度范围,则触发本步骤的执行。
或者理解为是,记录溢出次数,以便于后续迭代中根据溢出次数与第二阈值的比较,确定是否修改每一次迭代过程中初始精度范围的修改。例如,若溢出次数大于第二阈值,则将初始的精度范围从第一精度范围调整为第二精度范围。当然,调整初始精度范围除了可以考虑溢出次数,还可以考虑迭代次数。例如,在迭代次数达到1000次,且溢出次数大于800次(即第二阈值为800次)。则说明初始的精度范围设置为第一精度范围已经影响到模型的训练,为了后续模型训练的准确性,将初始的精度范围调整为第二精度范围。以减少由于参数溢出第一精度范围导致的训练停滞等问题。
步骤510,使用第一系数更新缩放系数。
若前述步骤506中权重梯度除以缩放系数的值溢出第一精度范围,则触发本步骤的执行。若权重梯度除以缩放系数的值溢出第一精度范围,则使用第一系数更新缩放系数。该第一系数为小于1的正数。
或者理解为,若权重梯度除以缩放系数的值溢出第一精度范围,则将缩放系数乘以一个小于1的正数进行调整。
步骤511,确定缩放系数大于或等于预设阈值。
训练设备确定权重梯度除以缩放系数的值溢出第一精度范围之后,在使用第一系数更新缩放系数时, 可以判断缩放系数是否小于预设阈值。若缩放系数小于预设阈值,则将缩放系数调整为预设阈值。若缩放系数大于或等于预设阈值,则不对缩放系数进行修改,并执行步骤509。
本步骤,通过对缩放系数的取值下限进行限定,可以减少后续参数下溢的风险,提升模型训练的稳定性。
可以理解的是,上述设置调整缩放系数的取值下限是为了下次迭代,调整缩放系数之后,执行步骤509。
另外,本实施例中的步骤501至步骤511可以执行多次,可以在模型的训练过程中每更新一次或多次后执行一遍步骤501至步骤511。也可以是满足预设周期或预设次数再执行一遍步骤501至步骤511。
本申请实施例中,一方面,通过前向计算和/或反向计算过程中产生的溢出信息进行精度范围的自动实时调整,不仅可以减少训练模型所占用的内存,还可以提升模型的训练效率。进而减少由于参数的计算溢出第一精度范围导致的训练停滞等问题。此外,相较于现有技术中需要使用模型中网络层的类型确定使用高精度浮点数还是低精度浮点数的方法,本申请实施例可以通过参数的溢出信息对参数所适用的精度范围进行实时调整,并减少由于低精度浮点数计算产生的溢出问题。另一方面,可以对初始的精度范围进行调整,在第一精度范围影响到模型的训练,为了后续模型训练的准确性,将初始的精度范围从第一精度范围调整为第二精度范围。另一方面,在反向计算过程中,通过设置缩放系数的取值下限,减少后续参数精度下溢的风险。
可以理解的是,在图5所示的实施例中,前向计算溢出后的处理有多种情况。例如,若前向计算溢出,则可以使用第二精度范围重新进行前向计算,并使用第一精度范围进行反向计算。例如,若前向计算溢出,则可以使用第二精度范围重新进行前向计算与后续的反向计算。即,调整精度可以是只针对溢出的计算进行调整,也可以是对整体的计算进行精度调整,具体此处不做限定。
图6为本申请实施例提供的另一种模型训练方法。该方法的执行主体与图3所示实施例的执行主体类似,此处不再赘述。该方法可以包括步骤601至步骤615。下面对步骤601至步骤615进行详细说明。
步骤601,获取训练数据。
步骤602,判断溢出次数(num)是否大于或等于第二阈值(N)。若是,执行步骤603。若否,执行步骤612。
本步骤601与步骤602可以参考前述图5所示实施例中步骤501与步骤502的描述,此处不再赘述。
步骤603,使用第一精度范围进行前向计算得到中间特征。
若前述步骤602中的溢出次数大于或等于第二阈值,则触发本步骤的执行。
步骤604,判断中间特征是否溢出。若是,执行步骤605。若否,执行步骤607。
训练设备获取中间特征之后,判断中间特征的计算值是否溢出。若是,执行步骤605。若否,执行步骤607。
步骤605,溢出次数累计加1(即num+1)。
若前述步骤604中中间特征,步骤608中loss,和/或权重梯度除以缩放系数的值溢出第一精度范围,则触发本步骤的执行。
步骤606,使用第二精度范围计算溢出层的中间特征或从第一层开始逐层计算中间特征。
本步骤可以理解为,若中间特征溢出第一精度范围,则可以使用第二精度范围进行溢出层的计算或所有层的计算。
步骤607,使用第一精度范围进行前向计算得到损失(loss)。
若前述步骤604中的中间特征未溢出,则触发本步骤的执行。
步骤608,判断loss是否溢出。若是,执行步骤613。若否,乘以缩放系数,并执行步骤609。
步骤609,使用第一精度范围进行反向计算得到权重梯度。
步骤610,判断权重梯度除以缩放系数的值是否溢出。若是,执行步骤614。若否,执行步骤611。
步骤611,更新权重。
步骤612,使用第二精度范围进行前向与后向的重新计算。
步骤613,溢出次数累计加1(即num+1)。
步骤614,使用第一系数更新缩放系数。
步骤615,确定缩放系数大于或等于预设阈值。
本步骤607至步骤615可以参考前述图5所示实施例中步骤503至步骤511的描述,此处不再赘述。
本申请实施例中,在中间特征的计算溢出第一精度范围的情况下,可以使用第二精度范围计算溢出层的中间特征或从第一层网络结构开始逐层计算中间特征。换句话说,该精度范围的调整可以是特定层的调整,也可以是整个网络结构所有层的调整。一方面,通过前向计算和/或反向计算过程中产生的溢出信息进行精度范围的自动实时调整,不仅可以减少训练模型所占用的内存,还可以提升模型的训练效率。进而减少由于参数的计算溢出第一精度范围导致的训练停滞等问题。此外,相较于现有技术中需要使用模型中网络层的类型确定使用高精度浮点数还是低精度浮点数的方法,本申请实施例可以通过参数的溢出信息对参数所适用的精度范围进行实时调整,并减少由于低精度浮点数计算产生的溢出问题。另一方面,可以对初始的精度范围进行调整,在第一精度范围影响到模型的训练,为了后续模型训练的准确性,将初始的精度范围从第一精度范围调整为第二精度范围。另一方面,在反向计算过程中,通过设置缩放系数的取值下限,减少后续参数精度下溢的风险。
图7为本申请实施例提供的另一种模型训练方法。该方法的执行主体与图3所示实施例的执行主体类似,此处不再赘述。该方法可以包括步骤701至步骤717。下面对步骤701至步骤717进行详细说明。
步骤701,获取训练数据。
步骤702,判断溢出次数(num)是否大于或等于第二阈值(N)。若是,执行步骤703。若否,执行步骤712。
步骤703,使用第一精度范围进行前向计算得到中间特征。
步骤704,判断中间特征是否溢出。若是,执行步骤705。若否,执行步骤707。
步骤705,溢出次数累计加1(即num+1)。
步骤706,使用第二精度范围计算溢出层的中间特征或从第一层开始逐层计算中间特征。
步骤707,使用第一精度范围进行前向计算得到损失(loss)。
步骤708,判断loss是否溢出。若是,执行步骤716。若否,乘以缩放系数,并执行步骤709。
步骤709,使用第一精度范围进行反向计算得到权重梯度。
步骤710,判断权重梯度除以缩放系数的值是否溢出。若是,执行步骤714。若否,执行步骤711。
步骤711,更新权重。
步骤712,使用第二精度范围进行前向与后向的重新计算。
步骤713,溢出次数累计加1(即num+1)。
步骤714,使用第一系数更新缩放系数。
步骤715,确定缩放系数大于或等于预设阈值。
本步骤701至步骤715可以参考前述图6所示实施例中步骤601至步骤615的描述,此处不再赘述。
步骤716,溢出次数累计加1(即num+1)。
本步骤716与前述步骤705类似,换句话说,该溢出次数累积的是中间特征、loss、权重梯度在模型训练过程中溢出第一精度范围的次数。
步骤717,使用第二精度范围重新计算loss或从第一层开始逐层计算直至得到loss。并将loss乘以缩放系数,执行步骤709。
本步骤可以理解为,若loss溢出第一精度范围,则可以使用第二精度范围对loss重新进行最后一次的计算进行精度范围调整或对模型所有层的计算进行精度范围调整。
本申请实施例中,在loss的计算溢出第一精度范围的情况下,可以使用第二精度范围重新计算loss或从第一层网络结构开始逐层计算直至得到loss。换句话说,该精度范围的调整可以是特定层的调整,也可以是整个网络结构所有层的调整。一方面,通过前向计算和/或反向计算过程中产生的溢出信息进行精度范围的自动实时调整,不仅可以减少训练模型所占用的内存,还可以提升模型的训练效率。进而减少由于参数的计算溢出第一精度范围导致的训练停滞等问题。此外,相较于现有技术中需要使用模型中网络层 的类型确定使用高精度浮点数还是低精度浮点数的方法,本申请实施例可以通过参数的溢出信息对参数所适用的精度范围进行实时调整,并减少由于低精度浮点数计算产生的溢出问题。另一方面,可以对初始的精度范围进行调整,在第一精度范围影响到模型的训练,为了后续模型训练的准确性,将初始的精度范围从第一精度范围调整为第二精度范围。另一方面,在反向计算过程中,通过设置缩放系数的取值下限,减少后续参数精度下溢的风险。
上面对本申请实施例中的模型训练方法进行了描述,下面对本申请实施例中的训练设备进行描述,请参阅图8,本申请实施例中训练设备的一个实施例包括:
获取单元801,用于获取训练数据;
计算单元802,用于将训练数据作为模型的输入,在模型训练过程中使用第一精度范围进行参数的计算以得到计算值;
计算单元802,还用于若计算值溢出第一精度范围,则使用第二精度范围重新计算参数,并使用重新计算后的参数对模型进行一次或多次迭代训练,其中,第二精度范围包括第一精度范围,或者第二精度范围与第一精度范围部分重叠。
可选地,模型包括多个网络结构,计算单元802,具体用于使用第二精度范围从模型的第一层网络结构开始重新计算参数。
可选地,计算单元802,具体用于使用第二精度范围从计算值溢出的当前网络结构重新计算参数。
可选地,模型包括多个网络结构,参数包括以下一项或多项:多个网络结构在前向传播过程中计算的中间特征或模型的损失函数的值,中间特征为多个网络结构中任意一个网络结构的输出特征;多个网络结构在反向传播过程中计算的梯度,梯度包括:中间特征的梯度和/或模型的权重梯度。
可选地,当参数包括反向传播过程中计算的梯度时,计算值为梯度除以缩放系数的值,缩放系数用于减少梯度溢出的概率;计算单元802,还用于使用第一系数更新缩放系数,更新后的缩放系数用于替换更新前的缩放系数进行模型的下一次迭代训练,第一系数为小于1的正数;缩放系数的最小取值为大于或等于1的预设阈值。
可选地,当参数包括前向传播过程中计算的中间特征时,计算单元802,具体用于使用第二精度范围计算溢出层的中间特征或从第一层网络结构开始逐层计算中间特征,溢出层为多个网络结构中中间特征的计算值溢出第一精度范围的网络结构。
可选地,当参数包括模型的损失函数的值时,计算单元802,具体用于使用第二精度范围计算损失函数的值或从第一层网络结构开始逐层计算直至得到损失函数的值。
可选地,计算单元802,具体用于在模型训练过程中的第N次迭代,获取基于第一精度范围对模型中多个网络结构的溢出次数,N为大于或等于1的正整数;计算单元802,具体用于若溢出次数大于或等于第二阈值,则确定下一次迭代训练过程中初始的精度范围从第一精度范围更换为第二精度范围,并将溢出次数清零。
可选地,溢出次数包括:多个网络结构在前向传播过程中的溢出次数,和/或多个网络结构在反向传播过程中的溢出次数。
本实施例中,训练设备中各单元所执行的操作与前述图1至图7所示实施例中描述的类似,此处不再赘述。
本实施例中,在模型训练过程中参数的计算值溢出第一精度范围的情况下,计算单元602使用第二精度范围重新计算参数。即通过参数计算值的溢出信息进行精度范围的自动实时调整,不仅可以减少训练模型所占用的内存,还可以提升模型的训练效率。进而减少由于参数的计算溢出第一精度范围导致的训练停滞等问题。此外,相较于现有技术中需要使用模型中网络层的类型确定使用高精度浮点数还是低精度浮点数的方法,本申请实施例可以通过参数的溢出信息对参数所适用的精度范围进行实时调整,并减少由于低精度浮点数计算产生的溢出问题。
参阅图9,本申请提供的另一种训练设备的结构示意图。该训练设备可以包括处理器901、存储器902和通信端口903。该处理器901、存储器902和通信端口903通过线路互联。其中,存储器902中存储有程序指令和数据。
存储器902中存储了前述图1至图7所示对应的实施方式中,由训练设备执行的步骤对应的程序指令以及数据。
处理器901,用于执行前述图1至图7所示实施例中任一实施例所示的由训练设备执行的步骤。
通信端口903可以用于进行数据的接收和发送,用于执行前述图1至图7所示实施例中任一实施例中与获取、接收相关的步骤。
一种实现方式中,训练设备可以包括相对于图9更多或更少的部件,本申请对此仅仅是示例性说明,并不作限定。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,read-only memory)、随机存取存储器(RAM,random access memory)、磁碟或者光盘等各种可以存储程序代码的介质。

Claims (21)

  1. 一种模型训练方法,其特征在于,所述方法包括:
    获取训练数据;
    将所述训练数据作为模型的输入,在所述模型训练过程中使用第一精度范围进行参数的计算以得到计算值;
    若所述计算值溢出所述第一精度范围,则使用第二精度范围重新计算所述参数,并使用重新计算后的参数对所述模型进行一次或多次迭代训练,其中,所述第二精度范围包括所述第一精度范围,或者所述第二精度范围与所述第一精度范围部分重叠。
  2. 根据权利要求1所述的方法,其特征在于,所述模型包括多个网络结构,所述使用第二精度范围重新计算所述参数,包括:
    使用所述第二精度范围从所述模型的第一层网络结构开始重新计算所述参数。
  3. 根据权利要求1所述的方法,其特征在于,所述模型包括多个网络结构,所述使用第二精度范围重新计算所述参数,包括:
    使用所述第二精度范围从所述计算值溢出的当前网络结构重新计算所述参数。
  4. 根据权利要求1至3中任一项所述的方法,其特征在于,所述模型包括多个网络结构,所述参数包括以下一项或多项:
    所述多个网络结构在前向传播过程中计算的中间特征或所述模型的损失函数的值,所述中间特征为所述多个网络结构中任意一个网络结构的输出特征;
    所述多个网络结构在反向传播过程中计算的梯度,所述梯度包括:所述中间特征的梯度和/或所述模型的权重梯度。
  5. 根据权利要求4所述的方法,其特征在于,当所述参数包括所述反向传播过程中计算的梯度时,所述计算值为所述梯度除以缩放系数的值,所述缩放系数用于减少所述梯度溢出的概率;所述方法还包括:
    使用第一系数更新所述缩放系数,更新后的缩放系数用于替换更新前的缩放系数进行所述模型的下一次迭代训练,所述第一系数为小于1的正数;所述缩放系数的最小取值为大于或等于1的预设阈值。
  6. 根据权利要求4所述的方法,其特征在于,当所述参数包括所述前向传播过程中计算的中间特征时,所述使用第二精度范围重新计算所述参数,包括:
    使用所述第二精度范围计算溢出层的所述中间特征或从第一层网络结构开始逐层计算所述中间特征,所述溢出层为所述多个网络结构中所述中间特征的计算值溢出所述第一精度范围的网络结构。
  7. 根据权利要求4所述的方法,其特征在于,当所述参数包括所述模型的损失函数的值时,所述使用第二精度范围重新计算所述参数,包括:
    使用所述第二精度范围计算所述损失函数的值或从第一层网络结构开始逐层计算直至得到所述损失函数的值。
  8. 根据权利要求1至7中任一项所述的方法,其特征在于,所述使用重新计算后的参数对所述模型进行训练,包括:
    在所述模型训练过程中的第N次迭代,获取基于所述第一精度范围对所述模型中多个网络结构的溢出次数,N为大于或等于1的正整数;
    若所述溢出次数大于或等于第二阈值,则确定下一次迭代训练过程中初始的精度范围从所述第一精度范围更换为所述第二精度范围,并将所述溢出次数清零。
  9. 根据权利要求8所述的方法,其特征在于,所述溢出次数包括:所述多个网络结构在前向传播过程中的溢出次数,和/或所述多个网络结构在反向传播过程中的溢出次数。
  10. 一种训练设备,其特征在于,所述训练设备包括:
    获取单元,用于获取训练数据;
    计算单元,用于将所述训练数据作为模型的输入,在所述模型训练过程中使用第一精度范围进行参数的计算以得到计算值;
    所述计算单元,还用于若所述计算值溢出所述第一精度范围,则使用第二精度范围重新计算所述参数,并使用重新计算后的参数对所述模型进行一次或多次迭代训练,其中,所述第二精度范围包括所述第一精 度范围,或者所述第二精度范围与所述第一精度范围部分重叠。
  11. 根据权利要求10所述的训练设备,其特征在于,所述模型包括多个网络结构,所述计算单元,具体用于使用所述第二精度范围从所述模型的第一层网络结构开始重新计算所述参数。
  12. 根据权利要求10所述的训练设备,其特征在于,所述计算单元,具体用于使用所述第二精度范围从所述计算值溢出的当前网络结构重新计算所述参数。
  13. 根据权利要求10至12中任一项所述的训练设备,其特征在于,所述模型包括多个网络结构,所述参数包括以下一项或多项:
    所述多个网络结构在前向传播过程中计算的中间特征或所述模型的损失函数的值,所述中间特征为所述多个网络结构中任意一个网络结构的输出特征;
    所述多个网络结构在反向传播过程中计算的梯度,所述梯度包括:所述中间特征的梯度和/或所述模型的权重梯度。
  14. 根据权利要求13所述的训练设备,其特征在于,当所述参数包括所述反向传播过程中计算的梯度时,所述计算值为所述梯度除以缩放系数的值,所述缩放系数用于减少所述梯度溢出的概率;所述计算单元,还用于使用第一系数更新所述缩放系数,更新后的缩放系数用于替换更新前的缩放系数进行所述模型的下一次迭代训练,所述第一系数为小于1的正数;所述缩放系数的最小取值为大于或等于1的预设阈值。
  15. 根据权利要求13所述的训练设备,其特征在于,当所述参数包括所述前向传播过程中计算的中间特征时,所述计算单元,具体用于使用所述第二精度范围计算溢出层的所述中间特征或从第一层网络结构开始逐层计算所述中间特征,所述溢出层为所述多个网络结构中所述中间特征的计算值溢出所述第一精度范围的网络结构。
  16. 根据权利要求13所述的训练设备,其特征在于,当所述参数包括所述模型的损失函数的值时,所述计算单元,具体用于使用所述第二精度范围计算所述损失函数的值或从第一层网络结构开始逐层计算直至得到所述损失函数的值。
  17. 根据权利要求10至16中任一项所述的训练设备,其特征在于,所述计算单元,具体用于在所述模型训练过程中的第N次迭代,获取基于所述第一精度范围对所述模型中多个网络结构的溢出次数,N为大于或等于1的正整数;
    所述计算单元,具体用于若所述溢出次数大于或等于第二阈值,则确定下一次迭代训练过程中初始的精度范围从所述第一精度范围更换为所述第二精度范围,并将所述溢出次数清零。
  18. 根据权利要求17所述的训练设备,其特征在于,所述溢出次数包括:所述多个网络结构在前向传播过程中的溢出次数,和/或所述多个网络结构在反向传播过程中的溢出次数。
  19. 一种训练设备,其特征在于,包括:处理器,所述处理器与存储器耦合,所述存储器用于存储程序或指令,当所述程序或指令被所述处理器执行时,使得所述训练设备执行如权利要求1至9中任一项所述的方法。
  20. 一种计算机存储介质,其特征在于,包括计算机指令,当所述计算机指令在训练端设备上运行时,使得所述训练设备执行如权利要求1至9中任一项所述的方法。
  21. 一种计算机程序产品,其特征在于,当所述计算机程序产品在计算机上运行时,使得所述计算机执行如权利要求1至9中任一项所述的方法。
PCT/CN2023/106905 2022-07-15 2023-07-12 一种模型训练方法及相关设备 Ceased WO2024012476A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
KR1020257004231A KR20250037493A (ko) 2022-07-15 2023-07-12 모델 훈련 방법 및 관련 디바이스
EP23838966.2A EP4550209A4 (en) 2022-07-15 2023-07-12 MODEL TRAINING METHOD AND ASSOCIATED DEVICE
JP2025501773A JP2025522114A (ja) 2022-07-15 2023-07-12 モデル訓練方法及び関連デバイス
US19/019,814 US20250156712A1 (en) 2022-07-15 2025-01-14 Model training method and related device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210832123.6 2022-07-15
CN202210832123.6A CN117474045A (zh) 2022-07-15 2022-07-15 一种模型训练方法及相关设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US19/019,814 Continuation US20250156712A1 (en) 2022-07-15 2025-01-14 Model training method and related device

Publications (1)

Publication Number Publication Date
WO2024012476A1 true WO2024012476A1 (zh) 2024-01-18

Family

ID=89535600

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/106905 Ceased WO2024012476A1 (zh) 2022-07-15 2023-07-12 一种模型训练方法及相关设备

Country Status (6)

Country Link
US (1) US20250156712A1 (zh)
EP (1) EP4550209A4 (zh)
JP (1) JP2025522114A (zh)
KR (1) KR20250037493A (zh)
CN (1) CN117474045A (zh)
WO (1) WO2024012476A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118798254A (zh) * 2024-09-13 2024-10-18 湖北华中电力科技开发有限责任公司 一种分布式建模方法及系统

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118095386B (zh) * 2024-03-26 2024-07-30 腾讯科技(深圳)有限公司 一种模型训练加速方法、装置、设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110097188A (zh) * 2019-04-30 2019-08-06 科大讯飞股份有限公司 一种模型训练方法、工作节点及参数更新服务器
CN113255877A (zh) * 2020-02-12 2021-08-13 阿里巴巴集团控股有限公司 神经网络模型的量化处理方法、装置、设备及存储介质
CN113435520A (zh) * 2021-06-30 2021-09-24 深圳市商汤科技有限公司 神经网络的训练方法、装置、设备及计算机可读存储介质
CN113888524A (zh) * 2021-10-20 2022-01-04 深圳市信润富联数字科技有限公司 缺陷检测模型训练方法、装置、设备及可读存储介质
CN114065902A (zh) * 2020-08-03 2022-02-18 虫极科技(北京)有限公司 神经网络层训练方法、神经网络计算系统和计算机可读介质

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108364065B (zh) * 2018-01-19 2020-09-11 上海兆芯集成电路有限公司 采布斯乘法的微处理器
CN110210611B (zh) * 2019-05-13 2021-09-07 西安交通大学 一种用于卷积神经网络计算的动态自适应数据截断方法
JP2022094508A (ja) * 2020-12-15 2022-06-27 富士通株式会社 演算処理装置、演算処理方法および演算処理プログラム

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110097188A (zh) * 2019-04-30 2019-08-06 科大讯飞股份有限公司 一种模型训练方法、工作节点及参数更新服务器
CN113255877A (zh) * 2020-02-12 2021-08-13 阿里巴巴集团控股有限公司 神经网络模型的量化处理方法、装置、设备及存储介质
CN114065902A (zh) * 2020-08-03 2022-02-18 虫极科技(北京)有限公司 神经网络层训练方法、神经网络计算系统和计算机可读介质
CN113435520A (zh) * 2021-06-30 2021-09-24 深圳市商汤科技有限公司 神经网络的训练方法、装置、设备及计算机可读存储介质
CN113888524A (zh) * 2021-10-20 2022-01-04 深圳市信润富联数字科技有限公司 缺陷检测模型训练方法、装置、设备及可读存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4550209A4

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118798254A (zh) * 2024-09-13 2024-10-18 湖北华中电力科技开发有限责任公司 一种分布式建模方法及系统

Also Published As

Publication number Publication date
EP4550209A1 (en) 2025-05-07
KR20250037493A (ko) 2025-03-17
JP2025522114A (ja) 2025-07-10
EP4550209A4 (en) 2025-10-22
US20250156712A1 (en) 2025-05-15
CN117474045A (zh) 2024-01-30

Similar Documents

Publication Publication Date Title
US12299577B2 (en) Tensor processing using low precision format
TWI744724B (zh) 處理卷積神經網路的方法
US20250156712A1 (en) Model training method and related device
WO2019238029A1 (zh) 卷积神经网络系统和卷积神经网络量化的方法
CN113095486B (zh) 整数张量网络数据处理方法
JP2020009444A (ja) ニューラルネットワークにおいてパラメータを処理する方法及び装置
CN113919479B (zh) 一种提取数据特征的方法和相关装置
CN112789627A (zh) 一种神经网络处理器、数据处理方法及相关设备
WO2022111002A1 (zh) 用于训练神经网络的方法、设备和计算机可读存储介质
CN110378470A (zh) 神经网络模型的优化方法、装置以及计算机存储介质
CN112633477A (zh) 一种基于现场可编程阵列的量化神经网络加速方法
WO2021169160A1 (zh) 图像归一化处理方法及装置、存储介质
WO2023109748A1 (zh) 一种神经网络的调整方法及相应装置
CN114239799A (zh) 一种高效目标检测方法、设备、介质和系统
CN112130805A (zh) 包括浮点加法器的芯片、设备及浮点运算的控制方法
CN114580625A (zh) 用于训练神经网络的方法、设备和计算机可读存储介质
CN116070689A (zh) 模型的量化训练方法、装置、电子设备及可读存储介质
WO2019076095A1 (zh) 处理方法及装置
TW202129551A (zh) 運算裝置和運算方法
CN114730331A (zh) 数据处理装置和数据处理方法
CN116957007A (zh) 神经网络训练时的特征量化方法、装置、介质及程序产品
CN114692348B (zh) 基于多保真深度学习代理模型的组件布局温度场预测方法
CN119646165A (zh) 问答模型的训练方法、装置、计算机设备及可读存储介质
CN115861041B (zh) 图像风格迁移方法、装置、计算机设备、存储介质和产品
CN119376686A (zh) 一种数据处理方法及相关设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23838966

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2025501773

Country of ref document: JP

WWE Wipo information: entry into national phase

Ref document number: 202527005682

Country of ref document: IN

WWE Wipo information: entry into national phase

Ref document number: 2023838966

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2023838966

Country of ref document: EP

Effective date: 20250131

REG Reference to national code

Ref country code: BR

Ref legal event code: B01A

Ref document number: 112025000709

Country of ref document: BR

ENP Entry into the national phase

Ref document number: 20257004231

Country of ref document: KR

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 1020257004231

Country of ref document: KR

NENP Non-entry into the national phase

Ref country code: DE

WWP Wipo information: published in national office

Ref document number: 202527005682

Country of ref document: IN

WWP Wipo information: published in national office

Ref document number: 1020257004231

Country of ref document: KR

WWP Wipo information: published in national office

Ref document number: 2023838966

Country of ref document: EP

REG Reference to national code

Ref country code: BR

Ref legal event code: B01E

Ref document number: 112025000709

Country of ref document: BR

Free format text: 1) APRESENTE NOVO RELATORIO DESCRITIVO ADAPTADO AO ART. 26 INCISO I DA PORTARIA/INPI NO 14/2024, UMA VEZ QUE O CONTEUDO ENVIADO NA PETICAO NO 870250003107 DE 14/01/2025 ENCONTRA-SE FORA DA NORMA EM RELACAO AO TITULO, CONTENDO TEXTO DIFERENTE DO TITULO INCIANDO A PAGINA. 2) APRESENTE NOVO RESUMO ADAPTADO AO ART. 40 INCISO I DA PORTARIA/INPI NO 14/2024, UMA VEZ QUE O CONTEUDO ENVIADO NA PETICAO NO 870250003107 DE 14/01/2025 ENCONTRA-SE FORA DA NORMA EM RELACAO AO TITULO, CONTENDO TEXTO DIFERENTE DO TITULO INCIANDO A PAGINA. A EXIGENCIA DEVE SER RESPONDIDA EM ATE 60 (SESSENTA) DIAS DE SUA PUBLICACAO E DEVE SER REALIZADA POR MEIO DA PETICAO GRU CODIGO DE SERVICO 207.

ENP Entry into the national phase

Ref document number: 112025000709

Country of ref document: BR

Kind code of ref document: A2

Effective date: 20250114