WO2024012476A1 - 一种模型训练方法及相关设备 - Google Patents
一种模型训练方法及相关设备 Download PDFInfo
- Publication number
- WO2024012476A1 WO2024012476A1 PCT/CN2023/106905 CN2023106905W WO2024012476A1 WO 2024012476 A1 WO2024012476 A1 WO 2024012476A1 CN 2023106905 W CN2023106905 W CN 2023106905W WO 2024012476 A1 WO2024012476 A1 WO 2024012476A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- model
- training
- accuracy range
- range
- layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24133—Distances to prototypes
- G06F18/24143—Distances to neighbourhood prototypes, e.g. restricted Coulomb energy networks [RCEN]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/09—Supervised learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/20—Scenes; Scene-specific elements in augmented reality scenes
Definitions
- This application relates to the field of artificial intelligence, and in particular to a model training method and related equipment.
- Artificial intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
- Neural network is an artificially established dynamic system with a directed graph as a topological structure. It processes information by responding to continuous or intermittent inputs as state responses. It is a system designed to imitate the structure and function of the human brain. Information processing systems. After decades of development, artificial neural networks have been widely used in pattern recognition, automatic control, signal processing, assisted decision-making, artificial intelligence, scientific computing and many other fields, and have achieved widespread success. Especially in many fields such as image processing, audio and video processing, natural language processing, etc., artificial neural networks are in the stage of vigorous development and are playing an irreplaceable role.
- the accuracy of the data format used is mainly set through manual experience.
- the setter will determine based on experience whether each network layer uses 16-bit half-precision floating point (FP16) or 32-bit single-precision floating point (FP32).
- This application provides a model training method and related equipment, which adjusts the accuracy range used in the process of training the model in real time when the calculated value of the parameter exceeds the accuracy range. It can effectively solve the training stagnation problem caused by overflow faced by low-precision training.
- the first aspect of the embodiments of this application provides a model training method.
- This method can be applied to dynamic computing graph scenarios as well as static computing graph scenarios.
- the dynamic calculation graph scenario can be understood as updating the calculation graph after each layer of the network structure of the model is calculated.
- the static calculation graph scenario can be understood as updating the calculation graph after all layer network structures of the model are calculated.
- the main difference is the update timing of the calculation graph.
- the method provided by the embodiments of this application can be applied to the calculation of the calculation graph.
- the method may be executed by the training device, or may be executed by a component of the training device (such as a processor, a chip, or a chip system, etc.).
- the method includes: obtaining training data; using the training data as input to the model, using the first precision range to calculate parameters during the model training process to obtain a calculated value; if the calculated value exceeds the first precision range, use the second precision range.
- the parameters are recalculated, and the model is trained for one or more iterations using the recalculated parameters, and the second accuracy range includes the first accuracy range, or the second accuracy range partially overlaps with the first accuracy range.
- the second precision range is used to recalculate the parameter. That is, automatic real-time adjustment of the accuracy range through the overflow information of parameter calculation values can not only reduce the memory occupied by the training model, but also improve the training efficiency of the model. This reduces problems such as training stagnation or training failure caused by parameters overflowing the first accuracy range.
- the embodiment of the present application can use the overflow information of the parameters to determine the applicable precision range of the parameters in real time. Adjust and reduce overflow problems caused by low-precision floating point calculations.
- the above model includes multiple network structures
- the step of: using the second accuracy range to recalculate the parameters includes: using the second accuracy range to start from the first layer of the model.
- the network structure begins to recalculate parameters.
- a new accuracy range can be selected to recalculate from the first layer network structure of the model to reduce calculation errors caused by calculated value overflow.
- the above model includes multiple network structures
- the step of: using the second precision range to recalculate the parameters includes: using the second precision range to recalculate the current value that overflows from the calculated value.
- Network structure recalculation parameters are used to calculate the second precision range to recalculate the current value that overflows from the calculated value.
- a new accuracy range can be selected to calculate the current layer network structure. Recalculate to reduce calculation errors caused by overflow of calculated values.
- the above parameters are related to the loss function of the model, or the parameters are related to the calculation of the model in the forward propagation process, or the parameters are related to the calculation of the model in the back propagation process. calculations related.
- the above model includes multiple network structures, and the parameters include one or more of the following: intermediate features calculated by the multiple network structures in the forward propagation process; The value of the loss function of the model.
- the intermediate feature is the output feature of any one of multiple network structures; the gradient calculated by multiple network structures during the backpropagation process.
- the gradient includes: the gradient of the intermediate feature and/or the weight of the model. gradient.
- the parameter can be a parameter that needs to be calculated during the forward propagation or back propagation process of the model, or it can be a parameter output by an individual layer in the model, or it can be all layers in the entire model. parameters after calculation, etc.
- the applicable scenarios of this method in the training process are improved. In other words, the calculations involved in the model training process can be adjusted for accuracy using the method provided by the embodiments of this application.
- the method when the parameter includes a gradient, the calculated value is the gradient divided by a scaling coefficient, and the scaling coefficient is used to reduce the probability of gradient overflow; the method also includes: using The first coefficient updates the scaling coefficient, and the updated scaling coefficient is used to replace the scaling coefficient before the update for the next iterative training of the model.
- the first coefficient is a positive number less than 1.
- the minimum value of the scaling coefficient is a preset threshold greater than or equal to 1, and the preset threshold is used to reduce the probability of gradient overflow.
- the lower limit of the scaling coefficient is set to reduce the risk of subsequent parameter accuracy underflow.
- the above step of: recalculating the parameters using the second precision range includes: using the second precision range Calculate the intermediate features of the overflow layer or calculate the intermediate features layer by layer starting from the first layer of the network structure.
- the overflow layer is a network structure in which the calculated values of the intermediate features in multiple network structures overflow the first accuracy range.
- the overflow layer can be recalculated using the second precision range, or the overflow layer can be restarted from the first layer network structure. That is, you can choose to modify the accuracy range of some layers, or you can choose to modify the accuracy range of all layers.
- the solution is flexible.
- the above step of: recalculating the parameters using the second accuracy range includes: calculating the loss function using the second accuracy range. The value of or calculation starting from the first layer network structure layer by layer until the value of the loss function is obtained.
- the value of the loss function can be recalculated using the second precision range, or the value of the loss function can be calculated layer by layer starting from the first layer of the network structure until the value of the loss function is obtained. . That is, you can choose to modify the accuracy range of some layers, or you can choose to modify the accuracy range of all layers.
- the solution is flexible.
- the above step: training the model using the recalculated parameters includes: in the Nth iteration of the model training process, obtaining the first accuracy range based on For the number of overflows of multiple network structures in the model, N is a positive integer greater than or equal to 1; if the number of overflows is greater than or equal to the second threshold, it is determined that the initial accuracy range in the next iterative training process is changed from the first accuracy range to Second precision range, and clear the overflow count to zero.
- the initial accuracy range is adjusted by recording the number of overflows, which affects model training in the first accuracy range.
- the initial accuracy range is adjusted from the first accuracy range.
- the range is adjusted to the second precision range.
- the above-mentioned number of overflows includes: the number of overflows of multiple network structures in the forward propagation process, and/or the number of overflows of multiple network structures in the back propagation process. number of overflows.
- the judgment condition for accuracy range adjustment ie, the determination of the number of overflows
- the judgment condition for accuracy range adjustment can be the number of overflows in the entire training process, or the number of overflows in forward propagation or back propagation. Improve the applicability of this method.
- the above parameters are related to the loss function.
- the above loss function can vary according to the training method of the model.
- the loss function is used to represent the difference between the output of the model and the label to which the training data belongs.
- the loss function can be a custom function.
- the loss function is used to represent the difference between the output of the model and the input (or clustering result, etc.).
- it can be understood that in unsupervised training, it is hoped that the output of the model can be restored to the input of the model.
- the label is the training data itself (that is, the output obtained by the model is then sent to another network to restore the training data).
- the loss function is not limited in the embodiments of the present application.
- the loss function can also be understood as the optimization objective function of the model, and can be set according to actual needs.
- a second aspect of the embodiment of the present application provides a training device.
- the training device can be applied to dynamic computing graph scenarios or static computing graph scenarios.
- the training device includes: an acquisition unit, used to obtain training data; a calculation unit, used to use the training data as input to the model, and use the first accuracy range to calculate parameters during the model training process to obtain calculated values; the calculation unit also Used to recalculate the parameters using the second accuracy range if the calculated value exceeds the first accuracy range, and use the recalculated parameters to train the model one or more iterations, where the second accuracy range includes the first accuracy range, Or the second accuracy range partially overlaps with the first accuracy range.
- the above-mentioned model includes multiple network structures and calculation units, specifically configured to use the second accuracy range to recalculate parameters starting from the first layer network structure of the model.
- the above-mentioned calculation unit is specifically configured to recalculate parameters using the current network structure that overflows from the calculated value in the second precision range.
- the above model includes multiple network structures, and the parameters include one or more of the following: intermediate features calculated by the multiple network structures during the forward propagation process; The value of the loss function of the model.
- the intermediate feature is the output feature of any one of multiple network structures; the gradient calculated by multiple network structures during the backpropagation process.
- the gradient includes: the gradient of the intermediate feature and/or the weight of the model. gradient.
- the calculated value is the gradient divided by the scaling coefficient, and the scaling coefficient is used to reduce the risk of gradient overflow. Probability; the calculation unit is also used to update the scaling coefficient using the first coefficient. The updated scaling coefficient is used to replace the scaling coefficient before the update for the next iteration training of the model.
- the first coefficient is a positive number less than 1; the scaling coefficient The minimum value is a preset threshold greater than or equal to 1.
- the calculation unit when the above parameters include intermediate features calculated during forward propagation, is specifically configured to use the second accuracy range to calculate the intermediate features of the overflow layer or Intermediate features are calculated layer by layer starting from the first layer of network structure.
- the overflow layer is a network structure in which the calculated values of intermediate features in multiple network structures overflow the first accuracy range.
- the calculation unit is specifically configured to use the second accuracy range to calculate the value of the loss function or from the first layer
- the network structure starts to be calculated layer by layer until the value of the loss function is obtained.
- the above-mentioned computing unit is specifically used for the Nth iteration in the model training process to obtain the results of multiple network structures in the model based on the first accuracy range.
- the number of overflows, N is a positive integer greater than or equal to 1; the calculation unit is specifically used to determine that if the number of overflows is greater than or equal to the second threshold, the initial accuracy range in the next iterative training process is changed from the first accuracy range to the second precision range and clear the number of overflows to zero.
- the above-mentioned number of overflows includes: the number of overflows of multiple network structures in the forward propagation process, and/or the number of overflows of multiple network structures in the back propagation process. number of overflows.
- the third aspect of the present application provides a training device, including: a processor, the processor is coupled to a memory, and the memory is used to store programs or instructions.
- the training device implements the above first aspect. or a method in any possible implementation of the first aspect.
- the fourth aspect of the present application provides a computer-readable medium on which a computer program or instructions are stored.
- the computer program or instructions When the computer program or instructions are run on a computer, the computer is caused to execute the foregoing first aspect or any possible implementation of the first aspect. method within the method.
- a fifth aspect of the present application provides a computer program product, which, when executed on a computer, causes the computer to execute the method in the foregoing first aspect or any possible implementation of the first aspect.
- this application has the following advantages: when the calculated value of the parameter exceeds the first precision range during the model training process, the second precision range is used to recalculate the parameter. That is, automatic real-time adjustment of the accuracy range through the overflow information of parameter calculation values can not only reduce the memory occupied by the training model, but also improve the training efficiency of the model. This reduces problems such as training stagnation or training failure caused by parameters overflowing the first accuracy range.
- the embodiment of the present application can use the overflow information of the parameters to determine the applicable precision range of the parameters. Adjust in real time and reduce overflow problems caused by low-precision floating point calculations.
- Figure 1 is a schematic structural diagram of the system architecture provided by the embodiment of the present application.
- Figure 2 is a schematic diagram of a chip hardware structure provided by an embodiment of the present application.
- Figure 3 is a schematic flow chart of the model training method provided by the embodiment of the present application.
- Figure 4 is a structural example diagram of the model provided by the embodiment of the present application.
- Figure 5 is another schematic flow chart of the model training method provided by the embodiment of the present application.
- Figure 6 is another schematic flow chart of the model training method provided by the embodiment of the present application.
- Figure 7 is another schematic flow chart of the model training method provided by the embodiment of the present application.
- Figure 8 is a schematic structural diagram of the training equipment provided by the embodiment of the present application.
- Figure 9 is another schematic structural diagram of the training equipment provided by the embodiment of the present application.
- This application provides a model training method and related equipment, which adjusts the accuracy range used in the process of training the model in real time when the calculated value exceeds the accuracy range. It can effectively solve the problem of training stagnation or training failure caused by overflow in low-precision training. In addition, it has low requirements for the network mixed precision initialization scheme, does not rely on manual experience to customize the initialization scheme, and can automatically adjust the training accuracy layer by layer in real time.
- the neural network can be composed of neural units.
- the neural unit can refer to an arithmetic unit that takes X s and intercept b as input.
- the output of the arithmetic unit can be:
- s 1, 2,...n, n is a natural number greater than 1
- W s is the weight of X s
- b is the bias of the neural unit.
- f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next layer.
- the activation function can be a Relu function.
- a neural network is a network formed by connecting many of the above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit.
- the input of each neural unit can be connected to the local receptive field of the previous layer to extract the features of the local receptive field.
- the local receptive field can be an area composed of several neural units.
- W is a weight vector, and each value in this vector represents the weight value of a neuron in this layer of neural network.
- This vector W determines the spatial transformation from the above input space to the output space, that is, the weight W of each layer controls how to transform the space.
- the purpose of training a neural network is to finally obtain the weight matrix of all layers of the trained neural network (a weight matrix formed by the vector W of many layers). Therefore, the training process of neural network is essentially to learn how to control spatial transformation, and more specifically, to learn the weight matrix.
- Neural network can also be called Artificial Neural Network (ANN): it is an artificially established dynamic system with a directed graph as a topology. It performs information processing by responding to continuous or intermittent input as a state response. Processing is an information processing system designed to imitate the structure and function of the human brain.
- ANN Artificial Neural Network
- loss function loss function
- objective function object function
- Neural network forward propagation refers to the calculation process from the input layer through the hidden layer to the output layer. Starting from the input layer, according to the topology of the network, the output (activation value) of the previous layer is used as the input of the next layer, and the output of each layer is calculated layer by layer until the final output layer. This process is called the network's Forward propagation.
- Neural network backpropagation is the abbreviation of "error backpropagation". It is a common method used in combination with optimization methods (such as gradient descent) to train artificial neural networks. This method calculates the gradient of the loss function for all weights in the network. This gradient is fed back to the optimization method, which is used to update the weights to minimize the loss function.
- optimization methods such as gradient descent
- the precision range refers to the precision range of the data type used by the computer. It can refer to the specific accuracy or the dynamic range of the accuracy.
- the following takes the commonly used data type in neural networks as floating point (FP) as an example for an exemplary description. In actual applications, the data type can also be an integer (int), such as int8, int16, etc.
- FP is mainly used to represent decimals and usually consists of three parts, namely the sign bit, the exponent bit and the mantissa bit.
- the sign bit can be 1 bit indicating positive or negative, and the exponent bit and mantissa bit can be multiple bits.
- the mantissa bit represents the precision
- the exponent bit is used to represent the dynamic range within which the precision can be achieved (referred to as the precision range in the embodiment of this application).
- Floating-point numbers can usually include three formats, namely half-precision floating-point numbers, single-precision floating-point numbers and double-precision floating-point numbers, as follows.
- Half-precision floating-point It is a binary data type used by computers. It occupies 16 bits (that is, occupies 2 bytes) in computer memory. It can also be referred to as FP16.
- the absolute value range of the values that can be represented by half-precision floating point numbers is approximately [6 ⁇ 10 -8 ,65504].
- the accuracy of FP16 is 2-10 .
- Single-precision floating-point It is a binary data type used by computers. It occupies 32 bits (i.e. 4 bytes) in computer memory. It can also be referred to as FP32.
- the absolute value range of the values that can be represented by single-precision floating point numbers is approximately [1.4 ⁇ 10 -45,1.7 ⁇ 10 38 ].
- the accuracy of FP32 is 2-23 .
- Double precision floating point It is a binary data type used by computers. It occupies 64bits (that is, occupies 8 bytes) in computer memory, and can also be referred to as FP64. Double-precision floating point numbers can represent 15 or 16 significant decimal digits, and the absolute value range of the representable values is approximately [2.23 ⁇ 10 -308,1.80 ⁇ 10 38 ]. The accuracy of FP32 is 2-52 .
- the sign bit occupies 1 bit
- the exponent bit occupies 5 bits
- the mantissa bit occupies 10 bits
- the sign bit occupies 1 bit
- the exponent bit occupies 8 bits
- the mantissa bit occupies 23 bits
- the sign bit occupies 1 bit
- the exponent bit occupies 11 bits
- the mantissa bit occupies 52 bits.
- Overflow in the embodiment of this application includes overflow and underflow.
- overflow means that the absolute value of the calculated value is too large and exceeds the maximum value that can be represented by a certain precision range.
- Underflow means that the absolute value of the calculated value is too small, smaller than the closest positive value to 0 that can be represented by a certain precision range.
- Overflow may include storage overflow and calculation overflow.
- the embodiment of this application is mainly applied to the scenario of calculation overflow.
- Overflow includes: if the calculated value is a positive number, and the calculated value is greater than the maximum positive number that FP16 can represent. Or if the calculated value is a negative number, and the calculated value is smaller than the smallest negative number that FP16 can represent.
- Underflow includes: if the calculated value is a positive number, and the calculated value is smaller than the smallest positive number that FP16 can represent. Or if the calculated value is a negative number, and the calculated value is greater than the maximum negative number that FP16 can represent.
- the key point of mixed precision is: what strategy is used to set which parts of the network are trained with high precision and which parts are trained with low precision, so as to ensure accuracy and improve training efficiency.
- the key point of mixed precision is: how to combine single precision and high precision for training.
- mixed precision training methods mainly include the following two methods.
- the first one specifies the calculation accuracy through the type of each layer in the neural network. Some types of layers are calculated with high precision, and some types of layers are calculated with low precision.
- the second one dynamically selects the accuracy by quantifying whether the error exceeds a threshold.
- quantized error can be measured at different points in the network or measured over time as training proceeds. For example: Calculate quantization error by comparing training results with baseline values.
- the baseline value can be determined by a variety of methods, such as training the same network using full-precision floating point values, repeating a subset of the calculations with high precision, and analyzing or sampling statistics specific to the calculations involved.
- the adjustment of training accuracy depends on the accuracy baseline value.
- This baseline value needs to be repeatedly calculated with high precision or obtained by training the same network with full precision floating point values. This makes the solution still not automated enough, and may significantly increase the computational complexity of network training due to the construction of baseline values. It runs counter to the purpose of using low-precision training in order to save calculations and speed up training.
- embodiments of the present application provide a model training method and related equipment.
- the network mixed precision initialization scheme has low requirements and does not rely on manual experience to customize the initialization scheme (for example, there is no need for manual experience to adjust the accuracy at the layer level for each network).
- the training accuracy can be automatically adjusted layer by layer in real time during the training process based on whether it overflows.
- an embodiment of the present invention provides a system architecture 100.
- the data collection device 160 is used to collect training data.
- the training data may include one or more of the following: images, speech, text, etc.
- the training data is stored in the database 130, and the training device 120 trains to obtain the target model/rule 101 based on the training data maintained in the database 130.
- the target model/rules 101 can be used to implement computer vision tasks (eg, classification, segmentation, detection, image generation, etc.).
- the target model/rule 101 in the embodiment of this application may specifically be a neural network or the like.
- the training data maintained in the database 130 may not all be collected by the data collection device 160, and may also be received from other devices.
- the training device 120 may not necessarily train the target model/rules 101 based entirely on the training data maintained by the database 130. It may also obtain training data from the cloud or other places for model training. The above description should not be used as a guide for this application. Limitations of Examples.
- the target model/rules 101 trained according to the training device 120 can be applied to different systems or devices, such as the execution device 110 shown in Figure 1 .
- the execution device 110 can be a terminal, such as a mobile phone terminal, a tablet computer, or a laptop computer. , augmented reality (augmented reality, AR) equipment/virtual reality (VR) equipment, vehicle-mounted terminals, etc.
- the execution device 110 can also be a server or a cloud, etc.
- the execution device 110 is configured with an I/O interface 112 for data interaction with external devices. The user can input data to the I/O interface 112 through the client device 140. The input data corresponds to the training data.
- Input data in embodiments may also include one or more of the following: images, voice, text, etc.
- the input data can be input by the user, or uploaded by the user through the shooting device, and of course it can also come from a database, etc., and the details are not limited here.
- the preprocessing module 113 is used to perform preprocessing (eg, segmentation, selection, transformation, etc.) according to the input data received by the I/O interface 112 .
- preprocessing eg, segmentation, selection, transformation, etc.
- the input data is divided into multiple data blocks (patches).
- the execution device 110 When the execution device 110 preprocesses input data, or when the calculation module 111 of the execution device 110 performs calculations and other related processes, the execution device 110 can call data, codes, etc. in the data storage system 150 for corresponding processing. , the data, instructions, etc. obtained by corresponding processing can also be stored in the data storage system 150 .
- the I/O interface 112 returns the processing results (eg, classification results, segmentation results, detection results, etc.) to the client device 140, thereby providing them to the user.
- processing results eg, classification results, segmentation results, detection results, etc.
- the training device 120 can generate corresponding target models/rules 101 based on different training data for different goals or different tasks, and the corresponding target models/rules 101 can be used to achieve the above goals or complete the The above tasks, thereby providing the user with the desired results.
- the user can manually set the input data, and the manual setting can be operated through the interface provided by the I/O interface 112 .
- the client device 140 can automatically send input data to the I/O interface 112. If requiring the client device 140 to automatically send input data requires the user's authorization, the user can set corresponding permissions in the client device 140.
- the user can view the results output by the execution device 110 on the client device 140, and the specific presentation form may be display, sound, action, etc.
- the client device 140 can also be used as a data collection end to collect the input data of the input I/O interface 112 and the output value of the output I/O interface 112 as new sample data, and store them in the database 130 .
- the I/O interface 112 directly uses the input data input to the I/O interface 112 and the output value of the output I/O interface 112 as a new sample as shown in the figure.
- the data is stored in database 130.
- Figure 1 is only a schematic diagram of a system architecture provided by an embodiment of the present invention.
- the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
- the data storage system 150 is an external memory relative to the execution device 110. In other cases, the data storage system 150 can also be placed in the execution device 110.
- Figure 2 is a chip hardware structure provided by an embodiment of the present invention.
- the chip includes a neural network processor 20.
- the chip can be disposed in the execution device 110 as shown in Figure 1 to complete the calculation work of the calculation module 111.
- the chip can also be provided in the training device 120 as shown in Figure 1 to complete the training work of the training device 120 and output the target model/rules 101.
- the neural network processor 20 may be a neural network processor (neural-network processing unit, NPU), a tensor processing unit (TPU), or a graphics processor (graphics processing unit, GPU), etc., which are suitable for large-scale applications.
- NPU neural-network processing unit
- TPU tensor processing unit
- GPU graphics processor
- the neural network processor 20 is mounted on the main central processing unit (central processing unit, CPU) (host CPU) as a co-processor, and the main CPU allocates tasks.
- the core part of the NPU is the arithmetic circuit 203.
- the controller 204 controls the arithmetic circuit 203 to extract data in the memory (weight memory or input memory) and perform operations.
- the computing circuit 203 internally includes multiple processing units (process engines, PEs).
- arithmetic circuit 203 is a two-dimensional systolic array.
- the arithmetic circuit 203 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition.
- arithmetic circuit 203 is a general-purpose matrix processor.
- the operation circuit 203 obtains the corresponding data of matrix B from the weight memory 202 and caches it on each PE in the operation circuit.
- the operation circuit takes the matrix A data from the input memory 201 and performs matrix operation on the matrix B, and the partial result or final result of the obtained matrix is stored in the accumulator 208 .
- the vector calculation unit 207 can further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc.
- the vector calculation unit 207 can be used for network calculations of non-convolutional/non-FC layers in neural networks, such as pooling, batch normalization, local response normalization, etc. .
- the vector computation unit can 207 store the processed output vectors to the unified buffer 206 .
- the vector calculation unit 207 may apply a nonlinear function to the output of the operation circuit 203, such as a vector of accumulated values, to generate an activation value.
- vector calculation unit 207 generates normalized values, merged values, or both.
- the processed output vector can be used as an activation input to the arithmetic circuit 203, such as for use in a subsequent layer in a neural network.
- the unified memory 206 is used to store input data and output data.
- the weight data directly transfers the input data in the external memory to the input memory 201 and/or the unified memory 206 through the storage unit access controller 205 (direct memory access controller, DMAC), and stores the weight data in the external memory into the weight memory 202. and storing the data in the unified memory 206 into the external memory.
- DMAC direct memory access controller
- a bus interface unit (BIU) 210 is used to implement interaction between the main CPU, the DMAC and the fetch memory 209 through the bus.
- An instruction fetch buffer 209 connected to the controller 204 is used to store instructions used by the controller 204.
- the controller 204 is used to call instructions cached in the fetch memory 209 to control the working process of the computing accelerator.
- the unified memory 206, the input memory 201, the weight memory 202 and the instruction memory 209 are all on-chip memories, and the external memory is a memory external to the NPU.
- the external memory can be double data rate synchronous dynamic random access. Memory (double data rate synchronous dynamic random access memory, DDR SDRAM for short), high bandwidth memory (high bandwidth memory, HBM) or other readable and writable memory.
- the application scenarios of the model training method provided by the embodiments of this application are described.
- This method can be applied to dynamic computing graph scenarios as well as static computing graph scenarios.
- the dynamic calculation graph scenario can be understood as updating the calculation graph after each layer of the network structure of the model is calculated.
- the static calculation graph scenario can be understood as updating the calculation graph after all layer network structures of the model are calculated.
- the main difference is the update timing of the calculation graph.
- the calculation of the calculation graph can be applied to the model training method (also known as the calculation accuracy adjustment method) provided by the embodiment of the present application.
- the model training method of the embodiment of the present application will be introduced in detail below with reference to Figure 3.
- the method may be executed by the training device, or may be executed by a component of the training device (for example, a processor, a chip, or a chip system, etc.).
- the training device can be a cloud device or a terminal device.
- it can also be a computer, server or other device with sufficient computing power to execute the model training method, or it can be a system composed of a cloud device and a terminal device.
- the training method can be executed by the training device 120 in Figure 1 and the neural network processor 20 in Figure 2 .
- the model training method can be processed by the CPU, or it can be processed by the CPU and GPU together, or it can not use the GPU but use other processors suitable for neural network calculations, which is not limited by this application.
- FIG. 3 is a schematic flow chart of a model training method provided by an embodiment of the present application.
- the method may include steps 301 to 303 . Steps 301 to 303 will be described in detail below.
- Step 301 Obtain training data.
- training device to obtain training data, which may be by receiving data sent by other devices (for example, servers, business equipment, etc.), or by selecting from a database, or by Through user photography and other methods, there are no specific limitations here.
- the training data in the embodiment of this application may include one or more of the following: images, speech, text, etc.
- the training data is specifically related to the scenario in which the model is applied. For example: when the function of the model is audio recognition, the specific form of training data can be audio data, etc. Another example: when the function of the model is image classification, the specific form of training data can be image data, etc. Another example: when the role of the model is to predict speech, the specific form of training data can be text data, etc. It can be understood that the above situations are only examples and do not necessarily have a one-to-one correspondence.
- the specific form of training data can also be image data or text data (for example, if it is applied to the field of education In the scene of watching pictures and playing voice, the role of the model is to recognize the voice corresponding to the image, then the specific form of the training data can be image data).
- the training data can be word vectors corresponding to movies, etc.
- the above training data can also include data in different modalities at the same time.
- the training data can include image/video data collected by the camera, and can also include voice/text data of instructions given by the user.
- the specific form or type of training data is not limited in the embodiments of this application.
- the training data obtained in this step is training data carrying labels. If the training of the model is unsupervised training, the training data obtained in this step is training data without labels.
- Step 302 Use the training data as the input of the model, and use the first accuracy range to calculate parameters during the model training process to obtain calculated values;
- the training device After the training device obtains the training data, it uses the training data as input to the model, and uses the first accuracy range to calculate parameters to obtain calculated values.
- This parameter is related to the loss function of the model.
- the loss function is used to represent the difference between the output of the model and the label to which the training data belongs.
- the loss function can be a custom function. For example, when the task of the model is a classification task, the loss function is used to represent the difference between the output of the model and the input (or clustering result, etc.). Or it can be understood that in unsupervised training, it is hoped that the output of the model can be restored to the input of the model.
- the label is the training data itself (that is, the output obtained by the model is then sent to another network to restore the training data).
- the loss function is not limited in the embodiments of the present application.
- the loss function can also be understood as the optimization objective function of the model, and can be set according to actual needs.
- the parameters in the embodiments of this application may include one or more of the following possibilities: parameters related to the loss function of the model, parameters related to the calculation of the model in the forward propagation process, and parameters related to the calculation of the model in the back propagation process. Related.
- the model includes multiple network structures.
- the model in the embodiment of this application is specifically an artificial neural network.
- the specific number of layers or structures included in the artificial neural network can be set according to actual needs and is not limited here.
- the precision range in the embodiment of the present application may refer to the precision range of the data type (for example, int, float, etc.).
- the accuracy range of FP is used as an example for illustrative description. Of course, in actual applications, it can also be other data types (for example, int, etc.).
- the first precision range is a dynamic reachable range of FP16 precision.
- FP16 and accuracy range can refer to the explanations in the aforementioned related terms, and will not be repeated here.
- the accuracy ranges used by each layer of the network structure in the model may be the same or different.
- the first accuracy range may be refined to the accuracy range of each layer in the model, or the same accuracy range may be used for all layers of the model. That is, the subsequent recalculation of parameters using the second precision range to replace the first precision range can be a precision range adjustment of layer granularity (that is, only the precision range of a specific layer is adjusted, and the precision range of a non-specific layer is not adjusted.
- the specific layer It can also become an overflow layer), or it can be an adjustment of the accuracy range of the entire model (that is, all layers).
- all layer network structures of the model adopt the same accuracy range, and the first accuracy range refers to the same accuracy range.
- the embodiments of the present application can be applied to a single accuracy training scenario.
- the accuracy ranges used are different.
- the model uses mixed Accuracy training.
- the first precision range refers to low precision or high precision in the mixed precision.
- the default high precision is less likely to cause overflow problems, and the first precision range can specifically refer to the low-precision range of mixed precision.
- the embodiments of the present application can be applied to mixed precision training scenarios. That is, only the accuracy of specific layers will be adjusted in the future.
- the parameters are parameters in the forward propagation process.
- the parameters can be the values of intermediate features or loss functions calculated by multiple network structures during the forward propagation process.
- the intermediate feature can also be understood as the activation value obtained by multiple network structures using activation functions.
- the value of this loss function can also be understood as the difference between the model output and the label calculated after the forward propagation of all layer network structures.
- the model includes a three-layer network structure, namely the input layer (including 3 neurons), the hidden layer (including 4 neurons), and the output layer (including 2 neurons).
- the weight from the j-th neuron in layer l-1 to the i-th neuron in layer l (for example, is the weight from the 3rd neuron in layer 1 to the 4th neuron in layer 2)
- the bias of the j-th neuron in the l-th layer is the activation value of the j-th neuron in layer l.
- the process of forward propagation can be shown as Formula 1 and Formula 2, where Formula 2 is the matrix form of Formula 1.
- ⁇ is the activation function.
- the activation value can be calculated layer by layer using the above formula 2, and finally the output f(x) of the model can be obtained based on the input X. And calculate the value of the loss function based on the difference between the model's output f(x) and the label value Y of X.
- the loss function can be mean squared error (MSE) loss, squared absolute error loss, cross entropy loss, hinge loss, etc.
- MSE mean squared error
- the specific settings can be set according to actual needs, and there is no limit to the structure of the loss function.
- the parameters may include intermediate features and/or the value of the loss function.
- a l obtained at each layer can be understood as the intermediate feature, and the difference between f(x) and Y is the value of the loss function.
- the parameters are the parameters in the backpropagation process.
- the parameters can be the gradients calculated by multiple network structures during the backpropagation process.
- the gradient includes one or more of the following: the gradient of intermediate features, the gradient of the weights in the model, the gradient of the loss, etc.
- the backpropagation process can be understood as the process of continuously adjusting the weight through the loss function value.
- the backpropagation process may include: passing an optimizer and a preset learning rate, and then minimizing the loss function to find the best parameters (for example, weights ). Specifically, you can use the loss function and the chain criterion to derive partial derivatives of the weights, This deflection reflects the gradient.
- Step 303 If the calculated value exceeds the first accuracy range, recalculate the parameters using the second accuracy range, and use the recalculated parameters to train the model one or more iterations.
- the training device uses the first precision range to calculate the parameters and the calculated values exceed the first precision range in the above step 302, then the second precision range is used to recalculate the parameters, and the recalculated parameters are used to perform one or more iterations on the model. train.
- the second accuracy range includes the first accuracy range, or the second accuracy range partially overlaps with the first accuracy range.
- the training data used for the overflow will be discarded and the problem of training stagnation will occur.
- the method provided by the embodiments of the present application can allow training to continue by adjusting the accuracy range in a timely manner, thereby improving the training efficiency of the model.
- the second accuracy range includes the first accuracy range, or the second accuracy range partially overlaps with the first accuracy range.
- the second accuracy range includes the first accuracy range. Take the first precision range as FP16 as an example, and the second precision range as FP32 as an example.
- the first accuracy range may be (-65504, 65504), and the second accuracy range may be [-1.7 ⁇ 10 38 , 1.7 ⁇ 10 38 ].
- the second accuracy range partially overlaps with the first accuracy range.
- the first precision range as FP16 as an example
- the second precision range as int16 as an example.
- the first precision range is (-65504, 65504)
- the second precision range can be an integer from -32768 to 32767.
- the first precision range and the second precision range can also be concepts of relatively high and low ranges. If the first precision range is a high precision range (hereinafter referred to as high precision), the second precision range is a low precision range (hereinafter referred to as high precision). Referred to as low precision). If the first precision range is low precision, the second precision range is high precision. In general, overflow problems will occur with low precision.
- the first precision range is low precision and the second precision range is high precision as an example for illustrative explanation.
- the adjustment between high precision and low precision, or the adjustment from low precision to high precision can also be achieved through the method provided by the embodiment of the present application.
- recalculating parameters using the second accuracy range may include multiple situations, and may include recalculating the parameters starting from the first layer network structure of the model using the second accuracy range. It is also possible to recalculate the parameter using the second precision range from the current network structure where the calculated value overflows. There are no specific restrictions here.
- the above step of: recalculating the parameters using the second precision range specifically includes: using the second precision range to calculate the intermediate features of the overflow layer or from The first layer of network structure begins to calculate intermediate features layer by layer, and the overflow layer is a network structure in which the calculated values of intermediate features in multiple network structures overflow the first accuracy range.
- the above step: recalculating the parameters using the second accuracy range specifically includes: using the second accuracy range to calculate the value of the loss function or from the first The layer network structure starts to be calculated layer by layer until the value of the loss function is obtained.
- the model including a 5-layer network structure. If the calculated value of layer 4 parameter calculation using the first precision range exceeds the first precision range. Then you can use the second precision range to recalculate from level 1 until you reach level 4. It is also possible to recalculate the parameters of layer 4 using a second precision range.
- the method when the parameter includes a gradient, the calculated value is the gradient divided by the scaling coefficient, and the scaling coefficient is used to reduce the probability of gradient overflow; the method also includes: using the first coefficient to update the scaling coefficient, and the updated scaling coefficient is The next iteration training of the model is performed by replacing the scaling coefficient before the update.
- the first coefficient is a positive number less than 1.
- the parameter is recalculated using a second accuracy range that is different from the first accuracy range, and the updated parameters are used to train the model one or more iterations.
- the specific training process please refer to the description of the forward propagation and back propagation processes mentioned above.
- the calculation of the loss function in forward propagation is performed by using the parameters calculated in the second precision range.
- the model is trained one or more iterations with the goal that the value of the loss function is less than a certain threshold to obtain a trained model.
- the second precision range recalculation may be a precision range adjustment at the layer granularity, or may be a precision range adjustment for the entire model (ie, all layers). That is, if the calculation accuracy of all layers is initialized to the same accuracy range, then the recalculation in this step is for all layers (or it can be understood as adjusting the calculation accuracy of all layers). If the calculation accuracy of all layers is initialized to different accuracy ranges (ie, mixed precision calculation), then the recalculation in this step is for a specific layer (or it can be understood as adjusting the calculation accuracy of a specific layer). The calculation accuracy of this particular layer exceeds the first accuracy range.
- steps 301 to 303 in this embodiment can be executed one or more times, and steps 301 to 303 can be executed after each update one or more times during the training process of the model. It is also possible to perform steps 301 to 303 again when the preset period or the preset number of times is met.
- the second precision range is used to recalculate the parameter. That is, automatic real-time adjustment of the accuracy range through the overflow information of parameter calculation values can not only reduce the memory occupied by the training model, but also improve the training efficiency of the model. This reduces problems such as training stagnation caused by parameter calculations overflowing the first accuracy range.
- the embodiment of the present application can use the overflow information of the parameters to determine the applicable precision range of the parameters in real time. Adjust and reduce overflow problems caused by low-precision floating point calculations.
- the calculated value of the parameter does not exceed the first precision range
- the value of the loss function obtained during the forward propagation process is multiplied by the scaling coefficient, and the first precision range is used for inversion.
- Gradient calculation in the propagation process If the value of the gradient divided by the scaling coefficient exceeds the first precision range, use the first coefficient to update the scaling coefficient, and use the second precision range to perform recalculation in the forward propagation process and the back propagation process.
- the first coefficient is less than 1. Positive number, the updated scaling coefficient is used to replace the pre-updated scaling coefficient for the next iteration of model training.
- the lower limit of the scaling factor can be limited.
- the minimum value of the scaling factor is a preset threshold greater than or equal to 1.
- the model may have an initial accuracy range (for example, the first accuracy range) set during the first iteration, in order to improve the efficiency of the model in subsequent iterations, the initial accuracy range for each iteration can be set based on the number of iterations and the number of overflows. The accuracy range can be adjusted.
- the initial accuracy range for example, the first accuracy range
- the number of overflows for multiple network structures in the model based on the first accuracy range is obtained, and N is a positive integer greater than or equal to 1. If the number of overflows is greater than or equal to the second threshold, it is determined that the initial accuracy range in the next iterative training process is changed from the first accuracy range to the second accuracy range, and the number of overflows is cleared.
- the number of overflows includes: the number of overflows of multiple network structures in the forward propagation process, and/or the number of overflows of multiple network structures in the back propagation process.
- the parameters in the embodiment shown in FIG. 3 have various situations.
- the parameters are intermediate features. And the calculation of intermediate features overflows the first accuracy range.
- the intermediate features can be recalculated using the second precision range.
- the second precision range can also be used for subsequent loss calculations and/or weight gradient calculations.
- the adjustment of accuracy calculation can be adjusted for the accuracy range of the overflow parameter, and can also be adjusted for the accuracy range of other parameter calculations in the subsequent training process. There is no specific limit here.
- the parameter is loss. And the calculation of loss exceeds the first precision range.
- the second precision range can involve this calculation or the loss. after multiple calculations before this overflow calculation.
- the second accuracy range can also be used for subsequent weight gradient calculations.
- the adjustment of the accuracy calculation can be adjusted for the accuracy range of the overflow parameters, or it can be adjusted for the accuracy range of the parameters that did not overflow during the training process before the overflow parameters. , and can also be used to adjust the calculation accuracy range of other parameters in the subsequent training process, which is not limited here.
- Figure 5 is another model training method provided by an embodiment of the present application.
- the execution body of this method is similar to the execution body of the embodiment shown in Figure 3 and will not be described again here.
- the method may include steps 501 to 511. Steps 501 to 511 will be described in detail below.
- Step 501 Obtain training data.
- step 501 reference may be made to the description of step 301 in the embodiment shown in FIG. 3, which will not be described again here.
- Step 502 Determine whether the number of overflows (num) is greater than or equal to the second threshold (N). If yes, execute step 503. If not, execute step Step 508.
- the number of overflows including the number of times the calculated value of each layer of the model overflows the first precision range during the forward calculation and reverse calculation processes of the model is used as an example for an exemplary description. It can be understood that in practical applications, the number of overflows may be the number of times that the intermediate features of each layer of the model overflow the first accuracy range in the forward calculation. It can also be the number of times the model overflows the first accuracy range when calculating the gradient in reverse. That is, the number of overflows can be counted in the overall forward and reverse processes, or it can be counted separately in the forward or reverse processes, etc. The specific number is not limited here.
- the forward calculation in the embodiment of this application refers to the calculation of parameters (for example, intermediate features, loss, etc.) of the model during the forward propagation process.
- Backward calculation refers to the calculation of parameters (for example, gradients, etc.) of the model during the backpropagation process.
- the training device determines whether the number of overflows is greater than or equal to the second threshold (N), where N is an integer greater than or equal to 0. If yes, execute step 503. If not, execute step 508.
- Step 503 Use the first accuracy range to perform forward calculation to obtain the loss (loss).
- step 502 If the number of overflows in the aforementioned step 502 is greater than or equal to the second threshold, execution of this step is triggered.
- the calculation of loss in this step 503 may refer to the description of step 302 in the embodiment shown in FIG. 3 , and will not be described again here.
- Step 504 determine whether loss overflows. If yes, execute step 509. If not, multiply the scaling factor and perform step 505.
- the training device After the training device obtains the loss, it determines whether the loss has overflowed. If yes, execute step 509. If not, multiply the scaling factor and perform step 505.
- Step 505 Use the first accuracy range to perform reverse calculation to obtain the weight gradient.
- step 504 If the loss in the aforementioned step 504 does not exceed the first accuracy range, the scaling factor is multiplied and the execution of this step is triggered.
- Step 505 is performed by multiplying the scaling coefficient to prevent underflow of the calculated value.
- the description of calculating the weight gradient in this step 505 may refer to the description of step 302 in the embodiment shown in FIG. 3 , and will not be described again here.
- Step 506 Determine whether the value of the weight gradient divided by the scaling coefficient overflows. If yes, execute step 510. If not, execute step 507.
- the training device After the training device obtains the weight gradient, it determines whether the value of the weight gradient divided by the scaling coefficient exceeds the first accuracy range. If yes, execute step 510. If not, execute step 507.
- Step 507 update weights.
- step 506 If the value of the weight gradient divided by the scaling coefficient in step 506 does not exceed the first accuracy range, and/or after step 508 , execution of this step is triggered.
- Step 508 Use the second accuracy range to perform forward and backward recalculation.
- step 502 If the number of overflows in the aforementioned step 502 is less than the second threshold, and/or after the aforementioned step 509, execution of this step is triggered.
- step 508 reference may be made to the description of step 303 in the embodiment shown in FIG. 3, which will not be described again here.
- Step 509 The cumulative number of overflows is increased by 1 (that is, num+1).
- step 504 If the loss overflows the first precision range in the aforementioned step 504, and/or the value of the weight gradient divided by the scaling coefficient overflows the first precision range, then the execution of this step is triggered.
- the number of overflows is recorded, so that in subsequent iterations, based on the comparison between the number of overflows and the second threshold, it is determined whether to modify the initial accuracy range in each iteration process. For example, if the number of overflows is greater than the second threshold, the initial precision range is adjusted from the first precision range to the second precision range.
- the number of iterations can also be considered when adjusting the initial accuracy range. For example, when the number of iterations reaches 1000 and the number of overflows is greater than 800 (that is, the second threshold is 800). This means that setting the initial accuracy range to the first accuracy range has affected the training of the model. For the accuracy of subsequent model training, the initial accuracy range is adjusted to the second accuracy range. To reduce problems such as training stagnation caused by parameters overflowing the first accuracy range.
- Step 510 Use the first coefficient to update the scaling coefficient.
- the execution of this step is triggered. If the value of the weight gradient divided by the scaling coefficient exceeds the first accuracy range, the first coefficient is used to update the scaling coefficient.
- the first coefficient is a positive number less than 1.
- the scaling coefficient is multiplied by a positive number less than 1 for adjustment.
- Step 511 Determine that the scaling coefficient is greater than or equal to a preset threshold.
- the training device determines that the value of the weight gradient divided by the scaling coefficient exceeds the first accuracy range, when updating the scaling coefficient using the first coefficient, It can be judged whether the scaling factor is smaller than the preset threshold. If the scaling coefficient is less than the preset threshold, the scaling coefficient is adjusted to the preset threshold. If the scaling coefficient is greater than or equal to the preset threshold, the scaling coefficient is not modified and step 509 is performed.
- step 509 is performed.
- steps 501 to 511 in this embodiment can be executed multiple times, and steps 501 to 511 can be executed after each update one or more times during the training process of the model. It is also possible to perform steps 501 to 511 again when the preset period or the preset number of times is met.
- automatic real-time adjustment of the accuracy range through the overflow information generated during forward calculation and/or reverse calculation can not only reduce the memory occupied by the training model, but also improve the training efficiency of the model. . This reduces problems such as training stagnation caused by parameter calculations overflowing the first accuracy range.
- the embodiment of the present application can use the overflow information of the parameters to determine the applicable precision range of the parameters in real time. Adjust and reduce overflow problems caused by low-precision floating point calculations.
- the initial accuracy range can be adjusted to affect model training in the first accuracy range.
- the initial accuracy range is adjusted from the first accuracy range to the second accuracy range.
- the risk of subsequent parameter accuracy underflow is reduced.
- the forward calculation can be rerun using the second precision range, and the backward calculation can be done using the first precision range.
- the second precision range can be used to re-perform the forward calculation and subsequent reverse calculation. That is, the adjustment accuracy may be adjusted only for overflow calculations, or the accuracy may be adjusted for all calculations. The details are not limited here.
- Figure 6 is another model training method provided by an embodiment of the present application.
- the execution body of this method is similar to the execution body of the embodiment shown in Figure 3 and will not be described again here.
- the method may include steps 601 to 615. Steps 601 to 615 will be described in detail below.
- Step 601 Obtain training data.
- Step 602 Determine whether the number of overflows (num) is greater than or equal to the second threshold (N). If yes, execute step 603. If not, execute step 612.
- step 601 and step 602 reference may be made to the description of step 501 and step 502 in the embodiment shown in FIG. 5 , which will not be described again here.
- Step 603 Use the first accuracy range to perform forward calculation to obtain intermediate features.
- step 602 If the number of overflows in the aforementioned step 602 is greater than or equal to the second threshold, execution of this step is triggered.
- Step 604 Determine whether the intermediate features overflow. If yes, execute step 605. If not, execute step 607.
- the training device After the training device obtains the intermediate features, it determines whether the calculated value of the intermediate features overflows. If yes, execute step 605. If not, execute step 607.
- Step 605 The cumulative number of overflows is increased by 1 (that is, num+1).
- step 604 If the value of the intermediate feature in step 604, the loss in step 608, and/or the weight gradient divided by the scaling coefficient exceeds the first accuracy range, the execution of this step is triggered.
- Step 606 Use the second accuracy range to calculate intermediate features of the overflow layer or calculate intermediate features layer by layer starting from the first layer.
- This step can be understood as: if the intermediate feature overflows the first precision range, the second precision range can be used to perform calculations on the overflow layer or all layers.
- Step 607 Use the first accuracy range to perform forward calculation to obtain the loss (loss).
- step 604 If the intermediate features in the aforementioned step 604 do not overflow, the execution of this step is triggered.
- Step 608 Determine whether loss overflows. If yes, execute step 613. If not, multiply the scaling factor and perform step 609.
- Step 609 Use the first accuracy range to perform reverse calculation to obtain the weight gradient.
- Step 610 Determine whether the value of the weight gradient divided by the scaling coefficient overflows. If yes, execute step 614. If not, execute step 611.
- Step 611 update the weight.
- Step 612 Use the second accuracy range to perform forward and backward recalculation.
- Step 613 The cumulative number of overflows is increased by 1 (that is, num+1).
- Step 614 Use the first coefficient to update the scaling coefficient.
- Step 615 Determine that the scaling coefficient is greater than or equal to the preset threshold.
- steps 607 to 615 reference may be made to the description of steps 503 to 511 in the embodiment shown in FIG. 5, which will not be described again here.
- the second precision range when the calculation of intermediate features overflows the first precision range, the second precision range can be used to calculate the intermediate features of the overflow layer or the intermediate features can be calculated layer by layer starting from the first layer network structure.
- the adjustment of the accuracy range can be an adjustment of a specific layer or an adjustment of all layers of the entire network structure.
- automatic real-time adjustment of the accuracy range through the overflow information generated during forward calculation and/or reverse calculation can not only reduce the memory occupied by the training model, but also improve the training efficiency of the model. This reduces problems such as training stagnation caused by parameter calculations overflowing the first accuracy range.
- the embodiment of the present application can use the overflow information of the parameters to determine the applicable precision range of the parameters in real time. Adjust and reduce overflow problems caused by low-precision floating point calculations.
- the initial accuracy range can be adjusted to affect model training in the first accuracy range. For the accuracy of subsequent model training, the initial accuracy range is adjusted from the first accuracy range to the second accuracy range.
- the reverse calculation process by setting the lower limit of the scaling coefficient, the risk of subsequent parameter accuracy underflow is reduced.
- Figure 7 is another model training method provided by an embodiment of the present application.
- the execution body of this method is similar to the execution body of the embodiment shown in Figure 3 and will not be described again here.
- the method may include steps 701 to 717. Steps 701 to 717 will be described in detail below.
- Step 701 Obtain training data.
- Step 702 Determine whether the number of overflows (num) is greater than or equal to the second threshold (N). If yes, execute step 703. If not, execute step 712.
- Step 703 Use the first accuracy range to perform forward calculation to obtain intermediate features.
- Step 704 Determine whether the intermediate features overflow. If yes, execute step 705. If not, execute step 707.
- Step 705 The cumulative number of overflows is increased by 1 (that is, num+1).
- Step 706 Use the second precision range to calculate intermediate features of the overflow layer or calculate intermediate features layer by layer starting from the first layer.
- Step 707 Use the first accuracy range to perform forward calculation to obtain the loss (loss).
- Step 708 determine whether loss overflows. If yes, execute step 716. If not, multiply the scaling factor and perform step 709.
- Step 709 Use the first accuracy range to perform reverse calculation to obtain the weight gradient.
- Step 710 Determine whether the value of the weight gradient divided by the scaling coefficient overflows. If yes, execute step 714. If not, execute step 711.
- Step 711 update the weight.
- Step 712 Use the second accuracy range to perform forward and backward recalculation.
- Step 713 The cumulative number of overflows is increased by 1 (that is, num+1).
- Step 714 Use the first coefficient to update the scaling coefficient.
- Step 715 Determine that the scaling coefficient is greater than or equal to the preset threshold.
- steps 701 to 715 reference may be made to the description of steps 601 to 615 in the embodiment shown in FIG. 6, which will not be described again here.
- Step 716 The cumulative number of overflows is increased by 1 (that is, num+1).
- This step 716 is similar to the aforementioned step 705.
- the number of overflows accumulated is the number of times that the intermediate features, loss, and weight gradients overflow the first accuracy range during the model training process.
- Step 717 Recalculate the loss using the second accuracy range or calculate it layer by layer starting from the first layer until the loss is obtained. And multiply the loss by the scaling factor, and execute step 709.
- This step can be understood as, if the loss overflows the first accuracy range, the second accuracy range can be used to re-calculate the loss for the last time to adjust the accuracy range or to adjust the accuracy range of the calculations of all layers of the model.
- the loss when the calculation of loss exceeds the first precision range, the loss can be recalculated using the second precision range or calculated layer by layer starting from the first layer network structure until the loss is obtained.
- the adjustment of the accuracy range can be an adjustment of a specific layer or an adjustment of all layers of the entire network structure.
- automatic real-time adjustment of the accuracy range through the overflow information generated during forward calculation and/or reverse calculation can not only reduce the memory occupied by the training model, but also improve the training efficiency of the model. This reduces problems such as training stagnation caused by parameter calculations overflowing the first accuracy range.
- the type of method determines whether to use high-precision floating point numbers or low-precision floating point numbers.
- the embodiment of the present application can adjust the precision range applicable to the parameters in real time through the overflow information of the parameters, and reduce overflow problems caused by low-precision floating point number calculations.
- the initial accuracy range can be adjusted to affect model training in the first accuracy range.
- the initial accuracy range is adjusted from the first accuracy range to the second accuracy range.
- the reverse calculation process by setting the lower limit of the scaling coefficient, the risk of subsequent parameter accuracy underflow is reduced.
- One embodiment of the training device in the embodiment of the present application includes:
- Acquisition unit 801 used to obtain training data
- the calculation unit 802 is used to use training data as input to the model, and use the first accuracy range to calculate parameters during the model training process to obtain calculated values;
- the calculation unit 802 is also configured to recalculate the parameters using the second accuracy range if the calculated value exceeds the first accuracy range, and use the recalculated parameters to train the model one or more iterations, where the second accuracy range includes The first accuracy range, or the second accuracy range partially overlaps with the first accuracy range.
- the model includes multiple network structures, and the calculation unit 802 is specifically configured to use the second accuracy range to recalculate parameters starting from the first layer network structure of the model.
- the calculation unit 802 is specifically configured to recalculate the parameters using the current network structure that overflows the calculated value using the second precision range.
- the model includes multiple network structures, and the parameters include one or more of the following: intermediate features calculated by the multiple network structures during the forward propagation process or the value of the loss function of the model, and the intermediate features are the values of the multiple network structures.
- the output features of any network structure the gradients calculated by multiple network structures during the backpropagation process.
- the gradients include: the gradient of intermediate features and/or the weight gradient of the model.
- the calculated value is the gradient divided by the scaling coefficient, and the scaling coefficient is used to reduce the probability of gradient overflow; the computing unit 802 is also used to update using the first coefficient Scaling coefficient.
- the updated scaling coefficient is used to replace the pre-updated scaling coefficient for the next iteration of the model training.
- the first coefficient is a positive number less than 1; the minimum value of the scaling coefficient is a preset threshold greater than or equal to 1.
- the calculation unit 802 is specifically configured to use the second accuracy range to calculate the intermediate features of the overflow layer or to calculate the intermediate features layer by layer starting from the first layer network structure,
- the overflow layer is a network structure in which the calculated values of intermediate features in multiple network structures overflow the first accuracy range.
- the calculation unit 802 is specifically configured to use the second accuracy range to calculate the value of the loss function or calculate layer by layer starting from the first layer network structure until the value of the loss function is obtained.
- the calculation unit 802 is specifically used for the Nth iteration in the model training process to obtain the number of overflows for multiple network structures in the model based on the first accuracy range, where N is a positive integer greater than or equal to 1; calculate Unit 802 is specifically configured to, if the number of overflows is greater than or equal to the second threshold, determine that the initial accuracy range in the next iterative training process is changed from the first accuracy range to the second accuracy range, and clear the number of overflows.
- the number of overflows includes: the number of overflows of multiple network structures in the forward propagation process, and/or the number of overflows of multiple network structures in the back propagation process.
- each unit in the training device the operations performed by each unit in the training device are similar to those described in the aforementioned embodiments shown in FIGS. 1 to 7 , and will not be described again here.
- the calculation unit 602 when the calculated value of the parameter exceeds the first precision range during the model training process, the calculation unit 602 recalculates the parameter using the second precision range. That is, automatic real-time adjustment of the accuracy range through the overflow information of parameter calculation values can not only reduce the memory occupied by the training model, but also improve the training efficiency of the model. This reduces problems such as training stagnation caused by parameter calculations overflowing the first accuracy range.
- the embodiment of the present application can use the overflow information of the parameters to determine the applicable precision range of the parameters in real time. Adjust and reduce overflow problems caused by low-precision floating point calculations.
- the training device may include a processor 901, a memory 902, and a communication port 903.
- the processor 901, memory 902 and communication port 903 are interconnected through lines.
- the memory 902 stores program instructions and data.
- the memory 902 stores program instructions and data corresponding to the steps executed by the training equipment in the corresponding embodiments shown in FIGS. 1 to 7 .
- the processor 901 is configured to perform the steps performed by the training device shown in any of the embodiments shown in FIGS. 1 to 7 .
- the communication port 903 can be used to receive and send data, and to perform steps related to acquisition and reception in any of the embodiments shown in FIGS. 1 to 7 .
- the training device may include more or less components than in Figure 9 , which is merely an illustrative description in this application and is not limiting.
- the disclosed systems, devices and methods can be implemented in other ways.
- the device embodiments described above are only illustrative.
- the division of the units is only a logical function division. In actual implementation, there may be other division methods.
- multiple units or components may be combined or can be integrated into another system, or some features can be ignored, or not implemented.
- the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.
- the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
- each functional unit in each embodiment of the present application can be integrated into one processing unit, each unit can exist physically alone, or two or more units can be integrated into one unit.
- the above integrated units can be implemented in the form of hardware or software functional units.
- the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a computer-readable storage medium.
- the technical solution of the present application is essentially or contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in various embodiments of this application.
- the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, read-only memory), random access memory (RAM, random access memory), magnetic disk or optical disk and other media that can store program code. .
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Complex Calculations (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
Claims (21)
- 一种模型训练方法,其特征在于,所述方法包括:获取训练数据;将所述训练数据作为模型的输入,在所述模型训练过程中使用第一精度范围进行参数的计算以得到计算值;若所述计算值溢出所述第一精度范围,则使用第二精度范围重新计算所述参数,并使用重新计算后的参数对所述模型进行一次或多次迭代训练,其中,所述第二精度范围包括所述第一精度范围,或者所述第二精度范围与所述第一精度范围部分重叠。
- 根据权利要求1所述的方法,其特征在于,所述模型包括多个网络结构,所述使用第二精度范围重新计算所述参数,包括:使用所述第二精度范围从所述模型的第一层网络结构开始重新计算所述参数。
- 根据权利要求1所述的方法,其特征在于,所述模型包括多个网络结构,所述使用第二精度范围重新计算所述参数,包括:使用所述第二精度范围从所述计算值溢出的当前网络结构重新计算所述参数。
- 根据权利要求1至3中任一项所述的方法,其特征在于,所述模型包括多个网络结构,所述参数包括以下一项或多项:所述多个网络结构在前向传播过程中计算的中间特征或所述模型的损失函数的值,所述中间特征为所述多个网络结构中任意一个网络结构的输出特征;所述多个网络结构在反向传播过程中计算的梯度,所述梯度包括:所述中间特征的梯度和/或所述模型的权重梯度。
- 根据权利要求4所述的方法,其特征在于,当所述参数包括所述反向传播过程中计算的梯度时,所述计算值为所述梯度除以缩放系数的值,所述缩放系数用于减少所述梯度溢出的概率;所述方法还包括:使用第一系数更新所述缩放系数,更新后的缩放系数用于替换更新前的缩放系数进行所述模型的下一次迭代训练,所述第一系数为小于1的正数;所述缩放系数的最小取值为大于或等于1的预设阈值。
- 根据权利要求4所述的方法,其特征在于,当所述参数包括所述前向传播过程中计算的中间特征时,所述使用第二精度范围重新计算所述参数,包括:使用所述第二精度范围计算溢出层的所述中间特征或从第一层网络结构开始逐层计算所述中间特征,所述溢出层为所述多个网络结构中所述中间特征的计算值溢出所述第一精度范围的网络结构。
- 根据权利要求4所述的方法,其特征在于,当所述参数包括所述模型的损失函数的值时,所述使用第二精度范围重新计算所述参数,包括:使用所述第二精度范围计算所述损失函数的值或从第一层网络结构开始逐层计算直至得到所述损失函数的值。
- 根据权利要求1至7中任一项所述的方法,其特征在于,所述使用重新计算后的参数对所述模型进行训练,包括:在所述模型训练过程中的第N次迭代,获取基于所述第一精度范围对所述模型中多个网络结构的溢出次数,N为大于或等于1的正整数;若所述溢出次数大于或等于第二阈值,则确定下一次迭代训练过程中初始的精度范围从所述第一精度范围更换为所述第二精度范围,并将所述溢出次数清零。
- 根据权利要求8所述的方法,其特征在于,所述溢出次数包括:所述多个网络结构在前向传播过程中的溢出次数,和/或所述多个网络结构在反向传播过程中的溢出次数。
- 一种训练设备,其特征在于,所述训练设备包括:获取单元,用于获取训练数据;计算单元,用于将所述训练数据作为模型的输入,在所述模型训练过程中使用第一精度范围进行参数的计算以得到计算值;所述计算单元,还用于若所述计算值溢出所述第一精度范围,则使用第二精度范围重新计算所述参数,并使用重新计算后的参数对所述模型进行一次或多次迭代训练,其中,所述第二精度范围包括所述第一精 度范围,或者所述第二精度范围与所述第一精度范围部分重叠。
- 根据权利要求10所述的训练设备,其特征在于,所述模型包括多个网络结构,所述计算单元,具体用于使用所述第二精度范围从所述模型的第一层网络结构开始重新计算所述参数。
- 根据权利要求10所述的训练设备,其特征在于,所述计算单元,具体用于使用所述第二精度范围从所述计算值溢出的当前网络结构重新计算所述参数。
- 根据权利要求10至12中任一项所述的训练设备,其特征在于,所述模型包括多个网络结构,所述参数包括以下一项或多项:所述多个网络结构在前向传播过程中计算的中间特征或所述模型的损失函数的值,所述中间特征为所述多个网络结构中任意一个网络结构的输出特征;所述多个网络结构在反向传播过程中计算的梯度,所述梯度包括:所述中间特征的梯度和/或所述模型的权重梯度。
- 根据权利要求13所述的训练设备,其特征在于,当所述参数包括所述反向传播过程中计算的梯度时,所述计算值为所述梯度除以缩放系数的值,所述缩放系数用于减少所述梯度溢出的概率;所述计算单元,还用于使用第一系数更新所述缩放系数,更新后的缩放系数用于替换更新前的缩放系数进行所述模型的下一次迭代训练,所述第一系数为小于1的正数;所述缩放系数的最小取值为大于或等于1的预设阈值。
- 根据权利要求13所述的训练设备,其特征在于,当所述参数包括所述前向传播过程中计算的中间特征时,所述计算单元,具体用于使用所述第二精度范围计算溢出层的所述中间特征或从第一层网络结构开始逐层计算所述中间特征,所述溢出层为所述多个网络结构中所述中间特征的计算值溢出所述第一精度范围的网络结构。
- 根据权利要求13所述的训练设备,其特征在于,当所述参数包括所述模型的损失函数的值时,所述计算单元,具体用于使用所述第二精度范围计算所述损失函数的值或从第一层网络结构开始逐层计算直至得到所述损失函数的值。
- 根据权利要求10至16中任一项所述的训练设备,其特征在于,所述计算单元,具体用于在所述模型训练过程中的第N次迭代,获取基于所述第一精度范围对所述模型中多个网络结构的溢出次数,N为大于或等于1的正整数;所述计算单元,具体用于若所述溢出次数大于或等于第二阈值,则确定下一次迭代训练过程中初始的精度范围从所述第一精度范围更换为所述第二精度范围,并将所述溢出次数清零。
- 根据权利要求17所述的训练设备,其特征在于,所述溢出次数包括:所述多个网络结构在前向传播过程中的溢出次数,和/或所述多个网络结构在反向传播过程中的溢出次数。
- 一种训练设备,其特征在于,包括:处理器,所述处理器与存储器耦合,所述存储器用于存储程序或指令,当所述程序或指令被所述处理器执行时,使得所述训练设备执行如权利要求1至9中任一项所述的方法。
- 一种计算机存储介质,其特征在于,包括计算机指令,当所述计算机指令在训练端设备上运行时,使得所述训练设备执行如权利要求1至9中任一项所述的方法。
- 一种计算机程序产品,其特征在于,当所述计算机程序产品在计算机上运行时,使得所述计算机执行如权利要求1至9中任一项所述的方法。
Priority Applications (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| KR1020257004231A KR20250037493A (ko) | 2022-07-15 | 2023-07-12 | 모델 훈련 방법 및 관련 디바이스 |
| EP23838966.2A EP4550209A4 (en) | 2022-07-15 | 2023-07-12 | MODEL TRAINING METHOD AND ASSOCIATED DEVICE |
| JP2025501773A JP2025522114A (ja) | 2022-07-15 | 2023-07-12 | モデル訓練方法及び関連デバイス |
| US19/019,814 US20250156712A1 (en) | 2022-07-15 | 2025-01-14 | Model training method and related device |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210832123.6 | 2022-07-15 | ||
| CN202210832123.6A CN117474045A (zh) | 2022-07-15 | 2022-07-15 | 一种模型训练方法及相关设备 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US19/019,814 Continuation US20250156712A1 (en) | 2022-07-15 | 2025-01-14 | Model training method and related device |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024012476A1 true WO2024012476A1 (zh) | 2024-01-18 |
Family
ID=89535600
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2023/106905 Ceased WO2024012476A1 (zh) | 2022-07-15 | 2023-07-12 | 一种模型训练方法及相关设备 |
Country Status (6)
| Country | Link |
|---|---|
| US (1) | US20250156712A1 (zh) |
| EP (1) | EP4550209A4 (zh) |
| JP (1) | JP2025522114A (zh) |
| KR (1) | KR20250037493A (zh) |
| CN (1) | CN117474045A (zh) |
| WO (1) | WO2024012476A1 (zh) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118798254A (zh) * | 2024-09-13 | 2024-10-18 | 湖北华中电力科技开发有限责任公司 | 一种分布式建模方法及系统 |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118095386B (zh) * | 2024-03-26 | 2024-07-30 | 腾讯科技(深圳)有限公司 | 一种模型训练加速方法、装置、设备及存储介质 |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110097188A (zh) * | 2019-04-30 | 2019-08-06 | 科大讯飞股份有限公司 | 一种模型训练方法、工作节点及参数更新服务器 |
| CN113255877A (zh) * | 2020-02-12 | 2021-08-13 | 阿里巴巴集团控股有限公司 | 神经网络模型的量化处理方法、装置、设备及存储介质 |
| CN113435520A (zh) * | 2021-06-30 | 2021-09-24 | 深圳市商汤科技有限公司 | 神经网络的训练方法、装置、设备及计算机可读存储介质 |
| CN113888524A (zh) * | 2021-10-20 | 2022-01-04 | 深圳市信润富联数字科技有限公司 | 缺陷检测模型训练方法、装置、设备及可读存储介质 |
| CN114065902A (zh) * | 2020-08-03 | 2022-02-18 | 虫极科技(北京)有限公司 | 神经网络层训练方法、神经网络计算系统和计算机可读介质 |
Family Cites Families (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108364065B (zh) * | 2018-01-19 | 2020-09-11 | 上海兆芯集成电路有限公司 | 采布斯乘法的微处理器 |
| CN110210611B (zh) * | 2019-05-13 | 2021-09-07 | 西安交通大学 | 一种用于卷积神经网络计算的动态自适应数据截断方法 |
| JP2022094508A (ja) * | 2020-12-15 | 2022-06-27 | 富士通株式会社 | 演算処理装置、演算処理方法および演算処理プログラム |
-
2022
- 2022-07-15 CN CN202210832123.6A patent/CN117474045A/zh active Pending
-
2023
- 2023-07-12 KR KR1020257004231A patent/KR20250037493A/ko active Pending
- 2023-07-12 EP EP23838966.2A patent/EP4550209A4/en active Pending
- 2023-07-12 WO PCT/CN2023/106905 patent/WO2024012476A1/zh not_active Ceased
- 2023-07-12 JP JP2025501773A patent/JP2025522114A/ja active Pending
-
2025
- 2025-01-14 US US19/019,814 patent/US20250156712A1/en active Pending
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110097188A (zh) * | 2019-04-30 | 2019-08-06 | 科大讯飞股份有限公司 | 一种模型训练方法、工作节点及参数更新服务器 |
| CN113255877A (zh) * | 2020-02-12 | 2021-08-13 | 阿里巴巴集团控股有限公司 | 神经网络模型的量化处理方法、装置、设备及存储介质 |
| CN114065902A (zh) * | 2020-08-03 | 2022-02-18 | 虫极科技(北京)有限公司 | 神经网络层训练方法、神经网络计算系统和计算机可读介质 |
| CN113435520A (zh) * | 2021-06-30 | 2021-09-24 | 深圳市商汤科技有限公司 | 神经网络的训练方法、装置、设备及计算机可读存储介质 |
| CN113888524A (zh) * | 2021-10-20 | 2022-01-04 | 深圳市信润富联数字科技有限公司 | 缺陷检测模型训练方法、装置、设备及可读存储介质 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP4550209A4 |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118798254A (zh) * | 2024-09-13 | 2024-10-18 | 湖北华中电力科技开发有限责任公司 | 一种分布式建模方法及系统 |
Also Published As
| Publication number | Publication date |
|---|---|
| EP4550209A1 (en) | 2025-05-07 |
| KR20250037493A (ko) | 2025-03-17 |
| JP2025522114A (ja) | 2025-07-10 |
| EP4550209A4 (en) | 2025-10-22 |
| US20250156712A1 (en) | 2025-05-15 |
| CN117474045A (zh) | 2024-01-30 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12299577B2 (en) | Tensor processing using low precision format | |
| TWI744724B (zh) | 處理卷積神經網路的方法 | |
| US20250156712A1 (en) | Model training method and related device | |
| WO2019238029A1 (zh) | 卷积神经网络系统和卷积神经网络量化的方法 | |
| CN113095486B (zh) | 整数张量网络数据处理方法 | |
| JP2020009444A (ja) | ニューラルネットワークにおいてパラメータを処理する方法及び装置 | |
| CN113919479B (zh) | 一种提取数据特征的方法和相关装置 | |
| CN112789627A (zh) | 一种神经网络处理器、数据处理方法及相关设备 | |
| WO2022111002A1 (zh) | 用于训练神经网络的方法、设备和计算机可读存储介质 | |
| CN110378470A (zh) | 神经网络模型的优化方法、装置以及计算机存储介质 | |
| CN112633477A (zh) | 一种基于现场可编程阵列的量化神经网络加速方法 | |
| WO2021169160A1 (zh) | 图像归一化处理方法及装置、存储介质 | |
| WO2023109748A1 (zh) | 一种神经网络的调整方法及相应装置 | |
| CN114239799A (zh) | 一种高效目标检测方法、设备、介质和系统 | |
| CN112130805A (zh) | 包括浮点加法器的芯片、设备及浮点运算的控制方法 | |
| CN114580625A (zh) | 用于训练神经网络的方法、设备和计算机可读存储介质 | |
| CN116070689A (zh) | 模型的量化训练方法、装置、电子设备及可读存储介质 | |
| WO2019076095A1 (zh) | 处理方法及装置 | |
| TW202129551A (zh) | 運算裝置和運算方法 | |
| CN114730331A (zh) | 数据处理装置和数据处理方法 | |
| CN116957007A (zh) | 神经网络训练时的特征量化方法、装置、介质及程序产品 | |
| CN114692348B (zh) | 基于多保真深度学习代理模型的组件布局温度场预测方法 | |
| CN119646165A (zh) | 问答模型的训练方法、装置、计算机设备及可读存储介质 | |
| CN115861041B (zh) | 图像风格迁移方法、装置、计算机设备、存储介质和产品 | |
| CN119376686A (zh) | 一种数据处理方法及相关设备 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23838966 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2025501773 Country of ref document: JP |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 202527005682 Country of ref document: IN |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2023838966 Country of ref document: EP |
|
| ENP | Entry into the national phase |
Ref document number: 2023838966 Country of ref document: EP Effective date: 20250131 |
|
| REG | Reference to national code |
Ref country code: BR Ref legal event code: B01A Ref document number: 112025000709 Country of ref document: BR |
|
| ENP | Entry into the national phase |
Ref document number: 20257004231 Country of ref document: KR Kind code of ref document: A |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 1020257004231 Country of ref document: KR |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| WWP | Wipo information: published in national office |
Ref document number: 202527005682 Country of ref document: IN |
|
| WWP | Wipo information: published in national office |
Ref document number: 1020257004231 Country of ref document: KR |
|
| WWP | Wipo information: published in national office |
Ref document number: 2023838966 Country of ref document: EP |
|
| REG | Reference to national code |
Ref country code: BR Ref legal event code: B01E Ref document number: 112025000709 Country of ref document: BR Free format text: 1) APRESENTE NOVO RELATORIO DESCRITIVO ADAPTADO AO ART. 26 INCISO I DA PORTARIA/INPI NO 14/2024, UMA VEZ QUE O CONTEUDO ENVIADO NA PETICAO NO 870250003107 DE 14/01/2025 ENCONTRA-SE FORA DA NORMA EM RELACAO AO TITULO, CONTENDO TEXTO DIFERENTE DO TITULO INCIANDO A PAGINA. 2) APRESENTE NOVO RESUMO ADAPTADO AO ART. 40 INCISO I DA PORTARIA/INPI NO 14/2024, UMA VEZ QUE O CONTEUDO ENVIADO NA PETICAO NO 870250003107 DE 14/01/2025 ENCONTRA-SE FORA DA NORMA EM RELACAO AO TITULO, CONTENDO TEXTO DIFERENTE DO TITULO INCIANDO A PAGINA. A EXIGENCIA DEVE SER RESPONDIDA EM ATE 60 (SESSENTA) DIAS DE SUA PUBLICACAO E DEVE SER REALIZADA POR MEIO DA PETICAO GRU CODIGO DE SERVICO 207. |
|
| ENP | Entry into the national phase |
Ref document number: 112025000709 Country of ref document: BR Kind code of ref document: A2 Effective date: 20250114 |