WO2021098821A1 - 神经网络系统中数据处理的方法、神经网络系统 - Google Patents

神经网络系统中数据处理的方法、神经网络系统 Download PDF

Info

Publication number
WO2021098821A1
WO2021098821A1 PCT/CN2020/130393 CN2020130393W WO2021098821A1 WO 2021098821 A1 WO2021098821 A1 WO 2021098821A1 CN 2020130393 W CN2020130393 W CN 2020130393W WO 2021098821 A1 WO2021098821 A1 WO 2021098821A1
Authority
WO
WIPO (PCT)
Prior art keywords
neural network
array
arrays
memristor
deviation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2020/130393
Other languages
English (en)
French (fr)
Inventor
高滨
姚鹏
王侃文
廖健行
王铁英
吴华强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Huawei Technologies Co Ltd
Original Assignee
Tsinghua University
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University, Huawei Technologies Co Ltd filed Critical Tsinghua University
Priority to EP20888862.8A priority Critical patent/EP4053748B1/en
Publication of WO2021098821A1 publication Critical patent/WO2021098821A1/zh
Priority to US17/750,052 priority patent/US20220277199A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • G06N3/065Analogue means
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning

Definitions

  • This application relates to the field of neural networks, and more specifically, to a method of data processing in a neural network system, and a neural network system.
  • Artificial intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results.
  • artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, and basic AI theories.
  • deep learning is a learning technology based on a deep artificial neural network (ANN) algorithm.
  • ANN deep artificial neural network
  • the training process of neural networks is a data-centric task that requires computing hardware with high-performance and low-power processing capabilities.
  • the neural network system based on multiple neural network arrays can realize the integration of storage and calculation, and can handle the task of deep learning.
  • at least one storage unit in the neural network array may store the weight value of the corresponding neural network layer. Due to the network structure or system architecture design, the processing speed of each neural network array will be inconsistent. At this time, multiple neural network arrays can be used for parallel processing, and joint calculations can speed up the neural network array that is at a speed bottleneck. However, due to the non-ideal characteristics of the storage and calculation unit itself among the various neural networks participating in parallel acceleration, such as device fluctuation, conductance drift, array yield, etc., it will reduce the overall performance of the neural network system, and the accuracy of the neural network system Lower.
  • This application provides a data processing method and neural network system in a parallel accelerated neural network system, which can solve the impact of non-ideal device characteristics when using parallel acceleration technology, and improve the performance of the neural network system and the accuracy of recognition rate.
  • a method for data processing in a neural network system includes: in a neural network system using parallel acceleration, inputting training data into the neural network system to obtain first output data, wherein the neural network system includes A plurality of neural network arrays, each neural network array in the plurality of neural network arrays includes a plurality of storage and calculation units, and each storage and calculation unit is used to store the weight value of the neuron in the corresponding neural network; calculate the first The deviation between the output data and the target output data; according to the deviation, the weight value stored in at least one of the partial neural network arrays in the plurality of neural network arrays is adjusted, wherein the partial neural network array Used to realize the calculation of part of the neural network layer in the neural network system.
  • the weight values stored in some of the neural network arrays in the plurality of neural network arrays can be adjusted and updated, which can be realized Compatibility with the non-ideal characteristics of the storage and calculation unit improves the recognition rate of the system and the performance of the system, thereby avoiding the system performance degradation caused by the non-ideal characteristics of the storage and calculation unit.
  • the multiple neural network arrays include a first neural network array and a second neural network array, and the input data of the first neural network array includes output data of the second neural network array .
  • the first neural network array includes a neural network array for realizing a fully connected layer calculation in a neural network.
  • the weight value stored in the at least one storage and calculation unit in the first neural network array is adjusted according to the input value of the first neural network array and the deviation.
  • the plurality of neural network arrays further include a third neural network array, and the third neural network array and the second neural network array are used to implement the volume in the neural network in parallel. Layer calculation.
  • the weight value stored in the at least one storage unit in the second neural network array is adjusted ;
  • the weight value stored in the at least one storage unit in the third neural network array is adjusted.
  • the weight values stored in the storage and calculation units in the multiple neural network arrays that implement the convolutional layer calculation in the neural network in parallel can also be adjusted and updated to improve the adjustment accuracy, thereby improving the accuracy of the neural network system output.
  • the deviation is divided into at least two sub-deviations, wherein the first sub-deviation of the at least two sub-deviations corresponds to the output data of the second neural network array, and the The second sub-deviation of the at least two sub-deviations corresponds to the output data of the third neural network array; according to the first sub-deviation and the input data of the second neural network array, at least the second sub-deviation in the second neural network array A weight value stored in the storage and calculation unit is adjusted; according to the second sub-deviation and the input data of the third neural network array, the weight stored in the storage and calculation unit of at least one of the third neural network array is adjusted Value to be adjusted.
  • the number of pulses is determined according to the updated weight value in the storage and calculation unit, and according to the number of pulses, the number of pulses stored in at least one of the storage and calculation units in the neural network array is determined.
  • the weight value is rewritten.
  • a neural network system including:
  • the processing module is used to input training data into a neural network system to obtain first output data, where the neural network system includes a plurality of neural network arrays, and each neural network array of the plurality of neural network arrays includes a plurality of memories.
  • Calculation unit, each storage and calculation unit is used to store the weight value of the neuron in the corresponding neural network;
  • a calculation module for calculating the deviation between the first output data and the target output data
  • the adjustment module is configured to adjust the weight value stored in at least one of the partial neural network arrays in the plurality of neural network arrays according to the deviation, wherein the partial neural network array is used to implement the neural network Calculation of some neural network layers in the system.
  • the multiple neural network arrays include a first neural network array and a second neural network array, and the input data of the first neural network array includes the output of the second neural network array data.
  • the first neural network array includes a neural network array for realizing a fully connected layer calculation in the neural network.
  • the adjustment module is specifically configured to:
  • the weight value stored in the at least one storage unit in the first neural network array is adjusted.
  • the plurality of neural network arrays further include a third neural network array, and the third neural network array and the second neural network array are used to implement the volume in the neural network in parallel. Layer calculation.
  • the adjustment module is specifically configured to:
  • the adjustment module is specifically configured to:
  • the deviation is divided into at least two sub-deviations, wherein the first sub-deviation of the at least two sub-deviations corresponds to the output data of the second neural network array, and the second sub-deviation of the at least two sub-deviations corresponds to the third sub-deviation.
  • the first sub-deviation of the at least two sub-deviations corresponds to the output data of the second neural network array
  • the second sub-deviation of the at least two sub-deviations corresponds to the third sub-deviation.
  • the weight value stored in the at least one storage unit in the third neural network array is adjusted.
  • the adjustment module is specifically configured to: determine the number of pulses according to the updated weight value in the storage and calculation unit, and according to the number of pulses, to determine the number of pulses for at least one of the neural network arrays.
  • the weight value stored in the storage unit is rewritten.
  • a neural network system including a processor and a memory, wherein the memory is used to store a computer program, and the processor is used to call and run the computer program from the memory, so that the neural network system executes the first aspect or The method provided by any possible implementation of the first aspect.
  • the processor may be a general-purpose processor, and may be implemented by hardware or software.
  • the processor may be a logic circuit, integrated circuit, etc.; when implemented by software, the processor may be a general-purpose processor, which is implemented by reading the software code stored in the memory, and the memory may Integrated in the processor, can be located outside the processor, and exist independently.
  • a chip is provided, and the neural network system as in the second aspect or any one of the possible implementation manners of the second aspect is provided on the chip.
  • the chip includes a processor and a data interface, wherein the processor reads instructions stored on the memory through the data interface to execute the first aspect or any one of the possible implementation methods of the first aspect.
  • the chip can be implemented with a central processing unit (CPU), a microcontroller (microcontroller unit, MCU), a microprocessor (microprocessing unit, MPU), and a digital signal processor (digital signal processor). processing, DSP), system on chip (SoC), application-specific integrated circuit (ASIC), field programmable gate array (FPGA) or programmable logic device , PLD).
  • CPU central processing unit
  • MCU microcontroller unit
  • MPU microprocessor
  • digital signal processor digital signal processor
  • processing DSP
  • SoC system on chip
  • ASIC application-specific integrated circuit
  • FPGA field programmable gate array
  • PLD programmable logic device
  • a computer program product includes: computer program code, when the computer program code runs on a computer, causes the computer to execute the first aspect or any one of the possible implementations of the first aspect The method in the way.
  • a computer-readable storage medium stores computer program code.
  • the computer program code runs on a computer, the computer executes the first aspect or any of the first aspects described above.
  • These computer-readable storages include but are not limited to one or more of the following: read-only memory (ROM), programmable ROM (programmable ROM, PROM), erasable PROM (erasable PROM, EPROM), Flash memory, electrically EPROM (electrically EPROM, EEPROM), and hard drive.
  • FIG. 1 is a schematic structural diagram of a neural network system 100 provided by this application.
  • FIG. 2 is a schematic structural diagram of another neural network system 200 provided by this application.
  • Fig. 3 is a schematic diagram of the mapping relationship between the neural network and the neural network array.
  • Fig. 4 is a schematic diagram of a possible weight matrix provided by the present application.
  • Figure 5 is a schematic diagram of a possible neural network model.
  • Fig. 6 is a schematic diagram of a neural network system provided by the present application.
  • FIG. 7 is a schematic diagram of the input data and output data of multiple groups of memristor arrays for parallel computing provided by the present application.
  • Fig. 8(a) is a multi-group memristor array for accelerating parallel calculation of input data provided by the present application.
  • Figure 8(b) is a schematic diagram of a specific data splitting provided by this application.
  • Fig. 9 is another multi-group memristor array for accelerating input data in parallel computing provided by the present application.
  • Fig. 10 is a schematic flowchart of a method for data processing in a neural network system provided by the present application.
  • Fig. 11 is a schematic diagram of a forward operation and a reverse operation process provided by the present application.
  • FIG. 12 is a schematic diagram of updating the weight value stored in the first memristor array for realizing the calculation of the fully connected layer among multiple memristor arrays provided by the present application.
  • FIG. 13 is another schematic diagram of updating the weight value stored in the first memristor array for realizing the calculation of the fully connected layer in the multiple memristor arrays provided by the present application.
  • FIG. 14 is a schematic diagram of updating the weight values stored in multiple groups of memristor arrays that realize the calculation of the convolutional layer.
  • FIG. 15 is a schematic diagram of updating the weight values stored in the multiple groups of memristor arrays that realize the calculation of the convolutional layer according to the residual value.
  • FIG. 16 is another schematic diagram of updating the weight values stored in the multiple groups of memristor arrays that realize the calculation of the convolutional layer.
  • FIG. 17 is another schematic diagram of updating the weight values stored in the multiple groups of memristor arrays that realize the calculation of the convolutional layer according to the residual value.
  • FIG. 18 is a schematic diagram of increasing the weight value stored by at least one storage unit in a neural network array provided by the present application.
  • FIG. 19 is a schematic diagram of reducing the weight value stored by at least one storage unit in a neural network array provided by the present application.
  • Fig. 20 is a schematic diagram of increasing the weight value stored in at least one storage unit in the neural network array by reading and writing as provided by the present application.
  • FIG. 21 is a schematic diagram of reducing the weight value stored in at least one storage unit in a neural network array through a read-write method provided by the present application.
  • Fig. 22 is a schematic flowchart of a neural network training process provided by an embodiment of the present application.
  • FIG. 23 is a schematic structural diagram of a neural network system 2300 provided by an embodiment of the present application.
  • Artificial intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results.
  • artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, and basic AI theories.
  • ANN deep artificial neural network
  • ANN artificial neural network
  • NN neural network
  • CNN convolutional neural network
  • MLP multilayer perceptron
  • RNN recurrent neural network
  • the training process of the neural network is also the process of learning the parameter matrix.
  • the ultimate goal is to obtain the parameter matrix of each layer of the trained neural network (the parameter matrix of each layer of neurons includes each layer of neurons).
  • Each parameter matrix formed by the weights obtained by training can extract pixel information from the image to be inferred input by the user, thereby helping the neural network to perform correct inference on the image to be inferred, so that the predicted value of the trained neural network output is as good as possible Prior knowledge close to the training data.
  • ground truth generally includes the real results corresponding to the training data provided by the user.
  • the training process of the above neural network is a data-centric task, which requires computing hardware with high-performance and low-power processing capabilities.
  • Computing based on the traditional von Neumann architecture requires a large amount of data to be moved due to the separation of the storage unit and the computing unit, and cannot achieve energy-efficient processing.
  • FIG. 1 is a schematic structural diagram of a neural network system 100 provided by an embodiment of this application.
  • the neural network system 100 may include a host 105 and a neural network circuit 110.
  • the neural network circuit 110 is connected to the host 105 through a host interface.
  • the host interface may include a standard host interface and a network interface (network interface).
  • the host interface may include a peripheral component interconnect express (PCIE) interface.
  • PCIE peripheral component interconnect express
  • the neural network circuit 110 may be connected to the host 105 through the PCIE bus 106. Therefore, data is input to the neural network circuit 110 through the PCIE bus 106, and data processed by the neural network circuit 110 is received through the PCIE bus 106.
  • the host 105 can also monitor the working status of the neural network circuit 110 through the host interface.
  • the host 105 may include a processor 1052 and a memory 1054. It should be noted that, in addition to the devices shown in FIG. 1, the host 105 may also include other devices such as a communication interface and a magnetic disk as an external memory, which is not limited here.
  • the processor 1052 is the computing core and control unit of the host 105.
  • the processor 1052 may include multiple processor cores.
  • the processor 1052 may be a very large-scale integrated circuit.
  • An operating system and other software programs are installed in the processor 1052, so that the processor 1052 can implement access to the memory 1054, cache, disk, and peripheral devices (such as the neural network circuit in FIG. 1).
  • the core in the processor 1052 may be, for example, a central processing unit (CPU), or may also be other application specific integrated circuits (ASICs).
  • processor 1052 in the embodiment of the present application may also be other general-purpose processors, digital signal processors (digital signal processors, DSP), application specific integrated circuits (ASICs), ready-made programmable gate arrays ( field programmable gate array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • DSP digital signal processors
  • ASICs application specific integrated circuits
  • FPGA field programmable gate array
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the memory 1054 is the main memory of the host 105.
  • the memory 1054 is connected to the processor 1052 through a double data rate (DDR) bus.
  • the memory 1054 is usually used to store various running software in the operating system, input and output data, and information exchanged with external memory. In order to increase the access speed of the processor 1052, the memory 1054 needs to have the advantage of fast access speed.
  • a dynamic random access memory (DRAM) is usually used as the memory 1054.
  • the processor 1052 can access the memory 1054 at a high speed through a memory controller (not shown in FIG. 1), and read and write any storage unit in the memory 1054.
  • the memory 1054 in the embodiment of the present application may be a volatile memory or a non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory can be read-only memory (ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), and electrically available Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • the volatile memory may be random access memory (RAM), which is used as an external cache.
  • RAM random access memory
  • static random access memory static random access memory
  • DRAM dynamic random access memory
  • DRAM synchronous dynamic random access memory
  • Access memory synchronous DRAM, SDRAM
  • double data rate synchronous dynamic random access memory double data rate SDRAM, DDR SDRAM
  • enhanced synchronous dynamic random access memory enhanced SDRAM, ESDRAM
  • synchronous connection dynamic random access memory Take memory (synchlink DRAM, SLDRAM) and direct memory bus random access memory (direct rambus RAM, DR RAM).
  • the neural network circuit 110 shown in FIG. 1 may be a chip array composed of a plurality of neural network chips (chips) and a plurality of routers 120.
  • the neural network chip 115 is referred to as the chip 115 in this embodiment of the present application.
  • the plurality of chips 115 are connected to each other through the router 120.
  • one chip 115 may be connected to one or more routers 120.
  • Multiple routers 120 can form one or more network topologies.
  • the chips 115 can perform data transmission and information exchange through the multiple network topologies.
  • FIG. 2 is a schematic structural diagram of another neural network system 200 provided by an embodiment of the application.
  • the neural network system 200 may include a host 105 and a neural network circuit 210.
  • the neural network circuit 210 is connected to the host 105 through a host interface. As shown in FIG. 2, the neural network circuit 210 can be connected to the host 105 through the PCIE bus 106.
  • the connection host 105 may include a processor 1052 and a memory 1054. For a specific description of the host 105, please refer to the description in FIG. 2, which will not be repeated here.
  • the neural network circuit 210 shown in FIG. 2 may be a chip array composed of a plurality of chips 115, wherein the plurality of chips 115 are hung on the PCIE bus 106.
  • the chips 115 perform data transmission and information exchange through the PCIE bus 106.
  • the architecture of the neural network system in Figures 1 to 2 is only an example, and those skilled in the art can understand that in practice, the above neural network system may include more or less than those in Figure 1 or Figure 2 unit.
  • the modules, units or circuits in the neural network system can also be replaced by other modules, units or circuits with similar functions, which are not limited in the embodiment of the present application.
  • the aforementioned neural network system may also be implemented by a graphics processing unit (GPU) or a field programmable gate array (FPGA) based on digital computing.
  • GPU graphics processing unit
  • FPGA field programmable gate array
  • the above-mentioned neural network circuit may be implemented by a plurality of neural network matrices integrated with storage and calculation.
  • Each neural network matrix in the multiple neural network matrices can include multiple storage and calculation units, and each storage and calculation unit is used to store the weight value of each layer of neurons in the corresponding neural network, which is used to realize the neural network layer. Calculation.
  • the embodiment of this application does not specifically limit the storage and calculation unit, which may include but not limited to: memristor, static RAM (SRAM), NOR Flash, magnetic RAM (magnetism RAM, MRAM), ferroelectric grid field Effect transistors (ferroelectric gate field-effect transistors, FeFET), electrochemical RAM (electrochemistry RAM, ECRAM).
  • the memristor may include, but is not limited to: resistive random-access memory (ReRAM), conductive-bridging RAM (CBRAM), phase-change memory (PCM) .
  • the neural network matrix is a ReRAM crossbar composed of ReRAM.
  • the aforementioned neural network system may include multiple ReRAM crossbars.
  • the ReRAM crossbar may also be referred to as a memristor cross array, ReRAM device, or ReRAM.
  • a chip that includes one or more ReRAM crossbars can be called a ReRAM chip.
  • ReRAM crossbar is a new non-von Neumann computing architecture.
  • the architecture integrates storage and computing functions, has flexible configurable features, and uses analog computing methods. It is expected to achieve a matrix-vector multiplication with faster speed and lower energy consumption than traditional computing architectures. It has a wide range of neural network calculations. Application prospects.
  • Fig. 3 is a schematic diagram of the mapping relationship between the neural network and the neural network array.
  • the neural network 110 includes a plurality of neural network layers.
  • the neural network layer is a logical layer concept, and a neural network layer refers to a neural network operation to be performed once.
  • the calculation of each layer of neural network is implemented by computing nodes (also called neurons).
  • the neural network layer can include a convolutional layer, a pooling layer, a fully connected layer, and so on.
  • the calculation nodes in the neural network system can calculate the input data and the weights of the corresponding neural network layers.
  • the weight is usually represented by a real matrix, and each element in the weight matrix represents a weight value.
  • Weight is usually used to indicate how important the input data is to the output data.
  • the weight matrix with m rows and n columns shown in Fig. 4 can be a weight of a neural network layer, and each element in the weight matrix represents a weight value.
  • the weights can be configured on multiple ReRAM cells of the ReRAM crossbar before calculation. Therefore, the matrix multiplication and addition operation of the input data and the configured weight can be realized through the ReRAM crossbar.
  • the ReRAM cell in the embodiment of the present application may also be referred to as a memristor cell.
  • the weight of the memristor unit configured before the calculation can be understood as storing the weight value of the neuron in the corresponding neural network in the memristor unit.
  • the resistance or conductance value of the memristor unit can be used to indicate the nerve.
  • the weight value of the neuron in the network is not limited to a memristor cell.
  • the ReRAM crossbar and the neural network layer may have a one-to-one mapping relationship, or may also be a one-to-many mapping relationship.
  • the detailed description will be given below in conjunction with the drawings, and will not be repeated here.
  • the first neural network layer in the neural network 110 is taken as an example to introduce the data processing process.
  • the first neural network layer can be any layer in the neural network system.
  • the first neural network layer can also be referred to as the "first layer" for short.
  • the ReRAM crossbar 120 shown in FIG. 3 is an m ⁇ n cross array, where the ReRAM crossbar 120 may include multiple memristor units (for example, G 1,1 , G 1,2, etc.), and the memristor of each column
  • the bit lines (BL) of the memory cells are connected together, and the source lines (SL) of the memristor cells in each row are connected together.
  • the weight of the neuron in the neural network can be expressed by the conductance value of the memristor.
  • each element in the weight matrix shown in FIG. 4 can be represented by the conductance value of the memristor located at the intersection of BL and SL.
  • G 1,1 in Fig. 3 represents the weight element W 0,0 in Fig. 4
  • G 1,2 in Fig. 3 represents the weight element W 0,1, etc. in Fig. 4.
  • the different conductance values of the memristor unit can indicate that the weights of neurons in the neural network stored in the memristor unit are different.
  • the n input data Vi can be represented by the voltage value of the BL loaded to the memristor, for example: V1, V2, V3, ... Vn in FIG. 3.
  • the input data can be expressed by voltage, so that the input data loaded to the memristor and the weight value stored in the memristor can be multiplied by dots, and m output data as shown in FIG. 3 are obtained.
  • the m output data can be represented by the current of SL, for example: I1, I2,... Im in FIG. 3.
  • the voltage value loaded to the memristor can be represented by the pulse amplitude of the voltage, another example, it can also be represented by the pulse width of the voltage, another example, it can also be represented by the number of voltage pulses, another example, it can also be represented by the number of voltage pulses It is expressed in combination with the pulse amplitude of the voltage.
  • One neural network array in the plurality of neural network arrays may correspond to one neural network layer, and the neural network array is used to realize the calculation of one neural network layer.
  • multiple neural network arrays may correspond to one neural network layer, and are used to implement the calculation of the one neural network layer.
  • one neural network array in the multiple neural network arrays may correspond to multiple neural network layers, and is used to realize the calculation of the multiple neural network layers.
  • the memristor array is the neural network array as an example for description.
  • Figure 5 is a schematic diagram of a possible neural network model.
  • the neural network model may include multiple neural network layers.
  • the neural network layer is a logical layer concept, and a neural network layer refers to a neural network operation to be performed once.
  • the calculation of each layer of neural network is realized by computing nodes.
  • the neural network layer can include a convolutional layer, a pooling layer, a fully connected layer, and so on.
  • the neural network model may include n neural network layers (also referred to as n-layer neural network), where n is an integer greater than or equal to 2.
  • Figure 5 shows part of the neural network layers in the neural network model.
  • the neural network model may include a first layer 302, a second layer 304, a third layer 306, a fourth layer 308, and a fifth layer 310.
  • the first layer 302 can perform convolution operations
  • the second layer 304 can perform pooling operations or activation operations on the output data of the first layer 302
  • the third layer 306 can be the output data of the second layer 304.
  • the fourth layer 308 may perform a convolution operation on the output result of the third layer 306, and the fifth layer 310 may perform a summation operation on the output data of the second layer 304 and the output data of the fourth layer 308.
  • the nth layer 312 may perform operations of a fully connected layer.
  • the pooling operation or the activation operation can be implemented by an external digital circuit module.
  • the external digital circuit module may be connected to the neural network circuit 110 through the PCIE bus 106 (not shown in FIG. 1 or FIG. 2).
  • Figure 5 is only a simple example and description of the neural network layer in the neural network model, and does not limit the specific operations of each layer of neural network.
  • the fourth layer 308 can also be a pooling operation.
  • the fifth layer 310 may also perform other neural network operations such as convolution operation or pooling operation.
  • Fig. 6 is a schematic diagram of a neural network system provided by an embodiment of the present application.
  • the neural network system may include a plurality of memristor arrays.
  • the first memristor array can realize the calculation of the fully connected layer in the neural network.
  • the weight of the fully connected layer in the neural network can be stored in the first memristor array, and the conductance value of each memristor unit in the memristor array can be used to indicate the weight of the fully connected layer, And realize the multiplication and accumulation calculation process of the fully connected layer in the neural network.
  • the fully connected layer in the neural network can also correspond to multiple memristor arrays, and the multiple memristor arrays jointly complete the calculation of the fully connected layer. This application does not specifically limit this.
  • the multiple groups of memristor arrays (for example, the second memristor array, the third memristor array, and the fourth memristor array) shown in FIG. 6 can realize the calculation of the convolutional layer in the neural network.
  • the convolution kernel will have a new input after each sliding window, resulting in a complete calculation process of the convolutional layer that needs to process different inputs. Therefore, the parallelism of the neural network system level can be increased, and the weight of the same position in the network can be realized through multiple groups of memristor arrays, thereby realizing parallel acceleration of different inputs. That is to say, the convolution weights of key positions are separately realized by multiple groups of memristor arrays.
  • each group of memristor arrays processes different input data in parallel, and work in parallel with each other, thereby increasing the volume Product calculation efficiency and system performance.
  • a convolution kernel represents a feature extraction method in the neural network calculation process.
  • each pixel in the output image is a weighted average of pixels in a small area in the input image, where the weight is defined by a function, this function is called Convolution kernel.
  • the convolution kernel sequentially traverses the input feature map (feature map) according to a certain stride to generate the output data after feature extraction (also called the output feature map). Therefore, the size of the convolution kernel (kernel size) is also used to indicate the size of the data amount for a calculation node in the neural network system to perform one calculation.
  • the convolution kernel can be represented by a matrix of real numbers.
  • Figure 8(a) shows a convolution kernel with 3 rows and 3 columns, and each element in the convolution kernel represents a weight. value.
  • a neural network layer can include multiple convolution kernels. In the neural network calculation process, the input data and the convolution kernel can be multiplied and added.
  • the input data of multiple groups of memristor arrays calculated in parallel may include the output data of other memristor arrays or external input Data
  • the output data of the multiple groups of memristor arrays (for example, the second memristor array, the third memristor array, and the fourth memristor array) can be used as the input data of the shared first memristor array. That is, the input data of the first memristor array may include output data of multiple sets of memristor arrays (for example, the second memristor array, the third memristor array, and the fourth memristor array).
  • the input data and output data structures of the multiple groups of memristor arrays that are calculated in parallel may have various structures, which are not specifically limited in this application.
  • FIG. 7 is a schematic diagram of the input data and output data of multiple groups of memristor arrays for parallel computing provided by an embodiment of the present application.
  • multiple groups of memristor arrays calculated in parallel for example, the second memristor array, the third memristor array, and the fourth memristor array
  • the input data of is combined to form a complete input data
  • the output data of the multiple groups of memristor arrays calculated in parallel are combined to form a complete output data.
  • the input data of the second memristor array is data 1
  • the input data of the third memristor array is data 2
  • the input data of the fourth memristor array is data 3.
  • a complete The input data includes a combination of data 1, data 2, and data 3.
  • the output data of the second memristor array is result 1
  • the output data of the third memristor array is result 2
  • the output data of the fourth memristor array is result 3.
  • one The complete output data includes the combination of Result 1, Result 2, and Result 3.
  • an input picture can be split into different parts and input into multiple groups of memristor arrays (for example, the second memristor array, the third memristor array, The fourth memristor array) performs parallel calculations.
  • the combination of the output results of the multiple groups of memristor arrays can be used as the complete output data corresponding to the input picture.
  • Fig. 8(b) shows a schematic diagram of a possible picture splitting.
  • an image is divided into three parts and sent to three groups of parallel accelerated arrays for calculation.
  • the first part is sent to the second memristor array shown in Fig. 8(a), and "result 1" corresponding to mode one in Fig. 7 is obtained, which corresponds to the output result of the second memristor array in the complete output.
  • Similar processing can be done in the second and third parts. Determine the overlap between each part according to the size of the convolution kernel and the sliding window step length (for example, there are 2 lines of overlap between each part in this example), so that the output results after three sets of array processing can form a complete Output.
  • the second memristor array is used to calculate the residual value of the corresponding neuron and the first part of the input according to the corresponding relationship of the forward calculation process, and the second memristive
  • the device array is updated in situ.
  • the array updates of the second and third groups are similar. For the specific update process, please refer to the description below, which will not be repeated here.
  • multiple groups of memristor arrays calculated in parallel (for example, the second memristor array, the third memristor array, and the fourth memristor array
  • the input data of is a complete input data
  • the output data of the multiple groups of memristor arrays calculated in parallel is a complete output data.
  • the input data of the second memristor array is data 1.
  • the data 1 is a complete input data
  • the output data is a result 1
  • the result 1 is a complete output data.
  • the input data of the third memristor array is data 2.
  • the data 2 is a complete input data
  • the output data is result 2, which is a complete output data.
  • the input data of the fourth memristor array is data 3.
  • the data 3 is a complete input data, and the output data is a result 3, and the result 3 is a complete output data.
  • a plurality of different complete input data can be input into multiple groups of memristor arrays (for example, the second memristor array, the third memristor array, and the fourth memristor array) for parallelism. Calculation.
  • the output results of the multiple groups of memristor arrays respectively correspond to a complete output data.
  • the storage and calculation unit in the neural network array has some non-ideal characteristics, such as device fluctuations, conductance drift, array yield, etc., the storage and calculation unit cannot achieve weight losslessly, resulting in the overall system performance of the neural network system Decrease, its recognition rate decreases.
  • the technical solutions provided by the embodiments of the present application can improve the performance of the neural network system and the accuracy of recognition.
  • the technical solutions of the embodiments of the present application can be applied to various neural networks, such as convolutional neural networks (convolutional neural network, CNN), recurrent neural networks that are widely used in processing natural language and speech, and Deep neural network combined with convolutional neural network and recurrent neural network, etc.
  • the processing process of the convolutional neural network is similar to the animal's visual system, making it very suitable for the field of image recognition.
  • Convolutional neural networks can be applied to various image recognition fields such as security, computer vision, safe cities, etc., and can also be applied to speech recognition, search engines, machine translation, and so on. In practical applications, the huge amount of parameters and calculations pose great challenges to the application of neural networks in scenarios with high real-time performance and low power consumption.
  • FIG. 10 is a schematic flowchart of a data processing method in a neural network system provided by an embodiment of the present application. As shown in FIG. 10, the method may include steps 1010-1030, and steps 1010-1030 will be described in detail below.
  • Step 1010 Input the training data into the neural network system to obtain the first output data.
  • the neural network system for parallel acceleration in the embodiment of the present application may include multiple neural network arrays, and each neural network array of the multiple neural network arrays may include multiple storage and calculation units, and each storage and calculation unit is used for Store the weight value of the neuron in the corresponding neural network.
  • Step 1020 Calculate the deviation between the first output data and the target output data.
  • the target output data may be an ideal value of the first output data actually output.
  • the deviation in the embodiment of the present application may be the calculation of the difference between the first output data and the target output data, or may also be the calculation of the residual between the first output data and the target output data, or may also be the calculation of the first output data and the target output data.
  • Other forms of loss function between output data and target output data may be the calculation of the difference between the first output data and the target output data.
  • Step 1030 Adjust the weight value stored in at least one of the storage and arithmetic units in the partial neural network arrays of the multiple neural network arrays in the parallel accelerated neural network system according to the deviation.
  • Part of the neural network array in the embodiment of the present application may be used to realize the calculation of a part of the neural network layer in the neural network system.
  • the correspondence between the neural network array and the neural network layer may be a one-to-one relationship, or a one-to-many relationship, or a many-to-one relationship.
  • the first memristor array shown in FIG. 6 corresponds to the fully connected layer in the neural network, and is used to realize the calculation of the fully connected layer.
  • the multiple groups of memristor arrays (for example, the second memristor array, the third memristor array, and the fourth memristor array) shown in FIG. 6 correspond to the convolutional layers in the neural network for Realize the calculation of the convolutional layer.
  • the neural network layer is a logical layer concept, and a neural network layer refers to a neural network operation to be performed once.
  • a neural network layer refers to a neural network operation to be performed once.
  • the resistance value or the conductance value in the storage and calculation unit can be used to indicate the weight value in the neural network layer.
  • at least one of the partial neural network arrays in the plurality of neural network arrays can be stored according to the calculated deviation.
  • the calculation unit adjusts or rewrites the resistance value or the conductance value.
  • the updated value of the resistance value or the conductance value of the storage and calculation unit may be determined according to the deviation, and a fixed number of programming pulses may be applied to the storage and calculation unit according to the updated value.
  • the resistance value or the updated value of the conductance value of the storage and calculation unit is determined according to the deviation, and the programming pulse is applied to the storage and calculation unit in a manner of reading and writing.
  • different numbers of programming pulses can also be applied according to the characteristics of different storage and calculation units, so as to realize the adjustment or rewriting of the resistance value or the conductance value thereof. The following will be described in conjunction with specific embodiments, which will not be repeated here.
  • the embodiment of the present application can adjust the resistance value or conductance value of the neural network array used to realize the fully connected layer in the multiple neural network arrays by deviation, or can also adjust the multiple neural network arrays by deviation
  • the resistance value or conductance value of the neural neural network array used to realize the convolutional layer in the middle, or the neural neural network array used to realize the fully connected layer and the neural neural network used to realize the convolutional layer can also be adjusted simultaneously through the deviation
  • the resistance value or conductance value of the array is adjusted.
  • a training data set such as pixel information of an input image
  • the data of the training data set is input into the neural network.
  • the actual output value is obtained from the output of the last layer of neural network.
  • the actual output value of the neural network is as close as possible to the prior knowledge of the training data.
  • Prior knowledge is also called ground truth or
  • the ideal output value generally includes the prediction result corresponding to the training data provided by the person. Therefore, the residual value can be calculated based on the deviation between the current actual output value and the ideal output value by comparing the current actual output value. Specifically, it may be to calculate the partial derivative of the objective loss function.
  • the weight size that needs to be updated is calculated according to the residual value, so that the weight value stored in at least one storage unit in the neural network array can be updated according to the weight size that needs to be updated.
  • the square of the difference between the actual output value of the neural network and the ideal output value can be calculated, and the square number can be used to obtain the derivative of the weight in the weight matrix to obtain the residual value.
  • the weight size that needs to be updated is determined by formula (1).
  • ⁇ W indicates the weight size that needs to be updated
  • indicates the learning rate
  • N indicates that there are N sets of input data
  • V indicates the input data value of this layer
  • indicates the residual value of this layer.
  • the NxM array shown in FIG. 11 SL represents a source data line (source line), and BL represents a bit data line (bit line).
  • the cumulative update weight obtained in the mth row and nth column of the layer it can also be determined according to the following formula (2) whether to update the weight value of the mth row and nth column of the layer .
  • the threshold represents a preset threshold.
  • the threshold update rule shown in formula (2) is adopted, that is, the weight that does not meet the threshold requirement is not updated. Specifically, if ⁇ W m,n is greater than or equal to the preset threshold, the weight value of the mth row and nth column of the layer can be updated. If ⁇ W m,n is less than the preset threshold, the weight value of the mth row and nth column of the layer is not updated.
  • FIG. 12 is a schematic diagram of updating the weight value stored in the first memristor array that realizes the calculation of the fully connected layer among multiple memristor arrays.
  • the weights of the neural network layer trained in advance can be rewritten into multiple memristor arrays.
  • the weights of the corresponding neural network layers are stored in multiple memristor arrays.
  • the first memristor array can realize the calculation of the fully connected layer in the neural network
  • the weight of the fully connected layer in the neural network can be stored in the first memristor array.
  • Each memristor in the memristor array The conductance value of the filter unit can be used to indicate the weight of the fully connected layer, and realize the multiplication and accumulation calculation process of the fully connected layer in the neural network.
  • multiple groups of memristor arrays can realize the calculation of the convolutional layer in the neural network.
  • the weight of the same position in the convolutional layer is realized by multiple groups of memristor arrays, so as to realize parallel acceleration of different inputs.
  • an input picture is divided into different parts and input into multiple groups of memristor arrays (for example, the second memristor array, the third memristor array, and the fourth Memristor array) for parallel calculations.
  • the output results of the multiple groups of memristor arrays can be used as input data, input to the first memristor array, and pass The first memristor array obtains first output data.
  • the residual value can be calculated according to the first output data and the ideal output data according to the above method for calculating the residual value. And according to formula (1), the weight value stored by each memristor in the first memristor array that realizes the calculation of the fully connected layer is updated in situ.
  • FIG. 13 is another schematic diagram of updating the weight value stored in the first memristor array that realizes the calculation of the fully connected layer among multiple memristor arrays.
  • a plurality of different input data are respectively input to a plurality of groups of memristor arrays (for example, the second memristor array, the third memristor array, and the fourth memristor array) for parallel calculation.
  • the output results of the multiple groups of memristor arrays can be used as input data, input to the first memristor array, and pass the The first memristor array obtains first output data.
  • the residual value can be calculated according to the first output data and the ideal output data according to the above method for calculating the residual value. And according to formula (1), the weight value stored by each memristor in the first memristor array that realizes the calculation of the fully connected layer is updated in situ.
  • FIG. 14 is a schematic diagram of updating the weight values stored in multiple groups of memristor arrays that realize the calculation of the convolutional layer.
  • an input picture is divided into different parts and input into multiple groups of memristor arrays (for example, the second memristor array, the third memristor array, and the fourth Memristor array) for parallel calculations.
  • the combination of the output results of the multiple groups of memristor arrays can be used as the complete output data corresponding to the input picture.
  • the residual value can be calculated according to the above method of calculating the residual value and the output data and the ideal output data. And according to formula (1), each memristor in the multiple groups of memristor arrays (for example, the second memristor array, the third memristor array, and the fourth memristor array) that realize the convolutional layer calculation in parallel
  • the stored weight value is updated in-situ.
  • it can be calculated based on the output values of multiple groups of memristor arrays (for example, the second memristor array, the third memristor array, and the fourth memristor array) and the corresponding ideal output values.
  • the residual value, and the weight value stored in each memristor in the multiple groups of memristor arrays that realize the convolutional layer calculation in parallel is updated in situ according to the residual value.
  • the residual value can also be calculated according to the first output value of the first memristor array and the corresponding ideal output value, and multiple groups of convolutional layer calculations can be implemented in parallel according to the residual value.
  • the weight value stored by each memristor in the memristor array (for example, the second memristor array, the third memristor array, and the fourth memristor array) is updated in situ. A detailed description will be given below in conjunction with FIG. 15.
  • FIG. 15 is a schematic diagram of updating the weight values stored in the multiple groups of memristor arrays that realize the calculation of the convolutional layer according to the residual value.
  • the complete residual may be the residual value of the shared first memristor array, and the residual value is determined according to the output value of the first memristor array and the corresponding ideal output value.
  • the complete residual can be divided into multiple sub-residuals, for example, residual 1, residual 2, and residual 3.
  • Each sub-residual corresponds to the output data of each memristor array in the multiple groups of memristor arrays calculated in parallel.
  • the residual 1 corresponds to the output data of the second memristor array
  • the residual 2 corresponds to the output data of the third memristor array
  • the residual 3 corresponds to the output data of the fourth memristor array.
  • the weight value stored by each memristor in the memristor array is updated in situ according to formula (2) .
  • FIG. 16 is another schematic diagram of updating the weight values stored in the multiple groups of memristor arrays that realize the calculation of the convolutional layer.
  • a plurality of different input data are respectively input to a plurality of groups of memristor arrays (for example, the second memristor array, the third memristor array, and the fourth memristor array ) Perform parallel calculations.
  • the output results of the multiple groups of memristor arrays respectively correspond to a complete output data.
  • the residual value can be calculated according to the above method of calculating the residual value and the output data and the ideal output data. And according to formula (1), each memristor in the multiple groups of memristor arrays (for example, the second memristor array, the third memristor array, and the fourth memristor array) that realize the convolutional layer calculation in parallel The stored weight value is rewritten. There are many specific implementation methods.
  • it can be calculated based on the output values of multiple groups of memristor arrays (for example, the second memristor array, the third memristor array, and the fourth memristor array) and the corresponding ideal output values.
  • the residual value, and the weight value stored in each memristor in the multiple groups of memristor arrays that realize the convolutional layer calculation in parallel is updated in situ according to the residual value.
  • the residual value can also be calculated according to the first output value of the first memristor array and the corresponding ideal output value, and multiple groups of convolutional layer calculations can be implemented in parallel according to the residual value.
  • the weight value stored by each memristor in the memristor array (for example, the second memristor array, the third memristor array, and the fourth memristor array) is updated in situ. A detailed description will be given below in conjunction with FIG. 17.
  • FIG. 17 is another schematic diagram of updating the weight values stored in the multiple groups of memristor arrays that realize the calculation of the convolutional layer according to the residual value.
  • the complete residual may be the residual value of the shared first memristor array, and the residual value is determined according to the output value of the first memristor array and the corresponding ideal output value. Since each group of memristor arrays participating in parallel acceleration will process the complete output result, each group of memristor arrays can be updated based on the relevant complete residual data. Assuming that the above-mentioned complete residual is obtained according to the output result 1 of the second memristor array, therefore, the above formula (1) can be used to determine the second memristor array according to the complete residual and the input data 1 of the second memristor array. The weight value stored by each memristor in the resistor array is updated in situ.
  • the weight values stored in the previous-stage arrays of multiple memristor arrays that implement convolutional layer calculations in parallel can also be adjusted, and the residuals of neurons in each layer can be calculated by backpropagation. Difference.
  • the input data of these arrays can be the output data of the previous-stage memristor arrays, or the original external input. Data, such as images, text, voice, etc.; the output data of these arrays is used as the input data of multiple memristor arrays that implement convolutional layer calculations in parallel.
  • the set operation is used to adjust the conductance of the memristor unit from low to high conductance
  • the reset operation is used to adjust the conductance of the memristor unit from high to low conductance
  • the target conductance range of the target memristor unit can represent the target weight W 11 .
  • the set operation can be used to increase the conductance of the target memristor unit.
  • the SL 11 can be used to apply a voltage to the gate of the transistor in the target memristor unit that needs to be adjusted to turn on the transistor, so that the target memristor unit is in the gated state.
  • the SL connected to the target memristor unit and other BLs in the cross array are also grounded, and then a set pulse is applied to the BL where the target memristor unit is located to adjust the conductance of the target memristor unit.
  • the reset operation can be used to reduce the conductance of the target memristor unit.
  • the voltage can be applied to the gate of the transistor in the target memristor unit to be adjusted through the SL, so that the target memristor unit is in the gated state.
  • the BL connected to the target memristor unit and other SLs in the cross array are grounded. Then, a reset pulse is applied to the SL where the target memristor unit is located to adjust the conductance of the target memristor unit.
  • a fixed number of programming pulses can be applied to the target memristor cell.
  • programming pulses can be applied to the target memristor cell by means of reading and writing.
  • different numbers of programming pulses can be applied to different memristor units to adjust their conductance values.
  • the embodiment of the present application may write target data to the target memristor unit based on an incremental step program pulse (ISPP) strategy.
  • ISPP incremental step program pulse
  • the ISPP strategy usually adjusts the conductance of the target memristor unit by means of "read verification-correction", so that the conductance of the target memristor unit is finally adjusted to the target conductance corresponding to the target data.
  • device 1, device 2 ⁇ are the target memristor cells in the selected memristor array.
  • a read pulse (V read ) can be applied to the target memristor unit to read the current conductance of the target memristor unit.
  • a set pulse (V set ) can be applied to the target memristor unit to increase the conductance of the target memristor unit.
  • V read the adjusted conductance through the read pulse (V read ). If the current conductance is less than the target conductance, then load the set pulse (V set ) to the target memristor unit, so that the target memristor unit’s electricity is directed to the target Conductivity adjustment.
  • device 1, device 2 ⁇ are the target memristor cells in the selected memristor array.
  • a read pulse (V read ) can be applied to the target memristor unit to read the current conductance of the target memristor unit.
  • a reset pulse (V reset ) can be applied to the target memristor unit to reduce the conductance of the target memristor unit.
  • V read the adjusted conductance through the read pulse (V read ). If the current conductance is greater than the target conductance, then load the set pulse (V reset ) to the target memristor unit, so that the target memristor unit’s electricity is directed to the target Conductivity adjustment.
  • V read may be a read voltage pulse less than the threshold voltage
  • V set or V reset may be a read voltage pulse greater than the threshold voltage
  • the conductance of the target memristor unit can be finally adjusted to the target conductance corresponding to the target data through the above-mentioned read-and-write manner.
  • the termination condition may be that the conductance increase of all selected devices in the row meets the requirement.
  • Fig. 22 is a schematic flowchart of a neural network training process provided by an embodiment of the present application. As shown in FIG. 22, the method may include steps 2210-2255, and steps 2210-2255 will be described in detail below.
  • Step 2210 Determine the network layer that needs to be accelerated according to the neural network information.
  • the network layer that needs to be accelerated may be determined according to one or more of the following: the number of layers of the neural network, parameter information, the size of the training data set, and so on.
  • Step 2215 Perform offline training on an external personal computer (PC) to determine the initial training weight.
  • PC personal computer
  • the weight parameters on the neurons of the neural network can be trained through steps such as forward calculation and reverse calculation on the external PC to determine the initial training weight.
  • Step 2220 Map the initial training weights to a neural network array that implements parallel accelerated network layer calculations and a neural network array that implements non-parallel accelerated network layer calculations in the storage-calculation integrated architecture.
  • the initial training weights can be respectively mapped to at least one storage and calculation unit of multiple neural network arrays in the storage and calculation integrated architecture according to the method shown in FIG. 3, so that the input data and configuration can be realized through the neural network array.
  • the matrix multiplication and addition operation of the weights can be respectively mapped to at least one storage and calculation unit of multiple neural network arrays in the storage and calculation integrated architecture according to the method shown in FIG. 3, so that the input data and configuration can be realized through the neural network array.
  • the plurality of neural network arrays may include a neural network array that implements non-parallel accelerated network layer calculations and a neural network array that implements parallel accelerated network layer calculations.
  • Step 2225 Input a set of training data into multiple neural network arrays of the storage-computing integrated architecture, and obtain the output result of the forward calculation based on the actual hardware of the storage-computing integrated architecture.
  • Step 2230 Judge whether the accuracy of the neural network system meets the requirement or whether it reaches the preset number of training times.
  • step 2235 may be executed.
  • step 2240 may be performed.
  • Step 2235 The training ends.
  • Step 2240 Determine whether the training data is the last set of training data.
  • step 2245 and step 2255 can be executed.
  • steps 2250-2255 can be performed.
  • Step 2245 Reload the training data.
  • Step 2250 Based on the proposed training method of the parallel training storage-calculation integrated system, the conductance weights of the parallel acceleration array or other arrays are trained and updated on-chip after calculations such as back propagation.
  • Step 2255 Load the next set of training data.
  • the loaded training data is input to multiple neural network arrays of the storage-computing integrated architecture, and the output result of the forward calculation based on the actual hardware of the storage-computing integrated architecture is obtained.
  • the size of the sequence number of the above-mentioned processes does not mean the order of execution, and the execution order of each process should be determined by its function and internal logic, and should not correspond to the embodiments of the present application.
  • the implementation process constitutes any limitation.
  • the device embodiment considers the implementation of the solution of this application from the perspective of products and equipment, and some of the content of the device embodiment of this application and the aforementioned method embodiment of this application are corresponding or complementary to each other. It can be universal when it comes to the realization of the solution and the support for the scope of the claims.
  • FIG. 23 is a schematic structural diagram of a neural network system 2300 provided by an embodiment of the present application. It should be understood that the neural network system 2300 shown in FIG. 23 is only an example, and the device in the embodiment of the present application may further include other modules or units. It should be understood that the neural network system 2300 can execute each step in the method of FIG. 10-22, and in order to avoid repetition, it will not be described in detail here.
  • the neural network system 2300 may include:
  • the processing module 2310 is configured to input training data into a neural network system to obtain first output data, wherein the neural network system includes a plurality of neural network arrays, and each neural network array of the plurality of neural network arrays includes Multiple storage and calculation units, each storage and calculation unit is used to store the weight value of the neuron in the corresponding neural network;
  • the calculation module 2320 is configured to calculate the deviation between the first output data and the target output data
  • the adjustment module 2330 is configured to adjust the weight value stored in at least one of the storage and arithmetic units in the partial neural network arrays in the plurality of neural network arrays according to the deviation, wherein the partial neural network array uses To realize the calculation of part of the neural network layer in the neural network system.
  • the multiple neural network arrays include a first neural network array and a second neural network array, and the input data of the first neural network array includes the second neural network array The output data.
  • the first neural network array includes a neural network array for realizing a fully connected layer calculation in a neural network.
  • the adjustment module 2330 is specifically configured to:
  • the weight value stored in at least one of the storage and calculation units in the first neural network array is adjusted.
  • the multiple neural network arrays further include a third neural network array, and the third neural network array and the second neural network array are used for parallel implementation of the neural network The calculation of the convolutional layer.
  • the adjustment module 2330 is specifically used for:
  • the weight value stored in at least one of the storage and calculation units in the third neural network array is adjusted.
  • the adjustment module 2330 is specifically used for:
  • the deviation is divided into at least two sub-deviations, wherein the first sub-deviation of the at least two sub-deviations corresponds to the output data of the second neural network array, and the second sub-deviation of the at least two sub-deviations Corresponding to the output data of the third neural network array;
  • the weight value stored in at least one of the storage and arithmetic units in the third neural network array is adjusted.
  • the adjustment module 2330 is specifically configured to: determine the number of pulses according to the updated weight value in the storage and calculation unit, and according to the number of pulses, to determine at least the number of pulses in the neural network array The weight value stored in one of the storage and calculation units is rewritten.
  • the neural network system 2300 here is embodied in the form of functional modules.
  • module herein can be implemented in the form of software and/or hardware, which is not specifically limited.
  • a “module” can be a software program, a hardware circuit, or a combination of the two that realize the above-mentioned functions.
  • the software exists in the form of computer program instructions and is stored in the memory, and the processor can be used to execute the program instructions to implement the above method flow.
  • the processor may include but is not limited to at least one of the following: a central processing unit (central processing unit, CPU), a microprocessor, a digital signal processing (digital signal processing, DSP), and a microcontroller (microcontroller unit, MCU) , Or artificial intelligence processors and other computing devices that run software.
  • a central processing unit central processing unit, CPU
  • a microprocessor central processing unit
  • DSP digital signal processing
  • microcontroller microcontroller unit, MCU
  • Each computing device may include one or more cores for executing software instructions for calculation or processing.
  • the processor can be a single semiconductor chip, or it can be integrated with other circuits to form a semiconductor chip.
  • the processor can be combined with other circuits (such as codec circuits, hardware acceleration circuits, or various bus and interface circuits) to form a system-on-chip ( system on chip, SoC), or as an application-specific integrated circuit (ASIC) built-in processor integrated in the ASIC, the ASIC integrated with the processor can be packaged separately or can be combined with other The circuits are packaged together.
  • the processor may also include necessary hardware accelerators, such as field programmable gate array (FPGA) and programmable logic device (FPGA). device, PLD), or a logic circuit that implements dedicated logic operations.
  • FPGA field programmable gate array
  • FPGA programmable logic device
  • PLD programmable logic circuit that implements dedicated logic operations.
  • the hardware circuits may be general-purpose central processing units (central processing unit, CPU), microcontrollers (microcontroller unit, MCU), microprocessors (microprocessing unit, MPU), Digital signal processor (digital signal processing, DSP), system on chip (system on chip, SoC) to achieve, of course, it can also be implemented by application-specific integrated circuit (ASIC), or programmable logic device (programmable logic) device, PLD).
  • the above-mentioned PLD can be a complex programmable logical device (CPLD), a field-programmable gate array (FPGA), a generic array logic (generic array logic, GAL) or its In any combination, it can run necessary software or does not rely on software to execute the above method flow.
  • CPLD complex programmable logical device
  • FPGA field-programmable gate array
  • GAL generic array logic
  • the foregoing embodiments may be implemented in whole or in part by software, hardware, firmware or any other combination.
  • the above-mentioned embodiments may be implemented in the form of a computer program product in whole or in part.
  • the computer program product includes one or more computer instructions or computer programs.
  • the computer instructions or computer programs are loaded or executed on the computer, the processes or functions described in the embodiments of the present application are generated in whole or in part.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • the computer instructions may be transmitted from a website, computer, server, or data center.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center that includes one or more sets of available media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium.
  • the semiconductor medium may be a solid state drive.
  • At least one refers to one or more, and “multiple” refers to two or more.
  • the following at least one item (a)” or similar expressions refers to any combination of these items, including any combination of a single item (a) or a plurality of items (a).
  • at least one of a, b, or c can mean: a, b, c, ab, ac, bc, or abc, where a, b, and c can be single or multiple .
  • the size of the sequence number of the above-mentioned processes does not mean the order of execution, and the execution order of each process should be determined by its function and internal logic, and should not correspond to the embodiments of the present application.
  • the implementation process constitutes any limitation.
  • the disclosed system, device, and method may be implemented in other ways.
  • the device embodiments described above are merely illustrative.
  • the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of the present application essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disks or optical disks and other media that can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Image Analysis (AREA)

Abstract

一种神经网络系统中数据处理的方法、神经网络系统,该方法包括:将训练数据输入神经网络系统得到第一输出数据,并根据第一输出数据和目标输出数据之间的偏差,对并行加速的神经网络系统中多个神经网络阵列中的部分神经网络阵列中的至少一个存算单元中存储的权重值进行调整。其中,部分神经网络阵列用于实现神经网络系统中部分神经网络层的计算。该方法可以提高神经网络系统的性能以及识别的准确率。

Description

神经网络系统中数据处理的方法、神经网络系统 技术领域
本申请涉及神经网络领域,并且更具体地,涉及神经网络系统中数据处理的方法、神经网络系统。
背景技术
人工智能(artificial intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。人工智能领域的研究包括机器人,自然语言处理,计算机视觉,决策与推理,人机交互,推荐与搜索,AI基础理论等。
在AI领域中,深度学习是一种基于深层次的人工神经网络(artificial neural network,ANN)算法的学习技术。神经网络的训练过程是一种数据为中心的任务,需要计算硬件具有高性能、低功耗的处理能力。
基于多个神经网络阵列的神经网络系统可以实现存算一体,可以处理深度学习的任务。例如,神经网络阵列中的至少一个存算单元可以存储对应的神经网络层的权重值。由于网络结构或系统架构设计的原因,各神经网络阵列的处理速度会不一致,此时可以使用多个神经网络阵列进行并行处理,共同计算来加速处于速度瓶颈的神经网络阵列。然而参与并行加速的各神经网络间由于存算单元自身存在的一些非理想特性,例如,器件波动、电导漂移、阵列良率等,将降低神经网络系统的整体性能,该神经网络系统的准确率较低。
发明内容
本申请提供一种在并行加速的神经网络系统中数据处理的方法、神经网络系统,可以解决使用并行加速技术时的由于器件非理想特性带来的影响,提高神经网络系统的性能以及识别的准确率。
第一方面,提供了一种神经网络系统中数据处理的方法,包括:在采用并行加速的神经网络系统中,将训练数据输入神经网络系统得到第一输出数据,其中,该神经网络系统中包括多个神经网络阵列,该多个神经网络阵列中的每个神经网络阵列中包括多个存算单元,每个存算单元用于存储对应的神经网络中神经元的权重值;计算该第一输出数据和目标输出数据之间的偏差;根据该偏差对该多个神经网络阵列中的部分神经网络阵列中的至少一个该存算单元中存储的权重值进行调整,其中,该部分神经网络阵列用于实现该神经网络系统中部分神经网络层的计算。
上述技术方案中,可以根据神经网络阵列的实际输出数据和目标输出数据之间的偏差,对多个神经网络阵列中的部分神经网络阵列中存算单元存储的权重值进行调整和更新,可以实现对存算单元非理想特性的兼容,提高系统的识别率和系统的性能,从而避免由于存算单元的非理想特性所造成的系统性能下降。
在第一方面的一种可能的实现方式中,多个神经网络阵列包括第一神经网络阵列和第 二神经网络阵列,该第一神经网络阵列的输入数据包括该第二神经网络阵列的输出数据。
在第一方面的另一种可能的实现方式中,该第一神经网络阵列包括用于实现神经网络中的全连接层计算的神经网络阵列。
上述技术方案中,可以仅对实现全连接层计算的神经网络阵列的存算单元存储的权重值进行调整和更新,便可以实现对存算单元非理想特性的兼容,提高系统的识别率和系统的性能,其成本较低,且有效,易于实现。
在第一方面的另一种可能的实现方式中,根据该第一神经网络阵列的输入值和该偏差,对该第一神经网络阵列中至少一个该存算单元中存储的权重值进行调整。
在第一方面的另一种可能的实现方式中,该多个神经网络阵列还包括第三神经网络阵列,该第三神经网络阵列和该第二神经网络阵列用于并行实现神经网络中的卷积层的计算。
在第一方面的另一种可能的实现方式中,根据该第二神经网络阵列的输入数据和该偏差,对该第二神经网络阵列中的至少一个该存算单元中存储的权重值进行调整;根据该第三神经网络阵列的输入数据和该偏差,对该第三神经网络阵列中的至少一个该存算单元中存储的权重值进行调整。
上述技术方案中,还可以对并行实现神经网络中卷积层计算的多个神经网络阵列中存算单元存储的权重值进行调整和更新,提高调整精度,从而提高神经网络系统输出的准确率。
在第一方面的另一种可能的实现方式中,将该偏差划分为至少两个子偏差,其中,该至少两个子偏差中的第一子偏差与该第二神经网络阵列的输出数据对应,该至少两个子偏差中的第二子偏差与该第三神经网络阵列的输出数据对应;根据所述第一子偏差和该第二神经网络阵的输入数据,对该第二神经网络阵列中的至少一个该存算单元中存储的权重值进行调整;根据所述第二子偏差和该第三神经网络阵的输入数据,对该第三神经网络阵列中的至少一个该存算单元中存储的权重值进行调整。
在第一方面的另一种可能的实现方式中,根据存算单元中更新后的权重值确定脉冲个数,根据该脉冲个数,对该神经网络阵列中至少一个该存算单元中存储的权重值进行改写。
第二方面,提供了一种神经网络系统,包括:
处理模块,用于将训练数据输入神经网络系统得到第一输出数据,其中,该神经网络系统中包括多个神经网络阵列,该多个神经网络阵列中的每个神经网络阵列中包括多个存算单元,每个存算单元用于存储对应的神经网络中神经元的权重值;
计算模块,用于计算该第一输出数据和目标输出数据之间的偏差;
调整模块,用于根据该偏差对该多个神经网络阵列中的部分神经网络阵列中的至少一个该存算单元中存储的权重值进行调整,其中,该部分神经网络阵列用于实现该神经网络系统中部分神经网络层的计算。
在第二方面的一种可能的实现方式中,该多个神经网络阵列包括第一神经网络阵列和第二神经网络阵列,该第一神经网络阵列的输入数据包括该第二神经网络阵列的输出数据。
在第二方面的另一种可能的实现方式中,该第一神经网络阵列包括用于实现神经网络中的全连接层计算的神经网络阵列。
在第二方面的另一种可能的实现方式中,该调整模块具体用于:
根据该第一神经网络阵列的输入值和该偏差,对该第一神经网络阵列中至少一个该存算单元中存储的权重值进行调整。
在第二方面的另一种可能的实现方式中,该多个神经网络阵列还包括第三神经网络阵列,该第三神经网络阵列和该第二神经网络阵列用于并行实现神经网络中的卷积层的计算。
在第二方面的另一种可能的实现方式中,该调整模块具体用于:
根据该第二神经网络阵列的输入数据和该偏差,对该第二神经网络阵列中的至少一个该存算单元中存储的权重值进行调整;根据该第三神经网络阵列的输入数据和该偏差,对该第三神经网络阵列中的至少一个该存算单元中存储的权重值进行调整。
在第二方面的另一种可能的实现方式中,该调整模块具体用于:
将该偏差划分为至少两个子偏差,其中,该至少两个子偏差中的第一子偏差与该第二神经网络阵列的输出数据对应,该至少两个子偏差中的第二子偏差与该第三神经网络阵列的输出数据对应;
根据所述第一子偏差和该第二神经网络阵的输入数据,对该第二神经网络阵列中的至少一个该存算单元中存储的权重值进行调整;
根据所述第二子偏差和该第三神经网络阵的输入数据,对该第三神经网络阵列中的至少一个该存算单元中存储的权重值进行调整。
在第二方面的另一种可能的实现方式中,调整模块具体用于:根据存算单元中更新后的权重值确定脉冲个数,根据该脉冲个数,对该神经网络阵列中至少一个该存算单元中存储的权重值进行改写。
第二方面和第二方面的任意一个可能的实现方式的有益效果和第一方面以及第一方面的任意一个可能的实现方式的有益效果是对应的,对此,不再赘述。
第三方面,提供了一种神经网络系统,包括处理器和存储器,其中该存储器用于存储计算机程序,该处理器用于从存储器中调用并运行该计算机程序,使得神经网络系统执行第一方面或第一方面任意一种可能的实现方式提供的方法。
可选地,在具体实现中,该处理器的个数不做限制。该处理器可以是通用处理器,可以通过硬件来实现也可以通过软件来实现。当通过硬件实现时,该处理器可以是逻辑电路、集成电路等;当通过软件来实现时,该处理器可以是一个通用处理器,通过读取存储器中存储的软件代码来实现,该存储器可以集成在处理器中,可以位于该处理器之外,独立存在。
第四方面,提供了一种芯片,该芯片上设置有如第二方面或第二方面中任一种可能的实现方式中该的神经网络系统。
该芯片包括处理器与数据接口,其中,处理器通过该数据接口读取存储器上存储的指令,以执行第一方面或第一方面任意一种可能的实现方式中的方法。在具体实现过程中,该芯片可以以中央处理器(central processing unit,CPU)、微控制器(micro controller unit,MCU)、微处理器(micro processing unit,MPU)、数字信号处理器(digital signal processing,DSP)、片上系统(system on chip,SoC)、专用集成电路(application-specific integrated circuit,ASIC)、现场可编程门阵列(field programmable gate array,FPGA)或可编辑逻辑器件(programmable logic device,PLD)的形式实现。
第五方面,提供了一种计算机程序产品,该计算机程序产品包括:计算机程序代码, 当该计算机程序代码在计算机上运行时,使得计算机执行上述第一方面或第一方面任意一种可能的实现方式中的方法。
第六方面,提供了一种计算机可读存储介质,该计算机可读存储介质存储有计算机程序代码,当该计算机程序代码在计算机上运行时,使得计算机执行上述执行第一方面或第一方面任意一种可能的实现方式中的方法。这些计算机可读存储包括但不限于如下的一个或者多个:只读存储器(read-only memory,ROM)、可编程ROM(programmable ROM,PROM)、可擦除的PROM(erasable PROM,EPROM)、Flash存储器、电EPROM(electrically EPROM,EEPROM)以及硬盘驱动器(hard drive)。
附图说明
图1为本申请提供的一种神经网络系统100的结构示意图。
图2为本申请提供的另一种神经网络系统200的结构示意图。
图3是神经网络与神经网络阵列之间的映射关系的示意图。
图4是本申请提供的一种可能的权重矩阵示意图。
图5是一种可能的神经网络模型的示意图。
图6是本申请提供的一种神经网络系统的示意图。
图7是本申请提供的并行计算的多组忆阻器阵列的输入数据和输出数据的结构示意图。
图8的(a)是本申请提供的一种对输入数据进行加速的并行计算的多组忆阻器阵列。
图8的(b)是本申请提供的一种具体的数据拆分的示意图。
图9是本申请提供的另一种对输入数据进行加速的并行计算的多组忆阻器阵列。
图10是本申请提供的一种神经网络系统中数据处理的方法的示意性流程图。
图11是本申请提供的一种前向操作和反向操作过程的示意图。
图12是本申请提供的一种对多个忆阻器阵列中实现全连接层计算的第一忆阻器阵列中存储的权重值进行更新的示意图。
图13是本申请提供的另一种对多个忆阻器阵列中实现全连接层计算的第一忆阻器阵列中存储的权重值进行更新的示意图。
图14是一种对实现卷积层计算的多组忆阻器阵列中存储的权重值进行更新的示意图。
图15是一种根据残差值对实现卷积层计算的多组忆阻器阵列中存储的权重值进行更新的示意图。
图16是另一种对实现卷积层计算的多组忆阻器阵列中存储的权重值进行更新的示意图。
图17是另一种根据残差值对实现卷积层计算的多组忆阻器阵列中存储的权重值进行更新的示意图。
图18是本申请提供的一种增加神经网络阵列中的至少一个存算单元存储的权重值的示意图。
图19是本申请提供的一种减小神经网络阵列中的至少一个存算单元存储的权重值的示意图。
图20是本申请提供的一种通过边读边写的方式增加神经网络阵列中的至少一个存算 单元存储的权重值的示意图。
图21是本申请提供的一种通过边读边写的方式减小神经网络阵列中的至少一个存算单元存储的权重值的示意图。
图22是本申请实施例提供的一种神经网络的训练过程的示意性流程图。
图23是本申请实施例提供的一种神经网络系统2300的示意性结构图。
具体实施方式
下面将结合附图,对本申请中的技术方案进行描述。
人工智能(artificial intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。人工智能领域的研究包括机器人,自然语言处理,计算机视觉,决策与推理,人机交互,推荐与搜索,AI基础理论等。
在AI领域中,深度学习是一种基于深层次的人工神经网络(artificial neural network,ANN)算法的学习技术。人工神经网络(artificial neural network,ANN),简称为神经网络(neural network,NN)或类神经网络。在机器学习和认知科学领域,是一种模仿生物神经网络(动物的中枢神经系统,特别是大脑)的结构和功能的数学模型或计算模型,用于对函数进行估计或近似。人工神经网络可以包括卷积神经网络(convolutional neural network,CNN)、多层感知器(multilayer perceptron,MLP)、循环神经网络(recurrent neural network,RNN)等神经网络。
神经网络的训练过程也就是学习参数矩阵的过程,其最终目的是得到训练好的神经网络的每一层神经元的参数矩阵(每一层神经元的参数矩阵包括该层神经元包括的每个神经元对应的权重)。通过训练得到的权重形成的各个参数矩阵可以从用户输入的待推理图像中提取像素信息,从而帮助神经网络对该待推理图像进行正确的推理,使得训练好的神经网络输出的预测值尽可能的接近训练数据的先验知识(prior knowledge)。
应理解,先验知识也被称为真实值(ground truth),一般包括由用户提供的训练数据对应的真实结果。
上述神经网络的训练过程是一种以数据为中心的任务,需要计算硬件具有高性能、低功耗的处理能力。基于传统冯诺依曼架构的计算由于存储单元和计算单元相分离,需要大量数据搬移,无法实现高能效的处理。
下面结合图1-图2描述本申请的系统架构图。
图1为本申请实施例提供的一种神经网络系统100的结构示意图。如图1所示,神经网络系统100可以包括主机105以及神经网络电路110。
神经网络电路110通过主机接口与主机105连接。主机接口可以包括标准的主机接口以及网络接口(network interface)。例如,主机接口可以包括快捷外设互联标准(peripheral component interconnect express,PCIE)接口。
作为示例,如图1所示,神经网络电路110可以通过PCIE总线106与主机105连接。因此,数据通过PCIE总线106输入至神经网络电路110中,并通过PCIE总线106接收 神经网络电路110处理完成后的数据。并且,主机105也可以通过主机接口监测神经网络电路110的工作状态。
主机105可以包括处理器1052以及内存1054。需要说明的是,除了图1所示的器件外,主机105还可以包括通信接口以及作为外存的磁盘等其他器件,在此不做限制。
处理器(processor)1052是主机105的运算核心和控制核心(control unit)。处理器1052中可以包括多个处理器核(core)。处理器1052可以是一块超大规模的集成电路。在处理器1052中安装有操作系统和其他软件程序,从而使得处理器1052能够实现对内存1054、缓存、磁盘及外设设备(如图1中的神经网络电路)的访问。可以理解的是,在本申请实施例中,处理器1052中的core例如可以是中央处理器(central processing unit,CPU),还可以是其他特定集成电路(application specific integrated circuit,ASIC)。
应理解,本申请实施例中的处理器1052还可以为其他通用处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
内存1054是主机105的主存。内存1054通过双倍速率(double data rate,DDR)总线和处理器1052相连。内存1054通常用来存放操作系统中各种正在运行的软件、输入和输出数据以及与外存交换的信息等。为了提高处理器1052的访问速度,内存1054需要具备访问速度快的优点。在传统的计算机系统架构中,通常采用动态随机存取存储器(dynamic random access memory,DRAM)作为内存1054。处理器1052能够通过内存控制器(图1中未示出)高速访问内存1054,对内存1054中的任意一个存储单元进行读操作和写操作。
还应理解,本申请实施例中的内存1054可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的随机存取存储器(random access memory,RAM)可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。
图1所示的神经网络电路110可以是由多个神经网络芯片(chip)和多个路由器120组成的芯片阵列。为了描述方便,本申请实施例将神经网络芯片115简称为芯片115。多个芯片115通过路由器120相互连接。例如,一个芯片115可以与一个或多个路由器120连接。多个路由器120可以组成一种或多种网络拓扑。芯片115之间可以通过所述多种网络拓扑进行数据传输和信息交互。
图2为本申请实施例提供的另一种神经网络系统200的结构示意图。如图2所示,神 经网络系统200可以包括主机105以及神经网络电路210。
神经网络电路210通过主机接口与主机105连接,如图2所示,神经网络电路210可以通过PCIE总线106与主机105。连接主机105中可以包括处理器1052以及内存1054。具体有关主机105的描述,请参考图2中的说明,此处不再赘述。
图2所示的神经网络电路210可以是由多个芯片115组成的芯片阵列,其中,该多个芯片115挂靠在PCIE总线106上。芯片115之间通过PCIE总线106进行数据传输和信息交互。
可选地,图1-图2中的神经网络系统的架构仅作为示例,本领域技术人员能够理解,在实践中,上述神经网络系统可以包括比图1或图2中更多或更少的单元。或者神经网络系统中的模块、单元或电路也可以被其它具有类似功能的模块、单元或电路替代,本申请实施例对此不作限定。例如,在另一些示例中,上述神经网络系统也可以由基于数字计算的图形处理器(graphics processing unit,GPU)或现场可编程门阵列(field programmable gate array,FPGA)实现。
在一些示例中,上述神经网络电路可以由存算一体的多个神经网络矩阵来实现。多个神经网络矩阵中的每一个神经网络矩阵可以包括多个存算单元,每个存算单元中用于存储对应的神经网络中每一层神经元的权重值,用于实现神经网络层的计算。
本申请实施例对存算单元不作具体限定,可以包括但不限于:忆阻器(memristor)、静态RAM(static RAM,SRAM)、NOR Flash、磁性RAM(magnetism RAM,MRAM)、铁电栅场效应晶体管(ferroelectric gate field-effect transistors,FeFET)、电化学RAM(electrochemistry RAM,ECRAM)。其中,忆阻器可以包括但不限于:阻变式随机访问存储器(resistive random-access memory,ReRAM)、导电桥接RAM(conductive-bridging RAM,CBRAM)、相变存储器(phase-change memory,PCM)。
例如,神经网络矩阵是由ReRAM组成的ReRAM交叉矩阵(ReRAM crossbar)。上述神经网络系统中可以包括多个ReRAM crossbar。
在本申请实施例中,ReRAM crossbar也可以称为忆阻器交叉阵列、ReRAM器件或者ReRAM。包括一个或多个ReRAM crossbar的芯片可以称为ReRAM芯片。
ReRAM crossbar是一种全新的非冯诺依曼计算架构。该架构集存储和计算功能于一身,具有灵活的可配置特性,并利用了模拟计算方式,有望实现比传统的计算构架更快速度、更低能耗的矩阵向量乘,在神经网络计算中具有广泛应用前景。
下面结合图3,以神经网络阵列为ReRAM crossbar作为示例,对通过ReRAM crossbar实现神经网络层计算的具体实现过程进行详细描述。
图3是神经网络与神经网络阵列之间的映射关系的示意图。神经网络110包括多个神经网络层。
在本申请实施例中,神经网络层为逻辑的层概念,一个神经网络层是指要执行一次神经网络操作。每一层神经网络计算均是由计算节点(也可以称为神经元)来实现。实际应用中,神经网络层可以包括卷积层、池化层、全连接层等。
本领域技术人员知道,在执行神经网络计算(例如,卷积计算)时,神经网络系统中的计算节点可以将输入数据与对应神经网络层的权重进行计算。在神经网络系统中,权重通常用一个实数矩阵表示,权重矩阵中的每一个元素代表一个权重值。权重通常用于表示输入数据对于输出数据的重要程度。如图4所示,图4所示的m行n列的权重矩阵可以 是一个神经网络层的一个权重,该权重矩阵中的每一个元素代表一个权重值。
由于每一层神经网络的计算可以由ReRAM crossbar来实现,又由于ReRAM具有存算一体的优势。因此,权重可以在计算之前被配置到ReRAM crossbar的多个ReRAM单元(cell)上。从而,可以通过ReRAM crossbar实现输入数据与配置的权重的矩阵乘加操作。
应理解,本申请实施例中ReRAM单元也可以称为忆阻器单元。忆阻器单元在计算之前被配置权重可以理解为在忆阻器单元中存储对应的神经网络中神经元的权重值,具体的,可以是通过忆阻器单元的阻值或电导值来指示神经网络中神经元的权重值。
还应理解,实际应用中,ReRAM crossbar和神经网络层之间可以是一对一的映射关系,或者还可以是一对多的映射关系。下面会结合附图进行详细描述,此处不再赘述。
为了描述清楚,下面对ReRAM crossbar如何实现矩阵乘加操作的过程进行简单的描述。
需要说明的是,图3中以神经网络110中的第一神经网络层为例介绍数据处理过程。在神经网络系统中,第一神经网络层可以是神经网络系统中的任意一层,为描述方便,第一神经网络层也可以简称为“第一层”。
图3所示的ReRAM crossbar 120为m×n的交叉阵列,其中,ReRAM crossbar 120中可以包括多个忆阻器单元(例如,G 1,1,G 1,2等),每一列的忆阻器单元的位数据线(bit line,BL)连接在一起,每一行的忆阻器单元的源数据线(source line,SL)连接在一起。
在本申请实施例中,可以通过忆阻器的电导值来表示神经网络中神经元的权重。具体的,作为示例,可以将图4所示的权重矩阵中的每个元素通过位于BL与SL交点的忆阻器的电导值来表示。例如,图3的G 1,1表示图4中的权重元素W 0,0,图3的G 1,2表示图4中的权重元素W 0,1等。
忆阻器单元的电导值不同可以表示忆阻器单元存储的神经网络中神经元的权重不同。
在执行神经网络计算的过程中,可以通过加载到忆阻器的BL的电压值来表示n个输入数据Vi,例如图3中的:V1,V2,V3,…Vn。输入数据可以通过电压表示,从而使得加载到忆阻器的输入数据与忆阻器存储的权重值实现点乘运算,得到如图3中所示的m个输出数据。所述m个输出数据可以通过SL的电流表示,例如图3中的:I1,I2,…Im。
应理解,加载到忆阻器的电压值的实现方式有多种,本申请实施例对此不做具体限定。例如,可以通过电压的脉冲幅值来表示,又如,还可以通过电压的脉冲宽度来表示,又如,还可以通过电压的脉冲个数来表示,又如,还可以通过电压的脉冲个数与电压的脉冲幅值的结合来表示。
需要说明的是,上文中以一个神经网络阵列为例,对神经网络阵列完成神经网络中相应的乘累加计算的过程进行了详细的描述。在实际应用中,是由多个神经网络阵列共同完成完整的神经网络所需要的乘累加计算的。
多个神经网络阵列中的一个神经网络阵列可以和一个神经网络层对应,该神经网络阵列用于实现一个神经网络层的计算。或者,多个神经网络阵列可以和一个神经网络层对应,用于实现该一个神经网络层的计算。或者,多个神经网络阵列中的一个神经网络阵列可以和多个神经网络层对应,用于实现该多个神经网络层的计算。
下面结合图5-图6,对神经网络阵列和神经网络层之间的对应关系进行详细描述。
为了便于描述,下文中以忆阻器阵列为神经网络阵列作为示例进行说明。
图5是一种可能的神经网络模型的示意图。该神经网络模型可以包括多个神经网络层。
在本申请实施例中,神经网络层为逻辑的层概念,一个神经网络层是指要执行一次神经网络操作。每一层神经网络计算均是由计算节点来实现。神经网络层可以包括卷积层、池化层、全连接层等。
如图5所示,神经网络模型中可以包括n个神经网络层(又可以被称为n层神经网络),其中,n为大于或等于2的整数。图5示出了神经网络模型中的部分神经网络层,如图5所示,神经网络模型可以包括第一层302、第二层304、第三层306、第四层308、第五层310至第n层312。其中,第一层302可以执行卷积操作,第二层304可以是对第一层302的输出数据执行池化操作或激活运算的操作,第三层306可以是对第二层304的输出数据执行卷积操作,第四层308可以对第三层306的输出结果执行卷积操作,第五层310可以对第二层304的输出数据以及第四层308的输出数据执行求和操作。第n层312可以执行全连接层的操作。
应理解,池化操作或激活运算的操作可以是由外部的数字电路模块来实现。具体的,外部的数字电路模块可以是通过PCIE总线106与神经网络电路110连接(图1或图2中未示出)。
可以理解的是,图5只是对神经网络模型中的神经网络层的一个简单示例和说明,并不对每一层神经网络的具体操作进行限制,例如,第四层308也可以是池化运算,第五层310也可以是做卷积操作或池化操作等其他的神经网络操作。
图6是本申请实施例提供的一种神经网络系统的示意图。如图6所示,该神经网络系统可以包括多个忆阻器阵列。例如,第一忆阻器阵列,第二忆阻器阵列,第三忆阻器阵列,第四忆阻器阵列等。
第一忆阻器阵列可以实现神经网络中全连接层的计算。具体的,可以将神经网络中全连接层的权重存储在第一忆阻器阵列中,该忆阻器阵列中的每个忆阻器单元的电导值可以用于在指示全连接层的权重,并实现神经网络中全连接层的乘累加计算过程。
需要说明的是,神经网络中全连接层也可以对应多个忆阻器阵列,并由该多个忆阻器阵列共同完成全连接层的计算。本申请对此不做具体限定。
图6所示的多组忆阻器阵列(例如,第二忆阻器阵列,第三忆阻器阵列,第四忆阻器阵列)可以实现神经网络中卷积层的计算。对于卷积层的操作而言,卷积核每次滑窗之后都会有新的输入,导致一次完整的卷积层的计算过程需要处理不同的输入。因此,可以增加神经网络网络系统层次的并行度,通过多组忆阻器阵列实现网络中相同位置的权重,从而实现对不同输入进行并行加速。即将关键位置的卷积权重用多组忆阻器阵列分别实现。在计算时,各组忆阻器阵列(例如,第二忆阻器阵列,第三忆阻器阵列,第四忆阻器阵列)并行处理不同的输入数据,彼此之间并行工作,从而提升卷积计算效率和系统性能。
应理解,一个卷积核(kernel)代表神经网络计算过程中的一种特征提取方式。例如,在神经网络系统中进行图像处理时,给定输入图像,在输出图像中每一个像素是输入图像中一个小区域中像素的加权平均,其中权值由一个函数定义,这个函数即称为卷积核。在计算过程中,卷积核按照一定的步长(stride)依次划过输入特征图(feature map)产生特征提取后的输出数据(也被称为输出特征图)。因此,卷积核大小(kernel size)也用于指示神经网络系统中的计算节点执行一次计算的数据量的大小。本领域技术人员可以知 道,卷积核可以用一个实数矩阵来表示,例如图8的(a)示出了一个3行3列的卷积核,该卷积核中的每一个元素代表一个权重值。实际应用中,一个神经网络层可以包括多个卷积核。在神经网络计算过程中,可以将输入数据与卷积核进行乘加计算。
并行计算的多组忆阻器阵列(例如,第二忆阻器阵列,第三忆阻器阵列,第四忆阻器阵列)的输入数据可以包括其他忆阻器阵列的输出数据或者外部的输入数据,该多组忆阻器阵列(例如,第二忆阻器阵列,第三忆阻器阵列,第四忆阻器阵列)的输出数据可以作为共享的第一忆阻器阵列的输入数据。也就是说,第一忆阻器阵列的输入数据可以包括多组忆阻器阵列(例如,第二忆阻器阵列,第三忆阻器阵列,第四忆阻器阵列)的输出数据。
并行计算的多组忆阻器阵列的输入数据和输出数据的结构可以有多种,本申请对此不做具体限定。
下面结合图7-图9,对并行计算的多组忆阻器阵列的输入数据和输出数据的结构进行详细描述。
图7是本申请实施例提供的并行计算的多组忆阻器阵列的输入数据和输出数据的结构示意图。
一种可能的实现方式中,如图7所示的方式一,并行计算的多组忆阻器阵列(例如,第二忆阻器阵列,第三忆阻器阵列,第四忆阻器阵列)的输入数据组合在一起成为一个完整的输入数据,该并行计算的多组忆阻器阵列的输出数据组合在一起成为一个完整的输出数据。
例如,第二忆阻器阵列的输入数据为数据1,第三忆阻器阵列的输入数据为数据2,第四忆阻器阵列的输入数据为数据3,对于卷积层而言,一个完整的输入数据包括数据1、数据2以及数据3的组合。同样的,第二忆阻器阵列的输出数据为结果1,第三忆阻器阵列的输出数据为结果2,第四忆阻器阵列的输出数据为结果3,对于卷积层而言,一个完整的输出数据包括结果1、结果2、结果3的组合。
具体的,参见图8的(a),可以将一张输入图片拆分为不同的部分,并分别输入至多组忆阻器阵列(例如,第二忆阻器阵列,第三忆阻器阵列,第四忆阻器阵列)进行并行计算。该多组忆阻器阵列(例如,第二忆阻器阵列,第三忆阻器阵列,第四忆阻器阵列)的输出结果的组合可以作为该输入图片对应的完整的输出数据。
图8的(b)示出了一种可能的图片拆分的示意图。如图8的(b),将一张图像拆分成三部分分别送入三组并行加速的阵列中进行计算。第一部分送入图8的(a)所示的第二忆阻器阵列,得到图7方式一对应的“结果1”,对应于完整输出中第二忆阻器阵列的输出结果。第二部分和第三部分可以完成类似的处理。根据卷积核的大小和滑窗步长确定每部分之间的交叠部分(如本实例中每部分之间有2行交叠),实现三组阵列处理后的输出结果能共同构成完整的输出。在训练过程中,当得到该层的完整残差时,按照前向计算过程的对应关系,利用第二忆阻器阵列计算对应的神经元的残差值和第一部分输入,对第二忆阻器阵列进行原位更新。第二组和第三组的阵列更新类似。具体的更新过程请参见下文中的描述,此处不再赘述。
另一种可能的实现方式中,如图7所示的方式二,并行计算的多组忆阻器阵列(例如,第二忆阻器阵列,第三忆阻器阵列,第四忆阻器阵列)的输入数据分别是一个完整的输入数据,该并行计算的多组忆阻器阵列的输出数据分别是一个完整的输出数据。
例如,第二忆阻器阵列的输入数据为数据1,对于卷积层而言,该数据1为一个完整 的输入数据,其输出数据为结果1,该结果1为一个完整的输出数据。同样的,第三忆阻器阵列的输入数据为数据2,对于卷积层而言,该数据2为一个完整的输入数据,其输出数据为结果2,该结果2为一个完整的输出数据。第四忆阻器阵列的输入数据为数据3,对于卷积层而言,该数据3为一个完整的输入数据,其输出数据为结果3,该结果3为一个完整的输出数据。
具体的,参见图9,可以将多个不同的完整输入数据分别输入至多组忆阻器阵列(例如,第二忆阻器阵列,第三忆阻器阵列,第四忆阻器阵列)进行并行计算。该多组忆阻器阵列(例如,第二忆阻器阵列,第三忆阻器阵列,第四忆阻器阵列)的输出结果分别对应一个完整的输出数据。
如果神经网络阵列中的存算单元存在一些非理想特性的影响,例如,器件波动、电导漂移、阵列良率等,则该存算单元不能无损的实现权重,从而导致神经网络系统的系统整体性能下降,其识别率降低。
本申请实施例提供的技术方案,可以提高神经网络系统的性能以及识别的准确率。
下面结合图10-图22,对本申请实施例提供的方法侧实施例进行详细描述。
需要说明的是,本申请实施例的技术方案可以应用于各种神经网络,例如:卷积神经网络(convolutional neural network,CNN),在处理自然语言和语音中有广泛应用的循环神经网络,以及卷积神经网络和循环神经网络结合的深度神经网络等等。卷积神经网络的处理过程类似于动物的视觉系统,使得其非常适用于图片识别领域。卷积神经网络可以应用于各个图片识别领域如安防领域、计算机视觉、平安城市等,还可以应用于语音识别、搜索引擎、机器翻译等等。实际应用中,巨大的参数量和计算量对神经网络在实时性高、功耗低的场景的应用带来极大挑战。
图10是本申请实施例提供的一种神经网络系统中数据处理的方法的示意性流程图。如图10所示,该方法可以包括步骤1010-1030,下面分别对步骤1010-1030进行详细描述。
步骤1010:将训练数据输入神经网络系统得到第一输出数据。
本申请实施例中并行加速的神经网络系统中可以包括多个神经网络阵列,所述多个神经网络阵列中的每个神经网络阵列中可以包括多个存算单元,每个存算单元用于存储对应的神经网络中神经元的权重值。
步骤1020:计算第一输出数据和目标输出数据之间的偏差。
目标输出数据可以是实际输出的第一输出数据的理想值。
本申请实施例中的偏差可以是计算第一输出数据和目标输出数据之间的差值,或者还可以是计算第一输出数据和目标输出数据之间的残差,或者还可以是计算第一输出数据和目标输出数据之间其它形式的损失函数。
步骤1030:根据所述偏差对所述并行加速的神经网络系统中多个神经网络阵列中的部分神经网络阵列中的至少一个所述存算单元中存储的权重值进行调整。
本申请实施例中部分神经网络阵列可以用于实现所述神经网络系统中部分神经网络层的计算。也就是说,神经网络阵列和神经网络层之间的对应关系可以是一对一的关系,或者还可以是一对多的关系,或者还可以是多对一的关系。
例如,图6所示的第一忆阻器阵列与神经网络中的全连接层对应,用于实现全连接层的计算。又如,图6所示的多组忆阻器阵列(例如,第二忆阻器阵列,第三忆阻器阵列,第四忆阻器阵列)与神经网络中的卷积层对应,用于实现卷积层的计算。
应理解,神经网络层为逻辑的层概念,一个神经网络层是指要执行一次神经网络操作。具体的请参见图5中的说明,此处不再赘述。
存算单元中的电阻值或者电导值可以用于指示神经网络层中的权重值,本申请实施例中可以根据计算的偏差,对多个神经网络阵列中的部分神经网络阵列中的至少一个存算单元对电阻值或者电导值进行调整或者改写。
对存算单元对电阻值或者电导值进行调整或者改写的实现方式有多种。一种可能的实现方式中,可以根据偏差确定存算单元的电阻值或者电导值的更新值,并根据该更新值对存算单元施加固定个数的编程脉冲。另一种可能的实现方式中,根据偏差确定存算单元的电阻值或者电导值的更新值,通过边读边写的方式对存算单元施加编程脉冲。另一种可能的实现方式中,还可以根据不同的存算单元的特性施加不同数目的编程脉冲,以实现其电阻值或者电导值的调整或改写。下面会结合具体的实施例进行描述,此处不再赘述。
需要说明的是,本申请实施例可以通过偏差对多个神经网络阵列中用于实现全连接层的神经神经网络阵列的电阻值或者电导值进行调整,或者还可以通过偏差对多个神经网络阵列中用于实现卷积层的神经神经网络阵列的电阻值或者电导值进行调整,或者还可以通过偏差同时对用于实现全连接层的神经神经网络阵列以及用于实现卷积层的神经神经网络阵列的电阻值或者电导值进行调整。下面会结合图11-图17进行详细描述,此处不再赘述。
为了便于描述,下面以计算实际输出值和目标输出值之间的残差作为示例,先对残差的计算过程进行详细描述。
在前向传播(forward propagation,FP)计算过程中,获取训练数据集,例如输入图像的像素信息,将训练数据集的数据输入到神经网络中。经过从第一层神经网络到最后一层神经网络的传递,从最后一层的神经网络的输出得到实际的输出值。
在反向传播(back propagation,BP)计算过程中,希望神经网络实际的输出值尽可能的接近训练数据的先验知识(prior knowledge),先验知识也被称为真实值(ground truth)或理想的输出值,一般包括由人提供的训练数据对应的预测结果。所以可以通过比较当前实际的输出值和理想的输出值,再根据两者之间的偏差计算残差值。具体的可以是计算目标损失函数的偏导数。根据残差值计算需要更新的权重大小,从而可以根据需要更新的权重大小对神经网络阵列中的至少一个存算单元中存储的权重值进行更新。
作为一个示例,可以计算神经网络实际的输出值和理想的输出值之间的差值的平方,并采用该平方数对权重矩阵中的权重求导数,得到残差值。
根据上述中确定的残差值以及权重值对应的输入数据,通过公式(1)确定需要更新的权重大小。
Figure PCTCN2020130393-appb-000001
其中,ΔW表示需要更新的权重大小,η表示学习率,N表示有N组输入数据,V表示本层的输入数据值,δ表示本层的残差值。
具体的,参见图11,如图11所示的NxM阵列,SL表示源数据线(source line),BL表示位数据线(bit line)。
前向操作中在BL输入电压,在SL输出电流,完成Y=XW的矩阵向量乘计算(X对应输入电压V,Y对应输出电流I),X是输入计算数据,可以用于前向推理。
在反向操作中在SL输入电压,在BL输出电流,即Y=XW T计算(X对应输入电压V,Y对应输出电流I),X是残差值,即完成残差值的反向传播计算。忆阻器阵列更新操作(也称为原位更新)可以完成权重按照梯度方向变化的过程。
可选的,在一些实施例中,对于该层第m行第n列得到的累积更新权重,还可以根据如下的公式(2)确定是否对该层第m行第n列的权重值进行更新。
Figure PCTCN2020130393-appb-000002
其中,阈值(threshold)表示预设阈值。
对于该层第m行第n列得到的累积更新权重ΔW m,n,采用如公式(2)所示的阈值更新法则,即没有达到阈值(threshold)要求的权重不进行更新。具体的,如果ΔW m,n大于或等于预设阈值,则可以对该层第m行第n列的权重值进行更新。如果ΔW m,n小于预设阈值,则不对该层第m行第n列的权重值进行更新。
下面结合图12-图13,以不同的数据组织结构为例,对实现全连接层计算的第一忆阻器阵列中存储的权重值进行更新的具体实现过程进行详细描述。
图12是一种对多个忆阻器阵列中实现全连接层计算的第一忆阻器阵列中存储的权重值进行更新的示意图。
如图12所示,可以将提前训练好的神经网络层的权重写到多个忆阻器阵列中。也就是说,在多个忆阻器阵列中存储对应神经网络层的权重。例如,第一忆阻器阵列可以实现神经网络中全连接层的计算,可以将神经网络中全连接层的权重存储在第一忆阻器阵列中,该忆阻器阵列中的每个忆阻器单元的电导值可以用于指示全连接层的权重,并实现神经网络中全连接层的乘累加计算过程。又如,多组忆阻器阵列(例如,第二忆阻器阵列,第三忆阻器阵列,第四忆阻器阵列)可以实现神经网络中卷积层的计算。通过多组忆阻器阵列实现卷积层中相同位置的权重,从而实现对不同输入进行并行加速。
如图8的(a)所示,将一张输入图片拆分为不同的部分,并分别输入至多组忆阻器阵列(例如,第二忆阻器阵列,第三忆阻器阵列,第四忆阻器阵列)进行并行计算。该多组忆阻器阵列(例如,第二忆阻器阵列,第三忆阻器阵列,第四忆阻器阵列)的输出结果可以作为输入数据,输入到第一忆阻器阵列,并通过该第一忆阻器阵列得到第一输出数据。
本申请实施例中可以根据上文中计算残差值的方法,根据第一输出数据和理想输出数据计算残差值。并根据公式(1)对实现全连接层计算的第一忆阻器阵列中的各个忆阻器存储的权重值进行原位更新。
图13是另一种对多个忆阻器阵列中实现全连接层计算的第一忆阻器阵列中存储的权重值进行更新的示意图。
如图9所示,将多个不同的输入数据分别输入至多组忆阻器阵列(例如,第二忆阻器阵列,第三忆阻器阵列,第四忆阻器阵列)进行并行计算。该多组忆阻器阵列(例如,第二忆阻器阵列,第三忆阻器阵列,第四忆阻器阵列)的输出结果可以作为输入数据,输入第一忆阻器阵列,并通过该第一忆阻器阵列得到第一输出数据。
本申请实施例中可以根据上文中计算残差值的方法,根据第一输出数据和理想输出数据计算残差值。并根据公式(1)对实现全连接层计算的第一忆阻器阵列中的各个忆阻器存储的权重值进行原位更新。
下面结合图14-图17,以不同的数据组织结构为例,对并行实现卷积层计算的多组忆阻器阵列中存储的权重值进行更新的具体实现过程进行详细描述。
图14是一种对实现卷积层计算的多组忆阻器阵列中存储的权重值进行更新的示意图。
如图8的(a)所示,将一张输入图片拆分为不同的部分,并分别输入至多组忆阻器阵列(例如,第二忆阻器阵列,第三忆阻器阵列,第四忆阻器阵列)进行并行计算。该多组忆阻器阵列(例如,第二忆阻器阵列,第三忆阻器阵列,第四忆阻器阵列)的输出结果的组合可以作为该输入图片对应的完整的输出数据。
本申请实施例中可以根据上文中计算残差值的方法,根据输出数据和理想输出数据计算残差值。并根据公式(1)对并行实现卷积层计算的多组忆阻器阵列(例如,第二忆阻器阵列,第三忆阻器阵列,第四忆阻器阵列)中的各个忆阻器存储的权重值进行原位更新。具体的实现方式有多种。
一种可能的实现方式中,可以根据多组忆阻器阵列(例如,第二忆阻器阵列,第三忆阻器阵列,第四忆阻器阵列)的输出值和对应的理想输出值计算残差值,并根据该残差值对并行实现卷积层计算的多组忆阻器阵列中的各个忆阻器存储的权重值进行原位更新。
另一种可能的实现方式中,还可以根据第一忆阻器阵列的第一输出值和对应的理想输出值计算残差值,并根据该残差值对并行实现卷积层计算的多组忆阻器阵列(例如,第二忆阻器阵列,第三忆阻器阵列,第四忆阻器阵列)中的各个忆阻器存储的权重值进行原位更新。下面结合图15进行详细描述。
图15是一种根据残差值对实现卷积层计算的多组忆阻器阵列中存储的权重值进行更新的示意图。
如图15所示,完整残差可以是共享的第一忆阻器阵列的残差值,该残差值是根据第一忆阻器阵列的输出值和对应的理想输出值确定的。本申请实施例中可以将完整残差划分为多个子残差,例如,残差1,残差2,残差3。
每一个子残差分别和并行计算的多组忆阻器阵列中每个忆阻器阵列的输出数据对应。例如,残差1和第二忆阻器阵列的输出数据对应,残差2和第三忆阻器阵列的输出数据对应,残差3和第四忆阻器阵列的输出数据对应。
本申请实施例中,根据多组忆阻器阵列中每个忆阻器阵列的输入数据和子残差,结合公式(2)对忆阻器阵列中各个忆阻器存储的权重值进行原位更新。
图16是另一种对实现卷积层计算的多组忆阻器阵列中存储的权重值进行更新的示意图。
根据如图9所示的输入数据的结构,将多个不同的输入数据分别输入至多组忆阻器阵列(例如,第二忆阻器阵列,第三忆阻器阵列,第四忆阻器阵列)进行并行计算。该多组忆阻器阵列(例如,第二忆阻器阵列,第三忆阻器阵列,第四忆阻器阵列)的输出结果分别对应一个完整的输出数据。
本申请实施例中可以根据上文中计算残差值的方法,根据输出数据和理想输出数据计算残差值。并根据公式(1)对并行实现卷积层计算的多组忆阻器阵列(例如,第二忆阻器阵列,第三忆阻器阵列,第四忆阻器阵列)中的各个忆阻器存储的权重值进行改写。具体的实现方式有多种。
一种可能的实现方式中,可以根据多组忆阻器阵列(例如,第二忆阻器阵列,第三忆 阻器阵列,第四忆阻器阵列)的输出值和对应的理想输出值计算残差值,并根据该残差值对并行实现卷积层计算的多组忆阻器阵列中的各个忆阻器存储的权重值进行原位更新。
另一种可能的实现方式中,还可以根据第一忆阻器阵列的第一输出值和对应的理想输出值计算残差值,并根据该残差值对并行实现卷积层计算的多组忆阻器阵列(例如,第二忆阻器阵列,第三忆阻器阵列,第四忆阻器阵列)中的各个忆阻器存储的权重值进行原位更新。下面结合图17进行详细描述。
图17是另一种根据残差值对实现卷积层计算的多组忆阻器阵列中存储的权重值进行更新的示意图。
如图17所示,完整残差可以是共享的第一忆阻器阵列的残差值,该残差值是根据第一忆阻器阵列的输出值和对应的理想输出值确定的。由于参与并行加速的每组忆阻器阵列都会处理得到的完整的输出结果,每组忆阻器阵列都可以依据相关的完整残差数据进行更新。假设上述完整残差是根据第二忆阻器阵列的输出结果1得到的,因此,可以通过上述公式(1),根据完整残差和第二忆阻器阵列的输入数据1,对第二忆阻器阵列中各个忆阻器存储的权重值进行原位更新。
可选的,本申请实施例中还可以对并行实现卷积层计算的多个忆阻器阵列的前级阵列存储的权重值进行调整,可以通过反向传播的方式计算各层神经元的残差值。具体的请参见上文中描述的方法,此处不再赘述。应理解,并行实现卷积层计算的多个忆阻器阵列的前级神经网络阵列而言,这些阵列的输入数据可以是更前级忆阻器阵列的输出数据,也可以是外部输入的原始数据,如图像、文本、语音等;这些阵列的输出数据作为并行实现卷积层计算的多个忆阻器阵列的输入数据。
上文通过图11-图17描述了对实现全连接层计算的神经网络矩阵存储的权重值进行调整,以及对并行实现卷积层计算的多个神经网络矩阵中存储的权重值进行调整。应理解,本申请实施例还可以同时对实现全连接层计算的神经网络矩阵存储的权重值以及并行实现卷积层计算的多个神经网络矩阵中存储的权重值进行调整。方法类似,此处不再赘述。
下面结合图18-图21,以向位于BL与SL交点的目标忆阻器单元写目标数据为例,介绍置位(set)操作和复位(reset)操作。
其中,set操作用于将忆阻器单元的电导从低电导向高电导调整,reset操作用于将忆阻器单元的电导从高电导向低电导调整。
如图18,假设目标忆阻器单元的目标电导范围可以表示目标权重W 11。在向目标忆阻器单元写W 11的过程中,若目标忆阻器单元的当前电导低于目标电导范围的下限,则可以通过set操作,增大目标忆阻器单元的电导。此时,可以通过SL 11向需要调整的目标忆阻器单元中晶体管的栅极加载电压开启晶体管,使得目标忆阻器单元处于选通状态。同时将目标忆阻器单元连接的SL以及交叉阵列中的其他BL也接地,然后,再向目标忆阻器单元所在的BL施加set脉冲,以调整目标忆阻器单元的电导。
如图19,若目标忆阻器单元的当前电导高于目标电导范围上限,则可以通过reset操作,减小目标忆阻器单元的电导。此时,可以通过SL向需要调整的目标忆阻器单元中晶体管的栅极加载电压,使得目标忆阻器单元处于选通状态。同时将目标忆阻器单元连接的BL以及交叉阵列中的其他SL接地。然后,再向目标忆阻器单元所在的SL施加reset脉冲,以调整目标忆阻器单元的电导。
调整目标忆阻器单元的电导的具体实现方式有多种。例如,可以对该目标忆阻器单元 施加固定个数的编程脉冲。又如,或者还可以通过边读边写的方式对目标忆阻器单元施加编程脉冲。又如,或者还可以对不同的忆阻器单元施加不同的数目的编程脉冲,以实现其电导值的调整。
下面基于set操作、reset操作,结合图20-图21,对通过边读边写的方式对目标忆阻器单元施加编程脉冲的具体实现方式进行详细描述。
本申请实施例可以基于增量步进编程脉冲(incremental step program pulse,ISPP)策略向目标忆阻器单元写目标数据。具体的,ISPP策略通常采用“读验证-矫正”的方式调整目标忆阻器单元的电导,使得目标忆阻器单元的电导最终调整至目标数据对应的目标电导。
参见图20,器件1、器件2···为选中的忆阻器阵列中的目标忆阻器单元。首先,可以向目标忆阻器单元施加一个读脉冲(V read),读出目标忆阻器单元的当前电导。与目标电导进行比较,如果当前电导小于目标电导,则可以向目标忆阻器单元加载set脉冲(V set)以增大目标忆阻器单元的电导。然后再通过读脉冲(V read)读取调整后的电导,如果当前还电导小于目标电导,则再向目标忆阻器单元加载set脉冲(V set),使得目标忆阻器单元的电导向目标电导调整。
参见图21,器件1、器件2···为选中的忆阻器阵列中的目标忆阻器单元。首先,可以向目标忆阻器单元施加一个读脉冲(V read),读出目标忆阻器单元的当前电导。与目标电导进行比较,如果当前电导大于目标电导,则可以向目标忆阻器单元加载reset脉冲(V reset)以减小目标忆阻器单元的电导。然后再通过读脉冲(V read)读取调整后的电导,如果当前还电导大于目标电导,则再向目标忆阻器单元加载set脉冲(V reset),使得目标忆阻器单元的电导向目标电导调整。
应理解,V read可以是一个小于阈值电压的读电压脉冲,V set或V reset可以是一个大于阈值电压的读电压脉冲。
本申请实施例可以分别通过上述边读边写的方式使得目标忆阻器单元的电导最终调整至目标数据对应的目标电导。可选地,终止条件可以是该行所有选中的器件的电导增加量都达到要求。
图22是本申请实施例提供的一种神经网络的训练过程的示意性流程图。如图22所示,该方法可以包括步骤2210-2255,下面分别对步骤2210-2255进行详细描述。
步骤2210:根据神经网络信息确定需要加速的网络层。
本申请实施例中可以根据以下中的一种或多种确定需要加速的网络层:神经网络的层数,参数信息,训练数据集的大小等。
步骤2215:在外部个人电脑(personal computer,PC)进行离线训练确定初始训练权重。
可以在外部PC上通过前向计算和反向计算等步骤对神经网络的神经元上的权重参数进行训练,确定初始训练权重。
步骤2220:将初始训练权重分别映射到存算一体架构中实现并行加速的网络层计算的神经网络阵列以及实现非并行加速的网络层计算的神经网络阵列。
本申请实施例中可以根据图3所示的方法,将初始训练权重分别映射到存算一体架构多个神经网络阵列的至少一个存算单元中,从而,可以通过神经网络阵列实现输入数据与配置的权重的矩阵乘加操作。
多个神经网络阵列中可以包括实现非并行加速的网络层计算的神经网络阵列以及实现并行加速的网络层计算的神经网络阵列。
步骤2225:将一组训练数据输入到存算一体架构的多个神经网络阵列,得到基于存算一体架构的实际硬件的前向计算的输出结果。
步骤2230:判断神经网络系统的准确率是否达到要求或是否达到预设训练次数。
如果神经网络系统的准确率达到要求或达到预设训练次数,可以执行步骤2235。
如果神经网络系统的准确率未达到要求或未达到预设训练次数,可以执行步骤2240。
步骤2235:训练结束。
步骤2240:判断训练数据是否是最后一组训练数据。
如果训练数据是最后一组训练数据,可以执行步骤2245,步骤2255。
如果训练数据不是最后一组训练数据,可以执行步骤2250-2255。
步骤2245:重新加载训练数据。
步骤2250:基于提出的并行训练存算一体系统的训练方法,经过反向传播等计算,对并行加速阵列或其他阵列的电导权重进行片上原位训练和更新。
具体的更新方法请参见上文中的说明,此处不再赘述。
步骤2255:加载下一组训练数据。
加载下一组训练数据后,继续执行步骤2225中的操作。也就是说,将加载的训练数据输入到存算一体架构的多个神经网络阵列,得到基于存算一体架构的实际硬件的前向计算的输出结果。
应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
上文结合图1至图22,详细描述了本申请实施例提供的一种神经网络系统中数据处理的方法,下面结合图23详细描述本申请的装置的实施例。应理解,方法实施例的描述与装置实施例的描述相互对应,因此,未详细描述的部分可以参见前面方法实施例。
需要说明的是,装置实施例是从产品和设备角度考虑本申请方案的实现,本申请的装置实施例和前面描述的本申请方法实施例的部分内容是对应的或者互为补充的。在涉及到方案的实现,以及对权利要求范围的支持上是可以通用的。
下面结合图23描述本申请的装置侧实施例。
图23是本申请实施例提供的一种神经网络系统2300的示意性结构图。应理解,图23示出的神经网络系统2300仅是示例,本申请实施例的装置还可包括其他模块或单元。应理解,神经网络系统2300能够执行图10-图22的方法中的各个步骤,为了避免重复,此处不再详述。
如图23所示,神经网络系统2300可以包括:
处理模块2310,用于将训练数据输入神经网络系统得到第一输出数据,其中,所述神经网络系统中包括多个神经网络阵列,所述多个神经网络阵列中的每个神经网络阵列中包括多个存算单元,每个存算单元用于存储对应的神经网络中神经元的权重值;
计算模块2320,用于计算所述第一输出数据和目标输出数据之间的偏差;
调整模块2330,用于根据所述偏差对所述多个神经网络阵列中的部分神经网络阵列中的至少一个所述存算单元中存储的权重值进行调整,其中,所述部分神经网络阵列用于 实现所述神经网络系统中部分神经网络层的计算。
可选地,一种可能的实现方式中,所述多个神经网络阵列包括第一神经网络阵列和第二神经网络阵列,所述第一神经网络阵列的输入数据包括所述第二神经网络阵列的输出数据。
在第二方面的另一种可能的实现方式中,所述第一神经网络阵列包括用于实现神经网络中的全连接层计算的神经网络阵列。
可选地,另一种可能的实现方式中,所述调整模块2330具体用于:
根据所述第一神经网络阵列的输入值和所述偏差,对所述第一神经网络阵列中至少一个所述存算单元中存储的权重值进行调整。
可选地,另一种可能的实现方式中,所述多个神经网络阵列还包括第三神经网络阵列,所述第三神经网络阵列和所述第二神经网络阵列用于并行实现神经网络中的卷积层的计算。
可选地,另一种可能的实现方式中,
所述调整模块2330具体用于:
根据所述第二神经网络阵列的输入数据和所述偏差,对所述第二神经网络阵列中的至少一个所述存算单元中存储的权重值进行调整;
根据所述第三神经网络阵列的输入数据和所述偏差,对所述第三神经网络阵列中的至少一个所述存算单元中存储的权重值进行调整。
在第二方面的另一种可能的实现方式中,
所述调整模块2330具体用于:
将所述偏差划分为至少两个子偏差,其中,所述至少两个子偏差中的第一子偏差与所述第二神经网络阵列的输出数据对应,所述至少两个子偏差中的第二子偏差与所述第三神经网络阵列的输出数据对应;
根据第一子偏差和所述第二神经网络阵的输入数据,对所述第二神经网络阵列中的至少一个所述存算单元中存储的权重值进行调整;
根据第二子偏差和所述第三神经网络阵的输入数据,对所述第三神经网络阵列中的至少一个所述存算单元中存储的权重值进行调整。
可选地,另一种可能的实现方式中,调整模块2330具体用于:根据存算单元中更新后的权重值确定脉冲个数,根据所述脉冲个数,对所述神经网络阵列中至少一个所述存算单元中存储的权重值进行改写。
应理解,这里的神经网络系统2300以功能模块的形式体现。这里的术语“模块”可以通过软件和/或硬件形式实现,对此不作具体限定。例如,“模块”可以是实现上述功能的软件程序、硬件电路或二者结合。当以上任一模块以软件实现的时候,所述软件以计算机程序指令的方式存在,并被存储在存储器中,处理器可以用于执行所述程序指令以实现以上方法流程。所述处理器可以包括但不限于以下至少一种:中央处理单元(central processing unit,CPU)、微处理器、数字信号处理器(digital signal processing,DSP)、微控制器(microcontroller unit,MCU)、或人工智能处理器等各类运行软件的计算设备,每种计算设备可包括一个或多个用于执行软件指令以进行运算或处理的核。该处理器可以是个单独的半导体芯片,也可以跟其他电路一起集成为一个半导体芯片,例如,可以跟其他电路(如编解码电路、硬件加速电路或各种总线和接口电路)构成一个片上系统(system  on chip,SoC),或者也可以作为一个专用集成电路(application-specific integrated circuit,ASIC)的内置处理器集成在所述ASIC当中,该集成了处理器的ASIC可以单独封装或者也可以跟其他电路封装在一起。该处理器除了包括用于执行软件指令以进行运算或处理的核外,还可进一步包括必要的硬件加速器,如现场可编程门阵列(field programmable gate array,FPGA)、可编程逻辑器件(programmable logic device,PLD)、或者实现专用逻辑运算的逻辑电路。
当以上模块以硬件电路实现的时候,所述硬件电路可能以通用中央处理器(central processing unit,CPU)、微控制器(micro controller unit,MCU)、微处理器(micro processing unit,MPU)、数字信号处理器(digital signal processing,DSP)、片上系统(system on chip,SoC)来实现,当然也可以采用专用集成电路(application-specific integrated circuit,ASIC)实现,或可编程逻辑器件(programmable logic device,PLD)实现,上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD),现场可编程门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合,其可以运行必要的软件或不依赖于软件以执行以上方法流程。
上述实施例,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令或计算机程序。在计算机上加载或执行所述计算机指令或计算机程序时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集合的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质。半导体介质可以是固态硬盘。
应理解,本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况,其中A,B可以是单数或者复数。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系,但也可能表示的是一种“和/或”的关系,具体可参考前后文进行理解。
本申请中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“以下至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b,或c中的至少一项(个),可以表示:a,b,c,a-b,a-c,b-c,或a-b-c,其中a,b,c可以是单个,也可以是多个。
应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以 硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (17)

  1. 一种神经网络系统中数据处理的方法,其特征在于,包括:
    将训练数据输入神经网络系统得到第一输出数据,其中,所述神经网络系统中包括多个神经网络阵列,所述多个神经网络阵列中的每个神经网络阵列中包括多个存算单元,每个存算单元用于存储对应的神经网络中神经元的权重值;
    计算所述第一输出数据和目标输出数据之间的偏差;
    根据所述偏差对所述多个神经网络阵列中的部分神经网络阵列中的至少一个所述存算单元中存储的权重值进行调整,其中,所述部分神经网络阵列用于实现所述神经网络系统中部分神经网络层的计算。
  2. 根据权利要求1所述的方法,其特征在于,所述多个神经网络阵列包括第一神经网络阵列和第二神经网络阵列,所述第一神经网络阵列的输入数据包括所述第二神经网络阵列的输出数据。
  3. 根据权利要求2所述的方法,其特征在于,所述第一神经网络阵列包括用于实现神经网络中的全连接层计算的神经网络阵列。
  4. 根据权利要求3所述的方法,其特征在于,所述根据所述偏差对所述多个神经网络阵列中的部分神经网络阵列中的至少一个所述存算单元中存储的权重值进行调整,包括:
    根据所述第一神经网络阵列的输入值和所述偏差,对所述第一神经网络阵列中至少一个所述存算单元中存储的权重值进行调整。
  5. 根据权利要求2至4中任一项所述的方法,其特征在于,所述多个神经网络阵列还包括第三神经网络阵列,所述第三神经网络阵列和所述第二神经网络阵列用于并行实现神经网络中的卷积层的计算。
  6. 根据权利要求5所述的方法,其特征在于,所述根据所述偏差对所述多个神经网络阵列中的部分神经网络阵列中的至少一个所述存算单元中存储的权重值进行调整,包括:
    根据所述第二神经网络阵列的输入数据和所述偏差,对所述第二神经网络阵列中的至少一个所述存算单元中存储的权重值进行调整;
    根据所述第三神经网络阵列的输入数据和所述偏差,对所述第三神经网络阵列中的至少一个所述存算单元中存储的权重值进行调整。
  7. 根据权利要求5所述的方法,其特征在于,所述根据所述偏差对所述多个神经网络阵列中的部分神经网络阵列中的至少一个所述存算单元中存储的权重值进行调整,包括:
    将所述偏差划分为至少两个子偏差,其中,所述至少两个子偏差中的第一子偏差与所述第二神经网络阵列的输出数据对应,所述至少两个子偏差中的第二子偏差与所述第三神经网络阵列的输出数据对应;
    根据所述第一子偏差和所述第二神经网络阵的输入数据,对所述第二神经网络阵列中的至少一个所述存算单元中存储的权重值进行调整;
    根据所述第二子偏差和所述第三神经网络阵的输入数据,对所述第三神经网络阵列中的至少一个所述存算单元中存储的权重值进行调整。
  8. 一种神经网络系统,其特征在于,包括:
    处理模块,用于将训练数据输入神经网络系统得到第一输出数据,其中,所述神经网络系统中包括多个神经网络阵列,所述多个神经网络阵列中的每个神经网络阵列中包括多个存算单元,每个存算单元用于存储对应的神经网络中神经元的权重值;
    计算模块,用于计算所述第一输出数据和目标输出数据之间的偏差;
    调整模块,用于根据所述偏差对所述多个神经网络阵列中的部分神经网络阵列中的至少一个所述存算单元中存储的权重值进行调整,其中,所述部分神经网络阵列用于实现所述神经网络系统中部分神经网络层的计算。
  9. 根据权利要求8所述的神经网络系统,其特征在于,所述多个神经网络阵列包括第一神经网络阵列和第二神经网络阵列,所述第一神经网络阵列的输入数据包括所述第二神经网络阵列的输出数据。
  10. 根据权利要求9所述的神经网络系统,其特征在于,所述第一神经网络阵列包括用于实现神经网络中的全连接层计算的神经网络阵列。
  11. 根据权利要求10所述的神经网络系统,其特征在于,所述调整模块具体用于:
    根据所述第一神经网络阵列的输入值和所述偏差,对所述第一神经网络阵列中至少一个所述存算单元中存储的权重值进行调整。
  12. 根据权利要求9至11中任一项所述的神经网络系统,其特征在于,所述多个神经网络阵列还包括第三神经网络阵列,所述第三神经网络阵列和所述第二神经网络阵列用于并行实现神经网络中的卷积层的计算。
  13. 根据权利要求12所述的神经网络系统,其特征在于,所述调整模块具体用于:
    根据所述第二神经网络阵列的输入数据和所述偏差,对所述第二神经网络阵列中的至少一个所述存算单元中存储的权重值进行调整;
    根据所述第三神经网络阵列的输入数据和所述偏差,对所述第三神经网络阵列中的至少一个所述存算单元中存储的权重值进行调整。
  14. 根据权利要求12所述的神经网络系统,其特征在于,所述调整模块具体用于:
    将所述偏差划分为至少两个子偏差,其中,所述至少两个子偏差中的第一子偏差与所述第二神经网络阵列的输出数据对应,所述至少两个子偏差中的第二子偏差与所述第三神经网络阵列的输出数据对应;
    根据所述第一子偏差和所述第二神经网络阵的输入数据,对所述第二神经网络阵列中的至少一个所述存算单元中存储的权重值进行调整;
    根据所述第二子偏差和所述第三神经网络阵的输入数据,对所述第三神经网络阵列中的至少一个所述存算单元中存储的权重值进行调整。
  15. 一种神经网络系统,其特征在于,包括:
    存储器,用于存储程序;
    处理器,用于从所述存储器中调用并运行所述程序以执行权利要求1至7中任一项所述的方法。
  16. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储用于设备执行的程序代码,该程序代码包括用于执行如权利要求1至7中任一项所述的方法。
  17. 一种芯片,其特征在于,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,以执行如权利要求1至7中任一项所述的方法。
PCT/CN2020/130393 2019-11-20 2020-11-20 神经网络系统中数据处理的方法、神经网络系统 Ceased WO2021098821A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP20888862.8A EP4053748B1 (en) 2019-11-20 2020-11-20 Method for data processing in neural network system, and neural network system
US17/750,052 US20220277199A1 (en) 2019-11-20 2022-05-20 Method for data processing in neural network system and neural network system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911144635.8 2019-11-20
CN201911144635.8A CN112825153A (zh) 2019-11-20 2019-11-20 神经网络系统中数据处理的方法、神经网络系统

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/750,052 Continuation US20220277199A1 (en) 2019-11-20 2022-05-20 Method for data processing in neural network system and neural network system

Publications (1)

Publication Number Publication Date
WO2021098821A1 true WO2021098821A1 (zh) 2021-05-27

Family

ID=75906348

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/130393 Ceased WO2021098821A1 (zh) 2019-11-20 2020-11-20 神经网络系统中数据处理的方法、神经网络系统

Country Status (4)

Country Link
US (1) US20220277199A1 (zh)
EP (1) EP4053748B1 (zh)
CN (1) CN112825153A (zh)
WO (1) WO2021098821A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115965067A (zh) * 2023-02-01 2023-04-14 苏州亿铸智能科技有限公司 一种针对ReRAM的神经网络加速器

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220012586A1 (en) * 2020-07-13 2022-01-13 Macronix International Co., Ltd. Input mapping to reduce non-ideal effect of compute-in-memory
CN115481562B (zh) * 2021-06-15 2023-05-16 中国科学院微电子研究所 多并行度优化方法、装置、识别方法和电子设备
CN115691475B (zh) * 2021-07-23 2025-12-12 澜至电子科技(成都)有限公司 用于训练语音识别模型的方法以及语音识别方法
CN113642419B (zh) * 2021-07-23 2024-03-01 上海亘存科技有限责任公司 一种用于目标识别的卷积神经网络及其识别方法
CN113792010A (zh) * 2021-09-22 2021-12-14 清华大学 存算一体芯片及数据处理方法
CN114372567B (zh) * 2021-11-26 2026-04-28 北京大学 一种人工智能加速器及其数据处理方法
CN114330688A (zh) * 2021-12-23 2022-04-12 厦门半导体工业技术研发有限公司 基于阻变式存储器的模型在线迁移训练方法、装置及芯片
CN115056824B (zh) * 2022-05-06 2023-11-28 北京和利时系统集成有限公司 一种确定控车参数的方法、装置、计算机存储介质及终端
CN114997388B (zh) * 2022-06-30 2024-05-07 杭州知存算力科技有限公司 存算一体芯片用基于线性规划的神经网络偏置处理方法
CN117436498A (zh) * 2022-07-11 2024-01-23 中国科学院上海技术物理研究所 一种基于权重调整的阻变存储器训练装置及训练方法
CN115564036B (zh) * 2022-10-25 2023-06-30 厦门半导体工业技术研发有限公司 基于rram器件的神经网络阵列电路及其设计方法
CN115796252B (zh) * 2022-11-25 2025-11-11 清华大学 权重写入方法及装置、电子设备和存储介质
CN116151343B (zh) * 2023-04-04 2023-09-05 荣耀终端有限公司 数据处理电路和电子设备
CN116863936B (zh) * 2023-09-04 2023-12-19 之江实验室 一种基于FeFET存算一体阵列的语音识别方法
CN117973468B (zh) * 2024-01-05 2025-01-28 中科南京智能技术研究院 基于存算架构的神经网络推理方法及相关设备
US12587207B2 (en) * 2024-02-06 2026-03-24 International Business Machines Corporation Adaptive digital-to-analog converter range optimization
CN120046677B (zh) * 2025-04-22 2025-07-22 湖南长城银河科技有限公司 基于dsp与忆阻器协同的异构神经网络加速装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108009640A (zh) * 2017-12-25 2018-05-08 清华大学 基于忆阻器的神经网络的训练装置及其训练方法
CN109460817A (zh) * 2018-09-11 2019-03-12 华中科技大学 一种基于非易失存储器的卷积神经网络片上学习系统
CN109886393A (zh) * 2019-02-26 2019-06-14 杭州闪亿半导体有限公司 一种存算一体化电路及神经网络的计算方法
CN110443168A (zh) * 2019-07-23 2019-11-12 华中科技大学 一种基于忆阻器的神经网络人脸识别系统

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9646243B1 (en) * 2016-09-12 2017-05-09 International Business Machines Corporation Convolutional neural networks using resistive processing unit array
US11348002B2 (en) * 2017-10-24 2022-05-31 International Business Machines Corporation Training of artificial neural networks
US10643705B2 (en) * 2018-07-24 2020-05-05 Sandisk Technologies Llc Configurable precision neural network with differential binary non-volatile memory cell structure
US11061902B2 (en) * 2018-10-18 2021-07-13 Oracle International Corporation Automated configuration parameter tuning for database performance
CN110334799B (zh) * 2019-07-12 2022-05-24 电子科技大学 基于存算一体的神经网络推理与训练加速器及其运行方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108009640A (zh) * 2017-12-25 2018-05-08 清华大学 基于忆阻器的神经网络的训练装置及其训练方法
CN109460817A (zh) * 2018-09-11 2019-03-12 华中科技大学 一种基于非易失存储器的卷积神经网络片上学习系统
CN109886393A (zh) * 2019-02-26 2019-06-14 杭州闪亿半导体有限公司 一种存算一体化电路及神经网络的计算方法
CN110443168A (zh) * 2019-07-23 2019-11-12 华中科技大学 一种基于忆阻器的神经网络人脸识别系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4053748A4 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115965067A (zh) * 2023-02-01 2023-04-14 苏州亿铸智能科技有限公司 一种针对ReRAM的神经网络加速器
CN115965067B (zh) * 2023-02-01 2023-08-25 苏州亿铸智能科技有限公司 一种针对ReRAM的神经网络加速器

Also Published As

Publication number Publication date
CN112825153A (zh) 2021-05-21
EP4053748B1 (en) 2025-07-30
US20220277199A1 (en) 2022-09-01
EP4053748A1 (en) 2022-09-07
EP4053748A4 (en) 2023-01-11

Similar Documents

Publication Publication Date Title
WO2021098821A1 (zh) 神经网络系统中数据处理的方法、神经网络系统
US12518147B2 (en) Neuromorphic device and operating method of the same
Rathi et al. Exploring neuromorphic computing based on spiking neural networks: Algorithms to hardware
Roy et al. Towards spike-based machine intelligence with neuromorphic computing
US11861489B2 (en) Convolutional neural network on-chip learning system based on non-volatile memory
CN110807519B (zh) 基于忆阻器的神经网络的并行加速方法及处理器、装置
Fouda et al. Spiking neural networks for inference and learning: A memristor-based design perspective
CN108009640B (zh) 基于忆阻器的神经网络的训练装置及其训练方法
CN110852429B (zh) 一种基于1t1r的卷积神经网络电路及其操作方法
US12050997B2 (en) Row-by-row convolutional neural network mapping for analog artificial intelligence network training
KR102618546B1 (ko) 2차원 어레이 기반 뉴로모픽 프로세서 및 그 동작 방법
CN110729011A (zh) 用于类神经网路的存储器内运算装置
KR102868994B1 (ko) 뉴럴 네트워크 장치 및 그의 동작 방법
CN113159293B (zh) 一种用于存算融合架构的神经网络剪枝装置及方法
CN115699028A (zh) 模拟人工智能网络推理的逐行卷积神经网络映射的高效瓦片映射
Ravichandran et al. Artificial neural networks based on memristive devices
KR102885872B1 (ko) 뉴럴 네트워크 장치
Vianello et al. Resistive memories for spike-based neuromorphic circuits
KR102650660B1 (ko) 뉴로모픽 장치 및 뉴로모픽 장치에서 멀티-비트 뉴로모픽 연산을 처리하는 방법
CN115796252A (zh) 权重写入方法及装置、电子设备和存储介质
Lu et al. NVMLearn: a simulation platform for non-volatile-memory-based deep learning hardware
CN114020239B (zh) 数据处理方法及电子装置
US11694065B2 (en) Spiking neural unit
Tian et al. NIMBLE: A neuromorphic learning scheme and memristor based computing-in-memory engine for EMG based hand gesture recognition
CN119047528A (zh) 忆阻器突触电路、人工神经元、由此构成的神经网络系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20888862

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020888862

Country of ref document: EP

Effective date: 20220531

NENP Non-entry into the national phase

Ref country code: DE

WWG Wipo information: grant in national office

Ref document number: 2020888862

Country of ref document: EP