WO2018024093A1 - 一种能支持不同位宽运算数据的运算单元、方法及装置 - Google Patents

一种能支持不同位宽运算数据的运算单元、方法及装置 Download PDF

Info

Publication number
WO2018024093A1
WO2018024093A1 PCT/CN2017/093159 CN2017093159W WO2018024093A1 WO 2018024093 A1 WO2018024093 A1 WO 2018024093A1 CN 2017093159 W CN2017093159 W CN 2017093159W WO 2018024093 A1 WO2018024093 A1 WO 2018024093A1
Authority
WO
WIPO (PCT)
Prior art keywords
instruction
bit width
operator
operand
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2017/093159
Other languages
English (en)
French (fr)
Inventor
陈天石
郭崎
杜子东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Cambricon Information Technology Co Ltd
Original Assignee
Shanghai Cambricon Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Cambricon Information Technology Co Ltd filed Critical Shanghai Cambricon Information Technology Co Ltd
Priority to EP17836275.2A priority Critical patent/EP3496006B1/en
Priority to KR1020187034252A priority patent/KR102486029B1/ko
Publication of WO2018024093A1 publication Critical patent/WO2018024093A1/zh
Priority to US16/268,457 priority patent/US10489704B2/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • G06F9/30014Arithmetic instructions with variable precision
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/22Microcontrol or microprogram arrangements
    • G06F9/226Microinstruction function, e.g. input/output microinstruction; diagnostic microinstruction; microinstruction format
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30181Instruction operation extension or modification
    • G06F9/30185Instruction operation extension or modification according to one or more bits in the instruction, e.g. prefix, sub-opcode
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3838Dependency mechanisms, e.g. register scoreboarding
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present invention relates to the field of computers, and in particular, to an arithmetic unit, an arithmetic method, and an arithmetic device, and supports operations of different bit width operation data.
  • ANNs Artificial neural networks
  • NNNs neural networks
  • ANNs Artificial neural networks
  • This kind of network relies on the complexity of the system to adjust the relationship between a large number of internal nodes to achieve the purpose of processing information.
  • neural networks have made great progress in many fields such as intelligent control and machine learning. Since the neural network belongs to the mathematical model of the algorithm, which involves a large number of mathematical operations, how to perform the neural network operation quickly and accurately is an urgent problem to be solved.
  • the parameter widths required for the representation and operation of each parameter in the neural network operation are different in different levels. Using different bit width operators can reduce the actual calculation amount and reduce the power consumption; The combination of the high bit width of the operator can reuse the low bit width of the operator, reducing the number of operators and reducing the area of the device.
  • an object of the present invention is to provide an arithmetic unit, an arithmetic method, and an arithmetic device that support operations of different bit width arithmetic data to achieve efficient neural network operations, matrix operations, and vector operations.
  • the arithmetic unit, the arithmetic method and the arithmetic device provided by the present invention first determine whether there is an arithmetic unit having the same bit width as the arithmetic data, and if so, directly transmit the arithmetic data to the corresponding arithmetic unit; otherwise, the arithmetic unit merge strategy is generated. And merging the plurality of operators into a new operator according to the operator merging strategy, so that the bit width of the new operator conforms to the operation data The bit width is passed to the new operator; and the operator that obtains the operation data performs a neural network operation/matrix operation/vector operation.
  • the present invention performs operations on different bit width operation data according to instructions, and the instruction is implemented in two ways: one is to directly adopt one instruction, and the instruction includes both an operand and a bit width field, and the operation unit can directly According to the instruction, the operand of the operand and the corresponding bit width is obtained to perform the corresponding operation; the other is to adopt two instructions, the operation unit first acquires or constructs the corresponding bit width according to the bit width configuration instruction, and then The operand is fetched according to the operation instruction to perform the corresponding operation.
  • the present invention specifies the bit width of the operation data by the bit width field in the instruction, so that the bit width of the operation data can be arbitrarily configured as needed, and for the operation data of a certain bit width, if there is an arithmetic unit that matches the bit width
  • the operator can be directly called to perform an operation. If the bit width of the operation data is too large, and there is no operator that conforms to the bit width, a plurality of lower bit width operators can be combined to construct a new operator. And using the new operator to perform operations, can support the operation of different bit width operation data, to achieve efficient neural network operation, matrix operation and vector operation, while saving the number of operators and reducing the hardware area.
  • the present invention employs a high speed scratchpad memory that enables storage of operational data (eg, neurons, vectors, matrices) of different lengths and different bit widths.
  • operational data eg, neurons, vectors, matrices
  • FIG. 1 is a schematic structural view of an arithmetic device provided by the present invention.
  • FIG. 2 is a schematic structural view of an arithmetic unit provided by the present invention.
  • FIG. 3 is a schematic diagram of an instruction format for performing an operation by using an instruction in the present invention.
  • FIG. 4 is a schematic diagram showing the format of a neural network operation instruction of the present invention.
  • Figure 5 is a diagram showing the format of a matrix-matrix operation instruction of the present invention.
  • Figure 6 is a diagram showing the format of a vector-vector operation instruction of the present invention.
  • Figure 7 is a diagram showing the format of a matrix-vector operation instruction of the present invention.
  • FIG. 8 is a schematic structural diagram of an arithmetic device according to an embodiment of the present invention.
  • FIG. 9 is a schematic structural diagram of a decoding module according to an embodiment of the present invention.
  • Figure 10 is a flow chart showing the operation of the arithmetic unit of the embodiment of the present invention in a manner of one instruction.
  • FIG. 11 is a schematic diagram showing the format of a bit width configuration instruction in the operation of the present invention by using two instructions.
  • FIG. 12 is a schematic diagram showing the format of an operation instruction in the operation of the present invention by using two instructions.
  • Figure 13 is a diagram showing the format of a neural network bit width configuration command of the present invention.
  • Figure 14 is a diagram showing the format of a neural network operation instruction of the present invention.
  • Figure 15 is a diagram showing the format of a matrix-matrix bit width configuration instruction of the present invention.
  • Figure 16 is a diagram showing the format of a matrix-matrix operation instruction of the present invention.
  • Figure 17 is a diagram showing the format of a vector-vector bit width configuration command of the present invention.
  • Figure 18 is a diagram showing the format of a vector-vector operation instruction of the present invention.
  • Figure 19 is a diagram showing the format of a matrix-vector bit width configuration command of the present invention.
  • Figure 20 is a diagram showing the format of a matrix-vector operation instruction of the present invention.
  • Figure 21 is a flow chart showing the operation of the arithmetic unit of the embodiment of the present invention by using two instructions.
  • the invention discloses an arithmetic unit, an operation method and an arithmetic device capable of supporting different bit width operation data, and configures a bit width of an operation data of a participating operation by configuring a bit width field in the instruction, and first judges when performing an operation according to the instruction.
  • the operation data is directly transmitted to the corresponding operator; otherwise, the operator merge strategy is generated, and the plurality of operators are merged into one new according to the operator merge strategy
  • the operator is such that the bit width of the new operator conforms to the bit width of the operation data, and the operation data is transmitted to the new operator; and the operator that obtains the operation data performs a neural network operation/matrix Operation / vector operation.
  • the invention can support the operation of different bit width operation data to realize efficient neural network operation, matrix operation and vector operation, and save the number of operators Quantity, reducing hardware area.
  • FIG. 1 is a schematic structural diagram of an arithmetic device provided by the present invention. As shown in FIG. 1, the computing device includes:
  • the storage unit may be a scratch pad memory capable of supporting neuron/matrix/vector data of different lengths and different bit widths.
  • the necessary operation data is temporarily stored in the scratchpad memory, so that the operation device can more flexibly and effectively support data of different lengths and different bit widths in the process of performing neural network operations and matrix/vector operations.
  • the scratchpad memory can be implemented by a variety of different memory devices (SRAM, eDRAM, DRAM, memristor, 3D-DRAM or nonvolatile memory, etc.).
  • the register unit can be a scalar register file that provides the scalar registers required for the operation.
  • the scalar registers store not only the neuron/matrix/vector address but also the scalar data. When it comes to matrix/vector and scalar operations, the unit must acquire the matrix/vector address from the register unit and the corresponding scalar from the register unit.
  • control unit for controlling the behavior of various modules in the device.
  • the control unit reads the prepared instruction, decodes and generates a plurality of micro-instructions, and sends the same to other modules in the device, and the other modules perform corresponding operations according to the obtained micro-instructions.
  • An operation unit configured to acquire an instruction, obtain a neuron/matrix/vector address in the register unit according to the instruction, and then acquire a corresponding neuron/matrix/vector according to the neuron/matrix/vector address in the storage unit, thereby
  • the operation data (neuron/matrix/vector) performs an operation.
  • the operations performed by the arithmetic unit include, but are not limited to, convolutional neural network forward operations, convolutional neural network training operations, neural network Pooling operations, full connection neural network forward operations, full connection neural network training operations, batch normalization Operational operations, RBM neural network operations, matrix-vector multiplication operations Work, matrix-matrix add/subtract operation, vector outer product (tensor) operation, vector inner product operation, vector four operation, vector logic operation, vector transcendental operation, vector comparison operation, vector The maximum/minimum operation operation, the vector cyclic shift operation operation, and the generation of a random vector operation operation subject to a certain distribution.
  • the operation unit selects one or more operators according to the bit width of the operation data indicated by the operands in the instruction to perform the operation, wherein one or more of the operators have different bit widths. For example, some operators support 16-bit data operations, and some operators support 32-bit data operations.
  • the operators can be essentially vector multiply parts, accumulative parts, and scalar multiply parts.
  • the operation unit includes a judgment submodule, an operator merge submodule, and an operation submodule;
  • the determining submodule is configured to determine whether there is an operator having the same bit width as the operation data indicated by the operand, and if so, pass the operand to the corresponding operator, otherwise, the operator merge policy and the operand are passed Merging sub-modules to the operator;
  • the operator merging sub-module is configured to combine the plurality of operators into a new operator according to the operator merging strategy, so that the bit width of the new operator conforms to the bit width of the operand, and the operand is transmitted to The new operator.
  • the operator merging strategy refers to combining the operators with larger bit widths. When there is an operator with the same width as the required bit width, the corresponding operator is used directly; if not, the available operators that are smaller than the required operator bit width and closest to each other are selected.
  • the available operator width for combination is 8 bits, 16 bits, and 32 bits
  • the 32-bit operator is directly used
  • the required operation When the bit width of the device is 64 bits, two 32-bit operators are used for merging; when the required operator has a bit width of 48 bits, a 32-bit arithmetic unit and a 16-bit arithmetic unit are used for merging;
  • the required operator has a bit width of 40 bits, a 32-bit arithmetic unit and an 8-bit arithmetic unit are used for merging.
  • the arithmetic submodule is used to cause an operator that obtains the operand to perform an operation.
  • the instructions of the present invention are implemented in two ways: one is to directly adopt one instruction, and the instruction includes both an operand and a bit width field, and the operation unit can directly obtain an operand and a corresponding bit width according to the instruction. To perform the corresponding operation; the other is to use two instructions, the arithmetic unit first obtains or constructs the corresponding bit width according to the bit width configuration instruction The operator then fetches the operand according to the operation instruction to perform the corresponding operation.
  • the instruction set of the present invention adopts a Load/Store structure, and the operation unit does not operate on data in the memory.
  • This instruction set uses a very long instruction word architecture. By configuring the instructions differently, complex neural network operations can be completed, and simple matrix/vector operations can be performed.
  • the present instruction set uses a fixed length instruction at the same time, so that the neural network operation and the matrix/vector operation device of the present invention fetch the next instruction in the decoding stage of the previous instruction.
  • the instruction includes at least one operation code and at least three operands and at least two bit width fields, wherein the bit width domain
  • the number of operands is the same as the number of operands in the operation; wherein the opcode is used to indicate the function of the operation instruction, and the operation unit can perform different operations by recognizing one or more operation codes, and the operand is used to indicate the operation.
  • the data information of the instruction, the bit width field is used to indicate the bit width of the corresponding operand; wherein the data information can be an immediate value or a register number, for example, when a matrix is to be acquired, the matrix can be obtained in the corresponding register according to the register number.
  • the starting address and the length of the matrix, and then the matrix stored in the corresponding address is obtained in the storage unit according to the starting address of the matrix and the length of the matrix.
  • the neural network operation instruction includes at least one operation code and 16 operands and 4 bit wide fields.
  • the operation code is used to indicate the function of the neural network operation instruction, and the operation unit can perform different neural network operations by identifying one or more operation codes, and the operand is used to indicate the data information of the neural network operation instruction, wherein The data information can be an immediate value or a register number.
  • the bit width field is used to indicate the bit width corresponding to the operand in the operation. At the same time, the bit width field is used to indicate the bit width of the corresponding operator in the operation process and whether it is needed The low bit width operators are combined into a high bit width operator.
  • the matrix-matrix operation instruction includes at least one operation code and at least 4 operands and 2 a bit width field, wherein the operation code is used to indicate the function of the matrix-matrix operation instruction, and the operation unit can perform different matrix operations by identifying one or more operation codes, and the operand is used to indicate the data of the matrix-matrix operation instruction Information, where data information can It is an immediate or register number.
  • the bit width field is used to indicate the bit width corresponding to the operand in the operation. At the same time, the bit width field is used to indicate the bit width of the corresponding operator in the operation and whether the low bit width operation is required.
  • the combiner is merged into a high bit width operator.
  • the vector-vector operation instruction includes at least one operation code and at least 3 operands and at least 2 a bit width field, wherein the operation code is used to indicate the function of the vector-vector operation instruction, and the operation unit can perform different vector operations by identifying one or more operation codes, and the operand is used to indicate the vector-vector operation instruction.
  • Data information wherein the data information may be an immediate or register number
  • the bit width field is used to indicate the bit width corresponding to the operand in the operation
  • the bit width field is used to indicate the bit of the corresponding operator in the operation process. Width and whether it is necessary to combine low bit width operators into high bit width operators.
  • the matrix-vector operation instruction includes at least one operation code and at least 6 operands and at least 3 a bit width field, wherein an opcode is used to indicate the function of the matrix-vector operation instruction, and the operation unit can perform different matrix and vector operations by identifying one or more operation codes, and the operand is used to indicate the matrix-vector operation
  • the data information of the instruction wherein the data information may be an immediate value or a register number
  • the bit width field is used to indicate the bit width corresponding to the operand in the operation
  • the bit width field is used to indicate the corresponding operator in the operation process. The bit width and whether it is necessary to combine the low bit width operator into a high bit width operator.
  • the device includes an instruction module, a decoding module, an instruction queue, a scalar register file, a dependency processing unit, a storage queue, and a reordering.
  • Cache arithmetic unit, cache, IO memory access module;
  • the fetch module which is responsible for fetching the next instruction to be executed from the instruction sequence and passing the instruction to the decoding module;
  • the module is responsible for decoding the instruction, and transmitting the decoded instruction to the instruction queue; as shown in FIG. 9, the decoding module includes: an instruction accepting module, a microinstruction generating module, a microinstruction queue, and a micro The instruction transmitting module; wherein the instruction accepting module is responsible for accepting the instruction fetched from the fetching module; the microinstruction decoding module decodes the instruction obtained by the instruction accepting module into the control Microinstructions for each functional component; the microinstruction queue is used to store microinstructions sent from the microinstruction decoding module; the microinstruction transmitting module is responsible for transmitting microinstructions to various functional components;
  • An instruction queue for sequentially buffering the decoded instructions and sending them to the dependency processing unit
  • a scalar register file that provides the scalar registers required by the device during the operation
  • a dependency processing unit that processes the storage dependencies that an instruction may have with the previous instruction.
  • the matrix operation instruction accesses the scratch pad memory, and the front and back instructions may access the same block of memory.
  • the instruction In order to ensure the correctness of the execution result of the instruction, if the current instruction is detected to have a dependency on the data of the previous instruction, the instruction must wait in the storage queue until the dependency is eliminated.
  • the storage queue, the module is an ordered queue, and instructions that have dependencies on the data of the previous instruction are stored in the queue until the dependency is eliminated, and the instruction is submitted.
  • the instruction is also cached in the module during execution.
  • the instruction When an instruction is executed, if the instruction is also the oldest instruction in the uncommitted instruction in the reordering buffer, the instruction will be submitted. . Once submitted, the operation of the instruction will not be able to cancel the state of the device; the instruction in the reordering cache acts as a placeholder.
  • the instruction When the first instruction it contains has a data dependency, then the instruction does not Will be submitted (released); although there will be a lot of instructions coming in, but only part of the instruction (redirected cache size control), until the first instruction is submitted, the entire operation will proceed smoothly.
  • An arithmetic unit that is responsible for all neural network operations and matrix/vector operations of the device, including but not limited to: convolutional neural network forward operations, convolutional neural network training operations, neural network Pooling operations, full connection neural Network forward operation operation, full connection neural network training operation, batch normalization operation operation, RBM neural network operation operation, matrix-vector multiplication operation operation, matrix-matrix addition/subtraction operation operation, vector outer product (tensor) operation operation, Vector inner product operation, vector four operation, vector logic operation, vector transcendental operation, vector comparison operation, vector maximum/minimum operation, vector cyclic shift operation, generation of random vectors subject to certain distribution Operational operations.
  • the operation instruction is sent to the operation unit for execution.
  • the operation unit determines whether there is an operator having the same length of the bit width field corresponding to the operand in the instruction, if Yes, the corresponding operator is selected. If not, the operator of the required bit width is formed by combining a plurality of low-bit width operators, and then the corresponding operator is performed on the operand according to the operation code in the instruction. And get the corresponding results;
  • the module is a data-specific temporary storage device capable of supporting data of different lengths and different bit widths
  • IO memory access module which is used to directly access the scratchpad memory and is responsible for reading data or writing data from the scratchpad memory.
  • Figure 10 is a flow chart showing the operation of the arithmetic unit of the embodiment of the present invention in a manner of one instruction. As shown in Figure 10, the process includes:
  • the fetch module takes the instruction and sends the instruction to the decoding module.
  • the decoding module decodes the instruction and sends the instruction to the instruction queue.
  • the instruction accepting module sends the instruction to the micro-instruction generating module to generate the micro-instruction.
  • the microinstruction generation module obtains the neural network operation operation code of the instruction and the neural network operation operand from the scalar register file, and decodes the instruction into a micro instruction that controls each functional component, and sends it to the microinstruction transmission queue.
  • the instruction is sent to the dependency processing unit.
  • the dependency processing unit analyzes whether the instruction has a dependency on the data with the previous instruction that has not been executed. The instruction needs to wait in the store queue until it no longer has a dependency on the data with the previous unexecuted instruction.
  • the micro-instruction corresponding to the neural network operation and the matrix/vector instruction is sent to a functional component such as an arithmetic unit.
  • the operation unit extracts the required data from the scratchpad memory according to the address and size of the required data, and then determines whether there is an operator that is the same as the instruction medium width domain, and if so, selects the matching operator to complete the instruction corresponding The operation, if not, completes the operation corresponding to the instruction by merging the low-width operators into a desired bit-width operator.
  • FIG. 11 and FIG. 12 are schematic diagrams showing the format of an instruction for performing an operation by using two instructions in the present invention, wherein FIG. 11 is a schematic diagram of a format of a bit width configuration instruction, and the bit width configuration instruction includes at least one operation code of at least 2 bits wide. Field, used to indicate the bit width of the operator used by the next operation instruction. 12 is a schematic diagram of a format of an operation instruction including at least one operation code of at least three operands, wherein the operation code is used to indicate the function of the operation instruction, and the operation unit can perform different operations by identifying one or more operation codes.
  • the operand is used to indicate the data information of the operation instruction, wherein the data information may be an immediate value or a register number.
  • the matrix start address and the matrix length may be obtained in the corresponding register according to the register number. Then, according to the matrix start address and the length of the matrix, the matrix stored in the corresponding address is obtained in the storage unit.
  • FIGS. 13 to 14 are instantiations of FIGS. 11-12, which are respectively a schematic diagram of a neural network bit width configuration command and a neural network operation instruction.
  • the bit width configuration instruction includes at least one operation code of at least 4
  • the bit width field is used to indicate the bit width of the operator used by the next neural network operation instruction.
  • the configuration instruction includes at least one operation code for indicating the function of the neural network operation instruction, and 16 operands, wherein the operation unit can perform different neural network operations by identifying one or more operation codes, and the operand is used for Data information indicating the neural network operation instruction, wherein the data information may be an immediate number or a register number.
  • FIGS. 11-12 are respectively a matrix-matrix bit width configuration instruction and a matrix-matrix operation instruction format, as shown in FIGS. 15-16
  • the bit width configuration instruction includes at least one operation code. At least 2 bit wide fields are used to indicate the bit width of the operator used by the next matrix-matrix operation instruction.
  • the matrix-matrix operation instructions include at least one opcode and at least four operands.
  • the operation code is used to indicate the function of the matrix-matrix operation instruction, and the operation unit can perform different matrix operations by identifying one or more operation codes, and the operand is used to indicate the data information of the matrix-matrix operation instruction, wherein The data information can be an immediate or register number.
  • FIGS. 11-12 are respectively a format of a vector-vector bit width configuration instruction and a vector-vector operation instruction, as shown in FIGS. 17-18
  • the bit width configuration instruction includes at least one operation code. At least 2 bit wide fields are used to indicate the bit width of the operator used by the next vector-vector operation instruction.
  • the vector-vector operation instruction includes at least one operation code and at least three operands, wherein the operation code is used to indicate a function of the vector-vector operation instruction, and the operation unit passes Identifying one or more opcodes can perform different vector operations, the operands being used to indicate data information of the vector-vector operation instructions, wherein the data information can be an immediate or register number.
  • FIGS. 11-12 are respectively a matrix-vector bit width configuration instruction and a matrix-vector operation instruction format, as shown in FIGS. 19-20
  • the bit width configuration instruction includes at least one operation code. At least 3 bit wide fields are used to indicate the bit width of the operator used by the next vector-vector operation instruction.
  • the matrix-vector operation instruction includes at least one operation code for indicating a function of the matrix-vector operation instruction, and at least 6 operands, wherein the operation unit can perform different matrix and vector by identifying one or more operation codes.
  • An operation, the operand is used to indicate data information of the matrix-vector operation instruction, wherein the data information may be an immediate value or a register number.
  • Figure 21 is a flow chart showing the operation of the arithmetic unit of the embodiment of the present invention by using two instructions. As shown in Figure 21, the process includes:
  • Step S1 the fetching module takes out a bit width configuration instruction, and sends the instruction to the decoding module;
  • Step S2 the decoding module decodes the instruction, and sends the instruction to the instruction queue
  • Step S3 in the decoding module, the instruction is sent to the instruction accepting module;
  • Step S4 the instruction receiving module sends the instruction to the micro-instruction decoding module to perform micro-instruction decoding
  • Step S5 the micro-instruction decoding module decodes the instruction into a micro-instruction that controls the arithmetic unit to select the operator of the specified bit width, and sends the micro-instruction to the micro-instruction transmission queue;
  • Step S6 the fetch module takes out a neural network operation and a matrix/vector instruction, and sends the instruction to the decoding module;
  • Step S7 the decoding module decodes the instruction, and sends the instruction to the instruction queue
  • Step S9 the instruction accepting module sends the instruction to the micro-instruction decoding module to perform micro-instruction decoding
  • Step S10 the micro-instruction decoding module acquires the neural network operation operation code of the instruction and the neural network operation operand from the scalar register file, and decodes the instruction into a micro-instruction for controlling each functional component, and sends the micro-instruction to the micro-instruction. Transmit queue
  • Step S11 after obtaining the required data, the instruction is sent to the dependency processing list.
  • the dependency processing unit analyzes whether the instruction has a dependency on the data with the previously unexecuted instruction, and if so, the instruction needs to wait in the storage queue to the data that has not been executed before. There is no longer a dependency;
  • Step S12 sending the microinstruction corresponding to the instruction and the previous microinstruction specifying the operator bit width to the arithmetic unit;
  • Step S13 the operation unit extracts the required data from the scratchpad memory according to the address and size of the required data; and then determines whether there is an operator having the same bit width as the bit width specification instruction, and if so, selects the matching operation. Completing the neural network operation and/or the matrix/vector operation corresponding to the instruction, if not, completing the neural network corresponding to the instruction by combining the low-bit width operators into a desired bit width operator Operation and/or matrix/vector operation;
  • the present invention discloses an apparatus and method for performing neural network operations and matrix/vector operations configurable with a bit width of an arithmetic unit, and with the corresponding instructions, can well solve the current neural network algorithm in the computer field.
  • the present invention can have instructions configurable, easy to use, the operator's bit width can be selected, multiple operators can be combined, and through a dedicated bit width
  • the configuration instruction and the bit width field are specified on the operation instruction to realize the configuration of the operator bit width, the supported neural network scale and the matrix/vector bit width and scale flexibility, the on-chip buffer is sufficient, and the arithmetic unit can be combined.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Neurology (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Executing Machine-Instructions (AREA)
  • Complex Calculations (AREA)
  • Communication Control (AREA)

Abstract

一种运算单元、运算方法及运算装置,通过配置指令中的位宽域来配置参与运算的运算数据位宽,在根据指令执行运算时,首先判断是否存在与指令中操作数所指示的运算数据位宽相同的运算器,如果是,将该操作数直接传给相应的运算器,否则,生成运算器合并策略,并根据运算器合并策略将多个运算器合并成一个新的运算器,以使该新的运算器的位宽符合该操作数的位宽,并将该操作数传给该新的运算器;再令获得该操作数的运算器执行神经网络运算/矩阵运算/向量运算。本运算单元、运算方法及运算装置能够支持不同位宽运算数据的运算,以实现高效的神经网络运算、矩阵运算及向量运算,同时,节省运算器的数量,减少硬件面积。

Description

一种能支持不同位宽运算数据的运算单元、方法及装置 技术领域
本发明涉及计算机领域,尤其涉及一种运算单元、运算方法及运算装置,支持不同位宽运算数据的运算。
背景技术
人工神经网络(ANNs),简称神经网络(NNs),是一种模仿动物神经网络行为特征,进行分布式并行信息处理的算法数学模型。这种网络依靠系统的复杂程度,通过调整内部大量节点之间相互连接的关系,从而达到处理信息的目的。目前,神经网络在智能控制、机器学习等很多领域均获得长足发展。由于神经网络属于算法数学模型,其涉及大量的数学运算,因此如何快速、准确地执行神经网络运算是当前迫切需要解决的问题。其中,神经网络运算中各个参数在不同的层次中进行表示和运算时所需的位宽不同,使用不同位宽的运算器,可以减少实际的运算量,降低功耗;通过将低位宽的运算器合并成高位宽的运算器可以重复利用低位宽的运算器,减少运算器的数量,减少装置的面积。
发明内容
有鉴于此,本发明的目的在于提供一种运算单元、运算方法及运算装置,支持不同位宽运算数据的运算,以实现高效的神经网络运算、矩阵运算及向量运算。
本发明提供的运算单元、运算方法及运算装置,首先判断是否存在与运算数据位宽相同的运算器,如果是,将该运算数据直接传给相应的运算器,否则,生成运算器合并策略,并根据运算器合并策略将多个运算器合并成一个新的运算器,以使该新的运算器的位宽符合该运算数据 的位宽,并将该运算数据传给该新的运算器;再令获得该运算数据的运算器执行神经网络运算/矩阵运算/向量运算。
另外,本发明根据指令执行不同位宽运算数据的运算,指令采用了两种方式进行实现:一种为直接采用一条指令的方式,该指令中同时包括操作数和位宽域,运算单元能够直接根据该指令获取操作数和相应位宽的运算器,以执行相应的运算;另一种为采用两条指令的方式,运算单元先根据位宽配置指令获取或构造相应位宽的运算器,再根据运算指令获取操作数以执行相应的运算。
本发明具有以下有益效果:
1、本发明通过指令中的位宽域来指定运算数据的位宽,使得运算数据的位宽能够根据需要任意配置,针对某一位宽的运算数据,如果存在与该位宽符合的运算器,可直接调用该运算器执行运算,如果该运算数据的位宽过大,没有符合该位宽的运算器,可对多个较低位宽的运算器进行合并,以构造新的运算器,并利用新的运算器执行运算,能够支持不同位宽运算数据的运算,以实现高效的神经网络运算、矩阵运算及向量运算,同时,节省运算器的数量,减少硬件面积。
2、本发明采用高速暂存存储器,其能够实现对不同长度和不同位宽的运算数据(如:神经元、向量、矩阵)的存储。
附图说明
图1是本发明提供的运算装置的结构示意图。
图2是本发明提供的运算单元的结构示意图。
图3为本发明采用一条指令的方式执行运算的指令格式示意图。
图4是本发明的神经网络运算指令的格式示意图。
图5是本发明的矩阵-矩阵运算指令的格式示意图。
图6是本发明的向量-向量运算指令的格式示意图。
图7是本发明的矩阵-向量运算指令的格式示意图。
图8是本发明实施例的运算装置的结构示意图。
图9是本发明实施例中译码模块的结构示意图。
图10是本发明实施例的运算装置采用一条指令的方式执行运算的流程图。
图11是本发明采用两条指令的方式执行运算中位宽配置指令的格式示意图。
图12是本发明采用两条指令的方式执行运算中运算指令的格式示意图。
图13是本发明的神经网络位宽配置指令的格式示意图。
图14是本发明的神经网络运算指令的格式示意图。
图15是本发明的矩阵-矩阵位宽配置指令的格式示意图。
图16是本发明的矩阵-矩阵运算指令的格式示意图。
图17是本发明的向量-向量位宽配置指令的格式示意图。
图18是本发明的向量-向量运算指令的格式示意图。
图19是本发明的矩阵-向量位宽配置指令的格式示意图。
图20是本发明的矩阵-向量运算指令的格式示意图。
图21是本发明实施例的运算装置采用两条指令的方式执行运算的流程图。
具体实施方式
本发明公开了一种能支持不同位宽运算数据的运算单元、运算方法及运算装置,通过配置指令中的位宽域来配置参与运算的运算数据位宽,在根据指令执行运算时,首先判断是否存在与运算数据位宽相同的运算器,如果是,将该运算数据直接传给相应的运算器,否则,生成运算器合并策略,并根据运算器合并策略将多个运算器合并成一个新的运算器,以使该新的运算器的位宽符合该运算数据的位宽,并将该运算数据传给该新的运算器;再令获得该运算数据的运算器执行神经网络运算/矩阵运算/向量运算。本发明能够支持不同位宽运算数据的运算,以实现高效的神经网络运算、矩阵运算及向量运算,同时,节省运算器的数 量,减少硬件面积。
为使本发明的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本发明进一步详细说明。
图1是本发明提供的运算装置的结构示意图,如图1所示,该运算装置包括:
存储单元,用于存储神经元/矩阵/向量,在一实施方式中,该存储单元可以是高速暂存存储器(Scratchpad Memory),能够支持不同长度和不同位宽的神经元/矩阵/向量数据,将必要的运算数据暂存在高速暂存存储器上,使本运算装置在进行神经网络运算以及矩阵/向量运算过程中可以更加灵活有效地支持不同长度和不同位宽的数据。高速暂存存储器可以通过各种不同存储器件(SRAM、eDRAM、DRAM、忆阻器、3D-DRAM或非易失存储等)实现。
寄存器单元,用于存储神经元/矩阵/向量地址,其中:神经元地址为神经元在存储单元中存储的地址、矩阵地址为矩阵在存储单元中存储的地址、向量地址为向量在存储单元中存储的地址;在一种实施方式中,寄存器单元可以是标量寄存器堆,提供运算过程中所需的标量寄存器,标量寄存器不只存放神经元/矩阵/向量地址,还存放有标量数据。当涉及到矩阵/向量与标量的运算时,运算单元不仅要从寄存器单元中获取矩阵/向量地址,还要从寄存器单元中获取相应的标量。
控制单元,用于控制装置中各个模块的行为。在一实施方式中,控制单元读取准备好的指令,进行译码生成多条微指令,发送给装置中的其他模块,其他模块根据得到的微指令执行相应的操作。
运算单元,用于获取指令,根据指令在寄存器单元中获取神经元/矩阵/向量地址,然后,根据该神经元/矩阵/向量地址在存储单元中获取相应的神经元/矩阵/向量,从而对该运算数据(神经元/矩阵/向量)执行运算。运算单元执行的运算包括但不限于:卷积神经网络正向运算操作、卷积神经网络训练操作、神经网络Pooling运算操作、full connection神经网络正向运算操作、full connection神经网络训练操作、batch normalization运算操作、RBM神经网络运算操作、矩阵-向量乘运算操 作、矩阵-矩阵加/减运算操作、向量外积(张量)运算操作、向量内积运算操作、向量四则运算操作、向量逻辑运算操作、向量超越函数运算操作、向量比较运算操作、求向量最大/最小值运算操作、向量循环移位运算操作、生成服从一定分布的随机向量运算操作。
运算单元在执行运算的过程中,根据指令中操作数所指示的运算数据的位宽,选择相应的一个或多个运算器以执行运算,其中,一个或多个运算器具有不同的位宽,例如,有的运算器支持16位的数据运算,有的运算器支持32位的数据运算,运算器实质上可以是向量乘法部件、累加部件和标量乘法部件等。如图2所示,运算单元包括判断子模块、运算器合并子模块和运算子模块;
判断子模块用于判断是否存在与该操作数所指示的运算数据位宽相同的运算器,如果是,将该操作数传给相应的运算器,否则,将运算器合并策略及该操作数传递给运算器合并子模块;
运算器合并子模块用于根据运算器合并策略将多个运算器合并成一个新的运算器,以使该新的运算器的位宽符合该操作数的位宽,并将该操作数传给该新的运算器。具体的,运算器合并策略是指优先选用较大位宽的运算器进行组合。当存在与所需位宽相同的运算器时,直接使用对应的运算器;若不存在,则选用比所需运算器位宽小且最为接近的可用的运算器进行组合。例如,可用的用于组合的运算器位宽分别为8位、16位、32位时,当所需的运算器的位宽为32位时,直接使用32位运算器;当所需的运算器的位宽为64位时,使用两个32位运算器进行合并;当所需的运算器的位宽为48位时,使用一个32位运算器和一个16位运算器进行合并;当所需的运算器的位宽为40位时,则选用一个32位运算器和一个8位运算器进行合并。
运算子模块用于令获得该操作数的运算器执行运算。
本发明的指令采用了两种方式进行实现:一种为直接采用一条指令的方式,该指令中同时包括操作数和位宽域,运算单元能够直接根据该指令获取操作数和相应位宽的运算器,以执行相应的运算;另一种为采用两条指令的方式,运算单元先根据位宽配置指令获取或构造相应位宽 的运算器,再根据运算指令获取操作数以执行相应的运算。
需要说明的是,本发明指令集采用Load/Store结构,运算单元不会对内存中的数据进行操作。本指令集采用超长指令字架构,通过对指令进行不同的配置可以完成复杂的神经网络运算,也可以完成简单的矩阵/向量运算。另外,本指令集同时采用定长指令,使得本发明的神经网络运算以及矩阵/向量运算装置在上一条指令的译码阶段对下一条指令进行取指。
图3示出了本发明采用一条指令的方式执行运算的指令格式示意图,如图3所示,指令包括至少一操作码和至少3个操作数和至少2个位宽域,其中,位宽域与在运算器中运算时操作数的种类数量相同;其中,操作码用于指示该运算指令的功能,运算单元通过识别一个或多个操作码可进行不同的运算,操作数用于指示该运算指令的数据信息,位宽域用于指明对应操作数的位宽;其中,数据信息可以是立即数或寄存器号,例如,要获取一个矩阵时,根据寄存器号可以在相应的寄存器中获取矩阵起始地址和矩阵长度,再根据矩阵起始地址和矩阵长度在存储单元中获取相应地址存放的矩阵。
图4是本发明的神经网络运算指令的格式示意图,其为图3指令的实例化指令,如图4所示,神经网络运算指令包括至少一操作码和16个操作数和4个位宽域,其中,操作码用于指示该神经网络运算指令的功能,运算单元通过识别一个或多个操作码可进行不同的神经网络运算,操作数用于指示该神经网络运算指令的数据信息,其中,数据信息可以是立即数或寄存器号,位宽域用于指明操作数在运算中所对应的位宽,同时,位宽域用于指明运算过程中所对应的运算器的位宽以及是否需要将低位宽运算器合并为高位宽运算器。
图5是本发明的矩阵-矩阵运算指令的格式示意图,其为图3指令的实例化指令,如图5所示,矩阵-矩阵运算指令包括至少一操作码和至少4个操作数和2个位宽域,其中,操作码用于指示该矩阵-矩阵运算指令的功能,运算单元通过识别一个或多个操作码可进行不同的矩阵运算,操作数用于指示该矩阵-矩阵运算指令的数据信息,其中,数据信息可以 是立即数或寄存器号,位宽域用于指明操作数在运算中所对应的位宽,同时,位宽域用于指明运算过程中所对应的运算器的位宽以及是否需要将低位宽运算器合并为高位宽运算器。
图6是本发明的向量-向量运算指令的格式示意图,其为图3指令的实例化指令,如图6所示,向量-向量运算指令包括至少一操作码和至少3个操作数和至少2个位宽域,其中,操作码用于指示该向量-向量运算指令的功能,运算单元通过识别一个或多个操作码可进行不同的向量运算,操作数用于指示该向量-向量运算指令的数据信息,其中,数据信息可以是立即数或寄存器号,位宽域用于指明操作数在运算中所对应的位宽,同时,位宽域用于指明运算过程中所对应的运算器的位宽以及是否需要将低位宽运算器合并为高位宽运算器。
图7是本发明的矩阵-向量运算指令的格式示意图,其为图3指令的实例化指令,如图7所示,矩阵-向量运算指令包括至少一操作码和至少6个操作数和至少3个位宽域,其中,操作码用于指示该矩阵-向量运算指令的功能,运算单元通过识别一个或多个操作码可进行不同的矩阵和向量运算,操作数用于指示该矩阵-向量运算指令的数据信息,其中,数据信息可以是立即数或寄存器号,位宽域用于指明操作数在运算中所对应的位宽,同时,位宽域用于指明运算过程中所对应的运算器的位宽以及是否需要将低位宽运算器合并为高位宽运算器。
图8是本发明一优选实施例的运算装置的结构示意图,如图8所示,该装置包括取指模块、译码模块、指令队列、标量寄存器堆、依赖关系处理单元、存储队列、重排序缓存、运算单元、高速暂存器、IO内存存取模块;
取指模块,该模块负责从指令序列中取出下一条将要执行的指令,并将该指令传给译码模块;
译码模块,该模块负责对指令进行译码,并将译码后指令传给指令队列;如图9所示,该译码模块包括:指令接受模块、微指令生成模块、微指令队列、微指令发射模块;其中,指令接受模块负责接受从取指模块取得的指令;微指令译码模块将指令接受模块获得的指令译码成控制 各个功能部件的微指令;微指令队列用于存放从微指令译码模块发送的微指令;微指令发射模块负责将微指令发射到各个功能部件;
指令队列,用于顺序缓存译码后的指令,送往依赖关系处理单元;
标量寄存器堆,提供装置在运算过程中所需的标量寄存器;
依赖关系处理单元,该模块处理指令与前一条指令可能存在的存储依赖关系。矩阵运算指令会访问高速暂存存储器,前后指令可能会访问同一块存储空间。为了保证指令执行结果的正确性,当前指令如果被检测到与之前的指令的数据存在依赖关系,该指令必须在存储队列内等待至依赖关系被消除。
存储队列,该模块是一个有序队列,与之前指令在数据上有依赖关系的指令被存储在该队列内,直至依赖关系消除之后,提交指令。
重排序缓存,指令在执行过程中,同时也被缓存在该模块中,当一条指令执行完之后,如果该指令同时也是重排序缓存中未被提交指令中最早的一条指令,该指令将被提交。一旦提交,该条指令进行的操作对装置状态的改变将无法撤销;该重排序缓存里的指令起到占位的作用,当它包含的第一条指令存在数据依赖时,那么该指令就不会提交(释放);尽管后面会有很多指令不断进入,但是只能接受部分指令(受重排序缓存大小控制),直到第一条指令被提交,整个运算过程才会顺利进行。
运算单元,该模块负责装置的所有的神经网络运算和矩阵/向量运算操作,包括但不限于:卷积神经网络正向运算操作、卷积神经网络训练操作、神经网络Pooling运算操作、full connection神经网络正向运算操作、full connection神经网络训练操作、batch normalization运算操作、RBM神经网络运算操作、矩阵-向量乘运算操作、矩阵-矩阵加/减运算操作、向量外积(张量)运算操作、向量内积运算操作、向量四则运算操作、向量逻辑运算操作、向量超越函数运算操作、向量比较运算操作、求向量最大/最小值运算操作、向量循环移位运算操作、生成服从一定分布的随机向量运算操作。运算指令被送往该运算单元执行,首先,运算单元判断是否有与指令中操作数对应的位宽域长度相同的运算器,如果 有,选用对应的运算器,如果没有,通过多个低位宽的运算器合并的方式构成所需位宽的运算器,然后,根据指令中操作码对运算数用选择的运算器进行对应的运算,得出相应的结果;
高速暂存存储器,该模块是数据专用的暂存存储装置,能够支持不同长度和不同位宽的数据;
IO内存存取模块,该模块用于直接访问高速暂存存储器,负责从高速暂存存储器中读取数据或写入数据。
图10是本发明实施例的运算装置采用一条指令的方式执行运算的流程图。如图10所示,过程包括:
S1,取指模块取出指令,并将该指令送往译码模块。
S2,译码模块对指令译码,并将指令送往指令队列。
S3,在译码模块中,指令被送往指令接受模块。
S4,指令接受模块将指令发送到微指令生成模块,进行微指令生成。
S5,微指令生成模块从标量寄存器堆里获取指令的神经网络运算操作码和神经网络运算操作数,同时将指令译码成控制各个功能部件的微指令,送往微指令发射队列。
S6,在取得需要的数据后,该指令被送往依赖关系处理单元。依赖关系处理单元分析该指令与前面的尚未执行结束的指令在数据上是否存在依赖关系。该条指令需要在存储队列中等待至其与前面的未执行结束的指令在数据上不再存在依赖关系为止。
S7,依赖关系不存在后,该条神经网络运算以及矩阵/向量指令对应的微指令被送往运算单元等功能部件。
S8,运算单元根据所需数据的地址和大小从高速暂存存储器中取出需要的数据,然后判断是否有与指令中位宽域相同的运算器,如果有,则选用匹配的运算器完成指令对应的运算,如果没有,则通过将低位宽的运算器合并的方式组成一个所需位宽的运算器来完成指令对应的运算。
S9,运算完成后,将输出数据写回至高速暂存存储器的指定地址,同时重排序缓存中的该指令被提交。
图11和图12示出了本发明采用两条指令的方式执行运算的指令格式示意图,其中,图11是位宽配置指令的格式示意图,位宽配置指令包括至少一操作码至少2个位宽域,用于指明下条运算指令所使用的运算器的位宽。图12是运算指令的格式示意图,运算指令包括至少一操作码至少3个操作数,其中,操作码用于指示该运算指令的功能,运算单元通过识别一个或多个操作码可进行不同的运算,操作数用于指示该运算指令的数据信息,其中,数据信息可以是立即数或寄存器号,例如,要获取一个矩阵时,根据寄存器号可以在相应的寄存器中获取矩阵起始地址和矩阵长度,再根据矩阵起始地址和矩阵长度在存储单元中获取相应地址存放的矩阵。
图13~14是图11~12的实例化,其分别为神经网络位宽配置指令和神经网络运算指令的格式示意图,如图13~14所示,位宽配置指令包括至少一操作码至少4个位宽域,用于指明下条神经网络运算指令所使用的运算器的位宽。配置指令包括至少一操作码和16个操作数,其中,操作码用于指示该神经网络运算指令的功能,运算单元通过识别一个或多个操作码可进行不同的神经网络运算,操作数用于指示该神经网络运算指令的数据信息,其中,数据信息可以是立即数或寄存器号。
图15~16是图11~12的实例化,其分别为矩阵-矩阵位宽配置指令和矩阵-矩阵运算指令的格式示意图,如图15~16所示,位宽配置指令包括至少一操作码至少2个位宽域,用于指明下条矩阵-矩阵运算指令所使用的运算器的位宽。矩阵-矩阵运算指令包括至少一操作码和至少4个操作数。其中,操作码用于指示该矩阵-矩阵运算指令的功能,运算单元通过识别一个或多个操作码可进行不同的矩阵运算,操作数用于指示该矩阵-矩阵运算指令的数据信息,其中,数据信息可以是立即数或寄存器号。
图17~18是图11~12的实例化,其分别为向量-向量位宽配置指令和向量-向量运算指令的格式示意图,如图17~18所示,位宽配置指令包括至少一操作码至少2个位宽域,用于指明下条向量-向量运算指令所使用的运算器的位宽。向量-向量运算指令包括至少一操作码和至少3个操作数,其中,操作码用于指示该向量-向量运算指令的功能,运算单元通过 识别一个或多个操作码可进行不同的向量运算,操作数用于指示该向量-向量运算指令的数据信息,其中,数据信息可以是立即数或寄存器号。
图19~20是图11~12的实例化,其分别为矩阵-向量位宽配置指令和矩阵-向量运算指令的格式示意图,如图19~20所示,位宽配置指令包括至少一操作码至少3个位宽域,用于指明下条向量-向量运算指令所使用的运算器的位宽。矩阵-向量运算指令包括至少一操作码和至少6个操作数,其中,操作码用于指示该矩阵-向量运算指令的功能,运算单元通过识别一个或多个操作码可进行不同的矩阵和向量运算,操作数用于指示该矩阵-向量运算指令的数据信息,其中,数据信息可以是立即数或寄存器号。
图21是本发明实施例的运算装置采用两条指令的方式执行运算的流程图。如图21所示,过程包括:
步骤S1,取指模块取出一条位宽配置指令,并将指令送往译码模块;
步骤S2,译码模块对所述指令译码,并将所述指令送往指令队列;
步骤S3,在译码模块,所述指令被送往指令接受模块;
步骤S4,指令接收模块将所述指令发送到微指令译码模块,进行微指令译码;
步骤S5,微指令译码模块将指令译码成控制运算单元选定指定位宽的运算器的微指令,发送到微指令发射队列;
步骤S6,取指模块取出一条神经网络运算以及矩阵/向量指令,并将所述指令送往译码模块;
步骤S7,译码模块对所述指令译码,并将所述指令送往指令队列;
步骤S8,在译码模块中,所述指令被送往指令接受模块;
步骤S9,指令接受模块将所述指令发送到微指令译码模块,进行微指令译码;
步骤S10,微指令译码模块从标量寄存器堆里获取所述指令的神经网络运算操作码和神经网络运算操作数,同时将所述指令译码成控制各个功能部件的微指令,送往微指令发射队列;
步骤S11,在取得需要的数据后,所述指令被送往依赖关系处理单 元;依赖关系处理单元分析所述指令与之前尚未执行完的指令在数据上是否存在依赖关系,如果存在,则所述指令需要在存储队列中等待至其与之前未执行完的指令在数据上不再存在依赖关系为止;
步骤S12,将所述指令对应的微指令以及之前的指定运算器位宽的微指令送往运算单元;
步骤S13,运算单元根据所需数据的地址和大小从高速暂存存储器中取出需要的数据;然后判断是否有与位宽指定指令中位宽域相同的运算器,如果有,则选用匹配的运算器完成所述指令对应的神经网络运算和/或矩阵/向量运算,如果没有,则通过将低位宽的运算器合并的方式组成一个所需位宽的运算器来完成所述指令对应的神经网络运算和/或矩阵/向量运算;
S14,运算完成后,将输出数据写回至高速暂存存储器的指定地址,同时重排序缓存中的该指令被提交。
综上所述,本发明公开了一种运算器位宽可配置的用于执行神经网络运算以及矩阵/向量运算的装置和方法,配合相应的指令,能够很好地解决当前计算机领域神经网络算法和大量矩阵/向量运算的问题,相比于已有的传统解决方案,本发明可以具有指令可配置、使用方便、运算器的位宽可以选择,多个运算器可以合并,并通过专用位宽配置指令和在运算指令上指定位宽域两种方式来实现运算器位宽的配置,支持的神经网络规模和矩阵/向量位宽和规模灵活、片上缓存充足,运算器可合并等优点。
以上所述的具体实施例,对本发明的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本发明的具体实施例而已,并不用于限制本发明,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (14)

  1. 一种运算单元,用于根据运算数据的位宽,选择相应的一个或多个运算器以执行运算,其中,所述一个或多个运算器具有不同的位宽,其特征在于,运算单元包括判断子模块、运算器合并子模块和运算子模块;
    判断子模块用于判断是否存在与该运算数据位宽相同的运算器,如果是,将该运算数据传给相应的运算器,否则,将运算器合并策略及该运算数据传递给运算器合并子模块;
    运算器合并子模块用于根据运算器合并策略将多个运算器合并成一个新的运算器,以使该新的运算器的位宽符合运算数据的位宽,并将该运算数据传给该新的运算器;
    运算子模块用于令获得该运算数据的运算器执行运算。
  2. 根据权利要求1所述的运算单元,其特征在于,所述运算单元根据一指令执行运算,其中,所述指令包括:
    操作码,用于指示该指令的运算类型;
    操作数,用于作为运算数据或用于指示运算数据的存储地址;
    位宽域,用于指示运算数据的位宽;
    所述运算单元执行该指令,根据指令中的位宽域确定运算数据的位宽,并选择相应的运算器,然后将指令中的操作数传给相应的运算器,运算器根据操作数获取运算数据,并执行操作码所指示的运算。
  3. 根据权利要求1所述的运算单元,其特征在于,所述运算单元根据位宽配置指令和运算指令执行运算,所述位宽配置指令包括操作码和位宽域,所述运算指令包括操作码和操作数,其中,
    所述操作码用于指示该指令的运算类型;
    所述操作数用于作为运算数据或用于指示运算数据的存储地址;
    所述位宽域用于指示指令中各个操作数的位宽;
    所述运算单元依次执行位宽配置指令和运算指令,根据位宽配置指令中的位宽域确定运算指令中操作数的位宽,并选择相应的运算器,然 后将运算指令中的操作数传给相应的运算器,运算器根据操作数获取运算数据,并执行操作码所指示的运算。
  4. 根据权利要求1所述的运算单元,其特征在于,所述运算器合并策略为,合并一个或多个最接近运算数据位宽的运算器。
  5. 根据权利要求1所述的运算单元,其特征在于,所述操作数为运算数据或运算数据存储位置,所述运算器根据该操作数获得相应的运算数据后,执行运算。
  6. 根据权利要求1所述的运算单元,其特征在于,所述运算数据为向量、矩阵和神经元中的一种。
  7. 一种运算方法,用于根据运算数据的位宽,选择相应的一个或多个运算器以执行运算,其中,所述一个或多个运算器具有不同的位宽,其特征在于,方法包括:
    S1,判断是否存在与该运算数据位宽相同的运算器,如果是,将该运算数据传给相应的运算器,然后执行步骤S3,否则,生成运算器合并策略并执行步骤S2;
    S2,根据运算器合并策略将多个运算器合并成一个新的运算器,以使该新的运算器的位宽符合该运算数据的位宽,并将该运算数据传给该新的运算器;
    S3,令获得该运算数据的运算器执行运算。
  8. 根据权利要求7所述的运算单元,其特征在于,所述运算单元根据一指令执行运算,其中,所述指令包括:
    操作码,用于指示该指令的运算类型;
    操作数,用于作为运算数据或用于指示运算数据的存储地址;
    位宽域,用于指示运算数据的位宽;
    所述运算单元执行该指令,根据指令中的位宽域确定运算数据的位宽,并选择相应的运算器,然后将指令中的操作数传给相应的运算器,运算器根据操作数获取运算数据,并执行操作码所指示的运算。
  9. 根据权利要求7所述的运算单元,其特征在于,所述运算单元根据位宽配置指令和运算指令执行运算,所述位宽配置指令包括操作码 和位宽域,所述运算指令包括操作码和操作数,其中,
    所述操作码用于指示该指令的运算类型;
    所述操作数用于作为运算数据或用于指示运算数据的存储地址;
    所述位宽域用于指示指令中各个操作数的位宽;
    所述运算单元依次执行位宽配置指令和运算指令,根据位宽配置指令中的位宽域确定运算指令中操作数的位宽,并选择相应的运算器,然后将运算指令中的操作数传给相应的运算器,运算器根据操作数获取运算数据,并执行操作码所指示的运算。
  10. 根据权利要求7所述的运算单元,其特征在于,所述运算器合并策略为,合并一个或多个最接近运算数据位宽的运算器。
  11. 根据权利要求7所述的运算方法,其特征在于,所述操作数为运算数据或运算数据存储位置,所述运算器根据该操作数获得相应的运算数据后,执行运算。
  12. 根据权利要求7所述的运算方法,其特征在于,所述运算数据为向量、矩阵和神经元中的一种。
  13. 一种运算装置,其特征在于,包括:
    权利要求1-5任意一项所述的运算单元;
    存储单元,用于存储所述运算数据;
    寄存器单元,用于存储所述运算数据的地址;
    控制单元,用于对运算单元、存储单元及寄存器单元进行控制,以使运算单元根据指令中的操作数访问寄存器单元,以获取运算数据的地址,并根据该运算数据的地址访问存储单元,以获取该运算数据,从而对该运算数据执行运算。
  14. 根据权利要求13所述的运算装置,其特征在于,所述存储单元为高速暂存存储器。
PCT/CN2017/093159 2016-08-05 2017-07-17 一种能支持不同位宽运算数据的运算单元、方法及装置 Ceased WO2018024093A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP17836275.2A EP3496006B1 (en) 2016-08-05 2017-07-17 Operation unit, method and device capable of supporting operation data of different bit widths
KR1020187034252A KR102486029B1 (ko) 2016-08-05 2017-07-17 비트폭이 다른 연산 데이터를 지원하는 연산 유닛, 연산 방법 및 연산 장치
US16/268,457 US10489704B2 (en) 2016-08-05 2019-02-05 Operation unit, method and device capable of supporting operation data of different bit widths

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610640111.8 2016-08-05
CN201610640111.8A CN107688854B (zh) 2016-08-05 2016-08-05 一种能支持不同位宽运算数据的运算单元、方法及装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/268,457 Continuation-In-Part US10489704B2 (en) 2016-08-05 2019-02-05 Operation unit, method and device capable of supporting operation data of different bit widths

Publications (1)

Publication Number Publication Date
WO2018024093A1 true WO2018024093A1 (zh) 2018-02-08

Family

ID=61072493

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/093159 Ceased WO2018024093A1 (zh) 2016-08-05 2017-07-17 一种能支持不同位宽运算数据的运算单元、方法及装置

Country Status (6)

Country Link
US (1) US10489704B2 (zh)
EP (1) EP3496006B1 (zh)
KR (1) KR102486029B1 (zh)
CN (2) CN114004349A (zh)
TW (1) TWI789358B (zh)
WO (1) WO2018024093A1 (zh)

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10657442B2 (en) * 2018-04-19 2020-05-19 International Business Machines Corporation Deep learning accelerator architecture with chunking GEMM
US11467973B1 (en) * 2018-09-28 2022-10-11 Amazon Technologies, Inc. Fine-grained access memory controller
CN111290788B (zh) * 2018-12-07 2022-05-31 上海寒武纪信息科技有限公司 运算方法、装置、计算机设备和存储介质
CN111079912B (zh) * 2018-10-19 2021-02-12 中科寒武纪科技股份有限公司 运算方法、系统及相关产品
CN111258634B (zh) * 2018-11-30 2022-11-22 上海寒武纪信息科技有限公司 数据选择装置、数据处理方法、芯片及电子设备
CN111260045B (zh) * 2018-11-30 2022-12-02 上海寒武纪信息科技有限公司 译码器和原子指令解析方法
KR102393916B1 (ko) * 2019-06-27 2022-05-02 주식회사 사피온코리아 위노그라드 알고리즘에 기반한 행렬 곱셈 방법 및 장치
CN112286578B (zh) * 2019-07-25 2025-09-12 北京百度网讯科技有限公司 由计算设备执行的方法、装置、设备和计算机可读存储介质
CN110728365B (zh) * 2019-09-12 2022-04-01 东南大学 多位宽pe阵列计算位宽的选择方法及计算精度控制电路
US20210200549A1 (en) * 2019-12-27 2021-07-01 Intel Corporation Systems, apparatuses, and methods for 512-bit operations
CN111459546B (zh) * 2020-03-30 2023-04-18 芯来智融半导体科技(上海)有限公司 一种实现操作数位宽可变的装置及方法
US12112167B2 (en) 2020-06-27 2024-10-08 Intel Corporation Matrix data scatter and gather between rows and irregularly spaced memory locations
US12474928B2 (en) 2020-12-22 2025-11-18 Intel Corporation Processors, methods, systems, and instructions to select and store data elements from strided data element positions in a first dimension from three source two-dimensional arrays in a result two-dimensional array
CN113469349B (zh) * 2021-07-02 2022-11-08 上海酷芯微电子有限公司 多精度神经网络模型实现方法及系统
CN114528248A (zh) * 2022-04-24 2022-05-24 广州万协通信息技术有限公司 阵列重构方法、装置、设备及存储介质
CN115390925A (zh) * 2022-08-16 2022-11-25 海光信息技术股份有限公司 指令的数据处理方法、相关器件及电子设备
CN116248217B (zh) * 2022-12-14 2026-04-14 成都海光集成电路设计有限公司 时间同步运算方法、模组及数据传输设备
WO2025118193A1 (zh) * 2023-12-06 2025-06-12 声龙(新加坡)私人有限公司 运算加速装置、方法及芯片

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102012876A (zh) * 2010-11-19 2011-04-13 中兴通讯股份有限公司 大位宽数据的写入、读取方法及控制器
CN102238348A (zh) * 2010-04-20 2011-11-09 上海华虹集成电路有限责任公司 一种可变数据个数的fft/ifft处理器的基4模块
CN103188487A (zh) * 2011-12-28 2013-07-03 联芯科技有限公司 视频图像中的卷积方法及视频图像处理系统

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5471633A (en) * 1993-09-30 1995-11-28 Intel Corporation Idiom recognizer within a register alias table
US5590352A (en) * 1994-04-26 1996-12-31 Advanced Micro Devices, Inc. Dependency checking and forwarding of variable width operands
US6418527B1 (en) * 1998-10-13 2002-07-09 Motorola, Inc. Data processor instruction system for grouping instructions with or without a common prefix and data processing system that uses two or more instruction grouping methods
JP4475614B2 (ja) * 2000-04-28 2010-06-09 大正製薬株式会社 並列処理方法におけるジョブの割り当て方法および並列処理方法
US6948051B2 (en) * 2001-05-15 2005-09-20 International Business Machines Corporation Method and apparatus for reducing logic activity in a microprocessor using reduced bit width slices that are enabled or disabled depending on operation width
JP3497852B1 (ja) * 2002-06-06 2004-02-16 沖電気工業株式会社 演算方法および演算回路
US8595279B2 (en) * 2006-02-27 2013-11-26 Qualcomm Incorporated Floating-point processor with reduced power requirements for selectable subprecision
DE602006006990D1 (de) * 2006-06-28 2009-07-09 St Microelectronics Nv SIMD-Prozessorarchitektur mit gruppierten Verarbeitungseinheiten
CN100456231C (zh) * 2007-03-19 2009-01-28 中国人民解放军国防科学技术大学 灵活分配运算群资源的流处理器扩展方法
US20110320765A1 (en) * 2010-06-28 2011-12-29 International Business Machines Corporation Variable width vector instruction processor
JP5786719B2 (ja) * 2012-01-04 2015-09-30 富士通株式会社 ベクトルプロセッサ
CN102591615A (zh) * 2012-01-16 2012-07-18 中国人民解放军国防科学技术大学 结构化混合位宽乘法运算方法及装置
CN103019647B (zh) * 2012-11-28 2015-06-24 中国人民解放军国防科学技术大学 具有浮点精度保持功能的浮点累加/累减运算方法
US9389863B2 (en) * 2014-02-10 2016-07-12 Via Alliance Semiconductor Co., Ltd. Processor that performs approximate computing instructions
CN103914277B (zh) * 2014-04-14 2017-02-15 复旦大学 一种基于改进的Montgomery模乘算法的可扩展模乘器电路

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102238348A (zh) * 2010-04-20 2011-11-09 上海华虹集成电路有限责任公司 一种可变数据个数的fft/ifft处理器的基4模块
CN102012876A (zh) * 2010-11-19 2011-04-13 中兴通讯股份有限公司 大位宽数据的写入、读取方法及控制器
CN103188487A (zh) * 2011-12-28 2013-07-03 联芯科技有限公司 视频图像中的卷积方法及视频图像处理系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3496006A4 *

Also Published As

Publication number Publication date
EP3496006A4 (en) 2020-01-22
US20190236442A1 (en) 2019-08-01
TWI789358B (zh) 2023-01-11
KR102486029B1 (ko) 2023-01-06
TW201805835A (zh) 2018-02-16
CN114004349A (zh) 2022-02-01
EP3496006A1 (en) 2019-06-12
EP3496006B1 (en) 2023-12-27
KR20190029515A (ko) 2019-03-20
CN107688854A (zh) 2018-02-13
US10489704B2 (en) 2019-11-26
CN107688854B (zh) 2021-10-19

Similar Documents

Publication Publication Date Title
WO2018024093A1 (zh) 一种能支持不同位宽运算数据的运算单元、方法及装置
WO2017185418A1 (zh) 一种用于执行神经网络运算以及矩阵/向量运算的装置和方法
CN111291880B (zh) 计算装置以及计算方法
CN109240746B (zh) 一种用于执行矩阵乘运算的装置和方法
CN110135581B (zh) 用于执行人工神经网络反向运算的装置和方法
CN109993285B (zh) 用于执行人工神经网络正向运算的装置和方法
KR102123633B1 (ko) 행렬 연산 장치 및 방법
CN110689138A (zh) 运算方法、装置及相关产品
CN111580866A (zh) 一种向量运算装置及运算方法
CN111027690B (zh) 执行确定性推理的组合处理装置、芯片和方法
WO2017185393A1 (zh) 一种用于执行向量内积运算的装置和方法
KR102703934B1 (ko) 벡터 인덱스 레지스터
WO2017185392A1 (zh) 一种用于执行向量四则运算的装置和方法
CN107315563A (zh) 一种用于执行向量比较运算的装置和方法
WO2017181336A1 (zh) maxout层运算装置和方法
TWI733746B (zh) 在運行時最佳化指令的處理器、由處理器在運行時最佳化指令的方法及非暫態機器可讀媒體
CN107315565B (zh) 一种用于生成服从一定分布的随机向量装置和方法

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 20187034252

Country of ref document: KR

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17836275

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2017836275

Country of ref document: EP

Effective date: 20190305