WO2017124648A1 - 一种向量计算装置 - Google Patents

一种向量计算装置 Download PDF

Info

Publication number
WO2017124648A1
WO2017124648A1 PCT/CN2016/078550 CN2016078550W WO2017124648A1 WO 2017124648 A1 WO2017124648 A1 WO 2017124648A1 CN 2016078550 W CN2016078550 W CN 2016078550W WO 2017124648 A1 WO2017124648 A1 WO 2017124648A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
instruction
unit
vector operation
address
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2016/078550
Other languages
English (en)
French (fr)
Inventor
陈天石
张潇
刘少礼
陈云霁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cambricon Technologies Corp Ltd
Original Assignee
Cambricon Technologies Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cambricon Technologies Corp Ltd filed Critical Cambricon Technologies Corp Ltd
Priority to KR1020197017258A priority Critical patent/KR102185287B1/ko
Priority to EP16885912.2A priority patent/EP3407182B1/en
Priority to KR1020187015435A priority patent/KR20180100550A/ko
Priority to KR1020207013258A priority patent/KR102304216B1/ko
Publication of WO2017124648A1 publication Critical patent/WO2017124648A1/zh
Priority to US16/039,803 priority patent/US10762164B2/en
Anticipated expiration legal-status Critical
Priority to US16/942,482 priority patent/US11734383B2/en
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/3012Organisation of register space, e.g. banked or distributed register file
    • G06F9/3013Organisation of register space, e.g. banked or distributed register file according to data content, e.g. floating-point registers, address registers
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/34Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes
    • G06F9/35Indirect addressing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3824Operand accessing

Definitions

  • the invention relates to a vector operation device for performing a vector operation according to a vector operation instruction, which can well solve the problem that more and more algorithms in the current computer field contain a large number of vector operations.
  • a known scheme for performing vector operations is to use a general purpose processor that performs general operations by executing general purpose instructions through a general purpose register file and general purpose functions.
  • a single general-purpose processor is mostly used for scalar calculations, and the computational performance is low when performing vector operations.
  • mutual communication between general-purpose processors may become a performance bottleneck.
  • a vector processing is performed using a graphics processing unit (GPU) in which vector operations are performed by executing general purpose SIMD instructions using a general purpose register file and a general purpose stream processing unit.
  • GPU graphics processing unit
  • the GPU on-chip cache is too small, and it is necessary to continuously perform off-chip data transfer when performing large-scale vector operations, and the off-chip bandwidth becomes a main performance bottleneck.
  • vector calculations are performed using specially tailored vector computing devices, where vector operations are performed using custom register files and custom processing units.
  • the existing dedicated vector operation devices are limited by the register file and cannot flexibly support vector operations of different lengths.
  • the present invention provides a vector operation device for performing a vector operation according to a vector operation instruction, including:
  • a storage unit for storing a vector
  • a register unit for storing a vector address, wherein the vector address is an address stored in the storage unit by the vector;
  • the vector operation unit is configured to obtain a vector operation instruction, obtain a vector address in the register unit according to the vector operation instruction, and then acquire a corresponding vector in the storage unit according to the vector address, and then perform a vector operation according to the acquired vector to obtain a vector. The result of the operation.
  • the vector operation device provided by the invention temporarily stores the vector data participating in the calculation on the scratch pad memory (Scratchpad Memory), so that the vector operation process can more flexibly and effectively support data of different widths, and enhance the execution of tasks involving a large number of vector calculations.
  • the instructions adopted by the present invention have a compact format, which makes the instruction set easy to use and the supported vector length flexible.
  • FIG. 1 is a schematic structural diagram of a vector operation device provided by the present invention.
  • FIG. 2 is a schematic diagram of the format of an instruction set provided by the present invention.
  • FIG. 3 is a schematic structural diagram of a vector operation device according to an embodiment of the present invention.
  • FIG. 4 is a flowchart of a vector point product instruction executed by a vector operation apparatus according to an embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of a vector operation unit according to an embodiment of the present invention.
  • the present invention provides a vector computing device including a memory unit, a register unit, and a vector operation unit.
  • the memory unit stores a vector
  • the address vector of the vector memory is stored in the register unit.
  • the arithmetic unit acquires the vector address in the register unit according to the vector operation instruction.
  • the corresponding vector is obtained in the storage unit according to the vector address, and then the vector operation is performed according to the obtained vector to obtain a vector operation result.
  • the invention temporarily stores the vector data participating in the calculation on the scratchpad memory, so that the vector operation process can support the data of different widths more flexibly and effectively, and improve the execution performance of the task including a large number of vector calculations.
  • the vector operation device includes:
  • the storage unit may be a scratchpad memory capable of supporting vector data of different sizes.
  • the present invention temporarily stores necessary calculation data on the scratchpad memory (Scratchpad Memory). ), the operating device can more flexibly and efficiently support data of different widths in performing vector operations.
  • the register unit for storing a vector address, wherein the vector address is an address in which the vector is stored in the storage unit;
  • the register unit may be a scalar register file, providing a scalar register required for the operation, a scalar register Not only store vector addresses, but also scalar data.
  • the vector operation unit not only obtains the vector address from the register unit, but also obtains the corresponding scalar from the register unit.
  • a vector operation unit configured to acquire a vector operation instruction, obtain a vector address in the register unit according to the vector operation instruction, and then acquire a corresponding vector in the storage unit according to the vector address, and then perform a vector operation according to the acquired vector, The result of the vector operation is obtained, and the result of the vector operation is stored in the storage unit.
  • the vector operation unit includes a vector addition unit, a vector multiplication unit, a size comparison unit, a nonlinear operation unit, and a vector scalar multiplication
  • the component and the vector operation unit are multi-stream water level structures, wherein the vector addition component and the vector multiplication component are at the first flow level, the size comparison component is at the second flow level, and the nonlinear operation component and the vector scalar multiplication component are in the third flow state. level. These units are at different pipeline levels. When the sequence of consecutive serial multiple vector operation instructions is consistent with the order of the corresponding unit's pipeline level, the operations required by the series of vector operation instructions can be realized more efficiently.
  • the vector operation unit is responsible for all vector operations of the device, including but not limited to vector addition operations, vector scaling operations, vector subtraction operations, vector subtraction operations, vector multiplication operations, vector multiplier operations, vector division (parallel division) operations, Vector and operation and vector or operation, vector operation instructions are sent to the arithmetic unit for execution.
  • the vector operation device further includes: an instruction buffer unit for storing the vector operation instruction to be executed. During the execution of the instruction, it is also cached in the instruction cache unit. When an instruction is executed, if the instruction is also the earliest instruction in the uncommitted instruction in the instruction cache unit, the instruction will be submitted back, once submitted. The operation of this instruction will not be able to cancel the change of the device status.
  • the instruction cache unit may be a reordering cache.
  • the vector operation device further includes: an instruction processing unit, configured to acquire a vector operation instruction from the instruction cache unit, and process the vector operation instruction to provide the vector operation unit.
  • the instruction processing unit includes:
  • An instruction fetch module configured to obtain a vector operation instruction from the instruction cache unit
  • a decoding module configured to decode the obtained vector operation instruction
  • the instruction queue is used for sequentially storing the decoded vector operation instructions. Considering that different instructions may have dependencies on the included registers, the instructions for buffering the decoded instructions are transmitted after the dependencies are satisfied.
  • the vector operation device further includes: a dependency processing unit, configured to determine, before the vector operation unit acquires the vector operation instruction, whether the vector operation instruction and the previous vector operation instruction access the same vector, if And storing the vector operation instruction in a storage queue, and after the execution of the previous vector operation instruction, providing the vector operation instruction in the storage queue to the vector operation unit; otherwise, directly providing the vector operation instruction to the vector operation instruction
  • the vector operation unit accesses the scratch pad storage In order to ensure the correctness of the execution result of the instruction, if the current instruction is detected to have a dependency on the data of the previous instruction, the instruction must wait in the storage queue until the dependency is eliminate.
  • the vector operation device further includes: an input/output unit for storing the vector in the storage unit, or acquiring the vector operation result from the storage unit.
  • the input and output unit can directly store the unit, and is responsible for reading vector data or writing vector data from the memory.
  • the instruction set used in the apparatus of the present invention adopts a Load/Store structure, and the vector operation unit does not operate on data in the memory.
  • This instruction set uses a reduced instruction set architecture.
  • the instruction set only provides the most basic vector operations. Complex vector operations are simulated by combining these simple instructions, making it possible to execute instructions in a single cycle at high clock frequencies.
  • the instruction set uses the fixed length instruction at the same time, so that the vector operation device proposed by the present invention fetches the next instruction in the decoding stage of the previous instruction.
  • the vector operation instruction includes an operation code and at least one operation field, wherein the operation code is used to indicate the function of the vector operation instruction, and the vector operation unit passes Identifying the operation code may perform different vector operations, and the operation field is used to indicate data information of the vector operation instruction, wherein the data information may be an immediate number or a register number, for example, when a vector is to be acquired, the corresponding register number may be corresponding
  • the vector start address and the vector length are obtained in the register, and the vector stored in the corresponding address is obtained in the storage unit according to the vector start address and the vector length.
  • the instruction set contains vector operation instructions with different functions:
  • VA Vector addition instruction
  • VAS Vector plus scalar instruction
  • the device fetches the vector data of the specified size from the designated address of the scratch pad memory, extracts the scalar data from the specified address of the scalar register file, and adds the scalar value to each element of the vector in the scalar operation unit, and Write the result back to the specified address of the scratch pad memory;
  • Vector subtraction instruction According to the instruction, the device is specified from the scratch pad memory Two vector data of a specified size are respectively taken out at the address, subtracted in the vector operation unit, and the result is written back to the designated address of the scratch pad memory;
  • Scalar minus vector instruction SSV
  • the device fetches the scalar data from the specified address of the scalar register file, extracts the vector data from the specified address of the scratchpad memory, subtracts the corresponding element in the vector from the scalar in the vector calculation unit, and writes the result back to the high speed.
  • the specified address of the scratchpad memory is the specified address of the scratchpad memory
  • VMV Vector Multiply Instruction
  • VMS Vector multiplier instruction
  • VD Vector division instruction
  • Scalar Divisor Vector Instructions According to the instruction, the device takes the scalar data from the specified position of the scalar register file, extracts the vector data of the specified size from the specified position of the scratchpad memory, and divides the scalar by the corresponding element in the vector, and the result is Write back to the specified location of the scratch pad memory;
  • VAV Vector and instruction
  • VAND Vector and instruction
  • VAV Vector or instruction
  • VOR vector or instruction
  • the device fetches the vector data of the specified size from the specified address of the scratchpad memory, performs an exponential operation on each element in the vector in the vector operation unit, and writes the result back to the specified address of the scalar register file;
  • VL Vector logarithmic instruction
  • the device fetches the vector data of the specified size from the specified address of the scratchpad memory, performs a logarithm operation on each element in the vector in the vector operation unit, and writes the result back to the specified address of the scalar register file;
  • the vector is greater than the decision command (VGT).
  • VCT decision command
  • the device respectively extracts the vector data of the specified size from the specified address in the scratchpad memory, and compares the two vector data in the vector operation unit, the former being larger than the latter and being set in the corresponding bit of the output vector, otherwise setting 0, and write the result back to the specified address of the scratch pad memory;
  • the vector is equal to the decision command (VEQ).
  • the device respectively extracts the vector data of the specified size from the specified address in the scratchpad memory, and compares the two vector data in the vector operation unit, the former being equal to the latter being set in the corresponding bit of the output vector, otherwise setting 0, and write the result back to the specified address of the scratch pad memory;
  • VNV Vector non-instruction
  • the device respectively extracts vector data of a specified size from the designated address of the scratch pad memory, including the selection vector and the selected vector one and the selected vector two.
  • the vector calculation unit selects the corresponding element from the selected vector one or the selected vector two as the element of the position in the output vector according to the element of the selection vector being 1 or 0, and writes the result back to the specified position of the scratch pad memory;
  • VMAX Vector Maximum Instruction
  • Scalar Extended Instruction According to the instruction, the device is from the designated place of the scalar register file
  • the scalar data is fetched, and in the vector operation unit, the scalar is expanded into a vector of a specified length, and the result is written back to the scalar register file;
  • Scalar replacement vector instruction STVPN.
  • the device takes the scalar from the specified address of the scalar register file, extracts the vector data of the specified size from the specified address of the scratchpad memory, and replaces the element of the specified position in the vector with the scalar value in the vector calculation unit. And write the result back to the specified address of the scratch pad memory;
  • the vector replaces the scalar instruction (VPNTS).
  • the device fetches the scalar from the specified address of the scalar register file, extracts the vector data of the specified size from the specified address of the scratch pad memory, and replaces the scalar value with the element of the specified position in the vector in the vector calculation unit, and Write the result back to the specified address of the scalar register file;
  • the device fetches the vector data of the specified size from the specified address of the scratch pad memory, extracts the corresponding element in the vector as an output according to the specified position, and writes the result back to the specified address of the scalar register file;
  • VP Vector dot product instruction
  • the device respectively extracts vector data of a specified size from a specified address of the scratchpad memory, performs a dot product operation on the two vectors in the vector calculation unit, and writes the result back to the specified address of the scalar register heap;
  • Random vector instruction According to the instruction, the device generates a uniformly distributed random vector ranging from 0 to 1 in the vector calculation unit, and writes the result back to the designated address of the scratch pad memory;
  • VCS Loop shift instruction
  • VLOAD Vector load instruction
  • Vector store instruction (VS). According to the instruction, the device stores the vector data of the specified size of the specified address of the scratch pad memory to the external destination address;
  • VMOVE Vector handling instructions
  • the device stores the vector data of the specified size of the specified address of the scratch pad memory to another specified address of the scratch pad memory.
  • FIG. 3 is a schematic structural diagram of a vector operation device according to an embodiment of the present invention.
  • the device includes an instruction module, a decoding module, an instruction queue, a scalar register file, a dependency processing unit, a storage queue, and a reordering cache.
  • vector operation unit cache, IO memory access module;
  • the fetch module which is responsible for fetching the next instruction to be executed from the instruction sequence and passing the instruction to the decoding module;
  • the module is responsible for decoding the instruction, and transmitting the decoded instruction to the instruction queue;
  • the instruction queue considering that different instructions may have dependencies on the included scalar registers, for buffering the decoded instructions, and transmitting the instructions when the dependencies are satisfied;
  • a scalar register file that provides the scalar registers required by the device during the operation
  • a dependency processing unit that handles storage dependencies that may exist between a processing instruction and a previous instruction.
  • the vector operation instruction accesses the scratch pad memory, and the front and back instructions may access the same block of memory.
  • the instruction In order to ensure the correctness of the execution result of the instruction, if the current instruction is detected to have a dependency on the data of the previous instruction, the instruction must wait in the storage queue until the dependency is eliminated.
  • the storage queue, the module is an ordered queue, and instructions related to the previous instruction on the data are stored in the queue until the storage relationship is eliminated;
  • the instruction is also cached in the module during execution.
  • the instruction When an instruction is executed, if the instruction is also the oldest instruction in the uncommitted instruction in the reordering buffer, the instruction will be submitted back. . Once submitted, the operation of the instruction will not be able to cancel the change of the device status;
  • Vector operation unit which is responsible for all vector operations of the device, including but not limited to vector addition operations, vector scaling operations, vector subtraction operations, vector subtraction operations, vector multiplication operations, vector multiscalar operations, vector divisions (parallel phase division) Operating, vector and operation, and vector or operation, vector operation instructions are sent to the arithmetic unit for execution;
  • this module is a temporary storage device dedicated to vector data, capable of supporting no Vector data of the same size
  • IO memory access module which is used to directly access the scratchpad memory and is responsible for reading data or writing data from the scratchpad memory.
  • FIG. 4 is a flowchart of a vector point product instruction executed by a vector operation apparatus according to an embodiment of the present invention. As shown in FIG. 4, the process of executing a vector dot product instruction (VP) includes:
  • the fetch module takes the vector dot product instruction and sends the instruction to the decoding module.
  • the decoding module decodes the instruction and sends the instruction to the instruction queue.
  • the vector dot product instruction needs to obtain data in the scalar register corresponding to the four operation fields in the instruction from the scalar register file, including the start address of the vector vin0, the length of the vector vin0, and the vector vin1. The starting address, the length of the vector vin1.
  • the instruction is sent to the dependency processing unit.
  • the dependency processing unit analyzes whether the instruction has a dependency on the data with the previous instruction that has not been executed. The instruction needs to wait in the store queue until it no longer has a dependency on the data with the previous unexecuted instruction.
  • the vector dot product instruction is sent to the vector operation unit.
  • the vector operation unit takes out the required vector from the data register according to the address and length of the required data, and then performs a dot product operation in the vector operation unit.
  • FIG. 5 is a schematic structural diagram of a vector operation unit according to an embodiment of the present invention.
  • a vector operation unit includes a vector addition unit, a size comparison unit, a nonlinear operation unit, a vector scalar multiplication unit, and the like.
  • the vector operation unit is a multi-stream water level structure in which the vector addition component and the vector multiplication component are at the pipeline level 1, the size comparison component is at the pipeline level 2, and the nonlinear operation component and the vector scalar multiplication component are at the pipeline level 3. These units are at different pipeline levels. When the sequence of consecutive serial multiple vector operation instructions is consistent with the order of the corresponding unit's pipeline level, the operations required by the series of vector operation instructions can be realized more efficiently.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Complex Calculations (AREA)
  • Advance Control (AREA)

Abstract

一种向量计算装置,包括存储单元、寄存器单元和向量运算单元,存储单元中存储有向量,寄存器单元中存储有向量存储的地址向量运算单元根据向量运算指令在寄存器单元中获取向量地址,然后,根据该向量地址在存储单元中获取相应的向量,接着,根据获取的向量进行向量运算,得到向量运算结果。该装置将参与计算的向量数据暂存在高速暂存存储器上,使得向量运算过程中可以更加灵活有效地支持不同宽度的数据,提升包含大量向量计算任务的执行性能。

Description

一种向量计算装置 技术领域
本发明涉及一种向量运算装置,用于根据向量运算指令执行向量运算,能够很好地解决当前计算机领域越来越多的算法包含大量向量运算的问题。
背景技术
当前计算机领域有越来越多的算法涉及到向量运算,以人工神经网络算法为例,多种神经网络算法中都含有大量的向量运算。在神经网络中,输出神经元的运算表达式为y=f(wx+b),其中w是矩阵,x、b是向量,计算输出向量y的过程为矩阵w与向量x相乘,加上向量b,然后对得到的向量进行激活函数运算(即对向量中的每个元素进行激活函数运算)。因此,向量运算成为目前各种计算装置在设计之初都需要考虑的一个重要问题。
在现有技术中,一种进行向量运算的已知方案是使用通用处理器,该方法通过通用寄存器堆和通用功能部件来执行通用指令,从而执行向量运算。然而,该方法的缺点之一是单个通用处理器多用于标量计算,在进行向量运算时运算性能较低。而使用多个通用处理器并行执行时,通用处理器之间的相互通讯又有可能成为性能瓶颈。
在另一种现有技术中,使用图形处理器(GPU)来进行向量计算,其中,通过使用通用寄存器堆和通用流处理单元执行通用SIMD指令来进行向量运算。然而,上述方案中,GPU片上缓存太小,在进行大规模向量运算时需要不断进行片外数据搬运,片外带宽成为了主要性能瓶颈。
在另一种现有技术中,使用专门定制的向量运算装置来进行向量计算,其中,使用定制的寄存器堆和定制的处理单元进行向量运算。然而, 目前已有的专用向量运算装置受限于寄存器堆,不能够灵活地支持不同长度的向量运算。
发明内容
(一)要解决的技术问题
本发明的目的在于,提供一种向量运算装置,解决现有技术中存在的受限于片间通讯、片上缓存不够、支持的向量长度不够灵活等问题。
(二)技术方案
本发明提供一种向量运算装置,用于根据向量运算指令执行向量运算,包括:
存储单元,用于存储向量;
寄存器单元,用于存储向量地址,其中,向量地址为向量在存储单元中存储的地址;
向量运算单元,用于获取向量运算指令,根据向量运算指令在寄存器单元中获取向量地址,然后,根据该向量地址在存储单元中获取相应的向量,接着,根据获取的向量进行向量运算,得到向量运算结果。
(三)有益效果
本发明提供的向量运算装置,将参与计算的向量数据暂存在高速暂存存储器上(Scratchpad Memory),使得向量运算过程中可以更加灵活有效地支持不同宽度的数据,提升包含大量向量计算任务的执行性能,本发明采用的指令具有精简的格式,使得指令集使用方便、支持的向量长度灵活。
附图说明
图1是本发明提供的向量运算装置的结构示意图。
图2是本发明提供的指令集的格式示意图。
图3是本发明实施例提供的向量运算装置的结构示意图。
图4是本发明实施例提供的向量运算装置执行向量点积指令的流程图。
图5为本发明实施例提供的向量运算单元的结构示意图。
具体实施方式
本发明提供一种向量计算装置,包括存储单元、寄存器单元和向量运算单元,存储单元中存储有向量,寄存器单元中存储有向量存储的地址向量运算单元根据向量运算指令在寄存器单元中获取向量地址,然后,根据该向量地址在存储单元中获取相应的向量,接着,根据获取的向量进行向量运算,得到向量运算结果。本发明将参与计算的向量数据暂存在高速暂存存储器上,使得向量运算过程中可以更加灵活有效地支持不同宽度的数据,提升包含大量向量计算任务的执行性能。
图1是本发明提供的向量运算装置的结构示意图,如图1所示,向量运算装置包括:
存储单元,用于存储向量,在一种实施方式中,该存储单元可以是高速暂存存储器,能够支持不同大小的向量数据;本发明将必要的计算数据暂存在高速暂存存储器上(Scratchpad Memory),使本运算装置在进行向量运算过程中可以更加灵活有效地支持不同宽度的数据。
寄存器单元,用于存储向量地址,其中,向量地址为向量在存储单元中存储的地址;在一种实施方式中,寄存器单元可以是标量寄存器堆,提供运算过程中所需的标量寄存器,标量寄存器不只存放向量地址,还存放有标量数据。当涉及到向量与标量的运算时,向量运算单元不仅要从寄存器单元中获取向量地址,还要从寄存器单元中获取相应的标量。
向量运算单元,用于获取向量运算指令,根据向量运算指令在所述寄存器单元中获取向量地址,然后,根据该向量地址在存储单元中获取相应的向量,接着,根据获取的向量进行向量运算,得到向量运算结果,并将向量运算结果存储于存储单元中。向量运算单元包含包括向量加法部件、向量乘法部件、大小比较部件、非线性运算部件和向量标量乘法 部件,并且,向量运算单元为多流水级结构,其中,向量加法部件和向量乘法部件处于第一流水级,大小比较部件处于第二流水级,非线性运算部件和向量标量乘法部件处于第三流水级。这些单元处于不同的流水级,当连续串行的多条向量运算指令的先后次序与相应单元所在流水级顺序一致时,可以更加高效地实现这一连串向量运算指令所要求的操作。向量运算单元负责装置的所有向量运算,包括但不限于向量加法操作、向量加标量操作、向量减法操作、向量减标量操作、向量乘法操作、向量乘标量操作、向量除法(对位相除)操作、向量与操作和向量或操作,向量运算指令被送往该运算单元执行。
根据本发明的一种实施方式,向量运算装置还包括:指令缓存单元,用于存储待执行的向量运算指令。指令在执行过程中,同时也被缓存在指令缓存单元中,当一条指令执行完之后,如果该指令同时也是指令缓存单元中未被提交指令中最早的一条指令,该指令将背提交,一旦提交,该条指令进行的操作对装置状态的改变将无法撤销。在一种实施方式中,指令缓存单元可以是重排序缓存。
根据本发明的一种实施方式,向量运算装置还包括:指令处理单元,用于从指令缓存单元获取向量运算指令,并对该向量运算指令进行处理后,提供给所述向量运算单元。其中,指令处理单元包括:
取指模块,用于从指令缓存单元中获取向量运算指令;
译码模块,用于对获取的向量运算指令进行译码;
指令队列,用于对译码后的向量运算指令进行顺序存储,考虑到不同指令在包含的寄存器上有可能存在依赖关系,用于缓存译码后的指令,当依赖关系被满足之后发射指令。
根据本发明的一种实施方式,向量运算装置还包括:依赖关系处理单元,用于在向量运算单元获取向量运算指令前,判断该向量运算指令与前一向量运算指令是否访问相同的向量,若是,将该向量运算指令存储在一存储队列中,待前一向量运算指令执行完毕后,将存储队列中的该向量运算指令提供给所述向量运算单元;否则,直接将该向量运算指令提供给所述向量运算单元。具体地,向量运算指令访问高速暂存存储 器时,前后指令可能会访问同一块存储空间,为了保证指令执行结果的正确性,当前指令如果被检测到与之前的指令的数据存在依赖关系,该指令必须在存储队列内等待至依赖关系被消除。
根据本发明的一种实施方式,向量运算装置还包括:输入输出单元,用于将向量存储于存储单元,或者,从存储单元中获取向量运算结果。其中,输入输出单元可直接存储单元,负责从内存中读取向量数据或写入向量数据。
根据本发明的一种实施方式,用于本发明装置的指令集采用Load/Store结构,向量运算单元不会对内存中的数据进行操作。本指令集采用精简指令集架构,指令集只提供最基本的向量运算操作,复杂的向量运算都由这些简单指令通过组合进行模拟,使得可以在高时钟频率下单周期执行指令。另外,本指令集同时采用定长指令,使得本发明提出的向量运算装置在上一条指令的译码阶段对下一条指令进行取指。
图2是本发明提供的指令集的格式示意图,如图2所示,向量运算指令包括一操作码和至少一操作域,其中,操作码用于指示该向量运算指令的功能,向量运算单元通过识别该操作码可进行不同的向量运算,操作域用于指示该向量运算指令的数据信息,其中,数据信息可以是立即数或寄存器号,例如,要获取一个向量时,根据寄存器号可以在相应的寄存器中获取向量起始地址和向量长度,再根据向量起始地址和向量长度在存储单元中获取相应地址存放的向量。
指令集包含有不同功能的向量运算指令:
向量加法指令(VA)。根据该指令,装置从高速暂存存储器的指定地址处分别取出两块指定大小的向量数据,在向量运算单元中进行加法运算,并将结果写回至高速暂存存储器的指定地址;
向量加标量指令(VAS)。根据该指令,装置从高速暂存存储器的指定地址取出指定大小的向量数据,从标量寄存器堆的指定地址取出标量数据,在标量运算单元中将向量的每一个元素加上该标量值,并将结果写回至高速暂存存储器的指定地址;
向量减法指令(VS)。根据该指令,装置从高速暂存存储器的指定 地址处分别取出两块指定大小的向量数据,在向量运算单元中进行减法运算,并将结果写回至高速暂存存储器的指定地址;
标量减向量指令(SSV)。根据该指令,装置从标量寄存器堆的指定地址取出标量数据,从高速暂存存储器的指定地址取出向量数据,在向量计算单元中用该标量减去向量中的相应元素,并将结果写回高速暂存存储器的指定地址;
向量乘法指令(VMV)。根据该指令,装置从高速暂存存储器的指定地址分别取出指定大小的向量数据,在向量计算单元中将两向量数据对位相乘,并将结果写回高速暂存存储器的指定地址;
向量乘标量指令(VMS)。根据该指令,装置从高速暂存存储器的指定地址取出指定大小的向量数据,从标量寄存器堆的指定地址取出指定大小的标量数据,在向量寄存单元中进行向量乘标量运算,并将结果写回高速暂存存储器的指定地址;
向量除法指令(VD)。根据该指令,装置从高速暂存存储器的指定地址取出分别取出指定大小的向量数据,在向量运算单元中将两向量对位相除,并将结果写回至高速暂存存储器的指定地址;
标量除向量指令(SDV)。根据该指令,装置从标量寄存器堆的指定位置取出标量数据,从高速暂存存储器的指定位置取出指定大小的向量数据,在向量计算单元中用标量分别除以向量中的相应元素,并将结果写回至高速暂存存储器的指定位置;
向量间与指令(VAV)。根据该指令,装置从高速暂存存储器的指定地址取出分别取出指定大小的向量数据,在向量运算单元中将两向量对位相与,并将结果写回至高速暂存存储器的指定地址;
向量内与指令(VAND)。根据该指令,装置从高速暂存存储器的指定地址取出指定大小的向量数据,在向量运算单元中向量中每一位相与,并将结果写回至标量寄存器堆的指定地址;
向量间或指令(VOV)。根据该指令,装置从高速暂存存储器的指定地址取出分别取出指定大小的向量数据,在向量运算单元中将两向量对位相或,并将结果写回至高速暂存存储器的指定地址;
向量内或指令(VOR)。根据该指令,装置从高速暂存存储器的指定地址取出指定大小的向量数据,在向量运算单元中向量中每一位相或,并将结果写回至标量寄存器堆的指定地址;
向量指数指令(VE)。根据该指令,装置从高速暂存存储器的指定地址取出指定大小的向量数据,在向量运算单元中向量中对每一元素进行指数运算,并将结果写回至标量寄存器堆的指定地址;
向量对数指令(VL)。根据该指令,装置从高速暂存存储器的指定地址取出指定大小的向量数据,在向量运算单元中向量中对每一元素进行对数运算,并将结果写回至标量寄存器堆的指定地址;
向量大于判定指令(VGT)。根据该指令,装置从高速暂存存储器中的指定地址分别取出指定大小的向量数据,在向量运算单元中将两向量数据对位比较,前者大于后者在输出向量相应位中置1,否则置0,并将结果写回至高速暂存存储器的指定地址;
向量等于判定指令(VEQ)。根据该指令,装置从高速暂存存储器中的指定地址分别取出指定大小的向量数据,在向量运算单元中将两向量数据对位比较,前者等于后者在输出向量相应位中置1,否则置0,并将结果写回至高速暂存存储器的指定地址;
向量非指令(VINV)。根据该指令,装置从高速暂存存储器的指定地址取出指定大小的向量数据,将该向量按位取非,并将结果存回至高速暂存存储器的指定地址;
向量选择合并指令(VMER)。根据该指令,装置从高速暂存存储器的指定地址分别取出指定大小的向量数据,包括选择向量和被选择向量一以及被选择向量二。向量计算单元根据选择向量的元素为1或0从被选择向量一或被选择向量二中选取相应元素作为输出向量里该位置的元素,并将结果写回至高速暂存存储器的指定位置;
向量最大值指令(VMAX)。根据该指令,装置从高速暂存存储器的指定地址取出指定大小的向量数据,从中选出最大的元素作为结果,并将结果写回至标量寄存器堆的指定地址;
标量扩展指令(STV)。根据该指令,装置从标量寄存器堆的指定地 址取出标量数据,在向量运算单元中,将标量扩展成指定长度的向量,并将结果写回至标量寄存器堆;
标量替换向量指令(STVPN)。根据该指令,装置从标量寄存器堆的指定地址取出标量,从高速暂存存储器的指定地址取出指定大小的向量数据,在向量计算单元中将向量中的指定位置的元素替换成该标量值,并将结果写回至高速暂存存储器的指定地址;
向量替换标量指令(VPNTS)。根据该指令,装置从标量寄存器堆的指定地址取出标量,从高速暂存存储器的指定地址取出指定大小的向量数据,在向量计算单元中用向量中的指定位置的元素替换该标量值,并将结果写回至标量寄存器堆的指定地址;
向量检索指令(VR)。根据该指令,装置从高速暂存存储器的指定地址取出指定大小的向量数据,在向量计算单元中根据指定位置取出向量中的相应元素作为输出,并将结果写回至标量寄存器堆的指定地址;
向量点积指令(VP)。根据该指令,装置分别从高速暂存存储器的指定地址取出指定大小的向量数据,在向量计算单元中将两向量进行点积运算,并将结果写回至标量寄存堆得指定地址;
随机向量指令(RV)。根据该指令,装置在向量计算单元中生成范围从0到1的服从均匀分布的随机向量,并将结果写回至高速暂存存储器的指定地址;
循环移位指令(VCS)。根据该指令,装置从高速暂存存储器的指定地址取出指定大小的向量数据,在向量计算单元中将该向量按照指定步长进行循环移位,并将结果写回至高速暂存存储器的指定地址;
向量加载指令(VLOAD)。根据该指令,装置从指定外部源地址载入指定大小的向量数据至高速暂存存储器的指定地址;
向量存储指令(VS)。根据该指令,装置将高速暂存存储器的指定地址的指定大小的向量数据存至外部目的地址处;
向量搬运指令(VMOVE)。根据该指令,装置将高速暂存存储器的指定地址的指定大小的向量数据存至高速暂存存储器的另一指定地址处。
为使本发明的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本发明进一步详细说明。
图3是本发明实施例提供的向量运算装置的结构示意图,如图3所示,装置包括取指模块、译码模块、指令队列、标量寄存器堆、依赖关系处理单元、存储队列、重排序缓存、向量运算单元、高速暂存器、IO内存存取模块;
取指模块,该模块负责从指令序列中取出下一条将要执行的指令,并将该指令传给译码模块;
译码模块,该模块负责对指令进行译码,并将译码后指令传给指令队列;
指令队列,考虑到不同指令在包含的标量寄存器上有可能存在依赖关系,用于缓存译码后的指令,当依赖关系被满足之后发射指令;
标量寄存器堆,提供装置在运算过程中所需的标量寄存器;
依赖关系处理单元,该模块处理处理指令与前一条指令可能存在的存储依赖关系。向量运算指令会访问高速暂存存储器,前后指令可能会访问同一块存储空间。为了保证指令执行结果的正确性,当前指令如果被检测到与之前的指令的数据存在依赖关系,该指令必须在存储队列内等待至依赖关系被消除。
存储队列,该模块是一个有序队列,与之前指令在数据上有依赖关系的指令被存储在该队列内直至存储关系被消除;
重排序缓存,指令在执行过程中,同时也被缓存在给模块中,当一条指令执行完之后,如果该指令同时也是重排序缓存中未被提交指令中最早的一条指令,该指令将背提交。一旦提交,该条指令进行的操作对装置状态的改变将无法撤销;
向量运算单元,该模块负责装置的所有向量运算,包括但不限于向量加法操作、向量加标量操作、向量减法操作、向量减标量操作、向量乘法操作、向量乘标量操作、向量除法(对位相除)操作、向量与操作和向量或操作,向量运算指令被送往该运算单元执行;
高速暂存器,该模块是向量数据专用的暂存存储装置,能够支持不 同大小的向量数据;
IO内存存取模块,该模块用于直接访问高速暂存存储器,负责从高速暂存存储器中读取数据或写入数据。
图4是本发明实施例提供的向量运算装置执行向量点积指令的流程图,如图4所示,执行向量点积指令(VP)的过程包括:
S1,取指模块取出该条向量点积指令,并将该指令送往译码模块。
S2,译码模块对指令译码,并将指令送往指令队列。
S3,在指令队列中,该向量点积指令需要从标量寄存器堆中获取指令中四个操作域所对应的标量寄存器里的数据,包括向量vin0的起始地址、向量vin0的长度、向量vin1的起始地址、向量vin1的长度。
S4,在取得需要的标量数据后,该指令被送往依赖关系处理单元。依赖关系处理单元分析该指令与前面的尚未执行结束的指令在数据上是否存在依赖关系。该条指令需要在存储队列中等待至其与前面的未执行结束的指令在数据上不再存在依赖关系为止。
S5:依赖关系不存在后,该条向量点积指令被送往向量运算单元。向量运算单元根据所需数据的地址和长度从数据暂存器中取出需要的向量,然后在向量运算单元中完成点积运算。
S6,运算完成后,将结果写回至高速暂存存储器的指定地址,同时提交重排序缓存中的该向量点积指令。
图5为本发明实施例提供的向量运算单元的结构示意图,如图5所示,向量运算单元内包含向量加法运算单元、大小比较单元、非线性运算单元、向量标量乘法单元等。并且,向量运算单元为多流水级结构,其中,向量加法部件和向量乘法部件处于流水级1,大小比较部件处于流水级2,非线性运算部件和向量标量乘法部件处于流水级3。这些单元处于不同的流水级,当连续串行的多条向量运算指令的先后次序与相应单元所在流水级顺序一致时,可以更加高效地实现这一连串向量运算指令所要求的操作。
以上所述的具体实施例,对本发明的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本发明的具体实施 例而已,并不用于限制本发明,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (12)

  1. 一种向量运算装置,用于根据向量运算指令执行向量运算,其特征在于,包括:
    存储单元,用于存储向量;
    寄存器单元,用于存储向量地址,其中,所述向量地址为向量在所述存储单元中存储的地址;
    向量运算单元,用于获取向量运算指令,根据向量运算指令在所述寄存器单元中获取向量地址,然后,根据该向量地址在存储单元中获取相应的向量,接着,根据获取的向量进行向量运算,得到向量运算结果。
  2. 根据权利要求1所述的向量运算装置,其特征在于,还包括:
    指令缓存单元,用于存储待执行的向量运算指令。
  3. 根据权利要求2所述的向量运算装置,其特征在于,还包括:
    指令处理单元,用于从所述指令缓存单元获取向量运算指令,并对该向量运算指令进行处理后,提供给所述向量运算单元。
  4. 根据权利要求3所述的向量运算装置,其特征在于,所述指令处理单元包括:
    取指模块,用于从所述指令缓存单元中获取向量运算指令;
    译码模块,用于对获取的向量运算指令进行译码;
    指令队列,用于对译码后的向量运算指令进行顺序存储。
  5. 根据权利要求1所述的向量运算装置,其特征在于,还包括:
    依赖关系处理单元,用于在所述向量运算单元获取向量运算指令前,判断该向量运算指令与前一向量运算指令是否访问相同的向量,若是,则等待前一向量运算指令执行完毕后,将该向量运算指令提供给所述向量运算单元;否则,直接将该向量运算指令提供给所述向量运算单元。
  6. 根据权利要求5所述的向量运算装置,其特征在于,当该向量运算指令与前一向量运算指令访问相同的向量时,所述依赖关系处理单元将该向量运算指令存储在一存储队列中,待前一向量运算指令执行完 毕后,将存储队列中的该向量运算指令提供给所述向量运算单元。
  7. 根据权利要求1所述的向量运算装置,其特征在于,所述存储单元还用于存储所述向量运算结果。
  8. 根据权利要求6所述的向量运算装置,其特征在于,还包括:
    输入输出单元,用于将向量存储于所述存储单元,或者,从所述存储单元中获取向量运算结果。
  9. 根据权利要求6所述的向量运算装置,其特征在于,所述存储单元为高速暂存存储器。
  10. 根据权利要求1所述的向量运算装置,其特征在于,所述向量运算指令包括一操作码和至少一操作域,其中,所述操作码用于指示该向量运算指令的功能,操作域用于指示该向量运算指令的数据信息。
  11. 根据权利要求1所述的向量运算装置,其特征在于,所述向量运算单元包含包括向量加法部件、向量乘法部件、大小比较部件、非线性运算部件和向量标量乘法部件。
  12. 根据权利要求11所述的向量运算装置,其特征在于,所述向量运算单元为多流水级结构,其中,所述向量加法部件和向量乘法部件处于第一流水级,大小比较部件处于第二流水级,非线性运算部件和向量标量乘法部件处于第三流水级。
PCT/CN2016/078550 2016-01-20 2016-04-06 一种向量计算装置 Ceased WO2017124648A1 (zh)

Priority Applications (6)

Application Number Priority Date Filing Date Title
KR1020197017258A KR102185287B1 (ko) 2016-01-20 2016-04-06 벡터 연산 장치
EP16885912.2A EP3407182B1 (en) 2016-01-20 2016-04-06 Vector computing device
KR1020187015435A KR20180100550A (ko) 2016-01-20 2016-04-06 벡터 연산 장치 및 방법
KR1020207013258A KR102304216B1 (ko) 2016-01-20 2016-04-06 벡터 계산 장치
US16/039,803 US10762164B2 (en) 2016-01-20 2018-07-19 Vector and matrix computing device
US16/942,482 US11734383B2 (en) 2016-01-20 2020-07-29 Vector and matrix computing device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610039216.8A CN106990940B (zh) 2016-01-20 2016-01-20 一种向量计算装置及运算方法
CN201610039216.8 2016-01-20

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/078546 Continuation-In-Part WO2017124647A1 (zh) 2016-01-20 2016-04-06 一种矩阵计算装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/039,803 Continuation-In-Part US10762164B2 (en) 2016-01-20 2018-07-19 Vector and matrix computing device

Publications (1)

Publication Number Publication Date
WO2017124648A1 true WO2017124648A1 (zh) 2017-07-27

Family

ID=59361389

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/078550 Ceased WO2017124648A1 (zh) 2016-01-20 2016-04-06 一种向量计算装置

Country Status (4)

Country Link
EP (1) EP3407182B1 (zh)
KR (3) KR20180100550A (zh)
CN (5) CN111580865B (zh)
WO (1) WO2017124648A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109062607A (zh) * 2017-10-30 2018-12-21 上海寒武纪信息科技有限公司 机器学习处理器及使用处理器执行向量最小值指令的方法
CN109754061A (zh) * 2017-11-07 2019-05-14 上海寒武纪信息科技有限公司 卷积扩展指令的执行方法以及相关产品
CN113874836A (zh) * 2019-05-20 2021-12-31 美光科技公司 真/假向量索引寄存器
WO2022134729A1 (zh) * 2020-12-24 2022-06-30 苏州浪潮智能科技有限公司 一种基于risc-v的人工智能推理方法和系统
US11990137B2 (en) 2018-09-13 2024-05-21 Shanghai Cambricon Information Technology Co., Ltd. Image retouching method and terminal device
US12505069B2 (en) 2019-05-20 2025-12-23 Micron Technology, Inc. Registers in a vector processor that store addresses for accessing vectors

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109754062B (zh) * 2017-11-07 2024-05-14 上海寒武纪信息科技有限公司 卷积扩展指令的执行方法以及相关产品
CN107861757B (zh) * 2017-11-30 2020-08-25 上海寒武纪信息科技有限公司 运算装置以及相关产品
CN107957976B (zh) * 2017-12-15 2020-12-18 安徽寒武纪信息科技有限公司 一种计算方法及相关产品
CN108037908B (zh) * 2017-12-15 2021-02-09 中科寒武纪科技股份有限公司 一种计算方法及相关产品
CN108388446A (zh) * 2018-02-05 2018-08-10 上海寒武纪信息科技有限公司 运算模块以及方法
CN110941789B (zh) * 2018-09-21 2023-12-15 北京地平线机器人技术研发有限公司 张量运算方法和装置
CN111290789B (zh) * 2018-12-06 2022-05-27 上海寒武纪信息科技有限公司 运算方法、装置、计算机设备和存储介质
CN111290788B (zh) * 2018-12-07 2022-05-31 上海寒武纪信息科技有限公司 运算方法、装置、计算机设备和存储介质
CN111079913B (zh) * 2018-10-19 2021-02-05 中科寒武纪科技股份有限公司 运算方法、装置及相关产品
CN111078282B (zh) * 2018-10-19 2020-12-22 安徽寒武纪信息科技有限公司 运算方法、装置及相关产品
CN111078281B (zh) * 2018-10-19 2021-02-12 中科寒武纪科技股份有限公司 运算方法、系统及相关产品
CN111079909B (zh) * 2018-10-19 2021-01-26 安徽寒武纪信息科技有限公司 运算方法、系统及相关产品
CN111399905B (zh) * 2019-01-02 2022-08-16 上海寒武纪信息科技有限公司 运算方法、装置及相关产品
CN110502278B (zh) * 2019-07-24 2021-07-16 瑞芯微电子股份有限公司 基于RiscV扩展指令的神经网络协处理器及其协处理方法
US11494734B2 (en) * 2019-09-11 2022-11-08 Ila Design Group Llc Automatically determining inventory items that meet selection criteria in a high-dimensionality inventory dataset
CN119045894B (zh) * 2024-07-29 2025-12-12 武汉大学 多标量乘法运算方法、装置、设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110138155A1 (en) * 2009-12-04 2011-06-09 Eiichiro Kawaguchi Vector computer and instruction control method therefor
US8108652B1 (en) * 2007-09-13 2012-01-31 Ronald Chi-Chun Hui Vector processing with high execution throughput
CN102495721A (zh) * 2011-12-02 2012-06-13 南京大学 一种支持fft加速的simd向量处理器
CN103699360A (zh) * 2012-09-27 2014-04-02 北京中科晶上科技有限公司 一种向量处理器及其进行向量数据存取、交互的方法
CN104699458A (zh) * 2015-03-30 2015-06-10 哈尔滨工业大学 定点向量处理器及其向量数据访存控制方法

Family Cites Families (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4541046A (en) * 1981-03-25 1985-09-10 Hitachi, Ltd. Data processing system including scalar data processor and vector data processor
JP3317985B2 (ja) * 1991-11-20 2002-08-26 喜三郎 中澤 擬似ベクトルプロセッサ
US5669013A (en) * 1993-10-05 1997-09-16 Fujitsu Limited System for transferring M elements X times and transferring N elements one time for an array that is X*M+N long responsive to vector type instructions
US5689653A (en) * 1995-02-06 1997-11-18 Hewlett-Packard Company Vector memory operations
JPH09325888A (ja) * 1996-06-04 1997-12-16 Hitachi Ltd データ処理装置
US6665790B1 (en) * 2000-02-29 2003-12-16 International Business Machines Corporation Vector register file with arbitrary vector addressing
CN1142484C (zh) * 2001-11-28 2004-03-17 中国人民解放军国防科学技术大学 微处理器向量处理方法
US20130212353A1 (en) * 2002-02-04 2013-08-15 Tibet MIMAR System for implementing vector look-up table operations in a SIMD processor
JP4349265B2 (ja) * 2004-11-22 2009-10-21 ソニー株式会社 プロセッサ
JP4282682B2 (ja) * 2006-04-12 2009-06-24 エヌイーシーコンピュータテクノ株式会社 情報処理装置及びそれに用いるベクトルレジスタアドレス制御方法
CN101833441B (zh) * 2010-04-28 2013-02-13 中国科学院自动化研究所 并行向量处理引擎结构
CN101876892B (zh) * 2010-05-20 2013-07-31 复旦大学 面向通信和多媒体应用的单指令多数据处理器电路结构
JP5658945B2 (ja) * 2010-08-24 2015-01-28 オリンパス株式会社 画像処理装置、画像処理装置の作動方法、および画像処理プログラム
CN102043723B (zh) * 2011-01-06 2012-08-22 中国人民解放军国防科学技术大学 用于通用流处理器的可变访存模式的片上缓存结构
CN102109978A (zh) * 2011-02-28 2011-06-29 孙瑞琛 一种数据的重排方法及重排装置
CN102156637A (zh) * 2011-05-04 2011-08-17 中国人民解放军国防科学技术大学 向量交叉多线程处理方法及向量交叉多线程微处理器
CN102200964B (zh) * 2011-06-17 2013-05-15 孙瑞琛 基于并行处理的fft装置及其方法
CN102262525B (zh) * 2011-08-29 2014-11-19 孙瑞玮 基于矢量运算的矢量浮点运算装置及方法
CN102750133B (zh) * 2012-06-20 2014-07-30 中国电子科技集团公司第五十八研究所 支持simd的32位三发射的数字信号处理器
US9098265B2 (en) * 2012-07-11 2015-08-04 Arm Limited Controlling an order for processing data elements during vector processing
US9632781B2 (en) * 2013-02-26 2017-04-25 Qualcomm Incorporated Vector register addressing and functions based on a scalar register data value
US9639503B2 (en) * 2013-03-15 2017-05-02 Qualcomm Incorporated Vector indirect element vertical addressing mode with horizontal permute
KR20150005062A (ko) * 2013-07-04 2015-01-14 삼성전자주식회사 미니-코어를 사용하는 프로세서
US10120682B2 (en) * 2014-02-28 2018-11-06 International Business Machines Corporation Virtualization in a bi-endian-mode processor architecture
CN104699465B (zh) * 2015-03-26 2017-05-24 中国人民解放军国防科学技术大学 向量处理器中支持simt的向量访存装置和控制方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8108652B1 (en) * 2007-09-13 2012-01-31 Ronald Chi-Chun Hui Vector processing with high execution throughput
US20110138155A1 (en) * 2009-12-04 2011-06-09 Eiichiro Kawaguchi Vector computer and instruction control method therefor
CN102495721A (zh) * 2011-12-02 2012-06-13 南京大学 一种支持fft加速的simd向量处理器
CN103699360A (zh) * 2012-09-27 2014-04-02 北京中科晶上科技有限公司 一种向量处理器及其进行向量数据存取、交互的方法
CN104699458A (zh) * 2015-03-30 2015-06-10 哈尔滨工业大学 定点向量处理器及其向量数据访存控制方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3407182A4 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11922132B2 (en) 2017-10-30 2024-03-05 Shanghai Cambricon Information Technology Co., Ltd. Information processing method and terminal device
CN109062607B (zh) * 2017-10-30 2021-09-21 上海寒武纪信息科技有限公司 机器学习处理器及使用处理器执行向量最小值指令的方法
US12461711B2 (en) 2017-10-30 2025-11-04 Shanghai Cambricon Information Technology Co., Ltd. Information processing method and terminal device
CN109062607A (zh) * 2017-10-30 2018-12-21 上海寒武纪信息科技有限公司 机器学习处理器及使用处理器执行向量最小值指令的方法
US11762631B2 (en) 2017-10-30 2023-09-19 Shanghai Cambricon Information Technology Co., Ltd. Information processing method and terminal device
US12050887B2 (en) 2017-10-30 2024-07-30 Shanghai Cambricon Information Technology Co., Ltd. Information processing method and terminal device
CN109754061A (zh) * 2017-11-07 2019-05-14 上海寒武纪信息科技有限公司 卷积扩展指令的执行方法以及相关产品
CN109754061B (zh) * 2017-11-07 2023-11-24 上海寒武纪信息科技有限公司 卷积扩展指令的执行方法以及相关产品
US12057109B2 (en) 2018-09-13 2024-08-06 Shanghai Cambricon Information Technology Co., Ltd. Information processing method and terminal device
US11990137B2 (en) 2018-09-13 2024-05-21 Shanghai Cambricon Information Technology Co., Ltd. Image retouching method and terminal device
US11996105B2 (en) 2018-09-13 2024-05-28 Shanghai Cambricon Information Technology Co., Ltd. Information processing method and terminal device
US12057110B2 (en) 2018-09-13 2024-08-06 Shanghai Cambricon Information Technology Co., Ltd. Voice recognition based on neural networks
US12094456B2 (en) 2018-09-13 2024-09-17 Shanghai Cambricon Information Technology Co., Ltd. Information processing method and system
CN113874836A (zh) * 2019-05-20 2021-12-31 美光科技公司 真/假向量索引寄存器
US12505069B2 (en) 2019-05-20 2025-12-23 Micron Technology, Inc. Registers in a vector processor that store addresses for accessing vectors
US11880684B2 (en) 2020-12-24 2024-01-23 Inspur Suzhou Intelligent Technology Co., Ltd. RISC-V-based artificial intelligence inference method and system
WO2022134729A1 (zh) * 2020-12-24 2022-06-30 苏州浪潮智能科技有限公司 一种基于risc-v的人工智能推理方法和系统

Also Published As

Publication number Publication date
CN111580864A (zh) 2020-08-25
CN111580866B (zh) 2024-05-07
KR20180100550A (ko) 2018-09-11
CN111580865A (zh) 2020-08-25
CN111580863A (zh) 2020-08-25
CN106990940A (zh) 2017-07-28
CN111580863B (zh) 2024-05-03
KR20200058562A (ko) 2020-05-27
EP3407182B1 (en) 2022-10-26
CN106990940B (zh) 2020-05-22
KR102185287B1 (ko) 2020-12-01
KR102304216B1 (ko) 2021-09-23
CN111580864B (zh) 2024-05-07
CN111580865B (zh) 2024-02-27
EP3407182A4 (en) 2020-04-29
KR20190073593A (ko) 2019-06-26
EP3407182A1 (en) 2018-11-28
CN111580866A (zh) 2020-08-25

Similar Documents

Publication Publication Date Title
WO2017124648A1 (zh) 一种向量计算装置
CN109522254B (zh) 运算装置及方法
KR102123633B1 (ko) 행렬 연산 장치 및 방법
CN107315717B (zh) 一种用于执行向量四则运算的装置和方法
CN107315718B (zh) 一种用于执行向量内积运算的装置和方法
WO2017185389A1 (zh) 一种用于执行矩阵乘运算的装置和方法
CN107315715A (zh) 一种用于执行矩阵加/减运算的装置和方法
CN107315716B (zh) 一种用于执行向量外积运算的装置和方法
WO2017185404A1 (zh) 一种用于执行向量逻辑运算的装置及方法
WO2017185419A1 (zh) 一种用于执行向量最大值最小值运算的装置和方法
EP3447690A1 (en) Maxout layer operation apparatus and method
CN107305486A (zh) 一种神经网络maxout层计算装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16885912

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 20187015435

Country of ref document: KR

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE