WO2023241418A1 - 处理器以及用于数据处理的方法、设备和存储介质 - Google Patents

处理器以及用于数据处理的方法、设备和存储介质 Download PDF

Info

Publication number
WO2023241418A1
WO2023241418A1 PCT/CN2023/098716 CN2023098716W WO2023241418A1 WO 2023241418 A1 WO2023241418 A1 WO 2023241418A1 CN 2023098716 W CN2023098716 W CN 2023098716W WO 2023241418 A1 WO2023241418 A1 WO 2023241418A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
data
instruction
memory
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2023/098716
Other languages
English (en)
French (fr)
Other versions
WO2023241418A9 (zh
Inventor
曹宇辉
施云峰
王剑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Youzhuju Network Technology Co Ltd
Original Assignee
Beijing Youzhuju Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Youzhuju Network Technology Co Ltd filed Critical Beijing Youzhuju Network Technology Co Ltd
Priority to JP2024573128A priority Critical patent/JP2025519635A/ja
Priority to KR1020247041480A priority patent/KR20250008533A/ko
Priority to EP23822987.6A priority patent/EP4524729A4/en
Publication of WO2023241418A1 publication Critical patent/WO2023241418A1/zh
Publication of WO2023241418A9 publication Critical patent/WO2023241418A9/zh
Priority to US18/979,402 priority patent/US12461747B2/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • G06F9/3016Decoding the operand specifier, e.g. specifier format
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/34Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes
    • G06F9/355Indexed addressing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30018Bit or string instructions
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30029Logical and Boolean instructions, e.g. XOR, NOT
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30032Movement instructions, e.g. MOVE, SHIFT, ROTATE, SHUFFLE
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • G06F9/30038Instructions to perform operations on packed data, e.g. vector, tile or matrix operations using a mask
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory
    • G06F9/30043LOAD or STORE instructions; Clear instruction
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/34Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes
    • G06F9/345Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes of multiple operands or results
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • Example embodiments of the present disclosure relate generally to the field of computers, and in particular to processors and methods, devices, and computer-readable storage media for data processing.
  • processors can be used in a variety of scenarios.
  • different instruction set architectures ISAs
  • These instruction set architectures often need to be compatible with a variety of usage scenarios.
  • ISAs instruction set architectures
  • a processor in a first aspect of the present disclosure, includes an instruction decoder configured to decode target instructions for vector operations.
  • Target instructions involve target opcodes, source operands, and destination operands.
  • the target opcode indicates the vector operation specified by the target instruction.
  • the source operand specifies the source storage location in memory from which to read the data to be processed.
  • the destination operand specifies the destination storage location in memory to which the processing results are written.
  • the processor also includes an arithmetic logic unit coupled to the instruction decoder and the memory.
  • the arithmetic logic unit is configured to: read the data to be processed from the source storage location of the memory; perform the arithmetic logic operation associated with the vector operation specified by the target instruction on the data to be processed; to and writing the processing results to the target storage location of the memory.
  • a method for data processing includes decoding target instructions for vector operations.
  • Target instructions involve target opcodes, source operands, and destination operands.
  • the target opcode indicates the vector operation specified by the target instruction.
  • the source operand specifies the source storage location in memory from which to read the data to be processed.
  • the destination operand specifies the destination storage location in memory to which the processing results are written.
  • the method also includes reading the data to be processed from a source storage location in the memory; performing an arithmetic and logical operation on the data to be processed associated with the vector operation specified by the target instruction; and writing the processing result to the target storage location in the memory.
  • an electronic device in a third aspect of the present disclosure, includes at least a processor according to the first aspect.
  • a computer-readable storage medium is provided.
  • a computer program is stored on the computer-readable storage medium, and the computer program can be executed by a processor to implement the method of the second aspect.
  • FIG. 1 illustrates a schematic diagram of an example environment in which embodiments of the present disclosure can be implemented
  • Figure 2 shows a schematic diagram of example instructions in accordance with some embodiments of the present disclosure
  • Figure 3 shows a schematic diagram of storage locations corresponding to example source operands according to some embodiments of the present disclosure
  • FIG. 4 illustrates a flow diagram of a process for data processing in accordance with some embodiments of the present disclosure.
  • FIG. 5 illustrates a block diagram of an electronic device in which a processor may be included in accordance with one or more embodiments of the present disclosure.
  • processors can be applied to a variety of scenarios.
  • different instruction set architectures that can be adopted by processors have been proposed. These instruction set architectures often need to be compatible with a variety of usage scenarios.
  • the usage scenarios of these conventional instruction set architectures are not consistent with the usage scenarios of vector computing such as neural network computing. Therefore, for some vector calculations with high instruction repetition and large data volume, a better instruction set architecture is needed to enable the processor to better handle such vector calculations.
  • a conventional solution is to use a standard processor instruction set, such as the reduced instruction set computer (RISC)-V instruction set.
  • RISC reduced instruction set computer
  • these general instruction sets can complete various vector calculations, such as various neural network operators, it is difficult to ensure high execution efficiency because these general instruction sets need to be compatible with a variety of usage scenarios.
  • the calculation of neural network operators usually involves a large number of vector calculations, which is not friendly to general instruction sets. good.
  • a conventional solution may use a digital signal processor (DSP) architecture, such as a single instruction multiple data (SIMD) architecture, or may use a vector processor (Vector) architecture.
  • DSP digital signal processor
  • SIMD single instruction multiple data
  • Vector vector processor
  • the instruction sets of the above-mentioned DSP architectures are usually not publicly available.
  • vector processor architectures such as the vector instruction set under the RISC-V standard (referred to as RISC-V vector instruction set)
  • these instruction sets are usually highly complex and appear redundant for vector calculations such as neural network operators.
  • the processor includes an instruction decoder and an arithmetic logic unit.
  • the instruction encoder is used to receive target instructions for processing vector operations.
  • This target instruction is suitable for memory-to-memory (MEM to MEM) processor architecture.
  • the target instruction involves a target opcode, a source operand, and a destination operand.
  • the target opcode indicates the vector operation specified by the target instruction.
  • the source operand specifies at least the source storage location in the memory for reading the data to be processed.
  • the target operand specifies at least the target storage location in the memory for writing the processing result. .
  • the processor's arithmetic logic unit is coupled to the instruction decoder and memory.
  • the arithmetic logic unit is configured to perform a vector operation of the target instruction based on decoding information of the target instruction by the instruction decoder. For example, the arithmetic logic unit is configured to read the data to be processed from a source storage location of the memory; perform an arithmetic logic operation associated with the vector operation specified by the target instruction on the data to be processed; and write the processing result of the data to be processed.
  • the target storage location of the memory is configured to perform a vector operation of the target instruction based on decoding information of the target instruction by the instruction decoder. For example, the arithmetic logic unit is configured to read the data to be processed from a source storage location of the memory; perform an arithmetic logic operation associated with the vector operation specified by the target instruction on the data to be processed; and write the processing result of the data to be processed.
  • the target storage location of the memory is configured to perform a vector operation of the
  • This solution simplifies processor operation by using a processor suitable for memory-to-memory architecture.
  • the processor is able to perform a large number of vector calculations using a simple instruction set.
  • the processor can perform neural network vector calculations using a simple instruction set.
  • this solution can use a simple instruction set to improve the efficiency of the processor in performing vector calculations.
  • FIG. 1 shows a schematic diagram of an example environment 100 in which embodiments of the present disclosure can be implemented.
  • processor 110 may represent any kind of instruction processing device.
  • processor 110 may be a general purpose processor or any other suitable processor.
  • Processor 110 is configured to receive instructions 140 and perform operations indicated by instructions 140, such as vector operations.
  • processor 110 may receive instructions 140 from other devices in environment 100 .
  • instructions 140 are SIMD instructions.
  • Processor 110 includes an instruction decoder 120 and an arithmetic logic unit 130 .
  • processor 110 may also include or be communicatively coupled to memory (not shown).
  • the memory may be data memory such as Vector Closely-coupled Memory (VCCM).
  • Instruction decoder 120, arithmetic logic unit 130, and memory are communicatively coupled. That is, the instruction decoder 120, the arithmetic logic unit 130, and the memory may communicate with each other according to appropriate data transmission protocols and/or standards. In operation, instruction decoder 120 receives instructions 140 and decodes instructions 140 .
  • VCCM Vector Closely-coupled Memory
  • instruction decoder 120 may decode instructions 140 into arithmetic and/or logical operations, etc., that may be processed by arithmetic logic unit 130 .
  • Instruction decoder 120 may be implemented using a variety of different mechanisms.
  • instruction decoder 120 may be implemented using hardware circuitry, or at least in part by means of software modules.
  • Arithmetic logic unit 130 is configured to operate based on information obtained by decoding instruction 140 by instruction decoder 120 . Arithmetic logic unit 130 may perform various arithmetic operations, logical operations, and the like. Arithmetic logic unit 130 may be implemented using a variety of different mechanisms. For example, arithmetic logic unit 130 may be implemented using hardware circuitry, or at least partially by means of software modules.
  • processor 110 may be implemented in a variety of existing or future computing platforms or computing systems.
  • the processor 110 may be implemented in various embedded applications (eg, data processing systems of mobile network base stations, etc.) to provide services such as large-scale vector calculations.
  • the processor 110 may also be integrated or embedded into various electronic devices or computing devices to provide various computing services.
  • the application environment and application scenarios of the processor 110 are not limited here.
  • instruction decoder 120 decodes received instructions 140 .
  • Instructions 140 are sometimes referred to herein as "target instructions,” which may be used interchangeably in this context.
  • Figure 2 shows a schematic diagram of example instructions 140 in accordance with some embodiments of the present disclosure.
  • instruction 140 includes a target opcode 210 , a source operand 220 , and a target operand 230 .
  • the target opcode 210 is sometimes also referred to as an "opcode”, which may be used interchangeably in this context.
  • Target opcode 210 may indicate the vector operation specified by instruction 140 .
  • the source operand 220 is at least used to specify a source storage location in the memory for reading the data to be processed.
  • the target operand 230 is at least used to specify a target storage location in the memory for writing the processing result.
  • the instruction decoder 120 decodes the above information indicated by the target operation code 210, the source operand 220, and the target operand 230 for processing by the arithmetic logic unit 130.
  • arithmetic logic unit 130 is configured to read data to be processed from a source storage location in memory specified by source operand 220 .
  • Arithmetic logic unit 130 performs arithmetic logic operations associated with the vector operations specified by instructions 140 on the data to be processed.
  • the arithmetic logic unit 130 then writes the processing result of the data to be processed into the target storage location specified by the target operand 230 .
  • instructions 140 may be encoded using, for example, binary. In other embodiments, instructions 140 may be encoded using other encoding forms or other bases. In this article, unless otherwise specified, the encoding format and encoding representation of the instruction 140 described below are all in binary format. For example, the instruction 140 in binary form may be defined in the format in Table 1 below.
  • bits 86 to 95 are used to represent target opcode 210 of instruction 140.
  • Each operand from bit 22 to bit 85 is used to represent the source operand 220 of the instruction 140 .
  • Each parameter from bit 0 to bit 21 is used to represent the target operand 230 of the instruction 140.
  • any specific numerical values or digits appearing here and elsewhere herein are exemplary unless otherwise stated.
  • the number of bits at which each operation code and/or operand is listed above is exemplary rather than limiting.
  • the destination opcode 210, source operand 220, and destination operand 230 of instruction 140 may be located at other appropriate number of bits.
  • the source operand A_vaddr of bits 70 to 85 is used to represent the address index of the data of the A channel (also called the first storage space of the memory) in a memory, such as a data memory (such as VCCM), that is, Address index of VCCM[A_vaddr].
  • the address index is in units of a vector word.
  • a vector word can represent a memory cell of a channel within a memory that is width by the SIMD width. That is, the address index is in units of one SIMD width.
  • the memory depth is, for example, 1024. In this example, only 10 bits from bits 70 to 85 can be used to represent the address index of the A channel.
  • the memory may have other suitable depths and the address index may have other suitable number of bits.
  • the source operand A_index from bit 60 to bit 69 is used to represent the element index of the vector word of the A channel in the memory.
  • Each vector word can have e.g. 64 elements. This element index can be used to indicate a certain element in the vector word within the A channel.
  • the vector word of the A channel is divided into, for example, 64 elements, only 6 of the 60th to 69th bits may be used to represent the A_index.
  • the source operand A_vm in bits 54 to 59 is used to represent the index of the vector mask (VM) register of the A channel.
  • the A channel has 16 VM registers. In such an example, A_vm may be represented using only 4 bits from bits 54 to 59.
  • the source operand B_vaddr of bits 38 to 53 is used to represent the address index of the B channel (also known as the second storage space of the memory) within the memory (for example, the data memory VCCM), that is, VCCM[B_vaddr ] address index.
  • the address index is in units of one vector word. That is, the address index is in units of one SIMD width.
  • the source operand B_index in bits 28 to 37 is used to represent the element index of the vector word of the B channel.
  • the source operand B_vm in bits 22 to 27 is used to represent the index of the vector mask register of the B channel.
  • Examples of the target operand 230 in Table 1 include bits 6 to 21 of C_vaddr that can represent the address index of the C channel of the data memory VCCM, that is, the address index of VCCM[C_vaddr].
  • the address index is in units of one vector word. That is, the address index is in units of one SIMD width.
  • Examples of destination operand 230 also include bits 0 to 5 of C_vm, which represent the index of the vector mask register of the C channel.
  • FIG. 3 shows a schematic diagram of storage locations corresponding to example source operands according to some embodiments of the present disclosure.
  • the storage space of the memory is divided into multiple channels, such as channel 310-1, channel 310-2, ..., channel 310-N, etc., where N is an integer greater than 1.
  • the channels 310-1, 310-2, ..., and the channels 310-N are collectively referred to as channels 310 or individually as channel 310 below.
  • the value of N may be preset.
  • N can be set to different values such as 1024, 512, etc.
  • Each channel 310 includes, for example, 1024 bits or other suitable number of bits.
  • Address index 330 may indicate the address of channel 310-1.
  • the address of the channel 310 may be in units of vector words.
  • Address index 330 may be 16 bits. For example, if address index 330 is "0b0000_0000_0000_0000,” then address index 330 may indicate channel 310-1.
  • the address index 330 may also be 10 bits, for example, the channel 310-1 is indicated by the address "0b00_0000_0000”. Note that the encoding representations starting with "0b” in this article all represent binary representations, and will not be repeated below. The "_" appearing in the binary representation is for viewing convenience only, has no actual meaning, and does not occupy binary bits.
  • the vector words of each channel 310 may be divided into multiple elements.
  • element such as element 320.
  • Element 320 may include, for example, 64 bits.
  • Element index 340 (eg, source operand A_index or B_index) may indicate a certain element, such as the index of element 320.
  • Element index 340 may be 10 bits. For example, if the element index 340 is "0b00_0000_0000", the element index 340 may indicate the element 320. For another example, in the example where the number of elements in the vector word of each channel 310 is 64, the element index may be 6 bits, for example, the element index "0b00_0000" may indicate the element 320.
  • each channel may have a different number of bits, and each vector word may also have a different number of bits.
  • the address index and the element index can also have different number of bits and different encoding representations. The scope of the present disclosure is not limited in this regard.
  • source operands 220 are listed above with reference to Table 1. Further examples of source operands 220 are described below with reference to Table 2.
  • source operand 220 may include A_imm located at bits 54 to 85, which represents an immediate value in instruction 140. Similarly, source operand 220 may also include B_imm located at bits 22 to 53, which represents another immediate value in instruction 140 . Similar to Table 1, the target operand 230 in Table 2 may also include C_vaddr and/or C_vm.
  • each source operand and/or each target operand described above in conjunction with Table 1 and Table 2 is only exemplary and not restrictive.
  • the source operand 220 and/or the destination operand 230 used in the present disclosure may include any one or more of the above source operands and/or destination operands.
  • source operand 220 and/or destination operand 230 may include any other suitable operand types other than the above source operands and/or destination operands.
  • Table 3 below describes example encodings of opcodes for instruction 140. For example, if bit 0 is 0, it means that instruction 140 is a variable type. If bit 0 is 1, it means that instruction 140 is of immediate type.
  • the first to second bits represent the sub-function code of instruction 140.
  • Bits 3 to 7 represent the function code of instruction 140.
  • Bits 8 to 9 represent the calculation accuracy of instruction 140. For example, binary "00" can represent a single-precision floating point number. Other binary values can represent other computational precisions reserved.
  • the vector operation specified by instruction 140 may be determined based on the target opcode 210 of instruction 140 .
  • the processor 110 may pre-store the operation codes of each instruction.
  • Instruction decoder 120 may determine the vector operation specified by instruction 140 based on the target operation code 210 of received instruction 140 .
  • the target opcode 210 of instruction 140 is encoded as "0b00_00110_01_0”
  • instruction decoder 120 may determine the instruction 140 to be a v2indexr instruction. It should be understood that the examples of opcodes and instruction types listed above are only illustrative and not restrictive. has been coded
  • the instruction "0b00_00110_01_0" can also specify other vector operations.
  • source operand 220 may include two source operands, such as A_vaddr and B_vaddr, or A_vaddr and B_imm.
  • the width of each source operand can be SIMD width.
  • source operand 220 may include only one source operand B_vaddr, etc.
  • the target operand 230, such as C_vaddr, can specify the target storage location where the processing result is written back to the memory, that is, VCCM[C_vaddr].
  • the instruction 140 includes a processing result vector at the target storage location.
  • Target operand 230 also indicates the target VM register, such as C_vm or vm3.
  • the value at each position of the target VM register indicates whether the corresponding processing result is to be written at the corresponding position of the processing result vector. For example, if the target register vm3[i] is 1, it means that the i-th element of the processing result vector word is write-enabled and can be written to the corresponding processing result. On the contrary, if the target register vm3[i] is 0, the i-th element of the processing result vector word cannot be written to the corresponding processing result.
  • Table 4 describes several example instructions that processor 110 may support.
  • the instructions in Table 4 can be described with reference to the instruction definitions in Table 1 or Table 2, and can be encoded with reference to the example encoding method in Table 3.
  • target operands 230 both include C_vaddr (ie, &v3) and C_vm (ie, vm3).
  • Reserved in Table 4 indicates one or more bits reserved. These reserved bits can be encoded or used later.
  • instructions 140 include a first index determination instruction (eg, v2indexl or v2indexr in Table 4).
  • source operand 220 specifies the location of the first storage space of memory, ie, the address index of channel A (A_vaddr is &v1).
  • the source operand 220 also specifies a given index value of the data to be processed in the second storage space of the memory, that is, the element index within the vector word of channel B (B_vaddr is &v2, B_index is index2).
  • arithmetic logic unit 130 is configured to determine the first index.
  • the first index indicates the storage location in the first storage space of the value at the position indicated by the given index value in the data to be processed.
  • the opcode of the instruction v2indexl can be encoded as "0b00_00110_00_0".
  • the instruction v2indexl v1, v2, index2, v3, vm3 means assigning v3[i] to indext, where indext can make v1[indext] equal to v2[index2 ]
  • target operand 230 also indicates a target vector mask register. The value at each position of the target vector mask register indicates whether the corresponding processing result is to be written at the corresponding position of the processing result vector. For example, if vm3[i] equals 1, then v3[i] is write-enabled.
  • the opcode of the instruction v2indexr can be encoded as "0b00_00110_01_0".
  • the instruction v2indexr v1, v2, index2, v3, vm3 means assigning v3[i] to indext, where indext is the index of the first element from right to left that can make v1[indext] equal to v2[index2]. If no element satisfies the above condition, indext is set to "-1" represented by two's complement. In this example, if vm3[i] equals 1, then v3[i] is write-enabled.
  • instructions 140 include a second index determination instruction (eg, v2indexli or v2indexri in Table 4).
  • source operand 220 specifies the location of the first storage space of memory, ie, the address index of channel A (A_vaddr is &v1).
  • Source operand 220 also specifies the first immediate value, namely, immediate value imm2.
  • the arithmetic logic unit is configured 130 to determine the second index.
  • the second index indicates the storage location of the first immediate value in the first storage space.
  • the opcode of instruction v2indexli is encoded as "0b00_00110_00_1".
  • the instruction v2indexli v1, imm2, v3, vm3 means assigning v3[i] to indext, where indext is the index of the first element from left to right that can make v1[indext] equal to imm2. If no element satisfies the above condition, indext is set to "-1" represented by two's complement. In this example, if vm3[i] equals 1, then v3[i] is write-enabled.
  • the opcode of the instruction v2indexri can be encoded as "0b00_00110_01_1".
  • the instruction v2indexri v1, imm2, v3, vm3 means assigning v3[i] to indext, where indext is the index of the first element from right to left that can make v1[indext] equal to imm2. If no element satisfies the above conditions, indext is set to Set to "-1" represented by two's complement notation. In this example, if vm3[i] equals 1, then v3[i] is write-enabled.
  • instructions 140 may include a first value determination instruction, such as instruction Sindex2v in Table 4.
  • the source operand 220 specifies the location of the first storage space of the memory, that is, the address index of channel A (A_vaddr is &v1).
  • the source operand 220 also specifies a given index value of the data to be processed in the second storage space of the memory, that is, the element index within the vector word of channel B (B_vaddr is &v2, B_index is index2).
  • the arithmetic logic unit 130 is configured to: determine a given value of the data to be processed at a location indicated by a given index value, and determine a location in the first storage space indexed by the given value. The first value at the position.
  • the instruction Sindex2v v1,v2,index2,v3,vm3 has an opcode encoded as "0b00_00110_10_0". This instruction means assigning v3[i] to v1[v2[index2]]. If vm3[i] is equal to 1, v3[i] is write enabled.
  • instructions 140 include second value determination instructions.
  • the source operand 220 specifies a given index value of the data to be processed in the second storage space of the memory, that is, B_vaddr is &v2 and B_index is index2.
  • the arithmetic logic unit 130 is configured to determine a second value in the data to be processed at the position indicated by the given index value.
  • the instruction s2v v2,index2,v3,vm3 has an opcode encoded as "0b00_00110_10_1". This instruction means assigning v3[i] to v2[index2]. If vm3[i] is equal to 1, v3[i] is write enabled.
  • the processor 110 can better process some operations such as obtaining coordinates.
  • Operators such as the index for finding the maximum value (ArgMax) operator, the index for finding the minimum value (ArgMin) operator, or the operator for finding the highest ranked K values (TopK). Taking ArgMax as an example, it is used to find the index that makes the value v[index] in the vector v be the maximum value.
  • ArgMax The instructions required for the 64-element ArgMax are as follows: first, v2smax v1,vm1,v2,vm2 (this instruction will be described in Table 5 and Table 6 below), which finds the largest element value in v1, and Write it to v2, where all bits of vm1 and vm2 store values are 1; next, v2indexl v1,v2,0,v3, vm3, this instruction finds the index so that v1[index] is equal to v2[0], and writes the value of index to v3, where all bits of the vm3 stored value are 1.
  • the target instructions include vector transpose instructions, such as the instructions vtranspose or vstranspose.
  • the source operand 220 specifies the first position in the first storage space in the memory, that is, A_vaddr is &v1, and A_index is index1.
  • Source operand 220 also specifies the source vector mask register vm1 and optionally vm2.
  • the arithmetic logic unit 130 is configured to vector-transpose the data to be processed at the first location in the first storage space to obtain the transposed data to be processed.
  • the vector transpose instruction vtranspose v1,index1,vm1,vm2,v3,vm3 has an opcode encoded as 0b00_00111_11_0, which is used to transpose a vector (or matrix) of, for example, 32*32.
  • the values of vm1 and vm2 enable the read lane (lane); the value of vm3 enables the write lane.
  • the number R of consecutive 1 bits in vm1 is used to represent the number of rows of the matrix
  • the number C of consecutive 1 bits in vm2 is used to represent the number of columns of the matrix, where R and C are both arbitrary natural numbers, and R and C can be the same It can also be different.
  • the valid bits of vm1, vm2 and vm3 must be consecutive, otherwise the first 1 in the lowest bit shall prevail.
  • the above vector transpose instruction vtranspose can be used to transpose the R*C matrix.
  • the vector transpose instruction vstranspose v1, index1, vm1, v3, vm3 can be used, which is used to transpose the square matrix.
  • the value of vm1 is the read channel enable; the value of vm3 is the write channel enable.
  • the number R of consecutive 1 bits in vm1 is used to represent the number of rows (or columns) of the square matrix.
  • the vector transpose instruction vstranspose can be used to transpose the R*R square matrix.
  • the vector transpose instruction is not a standard RISC type instruction.
  • the vector transposition function must be completed through multiple consecutive transposition instructions.
  • This solution can improve the computing power of part of the network by using vector transpose instructions.
  • the neural network training process usually involves a large number of transposition operations of matrices or square matrices. Using the vector transpose instruction of this solution can improve the computational efficiency of the neural network training process.
  • the target instructions include exponent instructions, such as vexp.
  • source operand 220 specifies the source storage location, that is, A_vaddr is &v1.
  • arithmetic logic sheet Element 130 is configured to determine an exponent value with a predetermined value (eg, the natural base e) as the base raised to the power of the data to be processed at the source storage location.
  • vexp v1, v3, vm3 has an opcode encoded as "0b00_01000_01_0", which means assigning v3[i] to exp(v1[i]). If vm3[i] is equal to 1, v3[i] is write enabled.
  • the above exponential instructions are suitable for sigmoid operators and operators such as hyperbolic functions sinh/cosh/tanh.
  • the sigmoid operator, sinh operator, cosh operator and tanh operator can be represented by the following equations (1) to (4).
  • x represents the data to be processed.
  • instructions 140 include VM register instructions, such as vm2index instructions.
  • Source operand 220 indicates the source VM register in memory, namely vm1.
  • the arithmetic logic unit is configured to store the index at the enabled location in the source VM register to the target storage location.
  • instructions 140 include onehot code conversion instructions, such as vindex2vm.
  • This is a VM register manipulation instruction.
  • the source operation Operator 220 specifies a given index value of the data to be processed in the second storage space of the memory, that is, B_vaddr is &v2, and B_index is index2.
  • the behavior of reading the vector mask register involved in this instruction is not a write enable for memory writing (other instructions that write to memory require reading the vector mask register as a write enable).
  • Destination operand 230 specifies the destination VM register, which is vm3.
  • the arithmetic logic unit is configured to: convert the value of the data to be processed at a given index value into a one-hot code, and store the one-hot code into the target VM register.
  • the instruction vindex2vm v2,index2,vm3 has an opcode encoded as "0b00_10000_01_0". This instruction indicates that vm3 is assigned the value onehot(v2[index2]), where onehot() indicates the one-hot code conversion function.
  • the one-hot code conversion instructions described above are suitable for index-type instructions and can support one-hot code operators. Convert a number to one-hot encoded form.
  • the following two instructions can be used to implement: vindex2vm v1,0,vm 1 and vmload vm1,v2,vm2, where the first instruction is used to convert the value of v1[0] into a one-hot encoding form , and write to vm1, the second instruction (vmload will be described in Table 7 and Table 8 below) is used to store the value in vm1 to v2, where all the bits of the stored value of vm2 are 1.
  • Table 5 shows examples of more general instructions 140 supported by the processor 110 .
  • the instructions in Table 5 can be described with reference to the instruction definitions in Table 1 or Table 2, and the opcodes are encoded using the example encoding method in Table 3.
  • the MAX() and MIN() functions in Table 6 represent the functions for finding the maximum value and the minimum value respectively.
  • DW represents the width of the vector word
  • LANE_NUM represents the number of elements in a vector word
  • the mod() function represents the remainder function
  • the ceil() and floor() functions represent upward rounding and downward rounding respectively
  • the SUM() function represents Sum function.
  • instructions supported by processor 110 also include various vector mask register access and operation instructions.
  • vector mask register access and manipulation instructions are shown in Table 7.
  • Table 7 The functions and definitions of each instruction in Table 7 will be shown in Table 8.
  • the functions of these instructions include reading, writing and operating on the vector mask register. These instructions involve the act of reading the vector mask register, not as a write enable for memory writes (other instructions that write to memory require reading the vector mask register as a write enable). These instructions are not described in detail here.
  • the instructions 140 supported by the processor 110 also include internal register access and manipulation instructions. This type of instruction is used to handle access to internal registers and some special operations. For example, writing to the internal control and status (CSR) register, writing a fixed value, or a certain SIMD length data in the data memory VCCM, etc. Another example is reading out the internal CSR register or reading out the data memory VCCM; and empty instructions (ie, no operation is performed, waiting for 1 cycle), etc.
  • CSR internal control and status
  • Table 9 shows several examples of internal register access and manipulation instructions.
  • Table 10 shows the functions of each instruction in Table 9. These instructions are not described in detail in this article. Note that for the vwcsr instruction in Table 9, the source operand is in channel A; while for vwcsri Instructions, immediate data in channel B.
  • the various instructions supported by the processor 110 are described above in conjunction with Table 4 to Table 10. These instructions may be decoded by instruction decoder 120 of processor 110 and executed by arithmetic execution unit 130 . These instructions may constitute an instruction set supported by processor 110 . It should be understood that in some embodiments, the instruction set may be constructed from only some or all of the above individual instructions. Alternatively or additionally, other suitable instructions not described above may also be employed to construct the instruction set supported by processor 110 .
  • example instruction definitions and example opcode encoding representations specified above with reference to Tables 1 to 3 enumerate individual instructions in Tables 4 to 10, they are only exemplary and not limiting.
  • the instruction set supported by the processor of the present disclosure may be defined and encoded in any suitable manner.
  • individual bits of each instruction may have different meanings than those represented by each bit in Table 1 or Table 2.
  • the coded representation of the operation code of each instruction may have different digits from those in Table 3, and each digit may also have a different meaning from each bit in Table 3.
  • the encoding representation of the operation codes of each instruction in the above Table 4 to Table 10 can be changed or interchanged. Individual instructions can also be represented by other names. The scope of the present disclosure is not limited in this regard.
  • the instructions described above do not include branch type instructions, nor do they include load/store type instructions.
  • the registers used in the present disclosure are memory-to-memory SIMD processor architectures.
  • the above instruction set defines multiple (for example, 64 or more or less) vector mask registers to represent the specific vectors that each SIMD instruction needs to process.
  • processor 110 simplifies the operation of the processor 110 by using a SIMD processor suitable for memory-to-memory architecture.
  • processor 110 is able to perform a large number of vector calculations using a simple instruction set.
  • the processor 110 can use a simple instruction set to perform tasks such as vector calculation of neural network operators.
  • this solution can use A simple instruction set to improve the efficiency of the processor performing vector calculations.
  • the solution of the present disclosure can greatly improve computational efficiency.
  • a processor according to embodiments of the present disclosure may support various index determination instructions, thereby improving the efficiency of various vector calculations such as obtaining coordinates.
  • the processor of the present disclosure can process instructions such as vector transpose, thereby improving the computational efficiency of corresponding calculations in the neural network training process.
  • the processor of the present disclosure can support exponential instructions, thereby improving and optimizing the calculation efficiency of sigmoid operators and hyperbolic function operators.
  • FIG. 4 illustrates a flow diagram of a process 400 for data processing in accordance with some embodiments of the present disclosure.
  • Process 400 may be implemented at processor 110 .
  • process 400 will be described with reference to environment 100 of FIG. 1 .
  • the target instruction such as instruction 140
  • instruction 140 may be decoded by instruction decoder 120 of processor 110 .
  • Instruction 140 involves a target opcode 210 , a source operand 220 , and a target operand 230 .
  • Target opcode 210 indicates the vector operation specified by instruction 140 .
  • the source operand 220 specifies at least a source storage location in memory from which to read the data to be processed.
  • the target operand 230 specifies at least a target storage location in memory for writing the processing results.
  • the data to be processed is read by the processor 110 from the source storage location of the memory.
  • the data to be processed may be read from the source storage location of the memory by the arithmetic logic unit 130 of the processor 110 .
  • arithmetic logical operations associated with the vector operations specified by the target instructions are performed by the processor 110 on the data to be processed.
  • the arithmetic logic operations described above may be performed by the arithmetic logic unit 130 of the processor 110 .
  • the processing result of the data to be processed is written by the processor 110 to a target storage location of the memory.
  • the processing results may be written to the target storage location by the arithmetic logic unit 130 of the processor 110 .
  • instructions 140 include index determination instructions.
  • the index determination instruction may be a first index determination instruction (v2indexl or v2indexr) or a second index determination instruction (v2indexli or v2indexri).
  • the source operand 220 specifies the location of the first storage space of the memory.
  • the source operand 220 also specifies a given index value or the first immediate value of the data to be processed in the second storage space of the memory.
  • processor 110 performs calculations
  • the numerical logic operation includes: determining the first index or determining the second index.
  • the first index indicates the storage location in the first storage space of the value at the position indicated by the given index value in the data to be processed.
  • the second index indicates the storage location of the first immediate value in the first storage space.
  • instruction 140 includes a first value determination instruction (eg, instruction Sindex2v), source operand 220 specifies a location in a first storage space of memory, and source operand 220 also specifies a location to be determined in a second storage space of memory. Processes the given index value of the data.
  • the arithmetic logic operations performed by the processor 110 include: determining a given value of the data to be processed at a location indicated by a given index value; and determining a given value in the first storage space indexed by the given value. The first value at the position.
  • instructions 140 include second value determination instructions, such as instructions s2v.
  • the source operand 220 specifies a given index value of the data to be processed within the second storage space of the memory.
  • the arithmetic logic operations performed by the processor 110 include determining a second value in the data to be processed at the location indicated by the given index value.
  • instructions 140 include vector transpose instructions, such as instructions vtranspose or vstranspose.
  • Source operand 220 specifies a first location in a first storage space in memory.
  • the arithmetic logic operation performed by the processor 110 includes vector transposing the data to be processed at the first location in the first storage space to obtain the transposed data to be processed.
  • instructions 140 include exponent instructions, such as the vexp instruction.
  • Source operand 220 specifies the source storage location.
  • the arithmetic logic operation performed by the processor 110 includes determining an exponent value with a predetermined numerical base raised to the power of the data to be processed at the source storage location.
  • instructions 140 include VM register instructions, such as vm2index.
  • Source operand 220 of instruction 140 indicates the source VM register in memory.
  • the arithmetic logic operation performed by the processor 110 includes storing the index at the enabled location in the source VM register to the target storage location.
  • each instruction 140 described above includes a processing result vector at the target storage location.
  • Target operand 230 also indicates the target VM register.
  • Target VM sends The value at each location in the register indicates whether the corresponding processing result is to be written to the corresponding location in the processing result vector. For example, if the target register vm3[i] is 1, it means that the i-th element of the processing result vector word is write-enabled and can be written to the corresponding processing result. On the contrary, if the target register vm3[i] is 0, the i-th element of the processing result vector word cannot be written to the corresponding processing result.
  • instructions 140 include one-hot code conversion instructions, such as the instruction vindex2vm.
  • the source operand 220 specifies a given index value of the data to be processed within the second storage space of the memory.
  • Destination operand 230 specifies the destination VM register.
  • the processor 110 converts the value of the data to be processed at the given index value into a one-hot code.
  • the processor 110 is also configured to store the one-hot code into the target VM register.
  • FIG. 5 illustrates a block diagram of an electronic device 500 in which a processor 110 may be included in accordance with one or more embodiments of the present disclosure. It should be understood that the electronic device 500 shown in FIG. 5 is merely exemplary and should not constitute any limitation on the functionality and scope of the embodiments described herein.
  • electronic device 500 is in the form of a general electronic device or computing device.
  • Components of electronic device 500 may include, but are not limited to, one or more processors 110 , memory 520 , storage devices 530 , one or more communication units 540 , one or more input devices 550 , and one or more output devices 560 .
  • processor 110 may perform various processes according to programs stored in memory 520 .
  • the processor 110 may be a multi-core processor that can execute computer-executable instructions in parallel to improve the parallel processing capability of the electronic device 500 .
  • Electronic device 500 typically includes a plurality of computer storage media. Such media may be any available media that is accessible to electronic device 500, including, but not limited to, volatile and nonvolatile media, removable and non-removable media.
  • Memory 520 may be volatile memory (e.g., registers, cache, random access memory (RAM)), nonvolatile memory (e.g., read only memory (ROM), electrically erasable programmable read only memory (EEPROM) , flash memory) or some combination thereof.
  • Storage device 530 may be a removable or non-removable medium and may include machine-readable media such as a flash drive, a magnetic disk, or any other medium that may be capable of storing information and/or data (e.g., using training data for training) and can be accessed within the electronic device 500.
  • machine-readable media such as a flash drive, a magnetic disk, or any other medium that may be capable of storing information and/or data (e.g., using training data for training) and can be accessed within the electronic device 500.
  • Electronic device 500 may further include additional removable/non-removable, volatile/non-volatile storage media.
  • a disk drive may be provided for reading from or writing to a removable, non-volatile disk (eg, a "floppy disk") and for reading from or writing to a removable, non-volatile optical disk. Read or write to optical disc drives.
  • each drive may be connected to the bus (not shown) by one or more data media interfaces.
  • Memory 520 may include a computer program product 525 having one or more program modules configured to perform various methods or actions of various embodiments of the disclosure. For example, these program modules may be configured to implement various functions or actions of processor 110 , such as implementing the functions of instruction decoder 120 and arithmetic logic unit 130 .
  • the communication unit 540 implements communication with other electronic devices or computing devices through communication media. Additionally, the functionality of the components of electronic device 500 may be implemented as a single computing cluster or as multiple computing machines capable of communicating over a communications connection. Accordingly, electronic device 500 may operate in a networked environment using a logical connection to one or more other servers, a network personal computer (PC), or another network node.
  • PC network personal computer
  • Input device 550 may be one or more input devices, such as a mouse, a keyboard, a trackball, etc.
  • Output device 560 may be one or more output devices, such as a display, speakers, printer, etc.
  • the electronic device 500 may also communicate with one or more external devices (not shown), such as storage devices, display devices, etc., through the communication unit 540 as needed, and with one or more devices that enable the user to interact with the electronic device 500 Communicate with or with any device (eg, network card, modem, etc.) that enables electronic device 500 to communicate with one or more other electronic devices or computing devices. Such communication may be performed via an input/output (I/O) interface (not shown).
  • I/O input/output
  • a computer-readable storage medium is provided with computer-executable instructions stored thereon, wherein the computer-executable instructions are executed by a processor to implement the method described above.
  • a computer program product is also provided, the computer program product is tangibly stored on a non-transitory computer-readable medium and includes computer-executable instructions, and the computer-executable instructions are executed by a processor to implement the method described above.
  • These computer-readable program instructions may be provided to a processing unit of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus, thereby producing a machine such that, when executed by the processing unit of the computer or other programmable data processing apparatus, the computer-readable program instructions , resulting in an apparatus that implements the functions/actions specified in one or more blocks in the flowchart and/or block diagram.
  • These computer-readable program instructions can also be stored in a computer-readable storage medium. These instructions cause the computer, programmable data processing device and/or other equipment to work in a specific manner. Therefore, the computer-readable medium storing the instructions includes An article of manufacture that includes instructions that implement aspects of the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams.
  • Computer-readable program instructions may be loaded onto a computer, other programmable data processing apparatus, or other equipment, causing a series of operating steps to be performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process, Thereby, instructions executed on a computer, other programmable data processing apparatus, or other equipment implement the functions/actions specified in one or more blocks of the flowcharts and/or block diagrams.
  • Example 1 describes a processor including an instruction decoder configured to decode target instructions for vector operations.
  • Target instructions involve target opcodes, source operands, and destination operands.
  • the target opcode indicates the vector operation specified by the target instruction.
  • the source operand specifies at least the source storage location in memory from which to read the data to be processed.
  • the destination operand specifies at least a destination storage location in memory to which to write the processing results.
  • the processor also includes an arithmetic logic unit coupled to the instruction decoder and the memory.
  • the arithmetic logic unit is configured to: read the data to be processed from the source storage location of the memory; perform the arithmetic logic operation associated with the vector operation specified by the target instruction on the data to be processed; and write the processing result of the data to be processed to the memory target storage location.
  • Example 2 includes as described according to Example 1 A processor, wherein the target instruction includes a first index determination instruction, the source operand specifies the location of the first storage space of the memory, and the source operand also specifies a given index value of the data to be processed in the second storage space of the memory.
  • the arithmetic logic unit is configured to perform an arithmetic logic operation associated with the vector operation specified by the target instruction: determine a first index indicating a location in the data to be processed indicated by a given index value. The storage location of the value in the first storage space.
  • Example 3 includes the processor as described in Example 1, wherein the target instruction includes a second index determination instruction, the source operand specifies a location of the first storage space of the memory, and the source operand further Specifies the first immediate value.
  • the arithmetic logic unit is configured to perform an arithmetic logic operation associated with the vector operation specified by the target instruction: determine a second index indicating a storage location of the first immediate value in the first storage space.
  • Example 4 includes the processor as described in Example 1, wherein the target instruction includes a first value determination instruction, the source operand specifies a location of a first storage space of the memory, and the source operand further Specify the given index value of the data to be processed in the second storage space of the memory.
  • the arithmetic logic unit is configured to perform an arithmetic logic operation associated with the vector operation specified by the target instruction: determine a given value of the data to be processed at a location indicated by a given index value; and determine a first storage The first value in space at the position indexed by the given value.
  • Example 5 includes the processor as described in Example 1, wherein the target instruction includes a second value determination instruction, and the source operand specifies a given value of data to be processed in a second storage space of the memory. Fixed index value.
  • the arithmetic logic unit is configured to perform an arithmetic logic operation associated with the vector operation specified by the target instruction: determine a second value in the data to be processed at a location indicated by a given index value.
  • Example 6 includes the processor as described in Example 1, wherein the target instruction includes a vector transpose instruction and the source operand specifies at least a first location in a first storage space in the memory.
  • the arithmetic logic unit is configured as follows to perform an arithmetic logic operation associated with the vector operation specified by the target instruction: vector-transpose the data to be processed at the first location in the first storage space to obtain the transposed of Data to be processed.
  • Example 7 includes the processor as described in Example 1, wherein the target instruction includes an exponent instruction and the source operand specifies the source storage location.
  • the arithmetic logic unit is configured to perform an arithmetic logic operation associated with the vector operation specified by the target instruction by raising the data to be processed at the source storage location to the power of determining an exponent value with a predetermined numerical base.
  • Example 8 includes the processor as described in Example 1, wherein the target instruction includes a vector mask VM register instruction and the source operand indicates a source VM register in memory.
  • the arithmetic logic unit is configured to perform an arithmetic logic operation associated with the vector operation specified by the target instruction: store an index at the enabled location in the source VM register to the target storage location.
  • Example 9 includes the processor as described in any one of Examples 2 to 8, wherein a processing result vector is included at the target storage location, and the target operand further indicates a target vector mask VM register. , the value at each position of the target VM register indicates whether the corresponding processing result is to be written at the corresponding position of the processing result vector.
  • Example 10 includes the processor as described in Example 1, wherein the target instruction includes a one-hot code conversion instruction, and the source operand specifies a given value of the data to be processed in the second storage space of the memory.
  • the index value is specified, and the destination operand specifies the destination vector mask VM register.
  • the arithmetic logic unit is configured to perform the arithmetic logic operations associated with the vector operations specified by the target instruction: convert the value of the data to be processed at the given index value into a one-hot code; and store the one-hot code in in the target VM register.
  • Example 11 describes a method of data processing.
  • the method includes decoding a target instruction for a vector operation, the target instruction involving a target opcode, a source operand, and a destination operand.
  • the target opcode indicates the vector operation specified by the target instruction.
  • the source operand specifies at least the source storage location in memory from which to read the data to be processed.
  • the destination operand specifies at least a destination storage location in memory to which to write the processing results.
  • the method also includes: reading the data to be processed from the source storage location of the memory; Perform arithmetic and logical operations associated with the vector operations specified by the target instruction on the data to be processed; and write the processing results of the data to be processed into the target storage location of the memory.
  • Example 12 includes the method described in Example 11, wherein the target instruction includes an index determination instruction, the source operand specifies a location of the first storage space of the memory, and the source operand further specifies at least the following: One item: the given index value and the first immediate value of the data to be processed in the second storage space of the memory.
  • Performing an arithmetic logical operation associated with the vector operation specified by the target instruction includes at least one of: determining a first index indicating that a value in the data to be processed at a position indicated by a given index value is at a th A storage location in a storage space; determine a second index, and the second index indicates the storage location of the first immediate number in the first storage space.
  • Example 13 includes the method described in Example 11, wherein the target instruction includes a first value determination instruction, the source operand specifies a location of a first storage space of the memory, and the source operand further specifies The given index value of the data to be processed in the second storage space of the memory.
  • Performing an arithmetic logical operation associated with the vector operation specified by the target instruction includes: determining a given value of the data to be processed at a location indicated by a given index value; and determining a given value in the first storage space. is the first value at the index position.
  • Example 14 includes the method described in Example 11, wherein the target instruction includes a second value determination instruction, and the source operand specifies a given value of the data to be processed in a second storage space of the memory. index value. Performing the arithmetic logic operation associated with the vector operation specified by the target instruction includes determining a second value in the data to be processed at a location indicated by the given index value.
  • Example 15 includes the method described in Example 11, wherein the target instruction includes a vector transpose instruction, and wherein the source operand specifies at least a first location in a first storage space in the memory.
  • Performing an arithmetic logical operation associated with the vector operation specified by the target instruction includes: vector transposing the data to be processed at the first location in the first storage space to obtain the transposed data to be processed.
  • Example 16 includes the method described in Example 11, wherein the target instruction includes an exponent instruction and the source operand specifies the source storage location. Performing the arithmetic logic operations associated with the vector operation specified by the target instruction includes determining an exponent value with a predetermined numerical base raised to the power of the data to be processed at the source storage location.
  • Example 17 includes the method described in Example 11, wherein the target instruction includes a vector mask VM register instruction and the source operand indicates a source VM register in memory. Performing the arithmetic logic operation associated with the vector operation specified by the target instruction includes storing the index at the enabled location in the source VM register to the target storage location.
  • Example 18 includes the method described in Example 11, wherein the target instruction includes a one-hot code conversion instruction, and the source operand specifies a given value of the data to be processed in the second storage space of the memory. Index value, destination operand specifies the destination vector mask VM register. Performing the arithmetic and logical operations associated with the vector operation specified by the target instruction includes converting the value of the data to be processed at the given index value into a one-hot code; and storing the one-hot code into the target VM register.
  • Example 19 describes an electronic device that includes at least the processor according to any one of Examples 1 to 10.
  • Example 20 describes a computer-readable storage medium having a computer program stored thereon.
  • the computer program is executed by the processor to implement the method according to any one of Examples 11 to 18.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions that contains one or more executable functions for implementing the specified logical functions instruction.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two consecutive blocks may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved.
  • each block of the block diagram and/or flowchart illustration, and combinations of blocks in the block diagram and/or flowchart illustration can be implemented by special purpose hardware-based systems that perform the specified functions or acts. , or can be implemented using a combination of specialized hardware and computer instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Complex Calculations (AREA)
  • Executing Machine-Instructions (AREA)
  • Memory System (AREA)

Abstract

根据本公开的实施例,提供了一种处理器以及用于数据处理的方法、设备和存储介质。该处理器包括指令解码器,被配置为解码用于向量操作的目标指令。目标指令涉及目标操作码、源操作数和目标操作数。目标操作码指示目标指令所指定的向量操作。源操作数指定存储器中的用于读取待处理数据的源存储位置。目标操作数指定存储器中的用于写入处理结果的目标存储位置。该处理器还包括算数逻辑单元,被耦合至指令解码器和存储器。算数逻辑单元被配置为:从存储器的源存储位置读取待处理数据;对待处理数据执行与目标指令所指定的向量操作相关联的算数逻辑运算;以及将处理结果写入存储器的目标存储位置。以此方式,可以提高向量计算的效率。

Description

处理器以及用于数据处理的方法、设备和存储介质
本申请要求2022年06月14日递交的,标题为“处理器以及用于数据处理的方法、设备和存储介质”、申请号为202210674857.6的中国发明专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本公开的示例实施例总体涉及计算机领域,特别地涉及处理器以及用于数据处理的方法、设备和计算机可读存储介质。
背景技术
随着信息技术的发展,各类处理器可以被应用于多种多样的场景。针对各种应用场景,目前已经提出了由处理器可采用的不同的指令集架构(ISA)。这些指令集架构往往需要兼容各种各样的使用场景。对于一些指令重复性高且数据量大的向量计算,需要更好的指令集架构来使处理器能够更好地处理这样的向量计算。
发明内容
在本公开的第一方面,提供了一种处理器。该处理器包括指令解码器,被配置为解码用于向量操作的目标指令。目标指令涉及目标操作码、源操作数和目标操作数。目标操作码指示目标指令所指定的向量操作。源操作数指定存储器中的用于读取待处理数据的源存储位置。目标操作数指定存储器中的用于写入处理结果的目标存储位置。该处理器还包括算数逻辑单元,被耦合至指令解码器和存储器。算数逻辑单元被配置为:从存储器的源存储位置读取待处理数据;对待处理数据执行与目标指令所指定的向量操作相关联的算数逻辑运算;以 及将处理结果写入存储器的目标存储位置。
在本公开的第二方面,提供了一种用于数据处理的方法。该方法包括解码用于向量操作的目标指令。目标指令涉及目标操作码、源操作数和目标操作数。目标操作码指示目标指令所指定的向量操作。源操作数指定存储器中的用于读取待处理数据的源存储位置。目标操作数指定存储器中的用于写入处理结果的目标存储位置。该方法还包括从存储器的源存储位置读取待处理数据;对待处理数据执行与目标指令所指定的向量操作相关联的算数逻辑运算;以及将处理结果写入存储器的目标存储位置。
在本公开的第三方面,提供了一种电子设备。该电子设备至少包括根据第一方面的处理器。
在本公开的第四方面,提供了一种计算机可读存储介质。计算机可读存储介质上存储有计算机程序,计算机程序可由处理器执行以实现第二方面的方法。
应当理解,该内容部分中所描述的内容并非旨在限定本公开的实施例的关键特征或重要特征,也不用于限制本公开的范围。本公开的其它特征将通过以下的描述而变得容易理解。
附图说明
结合附图并参考以下详细说明,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。在附图中,相同或相似的附图标记表示相同或相似的元素,其中:
图1示出了本公开的实施例能够在其中实现的示例环境的示意图;
图2示出了根据本公开的一些实施例的示例指令的示意图;
图3示出了根据本公开的一些实施例的示例源操作数对应的存储位置的示意图;
图4示出了根据本公开的一些实施例的用于数据处理的过程的流程图;以及
图5示出了其中可以包括根据本公开的一个或多个实施例的处理器的电子设备的框图。
具体实施方式
下面将参照附图更详细地描述本公开的实施例。虽然附图中示出了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反,提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。
在本公开的实施例的描述中,术语“包括”及其类似用语应当理解为开放性包含,即“包括但不限于”。术语“基于”应当理解为“至少部分地基于”。术语“一个实施例”或“该实施例”应当理解为“至少一个实施例”。术语“一些实施例”应当理解为“至少一些实施例”。下文还可能包括其他明确的和隐含的定义。
可以理解的是,本技术方案所涉及的数据(包括但不限于数据本身、数据的获取或使用)应当遵循相应法律法规及相关规定的要求。
如前所述,随着信息技术的发展,各类处理器可以被应用于多种多样的场景。针对各种应用场景,目前已经提出了由处理器可采用的不同的指令集架构。这些指令集架构往往需要兼容各种各样的使用场景。然而,这些常规的指令集架构的使用场景与诸如神经网络计算等的向量计算的使用场景并不一致。因此,对于一些指令重复性高且数据量大的向量计算,需要更好的指令集架构来使处理器能够更好地处理这样的向量计算。
一种常规的方案是采用标准的处理器指令集,诸如精简指令集计算机(RISC)-V指令集等。尽管这些通用的指令集能够完成各种向量计算,诸如各类神经网络算子等,但是由于这些通用指令集需要兼容各种各样的使用场景,因此难以保证较高的执行效率。例如,神经网络算子计算时通常涉及到大量向量计算,这对于通用指令集并不友 好。
经研究发现,针对某些大量的向量计算,常规方案的指令集架构并不适用。例如,常规的方案可以使用数字信号处理器(DSP)架构,例如单指令多数据(SIMD)架构,或者可以使用向量处理器(Vector)架构。然而,上述DSP架构的指令集通常不公开。而对于向量处理器架构,例如RISC-V标准下的向量指令集(简称为RISC-V向量指令集),这些指令集通常复杂度较高,对诸如神经网络算子等向量计算显得冗余。
综上,对于一些指令重复性高且数据量大的向量计算,需要设计出更适用于向量计算的指令集,以提高处理器的计算效率。
根据本公开的实施例,提出了一种用于处理器的改进方案。在该方案中,处理器包括指令解码器和算数逻辑单元。指令编码器用于接收用于处理向量操作的目标指令。该目标指令适用于存储器到存储器(MEM to MEM)的处理器架构。例如,该目标指令涉及目标操作码、源操作数和目标操作数。目标操作码指示目标指令所指定的向量操作,源操作数至少指定存储器中的用于读取待处理数据的源存储位置,目标操作数至少指定存储器中的用于写入处理结果的目标存储位置。
处理器的算数逻辑单元被耦合至指令解码器和存储器。该算数逻辑单元被配置为根据指令解码器对目标指令的解码信息来执行该目标指令的向量操作。例如,算数逻辑单元被配置为从存储器的源存储位置读取的待处理数据;对待处理数据执行与目标指令所指定的向量操作相关联的算数逻辑运算;以及将待处理数据的处理结果写入存储器的目标存储位置。
本方案通过采用适用于存储器到存储器架构的处理器,来简化处理器的操作。以此方式,处理器能够使用简单的指令集来完成大量的向量计算。例如,处理器能够使用简单的指令集来进行神经网络向量计算。通过这样,本方案能够采用简单的指令集来提高处理器执行向量计算的效率。
图1示出了本公开的实施例能够在其中实现的示例环境100的示意图。在该环境100中,处理器110可以表示任何种类的指令处理装置。例如,处理器110可以是通用处理器或者其他任何适当的处理器。处理器110被配置为接收指令140并且执行该指令140所指示的操作,诸如向量操作。例如,处理器110可以从环境100中的其他设备接收指令140。在一些实施例中,指令140是SIMD指令。
处理器110包括指令解码器120和算数逻辑单元130。备选地或附加地,处理器110还可以包括存储器(未示出)或者通信地偶合到存储器。例如,存储器可以是数据存储器(诸如,向量紧耦合存储器(Vector Closely-coupled Memory,VCCM))。指令解码器120、算数逻辑单元130和存储器可通信地耦合。也即,指令解码器120、算数逻辑单元130和存储器可以根据适当的数据传输协议和/或标准实现彼此通信。在运行中,指令解码器120接收指令140,并且对指令140进行解码。例如,指令解码器120可以将指令140解码为可以由算数逻辑单元130进行处理的算数运算和/或逻辑运算等。指令解码器120可以使用各种不同的机制来实现。例如,指令解码器120可以使用硬件电路来实现,或者至少部分地借助于软件模块来实现。
算数逻辑单元130被配置为基于由指令解码器120对指令140进行解码所获得的信息来进行操作。算数逻辑单元130可以执行各类算数操作、逻辑操作,等等。算数逻辑单元130可以使用各种不同的机制来实现。例如,算数逻辑单元130可以使用硬件电路来实现,或者至少部分地借助于软件模块来实现。
应理解,仅出于示例性的目的描述环境100的结构和功能,而不暗示对于本公开的范围的任何限制。例如,处理器110可以被应用于各种现有的或将来的计算平台或计算系统中。处理器110可以在各种嵌入式应用(例如,移动网络基站等的数据处理系统)中实现,以提供诸如大量向量计算等服务。处理器110也可以被集成或嵌入到各种电子设备或计算设备中,以提供各种计算服务。处理器110的应用环境和应用场景在此不受限制。
在一些实施例中,指令解码器120对接收到的指令140进行解码。在本文中,指令140有时也被称为“目标指令”,二者在本上下文中可以互换使用。图2示出了根据本公开的一些实施例的示例指令140的示意图。如图2所示,指令140包括目标操作码210、源操作数220和目标操作数230。在本文中,目标操作码210有时也被称为“操作码(opcode)”,二者在本上下文中可以互换使用。目标操作码210可以指示指令140所指定的向量操作。源操作数220至少用于指定存储器中的用于读取待处理数据的源存储位置。目标操作数230至少用于指定存储器中的用于写入处理结果的目标存储位置。
在一些实施例中,指令解码器120对目标操作码210、源操作数220和目标操作数230所指示的上述信息进行解码,以供算数逻辑单元130进行处理。例如,算数逻辑单元130被配置为从由源操作数220指定的存储器中的源存储位置读取待处理数据。算数逻辑单元130对待处理数据执行与指令140所指定的向量操作相关联的算数逻辑运算。算数逻辑单元130进而将待处理数据的处理结果写入目标操作数230所指定的目标存储位置。
在一些实施例中,指令140可以使用诸如二进制来编码。在其他实施例中,指令140可以使用其他编码形式或者其他进制来编码。在本文中,除非特别说明,以下所描述的指令140的编码格式和编码表示均以二进制为例。例如,二进制形式的指令140可以采用如下的表1中的格式来定义。
表1指令格式定义

如表1所示,第86位至第95位用于表示指令140的目标操作码210。第22位至第85位的各个操作数用于表示指令140的源操作数220。第0位至第21位的各个参数用于表示指令140的目标操作数230。当然,应当理解的是,除非特别说明,否则在这里以及本文其他地方出现的任何具体数值或者位数都是示例性的。例如,以上所列举的各个操作码和/或操作数所位于的位数是示例性的,而不是限制性的。指令140的目标操作码210、源操作数220和目标操作数230可以位于其他适当的位数处。
例如,第70位至第85位的源操作数A_vaddr用于表示存储器,例如数据存储器(诸如,VCCM)内的A通道(也被称为存储器的第一存储空间)的数据的地址索引,即VCCM[A_vaddr]的地址索引。该地址索引以一个向量单词(vector word)为单位。向量单词可以表示存储器内的通道的存储单元,该存储单元以SIMD的宽度为宽度。即,该地址索引以一个SIMD宽度为单位。在一些实施例中,存储器的深度例如为1024。在这一示例中,可以仅使用第70位至第85位中的10个位来表示A通道的地址索引。当然,应当理解,存储器可以具有其他适当的深度,地址索引也可以具有其他适当的位数。
又如,第60位至第69位的源操作数A_index用于表示存储器内的A通道的向量单词的元素索引。每个向量单词可以有例如64个元素。该元素索引可以用于指示A通道内的向量单词中的某个元素。在一些实施例中,如果A通道的向量单词被划分为例如64个元素,则可以仅使用第60位至第69位中的6个位来表示A_index。再如,第54位至第59位的源操作数A_vm用于表示A通道的向量掩码(VM)寄存器的索引。在一些实施例中,A通道具有16个VM寄存器。在这样的示例中,可以仅使用第54位至第59位的中的4个位来表示A_vm。
类似地,第38位至第53位的源操作数B_vaddr用于表示存储器(例如,数据存储器VCCM)内的B通道(也被称为存储器的第二存储空间)的地址索引,即VCCM[B_vaddr]的地址索引。该地址索引以一个向量单词为单位。即,该地址索引以一个SIMD宽度为单位。第28位至第37位的源操作数B_index用于表示B通道的向量单词的元素索引。第22位至第27位的源操作数B_vm用于表示B通道的向量掩码寄存器的索引。
表1中目标操作数230的示例包括第6位至第21位的C_vaddr可以表示数据存储器VCCM的C通道的地址索引,即VCCM[C_vaddr]的地址索引。该地址索引以一个向量单词为单位。即,该地址索引以一个SIMD宽度为单位。目标操作数230的示例还包括第0位至第5位的C_vm,其表示C通道的向量掩码寄存器的索引。
图3示出了根据本公开的一些实施例的示例源操作数对应的存储位置的示意图。在图3的示例中,存储器的存储空间被划分为多个通道,诸如通道310-1、通道310-2、……、通道310-N等,其中N为大于1的整数。为便于讨论,下文中将通道310-1、通道310-2、……、通道310-N统称为或单独称为通道310。在一些实施例中,N的数值可以是预先设置的。例如,N可以被设置为1024、512等不同的数值。各个通道310包括例如1024位或其他适当的位数。地址索引330(例如,源操作数A_vaddr、B_vaddr或者目标操作数C_vaddr)可以指示通道310-1的地址。通道310的地址可以以向量单词为单位。地址索引330可以是16位的。例如,如果地址索引330为“0b0000_0000_0000_0000”,则该地址索引330可以指示通道310-1。又如,在一些实施例中,在存储器的深度或者通道的数目为1024的示例中,地址索引330也可以是10位的,例如由地址“0b00_0000_0000”来指示通道310-1。注意,本文中以“0b”开头的编码表示均表示二进制表示,下文中将不再进行重复说明。二进制表示中出现的“_”仅为了便于查看,而不具有实际含义,并且不占据二进制位。
在一些实施例中,各个通道310的向量单词可以被划分为多个元 素,例如元素320。元素320可以包括例如64位。元素索引340(例如,源操作数A_index或者B_index)可以指示某个元素,例如元素320的索引。元素索引340可以是10位的。例如,元素索引340为“0b00_0000_0000”,则该元素索引340可以指示元素320。又如,在各个通道310的向量单词内的元素数目为64的示例中,元素索引可以是6位的,例如元素索引“0b00_0000”可以指示元素320。
应当理解的是,除非特别说明,否则在这里以及本文其他地方出现的任何具体数值、位数以及二进制表示都是示例性的。例如,在其他实施例中,各个通道可以具有不同的位数,各个向量单词也可以采用不同的位数。相应地,地址索引和元素索引也可以具有不同的位数和不同的编码表示。本公开的范围在此方面不受限制。
以上参考表1列举了源操作数220的若干示例。下面将参考表2描述源操作数220的更多的示例。
表2指令格式定义
如表2所示,源操作数220可以包括位于第54至第85位的A_imm,其表示指令140中的一个立即数。类似地,源操作数220还可以包括位于第22至第53位的B_imm,其表示指令140中的另一个立即数。与表1类似,表2中的目标操作数230也可以包括C_vaddr和/或C_vm。
应理解,以上结合表1和表2所描述的各个源操作数和/或各个目标操作数仅仅是是示例性的,而不是限制性的。本公开所采用的源操作数220和/或目标操作数230可以包括以上任意一种或多种源操作数和/或目标操作数。在一些实施例中,源操作数220和/或目标操作数 230可以包括与以上源操作数和/或目标操作数不同的其他任何适当的操作数类型。
以下的表3描述了指令140的操作码的示例编码方式。例如,如果第0位为0,则表示该指令140为变量类型。如果第0位为1,则表示该指令140为立即数类型。第1位至第2位表示指令140的子功能编码。第3位至第7位表示指令140的功能编码。第8位至第9位表示指令140的计算精度。例如,二进制“00”可以表示计算精度为单精度浮点数。其他二进制数值可以表示预留的其他计算精度。
表3示例操作码编码方式
当然,应当理解的是,表3所示出的指令140的操作码的编码方式仅仅是是示例性的,而不是限制性的。例如,在其他实施例中,可以使用其他编码方式来对指令140进行编码。
在一些实施例中,可以根据指令140的目标操作码210来确定指令140所指定的向量操作。例如,处理器110可以预先存储各个指令的操作码。指令解码器120可以根据接收到的指令140的目标操作码210来确定该指令140所指定的向量操作。例如,如果指令140的目标操作码210被编码为“0b00_00110_01_0”,则指令解码器120可以将该指令140确定为v2indexr指令。应理解,以上所列举的操作码和指令类型的示例仅仅是是示例性的,而不是限制性的。具有被编码 为“0b00_00110_01_0”的指令也可以指定其他的向量操作。
以下将描述指令140以及处理器110对执行140的示例执行方式的若干示例。在一些实施例中,源操作数220可以包括两个源操作数,例如A_vaddr与B_vaddr,或者A_vaddr与B_imm。每个源操作数的宽度可以是SIMD宽度。备选地或附加地,在一些实施例中,源操作数220可以仅包括一个源操作数B_vaddr等。目标操作数230,例如C_vaddr可以指定将处理结果写回存储器的目标存储位置,即VCCM[C_vaddr]处。
在一些实施例中,指令140的目标存储位置处包括处理结果向量。目标操作数230还指示目标VM寄存器,例如C_vm或者vm3。目标VM寄存器的各个位置处的值指示处理结果向量的相应位置处是否要被写入相应的处理结果。例如,如果目标寄存器vm3[i]为1,则表示处理结果向量单词的第i个元素是写使能的,可以被写入相应的处理结果。反之,如果目标寄存器vm3[i]为0,则处理结果向量单词的第i个元素不能被写入相应的处理结果。
表4描述了处理器110可以支持的若干示例指令。表4的指令可以参考表1或表2的指令定义来描述,并且可以参考表3的示例编码方式进行编码。在表4的示例中,目标操作数230均包括C_vaddr(即,&v3)和C_vm(即,vm3)。表4中的预留(reserved)表示预留的一个或多个位。这些预留的位可以在后续被编码或者使用。
表4示例指令

作为一个实施例,指令140包括第一索引确定指令(例如,表4中的v2indexl或v2indexr)。在这一示例中,源操作数220指定存储器的第一存储空间的位置,即,通道A的地址索引(A_vaddr为&v1)。源操作数220还指定存储器的第二存储空间内的待处理数据的给定索引值,即通道B的向量单词内的元素索引(B_vaddr为&v2,B_index为index2)。在这一示例中,算数逻辑单元130被配置为:确定第一索引。该第一索引指示待处理数据中的由给定索引值所指示的位置处的值在第一存储空间中的存储位置。
例如,指令v2indexl的操作码可以被编码表示为“0b00_00110_00_0”,指令v2indexl v1,v2,index2,v3,vm3表示将v3[i]赋值为indext,其中indext是能够使得v1[indext]等于v2[index2]的从左至右的第一个元素的索引。如果没有元素满足上述条件,则将indext设置为由二进制补码表示的“-1”。在一些实施例中,目标操作数230还指示目标向量掩码寄存器。目标向量掩码寄存器的各个位置处的值指示处理结果向量的相应位置处是否要被写入相应的处理结果。例如,如果vm3[i]等于1,则v3[i]是写使能的。
又如,指令v2indexr的操作码可以被编码表示为“0b00_00110_01_0”。指令v2indexr v1,v2,index2,v3,vm3表示将v3[i]赋值为indext,其中indext是能够使得v1[indext]等于v2[index2]的从右至左的第一个元素的索引。如果没有元素满足上述条件,则将indext设置为由二进制补码表示的“-1”。在这一示例中,如果vm3[i]等于1,则v3[i]是写使能的。
作为另一实施例,指令140包括第二索引确定指令(例如,表4中的v2indexli或v2indexri)。在这一示例中,源操作数220指定存储器的第一存储空间的位置,即,通道A的地址索引(A_vaddr为&v1)。源操作数220还指定第一立即数,即,立即数imm2。在这一示例中,算数逻辑单元被配置130确定第二索引。该第二索引指示第一立即数在第一存储空间中的存储位置。
例如,指令v2indexli的操作码被编码表示为“0b00_00110_00_1”。指令v2indexli v1,imm2,v3,vm3表示将v3[i]赋值为indext,其中indext是能够使得v1[indext]等于imm2的从左至右的第一个元素的索引。如果没有元素满足上述条件,则将indext设置为由二进制补码表示的“-1”。在这一示例中,如果vm3[i]等于1,则v3[i]是写使能的。
又如,指令v2indexri的操作码可以被编码表示为“0b00_00110_01_1”。指令v2indexri v1,imm2,v3,vm3表示将v3[i]赋值为indext,其中indext是能够使得v1[indext]等于imm2的从右至左的第一个元素的索引。如果没有元素满足上述条件,则将indext设 置为由二进制补码表示的“-1”。在这一示例中,如果vm3[i]等于1,则v3[i]是写使能的。
作为另一示例,指令140可以包括第一数值确定指令,例如表4中的指令sindex2v。源操作数220指定存储器的第一存储空间的位置即,通道A的地址索引(A_vaddr为&v1)。源操作数220还指定存储器的第二存储空间内的待处理数据的给定索引值,即通道B的向量单词内的元素索引(B_vaddr为&v2,B_index为index2)。
在这一示例中,算数逻辑单元130被配置为:确定待处理数据在由给定索引值所指示的位置处的给定值,并且确定第一存储空间中的在以给定值为索引的位置处的第一值。例如,指令sindex2v v1,v2,index2,v3,vm3,具有编码为“0b00_00110_10_0”的操作码。该指令表示将v3[i]赋值为v1[v2[index2]]。如果vm3[i]等于1,则v3[i]是写使能的。
在一些实施例中,指令140包括第二数值确定指令。在这一示例中,源操作数220指定存储器的第二存储空间内的待处理数据的给定索引值,即B_vaddr为&v2,B_index为index2。算数逻辑单元130被配置为:确定待处理数据中的由给定索引值所指示的位置处的第二值。例如,指令s2v v2,index2,v3,vm3,具有编码为“0b00_00110_10_1”的操作码。该指令表示将v3[i]赋值为v2[index2]。如果vm3[i]等于1,则v3[i]是写使能的。
通过使用以上描述的诸如第一索引确定指令、第二索引确定指令、第一数值确定指令和第二数值确定指令中的一个或多个,处理器110可以更好地处理一些诸如求取坐标的算子,例如求最大值的索引(ArgMax)算子、求最小值的索引(ArgMin)算子或者求最高排序的K个值(TopK)算子等算子。以ArgMax为例,其用于求使得向量v中的值v[index]为最大值的索引。64个元素的ArgMax所需的指令如下:首先,v2smax v1,vm1,v2,vm2(该指令将在下文的表5和表6中进行描述),该指令求v1中的最大的元素值,并将其写入v2,其中,vm1和vm2存储值全部位为1;接下来,v2indexl v1,v2,0,v3, vm3,该指令求index使得v1[index]等于v2[0],并将index的值写入v3,其中vm3存储值的全部位为1。
在一些实施例中,目标指令包括向量转置指令,例如指令vtranspose或vstranspose。源操作数220指定存储器中的第一存储空间中的第一位置,即A_vaddr为&v1,A_index为index1。源操作数220还指定源向量掩码寄存器vm1和可选的vm2。在该示例中,算数逻辑单元130被配置为:将第一存储空间中的第一位置处的待处理数据进行向量转置,以获得转置后的待处理数据。
例如,向量转置指令vtranspose v1,index1,vm1,vm2,v3,vm3,具有编码表示为0b00_00111_11_0的操作码,其用于将例如32*32的向量(或矩阵)进行转置。在该示例中,vm1和vm2的值为读通道(lane)使能;vm3的值为写通道使能。vm1中连续为1的位的数目R用于表示矩阵的行数,vm2中连续为1的位的数目C用于表示矩阵的列数,其中R和C均为任意自然数,R与C可以相同也可以不同。vm1、vm2和vm3的有效位必须是连续的,否则以最低位的第一个1为准。上述向量转置指令vtranspose可以用于将R*C的矩阵进行转置。
又如,在一些实施例中,可以使用向量转置指令vstranspose v1,index1,vm1,v3,vm3,其用于将方阵进行转置。在该示例中,vm1的值为读通道使能;vm3的值为写通道使能。vm1中连续为1的位的数目R用于表示方阵的行数(或列数)。该向量转置指令vstranspose可以用于将R*R的方阵进行转置。
向量转置指令并不是标准RISC类型的指令。在常规的标准RISC类型的指令集中,必须通过连续多条转置指令才能完成向量的转置功能。本方案通过使用向量转置指令,能够提高部分网络的计算能力。例如,对于神经网络训练过程,通常会涉及大量的矩阵或方阵的转置操作。利用本方案的向量转置指令,能够提高神经网络训练过程的计算效率。
在一些实施例中,目标指令包括指数指令,例如vexp。在该示例中,源操作数220指定源存储位置,即A_vaddr为&v1。算数逻辑单 元130被配置为:以源存储位置处的待处理数据为幂来确定以预定数值(例如,自然底数e)为底数的指数值。例如,vexp v1,v3,vm3,具有编码表示为“0b00_01000_01_0”的操作码,该指令表示将v3[i]赋值为exp(v1[i])。如果vm3[i]等于1,则v3[i]是写使能的。
上述指数指令适用于sigmoid算子以及双曲函数sinh/cosh/tanh等算子。例如,sigmoid算子、sinh算子、cosh算子和tanh算子可以由以下式(1)-式(4)来表示。



在式(1)-(4)中,x表示待处理数据。
例如,在神经网络激活函数中,sigmoid与双曲函数较为常见。通过使用本方案的指数指令,能够提高这类计算的效率。
在一些实施例中,指令140包括VM寄存器指令,例如vm2index指令。源操作数220指示存储器中的源VM寄存器,即vm1。算数逻辑单元被配置为:将源VM寄存器中被启用的位置处的索引存储到目标存储位置处。例如,指令vm2index vm1,v3,vm3,具有编码表示为“0b00_01100_01_0”的操作码,其表示v3[i]=vm1[i]?i:-1。即,如果vm1[i]的值为1,则将v3[i]赋值为i、反之,如果vm1[i]的值为0,则将v3[i]赋值为-1(例如,二进制补码表示的“-1”)。如果vm3[i]等于1,则v3[i]是写使能的。
在一些实施例中,指令140包括独热(onehot)码转换指令,例如vindex2vm。这是一种VM寄存器操作指令。在这一示例中,源操 作数220指定存储器的第二存储空间内的待处理数据的给定索引值,即B_vaddr为&v2,B_index为index2。该指令涉及的读取向量掩码寄存器的行为,不是作为存储器写入的写使能(其他写入存储器的指令都需要读取向量掩码寄存器作为写使能)。目标操作数230指定目标VM寄存器,即vm3。算数逻辑单元被配置为:将待处理数据在给定索引值处的值转换为独热码,并且将该独热码存储到目标VM寄存器中。例如,指令vindex2vm v2,index2,vm3,具有编码表示为“0b00_10000_01_0”的操作码。该指令表示将vm3赋值为onehot(v2[index2]),其中onehot()表示独热码转换函数。
以上所描述的独热码转换指令适用于索引类指令,并且可支持独热码算子。将数字转换为独热编码形式。例如,在实际使用时可以用以下两条指令来实现:vindex2vm v1,0,vm 1以及vmload vm1,v2,vm2,其中第一条指令用于将v1[0]的数值转换为独热编码形式,并且写入vm1,第二条指令(vmload将在下文的表7和表8中进行描述)用于将vm1中的数值存储至v2,其中vm2的存储值全部位为1。
以上结合表4描述了本公开的处理器110所支持的各类指令140的示例。应理解,本公开的处理器110还可以支持更多的指令。以下的表5示出了处理器110所支持的更多的常规的指令140的示例。表5的指令可以参考表1或表2的指令定义来描述,并且使用表3的示例编码方式对操作码进行编码。
表5示例常规指令



表5中的各个指令的功能和定义将由表6示出。这些指令的功能包括了各种加法、减法、乘法、除法、求最大值、求最小值、求倒数(仅支持立即数)、根据给定值对操作数移位,等等。这些指令提供了处理器110的基础计算,在此不进行详细描述。
表6示例常规指令的功能




表6中的MAX()和MIN()函数分别表示求最大值和求最小值的函数。DW表示向量单词的宽度,LANE_NUM表示一个向量单词中元素的数目,mod()函数表示求余函数,ceil()和floor()函数分别表示向上取整和向下取整,SUM()函数表示求和函数。
在一些实施例中,处理器110所支持的指令还包括各种向量掩码寄存器访问和操作指令。表7中示出了向量掩码寄存器访问和操作指令的若干示例。表7中的各个指令的功能和定义将由表8示出。这些指令的功能包括了向量掩码寄存器的读取、写入和操作。这些指令涉及的读取向量掩码寄存器的行为,不是作为存储器写入的写使能(其他写入存储器的指令都需要读取向量掩码寄存器作为写使能)。这些指令在此不进行详细描述。
表7示例向量掩码寄存器指令编码

表8示例向量掩码寄存器指令的功能

在一些实施例中,处理器110所支持的指令140还包括内部寄存器访问和操作指令。这类指令用于处理内部寄存器的访问以及一些特殊操作。例如,写入内部控制和状态(CSR)寄存器,写入固定值、或数据存储器VCCM中的某个SIMD长度数据等。又如,读出内部CSR寄存器或者读出到数据存储器VCCM;以及空指令(即,不进行任何操作,等待1个周期)等。
以下的表9示出了内部寄存器访问和操作指令的若干示例。表10示出了表9中的各个指令的功能。这些指令在本文中不进行详细描述。注意,对于表9中的vwcsr指令,源操作数在A通道;而对于vwcsri 指令,立即数在B通道。
表9示例内部寄存器指令
表10示例内部寄存器指令的功能

以上结合表4-表10描述了处理器110所支持的各个指令。这些指令可以由处理器110的指令解码器120进行解码,并且由算数执行单元130来执行。这些指令可以构成由处理器110所支持的指令集。应当理解,在一些实施例中,可以仅由上述各个指令中的部分或全部的指令来构建指令集。备选地或附加地,还可以采用以上没有描述的其他适当的指令来构建由处理器110支持的指令集。
此外,应当理解,虽然以上参考表1-表3规定的示例指令定义和示例操作码编码表示列举了表4-表10中的各个指令,但这仅仅是示例性的,而不是限制性的。本公开的处理器所支持的指令集可以采用任何适当的方式进行定义和编码。例如,各个指令的各个位可以具有不同于表1或表2中各个位所表示的含义。又如,各个指令的操作码的编码表示可以具有与表3中不同的位数,各个位数也可以与表3中的各个位具有不同的含义。上述表4-表10中的各个指令的操作码的编码表示可以改变或者互换。各个指令也可以使用其他的名称来表示。本公开的范围在此方面不受限制。
以上所描述的指令中不包括分支类型的指令,也不包括加载/存储(load/store)类型的指令。与传统的诸如向量寄存器文件的SIMD处理器不同,本公开所采用的寄存器是存储器到存储器的SIMD处理器架构。上述指令集定义了多个(例如,64个或更多或更少的)向量掩码寄存器,用于表示每个SIMD指令需要处理的具体的向量。
本方案通过采用适用于存储器到存储器架构的SIMD处理器,来简化处理器110的操作。以此方式,处理器110能够使用简单的指令集来完成大量的向量计算。例如,处理器110能够使用简单的指令集来进行神经网络算子的向量计算等任务。通过这样,本方案能够采用 简单的指令集来提高处理器执行向量计算的效率。对于诸如神经网络训练和/或推理等计算,本公开的方案能够极大地提高计算效率。例如,根据本公开的实施例的处理器可以支持各种索引确定指令,从而提高各种诸如求取坐标等向量计算的效率。又如,本公开的处理器可以处理诸如向量转置指令,从而提高神经网络训练过程中的相应计算的计算效率。再如,本公开的处理器能够支持指数指令,从而使得诸如sigmoid算子和双曲函数算子等的计算效率得以提高和优化。
图4示出了根据本公开的一些实施例的用于数据处理的过程400的流程图。过程400可以在处理器110处实现。为便于讨论,将参考图1的环境100来描述过程400。
在框410,由处理器110解码用于向量操作的目标指令,例如指令140。例如,可以由处理器110的指令解码器120解码指令140。指令140涉及目标操作码210、源操作数220和目标操作数230。目标操作码210指示指令140所指定的向量操作。源操作数220至少指定存储器中的用于读取待处理数据的源存储位置。目标操作数230至少指定存储器中的用于写入处理结果的目标存储位置。
在框420,由处理器110从存储器的源存储位置读取待处理数据。例如,可以由处理器110的算数逻辑单元130从存储器的源存储位置读取待处理数据。在框430,由处理器110对待处理数据执行与目标指令所指定的向量操作相关联的算数逻辑运算。例如,可以由处理器110的算数逻辑单元130执行上述算数逻辑运算。在框440,由处理器110将待处理数据的处理结果写入存储器的目标存储位置。例如,可以由处理器110的算数逻辑单元130将处理结果写入目标存储位置。
在一些实施例中,指令140包括索引确定指令。索引确定指令可以是第一索引确定指令(v2indexl或v2indexr)或者第二索引确定指令(v2indexli或v2indexri)。源操作数220指定存储器的第一存储空间的位置,源操作数220还指定存储器的第二存储空间内的待处理数据的给定索引值或者第一立即数。在框430处,处理器110执行的算 数逻辑运算包括:确定第一索引或者确定第二索引。第一索引指示待处理数据中的由给定索引值所指示的位置处的值在第一存储空间中的存储位置。第二索引指示第一立即数在第一存储空间中的存储位置。
在一些实施例中,指令140包括第一数值确定指令(例如,指令sindex2v),源操作数220指定存储器的第一存储空间的位置,源操作数220还指定存储器的第二存储空间内的待处理数据的给定索引值。在框430处,处理器110执行的算数逻辑运算包括:确定待处理数据在由给定索引值所指示的位置处的给定值;以及确定第一存储空间中的在以给定值为索引的位置处的第一值。
在一些实施例中,指令140包括第二数值确定指令,例如,指令s2v。源操作数220指定存储器的第二存储空间内的待处理数据的给定索引值。在框430处,处理器110执行的算数逻辑运算包括:确定待处理数据中的由给定索引值所指示的位置处的第二值。
在一些实施例中,指令140包括向量转置指令,例如指令vtranspose或vstranspose。源操作数220指定存储器中的第一存储空间中的第一位置。在框430处,处理器110执行的算数逻辑运算包括:将第一存储空间中的第一位置处的待处理数据进行向量转置,以获得转置后的待处理数据。
在一些实施例中,指令140包括指数指令,例如指令vexp。源操作数220指定源存储位置。在框430处,处理器110执行的算数逻辑运算包括:以源存储位置处的待处理数据为幂来确定以预定数值为底数的指数值。
在一些实施例中,指令140包括VM寄存器指令,例如vm2index。指令140的源操作数220指示存储器中的源VM寄存器。在框430处,处理器110执行的算数逻辑运算包括:将源VM寄存器中被启用的位置处的索引存储到目标存储位置处。
在一些实施例中,以上所描述的各个指令140的目标存储位置处包括处理结果向量。目标操作数230还指示目标VM寄存器。目标VM寄 存器的各个位置处的值指示处理结果向量的相应位置处是否要被写入相应的处理结果。例如,如果目标寄存器vm3[i]为1,则表示处理结果向量单词的第i个元素是写使能的,可以被写入相应的处理结果。反之,如果目标寄存器vm3[i]为0,则处理结果向量单词的第i个元素不能被写入相应的处理结果。
在一些实施例中,指令140包括独热码转换指令,例如指令vindex2vm。源操作数220指定存储器的第二存储空间内的待处理数据的给定索引值。目标操作数230指定目标VM寄存器。在框430处,处理器110将待处理数据在给定索引值处的值转换为独热码。处理器110还被配置为将独热码存储到目标VM寄存器中。
图5示出了其中可以包括根据本公开的一个或多个实施例的处理器110的电子设备500的框图。应当理解,图5所示出的电子设备500仅仅是示例性的,而不应当构成对本文所描述的实施例的功能和范围的任何限制。
如图5所示,电子设备500是通用电子设备或计算设备的形式。电子设备500的组件可以包括但不限于一个或多个处理器110、存储器520、存储设备530、一个或多个通信单元540、一个或多个输入设备550以及一个或多个输出设备560。在一些实施例中,处理器110可以根据存储器520中存储的程序来执行各种处理。处理器110可以是多核处理器,该多核处理器可以并行执行计算机可执行指令,以提高电子设备500的并行处理能力。
电子设备500通常包括多个计算机存储介质。这样的介质可以是电子设备500可访问的任何可以获得的介质,包括但不限于易失性和非易失性介质、可拆卸和不可拆卸介质。存储器520可以是易失性存储器(例如寄存器、高速缓存、随机访问存储器(RAM))、非易失性存储器(例如,只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、闪存)或它们的某种组合。存储设备530可以是可拆卸或不可拆卸的介质,并且可以包括机器可读介质,诸如闪存驱动、磁盘或者任何其他介质,其可以能够用于存储信息和/或数据(例如用 于训练的训练数据)并且可以在电子设备500内被访问。
电子设备500可以进一步包括另外的可拆卸/不可拆卸、易失性/非易失性存储介质。尽管未在图5中示出,可以提供用于从可拆卸、非易失性磁盘(例如“软盘”)进行读取或写入的磁盘驱动和用于从可拆卸、非易失性光盘进行读取或写入的光盘驱动。在这些情况中,每个驱动可以由一个或多个数据介质接口被连接至总线(未示出)。存储器520可以包括计算机程序产品525,其具有一个或多个程序模块,这些程序模块被配置为执行本公开的各种实施例的各种方法或动作。例如,这些程序模块可以被配置为实现处理器110的各种功能或动作,诸如实现指令解码器120和算数逻辑单元130的功能。
通信单元540实现通过通信介质与其他电子设备或计算设备进行通信。附加地,电子设备500的组件的功能可以以单个计算集群或多个计算机器来实现,这些计算机器能够通过通信连接进行通信。因此,电子设备500可以使用与一个或多个其他服务器、网络个人计算机(PC)或者另一个网络节点的逻辑连接来在联网环境中进行操作。
输入设备550可以是一个或多个输入设备,例如鼠标、键盘、追踪球等。输出设备560可以是一个或多个输出设备,例如显示器、扬声器、打印机等。电子设备500还可以根据需要通过通信单元540与一个或多个外部设备(未示出)进行通信,外部设备诸如存储设备、显示设备等,与一个或多个使得用户与电子设备500交互的设备进行通信,或者与使得电子设备500与一个或多个其他电子设备或计算设备通信的任何设备(例如,网卡、调制解调器等)进行通信。这样的通信可以经由输入/输出(I/O)接口(未示出)来执行。
根据本公开的示例性实现方式,提供了一种计算机可读存储介质,其上存储有计算机可执行指令,其中计算机可执行指令被处理器执行以实现上文描述的方法。根据本公开的示例性实现方式,还提供了一种计算机程序产品,计算机程序产品被有形地存储在非瞬态计算机可读介质上并且包括计算机可执行指令,而计算机可执行指令被处理器执行以实现上文描述的方法。
这里参照根据本公开实现的方法、设备和计算机程序产品的流程图和/或框图描述了本公开的各个方面。应当理解,流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合,都可以由计算机可读程序指令实现。
这些计算机可读程序指令可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理单元,从而生产出一种机器,使得这些指令在通过计算机或其他可编程数据处理装置的处理单元执行时,产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中,这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作,从而,存储有指令的计算机可读介质则包括一个制造品,其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。
可以把计算机可读程序指令加载到计算机、其他可编程数据处理装置、或其他设备上,使得在计算机、其他可编程数据处理装置或其他设备上执行一系列操作步骤,以产生计算机实现的过程,从而使得在计算机、其他可编程数据处理装置、或其他设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。
根据本公开的一个或多个实施例,示例1描述了一种处理器,该处理器包括指令解码器,该指令解码器被配置为解码用于向量操作的目标指令。目标指令涉及目标操作码、源操作数和目标操作数。目标操作码指示目标指令所指定的向量操作。源操作数至少指定存储器中的用于读取待处理数据的源存储位置。目标操作数至少指定存储器中的用于写入处理结果的目标存储位置。该处理器还包括算数逻辑单元,该算数逻辑单元被耦合至指令解码器和存储器。该算数逻辑单元被配置为:从存储器的源存储位置读取待处理数据;对待处理数据执行与目标指令所指定的向量操作相关联的算数逻辑运算;以及将待处理数据的处理结果写入存储器的目标存储位置。
根据本公开的一个或多个实施例,示例2包括根据示例1所描述 的处理器,其中目标指令包括第一索引确定指令,源操作数指定存储器的第一存储空间的位置,源操作数还指定存储器的第二存储空间内的待处理数据的给定索引值。算数逻辑单元被配置为如下,以执行与目标指令所指定的向量操作相关联的算数逻辑运算:确定第一索引,该第一索引指示待处理数据中的由给定索引值所指示的位置处的值在第一存储空间中的存储位置。
根据本公开的一个或多个实施例,示例3包括根据示例1所描述的处理器,其中目标指令包括第二索引确定指令,源操作数指定存储器的第一存储空间的位置,源操作数还指定第一立即数。算数逻辑单元被配置为如下,以执行与目标指令所指定的向量操作相关联的算数逻辑运算:确定第二索引,该第二索引指示第一立即数在第一存储空间中的存储位置。
根据本公开的一个或多个实施例,示例4包括根据示例1所描述的处理器,其中目标指令包括第一数值确定指令,源操作数指定存储器的第一存储空间的位置,源操作数还指定存储器的第二存储空间内的待处理数据的给定索引值。算数逻辑单元被配置为如下,以执行与目标指令所指定的向量操作相关联的算数逻辑运算:确定待处理数据在由给定索引值所指示的位置处的给定值;以及确定第一存储空间中的在以给定值为索引的位置处的第一值。
根据本公开的一个或多个实施例,示例5包括根据示例1所描述的处理器,其中目标指令包括第二数值确定指令,源操作数指定存储器的第二存储空间内的待处理数据的给定索引值。算数逻辑单元被配置为如下,以执行与目标指令所指定的向量操作相关联的算数逻辑运算:确定待处理数据中的由给定索引值所指示的位置处的第二值。
根据本公开的一个或多个实施例,示例6包括根据示例1所描述的处理器,其中目标指令包括向量转置指令,源操作数至少指定存储器中的第一存储空间中的第一位置。算数逻辑单元被配置为如下,以执行与目标指令所指定的向量操作相关联的算数逻辑运算:将第一存储空间中的第一位置处的待处理数据进行向量转置,以获得转置后的 待处理数据。
根据本公开的一个或多个实施例,示例7包括根据示例1所描述的处理器,其中目标指令包括指数指令,源操作数指定源存储位置。算数逻辑单元被配置为如下,以执行与目标指令所指定的向量操作相关联的算数逻辑运算:以源存储位置处的待处理数据为幂来确定以预定数值为底数的指数值。
根据本公开的一个或多个实施例,示例8包括根据示例1所描述的处理器,其中目标指令包括向量掩码VM寄存器指令,源操作数指示存储器中的源VM寄存器。算数逻辑单元被配置为如下,以执行与目标指令所指定的向量操作相关联的算数逻辑运算:将源VM寄存器中被启用的位置处的索引存储到目标存储位置处。
根据本公开的一个或多个实施例,示例9包括根据示例2至示例8任一项所描述的处理器,其中目标存储位置处包括处理结果向量,目标操作数还指示目标向量掩码VM寄存器,目标VM寄存器的各个位置处的值指示处理结果向量的相应位置处是否要被写入相应的处理结果。
根据本公开的一个或多个实施例,示例10包括根据示例1所描述的处理器,其中目标指令包括独热码转换指令,源操作数指定存储器的第二存储空间内的待处理数据的给定索引值,目标操作数指定目标向量掩码VM寄存器。算数逻辑单元被配置为如下,以执行与目标指令所指定的向量操作相关联的算数逻辑运算:将待处理数据在给定索引值处的值转换为独热码;以及将独热码存储到目标VM寄存器中。
根据本公开的一个或多个实施例,示例11描述了一种数据处理的方法。该方法包括:解码用于向量操作的目标指令,目标指令涉及目标操作码、源操作数和目标操作数。目标操作码指示目标指令所指定的向量操作。源操作数至少指定存储器中的用于读取待处理数据的源存储位置。目标操作数至少指定存储器中的用于写入处理结果的目标存储位置。该方法还包括:从存储器的源存储位置读取待处理数据; 对待处理数据执行与目标指令所指定的向量操作相关联的算数逻辑运算;以及将待处理数据的处理结果写入存储器的目标存储位置。
根据本公开的一个或多个实施例,示例12包括根据示例11所描述的方法,其中目标指令包括索引确定指令,源操作数指定存储器的第一存储空间的位置,源操作数还指定以下至少一项:存储器的第二存储空间内的待处理数据的给定索引值、第一立即数。执行与目标指令所指定的向量操作相关联的算数逻辑运算包括以下至少一项:确定第一索引,该第一索引指示待处理数据中的由给定索引值所指示的位置处的值在第一存储空间中的存储位置;确定第二索引,第二索引指示第一立即数在第一存储空间中的存储位置。
根据本公开的一个或多个实施例,示例13包括根据示例11所描述的方法,其中目标指令包括第一数值确定指令,源操作数指定存储器的第一存储空间的位置,源操作数还指定存储器的第二存储空间内的待处理数据的给定索引值。执行与目标指令所指定的向量操作相关联的算数逻辑运算包括:确定待处理数据在由给定索引值所指示的位置处的给定值;以及确定第一存储空间中的在以给定值为索引的位置处的第一值。
根据本公开的一个或多个实施例,示例14包括根据示例11所描述的方法,其中目标指令包括第二数值确定指令,源操作数指定存储器的第二存储空间内的待处理数据的给定索引值。执行与目标指令所指定的向量操作相关联的算数逻辑运算包括:确定待处理数据中的由给定索引值所指示的位置处的第二值。
根据本公开的一个或多个实施例,示例15包括根据示例11所描述的方法,其中目标指令包括向量转置指令,其中源操作数至少指定存储器中的第一存储空间中的第一位置。执行与目标指令所指定的向量操作相关联的算数逻辑运算包括:将第一存储空间中的第一位置处的待处理数据进行向量转置,以获得转置后的待处理数据。
根据本公开的一个或多个实施例,示例16包括根据示例11所描述的方法,其中目标指令包括指数指令,源操作数指定源存储位置。 执行与目标指令所指定的向量操作相关联的算数逻辑运算包括:以源存储位置处的待处理数据为幂来确定以预定数值为底数的指数值。
根据本公开的一个或多个实施例,示例17包括根据示例11所描述的方法,其中目标指令包括向量掩码VM寄存器指令,源操作数指示存储器中的源VM寄存器。执行与目标指令所指定的向量操作相关联的算数逻辑运算包括:将源VM寄存器中被启用的位置处的索引存储到目标存储位置处。
根据本公开的一个或多个实施例,示例18包括根据示例11所描述的方法,其中目标指令包括独热码转换指令,源操作数指定存储器的第二存储空间内的待处理数据的给定索引值,目标操作数指定目标向量掩码VM寄存器。执行与目标指令所指定的向量操作相关联的算数逻辑运算包括:将待处理数据在给定索引值处的值转换为独热码;以及将独热码存储到目标VM寄存器中。
根据本公开的一个或多个实施例,示例19描述了一种电子设备,其至少包括根据示例1至示例10任一项所述的处理器。
根据本公开的一个或多个实施例,示例20描述了一种计算机可读存储介质,其上存储有计算机程序。该计算机程序被处理器执行以实现根据示例11至18任一项所述的方法。
附图中的流程图和框图显示了根据本公开的多个实现的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分,模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或动作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
以上已经描述了本公开的各实现,上述说明是示例性的,并非穷尽性的,并且也不限于所公开的各实现。在不偏离所说明的各实现的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实现的原理、实际应用或对市场中的技术的改进,或者使本技术领域的其他普通技术人员能理解本文公开的各个实现方式。

Claims (20)

  1. 一种处理器,包括:
    指令解码器,被配置为解码用于向量操作的目标指令,所述目标指令涉及目标操作码、源操作数和目标操作数,所述目标操作码指示所述目标指令所指定的向量操作,所述源操作数至少指定存储器中的用于读取待处理数据的源存储位置,所述目标操作数至少指定所述存储器中的用于写入处理结果的目标存储位置;以及
    算数逻辑单元,耦合至所述指令解码器和所述存储器,并且被配置为:
    从所述存储器的所述源存储位置读取所述待处理数据;
    对所述待处理数据执行与所述目标指令所指定的向量操作相关联的算数逻辑运算;以及
    将所述待处理数据的处理结果写入所述存储器的所述目标存储位置。
  2. 根据权利要求1所述的处理器,其中所述目标指令包括第一索引确定指令,所述源操作数指定所述存储器的第一存储空间的位置,所述源操作数还指定所述存储器的第二存储空间内的所述待处理数据的给定索引值,并且
    所述算数逻辑单元被配置为如下,以执行与所述目标指令所指定的向量操作相关联的算数逻辑运算:
    确定第一索引,所述第一索引指示所述待处理数据中的由给定索引值所指示的位置处的值在所述第一存储空间中的存储位置。
  3. 根据权利要求1所述的处理器,其中所述目标指令包括第二索引确定指令,所述源操作数指定所述存储器的第一存储空间的位置,所述源操作数还指定第一立即数,并且
    所述算数逻辑单元被配置为如下,以执行与所述目标指令所指定的 向量操作相关联的算数逻辑运算:
    确定第二索引,所述第二索引指示所述第一立即数在所述第一存储空间中的存储位置。
  4. 根据权利要求1所述的处理器,其中所述目标指令包括第一数值确定指令,所述源操作数指定所述存储器的第一存储空间的位置,所述源操作数还指定所述存储器的第二存储空间内的所述待处理数据的给定索引值,并且
    所述算数逻辑单元被配置为如下,以执行与所述目标指令所指定的向量操作相关联的算数逻辑运算:
    确定所述待处理数据在由所述给定索引值所指示的位置处的给定值;以及
    确定所述第一存储空间中的在以所述给定值为索引的位置处的第一值。
  5. 根据权利要求1所述的处理器,其中所述目标指令包括第二数值确定指令,所述源操作数指定所述存储器的第二存储空间内的所述待处理数据的给定索引值,并且
    所述算数逻辑单元被配置为如下,以执行与所述目标指令所指定的向量操作相关联的算数逻辑运算:
    确定所述待处理数据中的由给定索引值所指示的位置处的第二值。
  6. 根据权利要求1所述的处理器,其中所述目标指令包括向量转置指令,所述源操作数至少指定所述存储器中的第一存储空间中的第一位置,并且
    所述算数逻辑单元被配置为如下,以执行与所述目标指令所指定的向量操作相关联的算数逻辑运算:
    将所述第一存储空间中的第一位置处的所述待处理数据进行向量转置,以获得转置后的待处理数据。
  7. 根据权利要求1所述的处理器,其中所述目标指令包括指数指令,所述源操作数指定所述源存储位置,并且
    所述算数逻辑单元被配置为如下,以执行与所述目标指令所指定的向量操作相关联的算数逻辑运算:
    以所述源存储位置处的所述待处理数据为幂来确定以预定数值为底数的指数值。
  8. 根据权利要求1所述的处理器,其中所述目标指令包括向量掩码VM寄存器指令,所述源操作数指示所述存储器中的源VM寄存器,并且
    所述算数逻辑单元被配置为如下,以执行与所述目标指令所指定的向量操作相关联的算数逻辑运算:
    将所述源VM寄存器中被启用的位置处的索引存储到所述目标存储位置处。
  9. 根据权利要求2至8中任一项所述的处理器,其中所述目标存储位置处包括处理结果向量,所述目标操作数还指示目标向量掩码VM寄存器,所述目标VM寄存器的各个位置处的值指示所述处理结果向量的相应位置处是否要被写入相应的处理结果。
  10. 根据权利要求1所述的处理器,其中所述目标指令包括独热码转换指令,所述源操作数指定所述存储器的第二存储空间内的所述待处理数据的给定索引值,所述目标操作数指定目标向量掩码VM寄存器,并且
    所述算数逻辑单元被配置为如下,以执行与所述目标指令所指定的向量操作相关联的算数逻辑运算:
    将所述待处理数据在给定索引值处的值转换为独热码;以及
    将所述独热码存储到所述目标VM寄存器中。
  11. 一种数据处理的方法,包括:
    解码用于向量操作的目标指令,所述目标指令涉及目标操作码、源 操作数和目标操作数,所述目标操作码指示所述目标指令所指定的向量操作,所述源操作数至少指定存储器中的用于读取待处理数据的源存储位置,所述目标操作数至少指定所述存储器中的用于写入处理结果的目标存储位置;
    从所述存储器的所述源存储位置读取所述待处理数据;
    对所述待处理数据执行与所述目标指令所指定的向量操作相关联的算数逻辑运算;以及
    将所述待处理数据的处理结果写入所述存储器的所述目标存储位置。
  12. 根据权利要求11所述的方法,其中所述目标指令包括索引确定指令,所述源操作数指定所述存储器的第一存储空间的位置,所述源操作数还指定以下至少一项:所述存储器的第二存储空间内的所述待处理数据的给定索引值、第一立即数,并且
    其中执行与所述目标指令所指定的向量操作相关联的算数逻辑运算包括以下至少一项:
    确定第一索引,所述第一索引指示所述待处理数据中的由给定索引值所指示的位置处的值在所述第一存储空间中的存储位置,
    确定第二索引,所述第二索引指示所述第一立即数在所述第一存储空间中的存储位置。
  13. 根据权利要求11所述的方法,其中所述目标指令包括第一数值确定指令,所述源操作数指定所述存储器的第一存储空间的位置,所述源操作数还指定所述存储器的第二存储空间内的所述待处理数据的给定索引值,并且
    其中执行与所述目标指令所指定的向量操作相关联的算数逻辑运算包括:
    确定所述待处理数据在由所述给定索引值所指示的位置处的给定值;以及
    确定所述第一存储空间中的在以所述给定值为索引的位置处的第一值。
  14. 根据权利要求11所述的方法,其中所述目标指令包括第二数值确定指令,所述源操作数指定所述存储器的第二存储空间内的所述待处理数据的给定索引值,并且
    其中执行与所述目标指令所指定的向量操作相关联的算数逻辑运算包括:确定所述待处理数据中的由给定索引值所指示的位置处的第二值。
  15. 根据权利要求11所述的方法,其中所述目标指令包括向量转置指令,所述源操作数至少指定所述存储器中的第一存储空间中的第一位置,并且
    其中执行与所述目标指令所指定的向量操作相关联的算数逻辑运算包括:将所述第一存储空间中的第一位置处的所述待处理数据进行向量转置,以获得转置后的待处理数据。
  16. 根据权利要求11所述的方法,其中所述目标指令包括指数指令,所述源操作数指定所述源存储位置,并且
    其中执行与所述目标指令所指定的向量操作相关联的算数逻辑运算包括:以所述源存储位置处的所述待处理数据为幂来确定以预定数值为底数的指数值。
  17. 根据权利要求11所述的方法,其中所述目标指令包括向量掩码VM寄存器指令,所述源操作数指示所述存储器中的源VM寄存器,并且
    其中执行与所述目标指令所指定的向量操作相关联的算数逻辑运算包括:将所述源VM寄存器中被启用的位置处的索引存储到所述目标存储位置处。
  18. 根据权利要求11所述的方法,其中所述目标指令包括独热码转换指令,所述源操作数指定所述存储器的第二存储空间内的所述待处理 数据的给定索引值,所述目标操作数指定目标向量掩码VM寄存器,并且
    其中执行与所述目标指令所指定的向量操作相关联的算数逻辑运算包括:
    将所述待处理数据在给定索引值处的值转换为独热码;以及
    将所述独热码存储到所述目标VM寄存器中。
  19. 一种电子设备,至少包括根据权利要求1至10任一项所述的处理器。
  20. 一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行以实现根据权利要求11至18任一项所述的方法。
PCT/CN2023/098716 2022-06-14 2023-06-06 处理器以及用于数据处理的方法、设备和存储介质 Ceased WO2023241418A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
JP2024573128A JP2025519635A (ja) 2022-06-14 2023-06-06 プロセッサ、データ処理のための方法、デバイス、及び記憶媒体
KR1020247041480A KR20250008533A (ko) 2022-06-14 2023-06-06 프로세서, 데이터 처리를 위한 방법, 디바이스 및 기억 매체
EP23822987.6A EP4524729A4 (en) 2022-06-14 2023-06-06 PROCESSOR, DATA PROCESSING METHOD, DEVICE AND STORAGE MEDIUM
US18/979,402 US12461747B2 (en) 2022-06-14 2024-12-12 Processor, method, device and storage medium for data processing

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210674857.6A CN117289991B (zh) 2022-06-14 2022-06-14 处理器以及用于数据处理的方法、设备和存储介质
CN202210674857.6 2022-06-14

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/979,402 Continuation US12461747B2 (en) 2022-06-14 2024-12-12 Processor, method, device and storage medium for data processing

Publications (2)

Publication Number Publication Date
WO2023241418A1 true WO2023241418A1 (zh) 2023-12-21
WO2023241418A9 WO2023241418A9 (zh) 2024-07-25

Family

ID=89192127

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/098716 Ceased WO2023241418A1 (zh) 2022-06-14 2023-06-06 处理器以及用于数据处理的方法、设备和存储介质

Country Status (6)

Country Link
US (1) US12461747B2 (zh)
EP (1) EP4524729A4 (zh)
JP (1) JP2025519635A (zh)
KR (1) KR20250008533A (zh)
CN (1) CN117289991B (zh)
WO (1) WO2023241418A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN121277714B (zh) * 2025-12-05 2026-03-31 上海壁仞科技股份有限公司 执行单元和计算设备

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109791488A (zh) * 2016-10-01 2019-05-21 英特尔公司 用于执行用于复数的融合乘-加指令的系统和方法
CN109992240A (zh) * 2017-12-29 2019-07-09 英特尔公司 用于多加载和多存储向量指令的方法和装置
CN113849769A (zh) * 2020-06-27 2021-12-28 英特尔公司 矩阵转置和乘法

Family Cites Families (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6907443B2 (en) * 2001-09-19 2005-06-14 Broadcom Corporation Magnitude comparator
JP4788177B2 (ja) * 2005-03-31 2011-10-05 日本電気株式会社 情報処理装置、演算処理装置、メモリアクセス制御方法およびプログラム
US20070118832A1 (en) * 2005-11-18 2007-05-24 Huelsbergen Lorenz F Method and apparatus for evolution of custom machine representations
CN104137054A (zh) * 2011-12-23 2014-11-05 英特尔公司 用于执行从索引值列表向掩码值的转换的系统、装置和方法
CN104185838B (zh) 2011-12-30 2017-12-22 英特尔公司 使用精减指令集核
US8874933B2 (en) 2012-09-28 2014-10-28 Intel Corporation Instruction set for SHA1 round processing on 128-bit data paths
US9552205B2 (en) * 2013-09-27 2017-01-24 Intel Corporation Vector indexed memory access plus arithmetic and/or logical operation processors, methods, systems, and instructions
US9851970B2 (en) * 2014-12-23 2017-12-26 Intel Corporation Method and apparatus for performing reduction operations on a set of vector elements
US9830151B2 (en) * 2014-12-23 2017-11-28 Intel Corporation Method and apparatus for vector index load and store
GB2543302B (en) * 2015-10-14 2018-03-21 Advanced Risc Mach Ltd Vector load instruction
US10509726B2 (en) * 2015-12-20 2019-12-17 Intel Corporation Instructions and logic for load-indices-and-prefetch-scatters operations
US20170177360A1 (en) * 2015-12-21 2017-06-22 Intel Corporation Instructions and Logic for Load-Indices-and-Scatter Operations
US20170315812A1 (en) * 2016-04-28 2017-11-02 Microsoft Technology Licensing, Llc Parallel instruction scheduler for block isa processor
WO2018158603A1 (en) * 2017-02-28 2018-09-07 Intel Corporation Strideshift instruction for transposing bits inside vector register
WO2018189728A1 (en) * 2017-04-14 2018-10-18 Cerebras Systems Inc. Floating-point unit stochastic rounding for accelerated deep learning
US11481218B2 (en) * 2017-08-02 2022-10-25 Intel Corporation System and method enabling one-hot neural networks on a machine learning compute platform
US10380063B2 (en) * 2017-09-30 2019-08-13 Intel Corporation Processors, methods, and systems with a configurable spatial accelerator having a sequencer dataflow operator
GB2568230B (en) * 2017-10-20 2020-06-03 Graphcore Ltd Processing in neural networks
US11294670B2 (en) * 2019-03-27 2022-04-05 Intel Corporation Method and apparatus for performing reduction operations on a plurality of associated data element values
US10997116B2 (en) * 2019-08-06 2021-05-04 Microsoft Technology Licensing, Llc Tensor-based hardware accelerator including a scalar-processing unit
US12086080B2 (en) * 2020-09-26 2024-09-10 Intel Corporation Apparatuses, methods, and systems for a configurable accelerator having dataflow execution circuits
US12373206B2 (en) * 2020-12-24 2025-07-29 Intel Corporation Methods, systems, and apparatuses to optimize cross-lane packed data instruction implementation on a partial width processor with a minimal number of micro-operations

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109791488A (zh) * 2016-10-01 2019-05-21 英特尔公司 用于执行用于复数的融合乘-加指令的系统和方法
CN109992240A (zh) * 2017-12-29 2019-07-09 英特尔公司 用于多加载和多存储向量指令的方法和装置
CN113849769A (zh) * 2020-06-27 2021-12-28 英特尔公司 矩阵转置和乘法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4524729A4

Also Published As

Publication number Publication date
US20250110742A1 (en) 2025-04-03
US12461747B2 (en) 2025-11-04
CN117289991A (zh) 2023-12-26
EP4524729A4 (en) 2025-12-17
KR20250008533A (ko) 2025-01-14
JP2025519635A (ja) 2025-06-26
CN117289991B (zh) 2025-09-12
EP4524729A1 (en) 2025-03-19
WO2023241418A9 (zh) 2024-07-25

Similar Documents

Publication Publication Date Title
CN109284130B (zh) 神经网络运算装置及方法
US20240070226A1 (en) Accelerator for sparse-dense matrix multiplication
CN112506467B (zh) 用于使用操作的混合精度分解的较高精度计算的计算机处理器
CN113762490B (zh) 使用列折叠和挤压的稀疏矩阵的矩阵乘法加速
CN114625423A (zh) 用于执行将矩阵变换为行交错格式的指令的系统和方法
KR101787819B1 (ko) 정렬 가속화 프로세서들, 방법들, 시스템들 및 명령어들
EP3623941B1 (en) Systems and methods for performing instructions specifying ternary tile logic operations
EP3567472B1 (en) Systems, methods, and apparatuses utilizing an elastic floating-point number
CN114625418A (zh) 用于执行快速转换片并且将片用作一维向量的指令的系统
CN110312992A (zh) 用于片矩阵乘法和累加的系统、方法和装置
CN109992304A (zh) 用于加载片寄存器对的系统和方法
EP3623940A2 (en) Systems and methods for performing horizontal tile operations
US10437562B2 (en) Apparatus and method for processing sparse data
CN109992305A (zh) 用于将片寄存器对归零的系统和方法
CN114691217A (zh) 用于8位浮点矩阵点积指令的装置、方法和系统
EP4462249A2 (en) Matrix transpose and multiply
CN114721624A (zh) 用于处理矩阵的处理器、方法和系统
CN110826722A (zh) 用于通过排序来生成索引并基于排序对元素进行重新排序的系统、装置和方法
WO2023077769A1 (zh) 数据处理方法、装置以及设备和计算机可读存储介质
CN116097212A (zh) 用于16比特浮点矩阵点积指令的装置、方法和系统
CN112149050A (zh) 用于增强的矩阵乘法器架构的装置、方法和系统
US12461747B2 (en) Processor, method, device and storage medium for data processing
US20240037179A1 (en) Data processing method and apparatus

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23822987

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2024573128

Country of ref document: JP

WWE Wipo information: entry into national phase

Ref document number: 2023822987

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 20247041480

Country of ref document: KR

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 1020247041480

Country of ref document: KR

ENP Entry into the national phase

Ref document number: 2023822987

Country of ref document: EP

Effective date: 20241212

NENP Non-entry into the national phase

Ref country code: DE