EP3814923A1 - Architektur eines asynchronen prozessors - Google Patents

Architektur eines asynchronen prozessors

Info

Publication number
EP3814923A1
EP3814923A1 EP19737810.2A EP19737810A EP3814923A1 EP 3814923 A1 EP3814923 A1 EP 3814923A1 EP 19737810 A EP19737810 A EP 19737810A EP 3814923 A1 EP3814923 A1 EP 3814923A1
Authority
EP
European Patent Office
Prior art keywords
data
memory
registers
arithmetic
calculation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP19737810.2A
Other languages
English (en)
French (fr)
Inventor
Khaled Maalej
Trung-Dung NGUYEN
Julien Schmitt
Pierre-Emmanuel BERNARD
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Vsora
Original Assignee
Vsora
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Vsora filed Critical Vsora
Publication of EP3814923A1 publication Critical patent/EP3814923A1/de
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8007Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors single instruction multiple data [SIMD] multiprocessors
    • G06F15/8015One dimensional arrays, e.g. rings, linear arrays, buses
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/34Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units

Definitions

  • the invention relates to the field of processors and their functional architecture.
  • a computing device comprises a set of one or more processors.
  • Each processor includes one or more processing units, or PU for "Processing Units".
  • Each PU includes one or more calculation units called arithmetic and logical units, or ALU for "Arithmetic-Logic Unit".
  • ALU arithmetic-Logic Unit
  • ALUs can process operations in parallel, i.e. at the same time. The unit of time is then a cycle of calculation. It is therefore common to quantify the computing power of the IT device in terms of the number of operations it is capable of performing per computing cycle.
  • Such a device comprises a memory assembly, itself comprising one or more memory organs, each having a fixed number of memory locations on which computer data can be permanently stored.
  • the ALUs receive input data from the memory organs and output data which, in turn, is stored on the memory organs. It is therefore understood that, in addition to the number of ALUs, the number of memory units is another determining criterion of the computing power of the device.
  • bus is used here in its general sense of system (or interface) of data transfer including hardware (interface circuit) and the protocols governing the exchanges.
  • the bus transmits the data itself, addresses and control signals.
  • Each bus also has hardware limits and software so that data routing is limited.
  • the bus has a limited number of ports on the memory unit side and a limited number of ports on the ALUs side.
  • a memory location is accessible via the bus in a single direction (in “read” or in “write”).
  • a memory location is accessible for a single ALU.
  • a computing device generally comprises a set of registers and local memory organs, which can be seen as memories distinct from the aforementioned memory organs.
  • registers intended for storing data as such
  • local memory organs intended for storing memory addresses.
  • the ALUs of a PU are assigned to each register.
  • a PU is assigned several registers.
  • the storage capacity of the registers is very limited in comparison with the memory devices, but their content is directly accessible to the ALUs.
  • each ALU must generally first obtain the calculation input data, typically the two operands of an elementary calculation. An operation of "reading" the corresponding memory location via the bus to import each of the two operands into a register is therefore implemented. Then, the ALU performs the calculation operation by itself from data in a register and by exporting the result as data to a register. Finally, a "write" operation is implemented to save, in a memory location, the result of the calculation. During such a write operation, the result stored in the register is saved in a memory location via the bus. Each of the operations a priori consumes one or more calculation cycles.
  • the invention improves the situation.
  • a data processing method is proposed, which can be broken down into a set of elementary operations to be carried out, implemented by a computer device, said device comprising:
  • a set of registers capable of supplying data forming the operand of the inputs of said first arithmetic and logic unit and capable of being supplied with data originating from the outputs of said arithmetic and logic unit;
  • the process includes:
  • Such a method makes it possible, by dissociating in time the tasks relating to the processing of memory addresses and the calculation tasks, to exempt the ALU performing the calculations from carrying out further addressing operations which would necessitate stopping the calculation operations .
  • the processing as a whole becomes both asynchronous and self-adapting: elementary calculations are initiated (by an instruction transmitted to an ALU) only once the memory addresses have been updated in the local memory organs.
  • the sum of the time necessary to update the memory addresses in the local memory organs during a first process, then to perform the calculations during a second process is less than the time required for the same amount of resources to perform all of the processing during a single process (with access address updates memory in local memory organs on the fly).
  • the saving of time is particularly important in the case of iterative processing which can typically be carried out by means of computer loops.
  • a computerized data processing device is proposed, said processing being decomposable into a set of elementary operations to be carried out.
  • the device includes:
  • a set of registers capable of supplying data forming the operand of the inputs of said arithmetic and logic units and capable of being supplied with data from the outputs of said arithmetic and logic units;
  • the IT device is configured to:
  • a set of machine instructions is proposed for the implementation of a method as defined herein when this program is executed by a processor.
  • a computer program in particular a compilation program, is proposed, comprising instructions for the implementation of all or part of a process as defined herein when this program is executed by a processor.
  • a non-transient recording medium readable by a computer, is proposed on which such a program is recorded.
  • the first arithmetic and logic unit performs all the elementary calculations of the processing during consecutive calculation cycles, no memory access being carried out by said first arithmetic and logic unit during said calculation cycles. This allows the first arithmetic and logical unit to be dispensed from any memory access operation during elementary calculations and therefore to speed up the implementation of said calculations.
  • At least one of the following steps includes an iterative loop:
  • the device further comprises at least one additional arithmetic and logic unit and distinct from the first arithmetic and logic unit executing all of said elementary operations.
  • the additional arithmetic and logic unit implements:
  • the applicant also describes an approach in which, at each read operation, the number of data read is greater than the number of data strictly necessary for the implementation of the next calculation.
  • Such an approach could, in contrast, be called “predictive memory access”. It is then possible that a data item among the data read is used for a future calculation, other than the calculation implemented immediately after reading. In such cases, the necessary data was obtained during a single memory access operation (with increased memory bandwidth) while the usual approach would have required at least two separate memory accesses.
  • Such an approach therefore has the effect, at least in certain cases, of reducing the consumption of calculation cycles for memory accesses and therefore makes it possible to improve the efficiency of the device. In the long term (several consecutive calculation cycles) the number of memory accesses (in read and / or write) is reduced.
  • FIG. 1 shows an architecture of a computer device according to the invention
  • FIG. 2 is a partial representation of an architecture of a computer device according to the invention.
  • FIG. 3 shows an example of memory access
  • Figure 4 is a variant of the example of Figure 3;
  • FIG. 6 shows a chronological breakdown of an example of operation according to the invention.
  • FIG. 1 shows an example of a computing device 1.
  • the device 1 comprises a set of one or more processors 3, sometimes called central processing units or CPU for "Central Processing Units".
  • the processor assembly (s) 3 comprises at least one control unit 5 and at least one processing unit 7, or PU 7 for "Processing Unit".
  • Each PU 7 comprises one or more calculation units called arithmetic and logical units 9, or ALU 9 for "Arithmetic-Logic Unit".
  • each PU 7 further comprises a set of registers 11.
  • the device 1 comprises at least one memory 13 capable of interacting with the processor assembly (s) 3.
  • the device 1 comprises in addition a memory interface 15, or "Bus".
  • the memory organs are single-port, that is to say that the read and write operations are implemented during different cycles, as opposed to the memories called "double -port ”(more expensive in terms of surface area and requiring larger split control buses for writing and reading).
  • the proposed technical solutions can be implemented with so-called “dual-port” memories. In such embodiments, reads and writes can be implemented during the same calculation cycle.
  • three PU 7 are represented: PU 1, PU X and PU N. Only the structure of PU X is shown in detail in order to simplify FIG. 1. Nevertheless, the structures of the PUs are similar to each other. In variants, the number of PUs is different.
  • the device 1 can comprise a single PU, two PUs or more than three PUs.
  • the PU X comprises four ALUs: ALU X.0, ALU X.1, ALU X.2 and ALU X.3.
  • the PUs can comprise a number of ALUs which are different from each other and / or different from four, including a single ALU.
  • Each PU includes a set of registers 11, here at least one register 11 assigned to each ALU.
  • the PU X comprises a single register 11 per ALU, that is to say four registers referenced REG X.0, REG X.1, REG X.2 and REG X.3 and allocated respectively to ALU X.0, ALU X.1, ALU X.2 and ALU X.3.
  • each ALU is assigned a plurality of registers 11.
  • Each register 11 is capable of supplying operand type data with the inputs of said ALUs 9 and is capable of being supplied with data from the outputs of said ALUs.
  • Each register 11 is furthermore capable of storing data from memory 13 obtained via bus 15 by a so-called "read” operation.
  • Each register 11 is, furthermore, capable of transmitting stored data, destined for the memory 13 and via the bus 15, by a so-called "write” operation. The read and write operations are managed by controlling the memory accesses from the control unit 5.
  • the control unit 5 imposes on each ALU 9 the manner of carrying out elementary calculations, in particular their order, and attributes to each ALU 9 the operations to be executed.
  • the control unit 5 is configured to drive the ALUs 9 according to a micro-architecture in the processing chain so that the ALUs 9 perform calculations in parallel with each other. others.
  • the device 1 has an architecture with a single instruction stream and multiple data streams, called SIMD for “Single Instructions Multiple Data”, and / or an architecture with multiple instruction streams and multiple data streams, called MIMD for "Multiple Instructions Multiple Data”.
  • the control unit 5 is further arranged to control the memory accesses via the memory interface 15 and in particular, here, the read and write operations.
  • the two types of control (calculation and memory access) are represented, in FIG. 1, by arrows in broken lines.
  • FIG. 2 in which a single ALU Y is represented.
  • Data transmissions are represented by arrows with solid lines.
  • the data transmission being carried out step by step, it is understood that FIG. 2 does not necessarily represent an instant t with simultaneous data transmissions.
  • FIG. 2 does not necessarily represent an instant t with simultaneous data transmissions.
  • for a datum to be transmitted from a register 11 to an ALU 9 it is for example necessary that said datum is previously transmitted to said register 11 from the memory 13, here via the memory interface 15 (or Bus) .
  • each ALU 9 has at least three ports, namely two inputs and an output. For each operation, at least two operands are received, respectively by the first and the second input. The result of the calculation is issued via the output.
  • the operands received as input come from the REG Y.0 register and the REG Y.2 register respectively. The result of the calculation is entered in the REG register Y.1. Once registered in the REG Y.1 register, the result (in the form of data) is written in memory 13, via the memory interface 15.
  • at least one ALU can have more than two entries and receive more of two operands for a calculation.
  • Each ALU 9 can perform:
  • ALUs 9 do not directly exchange data with each other. For example, if the result of a first calculation performed by a first ALU constitutes an operand for a second calculation to be performed by a second ALU, then the result of the first calculation must at least be entered in a register 11 before being usable by an ALU 9.
  • the data written to a register 11 is also systematically written to memory 13 (via the memory interface 15), even if said data is obtained only to serve as an operand and not as a result of a whole treatment process.
  • the data obtained to serve as an operand and having a short relevance are not systematically written in memory 13 and can be stored only temporarily on a register 11.
  • the result of a first calculation carried out by a first ALU constitutes an operand for a second calculation to be carried out by a second ALU
  • the result of the first calculation must be entered in a register 11.
  • said data is transmitted to the second ALU as an operand directly from the register 11.
  • This allocation can in particular take the form of addressing data which makes it possible to locate, at all times, the location of a data item, whether on a register 11 or at a location in the memory 15.
  • the operation of the device 1 is described for a processing applied to computer data, the processing being composed of a set of operations, including calculations performed in parallel by a plurality of ALUs 9 over a period of time consisting of a sequence of calculation cycles. It is then said that the ALUs 9 operate according to a microarchitecture in the processing chain.
  • the processing implemented by the device 1 and which we are talking about here can itself constitute part (or a subset) of a more global IT process.
  • Such a more global process can include, in other parts or sub-assemblies, calculations carried out in a non-parallel manner by a plurality of ALUs, for example according to a series or cascade operation.
  • the operating architectures can be constant or dynamic, for example imposed (controlled) by the control unit 5.
  • the architectural variations can for example be a function of the data to be processed and the current instructions received in input of the device 1.
  • Such dynamic adaptation of the architectures can be implemented at the compilation stage, by adapting the machine instructions generated by the compiler according to the type of data to be processed and the instructions when the type of data to be processed and the instructions can be derived from the source code.
  • Such an adaptation can also be implemented only at the level of the device 1, or of a processor, when it executes a conventional machine code and when it is programmed to implement a set of configuration instructions according to the data. to be processed and current instructions received.
  • the memory interface 15, or "bus" transmits and routes the data between the ALUs 9 and the memory 15, in both directions.
  • the memory interface 15 is controlled by the control unit 5.
  • the control unit 5 controls access to the memory 13 of the device 1 via the memory interface 15.
  • the command 5 coordinates operations
  • the control of the control unit 5 includes the implementation of a succession of operations broken down into calculation cycles. Piloting includes the generation of a first cycle i and a second cycle ii. Chronologically, the first cycle i predates the second cycle ii. As will be described more in detail in the examples below, the second cycle ii can be immediately subsequent to the first cycle i, or else the first cycle i and the second cycle ii can be chronologically spaced from each other, for example with cycles intermediate.
  • the first cycle i includes:
  • the second cycle ii includes the implementation of a second calculation by at least one ALU 9.
  • the second calculation can be implemented by the same ALU 9 as the first calculation or by a separate ALU 9. At least part of the first data set downloaded during the first cycle i forms an operand for the second calculation.
  • Data, or blocks of data, are respectively referenced A0 to A15 and are stored in memory 13.
  • the data A0 to A15 are grouped by four as follows :
  • AA0_3 a data set referenced AA0_3 consisting of data A0, A1, A2 and A3;
  • AA4_7 a data set referenced AA4_7 consisting of data A4, A5, A6 and A 7;
  • AA8_11 a data set referenced AA8_11 consisting of data A8, A9, A10 and A11;
  • AA12_15 a data set referenced AA12_15 consisting of data A12, A13, A14 and A15.
  • the data can be grouped differently, in particular by group (or “block”, or “slot”) of two, three or more than four.
  • a data set can be seen as a group of data accessible on the memory 13 via a single port of the memory interface 15 during a single read operation.
  • the data of a data set can be written into memory 13 via a single port of the memory interface 15 during a single write operation.
  • at least one data set AA0_3, AA4_7, AA8_11 and / or AA12_15 is downloaded to at least one register 11.
  • each of the data sets AA0_3, AA4_7, AA8_11 and / or AA12_15 is downloaded to a respective register 11, that is to say four registers 1 1 distinct from each other.
  • Each of the registers 11 is assigned at least temporarily to a respective ALU 9, here referenced respectively ALU 0, ALU 1 ALU 2 and ALU 3.
  • the ALUs 9 may have implemented a calculation.
  • each ALU 9 implements a calculation for which at least one of the data stored in the corresponding register 1 1 forms an operand.
  • ALU 0 implements a calculation, one of the operands of which is A0.
  • A1, A2 and A3 may be unused during the second cycle ii.
  • downloading data from memory 13 to a register 11 consumes less computation time than implementing calculations by ALUs 9.
  • a memory access operation here a reading
  • the implementation of a calculation by an ALU 9 consumes a calculation cycle or a succession of several calculation cycles, for example four.
  • FIG. 3 there are a plurality of registers 11 allocated to each ALU 9, represented by groups of registers 11 referenced REG A, REG B and REG C.
  • the data downloaded from the memory 13 on the registers 11 correspond to the REG A and REG B groups.
  • the REG C group is here intended to store data obtained by calculations implemented by the ALUs 9 (during a write operation).
  • the registers 11 of the REG B and REG C groups can thus contain data sets referenced in a similar way to those of REG A:
  • the REG B group includes four registers 11 on which are respectively stored a data set BB0_3 consisting of data B0 to B3, a data set BB4_7 consisting of data B4 to B7, a data set BB8_11 consisting of data B8 to B11 and a data set BB12_15 consisting of data B12 to B15;
  • the group REG C comprises four registers 11 on which are respectively stored a data set CC0_3 consisting of data C0 to C3, a data set CC4_7 consisting of data C4 to C7, a data set CC8_11 consisting of data C8 to C11 and a CC12_15 data set consisting of data C12 to C15.
  • the data AN and BN constitute the operands of a calculation implemented by an ALU 9 while the data CN constitutes the result, with “N” an integer between 0 and 15.
  • CN AN + BN.
  • the data processing implemented by the device 1 corresponds to 16 operations. The 16 operations are independent of each other in the sense that none of the results of the 16 operations is necessary to implement one of the other 15 operations.
  • the implementation of the treatment (the 16 operations) can therefore, for example, be broken down as follows, into 18 cycles.
  • cycle # 2 calculation of C0 (of game CC0_3) and reading of AA4_7 (forming for example a cycle i);
  • cycle # 7 calculation of C5 (from game CC4_7) and reading of BB8_11 (for example forming a cycle ii);
  • cycle # 8 calculation of C6 (from game CC4_7) (for example forming a cycle ii);
  • example 1 (18 cycles), we notice that the first two cycles # 0 and # 1 constitute initialization cycles.
  • the number I of initialization cycles corresponds to the number of operands per calculation.
  • a pattern of four successive cycles is repeated four times. For example, cycles # 2 to # 5 together form a pattern.
  • the number of cycles per pattern corresponds to the number D of data per data set while the number of patterns corresponds to the number E of data set to be processed.
  • the total number of cycles can therefore be expressed as follows: I + D * E.
  • the number of data accessible (in reading or writing) in a single cycle is equal to three (and no longer four), for example because of material limitations.
  • the succession of cycles can, for example, be broken down as follows:
  • each cycle includes a memory access operation (read or write). It is therefore understood that, if the number D of data accessible in a single cycle is strictly less than three, then additional cycles will be necessary to perform memory accesses. The optimum of 18 cycles for 16 elementary operations will therefore no longer be reached. However, even if the optimum is not reached, the number of cycles remains significantly less than the number of cycles required in Example 0.
  • An embodiment in which the data sets comprise two data presents an improvement compared to to the existing.
  • the small total number of cycles is achieved in particular because a maximum of memory access operations is implemented per set of (several) data rather than individually and in parallel with calculation operations.
  • the reading of all the necessary operands can be completed even before the preceding elementary calculation operation is completed.
  • it is preserved from the computing power to perform a calculation and record (write operation) the result of said calculation in a common calculation cycle (cycle # 5 of example 1 for example).
  • reading the operand data in advance is implemented throughout the process (repeated from one pattern to another).
  • the operands necessary for the calculations carried out during a pattern are systematically obtained (read) during the chronologically previous pattern.
  • the reading in advance is implemented only partially (for two successive reasons only). Such degraded mode compared to the examples above presents better results than the existing methods.
  • the data is read before serving as operands.
  • the data read in advance are read randomly, or at least independently of the calculations to be performed in the future.
  • at least some of the data read in advance from the data sets effectively correspond to operands for subsequent calculations while other data read are not operands for subsequent calculations.
  • at least some of the data read can be subsequently erased from the registers 11 without having been used by the ALUs 9, typically overwritten by other data subsequently recorded in the registers 11. Certain data are therefore read unnecessarily (and recorded unnecessarily on registers 11).
  • the data read in advance are preselected, and depend on the calculations to be performed. This improves the relevance of the pre-read data. Indeed, in the examples with 16 elementary calculations above, each of the 16 elementary calculations requires as input a pair of operands, respectively A0 and B0; A1 and B1; ...; A15 and B15. If the data is read randomly, then the first two cycles could correspond to the reading of AA0_3 and BB4_7. In such a case, no complete pair of operands is available on the registers 11 at the end of the first two cycles. Consequently, ALUs 9 cannot carry out any elementary calculation in the following cycle.
  • Such an algorithm receives as input parameters information data relating to the calculations to be carried out subsequently by the ALUs 9, and in particular relating to the necessary operands. Such an algorithm makes it possible, at the output, to select the data read (per set) in anticipation of the future calculations to be performed.
  • Such an algorithm is, for example, implemented by the control unit 5 when controlling the memory accesses. According to a first approach, the algorithm imposes an organization of the data as soon as they are recorded in the memory 13. For example, the data which one wishes to see forming together a dataset are juxtaposed and / or scheduled so that the whole of the dataset can be called by a single query.
  • the memory interface 15 can be configured for, in response to a read request on @ A0 , read the data automatically at the following three addresses @ A1, @ A2 and @ A3.
  • the prefetch algorithm provided at the output of the memory access requests adapted as a function of the calculations to be performed subsequently by the ALUs 9, and in particular relating to the necessary operands.
  • the algorithm identifies for example that the data to be read in priority are those of AA0_3 and BB0_3 to make possible, from the next cycle, the elementary calculations resulting in CC0_3, that is to say the calculation of CO with the operands AO and BO, the calculation of C1 with the operands A1 and B1, the calculation of C2 with the operands A2 and B2 and the calculation of C3 with the operands A3 and B3.
  • the algorithm therefore provides, at the output, memory access requests constructed to generate the reading of AA0_3 and BB0 3.
  • the algorithm identifies the data to be read and the control unit 5 deduces therefrom requests for memory access to the memory interface 15 to obtain said data , the requests being adapted as a function of the characteristics (structure and protocol) of the memory interface 15.
  • the number of ALUs assigned to elementary calculations is not defined.
  • a single ALU 9 can perform all of the elementary calculations, cycle by cycle.
  • the elementary calculations to be performed can also be distributed over a plurality of ALUs 9 of a PU, for example four. In such cases, coordinating the distribution of the calculations on the ALUs with the manner of grouping the data to be read in each read operation can make it possible to further improve efficiency. Two approaches stand out.
  • the data read in one operation form operands in calculations implemented by a single ALU 9.
  • the groups AA0_3 and BB0_3 of data A0, A1, A2, A3, B0, B1, B2 and B3 are read first and a first ALU is responsible for calculating CC0 3 (C0, C1, C2 and C3).
  • the groups AA4 _7 (A4, A5, A6, A7) and BB4_7 (B4, B5, B6 and B7) are then read and a second ALU is responsible for calculating CC4_7 (C4, C5, C6 and C7).
  • the first ALU will be able to start implementing the calculations before the second ALU can do the same because the operands necessary for the calculations of the first ALU will be available on the registers 11 before the operands necessary for the calculations of the second ALU are.
  • the ALUs 9 of a PU then operate in parallel and asynchronous fashion.
  • the data read in one operation form operands in calculations each implemented by different ALUs 9, for example four. For example, two groups of data including A0, A4, A8 and A12 respectively; B0, B4, B8 and B12 are read first.
  • a first ALU is responsible for calculating C0
  • a second ALU is responsible for calculating C4
  • a third ALU is responsible for calculating C8
  • a fourth ALU is responsible for calculating C12. It will then be understood that the four ALUs will be able to start implementing their respective calculation in a substantially simultaneous manner, because the necessary operands will be available on the registers 11 at the same time as downloaded in a common operation.
  • the ALUs 9 of a PU operate in parallel and synchronized. Depending on the types of calculations to be performed, the accessibility of the data in memory and the resources available, one or the other of the two approaches may be preferred.
  • the two approaches can also be combined: the ALUs can be organized into subgroups, the ALUs of a subgroup operating in synchronized fashion and the subgroups operating asynchronously with respect to each other.
  • the grouping of the data to be read by read operation must be selected in correspondence with the distribution of the assignments of the calculation operations to various ALUs.
  • the elementary calculations are independent of each other. The order in which they are carried out therefore does not a priori matter.
  • the scheduling of the calculations can be specific. Such a situation typically arises in the context of recursive computations.
  • the algorithm can be configured to identify the data to be acquired (read) in priority. For example, if:
  • the result C9 is obtained by a calculation of which one of the operands is C8, C8 being itself obtained from the operands A8 and B8, and
  • the result C13 is obtained by a calculation of which one of the operands is C12, C12 itself being obtained from the operands A12 and B12,
  • the algorithm can be configured to read, during the first two cycles # 0 and # 1 of initialization, the data sets defined as follows:
  • the dataset thus defined is represented in FIG. 4.
  • the data are grouped “online” in the embodiment represented in FIG. 3 and grouped “in column” in the embodiment represented in FIG. 4.
  • the implementation of the algorithm makes it possible to read and make available on the registers 11, the operands useful for the priority elementary computations.
  • the implementation of the algorithm makes it possible to increase the short-term relevance of the data read compared to a random reading.
  • processors - a processor or a set of processors
  • FIG. 5 There is shown an example of the operating architecture of a device 1 in which the memory access and address processing operations are treated separately. Elementary calculation operations.
  • Such an architecture can take the form of a computer process. It can optionally be combined with the embodiments described above.
  • the numerical references common with those of the preceding figures designate similar elements, in particular a control unit 5, an ALU 9, registers 11, a memory 13 and a memory interface 15, or "bus".
  • a set of instructions can be implemented by the computer device 1.
  • An example of such a set of instructions is given at the end of the description in the form of a computer pseudocode. .
  • Such instructions are applied one after the other during a common process implemented by an ALU 9.
  • the instructions relating to memory accesses and the instructions relating to operations elementary computations are processed by separate processes from each other.
  • the method can be broken down into steps referenced respectively 101 to 109.
  • the memory addresses @ A0 to @AN, respectively @ B0 to @BN, of each of the data forming operands for at least one of the elementary operations to be performed are obtained.
  • “obtained” is meant that at the end of operations 101 and 102, one or more local memory organs store the addresses of all the data forming operand.
  • Such memory accesses are for example triggered by the reception of instructions from the control unit 5.
  • at least some of said addresses are already stored in the local memory organs. No memory access is therefore necessary at this stage to obtain said addresses previously installed on the local memory organs.
  • step 101 concerning the first operands "A" of the addition and from step 102 concerning the second operands "B” of the addition. Distinguishing the two operands then makes it possible to implement iterative loops (in the computer sense) specific to each of the two operands, and possibly different from each other.
  • steps 101 and 102 can be implemented at least partially in parallel with one another, independently of one another.
  • step 103 and 104 each of said data obtained, respectively A0 to AN and B0 to BN, is read from memory 13, for loading into registers 11, via the memory interface 15. Such readings are made possible by addresses obtained in steps 101 and 102.
  • step 103 concerns the first operands "A” while step 104 concerns the second operands "B".
  • a calculation execution instruction is transmitted from the control unit 5 to an ALU 9.
  • the execution instruction is constructed so as to trigger the implementation of the calculations basic treatment for ALU 9.
  • the instruction here is devoid of addressing instruction.
  • devoid of addressing instruction it is meant here that, contrary to what is usually done, the instruction for carrying out calculations transmitted by the control unit 5 is not included in a general instruction set combining both addressing instructions and instructions for performing calculations.
  • the ALU 9 is able to immediately apply the instructions by carrying out the elementary calculations without the need to apply instructions for configuring the memory interface 15 in advance and therefore without it being necessary, either, to verify any mutual dependence on the various instructions received.
  • the ALU 9 then behaves like a computer resource implementing the calculations (step 106 described below) independently of a possible complexity of interdependence between the different instructions.
  • the registers 11 behave like first-in-first-out (or FIFO for “First In, First Out”) buffers.
  • the registers 11 filling and emptying respecting the order of arrival of the data, here the operands AN (step 103) and BN (step 104).
  • Step 106 is executed if the registers 11 are not empty: the registers 11 are unstacked from the operands AN and BN. As a variant, the registers 11 do not operate in FIFO mode. In this case, data can be stored there more permanently without the risk of being erased and can be reused later if necessary.
  • step 106 upon receipt of the instruction for executing calculations, the ALU 9 executes all of the corresponding elementary operations, as soon as the operands are available in the registers 11. Step 106 therefore includes the reception of the ALU 9 of each of the operands from the registers 11 as inputs. Provided that the addressing operations 103, 104 have been previously and correctly carried out, step 106 can be devoid of memory access (read).
  • step 107 the data forming results of the processing are stored on the registers 11 at the outputs of the ALU 9.
  • the results of the processing and not the results of each of the elementary calculations. Indeed, in the case where some of the results of the elementary calculations are used as operands for other elementary calculations during step 107, such (intermediate) results may become useless at the end of the processing. In these cases, the intermediate results can be deleted from the registers 11 at the end of step 107 (for example overwritten by other data, FIFO mode). As a variant, all of the data forming results of elementary operations are stored on registers 11 at the end of step 107 (mode different from FIFO).
  • step 108 a memory address @CX for each of the CX data forming the result of the processing is obtained.
  • Such an addressing operation makes it possible to determine the memory location on which each of the data forming the result will be stored.
  • step 108 is implemented after step 107 of recording the results in the registers 11 and before writing the results to memory 13 (step 109 described below).
  • step 107 can be implemented earlier during the process, in particular before step 106. In fact, especially when the shape (for example the size) of the result data is known in advance, it is possible to address the result data even before their calculation.
  • Obtaining the @CX memory addresses may include the transmission of addressing instructions from the control unit 5.
  • each of the data forming the result of the processing is written into memory 13 from registers 11, for storage and via the memory interface 15, by means of the memory addresses obtained in step 108.
  • steps 101, 102, 106 and 108 are given in the form of a computer pseudocode.
  • Such non-limiting examples represent operations in the form of computer loops.
  • the use of loops is particularly advantageous for limiting the number of calculation cycles necessary, and therefore for improving efficiency, when the operations to be implemented are substantially analogous to each other (for example all additions) and that only the input data varies.
  • the calculation instructions transmitted in step 105 may take the form of a repetitive loop for each operation.
  • FIG. 6 illustrates that, unlike the use in the technical field, the addressing of the input data (operands) is implemented distinct from the calculations themselves. In other words, addressing and calculation are treated as two separate processes from each other rather than being treated indiscriminately, on the fly, upon receipt of general instructions.
  • the step 106 for executing the elementary operations can start even though all the operands have not yet been downloaded from the memory 13 to the registers 11.
  • the first cycles of the step 106 can start as soon as that the first corresponding operands are available on the registers 11 for the calculations. This cascading implementation of operations gives the system its asynchronous nature.
  • the ALU 9 performs (step 106) all of the elementary calculations of the processing during consecutive calculation cycles. No memory access is made by the ALU 9 during these calculation cycles. Thus, the implementation of the calculations can be particularly rapid. The ALU 9, during such cycles, is exempt from memory access operation. In addition, from the point of view of the ALU 9 performing the calculations, obtaining the operands is similar to calling memory 13 but obtaining the operands is faster and independent of the memory interface 15 because the operands are in practice read directly from the registers 11. The memory accesses are implemented by another ALU (separate from the one performing the calculations). At least during a process, each ALU 9 has a fixed function: either the implementation of calculations, or the implementation of memory accesses.
  • each ALU 9 can be modified at the end of the implementation of the process to give flexibility to the IT device. However, this often involves adapting the addressing path accordingly. In the preferred embodiments, the function of each ALU 9 is therefore frozen from one process to another: the ALUs are specialized.
  • the methods and variants described above can take the form of a computer device, including a processor or a set of processors, arranged to implement such a method.
  • the invention is not limited to the examples of methods and devices described above, only by way of example, but it encompasses all the variants that a person skilled in the art may envisage within the framework of the protection sought.
  • the invention also relates to a set of machine instructions which can be implemented in a processor for obtaining such a computing device, such as a processor or a set of processors, the implementation of such a set of machine instructions on a processor, the processor architecture management method implemented by the processor, the computer program comprising the corresponding set of machine instructions, as well as the recording medium on which such a set of machine instructions is recorded by computer.
  • reg2 regO + governed
  • GOTO LOOP (10x) Example in the form of a computer pseudocode of instructions for carrying out a processing consisting of ten additions according to an embodiment in which the addressing instructions and the calculation instructions are distinguished from each other: // step 101 //

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Advance Control (AREA)
  • Executing Machine-Instructions (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
EP19737810.2A 2018-06-29 2019-05-21 Architektur eines asynchronen prozessors Pending EP3814923A1 (de)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
FR1856000A FR3083351B1 (fr) 2018-06-29 2018-06-29 Architecture de processeur asynchrone
PCT/FR2019/051156 WO2020002783A1 (fr) 2018-06-29 2019-05-21 Architecture de processeur asynchrone

Publications (1)

Publication Number Publication Date
EP3814923A1 true EP3814923A1 (de) 2021-05-05

Family

ID=65031328

Family Applications (1)

Application Number Title Priority Date Filing Date
EP19737810.2A Pending EP3814923A1 (de) 2018-06-29 2019-05-21 Architektur eines asynchronen prozessors

Country Status (6)

Country Link
US (1) US20210141644A1 (de)
EP (1) EP3814923A1 (de)
KR (1) KR20210021588A (de)
CN (1) CN112639760B (de)
FR (1) FR3083351B1 (de)
WO (1) WO2020002783A1 (de)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5513366A (en) * 1994-09-28 1996-04-30 International Business Machines Corporation Method and system for dynamically reconfiguring a register file in a vector processor
US5694565A (en) * 1995-09-11 1997-12-02 International Business Machines Corporation Method and device for early deallocation of resources during load/store multiple operations to allow simultaneous dispatch/execution of subsequent instructions
US20140281397A1 (en) * 2013-03-15 2014-09-18 Maxim Loktyukhin Fusible instructions and logic to provide or-test and and-test functionality using multiple test sources

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0762823B2 (ja) * 1985-05-22 1995-07-05 株式会社日立製作所 デ−タ処理装置
JP3068138B2 (ja) * 1989-04-06 2000-07-24 甲府日本電気株式会社 ベクトル演算処理装置
US5333284A (en) * 1990-09-10 1994-07-26 Honeywell, Inc. Repeated ALU in pipelined processor design
CN1180427A (zh) * 1996-02-28 1998-04-29 爱特梅尔股份有限公司 以单精度或双精度进行算术运算的系统
US6009505A (en) * 1996-12-02 1999-12-28 Compaq Computer Corp. System and method for routing one operand to arithmetic logic units from fixed register slots and another operand from any register slot
GB0323950D0 (en) * 2003-10-13 2003-11-12 Clearspeed Technology Ltd Unified simid processor
JP4232838B2 (ja) * 2007-03-29 2009-03-04 日本電気株式会社 再構成可能なsimd型プロセッサ
GB0706411D0 (en) * 2007-04-02 2007-05-09 Aspex Semiconductor Ltd Improvements relating to SIMD parallel processors
US20080263115A1 (en) * 2007-04-17 2008-10-23 Horizon Semiconductors Ltd. Very long arithmetic logic unit for security processor
US8880856B1 (en) * 2009-06-17 2014-11-04 Juniper Networks, Inc. Efficient arithmetic logic units
US8639882B2 (en) * 2011-12-14 2014-01-28 Nvidia Corporation Methods and apparatus for source operand collector caching
US9235414B2 (en) * 2011-12-19 2016-01-12 Intel Corporation SIMD integer multiply-accumulate instruction for multi-precision arithmetic
JP6300796B2 (ja) * 2012-07-06 2018-03-28 コーニンクレッカ フィリップス エヌ ヴェKoninklijke Philips N.V. 算術及び論理ユニットを伴わないコンピュータプロセッサ及びシステム

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5513366A (en) * 1994-09-28 1996-04-30 International Business Machines Corporation Method and system for dynamically reconfiguring a register file in a vector processor
US5694565A (en) * 1995-09-11 1997-12-02 International Business Machines Corporation Method and device for early deallocation of resources during load/store multiple operations to allow simultaneous dispatch/execution of subsequent instructions
US20140281397A1 (en) * 2013-03-15 2014-09-18 Maxim Loktyukhin Fusible instructions and logic to provide or-test and and-test functionality using multiple test sources

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
See also references of WO2020002783A1 *
XIN FU ET AL: "ORBIT: Effective Issue Queue Soft-Error Vulnerability Mitigation on Simultaneous Multithreaded Architectures Using Operand Readiness-Based Instruction Dispatch", COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING, 2008. SBAC-PAD '08. 20TH INTERNATIONAL SYMPOSIUM ON, IEEE, PISCATAWAY, NJ, USA, 29 October 2008 (2008-10-29), pages 71 - 78, XP031365358, ISBN: 978-0-7695-3423-7 *

Also Published As

Publication number Publication date
CN112639760A (zh) 2021-04-09
WO2020002783A1 (fr) 2020-01-02
FR3083351A1 (fr) 2020-01-03
CN112639760B (zh) 2025-01-10
KR20210021588A (ko) 2021-02-26
US20210141644A1 (en) 2021-05-13
FR3083351B1 (fr) 2021-01-01

Similar Documents

Publication Publication Date Title
Kim et al. Serverless data analytics with flint
EP1805611B1 (de) Task-verarbeitungs-einteilungsverfahren und einrichtung zu seiner implementierung
EP2232368B1 (de) System mit mehreren verarbeitungseinheiten, die es unmöglich machen, aufgaben gleichzeitig auszuführen durch mischung der ausführungsart des steuertyps und der ausführungsart des datenstromtyps
EP2366147B1 (de) Physischer manager der synchronisationsbarriere zwischen mehreren prozessen
FR2996037A1 (fr) Moteur hybride pour processeur central et processeur graphique
US12443891B2 (en) Vector processing of decision trees to form inferences
EP2585931B1 (de) Vorrichtung, string und verfahren zur datenverarbeitung sowie zugehöriges computerprogramm
EP2947563B1 (de) Prozessor mit bedingten anweisungen
EP3814893B1 (de) Prozessorspeicherzugriff
EP2956874A2 (de) Vorrichtung und verfahren zur beschleunigung der aktualisierungsphase eines simulationskerns
FR2678400A1 (fr) Processeur de protocole destine a l'execution d'un ensemble d'instructions en un nombre reduit d'operation.
WO2020002783A1 (fr) Architecture de processeur asynchrone
EP3005107A1 (de) Materialbeschleuniger für rote und schwarze bäume
CN119130772A (zh) 图元分发方法、装置、设备及存储介质
WO2019115902A1 (fr) Architectures de processeur
EP2556435B1 (de) Segmentierter zwischenspeicher
EP2499570A1 (de) Verfahren und vorrichtung zur optimierung der ausführung von softwareanwendungen in einer multiprozessorarchitektur einschliesslich einer vielzahl von eingabe-/ausgabesteuerungen und sekundärprozessoreinheiten
EP3716086A1 (de) Direkter speicherzugang
FR2957699A1 (fr) Systeme de traitement de donnees
FR2867874A1 (fr) Dispositif et procede de gestion d'un jeu d'instructions d'un microprocesseur
FR2858074A1 (fr) Procede de gestion de l'execution d'au moins un programme sur plusieurs calculateurs

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: UNKNOWN

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20201223

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: EXAMINATION IS IN PROGRESS

17Q First examination report despatched

Effective date: 20230629