WO2024077312A2 - Unité crossbar de multiplication de système de nombre résiduel plié - Google Patents
Unité crossbar de multiplication de système de nombre résiduel plié Download PDFInfo
- Publication number
- WO2024077312A2 WO2024077312A2 PCT/US2024/012403 US2024012403W WO2024077312A2 WO 2024077312 A2 WO2024077312 A2 WO 2024077312A2 US 2024012403 W US2024012403 W US 2024012403W WO 2024077312 A2 WO2024077312 A2 WO 2024077312A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- modulo
- input number
- input
- sign
- value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/60—Methods or arrangements for performing computations using a digital non-denominational number representation, i.e. number representation without radix; Computing devices using combinations of denominational and non-denominational quantity representations, e.g. using difunction pulse trains, STEELE computers, phase computers
- G06F7/72—Methods or arrangements for performing computations using a digital non-denominational number representation, i.e. number representation without radix; Computing devices using combinations of denominational and non-denominational quantity representations, e.g. using difunction pulse trains, STEELE computers, phase computers using residue arithmetic
- G06F7/722—Modular multiplication
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03K—PULSE TECHNIQUE
- H03K19/00—Logic circuits, i.e. having at least two inputs acting on one output; Inverting circuits
- H03K19/003—Modifications for increasing the reliability for protection
- H03K19/00346—Modifications for eliminating interference or parasitic voltages or currents
- H03K19/00361—Modifications for eliminating interference or parasitic voltages or currents in field effect transistor circuits
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03K—PULSE TECHNIQUE
- H03K19/00—Logic circuits, i.e. having at least two inputs acting on one output; Inverting circuits
- H03K19/02—Logic circuits, i.e. having at least two inputs acting on one output; Inverting circuits using specified components
- H03K19/173—Logic circuits, i.e. having at least two inputs acting on one output; Inverting circuits using specified components using elementary logic circuits as components
- H03K19/177—Logic circuits, i.e. having at least two inputs acting on one output; Inverting circuits using specified components using elementary logic circuits as components arranged in matrix form
- H03K19/17724—Structural details of logic blocks
- H03K19/17732—Macroblocks
Definitions
- the following is related generally to the field of electronic circuits and, more specifically, to electronic circuits for performing arithmetic functions such as multiplication.
- Computer systems frequently include one or more circuits for performing arithmetic functions such as addition, subtraction, multiplication, and division.
- one or more circuits for performing arithmetic operations such as multiplication may be integrated as execution units within a central processing unit (CPU).
- CPU central processing unit
- computer systems frequently include combinatorial logic circuits (e.g., in a microprocessor) such as a Vector Unit, or VU, an arithmetic logic unit, or ALU, and/or a floating-point unit, or FPU, often referred to as a math coprocessor.
- One or more such units may include multiplication circuits (multiplication units) to perform multiplication operations.
- one or more VUs, ALUs and/or FPUs may be integrated as execution units within the central processing unit.
- One way to improve the efficiency of multiplication operations is through use of a Residual, or Residue, Number System (RNS) multiplication operations in which the operands are represented by a set of remainders for a set of moduli.
- RMS Residual, or Residue, Number System
- this approach can provide advantages, as usually implemented it can require m x m lookup tables for a modulus value of m, need to use division for performing modular reduction for general values of m, or be restricted to specific values such as 2'-1 or 2'+1 , where i is an integer.
- a multiplication circuit includes a decoder circuit and a crossbar multiplier circuit comprising an array of switches.
- the decoder circuit is configured to: receive a first modulo p input number, where p is a prime number; receive a second modulo p input number; convert the first modulo p input number into a first binary valued vector, wherein, for non-zero values, one of the values of the first vector is one and other values of the first vector being zero and the values of the first vector are ordered in a modulo power-of-two order; and convert the second modulo p input number into a second binary valued vector, wherein, for non-zero values, one of the values of the second vector is one and other values of the second vector being zero and the values of the second vector are ordered in the modulo power-of-two order.
- the crossbar multiplier circuit comprises an array of switches configured to: receive the first modulo p input number ordered in the modulo power-of-two order; receive the second modulo p input number ordered in the modulo power-of-two order; and determine a product of the first modulo p input number and the second modulo p input number by applying the first modulo p input number ordered in the modulo power-of-two order and applying the second modulo p input number ordered in the modulo power-of-two order to the array of switches.
- the crossbar multiplier circuit is further configured to: receive the first modulo p input number in a sign-magnitude format; receive the second modulo p input number in the sign-magnitude format; and determine a magnitude of the product of the first modulo p input number and the second modulo p input number from the magnitude of the of the first modulo p input number and the magnitude of the of the second modulo p input number.
- the multiplication circuit further comprises a sign determination circuit configured to: receive a sign value of the first modulo p input number in the sign-magnitude format; receive a sign value of a second modulo p input number in the sign-magnitude format; and determine a sign value of the product of the first modulo p input number and the second modulo p input number from the sign value of the of the first modulo p input number and the sign value of the of the second modulo p input number.
- a sign determination circuit configured to: receive a sign value of the first modulo p input number in the sign-magnitude format; receive a sign value of a second modulo p input number in the sign-magnitude format; and determine a sign value of the product of the first modulo p input number and the second modulo p input number from the sign value of the of the first modulo p input number and the sign value of the of the second modulo p input number.
- the crossbar multiplier circuit is a (p-1 )/2 by (p+1 )/2 crossbar array of switches.
- the crossbar multiplier circuit is a p/2 by /2 crossbar array of switches.
- the switches are NMOS transistors.
- the switches are implemented as one or more of FinFET, Gate-AII-Around (GAA), or nano-wire or nano-sheet transistors.
- the magnitude of the first modulo p input number is received as a first binary valued vector, wherein, for nonzero values, one of the values of the vector being one and other values of the vector being zero; and the magnitude of the second modulo p input number is received as a second binary valued vector, wherein, for non-zero values, one of the values of the vector being one and other values of the vector being zero.
- the first modulo p input number is received as a binary encoded value and the second modulo p input number is received as a binary encoded value
- crossbar multiplier circuit is further configured to: convert the magnitude of the first modulo p input number to a first binary valued vector, wherein, for non-zero values, one of the values of the vector being one and other values of the vector being zero; and convert the magnitude of the second modulo p input number to a second binary valued vector, wherein, for non-zero values, one of the values of the vector being one and other values of the vector being zero.
- the crossbar multiplier circuit is further configured to: determine from the sign and magnitude of the first modulo p input number a first indication of whether the first modulo p sign-magnitude input number represents an unsigned binary number that is either in a first (p-1 )/2 non-zero values of the first unsigned binary modulo p input number or is in a second (p-1 )/2 non-zero values of the first unsigned binary modulo p input number; and determine from the sign and magnitude of the second modulo p input number a first indication of whether the second modulo p sign-magnitude input number represents an unsigned binary number that is either in the first (p-1 )/2 non-zero values of the second unsigned binary modulo p input number or is in the second (p-1 )/2 non-zero values of the second unsigned binary modulo p input number.
- the multiplication circuit is one of a plurality of modular arithmetic units, each for a different value of a prime number and each configured to receive a corresponding component of input values in a residual number format system.
- the multiplication circuit further comprises a plurality of level restore circuits, each configured to receive two outputs of a row of the crossbar multiplier circuit, one of outputs active and the other outputs inactive, and provide the bit of the output of the crossbar multiplier circuit as a strong logic value.
- a method of performing modulo p multiplication comprising: receiving a first modulo p input number in a sign-magnitude format, where p is a prime number; receiving a second modulo p input number in the sign-magnitude format; determining a magnitude of a product of the first modulo p input number and the second modulo p input number by providing the magnitude of the first modulo p input number and the second modulo p input number as respective first and second inputs of a crossbar multiplier circuit; and determining a sign value of the product of the first modulo p input number and the second modulo p input number from the sign value of the of the first modulo p input number and the sign value of the of the second modulo p input number.
- the second modulo p input number in the sign-magnitude format is a shift amount.
- the crossbar multiplier circuit is a (p-1 )/2 by (p+1 )/2 crossbar array of switches.
- receiving the first modulo p input number in the sign-magnitude format comprises: receiving the first modulo p input number in a binary format; and converting the first modulo p input number from the binary format into the sign-magnitude format; and receiving the second modulo p input number in the sign-magnitude format comprises: receiving the second modulo p input number in the binary format; and converting the second modulo p input number from the binary format into the sign-magnitude format.
- the sign-magnitude format includes the magnitude expressed as a 1 -hot vector.
- an arithmetical logic circuit includes a plurality of mod-p arithmetic units, each for a different value of a prime number p and each configured to receive mod-p components of input values in a residual number format system.
- Each mod-p arithmetic unit includes a multiplication circuit comprising a decoder circuit and a crossbar multiplier circuit.
- the decoder circuit is configured to: receive a first mod-p input number; receive a second mod-p input number; convert the first mod-p input number into a first binary valued vector, wherein, for non-zero values, one of the values of the first vector is one and other values of the first vector being zero and the values of the first vector are ordered in a modulo power-of-two order; and convert the second mod-p input number into a second binary valued vector, wherein, for non-zero values, one of the values of the second vector is one and other values of the second vector being zero and the values of the second vector are ordered in the modulo power-of-two order.
- the crossbar multiplier circuit comprises an array of switches configured to: receive the first mod-p input number ordered in the modulo power-of-two order; receive the second mod-p input number ordered in the modulo power-of-two order; and determine a product of the first mod-p input number and the second mod-p input number by applying the first mod-p input number ordered in the modulo power-of-two order and applying the second mod-p input number ordered in the modulo power-of-two order to the array of switches.
- the crossbar multiplier circuit is further configured to: receive the first modulo p input number in a sign-magnitude format; receive the second modulo p input number in the sign-magnitude format; and determine a magnitude of the product of the first mod-p input number and the second mod-p input number from the magnitude of the of the first mod-p input number and the magnitude of the of the second mod-p input number.
- the multiplication circuit further comprises a sign determination circuit configured to: receive a sign value of the first mod-p input number in the sign-magnitude format; receive a sign value of a second mod-p input number in the sign-magnitude format; and determine a sign value of the product of the first mod-p input number and the second mod-p input number from the sign value of the of the first mod-p input number and the sign value of the of the second mod-p input number.
- the arithmetical logic circuit further includes: a first buffer configured store the mod-p components of the first input value and provide the mod-p components of the first input value to the corresponding mod- p arithmetic unit; and a second buffer configured store the mod-p components of the second input value and provide the mod-p components of the first input value to the corresponding mod-p arithmetic unit.
- each mod-p arithmetic unit further comprising: a first decoding circuit configured to: receive the mod-p component of the first input value; and convert the mod-p component of the first input value into the sign-magnitude format; and a second decoding circuit configured to: receive the mod-p component of the second input value; and convert the mod-p component of the second input value into the sign-magnitude format.
- first decoding circuit is further configured to convert the magnitude of the mod-p component of the first input value into a 1-hot vector format; and to convert the mod-p component of the second input value into the sign-magnitude format, first decoding circuit is further configured to convert the magnitude of the mod-p component of the second input value into the 1 - hot vector format.
- crossbar multiplier circuit is a (p-1 )/2 by (p+1 )/2 crossbar array of switches.
- a multiplication circuit comprising: a first operand decoder block configured to receive a first modulo p input number and, for non-zero values, convert the first modulo p input number into a corresponding sign value and a corresponding one of >2(p-1 ) magnitude value vectors, where p is a prime number; and a second operand decoder block configured to receive a second modulo p input number and, for non-zero values, convert the second modulo p input number into a corresponding sign value and a corresponding one of the >2(p-1 ) magnitude value vectors.
- a >2(p-1 ) by >2(p-1 ) crossbar array of switches is configured to receive the magnitude value vector of the first modulo p input number as a first input and the magnitude value vector of the second modulo p input number as a second input and provide as output a magnitude value vector of the modulo p product of the first modulo p input and the second modulo p input.
- Logic circuitry is configured to receive the sign value of the first modulo p input number and the sign value of the second modulo p input number and determine therefrom a sign value of the modulo p product of the first modulo p input and the second modulo p input.
- the multiplication circuit can further comprise zero determination circuitry configured to provide a zero output value in respect to either of the first modulo p input number or the second modulo p input number having a zero value.
- the crossbar array comprises >2(p-1 ) columns of >2(p-1 ) switches, the switches of each column connected between a corresponding one of the magnitude value vectors of the first modulo p input number and one of either a first >2(p-1 ) output values or a second >2(p-1 ) output values, and a control gate of each of the switches of each column are connect to a different one of the magnitude value vectors of the second modulo p input numbers
- the logic circuitry is further configured to receive the first >2(p-1 ) output values and the second Vz(p-1 ) output values and determine whether the magnitude value vector of the output corresponds to either one of the first Vz(p-1 ) output values or the second >2(p-1 ) output values based on the sign value of the first modulo p input number and the sign value of the second modulo p input number.
- Figures 1A and 1 B are respectively block diagrams of a computer system and a microprocessor that can be incorporated into such a computer system.
- Figure 2A illustrates a Residual, or Residue, Number System (RNS) number RNS(8
- RNS Number System
- Figure 2B illustrates an embodiment for the structure of an adder, subtractor, or multiplier for numbers in the RNS format using the RNS(8
- Figure 3 illustrates components for implementing modular multiplication for moduli not of the form 2 k or 2 k -1 .
- Figure 5 is a table illustrating the modulo 11 value of 2 k for the k values 0 to
- Figure 6 lists the primitive roots of the prime numbers up to 100 with the prime numbers having a primitive root of 2 highlighted.
- Figure 7 is a table illustrating the modulo 11 value of 2 k for the k values 0 to 9 for both “wrapped” and “unwrapped” remainder values.
- Figures 8 and 9 are tables respectively illustrating the modulo 7 and 23 value of 2 k for both wrapped and unwrapped remainder values.
- Figure 10 lists the primitive roots of the prime numbers up to 100 with the additional available primes without a primitive root of 2 for which the methods presented here can be applied.
- Figures 11A and 11 B provide the value of 2 k for both wrapped and unwrapped remainder values for selected small primes which have primitive roots of 2 that can be used in the process, specifically for modulus values of 5, 11 , 13, 19, and 29.
- Figure 12 illustrates an example of a barrel shifter efficiently implemented with a crossbar switch.
- Figures 14 and 15 schematically illustrate implementation of a signed modulo 13 multiplication circuit as full crossbar circuit and as a folded and split crossbar circuit, respectively.
- Figures17A and17B present an embodiment for a split and folded crossbar multiplier circuit.
- Figurel 8 illustrates an NMOS only crossbar shifter circuit implementation than can be used in some embodiments.
- Figurel 9 illustrates an example of a level restorer circuit.
- Figure 20 illustrates an embodiment for a cross-connected level restoration circuit for use with the outputs of a split crossbar multiplier.
- Figures 21 and 22 illustrate the operation of the cross-connected level restoration circuit embodiment of Figure 20 for a “0” input and a weak “1” input, respectively.
- Figure 23 is a flowchart to illustrate the operation of the cross-connected level restoration circuit embodiment of Figure 20.
- Figure 25 is a remainder table for modulo 23 multiplication in wrapmagnitude format.
- Figures 26 and 27 respectively illustrate embodiments for sign bit to wrap bit conversion in an NMOS with level restoration and a CMOS embodiment for the modulo 13 example.
- Figure 28 shows an embodiment of a wrap select circuit for product wrap- to-sign conversion for a modulo 13 example.
- Figures 29A and 29B present an embodiment for a folded and split NMOS crossbar multiplier for inputs in a sign-magnitude format.
- Figures 30A and 30B present an embodiment for a folded NMOS crossbar multiplier for inputs in a sign-magnitude format for a non-primitive root of 2.
- Figure 31 is a flowchart for an embodiment of performing modular multiplication based on the embodiments of 17A, 17B, 24A, 24B, 29A, 29B, 30A, and 30B.
- Figures 32-35 provide more detail on the ordering of inputs to the crossbar arrays and their relationship to a modulus having a primitive root of 2.
- Figure 36 is a block diagram to illustrate the use of power-of-2 ordering for both folded and unfolded crossbar embodiments for RNS multiplication.
- Figure 37 illustrates an example of a computing system configured to implement embodiments of the present technology.
- RNS Residual, or Residue, Number System
- RNS Residue, Number System
- This allows for the multiplication to be broken down into a set of channels, one for each of the moduli, that can perform modular arithmetic within each of the channels.
- RMS Residual, or Residue, Number System
- This approach has advantages, as traditionally implemented, it has a number of inefficiencies.
- One is that it uses two dimensional lookup tables for modular multiplication, where such tables scale as the square of the size of the modulus.
- Another inefficiency is that for moduli not of the form 2 k or 2 k -1 , modular multiplication can be slow as it uses division in the modular reduction of the product.
- Another inefficiency is that for moduli of the form 2 k -1 for which somewhat efficient modular multiplications are possible, only a limited number of ‘small’ moduli exist (e.g., 3, 7, 15, 31 for k ⁇ 5). A limited number of moduli will limit the range of the number which can be represented by the plurality of moduli channels. Additionally, since the moduli must be pairwise co-prime, only a single moduli of the form 2 k can be used. Larger moduli of the form 2 k -1 are not desirable due to the inherent delay of carry propagation.
- the multiplication can be implemented by use of a crossbar multiplication circuit.
- a crossbar array of size ((p-1 )/2) x ((p+1 )/2) can be used to compute the magnitude of the product with logic circuitry used to compute the sign of the product.
- a cross-connected level restoration circuit is presented. For each row of the folded and split crossbar multiplication circuit, two outputs result, one active and at either a low logic level or a degraded high logic level and the other inactive and floating. The cross-connected level restoration circuit restores the logic levels for the active input and returns a low logic level for the floating input.
- Figures 1 A and 1 B are block diagrams of a computer system and a microprocessor that can be incorporated into such a computer system that implements one or more embodiments of the disclosure.
- the computer system 100 includes a computer 105, one or more input devices 101 and one or more output devices 103.
- input devices 101 include a keyboard or mouse or even a voice recognition system to support voice commands.
- output devices 103 include monitors, speakers or printers.
- the computer 105 includes a memory 107 and a microprocessor 120, where in this simplified representation the memory 107 is represented as a single block.
- the memory 107 can include ROM memory, RAM memory and/or non-volatile memory and, depending on the embodiment, can include separate memory for data and instructions.
- Figure 1 B illustrates one embodiment for the microprocessor 120 of Figure 1A and also includes the memory 107.
- the microprocessor 120 includes control logic 125, a processing block 140, an input interface 121 , and an output interface 123.
- the dashed lines represent control signals exchanged between the control logic 125 and the other elements of the microprocessor 120 and the memory 107.
- the solid lines represent the flow of data and instructions within the microprocessor 120 and between the microprocessor 120 and the memory 107.
- the processing block 140 includes combinatorial logic 143 that is configured to execute instructions and registers 141 in which the combinatorial logic stores instructions and data while executing these instructions.
- combinatorial logic 143 is connected to the memory 107 to receive and execute instruction and supply back the results.
- the combinatorial logic 143 is also connected to the input interface 121 to receive input from input device(s) 101 and/or other sources.
- the combinatorial logic 143 is also connected to the output interface 123 to provide output to output device(s) 103 and/or other destinations.
- a microprocessor such as the microprocessor 120, may perform a range of different functions including arithmetic operations. For example, a microprocessor may perform addition, subtraction, multiplication, multiply-accumulate, and/or division. Aspects of the present technology may enable efficient multiplication operations to be performed by a microprocessor. More generally, aspects of the present technology can more generally be applied to embodiments for one or more central processing units (CPUs), graphic processing units (GPUs), artificial intelligence (Al) accelerators, Tensor Processing Units (TPUs), and/or any other digital logic that performs multiplication or multiplication as part of a multiply-accumulate operation.
- CPUs central processing units
- GPUs graphic processing units
- Al artificial intelligence
- TPUs Tensor Processing Units
- Multiplication operations may be performed by one or more multiplication circuits in the microprocessor 120.
- one or more of the VU 146, the ALU 147 and the FPU 149 may include one or more multiplication circuits.
- Other elements of combinatorial logic 143 may alternatively or additionally include one or more multiplication circuits.
- the present technology is not limited to multiplication circuits or multiply-accumulate units in any particular location (in a microprocessor or elsewhere) or used for any particular purpose and can be applied wherever efficient multiplication or multiply-accumulate operations are desired.
- a number of different approaches may be used to perform multiplication. The embodiments presented in the following discussion are based on the use of the Residual, or Residue, Number System (RNS). Before discussing these embodiments, some discussion is given of the Residual Number System.
- the residual number system is based on the use of modular arithmetic and a property known as the Chinese Remainder Theorem. If mi, ..., n ⁇ are integers greater than 1 (often called moduli or divisors), then the Chinese remainder theorem asserts that if the ny are pairwise coprime, and if ai , ... , ak are integers such that 0 ⁇ a z ⁇ mi for every /, then there is one and only one integer x, such that 0 ⁇ x ⁇ N and the remainder of the Euclidean division of x by ny is a z for every /.
- M denotes the product of k pairwise relatively prime moduli
- M nik-i X • • • X mi x mo , then M is the number of different representable values in the RNS and is known as its dynamic range. This allows any number within this dynamic range to be expressed as a vector of its values for each of the moduli.
- a number in this format is made up of four components, one for each modulus, and referred to an RNS(8
- 3) format Represents 0 or 840 or ⁇ ⁇ ⁇ Represents 1 or 841 or ⁇ ⁇ ⁇ Represents 2 or 842 or ⁇ ⁇ ⁇ Represents 8 or 848 or ⁇ ⁇ ⁇ Represents 21 or 861 or ⁇ ⁇ ⁇ Represents 64 or 904 or - - Represents -70 or 770 or - - Represents -1 or 839 or ⁇ ⁇ ⁇ ⁇
- modulo 840 can be similarly expressed in the RNS(8
- the operations can be on a component by component basis, using the corresponding modulus, Performing arithmetic for integers in this can have a number of advantages.
- 3) is illustrated in Figure 2A.
- operations can individually be performed pair-wise on each component. This allows arithmetic operations for numbers in such an RNS format to be separated out into independent RNS channels, as illustrated in Figure 2B.
- Figure 2B illustrates an embodiment for the structure of an adder, subtractor, or multiplier for numbers in the RNS format using the RNS(8
- At top are shown two operands, Operand 1 and Operand 2, in the RNS(8
- the modulus 8, 7, 5, and 3 components of each operand go to the corresponding mod-8 unit 201 , mod-7 unit 203, mod-5 unit 205, and mod-3 unit 207 to undergo pair-wise operations in each RNS channel using the corresponding modular arithmetic.
- the components of the results can then be stored in the result register 221 , which can be used in subsequent RNS computations or converted out of the RNS format when needed.
- the RNS approach illustrated with respect to Figure 2B has a number of advantages. It solves the carry-propagation problem, as carry-propagation is limited to a few bits. It is faster, as for example, 4x4 multipliers are considerably more than 4 times simpler than a 16x16 multiplier. It is also smaller as a 4x4 multiplier is ⁇ 1/16 the size of a 16x16 multiplier (power and area scaled quadratically). Additionally, small residue arithmetic operations can be implemented with direct table lookup. This approach does require conversion between the RNS format and binary, and some arithmetic operations (e.g., division, sign detection, overflow detection) are more difficult, but the following is concerned with multiplication, where it can have a number of advantages.
- RNS moduli where two of the considerations are channel width and the number of channels. If the objective is to reduce the addition/multiplication execution time, then a large number of small moduli is desirable, as the execution time is determined by the largest modulus and Power/Area of multipliers proportional to n 2 .
- moduli One approach to selecting moduli is to use moduli of the form 2 k or 2 k -1 for ease of implementation.
- the total number of bits can be minimized by using an efficient binary representation — i.e., maximize the dynamic range by selecting a modulus, mi, that equals 2 k or is very close to it (2 k -1 ), for example. Only a single mi of the form 2 k because the moduli must be relatively prime.
- Another approach is to use the smallest prime moduli for which a symmetric complete residue system exists.
- Figure 3 illustrates components for implementing modular multiplication for moduli not of the form 2 k or 2 k -1 .
- Multiplier block 301 receives two n-bit operands A and B and multiplies these together to generate the 2n-bit result of AB. To turn this into the n-bit mod-m product of these numbers,
- implementing modular reduction requires division, which is a time consuming multistep operation that slows down the process.
- the embodiments below present a simple and efficient method to multiply moduli not of the form 2 k or 2 k -1 .
- cp(p) p-1.
- the table of Figure 4 illustrates an example.
- Figure 5 is a table illustrating the modulo 11 value of 2 k for the k values 0 to 9.
- the top row is the values of k, with the corresponding value of 2 k below.
- the mod 11 reductions of 2 k , or remainders, are in the bottom row.
- the prime number 11 has a primitive root of 2 2
- An example calculation can be the multiplication of 3 (mod 11 ) x 6 (mod 11 ).
- the first operand is represented as a 1 -hot vector of (p-1 ) elements in which the location of the 1 is based on the location of the operand in the bottom row, so that for the example of 3, this is represented by 1 -hot vector 0000000010.
- the second operand is represented by a rotate amount, where the amount of rotate is, for an operand value of k, the modular reduction of 2 k , so that for the example of 6, the amount of rotate is amount of 9.
- the 1 -hot vector of the first operand is right rotated by the rotate amount of the second operand, so that to compute 3 (mod 11 ) x 6 (mod 11 ), 0000000010 is rotated right by 9 bits.
- the result is 0000000100 which corresponds to 7, which is the correct result.
- this approach allows for modular multiplication with respect to prime numbers that have a primitive root of 2 to be multiplied using a simple rotation.
- Figure 6 lists the primitive roots of the prime numbers up to 100.
- the primes (up to 100) which have a primitive root of 2 are: 3,5,11 , 13,19,29,37,53,59,51 ,67,83.
- both unwrapped and wrapped remainders can be used, as can be illustrated with respect to Figure 7.
- Figure 7 is a table illustrating the modulo 11 value of 2 k for the k values 0 to 9 for both “unwrapped” and “wrapped” remainder values.
- Figure 7 repeats the elements of Figure 5, but also adds a bottom line for wrapped remainder values.
- For each unwrapped remainder x, there exists a wrapped remainder, y, such that y x - p, where p is the moduli, y is called the additive inverse of x.
- Figures 8 and 9 are tables respectively illustrating the modulo 7 and 23 value of 2 k for both unwrapped and wrapped remainder values. These figures are laid out similarly to Figure 7, but also include lower row “Check” listing the sum of the unwrapped and wrapped remainder values directly above to check that they add up to 7 or 23, respectively.
- modulo 7 has primitive roots 3
- modulo 23 has primitive roots 5, 7, 10, 11 , 14, 15, 18, 19, 21 , 26, 27.
- Figures 11A and 11 B provide the value of 2 k for both unwrapped and wrapped remainder values for additional smaller primes that can be used in the process, specifically for modulus values of 5, 11, 13, 19, and 29.
- the following discussion considers embodiments for the circuitry to implement the residual number system multiplication techniques described above. Specific embodiments include a crossbar based multiplication circuit that is both scalable and efficient. Relative to prior art modular multiplication circuits, such as combinational multipliers or Wallace tree multipliers, the crossbar based approach presents smaller and more power efficient circuitry.
- Figure 12 illustrates an example of a barrel shifter efficiently implemented with a crossbar switch.
- a barrel shifter is a digital circuit that can shift a data word by a specified number of bits without the use of any sequential logic and can be used to present a relatively simple implementation of a crossbar circuit as an introduction to residual number system crossbar multiplication units presented further below.
- the example of Figure 12 is a 4X4 example, with an input (xo, xi, X2, xs) that is a 1 -hot vector (i.e., one entry “1”, the rest “0”), a shift count of 0-3, and an output (yo, yi, y2, ys) corresponding to the input shifted (mod 4) by the shift amount.
- the shift count is received at a decoder 1201 that converts the shift count to another 1 -hot vector specifying a shift amount of either shift 0, shift 1 , shift 2, or shift 3.
- the barrel shifter is a square array of switches (4x4 in this example) connected between the input (xo, xi , X2, xs) vector and the output vector (yo, yi , y2, ys) and controlled by the shift value.
- the shift values run diagonally upward across the array, looping back from the bottom.
- switch 1203 connects input xo to output ys and is controlled by the shift 3 input, that an input of (1 , 0, 0, 0) becomes an output of (0, 0, 0, 1 ) in response to a shift of 3.
- this connects the input of X3 to the output of yi and is controlled by shift 2 such that an input of (0, 0, 0, 1 ) becomes an output of (0, 1 , 0, 0) for a shift of 2.
- the inputs run from xo to X3 left to right, with the shift and output indices increase from bottom to top.
- the array structure of the crossbar switch can be made quite dense, similarly to a DRAM device.
- multiplier optimization based on the unwrapped/wrapped approach can reduce this to an (n-1 )/2 x (n-1 )/2 array for a 75% area savings.
- the circuit can be fast and, such as by using a single NMOS pulldown device, small.
- the size of the array grows as n 2 , this can be mitigated by use of a “folded” crossbar arrangement based on wrapping, as described below with respect to Figures 14 and 15, for example.
- the crossbar switch can also result in degraded logic “1” output levels when only NMOS pass gates are used, but level restoring circuitry as presented below can overcome this problem.
- Figure 13 illustrates both wrapped and unwrapped remainder values (wrapped values are the additive inverses of the unwrapped values) for the example of modulus value of 13 when using negative remainders.
- Figure 13 repeats the mod 13 table of Figure, but with the remainders greater than 6 replaced with negative values: for example, -1 replaces 12, -2 replaces 11 , and so on.
- Figures 14 and 15 schematically illustrate implementation of a modulo 13 multiplication circuit as full crossbar circuit and as a folded and split crossbar circuit, respectively.
- the input vector is the (n-1 ) component 1 -hot input vector
- the rotate amount the rotate amount as an (n-1 ) component 1 -hot vector
- resulting in a crossbar multiplier of size (n-1 ) 2 avoiding the larger (n-1 ) component 1 -hot output vector.
- the crossbar array is folded, in that the product is between only one of +/- 1 , 2, 3, 4, 5, 6 is used, and split, in that array is split into a wrapped and unwrapped halves with the outputs divided between the wrapped and unwrapped values.
- the values used are then 1 , 2, 4, -5, 3, and 6, but where the sign needs to be adjusted as described in more detail below.
- each non-zero value for both inputs has a corresponding input line after decoding.
- one or more of the input lines for each input is shared by a pair of input values, where an additional piece of information (e.g., a sign or wrap value) is used to distinguish between the pair of values sharing an input line.
- folded is used to refer to when the inputs have been fully folded, or folded in half, so that for each input line for both inputs is shared by two of the non-zero inputs.
- a folded and split crossbar implementation reduces the size of the needed crossbar array from (n-1 ) 2 to ((n-1 )/2) 2 , or a quarter in size, requiring less space on the logic die.
- the size of the crossbar array is reduced by a quarter, as described with respect to Figures 29A and 29B, for example, additional circuitry is needed to sort out the sign values of the inputs. Consequently, for smaller moduli, a full crossbar implementation may take up less area than a folded and split crossbar implementation.
- the optimal implementation may use a full crossbar for, say, modulo 5 and below, with the folded and split crossbar used for modulo 7 and above.
- the recued number of values are those used in the folded and split crossbar embodiment of Figure 15 for the inputs, where the wrap or sign value is accounted for by additional logic circuitry, an example of which is shown in Figure 17B.
- the circuitry includes the crossbar barrel shifter that calculates a product magnitude and the circuitry to calculate product wrap value, where the magnitude and wrap value together give the final product value.
- Approximately half of the crossbar cells (those above lower left-to-upper right diagonal with black fill, such as 1711 and 1717) calculate a wrap condition, which indicate that Operand A has been wrapped off the end of the barrel shifter by rotation by Operand B.
- the other crossbar cells (those on and below the diagonal with no fill, such as 1713 and 1715) are the products that have not wrapped off the end of the barrel shifter.
- the inputs to the crossbar array of operand A and operand B are received as magnitude values, where if not already in the appropriate 1-hot vector format (such as from a previous operation in this format) these are respectively converted in Decoder A 1703 and Decoder B 1701. These inputs can then be applied to the crossbar array as described above with respect to Figure 15.
- the outputs from the wrapped switches and the unwrapped switches in each row are kept separate in each row to use in determining the product wrap value. For example, in the top row, the output of wrapped element 1711 goes to the lower output line of the first row while the output of unwrapped element 1713 goes to the upper output line of the row. Similarly, in the third row wrapped switch 1717 goes to the lower line and unwrapped switch 1715 goes to the upper line. (As the bottom row has only unwrapped values, it only has a single output.)
- the pairs of outputs from each row of the barrel shifter go a corresponding one of the OR gates 1721 , so that if the output on either of the lines is a “1”, the combined result is a “1”.
- the different components correspond to the shown values of 1 , 2, 4, 5, 3, 6, where, since all elements of the bottom row are unfolded, an OR gate is not needed for the bottom row.
- the product magnitude is in the 1-hot vector format and, if the magnitude is not to be kept in this format (such as for a subsequent operation), a corresponding one of the set of AND gates 1725 received the values.
- the second input of each of the AND gates is the corresponding value in a 4-bit decimal value, i.e. , 4’d1 is 1 in the format (0001 ).
- the resultant outputs of the AND gates 1725 are then the product magnitude.
- the series of AND gates 1723 are part of the product wrap determination circuitry. These each receive the output of the corresponding OR gate 1721 with its 1 -hot vector value and the other input corresponding to the lower line of the row. The outputs of the AND gates 1723 are then inputs to the OR gate 1733, so that if any of the AND gates 1723 output a “1 ”, the OR gate provides a “1 ” for the wrap select values.
- the operands A and B also have, in addition to the magnitude inputs, a wrap input in the folded embodiment.
- the operand A wrap bit and operand B wrap bit are inputs to the XOR gate 1731 , whose output then is combined with the wrap select value in XOR gate 1735 to generate the product wrap output value.
- the circuitry of Figure 17B can be replaced with the circuitry of Figure 29B as presented below.
- Figure18 illustrates an NMOS only crossbar shifter circuit implementation that can be used in some embodiments.
- a single NMOS for each element of the crossbar shifter array, such as NMOS1801 connected between operand A[0] input and output Out[0] and with its control gate connected to operand B[0]
- the size the array can use a smaller area than an NMOS/PMOS transmission gate implementation.
- the crossbar circuit array devices can be implemented in any semiconductor technology, including but not limited to planar transistors, finFET transistors, and ribbonFET, gate-all-around (GAA) or nano-wire or nano-sheet transistors.
- Figure19 illustrates an example of a level restorer circuit.
- the circuit acts as an inverter, but with well-defined output logic values of either ground for “0” or VDD for “1”, even for a weak “1” input.
- PMOS1903 When receiving a “0” value or ground, PMOS1903 will be on and NMOS1901 off, so that the Out value will be VDD and the full “1” logic level and, since Out is high, PMOS1905 is off and the input stays at ground.
- NMOS1901 may not fully be on and PMOS1903 may not fully be off, resulting in the Out value being pulled down somewhat, but not fully to a well-defined “0” value of ground.
- Figure 20 illustrates an embodiment for a cross-connected level restoration circuit for use with the outputs of a split crossbar multiplier.
- FIG left of Figure 20 is a portion 2001 of a split crossbar showing four elements of one row.
- the right most input /SOo here is used to represent NOT, which is shown in the figures as a bar on top of the name
- /SOo is unwrapped and connected to the output of /xbar and the other three inputs are wrapped and connected to the output of inverse of /(xbar_wrap).
- the two outputs were independent, they could respectively use the level restoration circuit of 2020 or 2025 to restore weak “1” values of (VDD - Vth) to the full VDD logic level; however, the two values are related in that the gate selects on one output will be always inactive, causing high impedance output on that crossbar output section.
- the cross-connected level restoration circuit is introduced, with an embodiment shown at 2021 .
- the embodiment of Figure 20 includes cross-connected pull-ups to restore low-z state on both outputs.
- the pull-ups of PMOSs 2031 and 2033 respectively have their gates connected to the input and the output of level restoration circuit 2025.
- the pull-ups of PMOSs 2035 and 2037 respectively have their gates connected to the input and the output of level restoration circuit 2023.
- PMOS 2033 and PMOS 2037 and the other pull-up PMOSs can be weak devices relative to the transistors forming the inverters of 2023 and 2025 and relative to other elements in the multiplication circuit.
- the circuit 2021 is needed since the split crossbar circuit is being used to compute two separate results with the same crossbar circuit: the crossbar to calculate a product magnitude; and the same crossbar also generates the wrap indication.
- the cross-connected level restoration circuit 2021 is symmetric between its inputs, if S1o is low, /xbar is floating and /(xbar_wrap) will either be a “0” or a weak “1” based on the value of /SO3, /SO2, or/SOi respectively depending on which of S11, SI2, or SI3 is on.
- the cross-connected level restoration circuit 2021 will output a “0” for either a ”0” or weak “1” at the non-floating input; and for the floating input, a ”0” or weak “1” will respectively result in a (full VDD value) “1” or a “0”. This operation is illustrated with respect to Figures 21 and 22.
- Figures 21 and 22 illustrate the operation of the cross-connected level restoration circuit embodiment of Figure 20 for a “0” input and a weak “1” input, respectively.
- the value would be swapped between the upper path and the lower path.
- PMOS 2033 As xbar_wrap is “0”, it will also be ON. However, PMOS 2033 is formed to be a weak device relative to the other transistors setting the /xbar level so that, even though it is ON, its output to the /xbar input to level restorer 2023 is overridden by the “0” value provided at the input by /SOo. For example, PMOS 2033 can be sized to provide low current when ON, as represented by the broken line arrow. Similarly, in the symmetric embodiment of Figures 20, PMOS 2037 will be a similarly weak device.
- the other pull-up PMOSs can also be relatively weak devices.
- Figure 22 considers when /xbar is a weak “1” and /(xbar_wrap) is again floating.
- /xbar is now a strong “1”
- Figure 23 is a flowchart to illustrate the operation of the cross-connected level restoration circuit embodiment of Figure 20.
- the cross-connected level restoration circuit 2021 receives first and second inputs (e.g., /xbar and /xbar_wrap), where either one of the inputs can be active and the other one of the inputs is inactive.
- the inactive input is floating and the active input either at a low logic value (“0”) or a degraded high logic value (weak “1”).
- the corresponding outputs for the two inputs are generated.
- the corresponding output is generated by the level restore circuit of the active input, either 2023 or 2025, resulting full high logic level (“1”) for a low logic level input or a low logic level (“0”) for the weak “1” of the degraded high logic input.
- the levels on the active input’s path then bias the cross-connected pull-up transistors (e.g., PMOSs 2031 , 2033, 2035, 2037) to set the output corresponding to the inactive input to a low logic level.
- PMOSs 2031 , 2033, 2035, 2037 bias the cross-connected pull-up transistors
- Figure 24A has NMOSs for the switches, adds in the cross-connected level restoration circuits, and adds in 0 detection.
- each of the switches of Figure17A is now implemented as a single NMOS, such as a finFET, ribbonFET, gate-all-around (GAA), or nano-wire or nano-sheet transistors, for example.
- the operand B magnitude decoder now includes a 0 decode value 2451 .
- the zero decode in crossbar ensures low-impedance on one of the two sets of crossbar outputs. Consequently, when operand B has zero magnitude, each of the unwrapped outputs will be pulled high (e.g., VDD-Vth), which will then be turned into a “0” by the subsequent level restores 2443.
- the cross-connected level restorers 2443 are connected to receive the pair of outputs of each row to provide well-defined logic outputs of either ground or VDD, described above with respect to Figure 23. As the cross-connected level restorers 2443 also invert their input values, each operand A input now receives an inverter 2441 to cancel out the inversion of the level restorers 2443. As discussed above with respect to Figure17A, the bottom row of the shifter array has all unwrapped switches, so that a more conventional level restore circuit 2445, as in Figure 19.
- a “1” value (1 ’b1 ) can supplied to as the second input, as this value will be inactive when the first input is active.
- the outputs of the level restore circuits 2443 can then pass on to the rest of the circuit, which can be implemented as in Figure 24B.
- Figure 24B repeats the elements of Figure 17B, which are similarly numbered (e.g., XOR 1741 is now XOR 2431 ), but adds the three input AND gate 2439.
- AND block 2439 insures that the output product sign is zero when either of the input sources are zero. Put another way, this is a three input AND gate.
- One input is the output of XOR gate 2435, which represents the output sign when both sources are non-zero.
- the other two inputs to the AND gate 2439 have inverted inputs, which mean that both of these inputs (Operand A Decode 0, Operand B Decode 0) must be inactive (logic 0) for the output sign bit from the XOR gate 2435 to pass through to the output. If either of these two inverted inputs are logic T (indicating that one or both of the sources equal to zero), then the product sign will be forced to zero.
- the circuitry of Figure 24B can be replaced with the circuitry of Figure 29B as presented below.
- crossbar multipliers discussed so far have used a wrap-magnitude encoding.
- the following embodiments instead consider a signmagnitude encoding.
- the sign-magnitude encoding format can be implemented with less decoding than binary encoding in the folded multiply crossbar, since instead of needing to decode (for example) “1” and “12” separately from binary and then OR-ing them, there is only a single decode (for magnitude).
- Figures 16 and 25 provide examples of remainder tables that can be used to illustrate converting input sign bits to wrap bits.
- Figures 16 and 25 are respectively remainder tables for modulo 13 and 23 multiplication in wrap-magnitude format.
- the “-5”/“5” values are bolded to indicate that the sign bit does not equal the wrap bit.
- the columns with the lighter background indicate remainders for which the sign bit does not equal the wrap bit. Therefore, when decoding the inputs, if the inputs match an entry for which the sign and wrap bits differ, a “flip” bit can be set to be XOR-ed with the sign bit.
- the multiplier can be implemented as an NMOS network (with level restoration) to perform this logic OR function or as a full CMOS circuit using with 1 -hot inputs.
- Figures 26 and 27 illustrate embodiments for these two implementations.
- Figures 26 and 27 respectively illustrate embodiments for sign bit to wrap bit conversion in an NMOS with level restoration and a CMOS embodiment for the modulo 13 example, where the MOSFETs can again be implemented in a FinFET embodiment.
- the only negative value is for 5, while the other values (and the unshown 0 value) are all nonnegative.
- the 5 input pulls down the output (/flip) to ground, while the other input values pull up the output up to VDD - Vth.
- a level restorer 2601 converts the output /flip to flip, so that in the NMOS (with level restoration) implementation the inverses for the input can be used to undo the inversion.
- the inverse of the 5 value is used to supply the gate voltage of a PMOS pull up, this the other (non-negative) values are all applied to a corresponding gate of an NMOS pull down.
- a level restorer is not needed to provide the final output flip value as the logic values at the output are well-defined and level restoration is not needed.
- Figure 28 shows an embodiment of a wrap select circuit for product wrap- to-sign conversion for a modulo 13 example.
- the wrap select circuit includes a determination circuit 2801 in an NMOS embodiment that then is followed by a level restoration circuit 2803 similar to those described above to provide well-defined logical values for the wrap select output.
- Other embodiments can be CMOS based, although these embodiments would typically require greater area and complexity.
- the circuit of Figure 28 selects the final wrap bit from the crossbar product outputs that the multiplier circuit uses to create the product sign bit and is scalable with the folded crossbar.
- xbar or xbar wrap When the product row that matches the result, either xbar or xbar wrap will be a logic “1”, while for all other product rows, both of xbar and xbar wrap will be a logic “0”.
- Xbar 5 is the unwrapped output for output value 5 and is used (rather than xbar wrap 5) because the wrap bit for output 5 is different from the sign bit for output 5 (i.e., +5 is wrapped, while -5 is unwrapped for modulo 13).
- the determination circuit 2801 also includes the “0” decode to ensure that all product outputs are covered.
- Figure 29A repeats the elements of Figure 24A, with a decoder 2901 to provide Operand B in a 1 -hot format if received as a magnitude and with a decoder 2903 to provide Operand A in a 1 -hot format if received as a magnitude.
- the 1 -hot format version is optimal for chained (i.e., sequentially dependent) multiplications as it eliminates 1-hot encoding/decoding to/from binary and if an operand is received the 1 -hot format, the corresponding decoder (2901 , 2903) is not needed.
- the embodiment of Figures 29A includes elements related to the operand sign inputs.
- the input values in the 1 -hot vector format are inputs into the Sign to Wrap circuit 2913, such as illustrated by the embodiments of Figure 26 or 27.
- the result sign flip value for operand B and sign value for operand B are then inputs to the XOR gate 2915.
- operand A the input values in the 1-hot vector format and the 0 value from operand B are inputs into the Sign to Wrap circuit 2911 , that can also be as illustrated by the embodiments of Figure 26 or 27.
- the XOR gate corresponding to 2915 is in Figure 29B.
- the operand A sign flip value, the outputs of the lever restorers, and the XOR 2915 continue on to the circuitry of Figure 29B.
- XOR logic circuit 2927 also receives the output of XOR 2915 for the B operand and output of XOR 2917 for the A operand, where XOR has inputs of the operand A sign flip value from sign to wrap circuit 2911 and the operand A sign value.
- the output of XOR 2927 is then one of the inputs to AND gate 2929, with additional inputs the inverted values of when Operand A decodes as 0 and when operand B decodes as 0.
- the resultant output of the circuit of Figures 29A and 29B is then a sign value for the product and the product either in 1-hot vector format (if AND gates 2923 are not included) or a magnitude format (if AND gates 2923 are used).
- Figures 29A and 29B present a folded and split crossbar multiplier for the case that the modulus has a primitive root of 2, specifically for a modulo 13 example.
- the situation is different for moduli that are one of the added primes in Figure 10, but that do not have a primitive root of 2.
- the multiplication circuit only uses the first half of the remainders (i.e., for RNS-13, instead of remainder range of 0-12, the circuit of Figures 29A and 29B uses a remainder range of -6 to +6).
- RNS-13 as shown in Figure 11 A a remainder of +8 is equivalent to a remainder of -5, so that for any remainder in which a negative appears in the top row (i.e., 8 or 5), that the sign-to-wrap indicator is set to a ‘T, else it is set to a ‘O’. Consequently, for RNS-13, an input of 5 has a sign-to-wrap indicator of “1”, with all other inputs have sign-to-wrap indicator of “0”.
- Figures 30A and 30B illustrate an embodiment for implementing [00133]
- Figures 30A and 30B are laid out similarly to Figures 29A and 29B, although the equivalents of decoders A 2901 and B 2903 for the inputs and 2923 for the output are not shown.
- the decoded operand A is supplied to both the crossbar array and to the sign to wrap circuitry 3011 (along with the 0 value for operand B) to provide an operand A sign flip.
- the decoded operand B is supplied to both the crossbar array and to the sign to wrap circuitry 3013 to provide an operand B sign flip, which is an input to XOR 3015 along with the sign input of operand B.
- the crossbar array is folded, but not wrapped, so that there is only one output per row and a conventional level restorer, illustrated in Figure 19, can be used.
- the non-primitive root- of-2 embodiment of Figures 30A and 30B has no product wrap, so split crossbar array (as no product wrap) and no cross-connected level restore (because there in no split crossbar array), so that a more conventional level restore circuitry can be used. Consequently, in this embodiment the non-primitive root of 2 circuit can be simpler than the primitive root of 2 circuit.
- the non-primitive root of 2 must meet the following criteria: The first half of the moduli table must have unique values which represent exactly half of the remainders; and the second half of the table must exactly repeat the first half of the table in the same order.
- Figure 31 is a flowchart for an embodiment of performing modular multiplication based on the embodiments of17A,17B, 24A, 24B, 29A, 29B, 30A, and 30B, where Figures 29A and 29B will be used as the exemplary embodiment for presenting the steps of Figure 31.
- a first modulo p input number is received in a sign-magnitude format, where p is a prime number.
- step 3103 which can be concurrent with step 3101 , a second modulo p input number is received in a sign-magnitude format.
- the sign value is received at XOR gate 2915 and the magnitude is received at the decoder 2901 , although again if the magnitude is already in a 1 -hot vector format, the input can go directly to the switch array and the sign to wrap circuit 2913.
- a magnitude of a product of the first modulo p input number and the second modulo p input number is determined by providing the magnitude of the first modulo p input number and the second modulo p input number as respective first and second inputs of the folded and split crossbar multiplier circuit of Figure 29A, along with the OR gates 2921 of Figure 29B.
- the sign value of the product of the first modulo p input number and the second modulo p input number is determined from the sign value of the of the first modulo p input number and the sign value of the of the second modulo p input number using the XOR gates 2915, 2917, 2927 and the wrap select circuit 2925 in step 3107.
- a dense NMOS folded and split crossbar circuit replaces the conventional modular multiplier.
- a regular NMOS crossbar structure far less area is needed that for a conventional modular multiplier for a wide variety of RNS moduli.
- such embodiments require no combinatorial multipliers or adders and no division to obtain remainder values; expand available moduli beyond conventional 2 n , 2 n -1 , or 2 n +1 moduli and beyond extremely small moduli using lookup tables; and require minimal extra logic- only decoders and a small number of basic gates.
- Use of the folded and split architecture uses 75% less area in the crossbar than a “naive” (n-1 )x(n-1 ) implementation of an (un-folded, un-split) crossbar.
- Use of 1 -hot encoding in crossbar can reduce switching power over conventional modular multiplier. Additionally, this approach widens the dynamic range of multipliers by allowing use of certain non-“primitive root of 2” primes (e.g., 7, 23, 47, 79, 97 under 100).
- Figures 29A and 29B uses a regular 1 -dimensional array of NMOS devices with 1 -hot inputs for wide logic OR’s or selection of multiple decodes; that is, it can help to insure that any remaining ‘glue’ logic scales along with the main crossbar logic.
- the use of sign-magnitude encoding of RNS numbers improves RNS multiplication (as opposed to prior-art unsigned binary encoding).
- Use of a “cross-connected levelrestoration circuit” enables correct electrical operation of the split crossbar.
- Figures 32-35 provide more detail on the ordering of inputs to the crossbar arrays and their relationship to a modulus having a primitive root of 2. More specifically, Figures 32-34 consider the case of modulo 13 multiplication, with Figures 32 and 33 considering a full crossbar arrangement (as represented schematically in Figure 13) and Figure 34 considering a folded crossbar arrangement (as represented schematically in Figure 16).
- Figure 32 illustrates an RNS-13 full nxn crossbar multiplier with the inputs values in sequential order, running from 1 to 12 for both the inputs of both source 0 and source 1. The array shows the resultant RNS-13 products that, as illustrated by the highlighted “1” output, have no regular pattern and cannot be extracted along rows.
- Figure 33 illustrates an RNS-13 full nxn crossbar multiplier with the inputs values in a modulo power-of-two ordering (i.e., 1 , 2, 4, ... , in mod-p arithmetic), which is used in the embodiments presented here, both folded and unfolded.
- a modulo power-of-two ordering i.e., 1 , 2, 4, ... , in mod-p arithmetic
- all of the values can be positive and there is not the need for use of a sign or wrap value to distinguish between inputs and output, as is needed in the folded crossbar embodiments.
- the modulo power-of-two source ordering of Figure 33 the products are aligned row-by-row, making it very easy to extract the product values.
- Figure 34 illustrates an RNS-13 full folded crossbar multiplier with the inputs values in a modulo power-of-two ordering.
- the crossbar array is “fully” folded in that each of the inputs for both Source 0 and Source 1 represent to two inputs and each output represents two outputs, where the pairs of values for both inputs and outputs are differentiated by an additional piece of information, such as the wrap or sign values described above with respect to Figures 8, 9, and 11A-16.
- the folded crossbar of Figure 34 is an optimized size array version of the full crossbar implementation of Figure 33, with the crossbar array being of the size, although it does require the additional circuit to keep track of the sign values.
- the full modulo power-of-two ordered crossbar array may require less area, while the folded and split crossbar array will be preferred for the other, larger moduli. Because of this, in some embodiments for arithmetic logic units, the first one or several moduli may use an unfolded implementation, with the other moduli will use a folded and split implementation.
- Figure 35 illustrates what happens when a modulus does not have a primitive root of 2 and is attempted to be used in an implementation of a power-of-2 full crossbar.
- Figure 35 considers the example of RNS-17 power-of-2 source ordering. Considering the Source 0 input, staring from left the values are 1 , 2, ... , 9, and then start over again at 1 . The is also seen for the Source 1 input values and for the products, such that there are missing and repeated source and product values when applying a modulo power- of-two ordering to full array when the modulus does not have a primitive root of 2.
- Figure 36 is a block diagram to illustrate the use of power-of-2 ordering for both folded and unfolded crossbar embodiments for RNS multiplication.
- the inputs Source 0 and Source 1 are initially received and, if in sequential or other, non-power- of-2, ordering at respective source decoders 3501 and 3603.
- the decoders 3601 and 3603 also arrange the inputs in power-of-2 ordering, where these can be in folded or unfolded format, which are then applied to the crossbar array 3605, which will correspondingly be a full (p-1 )x(p-1 ) array (for non-zero values), a folded and split ((p-1 )/2)x((p-1 )/2) array (for non-zero values), depending on the embodiment.
- the output will be unfolded, while for a folded embodiment the output will include a sign (or wrap) value, as well as a magnitude.
- unfolding circuitry 3607 is included for the final output.
- Some embodiment can also be “partially” folded, where some, but less than all, of the inputs are folded and shared by two unfolded inputs that are folded to a signal magnitude differentiated by a sign or other distinguishing value.
- the (non-zero) portions of the crossbar array is of size (p-1 )x(p-1 ) and inputs each of size (p-1 ).
- the crossconnected level restorers 2443 can be replaced with a standard level restorer as in Figure 19.
- Figure 29B as the operands and output are no longer folded, these elements are not needed in a full crossbar embodiment.
- the connections of the crossbar array represent the topology of the connections of the switches, but the actual layout on a die may be arranged in different ways, with corresponding changes in routing.
- Figure 37 is a high-level block diagram of a computing system 3701 that can be used to implement various embodiments of the microprocessors described above.
- the computing system 3701 is a network system 3701.
- Specific devices may utilize all of the components shown, or only a subset of the components, and levels of integration may vary from device to device.
- a device may contain multiple instances of a component, such as multiple processing units, processors, memories, transmitters, receivers, etc.
- the network system may comprise a computing system 3701 equipped with one or more input/output devices, such as network interfaces, storage interfaces, and the like.
- the computing system 3701 may include a central processing unit (CPU) 3710, a memory 3720, a mass storage device 3730, and an I/O interface 3760 connected to a bus 3770, where the CPU can include a microprocessor such as described above with respect to Figures 1A-B.
- the computing system 3701 is configured to connect to various input and output devices (keyboards, displays, etc.) through the I/O interface 3760.
- the bus 3770 may be one or more of any type of several bus architectures including a memory bus or memory controller, a peripheral bus or the like.
- the CPU 3710 may comprise any type of electronic data processor, including the microprocessor 120 of Figure 1 B, which includes divide circuits (e.g., any of the divide circuits described above).
- the CPU 3710 may be configured to implement any of the schemes described herein with respect to division, using any one or combination of steps described in the embodiments.
- the memory 3720 may comprise any type of system memory such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous DRAM (SDRAM), read-only memory (ROM), a combination thereof, or the like.
- the memory 3720 may include ROM for use at boot-up, and DRAM for program and data storage for use while executing programs.
- the mass storage device 3730 may comprise any type of storage device configured to store data, programs, and other information and to make the data, programs, and other information accessible via the bus 3770.
- the mass storage device 3730 may comprise, for example, one or more of a solid-state drive, hard disk drive, a magnetic disk drive, an optical disk drive, or the like.
- the computing system 3701 also includes one or more network interfaces 3750, which may comprise wired links, such as an Ethernet cable or the like, and/or wireless links to access nodes or one or more networks 3780.
- the network interface 3750 allows the computing system 3701 to communicate with remote units via the network 3780.
- the network interface 3750 may provide wireless communication via one or more transmitters/transmit antennas and one or more receivers/receive antennas.
- the computing system 3701 is coupled to a local-area network or a wide-area network for data processing and communications with remote devices, such as other processing units, the Internet, remote storage facilities, or the like.
- the network interface 3750 may be used to receive and/or transmit interest packets and/or data packets in an ICN.
- the term “network interface” will be understood to include a port.
- the technology described herein can be implemented using hardware, firmware, software, or a combination of these.
- these elements of the embodiments described above can include hardware only or a combination of hardware and software (including firmware).
- logic elements programmed by firmware to perform the functions described herein is one example of elements of the described multiplication circuit.
- a multiplication circuit can include a processor, FGA, ASIC, integrated circuit or other type of circuit.
- the software used is stored on one or more of the processor readable storage devices described above to program one or more of the processors to perform the functions described herein.
- the processor readable storage devices can include computer readable media such as volatile and non-volatile media, removable and non-removable media.
- Computer readable media may comprise computer readable storage media and communication media.
- Computer readable storage media may be implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Examples of computer readable storage media include RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information, and which can be accessed by a computer.
- a computer readable medium or media does (do) not include propagated, modulated or transitory signals.
- Communication media typically embodies computer readable instructions, data structures, program modules or other data in a propagated, modulated or transitory data signal such as a carrier wave or other transport mechanism and includes any information delivery media.
- modulated data signal means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal.
- communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as RF and other wireless media. Combinations of any of the above are also included within the scope of computer readable media.
- some or all of the software can be replaced by dedicated hardware logic components.
- illustrative types of hardware logic components include Field-programmable Gate Arrays (FPGAs), Application-specific Integrated Circuits (ASICs), Applicationspecific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), special purpose computers, etc.
- FPGAs Field-programmable Gate Arrays
- ASICs Application-specific Integrated Circuits
- ASSPs Applicationspecific Standard Products
- SOCs System-on-a-chip systems
- CPLDs Complex Programmable Logic Devices
- special purpose computers etc.
- software stored on a storage device
- the one or more processors can be in communication with one or more computer readable media/ storage devices, peripherals and/or communication interfaces.
- each process associated with the disclosed technology may be performed continuously and by one or more computing devices.
- Each step in a process may be performed by the same or different computing devices as those used in other steps, and each step need not necessarily be performed by a single computing device.
Landscapes
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Analysis (AREA)
- Computational Mathematics (AREA)
- Computer Hardware Design (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Complex Calculations (AREA)
- Logic Circuits (AREA)
- Advance Control (AREA)
- Detection And Correction Of Errors (AREA)
Abstract
Un circuit de multiplication qui multiplie deux nombres sélectionnés, modulo p de manière rapide et efficace, p étant un nombre premier. La multiplication peut être mise en œuvre au moyen d'un circuit de multiplication crossbar. Dans un mode de réalisation crossbar plié et divisé, pour des opérandes modulo p dans un format de signe-magnétude, un réseau crossbar de taille ( (p-1)/2) x (p +1)/2 peut être utilisé pour calculer la magnitude du produit avec des circuits logiques utilisés pour calculer le signe du produit.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202363608653P | 2023-12-11 | 2023-12-11 | |
| US63/608,653 | 2023-12-11 |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| WO2024077312A2 true WO2024077312A2 (fr) | 2024-04-11 |
| WO2024077312A3 WO2024077312A3 (fr) | 2024-10-17 |
Family
ID=90059595
Family Applications (3)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2024/012404 Pending WO2024077313A2 (fr) | 2023-12-11 | 2024-01-22 | Circuit de restauration de niveau interconnecté |
| PCT/US2024/012403 Pending WO2024077312A2 (fr) | 2023-12-11 | 2024-01-22 | Unité crossbar de multiplication de système de nombre résiduel plié |
| PCT/US2024/057160 Pending WO2025024868A2 (fr) | 2023-12-11 | 2024-11-22 | Multiplication de calcul en mémoire (cim) de système de nombres résiduels |
Family Applications Before (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2024/012404 Pending WO2024077313A2 (fr) | 2023-12-11 | 2024-01-22 | Circuit de restauration de niveau interconnecté |
Family Applications After (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2024/057160 Pending WO2025024868A2 (fr) | 2023-12-11 | 2024-11-22 | Multiplication de calcul en mémoire (cim) de système de nombres résiduels |
Country Status (1)
| Country | Link |
|---|---|
| WO (3) | WO2024077313A2 (fr) |
Family Cites Families (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5987487A (en) * | 1996-03-11 | 1999-11-16 | Cirrus Logic, Inc. | Methods and apparatus for the processing of digital signals |
| KR100224278B1 (ko) * | 1996-12-18 | 1999-10-15 | 윤종용 | 패스 트랜지스터 로직을 사용하는 조건 합 가산기 및 그것을 구비한 집적 회로 |
| US6898613B1 (en) * | 1999-08-26 | 2005-05-24 | Stmicroelectronics, Inc. | Arithmetic circuits for use with the residue number system |
-
2024
- 2024-01-22 WO PCT/US2024/012404 patent/WO2024077313A2/fr active Pending
- 2024-01-22 WO PCT/US2024/012403 patent/WO2024077312A2/fr active Pending
- 2024-11-22 WO PCT/US2024/057160 patent/WO2025024868A2/fr active Pending
Also Published As
| Publication number | Publication date |
|---|---|
| WO2024077313A9 (fr) | 2024-05-16 |
| WO2024077313A3 (fr) | 2024-10-10 |
| WO2024077312A3 (fr) | 2024-10-17 |
| WO2025024868A2 (fr) | 2025-01-30 |
| WO2024077313A2 (fr) | 2024-04-11 |
| WO2025024868A3 (fr) | 2025-03-27 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Mohan et al. | Residue number systems | |
| Kuang et al. | Low-cost high-performance VLSI architecture for Montgomery modular multiplication | |
| Hirner et al. | Proteus: A pipelined NTT architecture generator | |
| US11301212B1 (en) | Multimodal digital multiplication circuits and methods | |
| Tian et al. | Ultra-fast modular multiplication implementation for isogeny-based post-quantum cryptography | |
| CN110858137A (zh) | 除以整数常数的浮点除法 | |
| Zhang et al. | High performance and energy efficient single‐precision and double‐precision merged floating‐point adder on FPGA | |
| US11533054B2 (en) | Ternary logic circuit device | |
| Amberg et al. | Parallel high-radix Montgomery multipliers | |
| US8069200B2 (en) | Apparatus and method for implementing floating point additive and shift operations | |
| WO2024077312A2 (fr) | Unité crossbar de multiplication de système de nombre résiduel plié | |
| Shirakol et al. | An Improved VLSI Architectural Design of Discrete Cosine Transform Based on the Loeffler-DCT Algorithm. | |
| WO2024119200A2 (fr) | Unité d'addition crossbar de système de nombre résiduel plié | |
| Cardarilli et al. | RNS-to-binary conversion for efficient VLSI implementation | |
| US7620677B2 (en) | 4:2 Carry save adder and 4:2 carry save adding method | |
| US7693925B2 (en) | Multiplicand shifting in a linear systolic array modular multiplier | |
| US7167885B2 (en) | Emod a fast modulus calculation for computer systems | |
| Parashar et al. | Fast combinational architecture for a vedic divider | |
| US7836417B1 (en) | Method and apparatus for parallel carry chains | |
| De et al. | Fast parallel algorithm for ternary multiplication using multivalued I/sup 2/L technology | |
| WO2025018991A1 (fr) | Multiplication de système de nombres résiduels efficace pour modulos premiers sélectionnés | |
| Azarmehr et al. | High-speed and low-power reconfigurable architectures of 2-digit two-dimensional logarithmic number system-based recursive multipliers | |
| US12335368B2 (en) | Cryptosystem with utilizing split-radix discrete galois transformation | |
| US7797364B2 (en) | Booth decoder apparatus and method | |
| Gaalswyk et al. | A Low-Power Recurrence-Based Radix 4 Divider Using Signed-Digit Addition |