WO2024082674A1 - 浮点数据精度转换方法和装置 - Google Patents

浮点数据精度转换方法和装置 Download PDF

Info

Publication number
WO2024082674A1
WO2024082674A1 PCT/CN2023/102089 CN2023102089W WO2024082674A1 WO 2024082674 A1 WO2024082674 A1 WO 2024082674A1 CN 2023102089 W CN2023102089 W CN 2023102089W WO 2024082674 A1 WO2024082674 A1 WO 2024082674A1
Authority
WO
WIPO (PCT)
Prior art keywords
field
bit width
code value
value
code
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2023/102089
Other languages
English (en)
French (fr)
Inventor
罗元勇
陈敏琪
张忠星
伍玮翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to EP23878671.9A priority Critical patent/EP4597299A4/en
Publication of WO2024082674A1 publication Critical patent/WO2024082674A1/zh
Priority to US19/181,941 priority patent/US20250278241A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/483Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/499Denomination or exception handling, e.g. rounding or overflow
    • G06F7/49942Significance control
    • G06F7/49947Rounding
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/14Conversion to or from non-weighted codes
    • H03M7/24Conversion to or from floating-point codes
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3059Digital compression and data reduction techniques where the original information is represented by a subset or similar information, e.g. lossy compression
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3082Vector coding
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2207/00Indexing scheme relating to methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F2207/38Indexing scheme relating to groups G06F7/38 - G06F7/575
    • G06F2207/3804Details
    • G06F2207/3808Details concerning the type of numbers or the way they are handled
    • G06F2207/3812Devices capable of handling different types of numbers

Definitions

  • the present application relates to the field of chip technology, and in particular to a floating point data precision conversion method and device.
  • low-precision computing power such as FP16/BF16
  • FP32/FP64 the iterative algorithm and high-precision data format FP32/FP64 are used to solve the high-precision calculation results.
  • the embodiment of the present application provides a floating-point data precision conversion method and device, which realizes the conversion of high-precision data to low-precision data.
  • the second floating-point data uses the prefix code field to indicate the bit width of the second exponent field, which effectively balances the relationship between the bit width, range and precision of the second floating-point data.
  • the retained code value is rounded according to the discarded code value in the first mantissa field, without the support of other devices, the conversion efficiency of high-precision data to low-precision data is improved, and the hardware overhead is reduced.
  • an embodiment of the present application provides a method for converting the precision of floating-point data, wherein the first floating-point data includes a sign field, a first exponent field and a first mantissa field, and the second floating-point data includes a sign field, a prefix code field, a second exponent field and a second mantissa field, the prefix code field is used to indicate the bit width of the second exponent field, and the precision of the first floating-point data is higher than the precision of the second floating-point data, and the method includes: determining the first bit width of the prefix code field, the first coding value of the prefix code field, the first bit width of the second exponent field, the first coding value of the second exponent field and the first bit width of the second mantissa field according to the first coding value of the first exponent field; determining the retained coding values and the discarded coding values in the first mantissa field, the retained coding values including the coding values starting from the highest
  • the floating-point data precision conversion method realizes the conversion of high-precision data into low-precision data.
  • the symbol domain of the second floating-point data can be obtained based on the symbol domain of the first floating-point data
  • the prefix code domain and the second exponent domain of the second floating-point data can be obtained based on the first exponent domain of the first floating-point data
  • the second mantissa domain of the second floating-point data can be obtained based on the first mantissa domain of the first floating-point data.
  • the bit width of the second exponent domain is indicated by a shorter prefix code domain in the second floating-point data, which can effectively improve the precision or bit width of the mantissa of the second floating-point data, and at the same time, a larger numerical range can be represented for the second floating-point data with only 1 mantissa precision, which effectively balances the relationship between the bit width, range and precision of the second floating-point data.
  • the prefix code domain can adopt a prefix code encoding method, occupying less bit width, and parsing the second exponent domain and the second mantissa domain is convenient.
  • the retained code value is rounded according to the discarded code value in the first mantissa domain, without the support of other devices, the conversion efficiency of high-precision data to low-precision data conversion is improved, and the hardware overhead is reduced.
  • the rounding operation includes a carry operation and a discard operation, and the rounding operation is performed on the retained code value according to the discarded code value to obtain the first code value of the second mantissa domain, including: when the code value starting from the highest bit in the discarded code value and the bit width is greater than or equal to the second preset threshold, a carry operation is performed on the lowest bit of the retained code value, and the discarded code value is discarded, and the code value after the carry of the retained code value is the first code value of the second mantissa domain; when the code value starting from the highest bit in the discarded code value and the bit width is the preset bit width is less than the second preset threshold, the discarded code value is discarded, and the retained code value is the second mantissa domain.
  • the second preset threshold used for comparison is the coding value starting from the lowest bit in the discarded coding value and with a bit width of the preset bit width.
  • the generation of the second preset threshold does not require an additional random number generator, and there is no performance bottleneck for random number generation, which improves the conversion efficiency of high-precision data to low-precision data, and at the same time reduces hardware overhead.
  • the rounding operation includes a carry operation or a discard operation
  • the retained code value is rounded according to the discarded code value to obtain the first code value of the second mantissa field, including: when the highest bit of the discarded code value is greater than or equal to a first preset threshold, a carry operation is performed on the lowest bit of the retained code value, and the discarded code value is discarded, and the code value obtained after the carry operation on the retained code value is the first code value of the second mantissa field; when the highest bit of the discarded code value is less than the first preset threshold, a discard operation is performed on the discarded code value, and the retained code value is the first code value of the second mantissa field.
  • the first preset threshold value can be 0 or 1, and the highest bit of the discarded coded value is compared with the first preset threshold value, which belongs to the rounding away from 0.
  • the rounding away from even number and the rounding away from odd number can also be included.
  • the rounding away from 0 has a smaller hardware implementation area and lower power consumption than other rounding methods, and has a higher data resolution.
  • the floating-point data precision conversion method provided by the embodiment of the present application also includes: determining whether the retained code value after the carry operation overflows; if the retained code value after the carry operation overflows, adding 1 to the lowest bit of the first code value of the first exponent field to obtain the second code value of the first exponent field; if the second bit width of the prefix code field is different from the first bit width of the prefix code field, determining the second code value of the prefix code field, the second code value of the second exponent field, the second bit width of the second mantissa field, and the second code value of the second mantissa field according to the second code value of the first exponent field; if the second bit width of the prefix code field is the same as the first bit width of the prefix code field, determining whether the first bit width of the second exponent field and the second bit width of the second exponent field are the same; if the second bit width of the second exponent field is less than the first bit width of the second exponent field
  • the lowest bit of the first code value in the first exponent field is added by 1 to obtain the second bit width of the prefix code field and the bit width of the second exponent field. If the second bit width of the prefix code field is the same as the first bit width of the prefix code field, if the bit width of the second exponent field changes, the second bit width of the second mantissa field can be obtained, which can solve the problem of overflow caused by retaining the code value for carry operation.
  • determining the first bit width of the prefix code domain, the first coding value of the prefix code domain, the first bit width of the second exponent domain and the first coding value of the second exponent domain according to the first coding value of the first exponent domain includes: determining an indication value according to the first coding value of the first exponent domain, determining the first bit width of the prefix code domain and the first coding value of the prefix code domain corresponding to the indication value by looking up a table, the indication value being also used to indicate the first bit width of the second exponent domain; determining the first coding value corresponding to the first bit width of the second exponent domain according to the first coding value of the first exponent domain.
  • determining the first bit width of the second mantissa field according to the first encoded value of the first exponent field includes: determining the first bit width of the second mantissa field according to the total bit width of the second floating-point data, the first bit width of the prefix code field, and the first bit width of the second exponent field.
  • the bit width of the sign field is 1, the bit width of the prefix code field is 2 or 3, the first bit width of the second exponent field is an integer from 0 to 4, and the first bit width of the second mantissa field is an integer from 1 to 4.
  • a shorter prefix code field is used in the second floating-point data to indicate the first bit width of the second exponent field, so that the second floating-point data can provide a maximum of 4 mantissa precision, and at the same time, a larger numerical range can be represented for the second floating-point data that only provides 1 mantissa precision, effectively balancing the relationship between the bit width, range and precision of the second floating-point data.
  • the highest bit is hidden when the second exponent field is stored, reducing the first bit width that needs to be stored in the second exponent field, effectively avoiding the problem of numerical overlap of the first code value of the second exponent field corresponding to the indication value of different prefix code fields, so that there is no redundant coding in the HiFloat8 data format.
  • the second floating-point data when the first floating-point data exceeds the upper limit of the data range of the second floating-point data, the second floating-point data is determined based on a saturation method or an infinity method; when the first floating-point data exceeds the lower limit of the data range of the second floating-point data, the second floating-point data is zero; when the first floating-point data is a non-numeric value, the second floating-point data is a non-numeric value.
  • the second floating point data when the first floating point data exceeds the upper limit and the lower limit of the data range of the second floating point data, the second floating point data can represent the first floating point data by a special value, such as a saturated value, an infinite value, and a zero value.
  • the first floating point data is a non-digital value
  • the second floating point data is also represented by a non-digital value.
  • an embodiment of the present application provides a floating-point data precision conversion device, wherein the first floating-point data includes a sign field, a first exponent field, and a first mantissa field, and the second floating-point data includes a sign field, a prefix code field, a second exponent field, and a second mantissa field, and the prefix code field is used to indicate the bit width of the second exponent field.
  • the precision of the first floating-point data is higher than the precision of the second floating-point data.
  • the device includes: a bit width calculation A unit is used to determine the first bit width of the prefix code field, the first code value of the prefix code field, the first bit width of the second exponent field, the first code value of the second exponent field and the first bit width of the second mantissa field according to the first code value of the first exponent field; a mantissa field calculation unit is used to determine the retained code value and the discarded code value in the first mantissa field, the retained code value includes the code value starting from the highest bit in the first mantissa field and having the same bit width as the first bit width of the second mantissa field; a rounding operation unit is used to round the retained code value according to the discarded code value to obtain the first code value of the second mantissa field.
  • the beneficial effects of the second aspect can refer to the description of the first aspect.
  • the rounding operation includes a carry operation or a discard operation
  • the rounding operation unit is also used to: when the discarded code value starts from the highest bit and the code value with a bit width of a preset bit width is greater than or equal to a second preset threshold, perform a carry operation on the lowest bit of the retained code value, and discard the discarded code value, and the code value after the carry of the retained code value is the first code value of the second mantissa domain; when the discarded code value starts from the highest bit and the code value with a bit width of a preset bit width is less than the second preset threshold, perform a discard operation on the discarded code value, and the retained code value is the first code value of the second mantissa domain; wherein the second preset threshold is the code value starting from the lowest bit of the discarded code value and the bit width is the preset bit width.
  • the rounding operation includes a carry operation or a discard operation
  • the rounding operation unit is also used to: when the highest bit of the discarded code value is greater than or equal to a first preset threshold, perform a carry operation on the lowest bit of the retained code value, and perform a discard operation on the discarded code value, and the code value obtained after the carry operation on the retained code value is the first code value of the second mantissa field; when the highest bit of the discarded code value is less than the first preset threshold, perform a discard operation on the discarded code value, and the retained code value is the first code value of the second mantissa field.
  • the device also includes: an overflow unit, which is used to determine whether the retained code value after the carry operation overflows; the bit width calculation unit is also used to, if the retained code value after the carry operation overflows, add 1 to the first code value of the first exponent field to obtain the second code value of the first exponent field; determine the second bit width of the second exponent field and the second bit width of the prefix code field according to the second code value of the first exponent field; if the second bit width of the prefix code field is different from the first bit width of the prefix code field, determine the second code value of the prefix code field, the second code value of the second exponent field, and the second bit width of the second mantissa field according to the second code value of the first exponent field.
  • an overflow unit which is used to determine whether the retained code value after the carry operation overflows
  • the bit width calculation unit is also used to, if the retained code value after the carry operation overflows, add 1 to the first code value of the first exponent field to obtain the
  • the second bit width of the prefix code field is the same as the first bit width of the prefix code field, determine whether the first bit width of the second exponent field is the same as the second bit width of the second exponent field; if the second bit width of the second exponent field is less than the first bit width of the second exponent field, add 1 to the bit width of the retained encoding value to obtain the second bit width of the second mantissa field and the second encoding value of the second mantissa field; if the second bit width of the second exponent field is greater than or equal to the first bit width of the second exponent field, discard the lowest bit of the retained encoding value to obtain the second bit width of the second mantissa field and the second encoding value of the second mantissa field.
  • the bit width calculation unit is also used to: determine an indication value according to a first coding value of the first exponent field, determine the first bit width of the prefix code field and the first coding value of the prefix code field corresponding to the indication value by looking up a table, and the indication value is also used to indicate the first bit width of the second exponent field; determine the first coding value corresponding to the first bit width of the second exponent field according to the first coding value of the first exponent field.
  • bit width calculation unit is further used to determine the first bit width of the second mantissa field according to the total bit width of the second floating-point data, the first bit width of the prefix code field, and the first bit width of the second exponent field.
  • the bit width calculation unit is also used to: when the first floating-point data exceeds the upper limit of the conversion range of the second floating-point data, determine the second floating-point data based on a saturation method or an infinity method; when the first floating-point data exceeds the lower limit of the conversion range of the second floating-point data, the second floating-point data is zero; when the first floating-point data is a non-numeric value, the second floating-point data is a non-numeric value.
  • a communication device comprising at least one processor, wherein the at least one processor is connected to a memory, and the at least one processor is used to read and execute a program stored in the memory so that the device executes a method as described in the first aspect or any one of the first aspects.
  • a chip is provided, wherein the chip is coupled to a memory and is used to read and execute program instructions stored in the memory to implement the method described in the first aspect or any one of the first aspects.
  • the present application provides a chip system, which is applied to a cloud center.
  • the chip system includes one or more interface circuits and one or more processors.
  • the interface circuit and the processor are interconnected by a line; the interface circuit is used to receive a signal from a memory of the cloud center and send the signal to the processor, and the signal includes a computer instruction stored in the memory.
  • the cloud center executes the floating-point data precision conversion method provided by the first aspect or its corresponding possible design.
  • an embodiment of the present application provides a computer-readable storage medium, including computer instructions.
  • the computer instructions When the computer instructions are executed on an electronic device, the electronic device executes the floating-point data precision conversion method in any of the above aspects and any possible implementation methods.
  • an embodiment of the present application provides a computer program product, which, when executed on a computer or a processor, enables the computer or the processor to execute the floating-point data precision conversion method in any of the above aspects and any possible implementations. Law.
  • any of the floating-point data precision conversion devices, chip systems, computer-readable storage media or computer program products provided above can be applied to the corresponding methods provided above. Therefore, the beneficial effects that can be achieved can refer to the beneficial effects in the corresponding methods and will not be repeated here.
  • FIG1 is a diagram of an IEEE754 floating-point data format provided in an embodiment of the present application.
  • FIG2 is a schematic diagram of a system or device for applying a floating-point data precision conversion method or device provided in an embodiment of the present application
  • FIG3 is a schematic diagram of the structure of an SLC provided in an embodiment of the present application.
  • FIG4 is a flow chart of a floating point data precision conversion method provided by an embodiment of the present application.
  • FIG5 is a schematic diagram of the structure of a random rounding method provided in an embodiment of the present application.
  • FIG6 is a flow chart of a rounding method away from zero provided in an embodiment of the present application.
  • FIG7 is a flow chart of another floating point data precision conversion method provided by an embodiment of the present application.
  • FIG8 is a flow chart of another floating point data precision conversion method provided by an embodiment of the present application.
  • FIG9 is a flow chart of converting FP32 data into HiFloat8 data provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application.
  • Scalar calculation unit the circuit for scalar calculation is called scalar calculation unit, where scalar is also called pure quantity, which has only size but no direction. Scalar calculation is mostly used for general calculation.
  • execution unit (EXU) part of the multi-stage pipeline of the central processing unit (CPU)
  • ALU arithmetic logic unit
  • Vector computing unit is a computing unit specially designed for vector computing with a certain degree of parallelism, such as a single instruction multiple data (SIMD) processor, in which a vector is also called a vector, usually referring to a one-dimensional array with a length greater than 1.
  • SIMD single instruction multiple data
  • Vector computing units are mostly used in fields such as HPC and AI machine learning, including solving mathematical problems such as linear programming, Fourier transform, filtering calculation, and linear algebra, partial differential equations, and integration.
  • an arithmetic execution unit (Vector Unit) based on the HiFloat data format can be embedded.
  • Matrix computing unit a computing unit specially designed for matrix computing with corresponding parallelism, such as a systolic array processor, in which a matrix is a 2D array arranged in a rectangular array.
  • Matrix computing units are mostly used for matrix computing in fields such as HPC and AI machine learning, including matrix multiplication, matrix inversion, and matrix decomposition.
  • the matrix computing acceleration unit a matrix unit based on the HiFloat data format can be embedded.
  • Tensor computing units are specially designed for tensor computing and have corresponding parallelism. For example, cube computing units, where tensors are multidimensional arrays with more than 2 dimensions, and 3-dimensional arrays are common. Tensor computing units are mostly used in the field of AI machine learning, such as convolution operations. In tensor computing acceleration units, tensor units based on HiFloat data format can be embedded.
  • first and second are used for descriptive purposes only and are not to be understood as indicating or implying relative importance or implicitly indicating the number of the indicated technical features.
  • a feature defined as “first” or “second” may explicitly or implicitly include one or more of the features.
  • plural means two or more.
  • IEEE 754 The Institute of Electrical and Electronics Engineers (IEEE) has established IEEE 754 as a binary floating-point arithmetic standard, which defines floating-point data representation methods such as double-precision FP64, single-precision FP32, and half-precision FP16.
  • double-precision FP64 data and single-precision FP32 data are suitable for CPU and floating-point arithmetic environments
  • half-precision FP16 data are suitable for computer graphics environments.
  • Figure 1 is a diagram of the IEEE754 floating-point data format provided in an embodiment of the present application.
  • IEEE754 floating-point data includes a sign field (bit sign, S), an exponent field (bits exponent, E), and a mantissa field (bits mantissa, M).
  • bit sign, S sign, S
  • exponent field bits exponent
  • mantissa mantissa
  • M mantissa field
  • the sign field is 1 bit
  • the exponent field is 11 bits
  • the mantissa field is 52 bits
  • the sign field is 1 bit
  • the exponent field is 8 bits
  • the mantissa field is 23 bits
  • FP16 data the sign field is 1 bit
  • the exponent field is 5 bits
  • mantissa field 10 bits.
  • the first floating-point data provided in the embodiment of the present application is FP32 data.
  • the data format of FP32 data is first introduced below, for example, as shown in Table 1, which is the data format of FP32 data.
  • FP32 data includes a sign field S, an exponent field E, and a mantissa field M.
  • the sign field of FP32 data determines whether the FP32 data is a positive number or a negative number, where 0 represents a positive number and 1 represents a negative number.
  • the exponent field of FP32 data is a power of 2, and FP32 data can be weighted.
  • the mantissa field of FP32 data is a binary decimal. The steps for converting FP32 data to decimal data are as follows: (1) If the sign field of FP32 data is 0, then the FP32 data is a positive number. (2) The encoding value of the exponent field of FP32 data is 01111100, which represents 124 in decimal.
  • the exponent field of the FP32 data is 2 to the power of -3.
  • the mantissa field of the FP32 data is 010000000000000000000000. Since the exponent field is not all "0" or all "1", the decimal data represented by the mantissa field is 1.25. Based on the conversion formula between floating-point data and decimal values, the decimal data corresponding to the FP32 data is 0.15625.
  • the prior art provides a simplified random rounding method for converting FP32 to FP16 or BF16 data format, and the threshold of random rounding is calculated using specific bits of the data itself.
  • the 11th to 18th bits and the 16th to 23th bits in the mantissa field of FP32 data (a total of 23 bits, the most significant bit (MSB) to the least significant bit (LSB) are numbered 1-23) are added, and the overflowed bits are used as the threshold of the random input.
  • this method only describes a single random rounding conversion method, which only involves the conversion of FP32 to FP16 or BF16 data format, and cannot meet the conversion of high-precision data in other formats to low-precision data.
  • the threshold generation method of this method involves more bits in the mantissa field, the calculation is more complicated, and the hardware overhead is large.
  • the floating-point data precision conversion method realizes the conversion of high-precision data into low-precision data.
  • the sign domain of the second floating-point data can be obtained based on the sign domain of the first floating-point data
  • the prefix code domain and the second exponent domain of the second floating-point data can be obtained based on the first exponent domain of the first floating-point data
  • the second mantissa domain of the second floating-point data can be obtained based on the first mantissa domain of the first floating-point data.
  • the bit width of the second exponent domain is indicated by a shorter prefix code domain in the second floating-point data, which can effectively improve the precision or bit width of the mantissa of the second floating-point data, and at the same time, a larger numerical range can be represented for the second floating-point data with only 1 mantissa precision, which effectively balances the relationship between the bit width, range and precision of the second floating-point data.
  • the prefix code domain can adopt the prefix code encoding method, occupying less bit width, and parsing the second exponent domain and the second mantissa domain is convenient.
  • the retained code value is rounded according to the discarded code value in the first mantissa domain, without the support of other devices, the conversion efficiency of high-precision data to low-precision data conversion is improved, and the hardware overhead is reduced.
  • the second floating-point data provided in the embodiment of the present application is HiFloat8 data, as shown in Table 2, which is the encoding method of HiFloat8 data.
  • 8 is the total bit width of HiFloat8 data, and the total bit width can vary.
  • the sign field occupies one bit, 0 for positive, 1 for negative, or 1 for negative and 0 for positive.
  • the prefix field occupies 2 or 3 bits, and the prefix field can express 5 different information.
  • the value of D can be 0, 1, 2, 3, 4.
  • the bit width of the exponent field varies according to the value of D, and the mantissa field occupies the remaining bit width.
  • the prefix code domain can be encoded using integers, in which case D is a fixed value.
  • the prefix code domain can also be encoded using prefix codes, in which case D is a finite value set.
  • prefix code encoding 2 bits are used to encode the values 2, 3, and 4, and 3 The bit encoding values are 0 and 1.
  • the prefix code encoding method of the prefix code field is shown in Table 3.
  • Table 3 is the encoding method of the prefix code field.
  • Ec is the symmetry center of the order code, which is also the bias in FP32 data.
  • HiFloat(N,5,Ec) can be configured as HiFloat(8,5,0), abbreviated as HiF8, or other configurations.
  • the distribution of HiFloat8 encoding values is shown in Table 4.
  • the floating-point data precision conversion method and device of the present application can be applied to different systems or devices, such as the execution device 20 shown in Figure 2, which is a schematic diagram of a system or device for the application of a floating-point data precision conversion method and device provided in an embodiment of the present application.
  • the execution device can be a terminal, such as a mobile phone terminal, a tablet computer, a laptop computer, an AR device (not shown in Figure 2), a VR device (not shown in Figure 2), a vehicle terminal (not shown in Figure 2), etc., and can also be a server, etc.
  • the floating-point data precision conversion method provided in the present application can be applied to scenarios involving mixed precision calculations such as CPU, HPC and AI in the execution device 20, such as scalar calculation units, vector calculation units, matrix calculation units and tensor calculation units.
  • the device for floating-point data precision conversion proposed in the present application can be a chip, for example, the chip is a system-on-a-Chip (SoC).
  • SoC system-on-a-Chip
  • Figure 3 is a structural diagram of a SoC provided in an embodiment of the present application.
  • the SoC includes a processor, which can be a single-core processor or a multi-core processor, a memory, and an I/O interface, etc. After the processor loads the data and application in the memory, it processes the data, such as performing the calculation processing in the present application.
  • the sign domain, prefix code domain, second exponent domain, and second mantissa domain of the second floating-point data can be determined by reading the sign domain, first exponent domain, and first mantissa domain in the FP32 data.
  • the embodiment of the present application provides a floating-point data precision conversion method, which is applied to converting first floating-point data into second floating-point data, wherein the first floating-point data includes a sign field, a first exponent field, and a first mantissa field, and the second floating-point data includes a sign field, a prefix code field, a second exponent field, and a second mantissa field, wherein the prefix code field is used to indicate the bit width of the second exponent field, and the precision of the first floating-point data is higher than the precision of the second floating-point data.
  • FIG. 4 is a flow chart of a floating-point data precision conversion method provided by an embodiment of the present application, and the method includes:
  • Step 401 The execution device determines the first bit width of the prefix code field, the first code value of the prefix code field, the first bit width of the second exponent field, the first code value of the second exponent field, and the first bit width of the second mantissa field according to the first code value of the first exponent field.
  • the precision of the first floating-point data is higher than the precision of the second floating-point data, wherein the first floating-point data may be FP32 data and the second floating-point data may be HiFloat8 data.
  • the first floating-point data or the second floating-point data includes a binary integer part and a binary decimal part, wherein the first exponent field and the second exponent field respectively determine the binary integer part of the first floating-point data and the second floating-point data, the first mantissa field determines the binary decimal part of the first floating-point data, and the prefix code field and the second mantissa field determine the binary decimal part of the second floating-point data.
  • the calculation operation can be performed to obtain the first bit width of the prefix code field and the first coding value of the prefix code field.
  • the first bit width and the second exponent field of the second exponent field can also be obtained. After obtaining the first bit width of the prefix code field and the first bit width of the second exponent field, the first bit width of the second mantissa field can be determined.
  • Step 402 The execution device determines the retained code values and discarded code values in the first mantissa field, where the retained code values include code values in the first mantissa field starting from the highest bit and having the same bit width as the first bit width of the second mantissa field.
  • the bit width of the first mantissa field of the first floating-point data is greater than the bit width of the second mantissa field of the second floating-point data.
  • the first floating-point data is converted into the second floating-point data. Since the bit width of the second mantissa field of the second floating-point data is limited, it is necessary to round off the coded values in the first mantissa field.
  • the coded values starting from the highest bit in the first mantissa field and having the same first bit width in the bit width field and the second mantissa field are determined as the retained coded values, and the remaining coded values in the first mantissa field except the retained coded values are determined as the discarded coded values.
  • Step 403 The execution device rounds the retained code value according to the discarded code value to obtain the first code value of the second mantissa field.
  • the rounding operation may be a carry operation and a discard operation, and whether to perform a carry operation on the retained code value or a discard operation on the discarded code value is determined according to the discarded code value.
  • One determination method may be to compare the discarded code value with a threshold value, and if the discarded code value is greater than the threshold value, perform a carry operation on the retained code value, and if the discarded code value is less than the threshold value, perform a discard operation on the discarded code value.
  • the floating-point data precision conversion method realizes the conversion of high-precision data into low-precision data.
  • the sign domain of the second floating-point data can be obtained based on the sign domain of the first floating-point data
  • the prefix code domain and the second exponent domain of the second floating-point data can be obtained based on the first exponent domain of the first floating-point data
  • the second mantissa domain of the second floating-point data can be obtained based on the first mantissa domain of the first floating-point data.
  • the bit width of the second exponent domain is indicated by a shorter prefix code domain in the second floating-point data, so that the second floating-point data can provide a maximum of 4 mantissa precision, and the second floating-point data that only provides 1 mantissa precision can represent a larger numerical range, which effectively balances the relationship between the second floating-point data bit width, range and precision.
  • the prefix code domain can adopt a prefix code encoding method, occupying less bit width, and parsing the second exponent domain and the second mantissa domain is convenient.
  • the retained code value is rounded according to the discarded code value in the first mantissa domain, without the support of other devices, the conversion efficiency of high-precision data to low-precision data conversion is improved, and the hardware overhead is reduced.
  • the embodiment of the present application further provides a stochastic rounding (SR) method.
  • SR stochastic rounding
  • FIG5 is a schematic diagram of the structure of a stochastic rounding method provided by the embodiment of the present application.
  • Step 403 may also include:
  • Step 4033 When the code value whose bit width is the preset bit width and starts from the highest bit in the discarded code value is greater than or equal to the second preset threshold, the execution device performs a carry operation on the lowest bit of the retained code value and discards the discarded code value.
  • the code value after the carry operation of the retained code value is the first code value of the second mantissa field.
  • Step 4034 When the code value with a bit width of the preset bit width and starting from the highest bit in the discarded code value is smaller than the second preset threshold, the execution device discards the discarded code value and retains the code value as the first code value of the second mantissa field.
  • the second preset threshold is the encoding value of the discarded encoding value starting from the lowest bit and having a bit width of the preset bit width.
  • the preset bit width may be an integer between 10 and 14.
  • the second preset threshold may be a coding value starting from the lowest bit in the discarded coding value and having a bit width of 14, and the part of the discarded coding value used for comparison with the second preset threshold is a coding value starting from the highest bit and having a bit width of 14.
  • the first mantissa field 23’b010000000000000000000 in Table 1 if the first bit width of the second mantissa field is 2, the retained code value in the first mantissa field is 2’b01, the discarded code value is 21’b0000000000000000, the partially discarded code value is 14’b00000000000000, and the second preset threshold is 14’b000000000000000. Since the partially discarded code value is equal to the second preset threshold, the discarded code value is discarded, and the code value after the retained code value is carried is the first code value of the second mantissa field, that is, the first code value of the second mantissa field is 2’b01.
  • the first mantissa field 23’b010000000000000000000 in Table 1 if the first bit width of the second mantissa field is 1, the reserved code value in the first mantissa field is 1’b0, the discarded code value is 22’b1000000000000000000000, the partially discarded code value is 14’b100000000000000, and the second preset threshold is 14’b000000000000000. Since the partially discarded code value is greater than the second preset threshold, a carry operation is performed on the lowest bit of the reserved code value, and a discard operation is performed on the discarded code value.
  • the code value obtained after the carry operation on the reserved code value is the first code value of the second mantissa field, that is, the first code value of the second mantissa field is 1’b1.
  • the second preset threshold used for comparison is a coding value discarded from the lowest bit in the coding value and having a bit width of a preset bit width.
  • the generation of the second preset threshold does not require an additional random number generator, and there is no performance bottleneck for random number generation, which improves the conversion efficiency of high-precision data to low-precision data, and at the same time reduces hardware overhead.
  • the rounding operation includes a carry operation or a discard operation.
  • the embodiment of the present application provides a round half to zero carry operation.
  • FIG. 6 is a flowchart of a rounding method away from zero provided in an embodiment of the present application.
  • Step 403 may include:
  • Step 4031 When the highest bit of the discarded code value is greater than or equal to the first preset threshold, the execution device performs a carry operation on the lowest bit of the retained code value and discards the discarded code value.
  • the code value obtained after the carry operation on the retained code value is the first code value of the second mantissa field.
  • Step 4032 When the highest bit of the discarded code value is less than the first preset threshold, the execution device discards the discarded code value, and the retained code value is the first code value of the second mantissa field.
  • the first preset threshold value may be 1.
  • the highest bit of the discarded code value is greater than or equal to the preset threshold value, that is, the highest bit of the discarded code value is 1, a carry operation is performed on the lowest bit of the retained code value, and a discard operation is performed on the discarded code value.
  • the highest bit of the discarded code value is less than the first preset threshold value, that is, when the highest bit of the discarded code value is 0, a discard operation is performed on the discarded code value.
  • the retained code value in the first mantissa field is 2’b01
  • the discarded code value is 21’b000000000000000000000
  • the highest bit of the discarded code value is 0. Since the highest bit of the discarded code value is less than the first preset threshold, the discarded code value is discarded, and the retained code value is the first code value of the second mantissa field, that is, the first code value of the second mantissa field is 2’b01.
  • the first mantissa field 23’b010000000000000000000 in Table 1 if the first bit width of the second mantissa field is 1, the retained code value in the first mantissa field is 1’b0, and the discarded code value is 22’b1000000000000000000, and the highest discarded code value is 1. Since the highest bit of the discarded code value is greater than the first preset threshold, a carry operation is performed on the lowest bit of the retained code value, and a discard operation is performed on the discarded code value, and the code value obtained after the carry operation on the retained code value is the first code value of the second mantissa field, that is, the first code value of the second mantissa field is 1’b1.
  • the preset threshold value may also be 0.
  • a carry operation is performed on the lowest bit of the retained code value, and the discarded code value is discarded, and the code value obtained after the carry operation on the retained code value is the first code value of the second mantissa domain.
  • a discard operation is performed on the discarded code value, and the retained code value is the first code value of the second mantissa domain.
  • the round half to even rounding method and the round half to odd rounding method may also be included.
  • the TA rounding method provided in the embodiment of the present application has a smaller hardware implementation area and lower power consumption than other rounding methods, and has a higher data resolution.
  • FIG7 is a flow chart of another floating point data precision conversion method provided in an embodiment of the present application.
  • the floating point data precision conversion method provided in an embodiment of the present application may also include:
  • Step 404 The execution device determines whether the retained code value after the carry operation overflows.
  • the reserved code value may overflow.
  • an overflow may occur after a carry operation is performed on the lowest bit of the reserved code value.
  • Step 405 If the retained code value after the carry operation overflows, the execution device adds 1 to the least significant bit of the first code value in the first exponent field to obtain the second code value in the first exponent field.
  • the first exponent field in Table 1 its first encoding value is 8’b01111100. If the retained encoding value after the carry operation overflows, the lowest bit of the first encoding value of the first exponent field is added by 1 to obtain the second encoding value of the first exponent field, that is, the second encoding value of the first exponent field is 8’b01111101.
  • Step 406 The execution device determines a second bit width of the second exponent field and a second bit width of the prefix code field according to the second encoding value of the first exponent field.
  • the second bit width of the second exponent field is 1 and the second bit width of the prefix code field is 3.
  • Step 407 If the second bit width of the prefix code field is different from the first bit width of the prefix code field, the execution device determines the second code value of the prefix code field, the second code value of the second exponent field, the second bit width of the second mantissa field, and the second code value of the second mantissa field according to the second code value of the first exponent field.
  • the second bit width of the prefix code field is different from the first bit width of the prefix code field, since the prefix code field is used to indicate the bit width of the second exponent field, the first bit width of the second exponent field is different from the second bit width of the second exponent field. If the second bit width of the prefix code field is greater than the first bit width of the prefix code field, the second bit width of the second exponent field is less than the first bit width of the second exponent field. At this time, the number of bit widths increased by the prefix code field is the same as the number of bit widths reduced by the second exponent field, so the first bit width of the second mantissa field remains unchanged.
  • the second bit width of the prefix code field is less than the first bit width of the prefix code field, the second bit width of the second exponent field is greater than the first bit width of the second exponent field. At this time, The number of bit widths reduced by the prefix code field is the same as the number of bit widths reduced by the second exponent field, so the first bit width of the second mantissa field remains unchanged.
  • the execution device determines the second code value of the prefix code field and the second code value of the second exponent field according to the second code value of the first exponent field.
  • the second code value of the second mantissa field is all 0, for example, if the second bit width of the second mantissa field is 3, the second code value of the second mantissa field is 3'b000.
  • the first bit width of the prefix code field determined based on the first bit 8’b01111100 of the first exponent field is 2, and the second bit width of the second exponent field is 2.
  • the number of bit widths reduced by the prefix code field is the same as the number of bit widths increased by the second exponent field, so the first bit width of the second mantissa field remains unchanged.
  • Step 408 If the second bit width of the prefix code field is the same as the first bit width of the prefix code field, the execution device determines whether the first bit width of the second exponent field is the same as the second bit width of the second exponent field.
  • Step 409 If the second bit width of the second exponent field is smaller than the first bit width of the second exponent field, the execution device adds 1 to the bit width of the retained code value to obtain the second bit width of the second mantissa field and the second code value of the second mantissa field.
  • the second bit width of the prefix code field is the same as the first bit width of the prefix code field. If the second bit width of the second exponent field is less than the first bit width of the second exponent field, the first bit width of the second mantissa field will increase.
  • the bit width of the reserved code value is added by 1 to obtain the second bit width of the second mantissa field and the second code value of the second mantissa field. In one example, if the reserved code value is 2'b01, the bit width of the reserved code value is 2, and after adding 1 to the bit width of the reserved code value, the bit width of the reserved code value is 3, and the reserved code value is 3'b010.
  • Step 4010 If the second bit width of the second exponent field is greater than the first bit width of the second exponent field, the execution device discards the lowest bit of the retained encoding value to obtain the second bit width of the second mantissa field and the second encoding value of the second mantissa field.
  • the second bit width of the prefix code field is the same as the first bit width of the prefix code field. If the second bit width of the second exponent field is greater than the first bit width of the second exponent field, the first bit width of the second mantissa field will be reduced. The lowest bit of the retained code value is discarded to obtain the second bit width of the second mantissa field and the second code value of the second mantissa field. In one example, if the retained code value is 2'b01, the bit width of the retained code value is 2, and after the lowest bit of the retained code value is discarded, the retained code value is 1'b0, and the bit width of the retained code value is 1.
  • FIG8 is a flowchart of another floating-point data precision conversion method provided by an embodiment of the present application.
  • Step 401 includes:
  • Step 4011 the execution device determines an indication value according to the first coding value of the first exponent field, determines the first bit width of the prefix code field and the first coding value of the prefix code field corresponding to the indication value by looking up a table, and the indication value is also used to indicate the first bit width of the second exponent field.
  • the exponent value N of the first exponent field can be determined based on the first encoding value of the first exponent field, and the indication value can be determined based on the exponent value of the first exponent field.
  • the table lookup here is to look up Table 3, the indication value is the value of D, and the indication value can be 0, 1, 2, 3, 4.
  • the first exponent field in Table 1 its first encoding value is 8'b01111100, which represents 124 in decimal. After removing the offset 127 for FP32 data, a decimal value of -3 is obtained, where -3 is the exponent value N of the first exponent field.
  • D INT[log 2
  • the indication value is also used to indicate the first bit width of the second exponent field, that is, the first bit width of the second exponent field is 2.
  • Step 4012 The execution device determines a first encoding value corresponding to the first bit width of the second exponent field according to the first encoding value of the first exponent field.
  • the first exponent field in Table 1 please refer to Table 4.
  • D is 2
  • the exponent value of the first exponent field is -3, that is, the exponent sign bit Se is 1
  • the first bit width of the second exponent field determined by the indication value is 2
  • the first encoding value of the determined second exponent field is 11.
  • Step 4013 The execution device determines the first bit width of the second mantissa field according to the total bit width of the second floating-point data, the first bit width of the prefix code field, and the first bit width of the second exponent field.
  • the total bit width of the second floating-point data is Nb
  • the first bit width of the prefix code field is Db
  • the first bit width of the second exponent field is Eb
  • the bit width of the sign field is 1
  • the first bit width of the second mantissa field is Mb
  • the bit width of the sign field is 1, the bit width of the prefix code field is 2 or 3, the first bit width of the second exponent field is an integer from 0 to 4, and the first bit width of the second mantissa field is an integer from 1 to 4.
  • a shorter prefix code field is used in the second floating-point data to indicate the first bit width of the second exponent field, so that the second floating-point data can provide up to 4 bits of mantissa precision, and at the same time, the second floating-point data that only provides 1 bit of mantissa precision can represent a larger range of values, effectively balancing the relationship between the bit width, range and precision of the second floating-point data.
  • the highest bit is hidden when the second exponent field is stored, which reduces the first bit width that needs to be stored in the second exponent field, and effectively avoids the first encoding of the second exponent field corresponding to the indication values of different prefix code fields.
  • the problem of numerical overlap occurs in the value, so there is no redundant encoding in the HiFloat8 data format.
  • the second floating-point data is determined based on a saturation method or an infinity method.
  • the saturation mode can be to use the maximum floating point data that can be represented by the low-precision floating point data as the first floating point data.
  • the infinity mode can be to use the infinite data of the low-precision floating point data as the first floating point data.
  • the second floating point data after the precision conversion of the first floating point data can be represented as 8'b01101111.
  • the second floating-point data is zero.
  • the second floating-point data is zero, and the second floating-point data can be represented as 8’b01111110.
  • the second floating-point data is a non-numeric value.
  • the second floating-point data can be represented as 8’b11111110.
  • Figure 9 is a flowchart of converting FP32 data to HiFloat8 data provided by an embodiment of the present application. Taking the conversion of FP32 data to HiFloat8 data as an example, applied to the conversion module, the conversion process includes the following process.
  • the conversion module receives FP32 data, which includes a sign field S, an exponent field E[0:7], and a mantissa field M[0:22];
  • bit width of the new prefix code field and the bit width of the exponent field of the HiFloat8 data are calculated based on the exponent field of the FP32 after the addition operation, and are represented by db1 and eb1 respectively;
  • the electronic device includes hardware and/or software modules corresponding to the execution of each function.
  • the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a function is executed in the form of hardware or computer software driving hardware depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application in combination with the embodiments, but such implementation should not be considered to be beyond the scope of the present application.
  • This embodiment can divide the electronic device into functional modules according to the above method example.
  • each functional module can be divided according to each function, or two or more functions can be integrated into one processing module.
  • the above integrated module can be It is implemented in the form of hardware. It should be noted that the division of modules in this embodiment is schematic and is only a logical function division. There may be other division methods in actual implementation.
  • FIG10 shows a possible composition diagram of the electronic device 100 involved in the above embodiment, as shown in FIG10 , FIG10 is a structural diagram of an electronic device provided in an embodiment of the present application.
  • the electronic device 100 may include: a bit width calculation unit 101, a mantissa domain calculation unit 102, and a rounding operation unit 103.
  • the bit width calculation unit 101 may be used to support the electronic device 100 in executing the above-mentioned steps 401, 4011, 4012, 4013, etc., and/or other processes for the technology described herein.
  • the mantissa domain calculation unit 102 may be used to support the electronic device 100 in executing the above step 402 and/or other processes of the technology described herein.
  • the rounding operation unit 103 may be used to support the electronic device 100 to execute the above-mentioned steps 403 , 4031 , 4032 , etc., and/or other processes for the technology described herein.
  • the electronic device 100 provided in this embodiment is used to execute the above floating-point data precision conversion method, and thus can achieve the same effect as the above implementation method.
  • the electronic device 100 may include a processing module, a storage module and a communication module.
  • the processing module can be used to control and manage the actions of the electronic device 100, for example, it can be used to support the electronic device 100 to perform the steps performed by the above-mentioned bit width calculation unit 101, the mantissa domain calculation unit 102 and the rounding operation unit 103.
  • the storage module can be used to support the electronic device 100 to store program codes and data, etc.
  • the communication module can be used to support the communication between the electronic device 100 and other devices, such as communication with a wireless access device.
  • the processing module can be a processor or a controller. It can implement or execute various exemplary logic boxes, modules and circuits described in conjunction with the disclosure of this application.
  • the processor can also be a combination that implements computing functions, such as a combination of one or more microprocessors, a combination of digital signal processing (DSP) and a microprocessor, etc.
  • the storage module can be a memory.
  • the communication module can specifically be a radio frequency circuit, a Bluetooth chip, a Wi-Fi chip, or other devices that interact with other electronic devices.
  • the processing module is a processor and the storage module is a memory
  • the electronic device involved in this embodiment may be a server, a computer, and the like.
  • the embodiment of the present application also provides an electronic device, including one or more processors and one or more memories.
  • the one or more memories are coupled to the one or more processors, and the one or more memories are used to store computer program codes, and the computer program codes include computer instructions.
  • the electronic device executes the above-mentioned related method steps to implement the floating-point data precision conversion method in the above-mentioned embodiment.
  • An embodiment of the present application also provides a computer storage medium, in which computer instructions are stored.
  • the computer instructions When the computer instructions are executed on an electronic device, the electronic device executes the above-mentioned related method steps to implement the floating-point data precision conversion method in the above-mentioned embodiment.
  • the embodiments of the present application also provide a computer program product.
  • the computer program product When the computer program product is run on a computer, the computer is caused to execute the above-mentioned related steps to implement the floating-point data precision conversion method executed by the electronic device in the above-mentioned embodiment.
  • an embodiment of the present application also provides a device, which can specifically be a chip, component or module, and the device may include a connected processor and memory; wherein the memory is used to store computer execution instructions, and when the device is running, the processor can execute the computer execution instructions stored in the memory so that the chip executes the floating-point data precision conversion method executed by the electronic device in the above-mentioned method embodiments.
  • the electronic device, computer storage medium, computer program product or chip provided in this embodiment is used to execute the corresponding method provided above. Therefore, the beneficial effects that can be achieved can refer to the beneficial effects in the corresponding method provided above and will not be repeated here.
  • the disclosed devices and methods can be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of the modules or units is only a logical function division. There may be other division methods in actual implementation.
  • multiple units or components can be combined or integrated into another device, or some features can be ignored or not executed.
  • Another point is that the coupling or direct coupling between each other shown or discussed is not necessarily a direct coupling between the two devices.
  • the coupling or communication connection may be an indirect coupling or communication connection through some interface, device or unit, which may be electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may be one physical unit or multiple physical units, that is, they may be located in one place or distributed in multiple different places. Some or all of the units may be selected according to actual needs to achieve the purpose of the present embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit may be implemented in the form of hardware or in the form of software functional units.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a readable storage medium.
  • the technical solution of the embodiment of the present application is essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, which is stored in a storage medium and includes several instructions to enable a device (which can be a single-chip microcomputer, chip, etc.) or a processor (processor) to execute all or part of the steps of the method described in each embodiment of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), disk or optical disk and other media that can store program code.

Landscapes

  • Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Analysis (AREA)
  • Computing Systems (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Nonlinear Science (AREA)
  • Complex Calculations (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

本申请实施例提供一种浮点数据精度转换方法和装置,涉及芯片技术领域,提高了高精度数据向低精度数据转换时的整体均值不变性。具体方案为:根据第一指数域的第一编码值确定前缀码域的第一位宽、前缀码域的第一编码值、第二指数域的第一位宽、第二指数域的第一编码值以及第二尾数域的第一位宽;确定第一尾数域中的保留编码值和舍弃编码值,保留编码值包括第一尾数域中从最高位开始,且位宽与第二尾数域的第一位宽相同的编码值;根据舍弃编码值对保留编码值进行舍入操作,得到第二尾数域的第一编码值。本申请实施例用于高精度数据向低精度数据转换的过程。

Description

浮点数据精度转换方法和装置
本申请要求于2022年10月19日提交国家知识产权局、申请号为202211281416.6、申请名称为“浮点数据精度转换方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及芯片技术领域,尤其涉及一种浮点数据精度转换方法和装置。
背景技术
随着混合精度计算的快速发展,开始大规模的部署低精度浮点数据格式的计算资源。比如在人工智能(Artificial Intelligence,AI)领域部署有浮点(Floating Point,FP)16和FP32混合训练模式,脑浮点(Brain Floating Point,BF)16和FP32混合训练模式,以及高性能计算(High Performance Computing,HPC)业务领域的FP32和FP64混合精度等。目前,学界和业界提出了多种8比特浮点数据格式,比如共享指数位(Shard Exponent Bias,SEB)、混合FP8(Hybrid FP8,HFP8)以及可配置的FP8(Configurable FP8,CFP8)等。对于精度要求较高的HPC业务领域也想部署大规模的低精度算力,于是提出了多种混合精度的求解器算法。这些算法中先利用低精度算力,如FP16/BF16,计算出低精度的初始计算结果,然后使用迭代算法和高精度数据格式FP32/FP64,求解出高精度的计算结果。针对混合精度计算场景,需要涉及到不同精度数据间的相互转换。低精度数据向高精度数据进行格式转换时,可以实现无误差的转换。高精度数据向低精度数据进行格式转换时,需要对高精度数据进行舍入(round)操作,由此会产生转换误差,降低了高精度数据向低精度数据转换时的整体均值不变性。不同的舍入方式对应的转换误差也不相同,特别是AI训练场景,对高精度数据进行舍入操作会出现误差的累积效应,从而影响AI模型的训练精度。
发明内容
本申请实施例提供一种浮点数据精度转换方法和装置,实现了高精度数据向低精度数据的转换,第二浮点数据采用前缀码域指示第二指数域的位宽,有效平衡了第二浮点数据位宽、范围和精度之间的关系。并通过提供简单的舍入方式,根据第一尾数域中的舍弃编码值对保留编码值进行舍入操作,无需其他设备的支持,提高了高精度数据向低精度数据转换的转换效率,降低了硬件开销。
为达到上述目的,本申请实施例采用如下技术方案。
第一方面,本申请实施例提供一种浮点数据精度转换方法,第一浮点数据包括符号域、第一指数域和第一尾数域,第二浮点数据包括符号域、前缀码域、第二指数域和第二尾数域,前缀码域用于指示第二指数域的位宽,第一浮点数据的精度高于第二浮点数据的精度,该方法包括:根据第一指数域的第一编码值确定前缀码域的第一位宽、前缀码域的第一编码值、第二指数域的第一位宽、第二指数域的第一编码值以及第二尾数域的第一位宽;确定第一尾数域中的保留编码值和舍弃编码值,保留编码值包括第一尾数域中从最高位开始,且位宽域第二尾数域的第一位宽相同的编码值;根据舍弃编码值对保留编码值进行舍入操作,得到第二尾数域的第一编码值。
本申请实施例提供的浮点数据精度转换方法,实现了将高精度数据转换为低精度数据。在数据格式转换中,基于第一浮点数据的符号域可以得到第二浮点数据的符号域,基于第一浮点数据的第一指数域可以得到第二浮点数据的前缀码域和第二指数域,以及基于第一浮点数据的第一尾数域可以得到第二浮点数据的第二尾数域。第二浮点数据中通过较短的前缀码域指示第二指数域的位宽,可以有效提升第二浮点数据的尾数的精度或位宽,同时对于只提供1位尾数的精度的第二浮点数据可以表示较大的数值范围,有效平衡了第二浮点数据位宽、范围和精度之间的关系。且前缀码域可以采用前缀码编码方式,占用位宽少,解析第二指数域和第二尾数域便捷。并通过提供简单的舍入方式,根据第一尾数域中的舍弃编码值对保留编码值进行舍入操作,无需其他设备的支持,提高了高精度数据向低精度数据转换的转换效率,降低了硬件开销。
在一种可能的设计中,舍入操作包括进位操作和舍弃操作,根据舍弃编码值对保留编码值进行舍入操作,得到第二尾数域的第一编码值包括:舍弃编码值中从最高位开始,且位宽为预设位宽的编码值大于或等于第二预设阈值时,对保留编码值的最低位进行进位操作,对舍弃编码值进行舍弃操作,保留编码值进位后的编码值为第二尾数域的第一编码值;舍弃编码值中从最高位开始,且位宽为预设位宽的编码值小于第二预设阈值时,对舍弃编码值进行舍弃操作,保留编码值为第二尾数 域的第一编码值;其中,第二预设阈值为舍弃编码值中从最低位开始,且位宽为预设位宽的编码值。
这种设计中,对于随机舍入方式,用于比较的第二预设阈值为舍弃编码值中从最低位开始,且位宽为预设位宽的编码值,第二预设阈值的生成无需额外的随机数生成器,不存在随机数生成的性能瓶颈,提高了高精度数据向低精度数据的转换效率,同时硬件开销更低。
在一种可能的设计中,舍入操作包括进位操作或舍弃操作,根据舍弃编码值对保留编码值进行舍入操作,得到第二尾数域的第一编码值包括:舍弃编码值的最高位大于或等于第一预设阈值时,对保留编码值的最低位进行进位操作,并对舍弃编码值进行舍弃操作,保留编码值进行进位操作后得到的编码值为第二尾数域的第一编码值;舍弃编码值的最高位小于第一预设阈值时,对舍弃编码值进行舍弃操作,保留编码值为第二尾数域的第一编码值。
这种设计中,第一预设阈值可以为0或1,将舍弃编码值的最高位和第一预设阈值进行比较,属于远离0进位舍入方式。除了远离0进位舍入方式,还可以包括远离偶数进位舍入方式和远离奇数进位舍入方式等。但远离0进位舍入方式对于其他舍入方式硬件实现面积更小、功耗开销更小,且具有更高的数据分辨率。
在一种可能的设计中,本申请实施例提供的浮点数据精度转换方法还包括:判断进位操作后的保留编码值是否溢出;若进位操作后的保留编码值溢出,则对第一指数域的第一编码值的最低位进行加1操作,得到第一指数域的第二编码值;若前缀码域的第二位宽和前缀码域的第一位宽不同,根据第一指数域的第二编码值确定前缀码域的第二编码值、第二指数域的第二编码值、第二尾数域的第二位宽和第二尾数域的第二编码值;若前缀码域的第二位宽和前缀码域的第一位宽相同,判断第二指数域的第一位宽和第二指数域的第二位宽是否相同;若第二指数域的第二位宽小于第二指数域的第一位宽,对保留编码值的位宽进行加1操作,得到第二尾数域的第二位宽和第二尾数域的第二编码值;若第二指数域的第二位宽大于或等于第二指数域的第一位宽,对保留编码值的最低位进行舍弃操作,得到第二尾数域的第二位宽和第二尾数域的第二编码值。
这种设计中,在溢出的情况下,对第一指数域的第一编码值的最低位进行加1操作之后得到前缀码域的第二位宽和第二指数域的位宽,如果前缀码域的第二位宽和前缀码域的第一位宽相同时,如果第二指数域的位宽变化,可以得到第二尾数域的第二位宽,可以解决保留编码值进行进位操作后产生的溢出的问题。
在一种可能的设计中,根据第一指数域的第一编码值确定前缀码域的第一位宽、前缀码域的第一编码值、第二指数域的第一位宽和第二指数域的第一编码值包括:根据第一指数域的第一编码值确定指示值,通过查表确定与指示值对应的前缀码域的第一位宽和前缀码域的第一编码值,指示值还用于指示第二指数域的第一位宽;根据第一指数域的第一编码值确定第二指数域的第一位宽对应的第一编码值。
在一种可能的设计中,根据第一指数域的第一编码值确定第二尾数域的第一位宽包括:根据第二浮点数据的总位宽、前缀码域的第一位宽、第二指数域的第一位宽确定第二尾数域的第一位宽。
这种设计中,例如,对于HiFloat8数据格式的第二浮点数据,符号域的位宽为1,前缀码域的位宽为2或3,第二指数域的第一位宽为0至4中的一个整数,第二尾数域的第一位宽为1至4中的一个整数。由此,第二浮点数据中采用较短的前缀码域指示第二指数域的第一位宽,使得第二浮点数据最高可以提供4位尾数的精度,同时对于只提供1位尾数的精度的第二浮点数据可以表示较大的数值范围,有效平衡了第二浮点数据位宽、范围和精度之间的关系。且第二指数域存储时隐藏最高位,减少了第二指数域需要存储的第一位宽,有效避免了不同前缀码域的指示值对应的第二指数域的第一编码值出现数值重叠的问题,使得HiFloat8数据格式中无冗余编码。
在一种可能的设计中,第一浮点数据超出第二浮点数据的数据范围的上限时,基于饱和方式或无穷大方式确定第二浮点数据;第一浮点数据超出第二浮点数据的数据范围的下限时,第二浮点数据为零;第一浮点数据为非数字值时,第二浮点数据为非数字值。
这种设计中,第一浮点数据超出第二浮点数据的数据范围的上限和下限时,第二浮点数据可以通过特殊值表示第一浮点数据,例如饱和值、无穷大值和零值。当第一浮点数据为非数字值时,第二浮点数据也为非数字值表示。
第二方面,本申请实施例提供一种浮点数据精度转换装置,第一浮点数据包括符号域、第一指数域和第一尾数域,第二浮点数据包括符号域、前缀码域、第二指数域和第二尾数域,前缀码域用于指示第二指数域的位宽,第一浮点数据的精度高于第二浮点数据的精度,该装置包括:位宽计算 单元,用于根据第一指数域的第一编码值确定前缀码域的第一位宽、前缀码域的第一编码值、第二指数域的第一位宽、第二指数域的第一编码值以及第二尾数域的第一位宽;尾数域计算单元,用于确定第一尾数域中的保留编码值和舍弃编码值,保留编码值包括第一尾数域中从最高位开始,且位宽与第二尾数域的第一位宽相同的编码值;舍入操作单元,用于根据舍弃编码值对保留编码值进行舍入操作,得到第二尾数域的第一编码值。
第二方面的有益效果可参见第一方面的说明。
在一种可能的设计中,舍入操作包括进位操作或舍弃操作,舍入操作单元还用于:舍弃编码值中从最高位开始,且位宽为预设位宽的编码值大于或等于第二预设阈值时,对保留编码值的最低位进行进位操作,对舍弃编码值进行舍弃操作,保留编码值进位后的编码值为第二尾数域的第一编码值;舍弃编码值中从最高位开始,且位宽为预设位宽的编码值小于第二预设阈值时,对舍弃编码值进行舍弃操作,保留编码值为第二尾数域的第一编码值;其中,第二预设阈值为舍弃编码值中从最低位开始,且位宽为预设位宽的编码值。
在一种可能的设计中,舍入操作包括进位操作或舍弃操作,舍入操作单元还用于:舍弃编码值的最高位大于或等于第一预设阈值时,对保留编码值的最低位进行进位操作,并对舍弃编码值进行舍弃操作,保留编码值进行进位操作后得到的编码值为第二尾数域的第一编码值;舍弃编码值的最高位小于第一预设阈值时,对舍弃编码值进行舍弃操作,保留编码值为第二尾数域的第一编码值。
在一种可能的设计中,装置还包括:溢出单元,用于判断进位操作后的保留编码值是否溢出;位宽计算单元还用于若进位操作后的保留编码值溢出,则对第一指数域的第一编码值进行加1操作,得到第一指数域的第二编码值;根据第一指数域的第二编码值确定第二指数域的第二位宽和前缀码域的第二位宽;若前缀码域的第二位宽和前缀码域的第一位宽不同,根据第一指数域的第二编码值确定前缀码域的第二编码值、第二指数域的第二编码值、第二尾数域的第二位宽和第二尾数域的第二编码值;若所述前缀码域的第二位宽和所述前缀码域的第一位宽相同,判断所述第二指数域的第一位宽和第二指数域的第二位宽是否相同;若第二指数域的第二位宽小于第二指数域的第一位宽,对保留编码值的位宽进行加1操作,得到第二尾数域的第二位宽和第二尾数域的第二编码值;若第二指数域的第二位宽大于或等于第二指数域的第一位宽,对保留编码值的最低位进行舍弃操作,得到第二尾数域的第二位宽和第二尾数域的第二编码值。
在一种可能的设计中,位宽计算单元还用于:根据第一指数域的第一编码值确定指示值,通过查表确定与指示值对应的前缀码域的第一位宽和前缀码域的第一编码值,指示值还用于指示第二指数域的第一位宽;根据第一指数域的第一编码值确定第二指数域的第一位宽对应的第一编码值。
在一种可能的设计中,位宽计算单元还用于:根据第二浮点数据的总位宽、前缀码域的第一位宽、第二指数域的第一位宽确定第二尾数域的第一位宽。
在一种可能的设计中,位宽计算单元还用于:第一浮点数据超出第二浮点数据的转换范围的上限时,基于饱和方式或无穷大方式确定第二浮点数据;第一浮点数据超出第二浮点数据的转换范围的下限时,第二浮点数据为零;第一浮点数据为非数字值时,第二浮点数据为非数字值。
第三方面,提供一种通信装置,包括至少一个处理器,所述至少一个处理器与存储器相连,所述至少一个处理器用于读取并执行所述存储器中存储的程序,以使得所述装置执行如上述第一方面或第一方面的任一项所述的方法。
第四方面,提供一种芯片,所述芯片与存储器耦合,用于读取并执行所述存储器中存储的程序指令,以实现如上述第一方面或第一方面的任一项所述的方法。
第五方面,本申请提供一种芯片系统,该芯片系统应用于云中心。该芯片系统包括一个或多个接口电路,以及一个或多个处理器。该接口电路和该处理器通过线路互联;该接口电路用于从云中心的存储器接收信号,并向处理器发送该信号,该信号包括该存储器中存储的计算机指令。当该处理器执行该计算机指令时,云中心执行如第一方面或其相应的可能的设计提供的浮点数据精度转换方法。
第六方面,本申请实施例提供了一种计算机可读存储介质,包括计算机指令,当计算机指令在电子设备上运行时,使得电子设备执行上述任一方面及任一项可能的实现方式中的浮点数据精度转换方法。
第七方面,本申请实施例提供了一种计算机程序产品,当计算机程序产品在计算机或处理器上运行时,使得计算机或处理器执行上述任一方面及任一项可能的实现方式中的浮点数据精度转换方 法。
可以理解的是,上述提供的任一种浮点数据精度转换装置、芯片系统、计算机可读存储介质或计算机程序产品等均可以应用于上文所提供的对应的方法,因此,其所能达到的有益效果可参考对应的方法中的有益效果,此处不再赘述。
本申请的这些方面或其他方面在以下的描述中会更加简明易懂。
附图说明
图1为本申请实施例提供的一种IEEE754浮点数据格式图;
图2为本申请实施例提供的一种浮点数据精度转换方法或装置应用的系统或设备示意图;
图3为本申请实施例提供的一种SLC的结构示意图;
图4为本申请实施例提供的一种浮点数据精度转换方法的流程图;
图5为本申请实施例提供的一种随机舍入方式的结构示意图;
图6为本申请实施例提供的一种远离0进位舍入方式的流程图;
图7为本申请实施例提供的另一种浮点数据精度转换方法的流程图;
图8为本申请实施例提供的另一种浮点数据精度转换方法的流程图;
图9为本申请实施例提供的一种FP32数据转换为HiFloat8数据的流程图;
图10为本申请实施例提供的一种电子设备的结构示意图。
具体实施方式
为了便于理解,示例的给出了部分与本申请实施例相关概念的说明以供参考。如下所示:
标量计算单元,针对标量计算的电路称为标量计算单元,其中,标量又称纯量,只有大小,没有方向。标量计算多用于通用计算,在中央处理器(Central Processing Unit,CPU)多级流水线的执行单元(Execution Unit,EXU)部分和其他类似功能的处理器的标量计算部分,可以嵌入基于HiFloat数据格式的算数逻辑单元(Arithmetic Logic Unit,ALU)。
向量计算单元,针对向量计算而特殊设计的具有一定并行度的计算单元,如单指令多数据流(Single Instruction Multiple Data,SIMD)处理机,其中,向量又称矢量,通常指长度大于1的一维数组。向量计算单元多用于HPC和AI机器学习等领域,包括如线性规划、傅里叶变换、滤波计算以及线性代数、偏微分方程、积分等数学问题的求解。在向量计算加速单元或向量处理机中,可以嵌入基于HiFloat数据格式的算数执行单元(Vector Unit)。
矩阵计算单元,针对矩阵计算而特殊设计的具有相应并行度的计算单元,如脉动阵列(systolic array)处理机,其中,矩阵是一个按照长方阵列排列的2维数组。矩阵计算单元多用于HPC和AI机器学习等领域的矩阵计算,包括如矩阵乘、矩阵求逆和矩阵分解等。在矩阵计算加速单元中,可以嵌入基于HiFloat数据格式的矩阵单元(Matrix Unit)。
张量计算单元,针对张量计算而特殊设计的具有相应并行度的计算单元称为张量计算单元,如立方(Cube)计算单元,其中,张量是维数超过2维的多维数组,常见的为3维数组。张量计算单元多用于AI机器学习领域,如卷积操作。在张量计算加速单元中,可以嵌入基于HiFloat数据格式的张量单元(Tensor Unit)。
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。其中,在本申请实施例的描述中,除非另有说明,“/”表示或的意思,例如,A/B可以表示A或B;本文中的“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,在本申请实施例的描述中,“多个”是指两个或多于两个。
以下,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本实施例的描述中,除非另有说明,“多个”的含义是两个或两个以上。
电气电子工程师协会(Institute of Electrical and Electronics Engineers,IEEE)制定了IEEE 754为二进制浮点数算数标准,分别定义了双精度FP64、单精度FP32以及半精度FP16等浮点数据表示方法。其中,双精度FP64数据和单精度FP32数据适用于CPU和浮点运算器环境,半精度FP16数据适用于计算机图形环境。如图1所示,图1为本申请实施例提供的IEEE754浮点数据格式图。IEEE754浮点数据包括符号域(bit sign,S)、指数域(bits exponent,E)和尾数域(bits mantissa,M), 其中,对于FP64数据,符号域为1比特,指数域为11比特,尾数域为52比特;对于FP32数据,符号域为1比特,指数域为8比特,尾数域为23比特;对于FP16数据,符号域为1比特,指数域为5比特,尾数域为10比特。IEEE 754表示的浮点数据与十进制的值(Value)的转换公式为:Value=(-1)sign*2exponent-bias*(1+mantissa),其中,bias对于FP64、FP32和FP16等浮点数据为不同的常数。
本申请实施例提供的第一浮点数据为FP32数据,下面先对FP32数据的数据格式进行介绍,例如表1所示,表1为FP32数据的数据格式。
表1FP32数据的数据格式表
FP32数据包括符号域S、指数域E和尾数域M,FP32数据的符号域决定FP32数据是正数或是负数,其中,0表示正数,1表示负数。FP32数据的指数域为2的幂,可以对FP32数据进行加权处理。FP32数据的尾数域是二进制小数。FP32数据转十进制数据的转换步骤如下:(1)FP32数据的符号域为0,则该FP32数据为正数。(2)FP32数据的指数域的编码值为01111100,表示十进制的124,去除FP32数据中的偏置(bias)(对于FP32数据,bias=127),则该FP32数据的指数域为2的-3次方。(3)FP32数据的尾数域为01000000000000000000000,由于指数域不为全“0”或全“1”,则尾数域表示的十进制数据为1.25。则基于浮点数据与十进制的值的转换公式该FP32数据对应的十进制数据为0.15625。
对于AI的混合训练模式和HPC业务领域,现有的数据格式,例如FP32数据,数据的位宽较大,导致数据存储和转移开销大。且越来越多的应用,不需要例如FP64数据等高精度数据格式。因此,需要将高精度的数据转换为低精度的数据,在转换过程中,需要对高精度数据进行舍入操作,由此会产生转换误差。
现有技术提供了一种FP32向FP16或BF16数据格式转换的一种简化的随机舍入方式,随机舍入的阈值采用数据本身的特定的比特进行计算后完成。该方式中通过FP32数据的尾数域(共23位,最高有效位(Most Significant Bit,MSB)向最低有效位(Least Significant Bit,LSB)编号为1-23)中的第11-18比特与第16-23比特相加,溢出的比特作为随机输入的阈值。但该方式仅描述了单一的随机舍入转化方式,仅涉及到FP32到FP16或BF16数据格式的转换,无法满足其他格式的高精度数据向低精度数据的转换。且该方式的阈值生成方式涉及到较多尾数域的比特,计算较为复杂,硬件开销较大。
由此,本申请实施例提供的浮点数据精度转换方法,实现了将高精度数据转换为低精度数据。在数据格式转换中,基于第一浮点数据的符号域可以得到第二浮点数据的符号域,基于第一浮点数据的第一指数域可以得到第二浮点数据的前缀码域和第二指数域,以及基于第一浮点数据的第一尾数域可以得到第二浮点数据的第二尾数域。第二浮点数据中通过较短的前缀码域指示第二指数域的位宽,可以有效提升第二浮点数据的尾数的精度或位宽,同时对于只提供1位尾数的精度的第二浮点数据可以表示较大的数值范围,有效平衡了第二浮点数据位宽、范围和精度之间的关系。且前缀码域可以采用前缀码编码方式,占用位宽少,解析第二指数域和第二尾数域便捷。并通过提供简单的舍入方式,根据第一尾数域中的舍弃编码值对保留编码值进行舍入操作,无需其他设备的支持,提高了高精度数据向低精度数据转换的转换效率,降低了硬件开销。
本申请实施例提供的第二浮点数据为HiFloat8数据,如表2所示,表2为HiFloat8数据的编码方式。
表2 HiFloat8数据的编码方式
其中,8为HiFloat8数据的总位宽,总位宽可以变化。符号域占据一个比特位,0表示正数,1表示负数,或是1表示负数,0表示正数。前缀码域占据2或3个比特位,前缀码域可以表达5个不同信息,D的值(Value)可以是0、1、2、3、4。指数域的位宽根据D的值变化,以及尾数域占据剩余位宽。
其中,前缀码域可以采用整数(integer)编码,此时D为固定值。前缀码域也可以采用前缀码(prefix code)编码,此时D为有限值集合。前缀码编码中用2个比特位编码值2、3和4,用3个 比特位编码值0和1。前缀码域的前缀码编码方式如表3所示,表3为前缀码域的编码方式。
表3前缀码域的前缀码编码方式
本申请实施例提供的HiFloat8数据和十进制的值(X)转换公式为:
其中,Ec为阶码对称中心,也是FP32数据中的bias。
当D为0时,表示指数域的值为0。当D不为0时,指数域采用符号量值(signed magnitude)编码,即符号位尾随原码(True Form,TF),指数域的编码Ei={Se,1’b1,TF[2:end]},Se为指数位符号位。TF的最高位1’b1隐藏,不存储,则指数域的编码值为Es={Se,TF[2:end]}。指数域对应于十进制的编码值为Ev=Ei+Ec。
HiFloat(N,5,Ec)可配置为HiFloat(8,5,0),简写为HiF8,也可以配置为其他情况。HiFloat8编码数值分布如表4所示。
表4 HiFloat8编码数值分布表
在上述场景中,本申请的浮点数据精度转换方法和装置可以应用于不同的系统或设备中,如应用于图2所示的执行设备20,图2为本申请实施例提供的一种浮点数据精度转换方法和装置应用的系统或设备示意图。该执行设备可以是终端,如手机终端,平板电脑,笔记本电脑,AR设备(图2中未示出),VR设备(图2中未示出),车载终端(图2中未示出)等,还可以是服务器等。本申请提供的浮点数据精度转换方法可以应用于执行设备20中涉及到CPU、HPC和AI等关于混合精度计算的场景中,例如标量计算单元、向量计算单元、矩阵计算单元和张量计算单元等。
在一些实施例中,本申请提出的浮点数据精度转换的装置可以为芯片,例如该芯片为系统级芯片(System-on-a-Chip,SoC)。如图3所示,图3为本申请实施例提供的一种SoC的结构示意图。SoC包括处理器,该处理器可以为单核处理器或多核处理器、存储器和I/O接口等。处理器可加载存储器中的数据和应用程序后,对数据进行处理,例如进行本申请中的计算处理。例如数据为FP32数据时,可通过读取FP32数据中的符号域、第一指数域和第一尾数域确定第二浮点数据的符号域、前缀码域、第二指数域和第二尾数域。
本申请实施例提供一种浮点数据精度转换方法,应用于将第一浮点数据转换为第二浮点数据,第一浮点数据包括符号域、第一指数域和第一尾数域,第二浮点数据包括符号域、前缀码域、第二指数域和第二尾数域,前缀码域用于指示第二指数域的位宽,第一浮点数据的精度高于第二浮点数据的精度。如图4所示,图4为本申请实施例提供的一种浮点数据精度转换方法的流程图,该方法包括:
步骤401、执行设备根据第一指数域的第一编码值确定前缀码域的第一位宽、前缀码域的第一编码值、第二指数域的第一位宽、第二指数域的第一编码值以及第二尾数域的第一位宽。
示例性的,第一浮点数据的精度高于第二浮点数据的精度,其中,第一浮点数据可以是FP32数据,第二浮点数据可以是HiFloat8数据。
在转换过程中,由于数据格式转换并不会影响浮点数据的正负,因此,第一浮点数据的符号域和第二浮点数据的符号域相同。第一浮点数据或第二浮点数据包括二进制整数部分和二进制小数部分,其中,第一指数域和第二指数域分别决定第一浮点数据和第二浮点数据的二进制整数部分,第一尾数域决定第一浮点数据的二进制小数部分,前缀码域和第二尾数域决定第二浮点数据的二进制小数部分。基于第一指数域的第一编码值进行计算操作,可以得到前缀码域的第一位宽和前缀码域的第一编码值。基于第二浮点数据的数据格式,由此也可以得到第二指数域的第一位宽和第二指数 域的第一编码值。在得到前缀码域的第一位宽、第二指数域的第一位宽后,可以确定第二尾数域的第一位宽。
步骤402、执行设备确定第一尾数域中的保留编码值和舍弃编码值,保留编码值包括第一尾数域中从最高位开始,且位宽与第二尾数域的第一位宽相同的编码值。
示例性的,由于第一浮点数据的精度高于第二浮点数据的精度,第一浮点数据的第一尾数域的位宽大于第二浮点数据的第二尾数域的位宽。将第一浮点数据转换为第二浮点数据,由于第二浮点数据的第二尾数域的位宽有限,需要对第一尾数域中的编码值进行取舍。将第一尾数域中从最高位开始,且位宽域第二尾数域的第一位宽相同的编码值确定为保留编码值,以及将第一尾数域中除保留编码值外剩余的编码值确定为舍弃编码值。
步骤403、执行设备根据舍弃编码值对保留编码值进行舍入操作,得到第二尾数域的第一编码值。
示例性的,舍入操作可以是进位操作和舍弃操作,根据舍弃编码值判断对保留编码值是进行进位操作还是对舍弃编码值进行舍弃操作。其中一种判断方式可以是通过将舍弃编码值和阈值进行比较,若舍弃编码值大于阈值,对保留编码值进行进位操作,若舍弃编码值小于阈值,对舍弃编码值进行舍弃操作。
示例性的,本申请实施例提供的浮点数据精度转换方法,实现了将高精度数据转换为低精度数据。在数据格式转换中,基于第一浮点数据的符号域可以得到第二浮点数据的符号域,基于第一浮点数据的第一指数域可以得到第二浮点数据的前缀码域和第二指数域,以及基于第一浮点数据的第一尾数域可以得到第二浮点数据的第二尾数域。第二浮点数据中通过较短的前缀码域指示第二指数域的位宽,使得第二浮点数据最高可以提供4位尾数的精度,同时对于只提供1位尾数的精度的第二浮点数据可以表示较大的数值范围,有效平衡了第二浮点数据位宽、范围和精度之间的关系。且前缀码域可以采用前缀码编码方式,占用位宽少,解析第二指数域和第二尾数域便捷。并通过提供简单的舍入方式,根据第一尾数域中的舍弃编码值对保留编码值进行舍入操作,无需其他设备的支持,提高了高精度数据向低精度数据转换的转换效率,降低了硬件开销。
可选的,本申请实施例还提供一种随机舍入(Stochastic Round,SR)方式。如图5所示,图5为本申请实施例提供的一种随机舍入方式的结构示意图。步骤403还可以包括:
步骤4033、舍弃编码值中从最高位开始,且位宽为预设位宽的编码值大于或等于第二预设阈值时,执行设备对保留编码值的最低位进行进位操作,对舍弃编码值进行舍弃操作,保留编码值进位后的编码值为第二尾数域的第一编码值。
步骤4034、舍弃编码值中从最高位开始,且位宽为预设位宽的编码值小于第二预设阈值时,执行设备对舍弃编码值进行舍弃操作,保留编码值为第二尾数域的第一编码值。
其中,第二预设阈值为舍弃编码值中从最低位开始,且位宽为预设位宽的编码值。
示例性的,对于SR舍入方式,预设位宽可以为10至14中的整数。以预设位宽为14为例,第二预设阈值可以为舍弃编码值中从最低位开始,且位宽为14的编码值,则舍弃编码值中用于与第二预设阈值比较的部分舍弃编码值为从最高位开始,且位宽为14的编码值。
在一个实例中,对于表1中的第一尾数域23’b01000000000000000000000,若第二尾数域的第一位宽为2,则第一尾数域中的保留编码值为2’b01,舍弃编码值为21’b000000000000000000000,则部分舍弃编码值为14’b00000000000000,第二预设阈值为14’b00000000000000。由于部分舍弃编码值等于第二预设阈值,则对舍弃编码值进行舍弃操作,保留编码值进位后的编码值为第二尾数域的第一编码值,即第二尾数域的第一编码值为2’b01。
在另一个实例中,对于表1中的第一尾数域23’b01000000000000000000000,若第二尾数域的第一位宽为1,则第一尾数域中的保留编码值为1’b0,舍弃编码值为22’b1000000000000000000000,则部分舍弃编码值为14’b10000000000000,第二预设阈值为14’b00000000000000。由于部分舍弃编码值大于第二预设阈值,则对保留编码值的最低位进行进位操作,并对舍弃编码值进行舍弃操作,保留编码值进行进位操作后得到的编码值为第二尾数域的第一编码值,即第二尾数域的第一编码值为1’b1。
示例性的,对于SR舍入方式,用于比较的第二预设阈值为舍弃编码值中从最低位开始,且位宽为预设位宽的编码值,第二预设阈值的生成无需额外的随机数生成器,不存在随机数生成的性能瓶颈,提高了高精度数据向低精度数据的转换效率,同时硬件开销更低。
可选的,舍入操作包括进位操作或舍弃操作,本申请实施例提供一种远离0进位(Round Half To  Away,TA)的舍入方式,如图6所示,图6为本申请实施例提供的一种远离0进位舍入方式的流程图,步骤403可以包括:
步骤4031、舍弃编码值的最高位大于或等于第一预设阈值时,执行设备对保留编码值的最低位进行进位操作,并对舍弃编码值进行舍弃操作,保留编码值进行进位操作后得到的编码值为第二尾数域的第一编码值。
步骤4032、舍弃编码值的最高位小于第一预设阈值时,执行设备对舍弃编码值进行舍弃操作,保留编码值为第二尾数域的第一编码值。
示例性的,对于TA舍入方式,第一预设阈值可以为1。舍弃编码值的最高位大于或等于预设阈值时,即舍弃编码值的最高位为1,则对保留编码值的最低位进行进位操作,并对舍弃编码值进行舍弃操作。舍弃编码值的最高位小于第一预设阈值时,即舍弃编码值的最高位为0时,对舍弃编码值进行舍弃操作。
在一个实例中,对于表1中的第一尾数域23’b01000000000000000000000,若第二尾数域的第一位宽为2,则第一尾数域中的保留编码值为2’b01,舍弃编码值为21’b000000000000000000000,此时舍弃编码值的最高位为0。由于舍弃编码值的最高位小于第一预设阈值,则对舍弃编码值进行舍弃操作,保留编码值为第二尾数域的第一编码值,即第二尾数域的第一编码值为2’b01。
在另一个实例中,对于表1中的第一尾数域23’b01000000000000000000000,若第二尾数域的第一位宽为1,则第一尾数域中的保留编码值为1’b0,舍弃编码值为22’b1000000000000000000000,此时舍弃编码值的最高为1。由于舍弃编码值的最高位大于第一预设阈值,则对保留编码值的最低位进行进位操作,并对舍弃编码值进行舍弃操作,保留编码值进行进位操作后得到的编码值为第二尾数域的第一编码值,即第二尾数域的第一编码值为1’b1。
示例性的,对于TA舍入方式,其预设阈值也可以为0。舍弃编码值的最高位大于第一预设阈值时,对保留编码值的最低位进行进位操作,并对舍弃编码值进行舍弃操作,保留编码值进行进位操作后得到的编码值为第二尾数域的第一编码值。舍弃编码值的最高位小于或等于第一预设阈值时,对舍弃编码值进行舍弃操作,保留编码值为第二尾数域的第一编码值。
示例性的,除了TA舍入方式,还可以包括远离偶数进位(round half to even)舍入方式和远离奇数进位(round half to odd)舍入方式等。本申请实施例提供的TA舍入方式对于其他舍入方式硬件实现面积更小、功耗开销更小,且具有更高的数据分辨率。
可选的,如图7所示,图7为本申请实施例提供的另一种浮点数据精度转换方法的流程图。本申请实施例提供的浮点数据精度转换方法还可以包括:
步骤404、执行设备判断进位操作后的保留编码值是否溢出。
示例性的,若对保留编码值进行进位操作,保留编码值有可能产生溢出。在一个实例中,若保留编码值为3’b111,对保留编码值的最低位进行进位操作后,会出现溢出现象。
步骤405、若进位操作后的保留编码值溢出,执行设备则对第一指数域的第一编码值的最低位进行加1操作,得到第一指数域的第二编码值。
示例性的,如表1中的第一指数域,其第一编码值为8’b01111100,若进位操作后的保留编码值溢出,则对第一指数域的第一编码值的最低位进行加1操作,得到第一指数域的第二编码值,即第一指数域的第二编码值为8’b01111101。
步骤406、执行设备根据第一指数域的第二编码值确定第二指数域的第二位宽和前缀码域的第二位宽。
示例性的,根据第一指数域的第二编码值8’b01111101可以得到第二指数域的第二位宽为1和前缀码域的第二位宽为3。
步骤407、若前缀码域的第二位宽和前缀码域的第一位宽不同,执行设备根据第一指数域的第二编码值确定前缀码域的第二编码值、第二指数域的第二编码值第二尾数域的第二位宽和第二尾数域的第二编码值。
示例性的,请参看表4,若前缀码域的第二位宽和前缀码域的第一位宽不同,由于前缀码域用于指示第二指数域的位宽,则第二指数域的第一位宽和第二指数域的第二位宽不同。若前缀码域的第二位宽大于前缀码域的第一位宽,则第二指数域的第二位宽小于第二指数域的第一位宽,此时前缀码域增加的位宽数和第二指数域减少的位宽数相同,因此第二尾数域的第一位宽不变。若前缀码域的第二位宽小于前缀码域的第一位宽,则第二指数域的第二位宽大于第二指数域的第一位宽,此时 前缀码域减少的位宽数和第二指数域减少的位宽数相同,因此第二尾数域的第一位宽不变。由此,若前缀码域的第二位宽和前缀码域的第一位宽不同,此时的第二指数域和前缀码域的位宽发生变化,而第二尾数域的第二位宽不变,执行设备根据第一指数域的第二编码值确定前缀码域的第二编码值和第二指数域的第二编码值。其中,第二尾数域的第二编码值为全0,例如若第二尾数域的第二位宽为3,则第二尾数域的第二编码值为3’b000。
在一个实例中,基于第一指数域的第一比特8’b01111100确定的前缀码域的第一位宽为2,第二指数域的第二位宽为2,此时前缀码域减少的位宽数和第二指数域增加的位宽数相同,因此第二尾数域的第一位宽不变。
步骤408、若前缀码域的第二位宽和前缀码域的第一位宽相同,执行设备判断第二指数域的第一位宽和第二指数域的第二位宽是否相同。
步骤409、若第二指数域的第二位宽小于第二指数域的第一位宽,执行设备对保留编码值的位宽进行加1操作,得到第二尾数域的第二位宽和第二尾数域的第二编码值。
示例性的,前缀码域的第二位宽和前缀码域的第一位宽相同,若第二指数域的第二位宽小于第二指数域的第一位宽,则第二尾数域的第一位宽会增加。对保留编码值的位宽进行加1操作得到第二尾数域的第二位宽和第二尾数域的第二编码值。在一个实例中,若保留编码值为2’b01,保留编码值的位宽为2,对保留编码值的位宽进行加1操作后,保留编码值的位宽为3,保留编码值为3’b010。
步骤4010、若第二指数域的第二位宽大于第二指数域的第一位宽,执行设备对保留编码值的最低位进行舍弃操作,得到所述第二尾数域的第二位宽和所述第二尾数域的第二编码值。
示例性的,前缀码域的第二位宽和前缀码域的第一位宽相同,若第二指数域的第二位宽大于第二指数域的第一位宽,则第二尾数域的第一位宽会减少。对保留编码值的最低位进行舍弃操作得到第二尾数域的第二位宽和第二尾数域的第二编码值。在一个实例中,若保留编码值为2’b01,保留编码值的位宽为2,对保留编码值的最低位进行舍弃操作后,保留编码值为1’b0,保留编码值的位宽为1。
下面对本申请提供的浮点数据精度转换方法进一步进行说明,如图8所示,图8为本申请实施例提供的另一种浮点数据精度转换方法的流程图,步骤401包括:
步骤4011、执行设备根据第一指数域的第一编码值确定指示值,通过查表确定与指示值对应的前缀码域的第一位宽和前缀码域的第一编码值,指示值还用于指示第二指数域的第一位宽。
示例性的,基于第一指数域的第一编码值可以确定第一指数域的指数值N,基于第一指数域的指数值可以确定指示值。此处的查表为查看表3,指示值为D的值,指示值可以是0、1、2、3、4。在一个实例中,例如表1中的第一指数域,其第一编码值为8’b01111100,表示十进制的124,去除对于FP32数据的偏置127后,得到十进制的-3,其中,-3为第一指数域的指数值N,利用公式D=INT[log2|N|],可以得到D为2。通过查表3可以确定与2对应的前缀码域的第一位宽为2和前缀码域的第一编码值为01。指示值还用于指示第二指数域的第一位宽,即第二指数域的第一位宽为2。
步骤4012、执行设备根据第一指数域的第一编码值确定第二指数域的第一位宽对应的第一编码值。
示例性的,对于表1中的第一指数域,请参看表4,当D为2时,且第一指数域的指数值为-3,即指数位符号位Se为1,由于指示值确定的第二指数域的第一位宽为2,则确定的第二指数域的第一编码值为11。
步骤4013、执行设备根据第二浮点数据的总位宽、前缀码域的第一位宽、第二指数域的第一位宽确定第二尾数域的第一位宽。
示例性的,第二浮点数据的总位宽为Nb,前缀码域的第一位宽为Db,第二指数域的第一位宽为Eb,符号域的位宽为1,第二尾数域的第一位宽为Mb,则Mb=Nb-Db-Eb-1。在一个实例中,对于HiFloat8数据,Nb=8,Db=2或3,则第二尾数域的第一位宽Mb=Nb-3-Eb或Mb=Nb-Eb-4。
示例性的,对于为HiFloat8数据格式的第二浮点数据,符号域的位宽为1,前缀码域的位宽为2或3,第二指数域的第一位宽为0至4中的一个整数,第二尾数域的第一位宽为1至4中的一个整数。由此,第二浮点数据中采用较短的前缀码域指示第二指数域的第一位宽,使得第二浮点数据最高可以提供4位尾数的精度,同时对于只提供1位尾数的精度的第二浮点数据可以表示较大的数值范围,有效平衡了第二浮点数据位宽、范围和精度之间的关系。且第二指数域存储时隐藏最高位,减少了第二指数域需要存储的第一位宽,有效避免了不同前缀码域的指示值对应的第二指数域的第一编码 值出现数值重叠的问题,使得HiFloat8数据格式中无冗余编码。
可选的,第一浮点数据超出第二浮点数据的数据范围的上限时,基于饱和方式或无穷大方式确定第二浮点数据。
示例性的,饱和方式可以为用低精度浮点数据能表示的最大浮点数据作为第一浮点数据。无穷大方式可以为用低精度浮点数据的无穷大数据作为第一浮点数据。在一个实例中,对于HiFloat8数据,若第一浮点数据超出HiFloat8数据的数据范围的上限时,第一浮点数据精度转换后的第二浮点数据可以表示为8’b01101111。
可选的,第一浮点数据超出第二浮点数据的数据范围的下限时,第二浮点数据为零。
示例性的,对于HiFloat8数据,若第一浮点数据超出HiFloat8数据的数据范围的下限时,第二浮点数据为零,第二浮点数据可以表示为8’b01111110。
可选的,第一浮点数据为非数字值时,第二浮点数据为非数字值。
示例性的,对于HiFloat8数据,若第一浮点数据为非数字值(Not a Number,NAN)时,第二浮点数据可以表示为8’b11111110。
示例性的,如图9所示,图9为本申请实施例提供的一种FP32数据转换为HiFloat8数据的流程图。以FP32数据转换为HiFloat8数据为例,应用于转换模块,该转换过程包括以下流程。
(1)转换模块接收FP32数据,FP32数据包括符号域S,指数域E[0:7]和尾数域M[0:22];
(2)对FP32数据进行判断是否为特殊值(零值、非数字值(Not a Number,NAN)、正无穷大和负无穷大),若FP32数据为特殊值,转(21);
(3)若FP32数据不为特殊值,获取FP32数据的符号域,转(21);
(4)对于FP32数据的指数域进行去除偏置操作,即E=E+bias;
(5)计算HiFloat8数据中前缀码域、指数域和尾数域对应的位宽,分别用db、eb和mb表示;
(6)根据FP32数据的指数域和HiFloat8数据的指数域的位宽,确定HiFloat8数据的指数域的编码值e=E[0:k],其中,k为对应于HiFloat8数据的指数域的位宽的值;
(7)配置舍入方式;
(8)生成阈值;
(9)获取FP32数据中的尾数域的舍弃编码值的最高位,作为舍弃位;
(10)舍弃位和阈值比较;
(11)进行舍入判断,若舍弃位大于阈值,进行进位操作,若舍弃位小于阈值,进行舍弃操作;
(12)若进行舍弃操作,FP32数据中尾数域的保留编码值为HiFloat8数据的尾数域m=M[0:N-db-eb-1],转(21);
(13)若进行进位操作,FP32数据中尾数域的保留编码值进行加1操作,m=M[0:N-db-eb-1]+1;
(14)判断FP32数据中尾数域的保留编码值是否溢出,若FP32数据中尾数域的保留编码值未溢出,转(21);
(15)若FP32数据中尾数域的保留编码值溢出,FP32数据的指数域进行加1操作E=E+1;
(16)基于进行加1操作后的FP32的指数域计算得到HiFloat8数据的新的前缀码域的位宽和指数域的位宽,分别用db1和eb1表示;
(17)判断db1是否等于db,若db1不等于db,转(21);
(18)若db1等于db,判断eb1是否大于eb;
(19)若eb1小于eb,HiFloat8数据的尾数域的位宽进行加1操作,即;mb=mb+1;
(20)若eb1大于eb,HiFloat8数据的尾数域的位宽进行减1操作,即mb=mb-1;
(21)进行HiFloat8数据编码;
(22)得到HiFloat8数据;
可以理解的是,为了实现上述功能,电子设备包含了执行各个功能相应的硬件和/或软件模块。结合本文中所公开的实施例描述的各示例的算法步骤,本申请能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。本领域技术人员可以结合实施例对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
本实施例可以根据上述方法示例对电子设备进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中。上述集成的模块可以采 用硬件的形式实现。需要说明的是,本实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
在采用对应各个功能划分各个功能模块的情况下,图10示出了上述实施例中涉及的电子设备100的一种可能的组成示意图,如图10所示,图10为本申请实施例提供的一种电子设备的结构示意图。该电子设备100可以包括:位宽计算单元101、尾数域计算单元102和舍入操作单元103。
其中,位宽计算单元101可以用于支持电子设备100执行上述步骤401、步骤4011、步骤4012、步骤4013等,和/或用于本文所描述的技术的其他过程。
尾数域计算单元102可以用于支持电子设备100执行上述步骤402等,和/或用于本文所描述的技术的其他过程。
舍入操作单元103可以用于支持电子设备100执行上述步骤403、步骤4031、步骤4032等,和/或用于本文所描述的技术的其他过程。
需要说明的是,上述方法实施例涉及的各步骤的所有相关内容均可以援引到对应功能模块的功能描述,在此不再赘述。
本实施例提供的电子设备100,用于执行上述浮点数据精度转换方法,因此可以达到与上述实现方法相同的效果。
在采用集成的单元的情况下,电子设备100可以包括处理模块、存储模块和通信模块。其中,处理模块可以用于对电子设备100的动作进行控制管理,例如,可以用于支持电子设备100执行上述位宽计算单元101、尾数域计算单元102和舍入操作单元103执行的步骤。存储模块可以用于支持电子设备100存储程序代码和数据等。通信模块,可以用于支持电子设备100与其他设备的通信,例如与无线接入设备的通信。
其中,处理模块可以是处理器或控制器。其可以实现或执行结合本申请公开内容所描述的各种示例性的逻辑方框,模块和电路。处理器也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,数字信号处理(digital signal processing,DSP)和微处理器的组合等等。存储模块可以是存储器。通信模块具体可以为射频电路、蓝牙芯片、Wi-Fi芯片等与其他电子设备交互的设备。
在一个实施例中,当处理模块为处理器,存储模块为存储器时,本实施例所涉及的电子设备可以为服务器和电脑等。
本申请实施例还提供一种电子设备,包括一个或多个处理器以及一个或多个存储器。该一个或多个存储器与一个或多个处理器耦合,一个或多个存储器用于存储计算机程序代码,计算机程序代码包括计算机指令,当一个或多个处理器执行计算机指令时,使得电子设备执行上述相关方法步骤实现上述实施例中的浮点数据精度转换方法。
本申请的实施例还提供一种计算机存储介质,该计算机存储介质中存储有计算机指令,当该计算机指令在电子设备上运行时,使得电子设备执行上述相关方法步骤实现上述实施例中的浮点数据精度转换方法。
本申请的实施例还提供了一种计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述相关步骤,以实现上述实施例中电子设备执行的浮点数据精度转换方法。
另外,本申请的实施例还提供一种装置,这个装置具体可以是芯片,组件或模块,该装置可包括相连的处理器和存储器;其中,存储器用于存储计算机执行指令,当装置运行时,处理器可执行存储器存储的计算机执行指令,以使芯片执行上述各方法实施例中电子设备执行的浮点数据精度转换方法。
其中,本实施例提供的电子设备、计算机存储介质、计算机程序产品或芯片均用于执行上文所提供的对应的方法,因此,其所能达到的有益效果可参考上文所提供的对应的方法中的有益效果,此处不再赘述。
通过以上实施方式的描述,所属领域的技术人员可以了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个装置,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦 合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是一个物理单元或多个物理单元,即可以位于一个地方,或者也可以分布到多个不同地方。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该软件产品存储在一个存储介质中,包括若干指令用以使得一个设备(可以是单片机,芯片等)或处理器(processor)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上内容,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (16)

  1. 一种浮点数据精度转换方法,其特征在于,第一浮点数据包括符号域、第一指数域和第一尾数域,第二浮点数据包括所述符号域、前缀码域、第二指数域和第二尾数域,所述前缀码域用于指示所述第二指数域的位宽,所述第一浮点数据的精度高于所述第二浮点数据的精度,所述方法包括:
    根据所述第一指数域的第一编码值确定所述前缀码域的第一位宽、所述前缀码域的第一编码值、所述第二指数域的第一位宽、所述第二指数域的第一编码值以及所述第二尾数域的第一位宽;
    确定所述第一尾数域中的保留编码值和舍弃编码值,所述保留编码值包括所述第一尾数域中从最高位开始,且位宽与所述第二尾数域的第一位宽相同的编码值;
    根据所述舍弃编码值对所述保留编码值进行舍入操作,得到所述第二尾数域的第一编码值。
  2. 根据权利要求1所述的方法,其特征在于,所述舍入操作包括进位操作和舍弃操作,所述根据所述舍弃编码值对所述保留编码值进行舍入操作,得到所述第二尾数域的第一编码值包括:
    所述舍弃编码值中从最高位开始,且位宽为预设位宽的编码值大于或等于第二预设阈值时,对所述保留编码值的最低位进行进位操作,对所述舍弃编码值进行舍弃操作,所述保留编码值进位后的编码值为所述第二尾数域的第一编码值;
    所述舍弃编码值中从最高位开始,且位宽为预设位宽的编码值小于所述第二预设阈值时,对所述舍弃编码值进行舍弃操作,所述保留编码值为所述第二尾数域的第一编码值;
    其中,所述第二预设阈值为所述舍弃编码值中从最低位开始,且位宽为预设位宽的编码值。
  3. 根据权利要求1所述的方法,其特征在于,所述舍入操作包括进位操作或舍弃操作,所述根据所述舍弃编码值对所述保留编码值进行舍入操作,得到所述第二尾数域的第一编码值包括:
    所述舍弃编码值的最高位大于或等于第一预设阈值时,对所述保留编码值的最低位进行进位操作,并对所述舍弃编码值进行舍弃操作,所述保留编码值进行进位操作后得到的编码值为所述第二尾数域的第一编码值;
    所述舍弃编码值的最高位小于所述第一预设阈值时,对所述舍弃编码值进行舍弃操作,所述保留编码值为所述第二尾数域的第一编码值。
  4. 根据权利要求2或3所述的方法,其特征在于,所述方法还包括:
    判断进位操作后的所述保留编码值是否溢出;
    若进位操作后的所述保留编码值溢出,则对所述第一指数域的第一编码值的最低位执行加1操作,得到所述第一指数域的第二编码值;
    根据所述第一指数域的第二编码值确定所述第二指数域的第二位宽和所述前缀码域的第二位宽;
    若所述前缀码域的第二位宽和所述前缀码域的第一位宽不同,根据所述第一指数域的第二编码值确定所述前缀码域的第二编码值、所述第二指数域的第二编码值、所述第二尾数域的第二位宽和所述第二尾数域的第二编码值;
    若所述前缀码域的第二位宽和所述前缀码域的第一位宽相同,判断所述第二指数域的第一位宽和所述第二指数域的第二位宽是否相同;
    若所述第二指数域的第二位宽小于所述第二指数域的第一位宽,对所述保留编码值的位宽进行加1操作,得到所述第二尾数域的第二位宽和所述第二尾数域的第二编码值;
    若所述第二指数域的第二位宽大于或等于所述第二指数域的第一位宽,对所述保留编码值的最低位进行舍弃操作,得到所述第二尾数域的第二位宽和所述第二尾数域的第二编码值。
  5. 根据权利要求1所述的方法,其特征在于,所述根据所述第一指数域的第一编码值确定所述前缀码域的第一位宽、所述前缀码域的第一编码值、所述第二指数域的第一位宽和所述第二指数域的第一编码值包括:
    根据所述第一指数域的第一编码值确定指示值,通过查表确定与所述指示值对应的所述前缀码域的第一位宽和所述前缀码域的第一编码值,所述指示值还用于指示所述第二指数域的第一位宽;
    根据所述第一指数域的第一编码值确定所述第二指数域的第一位宽对应的第一编码值。
  6. 根据权利要求1或5所述的方法,其特征在于,所述根据所述第一指数域的第一编码值确定所述第二尾数域的第一位宽包括:
    根据所述第二浮点数据的总位宽、所述前缀码域的第一位宽、所述第二指数域的第一位宽确定所述第二尾数域的第一位宽。
  7. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    所述第一浮点数据超出所述第二浮点数据的数据范围的上限时,基于饱和方式或无穷大方式确定第二浮点数据;
    所述第一浮点数据超出所述第二浮点数据的数据范围的下限时,所述第二浮点数据为零;
    所述第一浮点数据为非数字值时,所述第二浮点数据为非数字值。
  8. 一种浮点数据精度转换装置,其特征在于,第一浮点数据包括符号域、第一指数域和第一尾数域,第二浮点数据包括所述符号域、前缀码域、第二指数域和第二尾数域,所述前缀码域用于指示所述第二指数域的位宽,所述第一浮点数据的精度高于所述第二浮点数据的精度,所述装置包括:
    位宽计算单元,用于根据所述第一指数域的第一编码值确定所述前缀码域的第一位宽、所述前缀码域的第一编码值、所述第二指数域的第一位宽、所述第二指数域的第一编码值以及所述第二尾数域的第一位宽;
    尾数域计算单元,用于确定所述第一尾数域中的保留编码值和舍弃编码值,所述保留编码值包括所述第一尾数域中从最高位开始,且位宽与所述第二尾数域的第一位宽相同的编码值;
    舍入操作单元,用于根据所述舍弃编码值对所述保留编码值进行舍入操作,得到所述第二尾数域的第一编码值。
  9. 根据权利要求8所述的装置,其特征在于,所述舍入操作包括进位操作或舍弃操作,所述舍入操作单元还用于:
    所述舍弃编码值中从最高位开始,且位宽为预设位宽的编码值大于或等于第二预设阈值时,对所述保留编码值的最低位进行进位操作,对所述舍弃编码值进行舍弃操作,所述保留编码值进位后的编码值为所述第二尾数域的第一编码值;
    所述舍弃编码值中从最高位开始,且位宽为预设位宽的编码值小于所述第二预设阈值时,对所述舍弃编码值进行舍弃操作,所述保留编码值为所述第二尾数域的第一编码值;
    其中,所述第二预设阈值为所述舍弃编码值中从最低位开始,且位宽为预设位宽的编码值。
  10. 根据权利要求8所述的装置,其特征在于,所述舍入操作包括进位操作或舍弃操作,所述舍入操作单元还用于:
    所述舍弃编码值的最高位大于或等于第一预设阈值时,对所述保留编码值的最低位进行进位操作,并对所述舍弃编码值进行舍弃操作,所述保留编码值进行进位操作后得到的编码值为所述第二尾数域的第一编码值;
    所述舍弃编码值的最高位小于所述第一预设阈值时,对所述舍弃编码值进行舍弃操作,所述保留编码值为所述第二尾数域的第一编码值。
  11. 根据权利要求9或10所述的装置,其特征在于,所述装置还包括:
    溢出单元,用于判断进位操作后的所述保留编码值是否溢出;
    所述位宽计算单元还用于若进位操作后的所述保留编码值溢出,则对所述第一指数域的第一编码值进行加1操作,得到所述第一指数域的第二编码值;
    根据所述第一指数域的第二编码值确定所述第二指数域的第二位宽和所述前缀码域的第二位宽;
    若所述前缀码域的第二位宽和所述前缀码域的第一位宽不同,根据所述第一指数域的第二编码值确定所述前缀码域的第二编码值、所述第二指数域的第二编码值、所述第二尾数域的第二位宽和所述第二尾数域的第二编码值;
    若所述前缀码域的第二位宽和所述前缀码域的第一位宽相同,判断所述第二指数域的第一位宽和所述第二指数域的第二位宽是否相同;
    若所述第二指数域的第二位宽小于所述第二指数域的第一位宽,对所述保留编码值的位宽进行加1操作,得到所述第二尾数域的第二位宽和所述第二尾数域的第二编码值;
    若所述第二指数域的第二位宽大于或等于所述第二指数域的第一位宽,对所述保留编码值的最低位进行舍弃操作,得到所述第二尾数域的第二位宽和所述第二尾数域的第二编码值。
  12. 根据权利要求8所述的装置,其特征在于,所述位宽计算单元还用于:
    根据所述第一指数域的第一编码值确定指示值,通过查表确定与所述指示值对应的所述前缀码域的第一位宽和所述前缀码域的第一编码值,所述指示值还用于指示所述第二指数域的第一位宽;
    根据所述第一指数域的第一编码值确定所述第二指数域的第一位宽对应的第一编码值。
  13. 根据权利要求8至12任一项所述的装置,其特征在于,所述位宽计算单元还用于:
    根据所述第二浮点数据的总位宽、所述前缀码域的第一位宽、所述第二指数域的第一位宽确定 所述第二尾数域的第一位宽。
  14. 根据权利要求8所述的装置,其特征在于,所述位宽计算单元还用于:
    所述第一浮点数据超出所述第二浮点数据的转换范围的上限时,基于饱和方式或无穷大方式确定所述第二浮点数据;
    所述第一浮点数据超出所述第二浮点数据的转换范围的下限时,所述第二浮点数据为零;
    所述第一浮点数据为非数字值时,所述第二浮点数据为非数字值。
  15. 一种计算机可读存储介质,其特征在于,包括计算机指令,当计算机指令在电子设备上运行时,使得电子设备执行上述权利要求1-7中的任一项所述的方法。
  16. 一种计算机程序产品,其特征在于,当计算机程序产品在计算机或处理器上运行时,使得所述计算机或所述处理器执行上述权利要求1-7中的任一项所述的方法。
PCT/CN2023/102089 2022-10-19 2023-06-25 浮点数据精度转换方法和装置 Ceased WO2024082674A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP23878671.9A EP4597299A4 (en) 2022-10-19 2023-06-25 METHOD AND APPARATUS FOR CONVERTING PRECISION DATA INTO FLOATING POINTS
US19/181,941 US20250278241A1 (en) 2022-10-19 2025-04-17 Floating-point data precision conversion method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211281416.6 2022-10-19
CN202211281416.6A CN117908827A (zh) 2022-10-19 2022-10-19 浮点数据精度转换方法和装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US19/181,941 Continuation US20250278241A1 (en) 2022-10-19 2025-04-17 Floating-point data precision conversion method and apparatus

Publications (1)

Publication Number Publication Date
WO2024082674A1 true WO2024082674A1 (zh) 2024-04-25

Family

ID=90695281

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/102089 Ceased WO2024082674A1 (zh) 2022-10-19 2023-06-25 浮点数据精度转换方法和装置

Country Status (4)

Country Link
US (1) US20250278241A1 (zh)
EP (1) EP4597299A4 (zh)
CN (1) CN117908827A (zh)
WO (1) WO2024082674A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118170347B (zh) * 2024-05-11 2024-11-26 北京壁仞科技开发有限公司 精度转换装置、数据处理方法、处理器、电子设备

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130282780A1 (en) * 2012-04-23 2013-10-24 Lsi Corporation Method and Apparatus to Perform Floating Point Operations
CN104778026A (zh) * 2015-04-28 2015-07-15 浪潮电子信息产业股份有限公司 一种带simd的高速数据格式转换部件及转换方法
CN111340207A (zh) * 2020-03-03 2020-06-26 南京大学 浮点数转换方法及装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3742198A (en) * 1971-03-19 1973-06-26 Bell Telephone Labor Inc Apparatus for utilizing a three-field word to represent a floating point number
US8106914B2 (en) * 2007-12-07 2012-01-31 Nvidia Corporation Fused multiply-add functional unit
US9582248B2 (en) * 2014-09-26 2017-02-28 Arm Limited Standalone floating-point conversion unit

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130282780A1 (en) * 2012-04-23 2013-10-24 Lsi Corporation Method and Apparatus to Perform Floating Point Operations
CN104778026A (zh) * 2015-04-28 2015-07-15 浪潮电子信息产业股份有限公司 一种带simd的高速数据格式转换部件及转换方法
CN111340207A (zh) * 2020-03-03 2020-06-26 南京大学 浮点数转换方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4597299A4

Also Published As

Publication number Publication date
CN117908827A (zh) 2024-04-19
US20250278241A1 (en) 2025-09-04
EP4597299A4 (en) 2025-12-17
EP4597299A1 (en) 2025-08-06

Similar Documents

Publication Publication Date Title
CN115934030B (zh) 算数逻辑单元、浮点数乘法计算的方法及设备
CN112230881B (zh) 浮点数处理器
WO2022143432A1 (zh) 一种矩阵计算装置、方法、系统、电路、芯片及设备
CN106951211B (zh) 一种可重构定浮点通用乘法器
CN110515589B (zh) 乘法器、数据处理方法、芯片及电子设备
US20230305803A1 (en) Method for Processing Floating Point Number and Related Device
WO2023029464A1 (zh) 数据处理装置、方法、芯片、计算机设备及存储介质
CN111381808B (zh) 乘法器、数据处理方法、芯片及电子设备
CN118915995A (zh) 运算单元、浮点数运算方法及装置
WO2023124235A1 (zh) 多输入浮点数处理方法、装置、处理器及计算机设备
CN115586922A (zh) 一种存储与计算格式解耦的SpMV混合精度优化方法
CN116974517A (zh) 浮点数处理方法、装置、计算机设备和处理器
US20250278241A1 (en) Floating-point data precision conversion method and apparatus
US20260003571A1 (en) Floating-Point Data Precision Conversion Method and Apparatus
CN113791756B (zh) 转数方法、存储介质、装置及板卡
CN117910537A (zh) 一种神经网络训练方法及装置
CN116882475A (zh) 应用于神经网络的训练方法及装置以及相关产品
CN111310909A (zh) 一种浮点数转换电路
CN209895329U (zh) 乘法器
WO2019205064A1 (zh) 神经网络加速装置与方法
CN116502028B (zh) 基于浮点数压缩技术的大规模fft实现方法及装置
CN111313906A (zh) 一种浮点数的转换电路
CN121411824A (zh) Alu、处理器、芯片产品及设备
WO2025107602A1 (zh) 数据处理方法和数据处理装置
CN121235129A (zh) 基于数据量化的大模型优化方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23878671

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2023878671

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2023878671

Country of ref document: EP

Effective date: 20250428

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 11202502549V

Country of ref document: SG

WWP Wipo information: published in national office

Ref document number: 11202502549V

Country of ref document: SG

WWP Wipo information: published in national office

Ref document number: 2023878671

Country of ref document: EP