WO2024082665A1 - 预条件子处理方法、装置、设备及系统 - Google Patents

预条件子处理方法、装置、设备及系统 Download PDF

Info

Publication number
WO2024082665A1
WO2024082665A1 PCT/CN2023/101175 CN2023101175W WO2024082665A1 WO 2024082665 A1 WO2024082665 A1 WO 2024082665A1 CN 2023101175 W CN2023101175 W CN 2023101175W WO 2024082665 A1 WO2024082665 A1 WO 2024082665A1
Authority
WO
WIPO (PCT)
Prior art keywords
thread
preconditioner
computing
executed
computing task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2023/101175
Other languages
English (en)
French (fr)
Other versions
WO2024082665A9 (zh
Inventor
王欣亮
王芹
丁肇辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to EP23878662.8A priority Critical patent/EP4589434A4/en
Publication of WO2024082665A1 publication Critical patent/WO2024082665A1/zh
Publication of WO2024082665A9 publication Critical patent/WO2024082665A9/zh
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/11Complex mathematical operations for solving equations, e.g. nonlinear equations, general mathematical optimization problems
    • G06F17/13Differential equations
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5038Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/52Program synchronisation; Mutual exclusion, e.g. by means of semaphores

Definitions

  • the present application relates to the field of computers, and in particular to a preconditioner processing method, device, equipment and system.
  • the problem of solving partial differential equations in high-performance computing of computers can be converted into the problem of solving large-scale sparse linear equations.
  • the improvement of the convergence and convergence speed of the iteration matrix depends not only on the iteration method and the selection of parameters in the iteration matrix, but also on some changes of the equation system itself.
  • the introduction of preconditioners greatly speeds up the convergence and convergence speed of iterations.
  • Preconditioning large-scale sparse linear equations can improve the convergence speed of large-scale sparse linear equations.
  • the data used for preconditioning is correlated, that is, the subsequent calculation tasks of some preconditioners often need to rely on the results of the previous calculation tasks, the parallel processing of these preconditioners needs to waste a lot of time waiting for the previous calculation tasks, resulting in low parallel execution efficiency of preconditioners.
  • the present application provides a preconditioner processing method, device, equipment and system, which can solve the problem of low parallel execution efficiency of preconditioners.
  • a preconditioner processing method is provided.
  • the preconditioner processing method is executed by a preconditioner processing device, the preconditioner processing device includes multiple threads, each thread is used to execute at least one thread computing task contained in the preconditioner in the process of solving a sparse matrix linear equation system.
  • the method includes: when the preconditioner processing device uses the preconditioner to perform multi-threaded parallel processing on the sparse linear equation system, the preconditioner processing device first determines the thread dependency between at least one thread computing task to be executed by the preconditioner, and then each of the multiple threads of the preconditioner processing device sequentially executes the to-be-executed thread computing task whose dependent thread computing tasks are all in the executed state, and then outputs the preconditioner computing result according to the execution result of at least one thread computing task.
  • the thread dependency is used to indicate the dependency between at least one thread computing task, that is, for each thread computing task, the thread computing task can be executed when at least one dependent thread computing task associated with the thread computing task is in the executed state.
  • the preconditioner processing device determines the dependency relationship between threads in the computing tasks of the preconditioner, and the preconditioner processing device performs multi-threaded parallelism on the thread computing tasks based on the thread dependency, thereby converting the full thread synchronization between layers into point-to-point synchronization between threads, that is, the thread computing tasks of the lower layer can be executed after the thread computing tasks of the upper layer on which they depend are in the executed state, thereby reducing the time consumption of synchronization between layers and improving the parallel execution efficiency of the preconditioner.
  • the preconditioner processing device converts the element dependency between at least one element computing task contained in the original sparse matrix into a thread dependency based on the original sparse matrix of the sparse matrix linear equation system, that is, each thread needs to rely on which threads have completed which levels of computing tasks before executing each level of computing tasks.
  • the element dependency is used to indicate that executing an element computing task requires at least one dependent element computing task to be in an executed state.
  • the preconditioner sub-processor divides the multiple element calculation tasks contained in the original sparse matrix into multiple calculation levels according to the element dependency, and then merges the element calculation tasks executed by the same thread in the same calculation level into thread calculation tasks according to the task allocation result, and then determines the thread dependency between at least one thread calculation task according to the element dependency.
  • the element calculation tasks of the same calculation level can be executed at the same time, and the task allocation result is used to indicate which thread among the multiple threads executes the element calculation task.
  • the preconditioner sub-processor merges the element calculation tasks belonging to the same thread in each calculation level into thread calculation tasks.
  • the computing tasks are integrated into one thread computing task. Each thread is responsible for only one thread computing task at each computing level. The thread computing task of a thread at the next level only depends on one thread computing task at the previous level, so that the thread dependency can be concretely represented.
  • the preconditioning sub-processing device after determining the thread dependency, also deletes the redundant dependency in the thread dependency between at least one thread computing task, and the dependency between the thread computing tasks represented by the redundant dependency can be indirectly represented by a set of thread dependencies outside itself.
  • the number of thread dependencies is reduced, and the thread is prevented from consuming computing resources and time to confirm the redundant dependencies before executing the thread computing task, further reducing the inter-layer synchronization overhead and improving the parallel execution efficiency.
  • the preconditioner processing device rearranges the original sparse matrix and then executes at least one thread computing task in parallel using multiple threads.
  • the preconditioner processing device rearranges the original sparse matrix according to multiple computing levels and memory access orders to obtain a rearranged matrix, and each of the multiple threads executes the thread computing tasks to be executed in sequence according to the rearranged matrix and the memory access order when executing the thread computing tasks in parallel in multiple threads.
  • the memory access order is the memory access order of the multiple threads for at least one thread computing task, and at least one thread computing task executed by each thread in the rearranged matrix is arranged together.
  • each thread can access the memory continuously when performing its own calculations, which is conducive to fully utilizing the memory bandwidth.
  • the matrix is rearranged to optimize the continuity of memory access and will not change any dependencies and dependency order, so it will not affect the solution of the preconditioner.
  • each thread in the preconditioner processing device determines whether the thread dependency has been satisfied based on a flag. For example, the preconditioner processing device determines whether at least one dependent thread computing task associated with the thread computing task is in an executed state based on the flag of at least one thread computing task, and if so, the thread computing task is used as a thread computing task to be executed. Thus, each thread in the preconditioner processing device uses the flag to quickly and accurately determine whether the dependent thread computing task of the thread computing task has been executed, thereby improving the overall efficiency of the preconditioner multi-thread parallelism.
  • a preconditioner processing device which executes multiple threads, each thread is used to execute at least one thread computing task included in the preconditioner, and the device includes various modules for executing the preconditioner processing method in the first aspect or any possible implementation of the first aspect.
  • the preconditioner processing device includes an analysis module and an execution module.
  • the analysis module is used to determine the thread dependency of at least one thread computing task, where the thread dependency is used to indicate that each thread in the multiple threads can execute the thread computing task when at least one dependent thread computing task associated with the thread computing task is in an executed state.
  • the execution module is used to instruct multiple threads to execute the to-be-executed thread computing tasks according to the thread dependency relationship, and at least one dependent thread computing task associated with the to-be-executed thread computing task is in an executed state.
  • the execution module is further used to output a preconditioner calculation result according to the execution result of at least one thread calculation task.
  • the analysis module is specifically used to: convert the element dependency between at least one element calculation task contained in the original sparse matrix into a thread dependency, the element dependency is used to indicate that executing the element calculation task requires at least one dependent element calculation task to be in an executed state.
  • the analysis module is specifically used to: divide multiple element calculation tasks contained in the original sparse matrix into multiple calculation levels according to element dependencies, and the element calculation tasks of the same calculation level can be executed simultaneously; merge the element calculation tasks executed by the same thread in the same calculation level into thread calculation tasks according to the task allocation result, and the task allocation result is used to indicate which thread among the multiple threads executes the element calculation task; determine the thread dependency between at least one thread calculation task according to the element dependency.
  • the analysis module is further used to: delete a redundant dependency in the thread dependency between at least one thread computing task, where the dependency between the thread computing tasks represented by the redundant dependency can be indirectly represented by a group of thread dependencies outside the thread computing tasks.
  • the preconditioner processing device also includes a preparation module, which is used to: rearrange the original sparse matrix according to multiple computing levels and memory access orders to obtain a rearranged matrix, wherein the memory access order is the memory access order of multiple threads for at least one thread computing task, and the rearranged matrix arranges at least one thread computing task executed by each thread together.
  • a preparation module which is used to: rearrange the original sparse matrix according to multiple computing levels and memory access orders to obtain a rearranged matrix, wherein the memory access order is the memory access order of multiple threads for at least one thread computing task, and the rearranged matrix arranges at least one thread computing task executed by each thread together.
  • the execution module is specifically used to: instruct each thread in the multiple threads to execute in sequence according to the reordering matrix and the memory access order Thread computing tasks to be executed.
  • the execution module is also used to: determine, based on a flag bit of at least one thread computing task, a thread computing task whose associated at least one dependent thread computing task is in an executed state as a thread computing task to be executed, wherein the flag bit is used to indicate whether the thread computing task is in an executed state.
  • the pre-conditioning sub-processing device described in the second aspect can be a terminal device or a network device, or it can be a chip (system) or other parts or components that can be set in the terminal device or the network device, or it can be a device that includes a terminal device or a network device, and this application does not limit this.
  • a preconditioner processing device comprising a memory and a processor, wherein the memory is used to store a set of computer instructions, and when the processor executes the set of computer instructions, it is used to execute the operating steps of the preconditioner processing method in any possible design in the first aspect.
  • a computer system comprising an application server and a preconditioner processing device, wherein the application server is used to send a request for solving a sparse linear equation group to the preconditioner processing device when calculating a sparse linear equation group, and the preconditioner processing device is used to execute the operating steps of the preconditioner processing method in any possible design in the first aspect.
  • the technical effects of the preconditioner processing device described in the second aspect, the technical effects of the preconditioner processing equipment described in the third aspect, and the technical effects of the computer system described in the fourth aspect can refer to the technical effects of the preconditioner processing method described in the first aspect, and will not be repeated here.
  • a computer-readable storage medium comprising: computer software instructions; when the computer software instructions are executed in a data processing system, the computing device executes the operating steps of the method described in any possible implementation of the first aspect.
  • a computer program product is provided.
  • the computing device executes the operation steps of the method described in any possible implementation manner of the first aspect.
  • FIG1 is a schematic diagram of a coloring method provided by the present application.
  • FIG2 is a schematic diagram of a layering method provided by the present application.
  • FIG3 is a schematic diagram of the structure of a computer system provided by the present application.
  • FIG4 is a schematic diagram of a flow chart of a preconditioner processing method provided by the present application.
  • FIG5 is a schematic diagram of a task allocation result provided by the present application.
  • FIG6 is a schematic diagram of a principle of matrix rearrangement provided by the present application.
  • FIG7 is a schematic diagram of the principle of pruning redundant dependencies provided by the present application.
  • FIG8 is a schematic diagram of a process flow of a thread executing a computing task provided by the present application.
  • FIG9 is a schematic diagram of the structure of a preconditioner processing device provided by the present application.
  • FIG10 is a schematic diagram of the structure of a computing device provided by the present application.
  • the process of solving a linear system is the process of obtaining the solution vector x through A and b.
  • the dimensions of x and b are n
  • the dimension of A is n*n, that is, there are n ⁇ 2 elements.
  • most of the n ⁇ 2 elements of A are The numbers are all 0. The proportion of 0 varies depending on different fields and problems, and is generally less than 0.1% or even lower.
  • the direct method is to obtain the exact solution of the equations through a finite number of operations including matrix decomposition and triangular equations solution without considering the round-off error, also known as the exact method.
  • the iterative method is to construct a vector sequence through certain calculations given an initial solution vector (a series of approximate solutions close to the exact value are obtained through successive iterations), and the limit of the vector sequence is used as the theoretical exact solution of the sparse linear equations.
  • the iterative method requires less storage space and has advantages in solving high-order non-ill-conditioned sparse linear equations.
  • the iterative method has problems such as slow convergence or no convergence when solving ill-conditioned problems, and difficulty in precision control.
  • problems such as slow convergence or no convergence when solving ill-conditioned problems, and difficulty in precision control.
  • the result When solving the system of equations, if the data is slightly disturbed, the result will have large fluctuations.
  • Such a matrix is called an ill-conditioned matrix.
  • preconditioners are introduced into the iterative method to solve the above problems of the iterative method.
  • Preconditioners are used to improve the properties of the coefficient matrix of the linear equation system, and transform the given linear equation system into a transformation (matrix) of the equation system that is easier to solve. They are mainly used to accelerate the convergence of the iterative method.
  • the preconditioner is one of the most important modules.
  • the existence of the preconditioner can help improve the convergence speed of the iterative method on the one hand, and help solve the ill-conditioned matrix on the other hand.
  • the preconditioner supports nesting. For example, method A and method B are executed in sequence to jointly complete the function of the preconditioner, and the bottom-level preconditioner often chooses sor or ilu, which is mainly due to its relatively simple algorithm flow and good convergence.
  • Sor algorithms include Gauss-Seidel, symmetric Gauss-Seidel, symmetric successive over-relaxation, etc.
  • Ilu algorithms include ilu(0), ilu(1), ilu(k), etc. Different from sor algorithms, ilu algorithms are divided into two steps in terms of process, ilu decomposition process and ilu re-banding process. However, decomposition and re-banding are consistent in calculation logic, so they can also be considered together here.
  • the LU algorithm is the core method for solving sparse matrices using the direct method.
  • the ILU algorithm an incomplete LU algorithm, can control the number of non-zero fill-ins according to the fill-in level to achieve higher efficiency.
  • 0 means the fill-in level is 0, that is, there will be no fill-in, which means that if the original position is non-zero, it will be continuously updated during the decomposition process, and if the original position is 0, it will not be updated and will always be 0.
  • the solution accuracy of ILU is not as good as that of the LU algorithm, but as a preconditioner, it meets the accuracy requirements, because the final solution correctness of the sparse equation is guaranteed by the iterative method, and the preconditioner is only used as a means to accelerate this process.
  • ilu(0) is divided into two processes: decomposition and back-banding.
  • the former is to numerically decompose the original matrix to obtain a lower triangular matrix L and an upper triangular matrix U.
  • Figure 1 is the pseudo code of the ilu(0) decomposition process. It is worth noting that since ilu(0) does not fill in new non-zero elements, the decomposed lower triangular matrix and upper triangular matrix can completely use the space of the original matrix A, that is, a local decomposition is performed.
  • the above dependency makes multi-threaded parallel acceleration very difficult, because the premise of parallelism is to have enough unrelated computing tasks.
  • the existing technology mainly focuses on how to mine parallel computing tasks, which can be roughly divided into mining parallel tasks and creating parallel tasks. The former is to analyze which computing tasks can be executed in parallel without dependencies by analyzing the dependency graph, and the latter is to forcibly break some dependencies to create more unrelated parallel computing tasks.
  • the coloring method is to color each element in the dependency graph obtained from the original sparse matrix A with a different color, and the elements of the same color are not
  • the elements of different colors are executed in series.
  • coloring methods such as coloring after determining the threshold range, that is, the points within the threshold range are colored with different colors.
  • Figure 1 is a coloring sample diagram when the threshold is 1.
  • 0-12 represent different elements contained in the original sparse matrix A.
  • the threshold is only 1, which means that adjacent elements need to be colored in different colors, and non-adjacent elements can be colored in the same color.
  • 0, 3, 4, 10, 11, and 12 are colored in the same color and can be executed in parallel; 1, 5, and 7 are colored in the same color and can be executed in parallel; 2, 6, 8, and 9 are colored in the same color and can be executed in parallel.
  • elements 0, 3, 4, 10, 11, and 12, elements 1, 5, and 7, and elements 2, 6, 8, and 9 are colored in three different colors respectively.
  • Different grayscales are used to represent different colors in Figure 1. Different colors need to be executed serially.
  • the nested multi-level main diagonal block method is to create parallel tasks by making nested symmetric changes to the original matrix A.
  • Any sparse matrix can be transformed into A11 and A22 are independent of each other and can be executed in parallel by two threads, and the rest are executed in series. If the above symmetric transformation is nested, more threads can be executed in parallel, as shown below:
  • the layered method divides the calculations for different elements into layers based on the dependencies between the elements. Elements at the same layer can be calculated simultaneously, and elements at different layers are executed in sequence.
  • the 16 elements are divided into 4 layers and executed in parallel by 3 threads.
  • Different grayscales represent different colors, and different colors represent the computing tasks that different threads are responsible for.
  • the existing technology uses the coloring method or the nested multi-level main diagonal block method to perform multi-threaded parallel processing on the preconditioner, which has the problem of reducing the convergence of the algorithm.
  • the Krylov subspace iteration method is combined with ILU (0) as the preconditioner.
  • the serial ILU (0) is used, the number of iterations is increased to 70 steps after switching to the above two parallel methods. This is mainly because the coloring method or the nested multi-level main diagonal block method breaks the original dependency relationship.
  • the former is through the graph method, and the latter is through the mathematical transformation method.
  • the dependency relationship that should have been considered is ignored and forced parallelization is performed, resulting in a decrease in convergence; through the mathematical transformation method, the order of decomposition is changed.
  • the result of ILU is strongly related to the order, so the mathematical transformation will still change (most likely reduce) the convergence.
  • the remaining parts of the nested multi-level main diagonal block method are usually difficult to parallelize, so the parallel efficiency is low in the later stage of the algorithm.
  • the layered method although the element calculation tasks in different layers can be multi-threaded and parallelized, the layers need to be synchronized to ensure correctness. When the number of cores increases, the synchronization overhead will increase significantly, resulting in a decrease in the overall speed of the algorithm. Furthermore, the jumpy memory access of threads will also lead to a decrease in memory access efficiency, and the memory bandwidth resources cannot be fully utilized.
  • the present application provides a preconditioner processing method, in particular, a preconditioner processing method for multi-threaded parallel processing of preconditioners based on thread dependencies, that is, a preconditioner processing device determines the thread dependency between at least one thread computing task contained in the preconditioner, the thread dependency is used to indicate that the execution of the thread computing task requires at least one dependent thread computing task to be in an executed state, and each thread of the preconditioner processing device sequentially executes the to-be-executed thread computing task of which at least one associated dependent thread computing task is in an executed state according to the memory access order, and outputs the preconditioner computing result according to the execution result of at least one thread computing task.
  • the preconditioner processing device converts the dependency between elements in the computing task of the preconditioner into the dependency between threads, and the preconditioner processing device performs multi-threaded parallel processing on the thread computing tasks based on the thread dependencies, and each thread in the lower layer of the thread computing task executes the lower layer of the thread computing task after the upper layer of the thread computing task on which it depends is in an executed state, thereby converting the full thread synchronization between layers into point-to-point synchronization between threads.
  • FIG3 is a schematic diagram of the structure of a computer system provided by the present application.
  • the computer system 300 includes an application server 301 and a preconditioning sub-processing device 302 .
  • the application server 301 can be a terminal, such as a mobile phone terminal, a tablet computer, a laptop computer, a virtual reality (VR) device, an augmented reality (AR) device, a mixed reality (MR) device, an extended reality (ER) device, a camera or a vehicle-mounted terminal, etc. It can also be an edge device (for example, a box carrying a chip with processing capabilities), etc.
  • the preconditioner processing device 302 may be a terminal, or other computing device that supports the calculation of sparse linear equations, such as a server or a cloud device.
  • the application server 301 and the preconditioning sub-processing device 302 are different processors deployed on different physical devices (such as: a server or a server in a cluster).
  • the execution device 410 can be a graphics processing unit (GPU), a central processing unit (CPU), other general-purpose processors, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor can be a microprocessor or any conventional processor, etc.
  • the preconditioning sub-processing device 302 can be a graphics processing unit (GPU), a neural network processor (NPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of the program of the present application.
  • GPU graphics processing unit
  • NPU neural network processor
  • ASIC application-specific integrated circuit
  • the application server 301 and the preconditioning sub-processing device 302 are deployed on the same physical device, or the application server 301 and the preconditioning sub-processing device 302 are the same physical device.
  • the application server 301 is used to determine the sparse linear equation group that needs to be solved in response to the user operation, and send a sparse linear equation group solution request to the preconditioner processing device 302 based on the sparse linear equation group.
  • the preconditioning sub-processing device 302 is used to receive a sparse linear equation solution request sent by the application server 301 , calculate the sparse linear equation system, and return the solution result of the sparse linear equation system to the application server 301 .
  • the preconditioner processing device 302 is provided with software in the field of high-performance computing including core computing functions such as partial differential equation solvers.
  • the core module of the differential equation solver is a linear solver, which supports the partial differential equation solver upward and indirectly supports software in the field of high-performance computing.
  • the linear solver can include different algorithms, such as the conjugate gradient method, the generalized minimum residual method, the generalized conjugate residual method, the Chebyshev method, etc.
  • the key module of the linear solver is the preconditioner, which can include different algorithms, such as the additive Schwarz algorithm, the block Jacobi algorithm, the Jacobi algorithm, the ILU algorithm, the SOR algorithm, the multigrid algorithm, etc.
  • These preconditioners are a nested process, and different algorithms can be nested together. As shown by the arrows in Figure 3, multiple preconditioners often end up nesting ILU algorithms or SOR algorithms.
  • the user inputs the problem to be solved into the software in the field of high-performance computing through the operation interface of the preconditioner processing device 302, and the software calls the partial differential equation solver to solve the partial differential equation contained in the problem to be solved.
  • the partial differential equation solver discretizes the equation and converts the partial differential equation solving problem into a large-scale sparse linear equation system solving problem.
  • the partial differential equation solver calls the linear solver to obtain the solution result of the large-scale sparse linear equation system.
  • the linear solver also uses the preconditioner to execute the preconditioner processing method to optimize the original sparse matrix of the large-scale sparse linear equation system.
  • FIG3 is merely a schematic diagram of a system architecture provided in the present application, and the positional relationship between the devices, components, modules, etc. shown in FIG3 does not constitute any limitation.
  • the preconditioner in the preconditioner processing device 302 in Figure 4 is an ILU type algorithm or a SOR type algorithm as an example for description.
  • Step 410 The preconditioning sub-processing device 302 determines the thread dependency of at least one thread computing task.
  • the preconditioner processing device 302 analyzes the thread dependency relationship between all the thread computing tasks in at least one thread computing task according to the structural information of the original sparse matrix.
  • the structural information refers to the elements and element arrangement order contained in the original sparse matrix.
  • the original sparse matrix input in the form of an array contains the structural information of the original sparse matrix.
  • the preconditioner processing device 302 performs task allocation according to the structural information of the original sparse matrix, obtains the task allocation result, and then determines the thread dependency based on the task allocation result.
  • the task allocation result is used to indicate which element computing tasks of different computing levels each thread needs to process, and the thread dependency is used to indicate which threads have completed which thread computing tasks in order to execute the thread computing tasks.
  • the preconditioner processing device 302 divides the multiple element calculation tasks included in the original sparse matrix into multiple calculation levels according to the element dependencies in the original sparse matrix.
  • the element dependency in the original sparse matrix refers to the dependency between the calculation tasks of each element in the original sparse matrix, that is, an element calculation task in the original sparse matrix may have an associated element dependency or may not have an associated element dependency.
  • the preconditioning sub-processing device 302 combines the element computing tasks executed by the same thread in the same computing level into thread computing tasks according to the task allocation result.
  • the thread dependency between thread computing tasks is similar to the element dependency between element computing tasks, and there is not necessarily a thread dependency between every two thread computing tasks.
  • the preconditioner processing device 302 divides the original sparse matrix into layers, it evenly distributes the element calculation tasks of each calculation layer to three threads.
  • the first layer as an example, there are six element calculation tasks, which are evenly distributed to three threads, and each thread is used to execute two element calculation tasks.
  • the element calculation tasks belonging to the same thread in each layer are integrated as a thread calculation task.
  • Each thread is responsible for one thread calculation task in each layer. For example, thread 0 is responsible for executing one thread calculation task in the 0th layer, the 1st layer, the 2nd layer, and the 3rd layer respectively.
  • the preconditioning sub-processing device 302 determines the thread dependency relationship between at least one thread computing task according to the element dependency relationship.
  • the method of analyzing thread dependencies is explained with the task allocation result shown in FIG5.
  • the preconditioning sub-processing device 302 determines that the thread computing task of thread A at the i-th computing level depends on the thread computing task of thread B at the j-th computing level if and only if any element computing task that thread A needs to execute at the i-th computing level depends on any element computing task that thread B needs to execute at the j-th computing level, where i and j are integers.
  • the preconditioning sub-processing device 302 visualizes the thread dependency as thread A needs to rely on thread B to complete the execution of computing level j when executing computing level i.
  • the preconditioner processing device 302 after the preconditioner processing device 302 hierarchically divides the original sparse matrix, before executing step 420, it can also perform matrix rearrangement and redundant dependency pruning on the matrix after the hierarchical division to further improve the overall computational efficiency of the preconditioner.
  • matrix rearrangement please refer to FIG6 and related steps
  • redundant dependency pruning please refer to FIG7 and related steps, which will not be repeated here.
  • Step 420 Each of the multiple threads of the preconditioning sub-processing device 302 executes the to-be-executed thread computing task according to the thread dependency relationship.
  • Each thread of the preconditioning sub-processing device 302 sequentially executes the to-be-executed thread computing task whose associated at least one dependent thread computing task is in the executed state according to the thread dependency relationship.
  • each thread of the preconditioning sub-processing device 302 sequentially calculates the thread computing tasks for which it is responsible according to the computing level, it is also necessary to determine whether the thread computing task is a to-be-executed thread computing task for which at least one associated dependent thread computing task is in an executed state. For example, in FIG5 , the thread computing task of thread 1 at the first computing level depends on the thread computing task of thread 0 at the 0th computing level, and the thread computing task of thread 2 at the 0th computing level.
  • each thread of the preconditioning sub-processing device 302 uses a flag bit to indicate whether the thread computing task is in an executed state, and each thread determines whether the thread computing task is a to-be-executed thread computing task whose dependent thread computing tasks are all in an executed state based on the flag bit of the dependent thread computing task associated with the thread computing task.
  • the flag bit is a mask flag bit
  • the flag bit corresponding to the thread computing task in the mask is 1, indicating that the thread computing task is in an executed state
  • the flag bit corresponding to the thread computing task in the mask is 0, indicating that the thread computing task is in an unexecuted state
  • the flag bit corresponding to the thread computing task in the mask is 0, indicating that the thread computing task is in an executed state
  • the flag bit corresponding to the thread computing task in the mask is 1, indicating that the thread computing task is in an unexecuted state.
  • Step 430 The preconditioner processing device 302 outputs a preconditioner calculation result according to the execution result of at least one thread computing task.
  • each thread of the preconditioner processing device 302 After each thread of the preconditioner processing device 302 has completed the thread computing tasks of all computing levels, it outputs the preconditioner computing results according to the execution results of all the thread computing tasks.
  • the thread computing tasks included in the preconditioner include SOR algorithm rewinding, ILU algorithm rewinding, ILU algorithm decomposition, etc. If the thread computing task is SOR algorithm rewinding, the input data obtained by the thread is the right-hand side vector B and the initial value vector X of the sparse linear equations, and the output preconditioner calculation result is the solution vector X. If the thread computing task is ILU algorithm rewinding, the input data obtained by the thread is the right-hand side vector B of the sparse linear equations, and the output preconditioner calculation result is the solution vector X. If the thread computing task is ILU algorithm decomposition, the input data obtained by the thread is the input matrix itself, and the output preconditioner calculation result is the decomposed matrix.
  • the preconditioner processing method can form a third-party mathematical library, which can be used by linear solver frameworks (such as Portable, Extensible Toolkit for Scientific Computation (PETSc), High Performance Preconditioners (Hypre), etc.) in the form of a mathematical library, and thus indirectly used by upper-level applications such as software in the field of high-performance computing.
  • the preconditioner processing method can also be used directly by the solver framework or upper-level applications in the form of code.
  • the above-mentioned thread computing tasks such as SOR algorithm rewinding, ILU algorithm rewinding, and ILU algorithm decomposition will be executed in parallel on different threads.
  • the preconditioner processing device converts the dependency between elements in the calculation task of the preconditioner into the dependency between threads, and the preconditioner processing device performs multi-threaded parallel processing on the thread calculation task based on the thread dependency, and each thread in the lower layer executes the thread calculation task of the lower layer after the upper layer thread calculation task on which it depends is in the executed state, thereby converting the full thread synchronization between layers into point-to-point synchronization between threads.
  • the preconditioning sub-processing device 302 arranges the element calculation tasks that each thread is responsible for together, that is, for each thread, arranges the thread calculation tasks that the thread is responsible for at each calculation level (the thread calculation tasks at each calculation level include one or more element calculation tasks, which are shown as element calculation tasks in FIG6 ) together, so that each thread can continuously access the memory when performing its own calculation, thereby giving full play to the memory bandwidth.
  • the rearrangement in this embodiment is only for optimizing the continuity of memory access, and will not change any dependency relationship and dependency order, so it will not affect the solution result.
  • the preconditioning sub-processing device 302 deletes a redundant dependency in the thread dependency between at least one thread computing task, where the redundant dependency means that the dependency between the thread computing tasks can be indirectly represented by a group of thread dependencies other than the thread computing tasks.
  • a group of thread dependencies refers to two or more thread dependencies.
  • the computation of thread 0 at the second level depends on the computation of thread 1 at the 0th computation level and the computation of thread 2 at the 1st computation level.
  • the computation of thread 2 at the 1st computation level depends on the computation of thread 1 at the 0th computation level.
  • the dependency of thread 0 at the 2nd computation level on thread 1 at the 0th computation level is redundant. It is indirectly guaranteed by thread 2's dependency on thread 1. Therefore, when similar dependencies are issued, the redundant parts will be removed, thereby further reducing the overhead of inter-layer synchronization and improving the overall computational efficiency of the preconditioner.
  • each thread of the preconditioning sub-processing device 302 sequentially executing thread computing tasks of each computing level in conjunction with FIG. 8 .
  • the process includes steps 810 to 850 .
  • Step 810 before executing the thread computing task of the i-th computing level, thread 1 determines whether there is an undetected flag bit, if so, executes step 820 , if not, executes step 830 .
  • Thread 1 detects the flag bits of the dependent thread tasks associated with the thread computing task at the i-th computing level to determine whether there are undetected flag bits in the flag bits of the associated dependent thread tasks.
  • Step 820 thread 1 obtains the undetected flag bit and determines whether the undetected flag bit is a value indicating that the thread computing task has been executed. If so, execute step 810. If not, determine again whether the undetected flag bit is a value indicating that the thread computing task has been executed.
  • Thread 1 uses an atomic operation to fetch the undetected flag bit. If the undetected flag bit is true (for example, the value of the flag bit is 1), step 810 is executed. If the undetected flag bit is not true (for example, the value of the flag bit is 0), it is determined again whether the undetected flag bit is true.
  • thread 1 periodically determines whether the undetected flag is true, that is, thread 1 determines whether the undetected flag is true twice with a preset time interval between each two times. If the undetected flag is not true, thread 1 continuously detects whether the undetected flag is true.
  • Step 830 Thread 1 executes the thread computing task of the i-th computing level.
  • Thread 1 executes the thread computing task of the i-th computing level according to the input data, and outputs the computing result of the thread computing task of the i-th computing level.
  • the thread computing task at the i-th computing level is the sor loop computing task
  • the input data received by thread 1 is the right-hand vector b and the initial value vector x
  • the calculation result is the solution vector x.
  • the thread computing task at the i-th computing level is the ilu back-carrying computing task
  • the input data received by thread 1 is the right-hand vector b
  • the calculation result is the solution vector x.
  • the thread computing task at the i-th computing level is the ilu decomposition computing task
  • the input data received by thread 1 is the rearranged matrix of the original sparse matrix
  • the computing result is the decomposed matrix
  • Step 840 Thread 1 modifies the flag bit of the thread computing task at the i-th computing level.
  • thread 1 modifies the flag bit of the thread computing task at the i-th computing level to a value indicating that the thread computing task has been executed, such as true.
  • thread 1 may also notify threads that have thread dependencies with the thread computing task at the i-th computing level.
  • Thread 1 executes step 810 to step 850 in a loop until thread 1 completes all the thread computing tasks of all computing levels for which it is responsible.
  • This embodiment illustrates the process of a thread in the preconditioner processing device 302 processing the thread computing task of the preconditioner through the above steps 810 to 850.
  • the multiple threads included in the preconditioner processing device 302 also process the thread computing tasks for which they are responsible in parallel. After all the threads of the preconditioner processing device 302 have completed the thread computing tasks for which they are responsible, the optimization processing of the preconditioner on the original sparse matrix is completed, and the preconditioner calculation result is output.
  • FIG9 is a schematic diagram of a possible preconditioning subprocessing device provided in this embodiment.
  • the preconditioning subprocessing device can be used to implement the functions of the execution device in the above method embodiment, and thus can also achieve the beneficial effects possessed by the above method embodiment.
  • the preconditioning subprocessing device can be the application server 301 and the preconditioning subprocessing device 302 as shown in FIG3 , or can be a module (such as a chip) applied to the server.
  • the preconditioner processing device 900 includes an analysis module 910 and an execution module 920.
  • the preconditioner processing device 900 is used to implement the function of the preconditioner processing device 302 in the method embodiment shown in FIG.
  • the analysis module 910 is used to determine the thread dependency of at least one thread computing task, where the thread dependency is used to indicate the thread dependency of multiple threads. Each thread in the threads can execute the thread computing task when at least one dependent thread computing task associated with the thread computing task is in the executed state. For example, the analysis module 910 is used to execute step 410 in FIG. 4 above.
  • the execution module 920 is used to instruct multiple threads to execute the pending thread computing tasks according to the thread dependency relationship, and at least one dependent thread computing task associated with the pending thread computing task is in the executed state. For example, the execution module 920 is used to execute step 420 in FIG. 4 above.
  • the execution module 920 is further configured to output a preconditioner calculation result according to the execution result of at least one thread calculation task.
  • the execution module 920 is configured to execute step 430 in FIG. 4 .
  • the analysis module 910 is specifically used to: convert the element dependency between at least one element calculation task contained in the original sparse matrix into a thread dependency, and the element dependency is used to indicate that executing the element calculation task requires at least one dependent element calculation task to be in an executed state.
  • the analysis module 910 is specifically used to: divide the multiple element calculation tasks contained in the original sparse matrix into multiple calculation levels according to the element dependency, and the element calculation tasks of the same calculation level can be executed simultaneously; merge the element calculation tasks executed by the same thread in the same calculation level into thread calculation tasks according to the task allocation result, and the task allocation result is used to indicate which thread among the multiple threads executes the element calculation task; determine the thread dependency between at least one thread calculation task according to the element dependency.
  • the analysis module 910 is further used to: delete redundant dependencies in the thread dependencies between at least one thread computing task, where the dependencies between the thread computing tasks represented by the redundant dependencies can be indirectly represented by a set of thread dependencies outside themselves.
  • the preconditioner processing device 900 also includes a preparation module, which is used to: rearrange the original sparse matrix according to multiple computing levels and memory access orders to obtain a rearranged matrix, where the memory access order is the memory access order of multiple threads for at least one thread computing task, and the rearranged matrix arranges at least one thread computing task executed by each thread together.
  • a preparation module which is used to: rearrange the original sparse matrix according to multiple computing levels and memory access orders to obtain a rearranged matrix, where the memory access order is the memory access order of multiple threads for at least one thread computing task, and the rearranged matrix arranges at least one thread computing task executed by each thread together.
  • the execution module 920 is specifically used to instruct each thread among the multiple threads to sequentially execute the to-be-executed thread computing tasks according to the reordered matrix and the memory access order.
  • the execution module 920 is also used to: determine, based on a flag bit of at least one thread computing task, a thread computing task in which at least one dependent thread computing task associated with it is in an executed state as a thread computing task to be executed, and the flag bit is used to indicate whether the thread computing task is in an executed state.
  • the preconditioner processing device 900 of the embodiment of the present application can be implemented by a GPU, an NPU, an ASIC, or a programmable logic device (PLD), and the PLD can be a complex programmable logical device (CPLD), a field-programmable gate array (FPGA), a generic array logic (GAL) or any combination thereof.
  • CPLD complex programmable logical device
  • FPGA field-programmable gate array
  • GAL generic array logic
  • the preconditioner processing device 900 and its various modules can also be software modules.
  • the preconditioning sub-processing device 900 of the embodiment of the present application may correspond to executing the method described in the embodiment of the present application, and the above and other operations and/or functions of each unit in the preconditioning sub-processing device 900 are respectively for implementing the corresponding processes of each method in Figure 4, which will not be repeated here for the sake of brevity.
  • the present application embodiment also provides a computing device, please refer to Figure 10,
  • Figure 10 is a schematic diagram of the structure of a computing device provided by the present application embodiment.
  • the computing device 1000 includes a memory 1001, a processor 1002, a communication interface 1003 and a bus 1004. Among them, the memory 1001, the processor 1002, and the communication interface 1003 are connected to each other through the bus 1004.
  • the computing device 1000 can be the application server 301 or the preconditioning sub-processing device 302 in Figure 3.
  • the memory 1001 may be a read-only memory, a static storage device, a dynamic storage device or a random access memory.
  • the memory 1001 may store computer instructions. When the computer instructions stored in the memory 1001 are executed by the processor 1002, the processor 1002 and the communication interface 1003 are used to execute the steps in the preconditioner processing method of the software system. For example, the processor 1002 is used to execute steps 410 to 430 in the preconditioner processing method shown in FIG. 4 above, as well as the functions of each module of the preconditioner processing device 900.
  • the memory may also store a data set, for example: a portion of the storage resources in the memory 1001 is divided into an area for storing a program for implementing the function of the preconditioner of the embodiment of the present application.
  • the processor 1002 may be a general-purpose CPU, an application specific integrated circuit (ASIC), a GPU, or any combination thereof.
  • the processor 1002 may include one or more chips.
  • the processor 1002 may include an AI accelerator, such as an NPU.
  • the communication interface 1003 uses a transceiver module such as, but not limited to, a transceiver to implement communication between the computing device 1000 and other devices or communication networks. For example, data such as the original sparse matrix can be obtained through the communication interface 1003 .
  • the bus 1004 may include a path for transmitting information between various components of the computing device 1000 (eg, the memory 1001 , the processor 1002 , and the communication interface 1003 ).
  • the computing device 1000 may be a computer (eg, a server) in a cloud data center, or a computer in an edge data center, or a terminal.
  • a computer eg, a server
  • a computer in an edge data center, or a terminal.
  • An application server 301 and/or a pre-conditioning sub-processing device 302 may be deployed on each computing device 1000 .
  • the method steps in this embodiment can be implemented by hardware or by a processor executing software instructions.
  • the software instructions can be composed of corresponding software modules, which can be stored in random access memory (RAM), flash memory, read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), registers, hard disks, mobile hard disks, CD-ROMs, or any other form of storage medium known in the art.
  • An exemplary storage medium is coupled to the processor so that the processor can read information from the storage medium and write information to the storage medium.
  • the storage medium can also be a component of the processor.
  • the processor and the storage medium can be located in an ASIC.
  • the ASIC can be located in a terminal device.
  • the processor and the storage medium can also exist as discrete components in a network device or a terminal device.
  • the computer program product includes one or more computer programs or instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, a network device, a user device or other programmable device.
  • the computer program or instruction may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer program or instruction may be transmitted from one website site, computer, server or data center to another website site, computer, server or data center by wired or wireless means.
  • the computer-readable storage medium may be any available medium that a computer can access or a data storage device such as a server or data center that integrates one or more available media.
  • the available medium may be a magnetic medium, for example, a floppy disk, a hard disk, or a tape; it may also be an optical medium, for example, a digital video disc (DVD); it may also be a semiconductor medium, for example, a solid state drive (SSD).

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Analysis (AREA)
  • Data Mining & Analysis (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Operations Research (AREA)
  • Computing Systems (AREA)
  • Multi Processors (AREA)
  • Complex Calculations (AREA)

Abstract

本申请提供一种预条件子处理方法、装置、设备及系统,涉及计算机领域,应用于偏微分方程求解的计算设备。该方法包括:预条件子处理设备首先确定预条件子要执行的至少一个线程计算任务之间的线程依赖关系,然后预条件子处理设备的多个线程中每个线程根据访存顺序依次执行依赖线程计算任务均处于已执行状态的待执行线程计算任务,从而实现预条件子的多线程并行。由此,确定预条件子的计算任务中线程与线程之间的依赖关系,从而将层与层之间的全线程同步转换为线程与线程之间点到点的同步,降低了层间同步的耗时,提高了预条件子的并行执行效率。

Description

预条件子处理方法、装置、设备及系统
本申请要求于2022年10月18日提交国家知识产权局、申请号为202211275678.1、申请名称为“预条件子处理方法、装置、设备及系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机领域,尤其涉及一种预条件子处理方法、装置、设备及系统。
背景技术
计算机的高性能运算中偏微分方程的求解问题,可以转换为求解大规模稀疏线性方程组问题。在对大规模稀疏线性方程组的迭代求解中,迭代矩阵的收敛性和收敛速度的改善不仅取决于迭代方法和迭代矩阵中参数的选取,而且和方程组自身的某些变化密切相关,特别是预条件子(preconditioner)的引入,大大加快了迭代的收敛性和收敛速度。
通过预条件子对大规模稀疏线性方程组进行预处理,能够提高大规模稀疏线性方程组的收敛速度。但由于预条件子进行预处理使用数据存在相关性,即部分预条件子的在后的计算任务往往需要依赖在先计算任务的结果,此部分预条件子的并行处理需要浪费大量的等待在先计算任务的时间开销,导致预条件子的并行执行效率较低。
发明内容
本申请提供一种预条件子处理方法、装置、设备及系统,能够解决预条件子的并行执行效率较低的问题。
第一方面,提供一种预条件子处理方法。该预条件子处理方法由预条件子处理设备执行,所述预条件子处理设备包括多个线程,每个线程用于在进行稀疏矩阵线性方程组的求解过程中执行预条件子包含的至少一个线程计算任务。该方法包括:预条件子处理设备在利用预条件子对稀疏线性方程组进行多线程并行处理时,预条件子处理设备首先确定预条件子要执行的至少一个线程计算任务之间的线程依赖关系,然后预条件子处理设备的多个线程中每个线程依次执行依赖线程计算任务均处于已执行状态的待执行线程计算任务,再根据至少一个线程计算任务的执行结果输出预条件子计算结果。其中,线程依赖关系用于指示至少一个线程计算任务之间的依赖关系,即针对每个线程计算任务,该线程计算任务关联的至少一个依赖线程计算任务均处于已执行状态时能够被执行。
基于上述预条件子处理方法,相对于传统的预条件子基于分层后元素与元素之间的依赖关系多线程并行执行计算任务,层与层之间所有线程需要同步以保证正确性,即在上层的所有元素计算任务处于已执行状态后才能执行下一层的元素计算任务,在并行量较大时同步需要占用较大的开销,本申请提供的方案中预条件子处理设备确定预条件子的计算任务中线程与线程之间的依赖关系,预条件子处理设备基于线程依赖关系对线程计算任务进行多线程并行,从而将层与层之间的全线程同步转换为线程与线程之间点到点的同步,即下层的线程计算任务在所依赖的上层线程计算任务处于已执行状态后能够被执行,从而降低了层间同步的耗时,提高了预条件子的并行执行效率。
作为一种可能的实现方式,预条件子处理设备基于稀疏矩阵线性方程组原稀疏矩阵,将原稀疏矩阵包含的至少一个元素计算任务之间存在的元素依赖关系转换为线程依赖关系,即每个线程在执行每个层次的计算任务之前,需要依赖哪些线程已经执行完哪些层次的计算任务。其中,元素依赖关系用于指示执行元素计算任务需要至少一个依赖元素计算任务处于已执行状态。
可选地,预条件子处理设备根据元素依赖关系将原稀疏矩阵包含的多个元素计算任务划分为多个计算层级,然后根据任务分配结果将所述同一计算层级中由同一线程执行的元素计算任务合并为线程计算任务,再根据元素依赖关系确定至少一个线程计算任务之间的线程依赖关系。其中,同一计算层级的元素计算任务能够同时被执行,任务分配结果用于指示元素计算任务由多个线程中的哪一个线程执行。由此,预条件子处理设备将每一计算层级内属于同一个线程的元素计 算任务整合为一个线程计算任务,每个线程在每个计算层级只负责一个线程计算任务,针对一个线程在下一层的线程计算任务仅依赖上一层的一个线程计算任务,从而使线程依赖关系能够被具象化表示。
作为一种可能的实现方式,预条件子处理设备在确定线程依赖关系之后,还删除至少一个线程计算任务之间的线程依赖关系中的冗余依赖关系,冗余依赖关系表示的线程计算任务之间的依赖关系能够被自身之外的一组线程依赖关系间接表示。由此,减少了线程依赖关系的数量,避免线程在执行线程计算任务前耗费计算资源和时间对冗余依赖关系进行确认,进一步降低了层间同步开销,提高了并行执行效率。
作为一种可能的实现方式,预条件子处理设备在对原稀疏矩阵进行重排后再多线程并行执行至少一个线程计算任务。
可选地,预条件子处理设备根据多个计算层级和访存顺序对原稀疏矩阵进行矩阵重排,得到重排矩阵,多个线程中每个线程在多线程并行执行线程计算任务时,根据重排矩阵和访存顺序,依次执行待执行线程计算任务。其中,访存顺序为多个线程对至少一个线程计算任务的访存顺序,重排矩阵中每个线程执行的至少一个线程计算任务排列在一起。
由此,每个线程在做各自的计算时,可以对内存做连续访存,有利于充分发挥内存带宽,且重排矩阵是为了优化访存连续性,不会改变任何的依赖关系和依赖顺序,所以不会对预条件子的求解结果产生影响。
作为一种可能的实现方式,预条件子处理设备中各个线程基于标志位确定线程依赖关系是否已被满足。例如,预条件子处理设备根据至少一个线程计算任务的标志位确定线程计算任务关联的至少一个依赖线程计算任务是否处于已执行状态,若是,将该线程计算任务作为待执行线程计算任务。从而预条件子处理设备中各个线程利用标志位来快速、准确地确定线程计算任务的依赖线程计算任务是否已执行,提高了预条件子多线程并行的整体效率。
第二方面,提供一种预条件子处理装置,该装置执行多个线程,每个线程用于执行预条件子包含的至少一个线程计算任务,所述装置包括用于执行第一方面或第一方面任一种可能实现方式中的预条件子处理方法的各个模块。
作为一种可能的实现方式,预条件子处理装置包括分析模块和执行模块。
分析模块,用于确定至少一个线程计算任务的线程依赖关系,线程依赖关系用于指示多个线程中每个线程当线程计算任务关联的至少一个依赖线程计算任务均处于已执行状态能够执行线程计算任务。
执行模块,用于指示多个线程根据线程依赖关系执行待执行线程计算任务,待执行线程计算任务关联的至少一个依赖线程计算任务均处于已执行状态。
执行模块,还用于根据至少一个线程计算任务的执行结果输出预条件子计算结果。
可选地,分析模块具体用于:将原稀疏矩阵包含的至少一个元素计算任务之间存在的元素依赖关系转换为线程依赖关系,元素依赖关系用于指示执行元素计算任务需要至少一个依赖元素计算任务处于已执行状态。
可选地,分析模块具体用于:根据元素依赖关系将原稀疏矩阵包含的多个元素计算任务划分为多个计算层级,同一计算层级的元素计算任务能够同时被执行;根据任务分配结果将同一计算层级中由同一线程执行的元素计算任务合并为线程计算任务,任务分配结果用于指示元素计算任务由多个线程中的哪一个线程执行;根据元素依赖关系确定至少一个线程计算任务之间的线程依赖关系。
可选地,分析模块还用于:删除至少一个线程计算任务之间的线程依赖关系中的冗余依赖关系,冗余依赖关系表示的线程计算任务之间的依赖关系能够被自身之外的一组线程依赖关系间接表示。
可选地,预条件子处理装置还包括准备模块,准备模块用于:根据多个计算层级和访存顺序对原稀疏矩阵进行矩阵重排,得到重排矩阵,访存顺序为多个线程对至少一个线程计算任务的访存顺序,重排矩阵将每个线程执行的至少一个线程计算任务排列在一起。
可选地,执行模块具体用于:指示多个线程中每个线程根据重排矩阵和访存顺序,依次执行 待执行线程计算任务。
可选地,执行模块还用于:根据至少一个线程计算任务的标志位,确定自身关联的至少一个依赖线程计算任务均处于已执行状态的线程计算任务为待执行线程计算任务,标志位用于指示线程计算任务是否处于已执行状态。
需要说明的是,第二方面所述的预条件子处理装置可以是终端设备或网络设备,也可以是可设置于终端设备或网络设备中的芯片(系统)或其他部件或组件,还可以是包含终端设备或网络设备的装置,本申请对此不做限定。
第三方面,提供了一种预条件子处理设备,包括存储器和处理器,所述存储器用于存储一组计算机指令,当所述处理器执行所述一组计算机指令时,用于执行第一方面中任一种可能设计中的预条件子处理方法的操作步骤。
第四方面,提供了一种计算机系统,包括应用服务器和预条件子处理设备,所述应用服务器用于在计算稀疏线程方程组时向预条件子处理设备发送稀疏线性方程组求解请求,所述预条件子处理设备用于执行第一方面中任一种可能设计中的预条件子处理方法的操作步骤。
此外,第二方面所述的预条件子处理装置的技术效果、第三方面所述的预条件子处理设备的技术效果以及第四方面所述的计算机系统的技术效果,可以参考第一方面所述的预条件子处理方法的技术效果,此处不再赘述。
第五方面,提供一种计算机可读存储介质,包括:计算机软件指令;当计算机软件指令在数据处理系统中运行时,使得计算设备执行如第一方面中任意一种可能的实现方式中所述方法的操作步骤。
第六方面,提供一种计算机程序产品,当计算机程序产品在计算机上运行时,使得计算设备执行如第一方面中任意一种可能的实现方式中所述方法的操作步骤。
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。
附图说明
图1为本申请提供的一种着色法的示意图;
图2为本申请提供的一种分层法的示意图;
图3为本申请提供的一种计算机系统的结构示意图;
图4为本申请提供的一种预条件子处理方法的流程示意图;
图5为本申请提供的一种任务分配结果的示意图;
图6为本申请提供的一种矩阵重排的原理示意图;
图7为本申请提供的一种冗余依赖关系剪除的原理示意图;
图8为本申请提供的一种线程执行计算任务的流程示意图;
图9为本申请提供的一种预条件子处理装置的结构示意图;
图10为本申请提供的一种计算设备的结构示意图。
具体实施方式
为了便于理解,下面先对本申请实施例涉及的相关术语进行介绍。
(1)稀疏线性方程组
高性能计算的最重要任务之一是解决领域关键问题,如半导体、流体力学、工程力学、电磁学、生物、材料、电路仿真等。每个关键领域问题的解决方式不尽相同,但是通常包含一个相似的关键模块——求解偏微分方程,如泊松方程、热传导方程、波动方程、纳维-斯托克斯方程等。
计算机在求解偏微分方程之前,需要对方程做离散化。依照时间维度离散方式的不同,大致可以分为显式方法和隐式方法。显式方法和隐式方法各有优缺点,均在工业界被广泛使用。其中,隐式离散方法会将求解偏微分方程问题转换成求解大规模稀疏线性方程组问题。
通常线性方程组可写成Ax=b的形式,其中A是系数矩阵,b是右端向量,x是解向量。求解线性方程组的过程就是通过A和b得到解向量x的过程。一般对于可求解的问题,x和b的维度是n,A的维度是n*n,即有n^2个元素。对于稀疏线性方程组,A的n^2个元素中有绝大多 数都是0,0的占比依不同领域不同问题而异,一般会低于0.1%甚至更低。
(2)预条件子
求解大规模稀疏线性方程组主要有两种方法:直接法(direct method)和迭代法(iterative method)。直接法是指在不考虑计算舍入误差的情况下,通过包括矩阵分解和三角方程组求解等有限步的操作求得方程组的精确解,又称精确法。迭代法是指给定一个初始解向量,通过一定的计算构造一个向量列(通过逐次迭代得到一系列逼近精确值的近似解),将向量列的极限作为稀疏线性方程组理论上的精确解。
迭代法对存储空间的需求较低,在求解高阶非病态稀疏线性方程组方面具有优势,但迭代法在求解病态问题时存在收敛速度慢或不收敛,以及精度控制难度较大的问题。求解方程组时如果对数据进行较小的扰动,得出的结果具有很大的波动,这样的矩阵称为病态矩阵。
由此,迭代法中引入预条件子来解决迭代法存在的上述问题。预条件子用于改善线性方程组系数矩阵的性质,将给定线性方程组转换为更易求解的方程组的变换(矩阵),主要用于迭代方法的加速收敛。例如,在基于Krylov子空间的迭代法,如共轭梯度法、广义最小残差法等,预条件子是最重要的模块之一。预条件子的存在,一方面可以帮助提高迭代法的收敛速度,另一方面也可以帮助求解病态矩阵。预条件子是支持嵌套的,比如说方法A和方法B依次执行来共同完成预条件子的功能,而最底层的预条件子往往会选择sor或者ilu,这主要得益于其相对简单的算法流程和较好的收敛性。
sor和ilu均可认为属于一类算法,由于其相似的计算流程,所以在这里被统一考虑。sor类算法包括高斯-赛德尔算法(Gauss-Seidel),对称高斯-赛德尔算法(symmetric Gauss-Seidel),对称超松弛算法(symmetric successive over-relaxation)等。ilu类算法包括ilu(0),ilu(1),ilu(k)等。与sor类算法不同,ilu类算法流程上分为2步,ilu分解过程和ilu回带过程,不过分解和回带在计算逻辑上是一致的,所以也可在这里被统一考虑。
lu算法是用直接法求解稀疏矩阵的核心方法,但是在分解过程中会有大量的填充元(fillin),导致lu算法的效率极低,不适合作为预条件子放在迭代法中。与之对应,ilu算法——不完全的lu算法,可以根据填充等级,控制非零元填充数量,从而获得更高的效率。下面会以ilu(0)为例介绍算法流程,其中0表示填充等级为0,即不会有任何填充,意味着原位置如果是非零,则会在分解过程中被不断更新,而如果原位置是0,则不会被更新而永远是0。毫无疑问,ilu的求解准确性不及lu算法,但是作为预条件子是满足准确性需求的,因为稀疏方程的最终求解正确性由迭代法保证,预条件子只是作为加速这个过程的手段。
如上所述,ilu(0)分为分解和回带2个过程,前者是对原矩阵做数值分解得到一个下三角阵L和一个上三角阵U,回带过程是求解LUx=b的过程,即求解Ly=b和Ux=y,一次下三角阵求解和一次上三角阵求解。Figure 1是ilu(0)分解过程的伪代码,值得注意的是,由于ilu(0)不会做新的非零元填充,所以分解后的下三角矩阵和上三角矩阵可以完全使用原矩阵A的空间,即做一次本地分解。
由于Ux=y正好是Ly=b的反向过程,即Ly=b是从前到后的过程,而Ux=y是从后到前的过程,逻辑上完全一致,这里对ilu(0)回带过程不再赘述。
高斯-赛德尔算法是sor算法在松弛因子为0时的特殊情况,其算法流程与ilu(0)回带中Ly=b的过程相似。
将上述三个算法流程横向对比,可以发现这几种算法的共性是前项依赖。若抽象地将第i行(ilu分解)和第i个解(ilu回带和高斯-赛德尔算法)统一认为是元素i,则元素i依赖“前项”元素j,当且仅当a_ij不为0并且j<i。sor类算法和ilu类算法均满足这一抽象,即存在前项依赖。
前项依赖使得多线程并行加速变得十分困难,因为并行的前提是有足够多的毫无关系的计算任务。基于这样的背景,现有技术主要是从如何挖掘并行计算任务上展开的,大体可以分为挖掘并行任务和创造并行任务,前者是通过分析依赖图,分析哪些计算任务可以无依赖地并行执行,后者是强行打破一些依赖关系,从而创造更多的无依赖并行计算任务。
(3)着色法
着色法是将由原稀疏矩阵A得到的依赖图中每个元素着以不同的颜色,相同颜色的元素并 行执行,不同颜色的元素依次串行执行。着色的方法包含多种,例如确定阈值范围后着色,即在阈值范围内的点着以不同的颜色。
如图1所示,图1为阈值为1时的着色样例图,0-12分别表示原稀疏矩阵A包含的不同元素,阈值仅为1意味着相邻的元素需要着以不同的颜色,不相邻的元素可以着以同样的颜色。图中0、3、4、10、11、12被着以相同的颜色,可以并行执行;1、5、7被着以相同的颜色,可以并行执行;2、6、8、9被着以相同的颜色,可以并行执行。其中,元素0、3、4、10、11、12,元素1、5、7,以及元素2、6、8、9分别着三种不同的颜色,图1中以不同灰度来表示不同颜色。不同颜色之前需要串行执行。
(4)嵌套多层次主对角块方法
嵌套多层次主对角块方法是通过对原矩阵A做嵌套的对称变化,以创造出可以并行的任务。任何稀疏矩阵均可以通过对称变换转成A11和A22彼此无关,可以被2个线程并行执行,剩余部分串行执行。若嵌套第调用上述对称变换,就可以支持更多数量的线程并行执行,如下所示:
(5)分层法
分层法是依据元素之间的前后依赖关系,将对于不同元素的计算分层,同一层次的元素可以被同时计算,不同层次的元素之间依次执行。
如图2所示,16个元素被分层4层,交由3个线程并行执行,不同灰度表示不同颜色,不同颜色表示不同线程需要负责的计算任务。
现有技术采用着色法或嵌套多层次主对角块方法对预条件子进行多线程并行,存在降低算法收敛性的问题,例如Krylov子空间迭代法配合ilu(0)做预条件子,当使用串行ilu(0)时迭代50步,换成上述两种并行方法后,迭代次数升至70步。这主要是因为着色法或嵌套多层次主对角块方法都打破了原有的依赖关系,前者是通过图的方式,后者是通过数学变换的方式。通过图的方式,忽略了原本应该考虑的依赖关系而强行并行,因此造成收敛性下降;通过数学变换的方式,改变了分解的顺序,而与lu不同,ilu的结果与顺序强相关,所以数学变换依旧会改变(大概率是降低)收敛性。此外,嵌套多层次主对角块方法在除去无关联的对角块部分,其余部分通常难以并行,因此在算法后期并行效率低。若采用分层法,虽然不同层内的元素计算任务可以多线程并行,但是层与层之间需要同步以保证正确性,而当核数增多时,同步的开销会明显提升,以导致算法整体速度的下降。进一步,线程的跳跃式访存也会导致访存效率的下降,无法充分利用内存带宽资源。
本申请提供了一种预条件子处理方法,尤其是一种基于线程依赖关系对预条件子进行多线程并行的预条件子处理方法,即预条件子处理设备确定预条件子包含的至少一个线程计算任务之间的线程依赖关系,线程依赖关系用于指示执行线程计算任务需要至少一个依赖线程计算任务处于已执行状态,预条件子处理设备的各个线程根据访存顺序依次执行关联的至少一个依赖线程计算任务均处于已执行状态的待执行线程计算任务,并根据至少一个线程计算任务的执行结果输出预条件子计算结果。由此,预条件子处理设备将预条件子的计算任务中元素与元素之间的依赖关系转换为线程与线程之间的依赖关系,预条件子处理设备基于线程依赖关系对线程计算任务进行多线程并行处理,各个线程在下层的线程计算任务在所依赖的上层线程计算任务处于已执行状态后执行该下层的线程计算任务,从而将层与层之间的全线程同步转换为线程与线程之间点到点的同步。相对于传统的预条件子基于分层后元素与元素之间的依赖关系多线程并行执行计算任务,层与层之间所有线程需要同步以保证正确性,即在上层的所有元素计算任务处于已执行状态后才能 执行下一层的元素计算任务导致并行量较大且同步需要占用较大的开销,本申请提供的方法仅需基于线程依赖关系进行线程与线程之间点到点的同步,降低了层间同步的耗时,提高了预条件子的并行执行效率。
下面将结合附图对本申请实施例的实施方式进行详细描述。
图3为本申请提供的一种计算机系统的结构示意图,如图3所示,计算机系统300包括应用服务器301和预条件子处理设备302。
应用服务器301可以是终端,如手机终端、平板电脑、笔记本电脑、虚拟现实(virtual reality,VR)设备、增强现实(augmented reality,AR)设备、混合现实(Mixed Reality,MR)设备、扩展现实(Extended Reality,ER)设备、摄像头或车载终端等,还可以是边缘设备(例如,携带具有处理能力芯片的盒子)等。
预条件子处理设备302可以是终端,还可以是其他支持稀疏线性方程组计算的计算设备,如服务器或者云端设备等。
作为一种可能的实施例,应用服务器301和预条件子处理设备302是部署在不同物理设备(如:服务器或集群中的服务器)上的不同处理器。例如,执行设备410可以是图形处理单元(graphic processing unit,GPU)、中央处理器(central processing unit,CPU)、其他通用处理器、数字信号处理器(digital signal processing,DSP)、专用集成电路(application-specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者是任何常规的处理器等。预条件子处理设备302可以是图形处理器(graphics processing unit,GPU)、神经网络处理器(neural network processing unit,NPU)、微处理器、特定应用集成电路(application-specific integrated circuit,ASIC)、或一个或多个用于控制本申请方案程序执行的集成电路。
在另一可能的实施例中,应用服务器301和预条件子处理设备302部署在同一物理设备,或应用服务器301和预条件子处理设备302为同一物理设备。
应用服务器301用于响应用户操作确定需要求解的稀疏线性方程组,基于稀疏线性方程组向预条件子处理设备302发送稀疏线性方程组求解请求。
预条件子处理设备302用于接收应用服务器301发送的稀疏线性方程组求解请求,计算稀疏线性方程组,并向应用服务器301返回稀疏线性方程组的求解结果。
在软件架构层面上,预条件子处理设备302中设置有包含偏微分方程求解器等核心计算功能的高性能计算领域的软件。微分方程求解器的核心模块是线性解法器,该线性解法器会向上支撑偏微分方程求解器,并间接支撑高性能计算领域的软件。线性解法器可以包含不同的算法,例如共轭梯度法、广义最小残差法、广义共轭残差法、切比雪夫法等。
线性解法器包含的关键模块是预条件子,预条件子可以包含不同算法,例如加性施瓦兹算法、块雅克比算法、雅克比算法、ilu类算法、sor类算法、多重网格算法等。这些预条件子是个嵌套的过程,不同的算法可以嵌套在一起。如图3中箭头所示,多种预条件子往往最终嵌套ilu类算法或者sor类算法。
以多线程并行的ilu类算法和sor类算法为例,用户通过预条件子处理设备302的操作界面将要解决的问题输入高性能计算领域的软件,软件调用偏微分方程求解器对要解决的问题包含的偏微分方程进行求解,偏微分方程求解器在求解偏微分方程之前,对方程进行离散化,将偏微分方程求解问题转换为大规模稀疏线性方程组求解问题。偏微分方程求解器调用线性解法器获得大规模稀疏线性方程组的求解结果,在此过程中,线性解法器还使用预条件子执行预条件子处理方法对大规模稀疏线性方程组的原稀疏矩阵进行优化。
图3仅是本申请提供的一种系统架构的示意图,图3中所示设备、器件、模块等之间的位置关系不构成任何限制。
接下来请参考图4,对预条件子处理方法进行详细阐述。在这里以图4中的预条件子处理设备302中预条件子为ilu类算法或sor类算法为例进行说明。
步骤410、预条件子处理设备302确定至少一个线程计算任务的线程依赖关系。
预条件子处理设备302根据原稀疏矩阵的结构信息,分析出至少一个线程计算任务中所有线程计算任务之间的线程依赖关系。
上述原稀疏矩阵是指稀疏线性方程组的系数矩阵,例如稀疏线性方程组Ax=b中的A是系数矩阵。结构信息是指原稀疏矩阵包含的元素及元素排列顺序,以数组形式输入的原稀疏矩阵包含该原稀疏矩阵的结构信息。
作为一种可能的实现方式,预条件子处理设备302根据原稀疏矩阵的结构信息进行任务分配,得到任务分配结果,然后基于任务分配结果确定线程依赖关系。其中,任务分配结果用于指示每个线程需要处理不同计算层级的哪些元素计算任务,线程依赖关系用于指示执行线程计算任务需要哪些线程已经执行完哪些线程计算任务。
首先,预条件子处理设备302根据原稀疏矩阵中的元素依赖关系将原稀疏矩阵包含的多个元素计算任务划分为多个计算层级。
其中,原稀疏矩阵中的元素依赖关系是指原稀疏矩阵中各元素计算任务之间存在的依赖关系,即针对原稀疏矩阵中的一个元素计算任务可以存在关联的元素依赖关系,也可以不存在关联的元素依赖关系。
然后,预条件子处理设备302根据任务分配结果将同一计算层级中由同一线程执行的元素计算任务合并为线程计算任务。其中,线程计算任务之间的线程依赖关系与元素计算任务之间的元素依赖关系类似,每两个线程计算任务不是必然存在线程依赖关系。
以图5所示的任务分配结果对任务分配进行举例说明,预条件子处理设备302在对原稀疏矩阵进行层次划分后,将每个计算层级的元素计算任务平均分配给3个线程,以第1层为例,共有6个元素计算任务,平均分配给3个线程,每个线程用于执行2个元素计算任务。接下来,将每一层内属于同一个线程的元素计算任务整合起来,作为一个线程计算任务,每个线程在每一层负责一个线程计算任务,例如线程0在第0层、第1层、第2层和第3层分别负责执行一个线程计算任务。
最后,预条件子处理设备302根据元素依赖关系确定至少一个线程计算任务之间的线程依赖关系。
以图5所示的任务分配结果对线程依赖关系的分析方式进行说明。预条件子处理设备302当且仅当线程A在第i计算层级需要执行的任一元素计算任务依赖线程B在第j计算层级需要执行的任一元素计算任务时,确定线程A在第i计算层级的线程计算任务依赖线程B在第j计算层级的线程计算任务,其中,i和j为整数。例如,线程1在第1计算层级需要执行的元素计算任务6依赖线程0在第0计算层级需要执行的元素计算任务0时,确定线程1在第1计算层级的线程计算任务依赖线程0在第0计算层级的线程计算任务。由此,预条件子处理设备302将线程依赖关系具象化表示为线程A在执行计算层级i时,需要依赖线程B执行完计算层级j。
在另外的实现方式中,预条件子处理设备302对原稀疏矩阵进行层次划分后,在执行步骤420之前,还可以对层次换分后的矩阵进行矩阵重排和冗余依赖关系剪除,来进一步提高预条件子的整体计算效率。矩阵重排的具体步骤请参考图6及相关步骤,冗余依赖关系剪除的具体步骤请参考图7及相关步骤,在此不再赘述。
步骤420、预条件子处理设备302的多个线程中每个线程根据线程依赖关系执行待执行线程计算任务。
预条件子处理设备302的各个线程根据线程依赖关系依次执行关联的至少一个依赖线程计算任务均处于已执行状态的待执行线程计算任务。
作为一种可能的实现方式,预条件子处理设备302的各个线程在根据计算层级依次计算各自负责的线程计算任务之前,还需要确定该线程计算任务是否为关联的至少一个依赖线程计算任务均处于已执行状态的待执行线程计算任务。例如图5中线程1在第1计算层级的线程计算任务依赖线程0在第0计算层级的线程计算任务,以及线程2在第0计算层级的线程计算任务。
可选地,预条件子处理设备302的各个线程采用标志位表示线程计算任务是否处于已执行状态,各个线程根据线程计算任务所关联的依赖线程计算任务的标志位确定该线程计算任务是否为依赖线程计算任务均处于已执行状态的待执行线程计算任务。
在本申请实施例中,标志位是掩码(mask)标志位,mask中与线程计算任务对应的标志位为1表示该线程计算任务处于已执行状态,mask中与线程计算任务对应的标志位为0表示该线程计算任务处于未执行状态,或者,mask中与线程计算任务对应的标志位为0表示该线程计算任务处于已执行状态,mask中与线程计算任务对应的标志位为1表示该线程计算任务处于未执行状态。
步骤430、预条件子处理设备302根据至少一个线程计算任务的执行结果输出预条件子计算结果。
预条件子处理设备302的各个线程分别执行完全部计算层级的线程计算任务后,根据所有线程计算任务的执行结果输出预条件子计算结果。
可选地,预条件子包含的线程计算任务包括sor算法回带、ilu算法回带、ilu算法分解等。若线程计算任务为sor算法回带,则线程获取的输入数据为稀疏线性方程组的右端向量b、初值向量x,输出的预条件子计算结果为解向量x。若线程计算任务为ilu算法回带,则线程获取的输入数据为稀疏线性方程组的右端向量b,输出的预条件子计算结果为解向量x。若线程计算任务为ilu算法分解,则线程获取的输入数据为输入矩阵本身,输出的预条件子计算结果为分解后的矩阵。
该预条件子处理方法可以形成第三方数学库,以数学库的方式被线性求解器框架(例如科学计算可移植扩展工具包(Portable,ExtensibleToolkit for Scientific Computation,PETSc)、高性能预条件子(High Performance Preconditioners,Hypre)等)使用,从而间接被高性能计算领域的软件等上层应用使用。此外,该预条件子处理方法还可以直接以代码的形式被求解器框架或者上层应用使用。在多线程框架下,上述sor算法回带、ilu算法回带、ilu算法分解等线程计算任务会在不同的线程上并行执行。
在本实施例中,针对各个线程执行上述步骤420-步骤430的具体方式请参考图8及相关描述,在此不再赘述。
基于上述预条件子处理方法,预条件子处理设备将预条件子的计算任务中元素与元素之间的依赖关系转换为线程与线程之间的依赖关系,预条件子处理设备基于线程依赖关系对线程计算任务进行多线程并行处理,各个线程在下层的线程计算任务在所依赖的上层线程计算任务处于已执行状态后执行该下层的线程计算任务,从而将层与层之间的全线程同步转换为线程与线程之间点到点的同步。相对于传统的预条件子基于分层后元素与元素之间的依赖关系多线程并行执行计算任务,层与层之间所有线程需要同步以保证正确性,即在上层的所有元素计算任务处于已执行状态后才能执行下一层的元素计算任务导致并行量较大且同步需要占用较大的开销,本申请提供的方法仅需基于线程依赖关系进行线程与线程之间点到点的同步,降低了层间同步的耗时,提高了预条件子的并行执行效率。
下面结合图6详细说明预条件子处理设备302进行矩阵重排的原理。
如图6所示,预条件子处理设备302将每个线程负责的元素计算任务排列在一起,即针对每个线程,将该线程在各个计算层级需负责的线程计算任务(每个计算层级的线程计算任务包含一个或多个元素计算任务,在图6中以元素计算任务进行展示)排列在一起,从而使每个线程在做各自的计算时,可以对内存做连续访存,从而充分发挥内存带宽。此外,与相关工作的重排不同,本实施例中的重排仅仅是为了优化访存连续性,不会改变任何的依赖关系和依赖顺序,所以不会对求解结果产生影响。
下面结合图7详细说明预条件子处理设备302进行冗余依赖关系剪除的原理。
预条件子处理设备302在确定至少一个线程计算任务之间的线程依赖关系后,删除至少一个线程计算任务之间的线程依赖关系中的冗余依赖关系,冗余依赖关系是指线程计算任务之间的依赖关系能够被自身之外的一组线程依赖关系间接表示。可选地,一组线程依赖关系间是指两个或两个以上的线程依赖关系。
例如图7的示例中,线程0在第2层的计算依赖线程1在第0计算层级的计算,以及线程2在第1计算层级的计算,同时线程2在第1计算层级的计算又依赖线程1在第0计算层级的计算,则线程0在第2计算层级对于线程1在第0计算层级的依赖是冗余的,由于该依赖关系已经 被线程2对于线程1的依赖间接保证,因此当发下类似的依赖关系时,会移除其中冗余的部分,从而进一步降低层间同步的开销,来提高预条件子的整体计算效率。
下面结合图8详细说明预条件子处理设备302的各个线程依次执行各计算层级的线程计算任务的流程,以线程1为例,该流程包括步骤810-步骤850。
步骤810、线程1在执行第i计算层级的线程计算任务前,确定是否存在未检测的标志位,若是,执行步骤820,若否,执行步骤830。
线程1对第i计算层级的线程计算任务所关联的依赖线程任务的标志位进行检测,确定关联的依赖线程任务的标志位中是否存在未检测的标志位。
步骤820、线程1获取未检测的标志位,确定未检测的标志位是否为表示线程计算任务已执行的值,若是,执行步骤810,若否,再次确定未检测的标志位是否为表示线程计算任务已执行的值。
线程1利用原子操作获取(fetch)未检测的标志位,若未检测的标志位为true(例如标志位的值为1),执行步骤810。若未检测的标志位不为true(例如标志位的值为0),再次确定未检测的标志位是否为true。
可选地,线程1周期性地判断未检测的标志位是否为true,即线程1每两次判断未检测的标志位是否为true之间间隔预设时长,若未检测的标志位不为true则持续性地检测未检测的标志位是否为true。
步骤830、线程1执行第i计算层级的线程计算任务。
线程1根据输入数据执行第i计算层级的线程计算任务,并输出第i计算层级的线程计算任务的计算结果。
例如,第i计算层级的线程计算任务为sor回带计算任务,线程1接收的输入数据为右端向量b和初值向量x,计算结果为解向量x。
又如,第i计算层级的线程计算任务为ilu回带计算任务,线程1接收的输入数据为右端向量b,计算结果为解向量x。
又如,第i计算层级的线程计算任务为ilu分解计算任务,线程1接收的输入数据为原稀疏矩阵的重排矩阵,计算结果为分解后矩阵。
步骤840、线程1修改第i计算层级的线程计算任务的标志位。
线程1在第i计算层级的线程计算任务输出计算结果,即处于已执行状态后,将第i计算层级的线程计算任务的标志位修改为表示线程计算任务已执行的值,例如true。
可选地,线程1修改第i计算层级的线程计算任务的标志位后,还可以通知与第i计算层级的线程计算任务存在线程依赖关系的线程。
步骤850、线程1令i=i+1,执行步骤810-步骤840。
线程1循环执行步骤810-步骤850,直至线程1将自身负责的所有计算层级的线程计算任务均执行完毕。
本实施例通过上述步骤810-步骤850对预条件子处理设备302中的一个线程处理预条件子的线程计算任务的流程进行了说明,预条件子处理设备302包含的多个线程还对各自负责的线程计算任务并行处理,在预条件子处理设备302的所有线程均执行完自身负责的线程计算任务后,完成预条件子对原稀疏矩阵的优化处理,输出预条件子计算结果。
上文结合图3-图8详细描述了根据本实施例所提供的预条件子处理方法,下面将结合图9,描述本实施例所提供的预条件子处理装置。
图9为本实施例提供的可能的预条件子处理装置的示意图。预条件子处理装置可以用于实现上述方法实施例中执行设备的功能,因此也能实现上述方法实施例所具备的有益效果。在本实施例中,该预条件子处理装置可以是如图3所示的应用服务器301和预条件子处理设备302,还可以是应用于服务器的模块(如芯片)。
预条件子处理装置900包括分析模块910和执行模块920。预条件子处理装置900用于实现上述图4所示方法实施例中预条件子处理设备302的功能。
分析模块910,用于确定至少一个线程计算任务的线程依赖关系,线程依赖关系用于指示多 个线程中每个线程当线程计算任务关联的至少一个依赖线程计算任务均处于已执行状态能够执行线程计算任务。例如,分析模块910用于执行上述图4中步骤410。
执行模块920,用于指示多个线程根据线程依赖关系执行待执行线程计算任务,待执行线程计算任务关联的至少一个依赖线程计算任务均处于已执行状态。例如,执行模块920用于执行上述图4中步骤420。
执行模块920,还用于根据至少一个线程计算任务的执行结果输出预条件子计算结果。例如,执行模块920用于执行上述图4中步骤430。
作为一种可能的实现方式,分析模块910具体用于:将原稀疏矩阵包含的至少一个元素计算任务之间存在的元素依赖关系转换为线程依赖关系,元素依赖关系用于指示执行元素计算任务需要至少一个依赖元素计算任务处于已执行状态。
作为一种可能的实现方式,分析模块910具体用于:根据元素依赖关系将原稀疏矩阵包含的多个元素计算任务划分为多个计算层级,同一计算层级的元素计算任务能够同时被执行;根据任务分配结果将同一计算层级中由同一线程执行的元素计算任务合并为线程计算任务,任务分配结果用于指示元素计算任务由多个线程中的哪一个线程执行;根据元素依赖关系确定至少一个线程计算任务之间的线程依赖关系。
作为一种可能的实现方式,分析模块910还用于:删除至少一个线程计算任务之间的线程依赖关系中的冗余依赖关系,冗余依赖关系表示的线程计算任务之间的依赖关系能够被自身之外的一组线程依赖关系间接表示。
作为一种可能的实现方式,预条件子处理装置900还包括准备模块,准备模块用于:根据多个计算层级和访存顺序对原稀疏矩阵进行矩阵重排,得到重排矩阵,访存顺序为多个线程对至少一个线程计算任务的访存顺序,重排矩阵将每个线程执行的至少一个线程计算任务排列在一起。
作为一种可能的实现方式,执行模块920具体用于:指示多个线程中每个线程根据重排矩阵和访存顺序,依次执行待执行线程计算任务。
作为一种可能的实现方式,执行模块920还用于:根据至少一个线程计算任务的标志位,确定自身关联的至少一个依赖线程计算任务均处于已执行状态的线程计算任务为待执行线程计算任务,标志位用于指示线程计算任务是否处于已执行状态。
应理解的是,本申请实施例的预条件子处理装置900可以通过GPU、NPU、ASIC实现,或可编程逻辑器件(programmable logic device,PLD)实现,上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD),现场可编程门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。此外,在通过软件实现图4所示的方法时,预条件子处理装置900及其各个模块也可以为软件模块。
本申请实施例的预条件子处理装置900可对应于执行本申请实施例中描述的方法,并且预条件子处理装置900中的各个单元的上述和其它操作和/或功能分别为了实现图4中的各个方法的相应流程,为了简洁,在此不再赘述。
本申请实施例还提供了一种计算设备,请参考图10,图10为本申请实施例提供的一种计算设备的结构示意图。计算设备1000包括存储器1001、处理器1002、通信接口1003以及总线1004。其中,存储器1001、处理器1002、通信接口1003通过总线1004实现彼此之间的通信连接。在本实施例中,计算设备1000可以是图3中的应用服务器301或预条件子处理设备302。
存储器1001可以是只读存储器,静态存储设备,动态存储设备或者随机存取存储器。存储器1001可以存储计算机指令,当存储器1001中存储的计算机指令被处理器1002执行时,处理器1002和通信接口1003用于执行软件系统的预条件子处理方法中的步骤。例如,处理器1002用于执行上述图4所示的预条件子处理方法中的步骤410-步骤430,以及上述预条件子处理装置900各模块的功能。存储器还可以存储数据集合,例如:存储器1001中的一部分存储资源被划分成一个区域,用于存储实现本申请实施例的预条件子的功能的程序。
处理器1002可以采用通用的CPU,应用专用集成电路(application specific integrated circuit,ASIC),GPU或其任意组合。处理器1002可以包括一个或多个芯片。处理器1002可以包括AI加速器,例如NPU。
通信接口1003使用例如但不限于收发器一类的收发模块,来实现计算设备1000与其他设备或通信网络之间的通信。例如,可以通过通信接口1003获取原稀疏矩阵等数据。
总线1004可包括在计算设备1000各个部件(例如,存储器1001、处理器1002、通信接口1003)之间传送信息的通路。
计算设备1000可以为云数据中心中的计算机(例如:服务器),或边缘数据中心中的计算机,或终端。
每个计算设备1000上都可以部署应用服务器301和/或预条件子处理设备302。
本实施例中的方法步骤可以通过硬件的方式来实现,也可以由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成,软件模块可以被存放于随机存取存储器(random access memory,RAM)、闪存、只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)、寄存器、硬盘、移动硬盘、CD-ROM或者本领域熟知的任何其它形式的存储介质中。一种示例性的存储介质耦合至处理器,从而使处理器能够从该存储介质读取信息,且可向该存储介质写入信息。当然,存储介质也可以是处理器的组成部分。处理器和存储介质可以位于ASIC中。另外,该ASIC可以位于终端设备中。当然,处理器和存储介质也可以作为分立组件存在于网络设备或终端设备中。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机程序或指令。在计算机上加载和执行所述计算机程序或指令时,全部或部分地执行本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、网络设备、用户设备或者其它可编程装置。所述计算机程序或指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机程序或指令可以从一个网站站点、计算机、服务器或数据中心通过有线或无线方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是集成一个或多个可用介质的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,例如,软盘、硬盘、磁带;也可以是光介质,例如,数字视频光盘(digital video disc,DVD);还可以是半导体介质,例如,固态硬盘(solid state drive,SSD)。以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (10)

  1. 一种预条件子处理方法,其特征在于,由预条件子处理设备执行,所述预条件子处理设备包括多个线程,每个线程用于执行预条件子包含的至少一个线程计算任务,包括:
    确定所述至少一个线程计算任务的线程依赖关系,所述线程依赖关系用于指示所述多个线程中每个线程当线程计算任务关联的至少一个依赖线程计算任务均处于已执行状态能够执行所述线程计算任务;
    所述多个线程根据所述线程依赖关系执行待执行线程计算任务,所述待执行线程计算任务关联的所述至少一个依赖线程计算任务均处于已执行状态;
    根据所述至少一个线程计算任务的执行结果输出预条件子计算结果。
  2. 根据权利要求1所述的方法,其特征在于,所述确定所述至少一个线程计算任务的线程依赖关系,包括:
    将原稀疏矩阵包含的至少一个元素计算任务之间存在的元素依赖关系转换为所述线程依赖关系,所述元素依赖关系用于指示执行元素计算任务需要至少一个依赖元素计算任务处于已执行状态。
  3. 根据权利要求2所述的方法,其特征在于,所述将原稀疏矩阵包含的至少一个元素计算任务之间存在的元素依赖关系转换为线程依赖关系,包括:
    根据元素依赖关系将所述原稀疏矩阵包含的多个元素计算任务划分为多个计算层级,同一计算层级的元素计算任务能够同时被执行;
    根据任务分配结果将所述同一计算层级中由同一线程执行的元素计算任务合并为线程计算任务,所述任务分配结果用于指示元素计算任务由所述多个线程中的指定线程执行;
    根据所述元素依赖关系确定所述至少一个线程计算任务之间的线程依赖关系。
  4. 根据权利要求3所述的方法,其特征在于,在所述根据所述元素依赖关系确定所述至少一个线程计算任务之间的线程依赖关系之后,所述方法还包括:
    删除所述至少一个线程计算任务之间的线程依赖关系中的冗余依赖关系,所述冗余依赖关系表示的线程计算任务之间的依赖关系能够被自身之外的一组线程依赖关系间接表示。
  5. 根据权利要求3或4所述的方法,其特征在于,在所述多个线程根据所述线程依赖关系执行待执行线程计算任务之前,所述方法还包括:
    根据所述多个计算层级和访存顺序对所述原稀疏矩阵进行矩阵重排,得到重排矩阵,所述访存顺序为所述多个线程对所述至少一个线程计算任务的访存顺序,所述重排矩阵将每个线程执行的至少一个线程计算任务排列在一起。
  6. 根据权利要求5所述的方法,其特征在于,所述多个线程根据所述线程依赖关系执行待执行线程计算任务,包括:
    所述多个线程中每个线程根据所述重排矩阵和所述访存顺序,依次执行所述待执行线程计算任务。
  7. 根据权利要求1-6中任一项所述的方法,其特征在于,在所述多个线程根据所述线程依赖关系执行待执行线程计算任务之前,所述方法还包括:
    根据所述至少一个线程计算任务的标志位,确定自身关联的所述至少一个依赖线程计算任务均处于已执行状态的线程计算任务为待执行线程计算任务,所述标志位用于指示线程计算任务是否处于已执行状态。
  8. 一种预条件子处理装置,其特征在于,所述装置执行多个线程,每个线程用于执行预条件子包含的至少一个线程计算任务,所述装置包括:
    分析模块,用于确定所述至少一个线程计算任务的线程依赖关系,所述线程依赖关系用于指示所述多个线程中每个线程当线程计算任务关联的至少一个依赖线程计算任务均处于已执行状态能够执行所述线程计算任务;
    执行模块,用于所述多个线程根据所述线程依赖关系执行待执行线程计算任务,所述待执行线程计算任务关联的所述至少一个依赖线程计算任务均处于已执行状态;
    所述执行模块,还用于根据所述至少一个线程计算任务的执行结果输出预条件子计算结果。
  9. 一种计算设备,其特征在于,所述计算设备包括处理器和存储器;
    所述存储器用于存储指令,所述处理器用于执行所述存储器存储的指令,当所述处理器执行所述存储器存储的指令时,所述处理器用于执行上述权利要求1-7中任一项所述的预条件子处理方法中的操作步骤。
  10. 一种计算机系统,其特征在于,所述系统包括应用服务器和预条件子处理设备,所述应用服务器用于向所述预条件子处理设备发送稀疏线性方程组求解请求,所述预条件子处理设备用于在接收到所述稀疏线性方程组求解请求时,执行如权利要求1-7中任一项所述的预条件子处理方法。
PCT/CN2023/101175 2022-10-18 2023-06-19 预条件子处理方法、装置、设备及系统 Ceased WO2024082665A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP23878662.8A EP4589434A4 (en) 2022-10-18 2023-06-19 METHOD, APPARATUS, DEVICE AND PRECONDITIONER TREATMENT SYSTEM

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211275678.1A CN117950842A (zh) 2022-10-18 2022-10-18 预条件子处理方法、装置、设备及系统
CN202211275678.1 2022-10-18

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US19/182,055 Continuation US20260119237A1 (en) 2022-10-18 2025-04-17 Preconditioner Processing Method and Apparatus, Device, and System

Publications (2)

Publication Number Publication Date
WO2024082665A1 true WO2024082665A1 (zh) 2024-04-25
WO2024082665A9 WO2024082665A9 (zh) 2024-05-30

Family

ID=90736804

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/101175 Ceased WO2024082665A1 (zh) 2022-10-18 2023-06-19 预条件子处理方法、装置、设备及系统

Country Status (3)

Country Link
EP (1) EP4589434A4 (zh)
CN (1) CN117950842A (zh)
WO (1) WO2024082665A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119057318A (zh) * 2024-09-10 2024-12-03 吉林大学 平面短焊缝识别方法、焊接方法和装置
CN119397149A (zh) * 2024-09-03 2025-02-07 湖南大学 面向分布式平台的迭代求解法和预条件子自动选择的方法

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120198274B (zh) * 2025-05-27 2025-07-22 中国空气动力研究与发展中心计算空气动力研究所 无同步ilu预条件子的cfd高效gpu计算方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112446004A (zh) * 2019-08-28 2021-03-05 无锡江南计算技术研究所 非结构网格dilu预条件子众核并行优化算法
CN113553288A (zh) * 2021-09-18 2021-10-26 北京大学 针对hpcg基准测试的两层分块多色并行优化方法
CN114385972A (zh) * 2021-12-20 2022-04-22 北京科技大学 一种直接求解结构化三角稀疏线性方程组的并行计算方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112446004A (zh) * 2019-08-28 2021-03-05 无锡江南计算技术研究所 非结构网格dilu预条件子众核并行优化算法
CN113553288A (zh) * 2021-09-18 2021-10-26 北京大学 针对hpcg基准测试的两层分块多色并行优化方法
CN114385972A (zh) * 2021-12-20 2022-04-22 北京科技大学 一种直接求解结构化三角稀疏线性方程组的并行计算方法

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
See also references of EP4589434A4
vol. 8, 1 January 1900, SPRINGER INTERNATIONAL PUBLISHING, article PARK JONGSOO; SMELYANSKIY MIKHAIL; SUNDARAM NARAYANAN; DUBEY PRADEEP: "Sparsifying Synchronization for High-Performance Shared-Memory Sparse Triangular Solver", pages: 124 - 140, XP047616827, DOI: 10.1007/978-3-319-07518-1_8 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119397149A (zh) * 2024-09-03 2025-02-07 湖南大学 面向分布式平台的迭代求解法和预条件子自动选择的方法
CN119057318A (zh) * 2024-09-10 2024-12-03 吉林大学 平面短焊缝识别方法、焊接方法和装置

Also Published As

Publication number Publication date
CN117950842A (zh) 2024-04-30
WO2024082665A9 (zh) 2024-05-30
EP4589434A1 (en) 2025-07-23
EP4589434A4 (en) 2025-11-12

Similar Documents

Publication Publication Date Title
WO2024082665A1 (zh) 预条件子处理方法、装置、设备及系统
Guo et al. Gpu-accelerated static timing analysis
US10853544B2 (en) Selective execution for partitioned parallel simulations
Kakay et al. Speedup of FEM micromagnetic simulations with graphical processing units
Dennl et al. On-the-fly composition of FPGA-based SQL query accelerators using a partially reconfigurable module library
US7840931B2 (en) Loop manipulation if a behavioral synthesis tool
KR102882460B1 (ko) 그래프 작업 부하들을 위한 스토리지에서의 최적의 동적 샤드 생성
WO2000065492A1 (en) Method for storing multiple levels of design data in a common database
US9164969B1 (en) Method and system for implementing a stream reader for EDA tools
US8990739B2 (en) Model-based retiming with functional equivalence constraints
Muttillo et al. SystemC-based co-simulation/analysis for system-level hardware/software co-design
US20150379172A1 (en) Device and method for accelerating the update phase of a simulation kernel
Miura et al. Accelerating Regular Path Queries using FPGA.
Soldavini et al. Iris: Automatic generation of efficient data layouts for high bandwidth utilization
You et al. Scalable and efficient spatial data management on multi-core CPU and GPU clusters: A preliminary implementation based on Impala
WO2025118892A1 (zh) 一种任务处理方法及装置
US20260119237A1 (en) Preconditioner Processing Method and Apparatus, Device, and System
Chen et al. Sparsity-oriented sparse solver design for circuit simulation
US12367334B2 (en) Runtime and memory efficient attribute query handling for distributed engine
Klein et al. Tridigpu: a GPU library for block tridiagonal and banded linear equation systems
Skliarova et al. Hardware/software co-design
Gao et al. SSA: A uniformly recursive bidirection-sequence systolic sorter array
Zhan et al. FPGA based co-design of storage-side query filter for big data systems
US20240354477A1 (en) Constant, equal, or opposite registers or ports detection during logic synthesis
Chu et al. A New Design Methodology for Composing Complex Digital Systems

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23878662

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2023878662

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2023878662

Country of ref document: EP

Effective date: 20250414

NENP Non-entry into the national phase

Ref country code: DE

WWP Wipo information: published in national office

Ref document number: 2023878662

Country of ref document: EP