WO2024082665A1 - 预条件子处理方法、装置、设备及系统 - Google Patents
预条件子处理方法、装置、设备及系统 Download PDFInfo
- Publication number
- WO2024082665A1 WO2024082665A1 PCT/CN2023/101175 CN2023101175W WO2024082665A1 WO 2024082665 A1 WO2024082665 A1 WO 2024082665A1 CN 2023101175 W CN2023101175 W CN 2023101175W WO 2024082665 A1 WO2024082665 A1 WO 2024082665A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- thread
- preconditioner
- computing
- executed
- computing task
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/11—Complex mathematical operations for solving equations, e.g. nonlinear equations, general mathematical optimization problems
- G06F17/13—Differential equations
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/5038—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/52—Program synchronisation; Mutual exclusion, e.g. by means of semaphores
Definitions
- the present application relates to the field of computers, and in particular to a preconditioner processing method, device, equipment and system.
- the problem of solving partial differential equations in high-performance computing of computers can be converted into the problem of solving large-scale sparse linear equations.
- the improvement of the convergence and convergence speed of the iteration matrix depends not only on the iteration method and the selection of parameters in the iteration matrix, but also on some changes of the equation system itself.
- the introduction of preconditioners greatly speeds up the convergence and convergence speed of iterations.
- Preconditioning large-scale sparse linear equations can improve the convergence speed of large-scale sparse linear equations.
- the data used for preconditioning is correlated, that is, the subsequent calculation tasks of some preconditioners often need to rely on the results of the previous calculation tasks, the parallel processing of these preconditioners needs to waste a lot of time waiting for the previous calculation tasks, resulting in low parallel execution efficiency of preconditioners.
- the present application provides a preconditioner processing method, device, equipment and system, which can solve the problem of low parallel execution efficiency of preconditioners.
- a preconditioner processing method is provided.
- the preconditioner processing method is executed by a preconditioner processing device, the preconditioner processing device includes multiple threads, each thread is used to execute at least one thread computing task contained in the preconditioner in the process of solving a sparse matrix linear equation system.
- the method includes: when the preconditioner processing device uses the preconditioner to perform multi-threaded parallel processing on the sparse linear equation system, the preconditioner processing device first determines the thread dependency between at least one thread computing task to be executed by the preconditioner, and then each of the multiple threads of the preconditioner processing device sequentially executes the to-be-executed thread computing task whose dependent thread computing tasks are all in the executed state, and then outputs the preconditioner computing result according to the execution result of at least one thread computing task.
- the thread dependency is used to indicate the dependency between at least one thread computing task, that is, for each thread computing task, the thread computing task can be executed when at least one dependent thread computing task associated with the thread computing task is in the executed state.
- the preconditioner processing device determines the dependency relationship between threads in the computing tasks of the preconditioner, and the preconditioner processing device performs multi-threaded parallelism on the thread computing tasks based on the thread dependency, thereby converting the full thread synchronization between layers into point-to-point synchronization between threads, that is, the thread computing tasks of the lower layer can be executed after the thread computing tasks of the upper layer on which they depend are in the executed state, thereby reducing the time consumption of synchronization between layers and improving the parallel execution efficiency of the preconditioner.
- the preconditioner processing device converts the element dependency between at least one element computing task contained in the original sparse matrix into a thread dependency based on the original sparse matrix of the sparse matrix linear equation system, that is, each thread needs to rely on which threads have completed which levels of computing tasks before executing each level of computing tasks.
- the element dependency is used to indicate that executing an element computing task requires at least one dependent element computing task to be in an executed state.
- the preconditioner sub-processor divides the multiple element calculation tasks contained in the original sparse matrix into multiple calculation levels according to the element dependency, and then merges the element calculation tasks executed by the same thread in the same calculation level into thread calculation tasks according to the task allocation result, and then determines the thread dependency between at least one thread calculation task according to the element dependency.
- the element calculation tasks of the same calculation level can be executed at the same time, and the task allocation result is used to indicate which thread among the multiple threads executes the element calculation task.
- the preconditioner sub-processor merges the element calculation tasks belonging to the same thread in each calculation level into thread calculation tasks.
- the computing tasks are integrated into one thread computing task. Each thread is responsible for only one thread computing task at each computing level. The thread computing task of a thread at the next level only depends on one thread computing task at the previous level, so that the thread dependency can be concretely represented.
- the preconditioning sub-processing device after determining the thread dependency, also deletes the redundant dependency in the thread dependency between at least one thread computing task, and the dependency between the thread computing tasks represented by the redundant dependency can be indirectly represented by a set of thread dependencies outside itself.
- the number of thread dependencies is reduced, and the thread is prevented from consuming computing resources and time to confirm the redundant dependencies before executing the thread computing task, further reducing the inter-layer synchronization overhead and improving the parallel execution efficiency.
- the preconditioner processing device rearranges the original sparse matrix and then executes at least one thread computing task in parallel using multiple threads.
- the preconditioner processing device rearranges the original sparse matrix according to multiple computing levels and memory access orders to obtain a rearranged matrix, and each of the multiple threads executes the thread computing tasks to be executed in sequence according to the rearranged matrix and the memory access order when executing the thread computing tasks in parallel in multiple threads.
- the memory access order is the memory access order of the multiple threads for at least one thread computing task, and at least one thread computing task executed by each thread in the rearranged matrix is arranged together.
- each thread can access the memory continuously when performing its own calculations, which is conducive to fully utilizing the memory bandwidth.
- the matrix is rearranged to optimize the continuity of memory access and will not change any dependencies and dependency order, so it will not affect the solution of the preconditioner.
- each thread in the preconditioner processing device determines whether the thread dependency has been satisfied based on a flag. For example, the preconditioner processing device determines whether at least one dependent thread computing task associated with the thread computing task is in an executed state based on the flag of at least one thread computing task, and if so, the thread computing task is used as a thread computing task to be executed. Thus, each thread in the preconditioner processing device uses the flag to quickly and accurately determine whether the dependent thread computing task of the thread computing task has been executed, thereby improving the overall efficiency of the preconditioner multi-thread parallelism.
- a preconditioner processing device which executes multiple threads, each thread is used to execute at least one thread computing task included in the preconditioner, and the device includes various modules for executing the preconditioner processing method in the first aspect or any possible implementation of the first aspect.
- the preconditioner processing device includes an analysis module and an execution module.
- the analysis module is used to determine the thread dependency of at least one thread computing task, where the thread dependency is used to indicate that each thread in the multiple threads can execute the thread computing task when at least one dependent thread computing task associated with the thread computing task is in an executed state.
- the execution module is used to instruct multiple threads to execute the to-be-executed thread computing tasks according to the thread dependency relationship, and at least one dependent thread computing task associated with the to-be-executed thread computing task is in an executed state.
- the execution module is further used to output a preconditioner calculation result according to the execution result of at least one thread calculation task.
- the analysis module is specifically used to: convert the element dependency between at least one element calculation task contained in the original sparse matrix into a thread dependency, the element dependency is used to indicate that executing the element calculation task requires at least one dependent element calculation task to be in an executed state.
- the analysis module is specifically used to: divide multiple element calculation tasks contained in the original sparse matrix into multiple calculation levels according to element dependencies, and the element calculation tasks of the same calculation level can be executed simultaneously; merge the element calculation tasks executed by the same thread in the same calculation level into thread calculation tasks according to the task allocation result, and the task allocation result is used to indicate which thread among the multiple threads executes the element calculation task; determine the thread dependency between at least one thread calculation task according to the element dependency.
- the analysis module is further used to: delete a redundant dependency in the thread dependency between at least one thread computing task, where the dependency between the thread computing tasks represented by the redundant dependency can be indirectly represented by a group of thread dependencies outside the thread computing tasks.
- the preconditioner processing device also includes a preparation module, which is used to: rearrange the original sparse matrix according to multiple computing levels and memory access orders to obtain a rearranged matrix, wherein the memory access order is the memory access order of multiple threads for at least one thread computing task, and the rearranged matrix arranges at least one thread computing task executed by each thread together.
- a preparation module which is used to: rearrange the original sparse matrix according to multiple computing levels and memory access orders to obtain a rearranged matrix, wherein the memory access order is the memory access order of multiple threads for at least one thread computing task, and the rearranged matrix arranges at least one thread computing task executed by each thread together.
- the execution module is specifically used to: instruct each thread in the multiple threads to execute in sequence according to the reordering matrix and the memory access order Thread computing tasks to be executed.
- the execution module is also used to: determine, based on a flag bit of at least one thread computing task, a thread computing task whose associated at least one dependent thread computing task is in an executed state as a thread computing task to be executed, wherein the flag bit is used to indicate whether the thread computing task is in an executed state.
- the pre-conditioning sub-processing device described in the second aspect can be a terminal device or a network device, or it can be a chip (system) or other parts or components that can be set in the terminal device or the network device, or it can be a device that includes a terminal device or a network device, and this application does not limit this.
- a preconditioner processing device comprising a memory and a processor, wherein the memory is used to store a set of computer instructions, and when the processor executes the set of computer instructions, it is used to execute the operating steps of the preconditioner processing method in any possible design in the first aspect.
- a computer system comprising an application server and a preconditioner processing device, wherein the application server is used to send a request for solving a sparse linear equation group to the preconditioner processing device when calculating a sparse linear equation group, and the preconditioner processing device is used to execute the operating steps of the preconditioner processing method in any possible design in the first aspect.
- the technical effects of the preconditioner processing device described in the second aspect, the technical effects of the preconditioner processing equipment described in the third aspect, and the technical effects of the computer system described in the fourth aspect can refer to the technical effects of the preconditioner processing method described in the first aspect, and will not be repeated here.
- a computer-readable storage medium comprising: computer software instructions; when the computer software instructions are executed in a data processing system, the computing device executes the operating steps of the method described in any possible implementation of the first aspect.
- a computer program product is provided.
- the computing device executes the operation steps of the method described in any possible implementation manner of the first aspect.
- FIG1 is a schematic diagram of a coloring method provided by the present application.
- FIG2 is a schematic diagram of a layering method provided by the present application.
- FIG3 is a schematic diagram of the structure of a computer system provided by the present application.
- FIG4 is a schematic diagram of a flow chart of a preconditioner processing method provided by the present application.
- FIG5 is a schematic diagram of a task allocation result provided by the present application.
- FIG6 is a schematic diagram of a principle of matrix rearrangement provided by the present application.
- FIG7 is a schematic diagram of the principle of pruning redundant dependencies provided by the present application.
- FIG8 is a schematic diagram of a process flow of a thread executing a computing task provided by the present application.
- FIG9 is a schematic diagram of the structure of a preconditioner processing device provided by the present application.
- FIG10 is a schematic diagram of the structure of a computing device provided by the present application.
- the process of solving a linear system is the process of obtaining the solution vector x through A and b.
- the dimensions of x and b are n
- the dimension of A is n*n, that is, there are n ⁇ 2 elements.
- most of the n ⁇ 2 elements of A are The numbers are all 0. The proportion of 0 varies depending on different fields and problems, and is generally less than 0.1% or even lower.
- the direct method is to obtain the exact solution of the equations through a finite number of operations including matrix decomposition and triangular equations solution without considering the round-off error, also known as the exact method.
- the iterative method is to construct a vector sequence through certain calculations given an initial solution vector (a series of approximate solutions close to the exact value are obtained through successive iterations), and the limit of the vector sequence is used as the theoretical exact solution of the sparse linear equations.
- the iterative method requires less storage space and has advantages in solving high-order non-ill-conditioned sparse linear equations.
- the iterative method has problems such as slow convergence or no convergence when solving ill-conditioned problems, and difficulty in precision control.
- problems such as slow convergence or no convergence when solving ill-conditioned problems, and difficulty in precision control.
- the result When solving the system of equations, if the data is slightly disturbed, the result will have large fluctuations.
- Such a matrix is called an ill-conditioned matrix.
- preconditioners are introduced into the iterative method to solve the above problems of the iterative method.
- Preconditioners are used to improve the properties of the coefficient matrix of the linear equation system, and transform the given linear equation system into a transformation (matrix) of the equation system that is easier to solve. They are mainly used to accelerate the convergence of the iterative method.
- the preconditioner is one of the most important modules.
- the existence of the preconditioner can help improve the convergence speed of the iterative method on the one hand, and help solve the ill-conditioned matrix on the other hand.
- the preconditioner supports nesting. For example, method A and method B are executed in sequence to jointly complete the function of the preconditioner, and the bottom-level preconditioner often chooses sor or ilu, which is mainly due to its relatively simple algorithm flow and good convergence.
- Sor algorithms include Gauss-Seidel, symmetric Gauss-Seidel, symmetric successive over-relaxation, etc.
- Ilu algorithms include ilu(0), ilu(1), ilu(k), etc. Different from sor algorithms, ilu algorithms are divided into two steps in terms of process, ilu decomposition process and ilu re-banding process. However, decomposition and re-banding are consistent in calculation logic, so they can also be considered together here.
- the LU algorithm is the core method for solving sparse matrices using the direct method.
- the ILU algorithm an incomplete LU algorithm, can control the number of non-zero fill-ins according to the fill-in level to achieve higher efficiency.
- 0 means the fill-in level is 0, that is, there will be no fill-in, which means that if the original position is non-zero, it will be continuously updated during the decomposition process, and if the original position is 0, it will not be updated and will always be 0.
- the solution accuracy of ILU is not as good as that of the LU algorithm, but as a preconditioner, it meets the accuracy requirements, because the final solution correctness of the sparse equation is guaranteed by the iterative method, and the preconditioner is only used as a means to accelerate this process.
- ilu(0) is divided into two processes: decomposition and back-banding.
- the former is to numerically decompose the original matrix to obtain a lower triangular matrix L and an upper triangular matrix U.
- Figure 1 is the pseudo code of the ilu(0) decomposition process. It is worth noting that since ilu(0) does not fill in new non-zero elements, the decomposed lower triangular matrix and upper triangular matrix can completely use the space of the original matrix A, that is, a local decomposition is performed.
- the above dependency makes multi-threaded parallel acceleration very difficult, because the premise of parallelism is to have enough unrelated computing tasks.
- the existing technology mainly focuses on how to mine parallel computing tasks, which can be roughly divided into mining parallel tasks and creating parallel tasks. The former is to analyze which computing tasks can be executed in parallel without dependencies by analyzing the dependency graph, and the latter is to forcibly break some dependencies to create more unrelated parallel computing tasks.
- the coloring method is to color each element in the dependency graph obtained from the original sparse matrix A with a different color, and the elements of the same color are not
- the elements of different colors are executed in series.
- coloring methods such as coloring after determining the threshold range, that is, the points within the threshold range are colored with different colors.
- Figure 1 is a coloring sample diagram when the threshold is 1.
- 0-12 represent different elements contained in the original sparse matrix A.
- the threshold is only 1, which means that adjacent elements need to be colored in different colors, and non-adjacent elements can be colored in the same color.
- 0, 3, 4, 10, 11, and 12 are colored in the same color and can be executed in parallel; 1, 5, and 7 are colored in the same color and can be executed in parallel; 2, 6, 8, and 9 are colored in the same color and can be executed in parallel.
- elements 0, 3, 4, 10, 11, and 12, elements 1, 5, and 7, and elements 2, 6, 8, and 9 are colored in three different colors respectively.
- Different grayscales are used to represent different colors in Figure 1. Different colors need to be executed serially.
- the nested multi-level main diagonal block method is to create parallel tasks by making nested symmetric changes to the original matrix A.
- Any sparse matrix can be transformed into A11 and A22 are independent of each other and can be executed in parallel by two threads, and the rest are executed in series. If the above symmetric transformation is nested, more threads can be executed in parallel, as shown below:
- the layered method divides the calculations for different elements into layers based on the dependencies between the elements. Elements at the same layer can be calculated simultaneously, and elements at different layers are executed in sequence.
- the 16 elements are divided into 4 layers and executed in parallel by 3 threads.
- Different grayscales represent different colors, and different colors represent the computing tasks that different threads are responsible for.
- the existing technology uses the coloring method or the nested multi-level main diagonal block method to perform multi-threaded parallel processing on the preconditioner, which has the problem of reducing the convergence of the algorithm.
- the Krylov subspace iteration method is combined with ILU (0) as the preconditioner.
- the serial ILU (0) is used, the number of iterations is increased to 70 steps after switching to the above two parallel methods. This is mainly because the coloring method or the nested multi-level main diagonal block method breaks the original dependency relationship.
- the former is through the graph method, and the latter is through the mathematical transformation method.
- the dependency relationship that should have been considered is ignored and forced parallelization is performed, resulting in a decrease in convergence; through the mathematical transformation method, the order of decomposition is changed.
- the result of ILU is strongly related to the order, so the mathematical transformation will still change (most likely reduce) the convergence.
- the remaining parts of the nested multi-level main diagonal block method are usually difficult to parallelize, so the parallel efficiency is low in the later stage of the algorithm.
- the layered method although the element calculation tasks in different layers can be multi-threaded and parallelized, the layers need to be synchronized to ensure correctness. When the number of cores increases, the synchronization overhead will increase significantly, resulting in a decrease in the overall speed of the algorithm. Furthermore, the jumpy memory access of threads will also lead to a decrease in memory access efficiency, and the memory bandwidth resources cannot be fully utilized.
- the present application provides a preconditioner processing method, in particular, a preconditioner processing method for multi-threaded parallel processing of preconditioners based on thread dependencies, that is, a preconditioner processing device determines the thread dependency between at least one thread computing task contained in the preconditioner, the thread dependency is used to indicate that the execution of the thread computing task requires at least one dependent thread computing task to be in an executed state, and each thread of the preconditioner processing device sequentially executes the to-be-executed thread computing task of which at least one associated dependent thread computing task is in an executed state according to the memory access order, and outputs the preconditioner computing result according to the execution result of at least one thread computing task.
- the preconditioner processing device converts the dependency between elements in the computing task of the preconditioner into the dependency between threads, and the preconditioner processing device performs multi-threaded parallel processing on the thread computing tasks based on the thread dependencies, and each thread in the lower layer of the thread computing task executes the lower layer of the thread computing task after the upper layer of the thread computing task on which it depends is in an executed state, thereby converting the full thread synchronization between layers into point-to-point synchronization between threads.
- FIG3 is a schematic diagram of the structure of a computer system provided by the present application.
- the computer system 300 includes an application server 301 and a preconditioning sub-processing device 302 .
- the application server 301 can be a terminal, such as a mobile phone terminal, a tablet computer, a laptop computer, a virtual reality (VR) device, an augmented reality (AR) device, a mixed reality (MR) device, an extended reality (ER) device, a camera or a vehicle-mounted terminal, etc. It can also be an edge device (for example, a box carrying a chip with processing capabilities), etc.
- the preconditioner processing device 302 may be a terminal, or other computing device that supports the calculation of sparse linear equations, such as a server or a cloud device.
- the application server 301 and the preconditioning sub-processing device 302 are different processors deployed on different physical devices (such as: a server or a server in a cluster).
- the execution device 410 can be a graphics processing unit (GPU), a central processing unit (CPU), other general-purpose processors, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
- the general-purpose processor can be a microprocessor or any conventional processor, etc.
- the preconditioning sub-processing device 302 can be a graphics processing unit (GPU), a neural network processor (NPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of the program of the present application.
- GPU graphics processing unit
- NPU neural network processor
- ASIC application-specific integrated circuit
- the application server 301 and the preconditioning sub-processing device 302 are deployed on the same physical device, or the application server 301 and the preconditioning sub-processing device 302 are the same physical device.
- the application server 301 is used to determine the sparse linear equation group that needs to be solved in response to the user operation, and send a sparse linear equation group solution request to the preconditioner processing device 302 based on the sparse linear equation group.
- the preconditioning sub-processing device 302 is used to receive a sparse linear equation solution request sent by the application server 301 , calculate the sparse linear equation system, and return the solution result of the sparse linear equation system to the application server 301 .
- the preconditioner processing device 302 is provided with software in the field of high-performance computing including core computing functions such as partial differential equation solvers.
- the core module of the differential equation solver is a linear solver, which supports the partial differential equation solver upward and indirectly supports software in the field of high-performance computing.
- the linear solver can include different algorithms, such as the conjugate gradient method, the generalized minimum residual method, the generalized conjugate residual method, the Chebyshev method, etc.
- the key module of the linear solver is the preconditioner, which can include different algorithms, such as the additive Schwarz algorithm, the block Jacobi algorithm, the Jacobi algorithm, the ILU algorithm, the SOR algorithm, the multigrid algorithm, etc.
- These preconditioners are a nested process, and different algorithms can be nested together. As shown by the arrows in Figure 3, multiple preconditioners often end up nesting ILU algorithms or SOR algorithms.
- the user inputs the problem to be solved into the software in the field of high-performance computing through the operation interface of the preconditioner processing device 302, and the software calls the partial differential equation solver to solve the partial differential equation contained in the problem to be solved.
- the partial differential equation solver discretizes the equation and converts the partial differential equation solving problem into a large-scale sparse linear equation system solving problem.
- the partial differential equation solver calls the linear solver to obtain the solution result of the large-scale sparse linear equation system.
- the linear solver also uses the preconditioner to execute the preconditioner processing method to optimize the original sparse matrix of the large-scale sparse linear equation system.
- FIG3 is merely a schematic diagram of a system architecture provided in the present application, and the positional relationship between the devices, components, modules, etc. shown in FIG3 does not constitute any limitation.
- the preconditioner in the preconditioner processing device 302 in Figure 4 is an ILU type algorithm or a SOR type algorithm as an example for description.
- Step 410 The preconditioning sub-processing device 302 determines the thread dependency of at least one thread computing task.
- the preconditioner processing device 302 analyzes the thread dependency relationship between all the thread computing tasks in at least one thread computing task according to the structural information of the original sparse matrix.
- the structural information refers to the elements and element arrangement order contained in the original sparse matrix.
- the original sparse matrix input in the form of an array contains the structural information of the original sparse matrix.
- the preconditioner processing device 302 performs task allocation according to the structural information of the original sparse matrix, obtains the task allocation result, and then determines the thread dependency based on the task allocation result.
- the task allocation result is used to indicate which element computing tasks of different computing levels each thread needs to process, and the thread dependency is used to indicate which threads have completed which thread computing tasks in order to execute the thread computing tasks.
- the preconditioner processing device 302 divides the multiple element calculation tasks included in the original sparse matrix into multiple calculation levels according to the element dependencies in the original sparse matrix.
- the element dependency in the original sparse matrix refers to the dependency between the calculation tasks of each element in the original sparse matrix, that is, an element calculation task in the original sparse matrix may have an associated element dependency or may not have an associated element dependency.
- the preconditioning sub-processing device 302 combines the element computing tasks executed by the same thread in the same computing level into thread computing tasks according to the task allocation result.
- the thread dependency between thread computing tasks is similar to the element dependency between element computing tasks, and there is not necessarily a thread dependency between every two thread computing tasks.
- the preconditioner processing device 302 divides the original sparse matrix into layers, it evenly distributes the element calculation tasks of each calculation layer to three threads.
- the first layer as an example, there are six element calculation tasks, which are evenly distributed to three threads, and each thread is used to execute two element calculation tasks.
- the element calculation tasks belonging to the same thread in each layer are integrated as a thread calculation task.
- Each thread is responsible for one thread calculation task in each layer. For example, thread 0 is responsible for executing one thread calculation task in the 0th layer, the 1st layer, the 2nd layer, and the 3rd layer respectively.
- the preconditioning sub-processing device 302 determines the thread dependency relationship between at least one thread computing task according to the element dependency relationship.
- the method of analyzing thread dependencies is explained with the task allocation result shown in FIG5.
- the preconditioning sub-processing device 302 determines that the thread computing task of thread A at the i-th computing level depends on the thread computing task of thread B at the j-th computing level if and only if any element computing task that thread A needs to execute at the i-th computing level depends on any element computing task that thread B needs to execute at the j-th computing level, where i and j are integers.
- the preconditioning sub-processing device 302 visualizes the thread dependency as thread A needs to rely on thread B to complete the execution of computing level j when executing computing level i.
- the preconditioner processing device 302 after the preconditioner processing device 302 hierarchically divides the original sparse matrix, before executing step 420, it can also perform matrix rearrangement and redundant dependency pruning on the matrix after the hierarchical division to further improve the overall computational efficiency of the preconditioner.
- matrix rearrangement please refer to FIG6 and related steps
- redundant dependency pruning please refer to FIG7 and related steps, which will not be repeated here.
- Step 420 Each of the multiple threads of the preconditioning sub-processing device 302 executes the to-be-executed thread computing task according to the thread dependency relationship.
- Each thread of the preconditioning sub-processing device 302 sequentially executes the to-be-executed thread computing task whose associated at least one dependent thread computing task is in the executed state according to the thread dependency relationship.
- each thread of the preconditioning sub-processing device 302 sequentially calculates the thread computing tasks for which it is responsible according to the computing level, it is also necessary to determine whether the thread computing task is a to-be-executed thread computing task for which at least one associated dependent thread computing task is in an executed state. For example, in FIG5 , the thread computing task of thread 1 at the first computing level depends on the thread computing task of thread 0 at the 0th computing level, and the thread computing task of thread 2 at the 0th computing level.
- each thread of the preconditioning sub-processing device 302 uses a flag bit to indicate whether the thread computing task is in an executed state, and each thread determines whether the thread computing task is a to-be-executed thread computing task whose dependent thread computing tasks are all in an executed state based on the flag bit of the dependent thread computing task associated with the thread computing task.
- the flag bit is a mask flag bit
- the flag bit corresponding to the thread computing task in the mask is 1, indicating that the thread computing task is in an executed state
- the flag bit corresponding to the thread computing task in the mask is 0, indicating that the thread computing task is in an unexecuted state
- the flag bit corresponding to the thread computing task in the mask is 0, indicating that the thread computing task is in an executed state
- the flag bit corresponding to the thread computing task in the mask is 1, indicating that the thread computing task is in an unexecuted state.
- Step 430 The preconditioner processing device 302 outputs a preconditioner calculation result according to the execution result of at least one thread computing task.
- each thread of the preconditioner processing device 302 After each thread of the preconditioner processing device 302 has completed the thread computing tasks of all computing levels, it outputs the preconditioner computing results according to the execution results of all the thread computing tasks.
- the thread computing tasks included in the preconditioner include SOR algorithm rewinding, ILU algorithm rewinding, ILU algorithm decomposition, etc. If the thread computing task is SOR algorithm rewinding, the input data obtained by the thread is the right-hand side vector B and the initial value vector X of the sparse linear equations, and the output preconditioner calculation result is the solution vector X. If the thread computing task is ILU algorithm rewinding, the input data obtained by the thread is the right-hand side vector B of the sparse linear equations, and the output preconditioner calculation result is the solution vector X. If the thread computing task is ILU algorithm decomposition, the input data obtained by the thread is the input matrix itself, and the output preconditioner calculation result is the decomposed matrix.
- the preconditioner processing method can form a third-party mathematical library, which can be used by linear solver frameworks (such as Portable, Extensible Toolkit for Scientific Computation (PETSc), High Performance Preconditioners (Hypre), etc.) in the form of a mathematical library, and thus indirectly used by upper-level applications such as software in the field of high-performance computing.
- the preconditioner processing method can also be used directly by the solver framework or upper-level applications in the form of code.
- the above-mentioned thread computing tasks such as SOR algorithm rewinding, ILU algorithm rewinding, and ILU algorithm decomposition will be executed in parallel on different threads.
- the preconditioner processing device converts the dependency between elements in the calculation task of the preconditioner into the dependency between threads, and the preconditioner processing device performs multi-threaded parallel processing on the thread calculation task based on the thread dependency, and each thread in the lower layer executes the thread calculation task of the lower layer after the upper layer thread calculation task on which it depends is in the executed state, thereby converting the full thread synchronization between layers into point-to-point synchronization between threads.
- the preconditioning sub-processing device 302 arranges the element calculation tasks that each thread is responsible for together, that is, for each thread, arranges the thread calculation tasks that the thread is responsible for at each calculation level (the thread calculation tasks at each calculation level include one or more element calculation tasks, which are shown as element calculation tasks in FIG6 ) together, so that each thread can continuously access the memory when performing its own calculation, thereby giving full play to the memory bandwidth.
- the rearrangement in this embodiment is only for optimizing the continuity of memory access, and will not change any dependency relationship and dependency order, so it will not affect the solution result.
- the preconditioning sub-processing device 302 deletes a redundant dependency in the thread dependency between at least one thread computing task, where the redundant dependency means that the dependency between the thread computing tasks can be indirectly represented by a group of thread dependencies other than the thread computing tasks.
- a group of thread dependencies refers to two or more thread dependencies.
- the computation of thread 0 at the second level depends on the computation of thread 1 at the 0th computation level and the computation of thread 2 at the 1st computation level.
- the computation of thread 2 at the 1st computation level depends on the computation of thread 1 at the 0th computation level.
- the dependency of thread 0 at the 2nd computation level on thread 1 at the 0th computation level is redundant. It is indirectly guaranteed by thread 2's dependency on thread 1. Therefore, when similar dependencies are issued, the redundant parts will be removed, thereby further reducing the overhead of inter-layer synchronization and improving the overall computational efficiency of the preconditioner.
- each thread of the preconditioning sub-processing device 302 sequentially executing thread computing tasks of each computing level in conjunction with FIG. 8 .
- the process includes steps 810 to 850 .
- Step 810 before executing the thread computing task of the i-th computing level, thread 1 determines whether there is an undetected flag bit, if so, executes step 820 , if not, executes step 830 .
- Thread 1 detects the flag bits of the dependent thread tasks associated with the thread computing task at the i-th computing level to determine whether there are undetected flag bits in the flag bits of the associated dependent thread tasks.
- Step 820 thread 1 obtains the undetected flag bit and determines whether the undetected flag bit is a value indicating that the thread computing task has been executed. If so, execute step 810. If not, determine again whether the undetected flag bit is a value indicating that the thread computing task has been executed.
- Thread 1 uses an atomic operation to fetch the undetected flag bit. If the undetected flag bit is true (for example, the value of the flag bit is 1), step 810 is executed. If the undetected flag bit is not true (for example, the value of the flag bit is 0), it is determined again whether the undetected flag bit is true.
- thread 1 periodically determines whether the undetected flag is true, that is, thread 1 determines whether the undetected flag is true twice with a preset time interval between each two times. If the undetected flag is not true, thread 1 continuously detects whether the undetected flag is true.
- Step 830 Thread 1 executes the thread computing task of the i-th computing level.
- Thread 1 executes the thread computing task of the i-th computing level according to the input data, and outputs the computing result of the thread computing task of the i-th computing level.
- the thread computing task at the i-th computing level is the sor loop computing task
- the input data received by thread 1 is the right-hand vector b and the initial value vector x
- the calculation result is the solution vector x.
- the thread computing task at the i-th computing level is the ilu back-carrying computing task
- the input data received by thread 1 is the right-hand vector b
- the calculation result is the solution vector x.
- the thread computing task at the i-th computing level is the ilu decomposition computing task
- the input data received by thread 1 is the rearranged matrix of the original sparse matrix
- the computing result is the decomposed matrix
- Step 840 Thread 1 modifies the flag bit of the thread computing task at the i-th computing level.
- thread 1 modifies the flag bit of the thread computing task at the i-th computing level to a value indicating that the thread computing task has been executed, such as true.
- thread 1 may also notify threads that have thread dependencies with the thread computing task at the i-th computing level.
- Thread 1 executes step 810 to step 850 in a loop until thread 1 completes all the thread computing tasks of all computing levels for which it is responsible.
- This embodiment illustrates the process of a thread in the preconditioner processing device 302 processing the thread computing task of the preconditioner through the above steps 810 to 850.
- the multiple threads included in the preconditioner processing device 302 also process the thread computing tasks for which they are responsible in parallel. After all the threads of the preconditioner processing device 302 have completed the thread computing tasks for which they are responsible, the optimization processing of the preconditioner on the original sparse matrix is completed, and the preconditioner calculation result is output.
- FIG9 is a schematic diagram of a possible preconditioning subprocessing device provided in this embodiment.
- the preconditioning subprocessing device can be used to implement the functions of the execution device in the above method embodiment, and thus can also achieve the beneficial effects possessed by the above method embodiment.
- the preconditioning subprocessing device can be the application server 301 and the preconditioning subprocessing device 302 as shown in FIG3 , or can be a module (such as a chip) applied to the server.
- the preconditioner processing device 900 includes an analysis module 910 and an execution module 920.
- the preconditioner processing device 900 is used to implement the function of the preconditioner processing device 302 in the method embodiment shown in FIG.
- the analysis module 910 is used to determine the thread dependency of at least one thread computing task, where the thread dependency is used to indicate the thread dependency of multiple threads. Each thread in the threads can execute the thread computing task when at least one dependent thread computing task associated with the thread computing task is in the executed state. For example, the analysis module 910 is used to execute step 410 in FIG. 4 above.
- the execution module 920 is used to instruct multiple threads to execute the pending thread computing tasks according to the thread dependency relationship, and at least one dependent thread computing task associated with the pending thread computing task is in the executed state. For example, the execution module 920 is used to execute step 420 in FIG. 4 above.
- the execution module 920 is further configured to output a preconditioner calculation result according to the execution result of at least one thread calculation task.
- the execution module 920 is configured to execute step 430 in FIG. 4 .
- the analysis module 910 is specifically used to: convert the element dependency between at least one element calculation task contained in the original sparse matrix into a thread dependency, and the element dependency is used to indicate that executing the element calculation task requires at least one dependent element calculation task to be in an executed state.
- the analysis module 910 is specifically used to: divide the multiple element calculation tasks contained in the original sparse matrix into multiple calculation levels according to the element dependency, and the element calculation tasks of the same calculation level can be executed simultaneously; merge the element calculation tasks executed by the same thread in the same calculation level into thread calculation tasks according to the task allocation result, and the task allocation result is used to indicate which thread among the multiple threads executes the element calculation task; determine the thread dependency between at least one thread calculation task according to the element dependency.
- the analysis module 910 is further used to: delete redundant dependencies in the thread dependencies between at least one thread computing task, where the dependencies between the thread computing tasks represented by the redundant dependencies can be indirectly represented by a set of thread dependencies outside themselves.
- the preconditioner processing device 900 also includes a preparation module, which is used to: rearrange the original sparse matrix according to multiple computing levels and memory access orders to obtain a rearranged matrix, where the memory access order is the memory access order of multiple threads for at least one thread computing task, and the rearranged matrix arranges at least one thread computing task executed by each thread together.
- a preparation module which is used to: rearrange the original sparse matrix according to multiple computing levels and memory access orders to obtain a rearranged matrix, where the memory access order is the memory access order of multiple threads for at least one thread computing task, and the rearranged matrix arranges at least one thread computing task executed by each thread together.
- the execution module 920 is specifically used to instruct each thread among the multiple threads to sequentially execute the to-be-executed thread computing tasks according to the reordered matrix and the memory access order.
- the execution module 920 is also used to: determine, based on a flag bit of at least one thread computing task, a thread computing task in which at least one dependent thread computing task associated with it is in an executed state as a thread computing task to be executed, and the flag bit is used to indicate whether the thread computing task is in an executed state.
- the preconditioner processing device 900 of the embodiment of the present application can be implemented by a GPU, an NPU, an ASIC, or a programmable logic device (PLD), and the PLD can be a complex programmable logical device (CPLD), a field-programmable gate array (FPGA), a generic array logic (GAL) or any combination thereof.
- CPLD complex programmable logical device
- FPGA field-programmable gate array
- GAL generic array logic
- the preconditioner processing device 900 and its various modules can also be software modules.
- the preconditioning sub-processing device 900 of the embodiment of the present application may correspond to executing the method described in the embodiment of the present application, and the above and other operations and/or functions of each unit in the preconditioning sub-processing device 900 are respectively for implementing the corresponding processes of each method in Figure 4, which will not be repeated here for the sake of brevity.
- the present application embodiment also provides a computing device, please refer to Figure 10,
- Figure 10 is a schematic diagram of the structure of a computing device provided by the present application embodiment.
- the computing device 1000 includes a memory 1001, a processor 1002, a communication interface 1003 and a bus 1004. Among them, the memory 1001, the processor 1002, and the communication interface 1003 are connected to each other through the bus 1004.
- the computing device 1000 can be the application server 301 or the preconditioning sub-processing device 302 in Figure 3.
- the memory 1001 may be a read-only memory, a static storage device, a dynamic storage device or a random access memory.
- the memory 1001 may store computer instructions. When the computer instructions stored in the memory 1001 are executed by the processor 1002, the processor 1002 and the communication interface 1003 are used to execute the steps in the preconditioner processing method of the software system. For example, the processor 1002 is used to execute steps 410 to 430 in the preconditioner processing method shown in FIG. 4 above, as well as the functions of each module of the preconditioner processing device 900.
- the memory may also store a data set, for example: a portion of the storage resources in the memory 1001 is divided into an area for storing a program for implementing the function of the preconditioner of the embodiment of the present application.
- the processor 1002 may be a general-purpose CPU, an application specific integrated circuit (ASIC), a GPU, or any combination thereof.
- the processor 1002 may include one or more chips.
- the processor 1002 may include an AI accelerator, such as an NPU.
- the communication interface 1003 uses a transceiver module such as, but not limited to, a transceiver to implement communication between the computing device 1000 and other devices or communication networks. For example, data such as the original sparse matrix can be obtained through the communication interface 1003 .
- the bus 1004 may include a path for transmitting information between various components of the computing device 1000 (eg, the memory 1001 , the processor 1002 , and the communication interface 1003 ).
- the computing device 1000 may be a computer (eg, a server) in a cloud data center, or a computer in an edge data center, or a terminal.
- a computer eg, a server
- a computer in an edge data center, or a terminal.
- An application server 301 and/or a pre-conditioning sub-processing device 302 may be deployed on each computing device 1000 .
- the method steps in this embodiment can be implemented by hardware or by a processor executing software instructions.
- the software instructions can be composed of corresponding software modules, which can be stored in random access memory (RAM), flash memory, read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), registers, hard disks, mobile hard disks, CD-ROMs, or any other form of storage medium known in the art.
- An exemplary storage medium is coupled to the processor so that the processor can read information from the storage medium and write information to the storage medium.
- the storage medium can also be a component of the processor.
- the processor and the storage medium can be located in an ASIC.
- the ASIC can be located in a terminal device.
- the processor and the storage medium can also exist as discrete components in a network device or a terminal device.
- the computer program product includes one or more computer programs or instructions.
- the computer may be a general-purpose computer, a special-purpose computer, a computer network, a network device, a user device or other programmable device.
- the computer program or instruction may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer program or instruction may be transmitted from one website site, computer, server or data center to another website site, computer, server or data center by wired or wireless means.
- the computer-readable storage medium may be any available medium that a computer can access or a data storage device such as a server or data center that integrates one or more available media.
- the available medium may be a magnetic medium, for example, a floppy disk, a hard disk, or a tape; it may also be an optical medium, for example, a digital video disc (DVD); it may also be a semiconductor medium, for example, a solid state drive (SSD).
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Computational Mathematics (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Analysis (AREA)
- Data Mining & Analysis (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Operations Research (AREA)
- Computing Systems (AREA)
- Multi Processors (AREA)
- Complex Calculations (AREA)
Abstract
Description
Claims (10)
- 一种预条件子处理方法,其特征在于,由预条件子处理设备执行,所述预条件子处理设备包括多个线程,每个线程用于执行预条件子包含的至少一个线程计算任务,包括:确定所述至少一个线程计算任务的线程依赖关系,所述线程依赖关系用于指示所述多个线程中每个线程当线程计算任务关联的至少一个依赖线程计算任务均处于已执行状态能够执行所述线程计算任务;所述多个线程根据所述线程依赖关系执行待执行线程计算任务,所述待执行线程计算任务关联的所述至少一个依赖线程计算任务均处于已执行状态;根据所述至少一个线程计算任务的执行结果输出预条件子计算结果。
- 根据权利要求1所述的方法,其特征在于,所述确定所述至少一个线程计算任务的线程依赖关系,包括:将原稀疏矩阵包含的至少一个元素计算任务之间存在的元素依赖关系转换为所述线程依赖关系,所述元素依赖关系用于指示执行元素计算任务需要至少一个依赖元素计算任务处于已执行状态。
- 根据权利要求2所述的方法,其特征在于,所述将原稀疏矩阵包含的至少一个元素计算任务之间存在的元素依赖关系转换为线程依赖关系,包括:根据元素依赖关系将所述原稀疏矩阵包含的多个元素计算任务划分为多个计算层级,同一计算层级的元素计算任务能够同时被执行;根据任务分配结果将所述同一计算层级中由同一线程执行的元素计算任务合并为线程计算任务,所述任务分配结果用于指示元素计算任务由所述多个线程中的指定线程执行;根据所述元素依赖关系确定所述至少一个线程计算任务之间的线程依赖关系。
- 根据权利要求3所述的方法,其特征在于,在所述根据所述元素依赖关系确定所述至少一个线程计算任务之间的线程依赖关系之后,所述方法还包括:删除所述至少一个线程计算任务之间的线程依赖关系中的冗余依赖关系,所述冗余依赖关系表示的线程计算任务之间的依赖关系能够被自身之外的一组线程依赖关系间接表示。
- 根据权利要求3或4所述的方法,其特征在于,在所述多个线程根据所述线程依赖关系执行待执行线程计算任务之前,所述方法还包括:根据所述多个计算层级和访存顺序对所述原稀疏矩阵进行矩阵重排,得到重排矩阵,所述访存顺序为所述多个线程对所述至少一个线程计算任务的访存顺序,所述重排矩阵将每个线程执行的至少一个线程计算任务排列在一起。
- 根据权利要求5所述的方法,其特征在于,所述多个线程根据所述线程依赖关系执行待执行线程计算任务,包括:所述多个线程中每个线程根据所述重排矩阵和所述访存顺序,依次执行所述待执行线程计算任务。
- 根据权利要求1-6中任一项所述的方法,其特征在于,在所述多个线程根据所述线程依赖关系执行待执行线程计算任务之前,所述方法还包括:根据所述至少一个线程计算任务的标志位,确定自身关联的所述至少一个依赖线程计算任务均处于已执行状态的线程计算任务为待执行线程计算任务,所述标志位用于指示线程计算任务是否处于已执行状态。
- 一种预条件子处理装置,其特征在于,所述装置执行多个线程,每个线程用于执行预条件子包含的至少一个线程计算任务,所述装置包括:分析模块,用于确定所述至少一个线程计算任务的线程依赖关系,所述线程依赖关系用于指示所述多个线程中每个线程当线程计算任务关联的至少一个依赖线程计算任务均处于已执行状态能够执行所述线程计算任务;执行模块,用于所述多个线程根据所述线程依赖关系执行待执行线程计算任务,所述待执行线程计算任务关联的所述至少一个依赖线程计算任务均处于已执行状态;所述执行模块,还用于根据所述至少一个线程计算任务的执行结果输出预条件子计算结果。
- 一种计算设备,其特征在于,所述计算设备包括处理器和存储器;所述存储器用于存储指令,所述处理器用于执行所述存储器存储的指令,当所述处理器执行所述存储器存储的指令时,所述处理器用于执行上述权利要求1-7中任一项所述的预条件子处理方法中的操作步骤。
- 一种计算机系统,其特征在于,所述系统包括应用服务器和预条件子处理设备,所述应用服务器用于向所述预条件子处理设备发送稀疏线性方程组求解请求,所述预条件子处理设备用于在接收到所述稀疏线性方程组求解请求时,执行如权利要求1-7中任一项所述的预条件子处理方法。
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP23878662.8A EP4589434A4 (en) | 2022-10-18 | 2023-06-19 | METHOD, APPARATUS, DEVICE AND PRECONDITIONER TREATMENT SYSTEM |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202211275678.1A CN117950842A (zh) | 2022-10-18 | 2022-10-18 | 预条件子处理方法、装置、设备及系统 |
| CN202211275678.1 | 2022-10-18 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US19/182,055 Continuation US20260119237A1 (en) | 2022-10-18 | 2025-04-17 | Preconditioner Processing Method and Apparatus, Device, and System |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| WO2024082665A1 true WO2024082665A1 (zh) | 2024-04-25 |
| WO2024082665A9 WO2024082665A9 (zh) | 2024-05-30 |
Family
ID=90736804
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2023/101175 Ceased WO2024082665A1 (zh) | 2022-10-18 | 2023-06-19 | 预条件子处理方法、装置、设备及系统 |
Country Status (3)
| Country | Link |
|---|---|
| EP (1) | EP4589434A4 (zh) |
| CN (1) | CN117950842A (zh) |
| WO (1) | WO2024082665A1 (zh) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119057318A (zh) * | 2024-09-10 | 2024-12-03 | 吉林大学 | 平面短焊缝识别方法、焊接方法和装置 |
| CN119397149A (zh) * | 2024-09-03 | 2025-02-07 | 湖南大学 | 面向分布式平台的迭代求解法和预条件子自动选择的方法 |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN120198274B (zh) * | 2025-05-27 | 2025-07-22 | 中国空气动力研究与发展中心计算空气动力研究所 | 无同步ilu预条件子的cfd高效gpu计算方法 |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112446004A (zh) * | 2019-08-28 | 2021-03-05 | 无锡江南计算技术研究所 | 非结构网格dilu预条件子众核并行优化算法 |
| CN113553288A (zh) * | 2021-09-18 | 2021-10-26 | 北京大学 | 针对hpcg基准测试的两层分块多色并行优化方法 |
| CN114385972A (zh) * | 2021-12-20 | 2022-04-22 | 北京科技大学 | 一种直接求解结构化三角稀疏线性方程组的并行计算方法 |
-
2022
- 2022-10-18 CN CN202211275678.1A patent/CN117950842A/zh active Pending
-
2023
- 2023-06-19 WO PCT/CN2023/101175 patent/WO2024082665A1/zh not_active Ceased
- 2023-06-19 EP EP23878662.8A patent/EP4589434A4/en active Pending
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112446004A (zh) * | 2019-08-28 | 2021-03-05 | 无锡江南计算技术研究所 | 非结构网格dilu预条件子众核并行优化算法 |
| CN113553288A (zh) * | 2021-09-18 | 2021-10-26 | 北京大学 | 针对hpcg基准测试的两层分块多色并行优化方法 |
| CN114385972A (zh) * | 2021-12-20 | 2022-04-22 | 北京科技大学 | 一种直接求解结构化三角稀疏线性方程组的并行计算方法 |
Non-Patent Citations (2)
| Title |
|---|
| See also references of EP4589434A4 |
| vol. 8, 1 January 1900, SPRINGER INTERNATIONAL PUBLISHING, article PARK JONGSOO; SMELYANSKIY MIKHAIL; SUNDARAM NARAYANAN; DUBEY PRADEEP: "Sparsifying Synchronization for High-Performance Shared-Memory Sparse Triangular Solver", pages: 124 - 140, XP047616827, DOI: 10.1007/978-3-319-07518-1_8 * |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119397149A (zh) * | 2024-09-03 | 2025-02-07 | 湖南大学 | 面向分布式平台的迭代求解法和预条件子自动选择的方法 |
| CN119057318A (zh) * | 2024-09-10 | 2024-12-03 | 吉林大学 | 平面短焊缝识别方法、焊接方法和装置 |
Also Published As
| Publication number | Publication date |
|---|---|
| CN117950842A (zh) | 2024-04-30 |
| WO2024082665A9 (zh) | 2024-05-30 |
| EP4589434A1 (en) | 2025-07-23 |
| EP4589434A4 (en) | 2025-11-12 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2024082665A1 (zh) | 预条件子处理方法、装置、设备及系统 | |
| Guo et al. | Gpu-accelerated static timing analysis | |
| US10853544B2 (en) | Selective execution for partitioned parallel simulations | |
| Kakay et al. | Speedup of FEM micromagnetic simulations with graphical processing units | |
| Dennl et al. | On-the-fly composition of FPGA-based SQL query accelerators using a partially reconfigurable module library | |
| US7840931B2 (en) | Loop manipulation if a behavioral synthesis tool | |
| KR102882460B1 (ko) | 그래프 작업 부하들을 위한 스토리지에서의 최적의 동적 샤드 생성 | |
| WO2000065492A1 (en) | Method for storing multiple levels of design data in a common database | |
| US9164969B1 (en) | Method and system for implementing a stream reader for EDA tools | |
| US8990739B2 (en) | Model-based retiming with functional equivalence constraints | |
| Muttillo et al. | SystemC-based co-simulation/analysis for system-level hardware/software co-design | |
| US20150379172A1 (en) | Device and method for accelerating the update phase of a simulation kernel | |
| Miura et al. | Accelerating Regular Path Queries using FPGA. | |
| Soldavini et al. | Iris: Automatic generation of efficient data layouts for high bandwidth utilization | |
| You et al. | Scalable and efficient spatial data management on multi-core CPU and GPU clusters: A preliminary implementation based on Impala | |
| WO2025118892A1 (zh) | 一种任务处理方法及装置 | |
| US20260119237A1 (en) | Preconditioner Processing Method and Apparatus, Device, and System | |
| Chen et al. | Sparsity-oriented sparse solver design for circuit simulation | |
| US12367334B2 (en) | Runtime and memory efficient attribute query handling for distributed engine | |
| Klein et al. | Tridigpu: a GPU library for block tridiagonal and banded linear equation systems | |
| Skliarova et al. | Hardware/software co-design | |
| Gao et al. | SSA: A uniformly recursive bidirection-sequence systolic sorter array | |
| Zhan et al. | FPGA based co-design of storage-side query filter for big data systems | |
| US20240354477A1 (en) | Constant, equal, or opposite registers or ports detection during logic synthesis | |
| Chu et al. | A New Design Methodology for Composing Complex Digital Systems |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23878662 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2023878662 Country of ref document: EP |
|
| ENP | Entry into the national phase |
Ref document number: 2023878662 Country of ref document: EP Effective date: 20250414 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| WWP | Wipo information: published in national office |
Ref document number: 2023878662 Country of ref document: EP |