CN118819869A

CN118819869A - A heterogeneous hybrid acceleration method, device and medium based on multiple accelerator cards

Info

Publication number: CN118819869A
Application number: CN202411303417.5A
Authority: CN
Inventors: 郝运凯; 姜凯; 赵鑫鑫; 薛海军
Original assignee: Shandong Inspur Science Research Institute Co Ltd
Current assignee: Yuanqixin Shandong Semiconductor Technology Co ltd
Priority date: 2024-09-19
Filing date: 2024-09-19
Publication date: 2024-10-22
Anticipated expiration: 2044-09-19
Also published as: CN118819869B

Abstract

The present application discloses a heterogeneous hybrid acceleration method, device and medium based on multiple accelerator cards, and relates to the field of parallel computing technology. The method includes: taking each operator in the deep learning model and the kernel combination of the corresponding target platform as the initial situation, and unloading the kernel combination to the hardware for execution; taking the score of the kernel combination as the initial cost of the cost model, and optimizing the cost model by the gradient descent method to determine whether it is an extreme value; comparing the extreme value corresponding to the kernel combination with the historical record to determine whether to update the tuning record, so as to determine the corresponding target tuning record; dividing the computational subgraph of the deep learning model analysis according to the target tuning record, and unloading the divided deep learning model to the corresponding hardware device for reasoning, so as to realize heterogeneous hybrid acceleration. The present application realizes efficient coordination and performance maximization of model reasoning, and improves the computing efficiency and response speed of the overall system.

Description

Heterogeneous hybrid acceleration method, equipment and medium based on multiple acceleration cards

Technical Field

The application relates to the technical field of parallel computing, in particular to heterogeneous hybrid acceleration method, heterogeneous hybrid acceleration equipment and heterogeneous hybrid acceleration medium based on multiple acceleration cards.

Background

Currently, with the rapid development and wide application of Artificial Intelligence (AI) technology, from deep learning, machine learning to big data processing, the demand for high performance computing in various fields has increased dramatically, which presents unprecedented challenges to the computing industry. The explosive growth in the complexity and data volume of AI algorithms has driven a need for more efficient, powerful computing power. However, in response to this surging demand, worldwide large chip manufacturers, despite continuous efforts to increase throughput and optimize design, have difficulty in meeting the immediate and comprehensive supply of all types of accelerated hardware in the marketplace.

Meanwhile, in an actual application scene, users often face diversified computing task demands, and the tasks may relate to multiple fields of image processing, natural language processing, scientific computing and the like, and the demands of each task on computing resources are different. Thus, there are often many different forms and architectures of acceleration computing cards that coexist in the user environment, including but not limited to GPUs (graphics processing units), FPGAs (field programmable gate arrays), TPUs (tensor processing units), and various types of specialized AI acceleration chips. These heterogeneous accelerator cards each have their unique advantages and application scenarios, but at the same time bring the complexity of resource integration and collaborative work.

Against the background, how to find balance points between limited hardware resources and diversified user demands, and to realize efficient and flexible computing deployment, becomes a problem to be solved urgently. Conventional single-type accelerator card schemes often fail to fully utilize existing hardware resources, resulting in wasted computing power or performance bottlenecks.

Disclosure of Invention

The embodiment of the application provides heterogeneous hybrid acceleration method, heterogeneous hybrid acceleration equipment and heterogeneous hybrid acceleration medium based on various acceleration cards, which are used for solving the technical problems.

In one aspect, an embodiment of the present application provides a heterogeneous hybrid acceleration method based on multiple acceleration cards, including:

taking Kernel combination of each operator in the deep learning model and the corresponding target platform as an initial condition, and unloading the Kernel combination to hardware for execution;

Taking the scoring condition of the Kernel combination as the initial cost of a cost model, and optimizing the cost model by a gradient descent method to determine whether the cost model is an extremum;

Comparing the extreme value corresponding to the Kernel combination with the history record, and determining whether to update the tuning record so as to determine a corresponding target tuning record;

dividing a computational subgraph analyzed by the deep learning model according to the target tuning record, and unloading the divided deep learning model to corresponding hardware equipment for reasoning to realize heterogeneous hybrid acceleration.

In one implementation of the present application, before combining each operator in the deep learning model with a Kernel of the corresponding target platform as an initial case, the method further includes:

Receiving an input deep learning model, and analyzing a calculation map corresponding to a model structure of the deep learning model;

Determining a corresponding operator calling relation and a weight coefficient corresponding to each operator, and storing the operator calling relation and the weight coefficient into a hash table;

Dividing the calculation graph according to the operator calling relation in the hash table to obtain a divided calculation subgraph;

and determining operators corresponding to the computational subgraph and attribute marking information corresponding to each operator, and determining Kernel combination of each operator and a corresponding target platform according to the attribute marking information.

In one implementation manner of the present application, taking the score of the Kernel combination as the initial cost of the cost model specifically includes:

establishing a cost model according to calculation time, hardware calculation force condition and data transmission time, and respectively determining weight coefficients corresponding to the calculation time, the hardware calculation force condition and the data transmission time;

Recording the execution time and the score of each Kernel in the Kernel combination, and taking the score of the Kernel combination as the initial cost of the cost model.

In one implementation of the present application, the specific calculation mode of the cost model is:

wherein, The time of the calculation is indicated and,Representing the power of the corresponding calculation time,The condition of the calculation force of the hardware is represented,Representing the power of the hardware calculation power corresponding to the condition,The time of transmission of the data is indicated,Representing the corresponding power of the data transmission time,Representing the weight coefficient corresponding to the calculation time,The weight coefficient corresponding to the hardware computing power condition is represented,And representing the weight coefficient corresponding to the data transmission time.

In one implementation manner of the present application, the optimizing the cost model by a gradient descent method specifically includes:

Calculating a model score corresponding to the cost model based on the initial cost of the cost model, and calculating a gradient function corresponding to the cost model;

under the condition that the gradient is larger than a preset minimum threshold value, selecting an initial Kernel combination from a pre-constructed operator library, and carrying out reverse iteration on the initial Kernel combination according to the gradient reverse direction;

And stopping iteration under the condition that the gradient is equal to the preset minimum threshold value, and determining the local minimum value of the cost model score.

In one implementation manner of the present application, comparing the extremum corresponding to the Kernel combination with the history record, and determining whether to update the tuning record, so as to determine the corresponding target tuning record, which specifically includes:

Acquiring a history cache tuning record, and comparing the local minimum value of the cost model score with the history cache tuning record;

Updating the history cache tuning record and determining a corresponding target tuning record under the condition that the local minimum value is smaller than the history cache tuning record;

continuously determining whether the local minimum value meets a current threshold value or not under the condition that the local minimum value is larger than the history cache tuning record;

If not, the parameters of the cost model are adjusted to be larger than the local minimum value, and if yes, the history cache tuning record is determined to be the target tuning record.

In one implementation manner of the application, the computational subgraph analyzed by the deep learning model is segmented according to the target tuning record, and the segmented deep learning model is unloaded to corresponding hardware equipment for reasoning, so that heterogeneous hybrid acceleration is realized, and the method specifically comprises the following steps:

dividing the computational subgraph analyzed by the deep learning model according to the target strategy in the target tuning record, and determining hardware equipment corresponding to each operator after dividing;

Corresponding each operator after segmentation to corresponding hardware equipment, and storing the hardware information of each operator after segmentation into the deep learning model; the hardware information is used for representing operator calling relations between operators and hardware equipment;

receiving an operator to be processed, and determining target hardware equipment corresponding to the operator to be processed according to an operator calling relation corresponding to hardware information in the deep learning model;

Unloading the deep learning model to target hardware equipment corresponding to the operator to be processed so as to infer the operator to be processed and realize heterogeneous hybrid acceleration.

In one implementation of the present application, before taking the Kernel combination of each operator in the input model corresponding to the target platform as the initial condition, the method further includes:

unifying interfaces corresponding to the acceleration cards in different forms, and packaging all the unified interfaces;

And acquiring attribute marking information corresponding to each target platform, and adding the attribute marking information to the corresponding target platform.

On the other hand, the embodiment of the application also provides heterogeneous hybrid acceleration equipment based on a plurality of acceleration cards, which comprises:

At least one processor;

and a memory communicatively coupled to the at least one processor;

Wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a heterogeneous hybrid acceleration method based on multiple acceleration cards as described above.

On the other hand, the embodiment of the application also provides a nonvolatile computer storage medium, which stores computer executable instructions, wherein the computer executable instructions are executed to realize the heterogeneous hybrid acceleration method based on the multiple acceleration cards.

The embodiment of the application provides a heterogeneous hybrid acceleration method, heterogeneous hybrid acceleration equipment and heterogeneous hybrid acceleration medium based on multiple acceleration cards, which at least comprise the following beneficial effects:

The binding condition between the operator and the hardware Kernel is determined, so that the actual performance data of the operator on specific hardware can be directly obtained, and an accurate reference is provided for subsequent optimization; the optimal operator and hardware combination strategy can be efficiently searched by optimizing the cost model and the gradient descent method, so that the performance of the deep learning model in heterogeneous hardware environment is maximally improved; through continuous comparison and updating of the tuning records, each optimization can be ensured to be carried out based on a historical optimal result, so that repeated labor and possible performance reversing are avoided, and the continuity and effectiveness of the optimization process are ensured; through sub-graph segmentation and heterogeneous deployment based on tuning records, the advantages of different hardware devices can be fully utilized, efficient synergy and performance maximization of model reasoning are realized, the calculation efficiency and response speed of the whole system are improved, and efficient heterogeneous hybrid acceleration is realized, so that faster response speed and better user experience are provided in practical application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

Fig. 1 is a schematic flow chart of a heterogeneous hybrid acceleration method based on multiple acceleration cards according to an embodiment of the present application;

Fig. 2 is a schematic diagram of an internal structure of heterogeneous hybrid acceleration device based on multiple acceleration cards according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The following describes in detail the technical solutions provided by the embodiments of the present application with reference to the accompanying drawings.

Fig. 1 is a flow chart of a heterogeneous hybrid acceleration method based on multiple acceleration cards according to an embodiment of the present application.

The implementation of the analysis method according to the embodiment of the present application may be a terminal device or a server, which is not particularly limited in the present application. For ease of understanding and description, the following embodiments are described in detail with reference to a server.

It should be noted that the server may be a single device, or may be a system formed by a plurality of devices, that is, a distributed server, which is not particularly limited in the present application.

As shown in fig. 1, the heterogeneous hybrid acceleration method based on multiple acceleration cards provided in the embodiment of the present application includes:

101. And taking Kernel combination of each operator in the deep learning model and the corresponding target platform as an initial condition, and unloading the Kernel combination to hardware for execution.

The application discloses a heterogeneous hybrid acceleration method based on multiple acceleration cards, which is mainly realized based on the existing FPGA (Xilinx U280), GPGPU (A100, assetnd 910) and NPU (MLU 220) computing acceleration cards, and operator libraries of a whole platform system are constructed based on operators for realizing running tasks of each hardware platform.

In one embodiment of the present application, before combining each operator in the deep learning model with the Kernel of the corresponding target platform as an initial case, it includes:

Receiving an input deep learning model, and analyzing a calculation chart corresponding to a model structure of the deep learning model;

And determining operators corresponding to the computational subgraph and attribute marking information corresponding to each operator, and determining Kernel combination of each operator and the corresponding target platform according to the attribute marking information.

In one embodiment, it is assumed that a TensorFlow-frame-based deep learning model is received, which is used for the image classification task. The structure of the model is defined as a computational graph that contains multiple layers and nodes, each representing an operation or operator, such as convolution, full join, activation function, etc.

The model file is loaded and the model, such as tf.keras.model.load_model, is read using the API interface provided by the TensorFlow framework. Computational graph representations of the model are obtained using a TensorFlow graph tool, such as tf.graph and tf.session, or compatible modes using tf.function and tf.graph in TensorFlow. X. And analyzing the calculation graph so as to identify all the nodes and the interconnection relation thereof.

For each node, its operator type, such as Conv2D, matMul, reLU, etc., is recorded. The input and output tensors of each operator and the calling relations with other operators are determined, and the weight coefficient of each operator, such as a convolution kernel and a weight matrix of a full connection layer, is extracted. This information is stored in a hash table whose keys are unique identifiers of operators, which may be names or IDs of nodes, and whose values are a structure containing operator types, call relationships, and weight coefficients.

To optimize computation or to perform parallel processing, the computation graph is partitioned into multiple computation subgraphs. The split strategy may be based on specific requirements, such as minimizing data transfer between subgraphs, balancing computational loads, etc. For example, the computational graph may be sliced in terms of convolutional layers and fully-connected layers, each convolutional layer and the activation function thereafter acting as a computational subgraph. Using the calling relationships in the hash table, the boundaries of these nodes are identified and a new sub-graph object is created.

For each computational subgraph, further analyzing the operators contained in the computational subgraph, and collecting attribute marking information of each operator, such as data types, tensor shapes, step sizes, filling modes and the like. Assuming that the target platform is a GPU, an optimal GPU Kernel needs to be selected for each operator. This typically involves looking up Kernel implementations that match operator properties and target platforms. For example, for a Conv2D operator, it may be necessary to select the most appropriate implementation in the Kernel library of the GPU based on its step size, convolution Kernel size, data type of input and output tensors, etc. Finally, these Kernel combinations, together with computational subgraphs, form an optimized model representation that can be efficiently executed on the target platform.

In one embodiment of the application, because the interfaces adopted by different heterogeneous accelerator cards are different, great challenges are brought to the integration and unification of an upper layer, and therefore, before the Kernel combination of each operator corresponding to a target platform in an input model is taken as an initial condition, the interfaces corresponding to the accelerator cards in different forms are unified, and all the interfaces after unification are packaged; and acquiring attribute marking information corresponding to each target platform, and adding the attribute marking information to the corresponding target platform.

In one embodiment, in order to unify interfaces corresponding to multiple accelerator cards in different forms, encapsulate all the interfaces after unification, obtain attribute marking information corresponding to each target platform at the same time, and add the attribute marking information to the corresponding target platform, firstly, the system automatically detects and identifies various accelerator cards connected in the current system through a preset identification module, including but not limited to GPU, FPGA, ASIC accelerator cards in different types.

And secondly, designing a set of unified interface specifications according to the identified acceleration card type. This set of specifications covers the basic functions of accelerator cards such as data transfer, computing task submission, status queries, etc. Thereafter, for each type of accelerator card, a corresponding interface adapter is developed. These adapters are responsible for converting the native interfaces of the accelerator card into interfaces under the unified interface specification, thereby achieving the unification of the interfaces. And then, packaging the unified interfaces to form a set of API library which is easy to use. The set of API libraries can mask the differences in the underlying accelerator cards so that the upper layer applications need not care which accelerator card is specifically used by the underlying layer.

A set of attribute marking information is defined for each target platform including, but not limited to, the hardware configuration of the platform, operating system version, type and number of accelerator cards installed, etc., such as different servers, workstations, etc. And automatically acquiring attribute marking information of each target platform by means of system inquiry, configuration file reading and the like. For example, the hardware information may be obtained by querying a hardware configuration file of the system, the operating system version may be obtained by reading version information of the operating system, and the like.

And adding the acquired attribute marking information to a corresponding target platform. This is accomplished by creating or updating a configuration file in the system that records each target platform and its corresponding attribute marking information. When the upper layer application needs to call the accelerator card, the attribute marking information of the target platform can be queried first to determine which types of accelerator cards and the number of accelerator cards are supported by the platform. Then, a proper accelerator card is selected according to the information, and the functions are called through a unified interface.

Assuming that a server containing two accelerator cards, namely a GPU and an FPGA is provided, the system firstly detects the two accelerator cards through the identification module and develops interface adapters for the two accelerator cards respectively. The interfaces of these adapters are then uniformly packaged as a set of API libraries. Meanwhile, the system acquires attribute marking information of the server, such as hardware configuration (including GPU and FPGA), operating system version and the like, and adds the information to a configuration file of the server. When the upper layer application needs to call the acceleration card to perform calculation, the upper layer application can firstly inquire attribute marking information of the server, determine which acceleration cards are available on the server, and then call the functions of the acceleration cards through a unified interface.

102. Taking the scoring condition of the Kernel combination as the initial cost of the cost model, and optimizing the cost model by a gradient descent method to determine whether the cost model is an extremum.

Specifically, in one embodiment of the present application, the score of Kernel combination is taken as the initial cost of the cost model, which specifically includes:

establishing a cost model according to the calculation time, the hardware calculation force condition and the data transmission time, and respectively determining weight coefficients corresponding to the calculation time, the hardware calculation force condition and the data transmission time;

The execution time and score for each Kernel in the Kernel combination are recorded and the score for the Kernel combination is taken as the initial cost of the cost model.

In one embodiment, multiple factors need to be considered in combination to evaluate the performance of different Kernel combinations on specific hardware during deep learning model optimization and deployment. To this end, a cost model is built, which comprises three main components: calculation time, hardware calculation power condition and data transmission time.

It should be noted that, the calculation time in the embodiment of the present application refers to the time required for executing a specific Kernel, and is generally related to the complexity of the Kernel, the size of the input data, and the floating point computing capability of the hardware. Hardware power conditions refer to the efficiency of reflecting the particular type of computation performed by hardware, such as floating point operations, integer operations, matrix multiplications, etc., and may be measured in terms of operands that can be executed per second (OPS). Data transfer time refers to the time required to transfer data between hardware or between hardware and memory, depending on the amount of data, the transfer bandwidth, and the transfer protocol.

To integrate these three factors, they are assigned weight coefficients, respectively, which reflect the importance of each factor in a particular application scenario. For example, in a scenario where real-time requirements are high, the weight of computation time may be high; in data-intensive applications, the weight of the data transmission time may be more critical.

Assuming that the following weight coefficients are determined by expert evaluation, experimental measurement or historical data analysis: calculating time weight: 0.5, hardware calculation force situation weight: 0.3, data transmission time weight: 0.2. the cost model may be expressed as a weighted sum of these factors: cost = calculation time 0.5+ hardware calculation case 0.3+ data transmission time 0.2.

After the cost model is determined, different Kernel combinations need to be evaluated. For this purpose, the execution time and score of each Kernel on the target hardware are first recorded. And unloading all Kernel groups to hardware equipment for execution, recording the execution time of each Kernel, selecting a corresponding Kernel group from an operator library by adopting a random gradient descent mode, unloading the Kernel groups to the hardware equipment for execution, and recording the corresponding scoring condition.

In one embodiment of the present application, the specific calculation mode of the cost model is:

In the embodiment of the application The time of the calculation is indicated and,Representing the power of the corresponding calculation time,The condition of the calculation force of the hardware is represented,Representing the power of the hardware corresponding to the power calculation case,The time of transmission of the data is indicated,Representing the corresponding power of the data transmission time,Representing the weight coefficient corresponding to the calculation time,The weight coefficient corresponding to the hardware computing force condition is represented,And the weight coefficient corresponding to the data transmission time is represented. The weight coefficient in the application can be set according to the actual situation, and the sum of the weight coefficient corresponding to the calculation time, the weight coefficient corresponding to the hardware calculation force situation and the weight coefficient corresponding to the data transmission time is 1, namely。

In one embodiment, the cost model corresponds to a power of a powerPower of hardware calculation force conditionAnd the power corresponding to the data transmission timeThe power may be selected according to the actual situation, which is not particularly limited in the present application. For example, the number of the cells to be processed,，，The cost model corresponds to。

In one embodiment of the present application, the cost model is optimized by a gradient descent method, which specifically includes:

And stopping iteration under the condition that the gradient is equal to a preset minimum threshold value, and determining the local minimum value of the cost model score.

In one embodiment, a gradient descent algorithm is used to optimize the cost model, and the specific steps are as follows: selecting a group of initial Kernel, and performing actual operation to obtainParameters, calculating a cost model score f (x), and calculating a gradient function using the set of parameters as initial values. Stopping iteration when the gradient is small enough to makeAt this point, a set of local minima is found. In the embodiment of the applicationRepresenting an intermediate value, i may take the form of 1,2,3,、、Representing temporary recorded values for calculation time, hardware calculation power, transmission time respectively,、、Respectively representing a current group of data records corresponding to the calculation time, the hardware calculation force and the transmission time at the time t.

Otherwise, according to the formulaObtaining. Specifically, let theIteration is performed in the opposite direction of the gradient, i.e. the time t+1 is determined from the x at time t and the gradient.

In one embodiment, it is assumed that a machine learning model optimization system is being developed that aims to minimize a specific cost function, such as mean square error, cross entropy loss, etc., by adjusting model parameters. First, a cost model is defined that calculates an initial cost based on current model parameters. For example, in one linear regression problem, the cost model may be the sum of squares of the differences between the predicted and actual values. Then, according to the pre-defined scoring mechanism, such as accuracy, F1 score or the negative value of the cost, calculating the model score corresponding to the cost model. It is assumed that the scoring mechanism simply takes the negative value of the cost as a score, i.e., score = -initial cost. And calculating the gradient of the cost model relative to each model parameter by using a partial derivative method in calculus to form a gradient vector.

Next, each element in the gradient vector is checked to determine if its absolute value is greater than a preset minimum threshold. This threshold is used to determine when to stop iteration, avoiding wasting computational resources on gradients that are close to zero in value but not theoretically zero. If the gradient is greater than a preset minimum threshold, a set of initial Kernel combinations is selected from a pre-constructed operator library. These Kernel are pre-optimized basic computational units, such as convolution kernels, activation functions, etc., for accelerating parameter updates during gradient descent.

Thereafter, reverse iterations are performed in the reverse direction of the gradient (i.e., negative gradient direction) using the selected initial Kernel combination, updating the model parameters such that the cost function value for the next iteration is expected to decrease. In the iterative process, the gradient is recalculated in each step, and the Kernel combination is dynamically adjusted to adapt to the current gradient state, so that the high efficiency and the stability of the optimization process are ensured.

When the absolute values of all gradients are equal to or less than a preset minimum threshold, the iteration is stopped. At this time, the cost model score is considered to reach a local minimum, and further parameter adjustment cannot continue to bring significant cost reduction. And recording the model parameter configuration and the corresponding cost model score at the moment, and taking the model parameter configuration and the corresponding cost model score as the result of the optimization process.

103. And comparing the extremum corresponding to the Kernel combination with the history record to determine whether to update the tuning record so as to determine the corresponding target tuning record.

Specifically, in one embodiment of the present application, comparing the extremum corresponding to the Kernel combination with the history record to determine whether to update the tuning record, so as to determine the corresponding target tuning record, which specifically includes:

Under the condition that the local minimum value is smaller than the history cache tuning record, updating the history cache tuning record, and determining a corresponding target tuning record;

under the condition that the local minimum value is larger than the history cache tuning record, continuously determining whether the local minimum value meets the current threshold value or not;

If not, the parameters of the cost model are adjusted to be larger than the local minimum value, and if so, the history cache tuning record is determined to be the target tuning record.

In one embodiment, in a scenario intended to optimize deep learning model performance, an automated tuning flow is implemented that includes calculation of cost model scores, comparison with historical cache tuning records, and decision making based on the comparison results.

The history tuning record is first retrieved from a specialized database or cache system. The records contain the information of the minimum value of the cost model score, the corresponding model parameter configuration, the tuning time stamp and the like in the previous tuning process. In the current tuning iteration, a new local minimum has been calculated by the cost model. Next, this local minimum is compared to the minimum in the history cache tuning record.

If the current local minimum is less than the minimum in the history cache tuning record, this indicates that a better solution is found. At this time, the history cache tuning record is updated, and the information such as the current local minimum value, the corresponding model parameter configuration, the current timestamp and the like is stored in the database. And meanwhile, determining the current record as the target tuning record.

If the current local minimum is greater than the minimum in the history cache tuning record, the current solution is not immediately discarded, but is further evaluated for meeting the currently set threshold. This threshold may be dynamically determined based on traffic demand, computational resource limitations, or tuning policies.

If the current local minimum does not meet the current threshold (i.e., is greater than the threshold), then the current solution is considered to be not optimal enough and tuning needs to be continued. At this time, a policy may be adopted, such as adjusting parameters of the cost model to jump out of the current local minimum value, and explore other possible solution spaces. If the current local minimum meets the current threshold (i.e., is less than or equal to the threshold), it is not historically optimal, but is already good enough under the current conditions. At this time, the history cache tuning record (i.e. the previous optimal solution) is determined as the target tuning record, and the current tuning iteration may be ended, or the next round of tuning may be performed according to the service requirement.

104. And segmenting a computational subgraph analyzed by the deep learning model according to the target tuning record, and unloading the segmented deep learning model to corresponding hardware equipment for reasoning to realize heterogeneous hybrid acceleration.

Specifically, in one embodiment of the present application, a computational subgraph parsed by a deep learning model is segmented according to a target tuning record, and the segmented deep learning model is offloaded to a corresponding hardware device for reasoning, so as to implement heterogeneous hybrid acceleration, which specifically includes:

splitting a computational subgraph analyzed by the deep learning model according to a target strategy in a target tuning record, and determining hardware equipment corresponding to each operator after splitting;

corresponding each operator after segmentation to corresponding hardware equipment, and storing the hardware information of each operator after segmentation into a deep learning model; the hardware information is used for representing operator calling relations between operators and hardware equipment;

In one embodiment, the server segments the computation subgraph according to the optimal solution found in the target tuning record by the gradient descent method and according to the corresponding relation of the optimal demodulation integer operator device, so as to ensure that the corresponding relation between the segmented operator and the corresponding hardware device is the optimal solution.

In one embodiment, in a deep learning model optimization and deployment scenario, efficient reasoning of models is aimed at by leveraging heterogeneous hardware resources (e.g., CPU, GPU, FPGA, etc.). Firstly, according to a target strategy in a target tuning record, segmenting a calculation subgraph after analysis of a deep learning model. The segmentation process is based on the calculation structure and the dependency relationship of the model, and the model is split into a plurality of operators which can be independently executed. In the segmentation process, the calculated amount of operators, data dependence and the characteristics of hardware equipment are considered, so that the segmented operators can be mapped onto the hardware equipment efficiently.

And for each operator after segmentation, determining the corresponding target hardware equipment according to the calculation characteristics and the performance characteristics of the hardware equipment. For example, computationally intensive operators may be allocated to the GPU, while data preprocessing or simple logic operations may be allocated to the CPU. And the data transmission overhead between hardware devices is also considered, so that the data movement among devices is reduced as much as possible, and the overall reasoning efficiency is improved.

Storing each operator after segmentation and the corresponding hardware device information thereof into a deep learning model, wherein the hardware information not only comprises the mapping relation between the operators and the hardware devices, but also can comprise detailed information such as a specific execution unit (such as a certain core of a GPU), a memory address and the like on the devices. The hardware information is used for representing the operator calling relation between the operator and the hardware equipment, and is an important basis for scheduling the operator in the follow-up model reasoning.

In the model reasoning stage, an operator to be processed is received. According to the hardware information stored in the deep learning model, the target hardware equipment corresponding to the operator to be processed can be rapidly determined. This process is implemented by looking up a hardware information table or calling a corresponding API, ensuring that operators can be accurately scheduled on the correct hardware devices.

And unloading the deep learning model (or part of the model) to target hardware equipment corresponding to the operator to be processed. This typically involves transmitting the weights of the model, the computational graph, and the necessary runtime environment to the target device. And on the target equipment, executing an operator to be processed by utilizing a corresponding reasoning engine or library, and realizing the reasoning of the deep learning model. Because operators are optimized and scheduled according to hardware equipment, heterogeneous mixing acceleration can be realized, and reasoning efficiency is improved. Different types of operators are reasonably distributed to different hardware devices for execution, so that the advantages of the hardware devices are fully utilized, the overall reasoning performance of the model is improved, and the hybrid acceleration of the deep learning model in heterogeneous hardware environments is realized.

The above is a method embodiment of the present application. Based on the same inventive concept, the embodiment of the application also provides heterogeneous hybrid acceleration equipment based on multiple acceleration cards, and the structure of the heterogeneous hybrid acceleration equipment is shown in fig. 2.

Fig. 2 is a schematic diagram of an internal structure of heterogeneous hybrid acceleration device based on multiple acceleration cards according to an embodiment of the present application. As shown in fig. 2, the apparatus includes:

At least one processor;

and a memory communicatively coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to:

taking the scoring condition of the Kernel combination as the initial cost of the cost model, and optimizing the cost model by a gradient descent method to determine whether the cost model is an extremum;

comparing the extremum corresponding to the Kernel combination with the history record, and determining whether to update the tuning record so as to determine a corresponding target tuning record;

And segmenting a computational subgraph analyzed by the deep learning model according to the target tuning record, and unloading the segmented deep learning model to corresponding hardware equipment for reasoning to realize heterogeneous hybrid acceleration.

The embodiment of the application also provides a nonvolatile computer storage medium, which stores computer executable instructions, and the computer executable instructions can be executed:

The embodiments of the present application are described in a progressive manner, and the same and similar parts of the embodiments are all referred to each other, and each embodiment is mainly described in the differences from the other embodiments. In particular, for the apparatus and medium embodiments, the description is relatively simple, as it is substantially similar to the method embodiments, with reference to the section of the method embodiments being relevant.

The devices and media provided in the embodiments of the present application are in one-to-one correspondence with the methods, so that the devices and media also have similar beneficial technical effects as the corresponding methods, and since the beneficial technical effects of the methods have been described in detail above, the beneficial technical effects of the devices and media are not repeated here.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

Claims

1. Heterogeneous hybrid acceleration method based on multiple acceleration cards, characterized in that the method comprises the following steps:

2. The heterogeneous hybrid acceleration method of claim 1, wherein before combining each operator in the deep learning model with a Kernel of the corresponding target platform as an initial condition, the method further comprises:

3. The heterogeneous hybrid acceleration method based on multiple acceleration cards according to claim 1, wherein the scoring condition of the Kernel combination is taken as an initial cost of a cost model, and specifically comprises:

4. The heterogeneous hybrid acceleration method based on multiple acceleration cards according to claim 1, wherein the specific calculation mode of the cost model is as follows:

5. The heterogeneous hybrid acceleration method based on multiple acceleration cards according to claim 1, wherein the cost model is optimized by a gradient descent method, and specifically comprises:

6. The heterogeneous hybrid acceleration method based on multiple acceleration cards according to claim 5, wherein comparing the extremum corresponding to the Kernel combination with the history record to determine whether to update the tuning record, and further comprising:

7. The heterogeneous hybrid acceleration method based on multiple acceleration cards according to claim 1, wherein the method is characterized by dividing a computational subgraph analyzed by the deep learning model according to the target tuning record, and unloading the divided deep learning model to corresponding hardware equipment for reasoning, so as to realize heterogeneous hybrid acceleration, and specifically comprises the following steps:

8. The heterogeneous hybrid acceleration method based on multiple acceleration cards according to claim 1, wherein before taking Kernel combinations of target platforms corresponding to each operator in the input model as initial conditions, the method further comprises:

9. Heterogeneous hybrid acceleration device based on a plurality of acceleration cards, characterized in that it comprises:

At least one processor;

and a memory communicatively coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a heterogeneous hybrid acceleration method based on multiple acceleration cards as set forth in any one of claims 1-8.

10. A non-transitory computer storage medium storing computer executable instructions which, when executed, implement a heterogeneous hybrid acceleration method based on multiple acceleration cards according to any one of claims 1-8.