WO2022052973A1 - 一种模型处理方法、装置、设备及计算机可读存储介质 - Google Patents
一种模型处理方法、装置、设备及计算机可读存储介质 Download PDFInfo
- Publication number
- WO2022052973A1 WO2022052973A1 PCT/CN2021/117359 CN2021117359W WO2022052973A1 WO 2022052973 A1 WO2022052973 A1 WO 2022052973A1 CN 2021117359 W CN2021117359 W CN 2021117359W WO 2022052973 A1 WO2022052973 A1 WO 2022052973A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- model
- parallelization
- graph
- trained
- subgraph
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T1/00—General purpose image data processing
- G06T1/20—Processor architectures; Processor configuration, e.g. pipelining
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/098—Distributed learning, e.g. federated learning
Definitions
- the present disclosure relates to the field of information technology, and in particular, to a model processing method, apparatus, device, and computer-readable storage medium.
- the prior art adopts a distributed training method to train the model.
- Common distributed training methods include parallelization strategies such as data parallelism, model parallelism, pipeline parallelism, operator splitting, and hybrid parallelism.
- the hybrid parallelism may be a combination of two or more of data parallelism, model parallelism, pipeline parallelism, and operator splitting.
- the distributed training frameworks in the prior art cannot support various parallelization strategies and their combinations.
- an embodiment of the present disclosure provides a model processing method, including:
- the parallelization strategy of the model to be trained includes at least one of pipeline parallelism, model parallelism, data parallelism and operator splitting ;
- the to-be-trained model is trained according to the distributed computing graph.
- an embodiment of the present disclosure provides a model processing apparatus, including:
- Add module for adding parallelization information in the first calculation graph according to the parallelization strategy of the model to be trained to obtain the second calculation graph
- a determining module configured to determine a distributed computing graph according to the second computing graph and computing resources
- a training module configured to train the to-be-trained model according to the distributed computing graph.
- an embodiment of the present disclosure provides a model processing device, including:
- the computer program is stored in the memory and configured to be executed by the processor to implement the method of the first aspect.
- an embodiment of the present disclosure provides a computer-readable storage medium on which a computer program is stored, and the computer program is executed by a processor to implement the method described in the first aspect.
- the parallelization strategy of the model to be trained includes pipeline At least one of parallelism, model parallelism, data parallelism, and operator splitting, adding parallelization information to the first calculation graph according to the parallelization strategy of the model to be trained, to obtain a second calculation graph, and according to the second calculation graph and Computing resources, determine the distributed computing graph, train the model to be trained according to the distributed computing graph, and realize the technology based on the editing of the computational graph to support multiple parallelization strategies, so that a variety of parallelization strategies can be integrated into one system , thus realizing a distributed training framework that can support multiple parallelization strategies.
- FIG. 1 is a schematic diagram of data parallelism provided by an embodiment of the present disclosure
- FIG. 2 is a schematic diagram of model parallelism provided by an embodiment of the present disclosure
- FIG. 3 is a schematic diagram of pipeline parallelism provided by an embodiment of the present disclosure.
- FIG. 5 is a schematic diagram of operator splitting provided by an embodiment of the present disclosure.
- FIG. 6 is a flowchart of a model processing method provided by an embodiment of the present disclosure.
- FIG. 7 is a schematic diagram of an application scenario provided by an embodiment of the present disclosure.
- FIG. 8 is a schematic diagram of another application scenario provided by an embodiment of the present disclosure.
- FIG. 9 is a flowchart of a model processing method provided by another embodiment of the present disclosure.
- FIG. 11 is a schematic diagram of another application scenario provided by an embodiment of the present disclosure.
- FIG. 12 is a schematic diagram of a distributed training framework provided by an embodiment of the present disclosure.
- FIG. 13 is a schematic diagram of a model parameter dimension provided by an embodiment of the present disclosure.
- FIG. 14 is a schematic diagram of a method for dividing a first calculation graph according to an embodiment of the present disclosure
- FIG. 15 is a schematic diagram of a method for dividing a virtual device according to an embodiment of the present disclosure.
- 16 is a schematic diagram of another application scenario provided by an embodiment of the present disclosure.
- FIG. 17 is a schematic structural diagram of a model processing apparatus provided by an embodiment of the present disclosure.
- FIG. 18 is a schematic structural diagram of a model processing device according to an embodiment of the present disclosure.
- a distributed training method is used to train the model.
- Common distributed training methods include parallelization strategies such as data parallelism, model parallelism, pipeline parallelism, and operator splitting.
- parallelization strategies such as data parallelism, model parallelism, pipeline parallelism, and operator splitting.
- the distributed training frameworks in the prior art cannot support various parallelization strategies and their combinations.
- an embodiment of the present disclosure provides a model processing method, which is described below with reference to specific embodiments.
- Data parallelism is embodied as: each device in multiple devices is loaded with the same copy of the model, that is, the model trained by each device in multiple devices is the same. However, the sample data used to train the model is different in each device. For example, the sample data used to train the model in different devices are different subsets of data. The set of subsets of data in each device is the complete set for training the model. Additionally, each device can synchronize model parameters across replicas at the end of an iteration.
- Figure 1 is a schematic diagram of data parallelism, for example, model 10 and model 11 are the same model. Device 0 is used to train model 10 and device 1 is used to train model 11 .
- the sample data used by device 0 to train model 10 is recorded as input1, and the sample data used by device 1 to train model 11 is recorded as input2.
- input1 and input2 can be different.
- Device 0 may output a training result according to input1, and the training result may be, for example, a gradient.
- device 1 can output a training result according to input 2, and the training result can be, for example, a gradient. Since the training result output by device 0 and the training result output by device 1 may be different, the training result output by device 0 and the training result output by device 1 may be aggregated to obtain the aggregated result. Further, the parameters of the model 10 and the parameters of the model 11 can be updated respectively according to the aggregation result, so that the parameters of the model 10 and the parameters of the model 11 are the same, and the next iteration training is performed.
- Model parallelism is embodied in the fact that different devices in multiple devices are used to train different layers of the model.
- a model may include multiple layers (eg, network layers), and different devices are responsible for the computation of different layers, that is, different layers of the model are assigned to different devices.
- one or more layers of the model can be assigned to the same device.
- Figure 2 is a schematic diagram of model parallelism.
- the model includes layer 1, layer 2, layer 3 and layer 4, wherein layer 1 and layer 2 can be assigned to device 0, and layer 3 and layer 4 can be assigned to For device 1, the input of device 0 is sample data, and the output of device 0 can be used as the input of device 1.
- Pipeline parallelism is an auxiliary parallel strategy. Pipeline parallelism can be used alone or mixed with model parallelism.
- Figure 3 shows a schematic diagram of pipeline parallel use alone.
- 30 represents a model
- the model 30 is trained by a certain device
- 31 represents a data set used to train the model
- the data set may include a large number of sample data. Since a large amount of sample data occupies a large storage space and computing resources, when using pipeline parallelism alone, the data set 31 can be divided into multiple smaller shards, for example, 32 represents any of the multiple shards. a shard.
- the device can sequentially input each of the multiple shards into the model 30 to train the model 30, wherein a training result can be obtained after inputting one shard into the model 30 to train the model 30 , the training result can be a gradient, for example.
- the data set 31 is divided into 10 shards, and the 10 shards are sequentially input into the model 30 to train the model 30 to obtain 10 training results, and the model can be obtained by further processing the 10 training results. model parameters.
- Figure 4 shows a schematic diagram of the mixed use of pipeline parallelism and model parallelism.
- the model 40 includes layer 1 , layer 2 , layer 3 and layer 4 .
- layer 1, layer 2, layer 3, and layer 4 can be allocated to different devices for computation respectively.
- device 1 is responsible for computing at layer 1
- device 2 is responsible for computing at layer 2
- device 3 is responsible for computing at layer 3
- device 4 is responsible for computing at layer 4.
- the mixed use of pipeline parallelism and model parallelism as shown in FIG. 4 is only a schematic illustration, and does not make a specific limitation.
- the input data 41 of layer 1 can be divided into multiple shards, for example, F 0,1 and F 1,1 respectively represent one shard.
- the device 1 may process multiple fragments in sequence. For example, the device 1 may process the fragment F 0,1 first, and the processing result obtained by the device 1 after processing the fragment F 0,1 may be recorded as F 0,2 .
- F 0,2 can be used as input to layer 2, i.e. device 1 can send F 0,2 to device 2.
- the processing result obtained by device 2 after processing F 0,2 can be recorded as F 0,3 , and device 2 can send F 0,3 to device 3, and F 0,3 is used as the input of layer 3.
- the processing result obtained by device 3 after processing F 0,3 can be recorded as F 0,4 , and device 3 can send F 0,4 to device 4, and F 0,4 is used as the input of layer 4.
- device 2 is processing F 0,2
- device 1 can process the next fragment of F 0 , 1, namely F 1 , 1, and the flow of the processing result of F 1 , 1 and subsequent
- the processing is similar to F 0,1 , and will not be repeated here. That is to say, through the mixed use of pipeline parallelism and model parallelism, devices corresponding to different layers in the model 40 can perform parallel computing. For example, at time t, device 1, device 2, device 3, and device 4 can compute in parallel at the same time, thereby improving computing efficiency.
- each layer of the model includes one or more operators.
- the operators in each layer are used to train some of the parameters of the model.
- the number of parameters corresponding to different layers may be different or the same.
- Figure 5 shows a schematic diagram of operator splitting.
- the storage part and the calculation part of the operator in each layer are divided into device 0 and device 1 for storage and calculation.
- any layer or layers in the model can also be split.
- the storage part of the operator may be the training parameter of the operator.
- the above-mentioned device may specifically be a computing device such as a graphics processing unit (Graphics Processing Unit, GPU) or a central processing unit (Central Processing Unit, CPU).
- a graphics processing unit Graphics Processing Unit, GPU
- CPU Central Processing Unit
- the above model may be a neural network model, or may also be a deep learning model, and may also be other types of models, that is, the embodiments of the present disclosure do not specifically limit the model.
- deep learning is a branch of machine learning, which is an algorithm that uses artificial neural network as the architecture to perform representation learning of data.
- the deep learning model may also be referred to as a deep neural network model.
- a neural network model with three or more layers may be a deep learning model.
- FIG. 6 is a flowchart of a model processing method provided by an embodiment of the present disclosure. The specific steps of this method are as follows:
- the model to be trained may be, for example, a neural network model or a deep learning model to be trained.
- the to-be-trained model may be a user's single-machine and single-card model.
- the model in this embodiment can be applied in the fields of computer vision, natural language processing, knowledge graph, and the like.
- the model processing method described in this embodiment may be specifically executed by a distributed training framework.
- the distributed training framework may be deployed on one or more machines, and the machines may specifically include a computer, a computer cluster, a server, and the like, where the server may specifically include a cloud server, a remote server, and the like.
- Each machine may include multiple devices, which may be computing devices such as GPUs or CPUs, for example.
- the distributed training framework is deployed on a cloud server.
- this embodiment does not limit the number of cloud servers, for example, it may be one or more.
- the cloud server 70 shown in FIG. 7 The model to be trained is stored in the user terminal 71 .
- This embodiment does not limit the specific product form of the terminal 71, for example, it may be a notebook computer, a desktop computer, a tablet computer, a personal computer (Personal Computer, PC), and the like.
- the terminal 71 may send the model to be trained to the cloud server 70, and the cloud server 70 generates a calculation graph corresponding to the model to be trained after receiving the model to be trained, and the calculation graph is denoted as the first calculation graph.
- the first calculation graph may be an original single-machine and single-card calculation graph of the user.
- the terminal 71 may generate the first computation graph according to the local model to be trained, and send the first computation graph to the cloud server 70 .
- the cloud server 70 can use the distributed training framework to train the model to be trained, and feed back the training result to the terminal 71 .
- the cloud server 70 may send the distributed training framework to the terminal 71, and the terminal 71 may deploy the distributed training framework on the terminal 71 or other devices.
- the terminal 71 can generate a first computational graph according to the local model to be trained, and use the first computational graph as the input of the distributed training framework, through the distributed training framework Train the model to be trained and output the training result.
- the distributed training framework described in this embodiment can support multiple parallelization strategies.
- the distributed training framework can not only support pipeline parallelism, model parallelism, data parallelism or operator splitting, but also pipeline parallelism and model parallelism. , a combination of two or more of data parallelism and operator splitting.
- the distributed training framework in this embodiment can not only support a single parallelization strategy among pipeline parallelism, model parallelism, data parallelism, and operator splitting, but also support hybrid parallelism. Therefore, the distributed training framework can not only use a single parallelization strategy to train the to-be-trained model, but also can use a hybrid parallel method to train the to-be-trained model.
- the distributed training framework when it obtains the first computation graph of the model to be trained, it can further obtain the parallelization strategy of the model to be trained.
- the parallelization strategy of the model to be trained can be pipeline parallelism, model parallelism, data parallelism and At least one of the operator splits. That is to say, the parallelization strategy of the model to be trained may be a single parallelization strategy or a mixed parallelization strategy.
- acquiring the parallelization strategy of the model to be trained includes: determining the parallelization strategy of the model to be trained according to a first computation graph corresponding to the model to be trained.
- the distributed training framework when the distributed training framework obtains the first computation graph of the model to be trained, the parallelization strategy of the model to be trained can be determined according to the first computation graph.
- the distributed training framework may analyze the first computation graph according to a machine learning method such as reinforcement learning to determine feature information of the model to be trained, and determine a parallelization strategy for the model to be trained according to the feature information.
- acquiring the parallelization strategy of the model to be trained includes: acquiring the parallelization strategy of the model to be trained selected by the user.
- the distributed training framework provides a user interface through which the user can select the parallelization strategy of the model to be trained.
- the user interface provides pipeline parallelism, model parallelism, data parallelism, and operator splitting, and the user can select one or more of pipeline parallelism, model parallelism, data parallelism, and operator splitting as the parallelism of the model to be trained. strategy.
- the distributed training framework when the distributed training framework obtains the parallelization strategy of the model to be trained selected by the user, it can also calculate the parallelization strategy selected by the user or the required combination of various parallelization strategies. Indicator information such as consumption of computing resources, cost, time, and performance. Further, the distributed training framework can also feed back these indicator information to the user, so that the user can adjust or confirm the selected parallelization strategy.
- the distributed training framework can add parallelization information on the basis of the first calculation graph 81 according to the parallelization strategy of the model to be trained to obtain the second calculation graph 82 .
- the parallelization information is information related to the parallelization strategy of the model to be trained.
- the parallelization strategy of the model to be trained is: model parallelism is used between the first part and the second part of the model to be trained, data parallelism is used in the first part, and operator splitting is used in the second part.
- the parallelization information may include the parallelization strategy of the model to be trained.
- the parallelization information may also include parameter information of the parallelization strategy, for example, the number of devices required for data parallelism, the number of split objects split by the operator, or the number of shards.
- adding parallelization information to the first computation graph according to the parallelization strategy of the model to be trained to obtain a second computation graph comprising: converting the first computation graph according to the parallelization strategy of the model to be trained.
- the computation graph is divided into a plurality of first subgraphs; parallelization information is added to each first subgraph of the plurality of first subgraphs according to the parallelization strategy of the model to be trained to obtain a second computation graph.
- the distributed training framework may divide the first computation graph into a plurality of subgraphs according to the parallelization strategy of the model to be trained, and each subgraph may include one or more layers of the model to be trained.
- the distributed training framework divides the first computation graph into subgraph 1 and subgraph 2 shown in FIG. 8 , and subgraph 1 and subgraph 2 are respectively denoted as the first subgraph.
- sub-graph 1 corresponds to the second part of the model to be trained
- sub-graph 2 corresponds to the first part of the model to be trained.
- parallelization information is added to each subgraph in subgraph 1 and subgraph 2 according to the parallelization strategy of the model to be trained, to obtain a second calculation graph 82 .
- the parallelization information includes parallelization information between different first subgraphs and parallelization information within each first subgraph.
- the parallelization information added in subgraph 1 by the distributed training framework may include parallelization information within subgraph 1, and may also include parallelization information between subgraph 1 and subgraph 2.
- the parallelization information added in the sub-graph 2 by the distributed training framework may include the parallelization information in the sub-graph 2, and may also include the parallelization information between the sub-graph 1 and the sub-graph 2.
- the parallelization information between different first subgraphs includes: parallelization strategies adopted between different first subgraphs.
- Parallelization strategies adopted between different first subgraphs include pipeline parallelism and/or model parallelism.
- the parallelization information between subgraph 1 and subgraph 2 includes a parallelization strategy adopted between subgraph 1 and subgraph 2, and the parallelization strategy is model parallelism.
- the parallelization information between different first subgraphs further includes: parameter information of parallelization strategies adopted between different first subgraphs.
- the parallelization strategy between sub-graph 1 and sub-graph 2 is to increase pipeline parallelism on the basis of model parallelism.
- pipeline parallelism can divide the sample data set of the model into multiple smaller shards.
- the input data of sub-graph 1 can be divided into multiple smaller shards
- the input data of sub-graph 2 can be divided into multiple smaller shards.
- the parallelization information between subgraph 1 and subgraph 2 can include not only the parallelization strategy adopted between subgraph 1 and subgraph 2, but also the parallelism adopted between subgraph 1 and subgraph 2.
- the parameter information may specifically be the number of slices into which the input data of sub-figure 1 is divided, and the number of slices into which the input data of sub-figure 2 is divided.
- the number of slices into which the input data of sub-figure 1 is divided and the number of slices into which the input data of sub-figure 2 is divided may be the same or different.
- the parameter information of the parallelization strategy adopted between sub-figure 1 and sub-figure 2 may be configured by a user or a distributed training framework.
- the parallelization information in each first subgraph includes: a parallelization strategy in each first subgraph.
- Parallelization strategies within each first subgraph include: data parallelism and/or operator splitting.
- operator splitting is used in sub-graph 1 shown in FIG. 8
- data parallelism is used in sub-graph 2
- operator splitting may be used in sub-graph 2.
- the parallelization strategy adopted in sub-figure 1 may also be the same as the parallelization strategy adopted in sub-figure 2 .
- the parallelization strategy adopted by each subgraph may also be a hybrid parallelization method, for example, a combination of data parallelism and operator splitting is adopted in subgraph 1, and/or data parallelism and operator splitting are adopted in subgraph 2 combination.
- the parallelization information in each first subgraph further includes: parameter information of the parallelization strategy in each first subgraph.
- the parallelization information in subgraph 1 may also include parameter information of operator splitting. number.
- the parallelization information in sub-figure 2 may also include parameter information of data parallelism, for example, data parallelism requires several devices to execute.
- the parameter information of the parallelization strategy in sub-figure 1 or in sub-figure 2 may be configured by the user, or may be configured by the distributed training framework.
- the edge dashed boxes of sub-picture 1 and sub-picture 2, and the circles 1 and 2 represent the parallelization information between sub-picture 1 and sub-picture 2, that is, sub-picture 1 and sub-picture 2 are respectively assigned to different device for calculation.
- the parallelization information in sub-figure 1 may be a dotted box in 811, indicating that operator splitting is used in sub-figure 1.
- the number of dotted boxes in 811 can identify the number of sub-picture 1 to be split.
- the two dashed boxes in 811 indicate that sub-picture 1 is split into two devices for execution.
- parallelization information may also be represented in sub-figure 2.
- the parallelization information in sub-figure 2 is used to identify data parallelism and the number of devices required for data parallelism, for example, three.
- physical device 0 , physical device 1 , physical device 2 , physical device 3 , and physical device 4 shown in FIG. 8 are specific hardware computing resources, respectively.
- the physical device 0, physical device 1, physical device 2, physical device 3, and physical device 4 may be physical devices from the same machine, or may be physical devices from different machines.
- the physical device may be a computing device such as a GPU or a CPU as described above, and in addition, the physical device may also be a virtual machine.
- the distributed training framework may divide the hardware computing resources into multiple virtual devices, and this embodiment does not specifically limit the division method here.
- physical device 0 and physical device 1 may be divided into virtual device 1
- physical device 2 , physical device 3 and physical device 4 may be divided into virtual device 2 .
- the distributed computation graph 83 is obtained according to the second computation graph 82 and the physical devices included in each virtual device.
- Distributed Computing Figure 83 shows that physical device 0 and physical device 1 are used to perform operator split computation on sub-graph 1, and physical device 2, physical device 3, and physical device 4 are used to perform data-parallel computation on sub-graph 2.
- the distributed computing graph 83 is input to the training engine Tensorflow or PyTorch, and the training process is performed by the training engine Tensorflow or PyTorch.
- Tensorflow is an open source machine learning platform for machine learning tasks such as image, speech and language understanding.
- PyTorch is an open source Python machine learning library based on Torch, which is used in artificial intelligence fields such as natural language processing.
- the terminal 71 may send the computing resource information and the to-be-trained model or the first computation graph corresponding to the to-be-trained model to the cloud server 70, and the cloud server 70 may send the computing resource information, and the to-be-trained model or For the first computation graph corresponding to the model to be trained, the distributed computation graph is determined.
- the process of training the model to be trained according to the distributed computing graph can be performed in other servers or in training engines provided by other servers.
- the cloud server 70 can send the distributed computing graph to the terminal 71, and after the user receives the distributed computing graph through the terminal 71, he can treat it in other servers or training engines provided by other servers according to the distributed computing graph. Train the model to train.
- the first computation graph corresponding to the model to be trained and the parallelization strategy of the model to be trained are obtained.
- the parallelization strategy of the model to be trained includes at least one of pipeline parallelism, model parallelism, data parallelism and operator splitting.
- a second calculation graph is obtained by adding parallelization information to the first calculation graph according to the parallelization strategy of the model to be trained, and a distributed calculation graph is determined according to the second calculation graph and computing resources, and the training is to be trained according to the distributed calculation graph.
- the model is trained, and the technology based on computational graph editing is implemented to support multiple parallelization strategies, so that multiple parallelization strategies can be integrated into a set of systems, thus realizing a distributed system that can support multiple parallelization strategies. training framework.
- determining the distributed computing graph includes the following steps as shown in FIG. 9 :
- physical device 0 and physical device 1 are divided into virtual device 1
- physical device 2 physical device 3
- physical device 4 are divided into virtual device 2.
- dividing the computing resources to obtain one or more virtual devices includes: dividing the computing resources according to the parallelization information to obtain one or more virtual devices.
- the division may be specifically performed according to the parallelization information in the second calculation diagram 82.
- the parallelization information in the second computation graph 82 indicates that the first computation graph 81 is divided into two subgraphs, therefore, physical device 0, physical device 1, physical device 2, physical device 3, and physical device 4 can be divided into Two virtual devices, for example, virtual device 1 and virtual device 2.
- the second computation graph 82 may also be converted into a third computation graph 84 according to the parallelization information in the second computation graph 82 .
- converting the second computation graph into a third computation graph according to the parallelization information includes: converting the second computation graph into a third computation graph according to the parallelization information of each first subgraph in the plurality of first subgraphs.
- the first subgraph is converted into a distributed second subgraph; according to the connection relationship between the plurality of first subgraphs, the distributed second subgraphs corresponding to each first subgraph are connected to obtain the first subgraph.
- sub-graph 1 and sub-graph 2 are respectively denoted as the first sub-graph, and each first sub-graph can be converted into a distributed second sub-graph according to the parallelization information of each first sub-graph.
- subgraph 11 and subgraph 12 are distributed second subgraphs obtained by converting subgraph 1 .
- Subgraph 21, subgraph 22, and subgraph 23 are distributed second subgraphs obtained by converting subgraph 2.
- the sub-picture 11 is respectively connected with the sub-picture 21, the sub-picture 22 and the sub-picture 23
- the sub-picture 12 is respectively connected with the sub-picture 21, the sub-picture 22 and the sub-picture 23. connected, resulting in a third computation graph 84 .
- mapping the third computation graph 84 to physical devices results in a distributed computation graph 83 .
- mapping the third computational graph to a physical device includes: mapping each second subgraph in the third computational graph to a physical device.
- mapping each second sub-graph in the third computing graph to a physical device includes: mapping each first sub-graph to a virtual device; mapping each of the first sub-graphs to a virtual device; The second subgraph is mapped to the physical device included in the virtual device corresponding to the first subgraph.
- map sub-picture 1 to virtual device 1 maps sub-picture 11 and sub-picture 12 corresponding to sub-picture 1 to physical devices included in virtual device 1, for example, map sub-picture 11 to virtual device 1.
- Physical device 1 maps subgraph 12 to physical device 1 in virtual device 1.
- map sub-picture 2 to virtual device 2 maps sub-picture 21, sub-picture 22, and sub-picture 23 corresponding to sub-picture 2 to physical devices included in virtual device 2, for example, map sub-picture 21 to For physical device 2, map submap 22 to physical device 3, and map submap 23 to physical device 4.
- the sub-picture 1 may also be split into three devices, as shown in FIG. 11 .
- physical device 0 and physical device 3 may be the same device, or may be different devices.
- physical device 1 and physical device 4 may be the same device or different devices.
- Physical device 2 and physical device 5 may be the same device or different devices.
- one or more virtual devices are obtained by dividing the computing resources, and the virtual devices include one or more physical devices, and the second computing graph is converted into a third computing graph according to the parallelization information Figure, the third computing graph is mapped to the physical device to obtain a distributed computing graph, so that the computing resources can be fully utilized, and the utilization rate of the computing resources is improved.
- the 120 as shown in FIG. 12 represents a schematic structural diagram of a distributed training framework.
- the input to the distributed training framework 120 may be the first computational graph as described above.
- the output of the distributed training framework 120 may be training results.
- the distributed training framework 120 includes an interface layer, and the interface layer includes a user interface, and the user interface includes scopes and clusters. Users can configure parallelization strategies for the model to be trained through scopes and clusters.
- scopes are used to identify parallelization strategies for different parts of the model to be trained.
- scopes can be at least one of replica (data parallelism), split (operator splitting), pipeline (pipeline parallelism), and stage (model parallelism), that is, scopes can be replica (data parallelism), split (operator splitting), pipeline (pipeline parallelism), stage (model parallelism), or a combination of two or more.
- replica data parallelism
- split split
- pipeline pipeline parallelism
- stage model parallelism
- Different scopes are used to specify different parallelization strategies.
- the scopes interface supports nested use, so that different parallelization strategies can be nested to implement various hybrid parallel strategies to accelerate distributed training. Users can divide the model to be trained into multiple subgraphs through the scopes interface, and configure a scope for each subgraph.
- the cluster shown in FIG. 12 is used to divide computing resources, where computing resources may also be referred to as hardware resources.
- the computing resource may specifically be a GPU, a CPU, or the like.
- a cluster is used to divide computing resources into multiple virtual computing devices. Further, according to the parallelization strategy of the model to be trained, the subgraphs divided by the user through the scopes are mapped to the virtual computing device, and the mapping process can be completely transparent to the user.
- USER_MODEL_DEFINATION() represents the original code of the user, that is, the code corresponding to the model to be trained.
- whale.replica() Indicates the data parallel strategy configured by the user for the model to be trained.
- whale.cluster() Indicates calling the cluster interface. That is to say, users do not need to modify the original code, but only need to add a replica scope and cluster to the outer layer of the original code to enable the distributed training framework to perform data-parallel distributed training of the model to be trained.
- USER_MODEL_DEFINATION_PART_1 represents the first part of the model to be trained
- USER_MODEL_DEFINATION_PART_2 represents the second part of the model to be trained.
- the first part and the second part may be specifically divided by the user.
- Two with whale.stage() Represents the model parallel strategy configured by the user for the first part and the second part respectively.
- whale.replica() Indicates the data parallel strategy configured by the user for the first part and the second part respectively.
- the first part may correspond to one sub-image
- the second part may correspond to another sub-image.
- Adding the pipeline scope to the outer layers of the first part and the second part enables the distributed training framework to perform pipeline-parallel training on the model to be trained.
- the outer layer of the replica scope is added.
- whale.replica() Represents the data parallel strategy configured by the user for the first part of the model to be trained.
- replica data parallelism
- split perator splitting
- pipeline pipeline parallelism
- stage model parallelism
- replica data parallelism
- split perator splitting
- pipeline pipeline parallelism
- stage model parallelism
- replica data parallelism
- split perator splitting
- pipeline pipeline parallelism
- stage model parallelism
- the distributed training framework 120 may automatically add a parallelization strategy to the first computational graph through scopes in the interface layer.
- This embodiment enables the user to construct various parallelization strategy, thus improving the flexibility of the parallelization strategy.
- the user's original code that is, the code of the user's model definition part
- native interfaces such as Tensorflow interface and PyTorch interface
- API Application Programming Interface
- the distributed training framework 120 further includes a model and a parallelized intermediate presentation layer, and the model and the parallelized presentation layer in the parallelized intermediate presentation layer can parse the parallelization strategy of the model to be trained to obtain corresponding abstractions.
- this embodiment provides three types of abstractions, namely Multi-Dimensional Resource, Subgraph Group, and Virtual Device. These 3 classes of abstractions can be used to unify and express different parallelization strategies. After unified abstraction of the parallelization strategy, the parallelization strategy can be realized based on the computational graph editing technology.
- model parameters have multiple dimensions, such as data sample dimension, channel dimension, height dimension, width dimension, and length dimension, where the data sample dimension is denoted as N, the channel dimension is denoted as C, and the height dimension is denoted as H, The width dimension is denoted as W, and the length dimension is denoted as L.
- data parallelism can split the data sample dimension N.
- operator splitting can split other dimensions except the data sample dimension N. For example, operator splitting can split one of the channel dimension C, height dimension H, width dimension W, and length dimension L. split, or split multiple dimensions in the channel dimension C, the height dimension H, the width dimension W, and the length dimension L.
- Multi-Dimensional Resource supports arbitrary splitting or slicing in different dimensions.
- Batch Sample represents the data sample dimension
- Channel represents the channel dimension
- Length represents the length dimension.
- Multi-Dimensional Resource can represent data parallelism when splitting data sample dimensions.
- the Multi-Dimensional Resource can represent operator splitting.
- Multi-Dimensional Resource can represent the combination of data parallelism and operator splitting.
- Subgraph Group supports dividing a complete computational graph (Graph) of a model, such as the first computational graph described in the above embodiment, into multiple subgraphs (Subgraphs), and each subgraph can implement the same or different parallelization strategies. Communication between subgraphs is possible.
- Subgraph Groups can be used to represent model parallelism and/or pipeline parallelism.
- model parallelism and/or pipeline parallelism may be parallelization strategies between subgraphs
- data parallelism and/or operator splitting may be parallelization strategies within subgraphs.
- the first calculation graph 81 may be divided in various manners.
- the first calculation graph 81 may be divided into the second calculation graph 82 described in the foregoing embodiment.
- the first calculation graph 81 can also be divided into 140 as shown in FIG. 14 , that is, the first calculation graph 81 can be divided into 4 subgraphs, and each subgraph includes a layer of the model to be trained.
- the Virtual Device abstraction supports abstracting multiple physical devices into a virtual device.
- multiple physical devices may come from the same machine, that is, a single machine with multiple cards, or multiple physical devices may come from multiple different machines, that is, multiple machines and multiple cards.
- the physical device is specifically a GPU, and the multiple physical devices are GPU0-GPU5 as shown in FIG. 15 .
- GPU3-GPU5 comes from machine B
- GPU0-GPU3 can be divided into virtual device
- GPU4 and GPU5 can be divided into virtual device 1.
- FIG. 15 is only a schematic manner, and is not specifically limited in this embodiment. Specifically, the user only needs to perceive the virtual device and assign the corresponding virtual device to the sub-graph.
- the distributed training framework 120 can associate virtual devices to specific physical devices according to the network topology of hardware computing resources.
- the execution layer in the distributed training framework 120 shown in FIG. 12 can be used to rewrite the second computation graph to construct a third computation graph that can be parallelized, and based on the third computation graph, Convert the third computational graph to a distributed computational graph. Further, the execution layer can send the distributed computation graph to the training engine, for example, Tensorflow, PyTorch, etc.
- This embodiment unifies and expresses various parallelization strategies through three types of abstractions: Multi-Dimensional Resource, Subgraph Group, and Virtual Device, so that the distributed training framework can support arbitrary parallelization strategies and various hybrid parallelization strategies. strategy, thus solving the problem of a single parallelization strategy.
- this embodiment also implements various parallelization strategies based on the computational graph editing technology, so that multiple parallelization strategies can be integrated into one system, thereby improving the flexibility and diversity of the parallelization strategies.
- training the model to be trained according to the distributed computing graph includes the following steps as shown in FIG. 16 :
- the distributed computing graph 83 can be converted into a distributed computing graph that can be recognized by a training engine such as Tensorflow or PyTorch.
- a training engine such as Tensorflow or PyTorch.
- the process of converting the distributed computing graph 83 into a distributed computing graph identifiable by Tensorflow or PyTorch can be performed by the parallelized computing graph converting component shown in FIG. 12 .
- the distributed training framework 120 also includes a training engine.
- the parallelized computation graph conversion component converts the distributed computation graph 83 into a distributed computation graph identifiable by Tensorflow or PyTorch, it can also convert Tensorflow or PyTorch into a
- the distributed computing graph that PyTorch can recognize is input to the training engine, and the training engine can train the model to be trained.
- the distributed computing graph by converting the distributed computing graph into a distributed computing graph identifiable by the training engine, cross-platform compatibility with different training engines such as Tensorflow or PyTorch can be achieved, thereby improving the compatibility of the distributed training framework.
- the coupling between the training engine and the parallelization strategy can also be reduced, so that it can be compatible with the existing training engine and improve the compatibility of user models. sex.
- FIG. 17 is a schematic structural diagram of a model processing apparatus provided by an embodiment of the present disclosure.
- the model processing apparatus provided by the embodiment of the present disclosure may execute the processing flow provided by the model processing method embodiment.
- the model processing apparatus 170 includes:
- the acquisition module 171 is used to acquire the first computation graph corresponding to the model to be trained and the parallelization strategy of the model to be trained, the parallelization strategy of the model to be trained includes pipeline parallelism, model parallelism, data parallelism and operator splitting at least one of the points;
- An adding module 172 configured to add parallelization information in the first computation graph according to the parallelization strategy of the model to be trained, to obtain a second computation graph;
- a determining module 173, configured to determine a distributed computing graph according to the second computing graph and computing resources
- a training module 174 configured to train the to-be-trained model according to the distributed computing graph.
- the adding module 172 is specifically configured to: divide the first computation graph into a plurality of first subgraphs according to the parallelization strategy of the model to be trained; Parallelization information is added to each first subgraph of the plurality of first subgraphs to obtain a second computation graph.
- the parallelization information includes parallelization information between different first subgraphs and parallelization information within each first subgraph.
- the parallelization information between different first subgraphs includes: a parallelization strategy adopted between different first subgraphs.
- the parallelization information between different first subgraphs further includes: parameter information of parallelization strategies adopted between different first subgraphs.
- the parallelization strategy adopted between different first subgraphs includes: pipeline parallelism and/or model parallelism.
- the parallelization information in each first subgraph includes: a parallelization strategy in each first subgraph.
- the parallelization information in each first subgraph further includes: parameter information of the parallelization strategy in each first subgraph.
- the parallelization strategy in each first subgraph includes: data parallelism and/or operator splitting.
- the determining module 173 includes:
- a dividing unit 1731 configured to divide the computing resources to obtain one or more virtual devices, where the virtual devices include one or more physical devices;
- a conversion unit 1732 configured to convert the second computation graph into a third computation graph according to the parallelization information
- the mapping unit 1733 is configured to map the third computation graph to a physical device to obtain a distributed computation graph.
- the dividing unit 1731 is specifically configured to: divide the computing resources according to the parallelization information to obtain one or more virtual devices.
- the conversion unit 1732 is specifically configured to: convert the first subgraph into a distributed second subgraph according to the parallelization information of each first subgraph in the plurality of first subgraphs;
- the connection relationship between the plurality of first subgraphs is to connect the distributed second subgraphs corresponding to each first subgraph to obtain a third computation graph.
- mapping unit 1733 is specifically configured to: map each second subgraph in the third computation graph to a physical device.
- the mapping unit 1733 is specifically configured to: map each first sub-image to a virtual device; map each second sub-image corresponding to the first sub-image to a virtual device corresponding to the first sub-image.
- the device includes the physical device.
- the obtaining module 171 is specifically configured to: determine a parallelization strategy of the model to be trained according to the first computation graph corresponding to the model to be trained.
- the acquiring module 171 is specifically configured to: acquire the parallelization strategy of the to-be-trained model selected by the user.
- the training module 174 is specifically configured to: convert the distributed computing graph into a distributed computing graph identifiable by the training engine; input the distributed computing graph identifiable by the training engine into the training engine, through The training engine trains the to-be-trained model.
- the model processing apparatus in the embodiment shown in FIG. 17 can be used to execute the technical solutions of the foregoing method embodiments, and the implementation principles and technical effects thereof are similar, and are not repeated here.
- FIG. 18 is a schematic structural diagram of a model processing device according to an embodiment of the present disclosure.
- the model processing device provided by the embodiment of the present disclosure can execute the processing flow provided by the model processing method embodiment.
- the model processing device 180 includes: a memory 181, a processor 182, a computer program, and a communication interface 183;
- the program is stored in the memory 181 and is configured to execute the model processing method as described above by the processor 182 .
- an embodiment of the present disclosure further provides a computer-readable storage medium on which a computer program is stored, and the computer program is executed by a processor to implement the model processing method described in the foregoing embodiments.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Neurology (AREA)
- Image Processing (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
一种模型处理方法、装置、设备及计算机可读存储介质。通过获取待训练模型对应的第一计算图、以及待训练模型的并行化策略,待训练模型的并行化策略包括流水并行、模型并行、数据并行和算子拆分中的至少一种(S601),根据待训练模型的并行化策略在第一计算图中添加并行化信息,得到第二计算图(S602),并根据第二计算图和计算资源,确定分布式计算图(S603),根据分布式计算图对待训练模型进行训练(S604),实现了基于计算图图编辑的技术来支持多种并行化策略,使得多种并行化策略可以整合于一套系统中,从而实现了一种能够支持多种并行化策略的分布式训练框架。
Description
本公开要求2020年09月10日递交的申请号为202010947896.X、发明名称为“一种模型处理方法、装置、设备及计算机可读存储介质”中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本公开涉及信息技术领域,尤其涉及一种模型处理方法、装置、设备及计算机可读存储介质。
随着深度神经网络的发展,深度学习模型、神经网络模型等模型的参数量在不断的增长,但是用于训练模型的硬件的更新速度却跟不上模型的迭代速度。
现有技术采用分布式的训练方法对模型进行训练。通常的分布式的训练方法包括:数据并行、模型并行、流水并行、算子拆分、以及混合并行等并行化策略。其中,混合并行可以是数据并行、模型并行、流水并行、算子拆分中两种或两种以上的组合方式。但是,现有技术中的分布式训练框架无法支持各种并行化策略及其组合。
发明内容
为了解决上述技术问题或者至少部分地解决上述技术问题,本公开提供了一种模型处理方法、装置、设备及计算机可读存储介质,以实现一种能够支持多种并行化策略的分布式训练框架。
第一方面,本公开实施例提供一种模型处理方法,包括:
获取待训练模型对应的第一计算图、以及所述待训练模型的并行化策略,所述待训练模型的并行化策略包括流水并行、模型并行、数据并行和算子拆分中的至少一种;
根据所述待训练模型的并行化策略在所述第一计算图中添加并行化信息,得到第二计算图;
根据所述第二计算图和计算资源,确定分布式计算图;
根据所述分布式计算图对所述待训练模型进行训练。
第二方面,本公开实施例提供一种模型处理装置,包括:
获取模块,用于获取待训练模型对应的第一计算图、以及所述待训练模型的并行化策略,所述待训练模型的并行化策略包括流水并行、模型并行、数据并行和算子拆分中的至少一种;
添加模块,用于根据所述待训练模型的并行化策略在所述第一计算图中添加并行 化信息,得到第二计算图;
确定模块,用于根据所述第二计算图和计算资源,确定分布式计算图;
训练模块,用于根据所述分布式计算图对所述待训练模型进行训练。
第三方面,本公开实施例提供一种模型处理设备,包括:
存储器;
处理器;以及
计算机程序;
其中,所述计算机程序存储在所述存储器中,并被配置为由所述处理器执行以实现如第一方面所述的方法。
第四方面,本公开实施例提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行以实现第一方面所述的方法。
本公开实施例提供的模型处理方法、装置、设备及计算机可读存储介质,通过获取待训练模型对应的第一计算图、以及待训练模型的并行化策略,待训练模型的并行化策略包括流水并行、模型并行、数据并行和算子拆分中的至少一种,根据待训练模型的并行化策略在第一计算图中添加并行化信息,得到第二计算图,并根据第二计算图和计算资源,确定分布式计算图,根据分布式计算图对待训练模型进行训练,实现了基于计算图图编辑的技术来支持多种并行化策略,使得多种并行化策略可以整合于一套系统中,从而实现了一种能够支持多种并行化策略的分布式训练框架。
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理。
为了更清楚地说明本公开实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,对于本领域普通技术人员而言,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本公开实施例提供的数据并行的示意图;
图2为本公开实施例提供的模型并行的示意图;
图3为本公开实施例提供的流水并行的一种示意图;
图4为本公开实施例提供的流水并行的另一种示意图;
图5为本公开实施例提供的算子拆分的示意图;
图6为本公开实施例提供的模型处理方法流程图;
图7为本公开实施例提供的一种应用场景的示意图;
图8为本公开实施例提供的另一种应用场景的示意图;
图9为本公开另一实施例提供的模型处理方法流程图;
图10为本公开实施例提供的另一种应用场景的示意图;
图11为本公开实施例提供的另一种应用场景的示意图;
图12为本公开实施例提供的分布式训练框架的示意图;
图13为本公开实施例提供的模型参数维度的示意图;
图14为本公开实施例提供的第一计算图的划分方法的示意图;
图15为本公开实施例提供的虚拟设备的划分方法的示意图;
图16为本公开实施例提供的另一种应用场景的示意图;
图17为本公开实施例提供的模型处理装置的结构示意图;
图18为本公开实施例提供的模型处理设备的结构示意图。
为了能够更清楚地理解本公开的上述目的、特征和优点,下面将对本公开的方案进行进一步描述。需要说明的是,在不冲突的情况下,本公开的实施例及实施例中的特征可以相互组合。
在下面的描述中阐述了很多具体细节以便于充分理解本公开,但本公开还可以采用其他不同于在此描述的方式来实施;显然,说明书中的实施例只是本公开的一部分实施例,而不是全部的实施例。
通常情况下,采用分布式的训练方法对模型进行训练。通常的分布式的训练方法包括:数据并行、模型并行、流水并行、算子拆分等并行化策略。但是,现有技术中的分布式训练框架无法支持各种并行化策略及其组合。针对该问题,本公开实施例提供了一种模型处理方法,下面结合具体的实施例对该方法进行介绍。
在本公开实施例中,并行化策略也可以称为并行策略,具体可以是分布式并行方法的集合。例如,并行化策略包括:数据并行、模型并行、流水并行、算子拆分、混合并行等。其中,混合并行可以是数据并行、模型并行、流水并行、算子拆分中两种或两种以上的组合方式。下面分别对数据并行、模型并行、流水并行、算子拆分进行详细介绍。
数据并行具体表现为:多个设备中每个设备加载有相同的模型副本,也就是说,多个设备中每个设备分别训练的模型是相同的。但是,每个设备中用于训练模型的样本数据是不同的。例如,不同设备中用于训练模型的样本数据是不同的数据子集。每个设备中的数据子集构成的集合是训练该模型的完整集合。另外,每个设备在迭代结束时可以同步跨副本的模型参数。如图1所示为数据并行的示意图,例如,模型10和模型11是相同的模型。设备0用于训练模型10,设备1用于训练模型11。在一次 迭代训练过程中,设备0训练模型10采用的样本数据记为input1,设备1训练模型11采用的样本数据记为input2。其中,input1和input2可以不同。设备0根据input1可以输出训练结果,该训练结果例如可以是梯度。同理,设备1根据input2可以输出训练结果,该训练结果例如可以是梯度。由于设备0输出的训练结果和设备1输出的训练结果可能不同,因此,可以对设备0输出的训练结果和设备1输出的训练结果进行聚合处理,得到聚合结果。进一步,可以根据该聚合结果分别更新模型10的参数和模型11的参数,使得模型10的参数和模型11的参数相同,并进行下一次迭代训练。
模型并行具体表现为:多个设备中不同设备用于训练模型的不同层。例如,模型可以包括多个层(例如,网络层),不同设备负责不同层的计算,也就是说,模型的不同层被分配到不同的设备。具体的,模型的一层或多层可以被分配到同一个设备。如图2所示为模型并行的示意图,例如,模型包括层1、层2、层3和层4,其中,层1和层2可以被分配到设备0,层3和层4可以被分配到设备1,设备0的输入为样本数据,设备0的输出可以作为设备1的输入。
流水并行是一种辅助性的并行策略,流水并行可以单独使用,也可以与模型并行混合使用。
图3所示为流水并行单独使用的示意图。例如,30表示模型,模型30由某个设备进行训练,31表示用于训练该模型的数据集合,该数据集合可以包括大量的样本数据。由于大量的样本数据占用较大的存储空间和计算资源,因此,在单独使用流水并行时,可以将数据集合31划分为多个较小的分片,例如,32表示多个分片中的任意一个分片。进一步,该设备可以将多个分片中的每个分片依次输入到模型30中对模型30进行训练,其中,将一个分片输入到模型30中对模型30进行训练后可得到一个训练结果,该训练结果例如可以是梯度。例如,数据集合31被划分为10个分片,10个分片依次输入到模型30中对模型30进行训练后可得到10个训练结果,进一步对该10个训练结果进行处理即可得到该模型的模型参数。
图4所示为流水并行与模型并行混合使用的示意图。如图4所示,模型40包括层1、层2、层3和层4。根据前面所述的模型并行可知,层1、层2、层3和层4可以分别被分配到不同的设备进行计算。例如,设备1负责层1的计算,设备2负责层2的计算,设备3负责层3的计算,设备4负责层4的计算。可以理解的是,如图4所示的流水并行与模型并行的混合使用只是一种示意性说明,并不做具体限定。在流水并行与模型并行混合使用的情况下,可以将层1的输入数据41划分为多个分片,例如,F
0,1和F
1,1分别表示一个分片。设备1可以对多个分片依次处理,例如,设备1可以先处理分片F
0,1,设备1对分片F
0,1处理后得到的处理结果可以记为F
0,2。F
0,2可 以作为层2的输入,即设备1可以将F
0,2发送给设备2。同理,设备2对F
0,2处理后得到的处理结果可记为F
0,3,设备2可以将F
0,3发送给设备3,F
0,3作为层3的输入。同理,设备3对F
0,3处理后得到的处理结果可记为F
0,4,设备3可以将F
0,4发送给设备4,F
0,4作为层4的输入。可以理解的是,当设备2对F
0,2进行处理的同时,设备1可以对F
0,1的下一个分片即F
1,1进行处理,F
1,1的处理结果的流向以及后续的处理类似于F
0,1,此处不再赘述。也就是说,通过流水并行与模型并行混合使用,模型40中不同层所对应的设备可以并行计算。例如,在t时刻,设备1、设备2、设备3和设备4可以同时并行计算,从而提高了计算效率。
算子拆分具体表现为:模型的每一层包括一个或多个算子。每一层中的算子用于训练模型的部分参数。另外,不同层对应的参数个数可以不同,也可以相同。图5所示为算子拆分的一种示意图。例如,将每一层中算子的存储部分和计算部分拆分到设备0和设备1上进行存储和计算。在一些实施例中,还可以对模型中的任意一层或多层进行拆分。其中,算子的存储部分可以是该算子训练的参数。
如上所述的设备具体可以是图形处理器(Graphics Processing Unit,GPU)或中央处理器(Central Processing Unit,CPU)等计算设备。
可以理解的是,如上所述的模型可以是神经网络模型,或者也可以是深度学习模型,另外,还可以是其他类型的模型,也就是说,本公开实施例并不对模型作具体限定。其中,深度学习是机器学习的分支,是一种以人工神经网络为架构,对资料进行表征学习的算法。另外,深度学习模型也可以称为深度神经网络模型,具体的,3层或3层以上的神经网络模型可以是深度学习模型。
图6为本公开实施例提供的模型处理方法流程图。该方法具体步骤如下:
S601、获取待训练模型对应的第一计算图、以及所述待训练模型的并行化策略,所述待训练模型的并行化策略包括流水并行、模型并行、数据并行和算子拆分中的至少一种。
本实施例中待训练模型例如可以是待训练的神经网络模型或深度学习模型等。该待训练模型可以是用户的单机单卡模型。另外,本实施例中的模型可以应用在计算机视觉、自然语言处理、知识图谱等领域中。
本实施例所述的模型处理方法具体可以由分布式训练框架来执行。该分布式训练框架可以部署在一个或多个机器上,该机器具体可以包括计算机、计算机集群、服务器等,其中,服务器具体可以包括云端服务器、远程服务器等。每个机器可包括多个设备,该设备例如可以是GPU或CPU等计算设备。
在一种可能的应用场景中,该分布式训练框架部署在云端服务器,可以理解的是,本实施例并不限定云端服务器的个数,例如可以是一个,也可以是多个。此处以一个 为例进行示意性说明。例如图7所示的云端服务器70。待训练模型存储在用户终端71。本实施例并不限定终端71的具体产品形态,例如,可以是笔记本电脑、台式电脑、平板电脑、个人计算机(Personal Computer,PC)等。具体的,终端71可以将待训练模型发送给云端服务器70,云端服务器70接收到待训练模型后,生成待训练模型对应的计算图,该计算图记为第一计算图。该第一计算图可以是用户原始单机单卡计算图。或者,终端71可以根据本地的待训练模型生成第一计算图,并将该第一计算图发送给云端服务器70。云端服务器70可以采用该分布式训练框架对待训练模型进行训练,并将训练结果反馈给终端71。
在另一种可能的应用场景中,云端服务器70可以将分布式训练框架发送给终端71,终端71可以将该分布式训练框架部署在终端71或其他设备上。例如,该分布式训练框架部署在终端71之后,终端71可以根据本地的待训练模型生成第一计算图,并将该第一计算图作为该分布式训练框架的输入,通过该分布式训练框架对待训练模型进行训练,并输出训练结果。本实施例所述的分布式训练框架可以支持多种并行化策略,例如,该分布式训练框架不仅可以支持流水并行、模型并行、数据并行或算子拆分,还可以支持流水并行、模型并行、数据并行和算子拆分中两种或两种以上的组合方式。也就是说,本实施例中的分布式训练框架不仅可以支持流水并行、模型并行、数据并行和算子拆分中单一的一种并行化策略,也可以支持混合并行。因此,该分布式训练框架不仅可以采用单一的并行化策略对待训练模型进行训练,还可以采用混合并行方式对待训练模型进行训练。
另外,当分布式训练框架获取到待训练模型的第一计算图时,还可以进一步获取待训练模型的并行化策略,该待训练模型的并行化策略可以是流水并行、模型并行、数据并行和算子拆分中的至少一种。也就是说,该待训练模型的并行化策略可以是单一的并行化策略,也可以是混合并行策略。
在一种可能的实现方式中,获取所述待训练模型的并行化策略,包括:根据所述待训练模型对应的第一计算图,确定所述待训练模型的并行化策略。
例如,当分布式训练框架获取到待训练模型的第一计算图时,可以根据该第一计算图,确定出该待训练模型的并行化策略。例如,分布式训练框架可以根据强化学习等机器学习方法对第一计算图进行分析,以确定该待训练模型的特征信息,并根据该特征信息确定该待训练模型的并行化策略。
在另一种可能的实现方式中,获取所述待训练模型的并行化策略,包括:获取用户选择的所述待训练模型的并行化策略。
例如,该分布式训练框架提供有用户接口,用户可以采用该用户接口选择待训练模型的并行化策略。具体的,该用户接口提供有流水并行、模型并行、数据并行和算 子拆分,用户可以从流水并行、模型并行、数据并行和算子拆分中选择一个或多个作为待训练模型的并行化策略。另外,作为一种可能的实现方式,当分布式训练框架获取到用户选择的待训练模型的并行化策略时,还可以计算该用户选择的并行化策略或各种并行化策略的组合方式所需要消耗的计算资源、成本、时间、性能等指标信息。进一步,分布式训练框架还可以将这些指标信息反馈给用户,以便用户对已选择的并行化策略进行调整或确认。
S602、根据所述待训练模型的并行化策略在所述第一计算图中添加并行化信息,得到第二计算图。
例如图8所示的81表示待训练模型对应的第一计算图。811、812、813、814分别表示该待训练模型包括的不同层。分布式训练框架可以根据待训练模型的并行化策略,在第一计算图81的基础上添加并行化信息,得到第二计算图82。该并行化信息是与待训练模型的并行化策略相关的信息。例如,待训练模型的并行化策略为:待训练模型的第一部分和第二部分之间采用模型并行,第一部分内采用数据并行,第二部分内采用算子拆分。该并行化信息可以包括待训练模型的并行化策略。另外,该并行化信息还可以包括并行化策略的参数信息,例如,数据并行所需的设备的个数、算子拆分的拆分对象被拆分成的个数或分片数等。
可选的,根据所述待训练模型的并行化策略在所述第一计算图中添加并行化信息,得到第二计算图,包括:根据所述待训练模型的并行化策略将所述第一计算图划分为多个第一子图;根据所述待训练模型的并行化策略在所述多个第一子图的每个第一子图中添加并行化信息,得到第二计算图。
例如,分布式训练框架可以根据待训练模型的并行化策略将该第一计算图划分为多个子图(Subgraph),每个子图可以包括待训练模型的一层或多个。例如,分布式训练框架将第一计算图划分为图8所示的子图1和子图2,子图1和子图2分别记为第一子图。例如,子图1对应待训练模型的第二部分,子图2对应待训练模型的第一部分。进一步,根据待训练模型的并行化策略在子图1和子图2中的每个子图中添加并行化信息,得到第二计算图82。
可选的,所述并行化信息包括不同第一子图之间的并行化信息、以及每个第一子图内的并行化信息。
例如,分布式训练框架在子图1中添加的并行化信息可以包括子图1内的并行化信息,另外,还可以包括子图1和子图2之间的并行化信息。同理,分布式训练框架在子图2中添加的并行化信息可以包括子图2内的并行化信息,另外,还可以包括子图1和子图2之间的并行化信息。
可选的,不同第一子图之间的并行化信息包括:不同第一子图之间采用的并行化 策略。不同第一子图之间采用的并行化策略包括:流水并行和/或模型并行。
例如,子图1和子图2之间的并行化信息包括子图1和子图2之间采用的并行化策略,该并行化策略为模型并行。
可选的,不同第一子图之间的并行化信息还包括:不同第一子图之间采用的并行化策略的参数信息。
例如,在其他一些实施例中,子图1和子图2之间的并行化策略为在模型并行的基础上增加流水并行。根据如上所述的内容可知流水并行可以将模型的样本数据集划分为多个较小的分片。当流水并行与模型并行混合使用时,可以将子图1的输入数据划分为多个较小的分片、以及将子图2的输入数据划分为多个较小的分片。在这种情况下,子图1和子图2之间的并行化信息不仅可以包括子图1和子图2之间采用的并行化策略,另外还可以包括子图1和子图2之间采用的并行化策略的参数信息,例如,该参数信息具体可以是子图1的输入数据被划分成的分片的个数,以及子图2的输入数据被划分成的分片的个数。其中,子图1的输入数据被划分成的分片的个数和子图2的输入数据被划分成的分片的个数可以相同,也可以不同。另外,子图1和子图2之间采用的并行化策略的参数信息可以是用户配置的,也可以是分布式训练框架配置的。
可选的,每个第一子图内的并行化信息包括:每个第一子图内的并行化策略。每个第一子图内的并行化策略包括:数据并行和/或算子拆分。
例如,图8所示的子图1内采用算子拆分,子图2内采用数据并行。在其他实施例中,子图1内可以采用数据并行,子图2内采用算子拆分。另外,子图1内采用的并行化策略还可以与子图2内采用的并行化策略相同。此外,每个子图采用的并行化策略还可以是混合并行方式,例如,子图1内采用数据并行和算子拆分的组合方式,和/或子图2内采用数据并行和算子拆分的组合方式。
可选的,每个第一子图内的并行化信息还包括:每个第一子图内的并行化策略的参数信息。
例如,当子图1内采用算子拆分,子图2内采用数据并行时,子图1内的并行化信息还可以包括算子拆分的参数信息,例如,子图1被拆分的个数。同理,子图2内的并行化信息还可以包括数据并行的参数信息,例如,数据并行需要几个设备来执行。另外,子图1内或子图2内的并行化策略的参数信息可以是用户配置的,也可以是分布式训练框架配置的。
如图8所示,子图1和子图2各自的边缘虚线框、以及圆圈1和圆圈2表示子图1和子图2之间的并行化信息,即子图1和子图2分别被分配到不同的设备进行计算。另外,子图1内的并行化信息可以是811里面的虚线框,表示子图1内采用算子拆分。 另外,811里面虚线框的个数可以标识子图1被拆分的个数。例如,811里面的两个虚线框表示将子图1被拆分到两个设备中执行。同理,还可以在子图2内部表示并行化信息,例如,子图2内的并行化信息用于标识数据并行、以及数据并行所需的设备个数,例如3个。
S603、根据所述第二计算图和计算资源,确定分布式计算图。
例如图8所示的物理设备0、物理设备1、物理设备2、物理设备3和物理设备4分别为具体的硬件计算资源。其中,物理设备0、物理设备1、物理设备2、物理设备3和物理设备4可以是来自于同一个机器中的物理设备,也可以是来自于不同机器的物理设备。在本实施例中,物理设备可以是如上所述的GPU或CPU等计算设备,另外,物理设备还可以是虚拟机。分布式训练框架可以将该硬件计算资源划分为多个虚拟设备,本实施例并不对此处的划分方法做具体限定。例如,可以将物理设备0和物理设备1划分为虚拟设备1,将物理设备2、物理设备3和物理设备4划分为虚拟设备2。进一步,根据第二计算图82和各个虚拟设备中包括的物理设备得到分布式计算图83。分布式计算图83表示物理设备0和物理设备1用于对子图1进行算子拆分计算,物理设备2、物理设备3和物理设备4用于对子图2进行数据并行计算。
S604、根据所述分布式计算图对所述待训练模型进行训练。
例如,将分布式计算图83输入到训练引擎Tensorflow或PyTorch,由训练引擎Tensorflow或PyTorch执行训练过程。其中,Tensorflow是一个用于图像、语音和语言理解等机器学习任务的开源机器学习平台。PyTorch是一个基于Torch实现的开源的Python机器学习库,应用于人工智能领域,如自然语言处理。
在其他一些可能的应用场景中,终端71可以向云端服务器70发送计算资源信息、以及待训练模型或待训练模型对应的第一计算图,云端服务器70可以根据计算资源信息、以及待训练模型或待训练模型对应的第一计算图,确定出分布式计算图。根据该分布式计算图对待训练模型进行训练的过程可以在其他服务器或其他服务器所提供的训练引擎中执行。例如,云端服务器70可以将分布式计算图发送给终端71,用户通过终端71接收到该分布式计算图后,可以根据该分布式计算图,在其他服务器或其他服务器所提供的训练引擎中对待训练模型进行训练。本公开实施例通过获取待训练模型对应的第一计算图、以及待训练模型的并行化策略,待训练模型的并行化策略包括流水并行、模型并行、数据并行和算子拆分中的至少一种,根据待训练模型的并行化策略在第一计算图中添加并行化信息,得到第二计算图,并根据第二计算图和计算资源,确定分布式计算图,根据分布式计算图对待训练模型进行训练,实现了基于计算图图编辑的技术来支持多种并行化策略,使得多种并行化策略可以整合于一套系统中,从而实现了一种能够支持多种并行化策略的分布式训练框架。
在上述实施例的基础上,根据所述第二计算图和计算资源,确定分布式计算图包括如图9所示的如下几个步骤:
S901、对所述计算资源进行划分,得到一个或多个虚拟设备,所述虚拟设备包括一个或多个物理设备。
例如,将物理设备0和物理设备1划分为虚拟设备1,将物理设备2、物理设备3和物理设备4划分为虚拟设备2。
可选的,对所述计算资源进行划分,得到一个或多个虚拟设备,包括:根据所述并行化信息对所述计算资源进行划分,得到一个或多个虚拟设备。
例如,在对物理设备0、物理设备1、物理设备2、物理设备3和物理设备4进行划分时,具体可以根据第二计算图82中的并行化信息进行划分。例如,第二计算图82中的并行化信息表示将第一计算图81划分为两个子图,因此,可以将物理设备0、物理设备1、物理设备2、物理设备3和物理设备4划分为两个虚拟设备,例如,虚拟设备1和虚拟设备2。
S902、根据所述并行化信息将所述第二计算图转换为第三计算图。
如图10所示,在得到第二计算图82时,还可以根据第二计算图82中的并行化信息将第二计算图82转换为第三计算图84。
可选的,根据所述并行化信息将所述第二计算图转换为第三计算图,包括:根据所述多个第一子图中每个第一子图的并行化信息,将所述第一子图转换为分布式的第二子图;根据所述多个第一子图之间的连接关系,将每个第一子图对应的分布式的第二子图进行连接,得到第三计算图。
例如,子图1和子图2分别记为第一子图,根据每个第一子图的并行化信息,可以每个第一子图转换为分布式的第二子图。例如,子图11和子图12是将子图1转换后得到的分布式的第二子图。子图21、子图22和子图23是将子图2转换后得到的分布式的第二子图。进一步,根据子图1和子图2之间的连接关系,将子图11分别与子图21、子图22和子图23连接,以及将子图12分别与子图21、子图22和子图23连接,从而得到第三计算图84。
S903、将所述第三计算图映射到物理设备,得到分布式计算图。
例如,将第三计算图84映射到物理设备,得到分布式计算图83。
可选的,将所述第三计算图映射到物理设备,包括:将所述第三计算图中的每个第二子图映射到物理设备。
例如,将第三计算图84中的每个第二子图映射到一个物理设备。
可选的,将所述第三计算图中的每个第二子图映射到物理设备,包括:将每个第 一子图映射到一个虚拟设备;将所述第一子图对应的每个第二子图映射到所述第一子图对应的虚拟设备包括的物理设备。
例如,将子图1映射到虚拟设备1,进一步,将子图1对应的子图11和子图12分别映射到虚拟设备1包括的物理设备,例如,将子图11映射到虚拟设备1中的物理设备0,将子图12映射到虚拟设备1中的物理设备1。同理,将子图2映射到虚拟设备2,进一步,将子图2对应的子图21、子图22和子图23分别映射到虚拟设备2包括的物理设备,例如,将子图21映射到物理设备2,将子图22映射到物理设备3,将子图23映射到物理设备4。
另外,在其他实施例中,子图1还可以被拆分到3个设备上,如图11所示。可选的,物理设备0和物理设备3可以是同一个设备,也可以是不同设备。同理,物理设备1和物理设备4可以是同一个设备,也可以是不同设备。物理设备2和物理设备5可以是同一个设备,也可以是不同设备。
本实施例通过对所述计算资源进行划分,得到一个或多个虚拟设备,所述虚拟设备包括一个或多个物理设备,根据所述并行化信息将所述第二计算图转换为第三计算图,将所述第三计算图映射到物理设备,得到分布式计算图,使得计算资源可以被充分的利用,提高了计算资源的利用率。
如图12所示的120表示分布式训练框架的结构示意图。分布式训练框架120的输入可以是如上所述的第一计算图。分布式训练框架120的输出可以是训练结果。
如图12所示,该分布式训练框架120包括接口层,接口层包括用户接口,该用户接口包括scopes和cluster。用户通过scopes和cluster可以给待训练模型配置并行化策略。
其中,scopes用于标识待训练模型不同部分的并行化策略。例如,scopes具体可以是replica(数据并行)、split(算子拆分)、pipeline(流水并行)、stage(模型并行)中的至少一个,也就是说,scopes可以是replica(数据并行)、split(算子拆分)、pipeline(流水并行)、stage(模型并行)中的任意一个,或者也可以是两个或两个以上的组合。不同的scopes用于指定不同的并行化策略。另外,scopes接口支持嵌套使用,从而可以将不同的并行化策略进行嵌套使用来实现各种混合并行策略加速分布式训练。用户可以通过scopes接口将待训练模型划分为多个子图,并且给每个子图配置一个scopes。
如图12所示的cluster用于对计算资源进行划分,其中计算资源也可以称为硬件资源。该计算资源具体可以是GPU或CPU等。例如,cluster用于将计算资源划分为多个虚拟计算设备。进一步,根据待训练模型的并行化策略将用户通过scopes划分出的子图映射到虚拟计算设备上,该映射过程对于用户可以是完全透明的。
下面通过几个具体示例来介绍一下通过用户接口如何构建各种不同的并行化策略。
1)数据并行的构建方法如下:
with whale.cluster():
with whale.replica():
USER_MODEL_DEFINATION()
其中,USER_MODEL_DEFINATION()表示用户原始代码即待训练模型对应的代码。with whale.replica():表示用户给待训练模型配置的数据并行策略。with whale.cluster():表示调用cluster接口。也就是说,用户无需修改原始代码,只需在原始代码外层增加replica scope和cluster即可使得分布式训练框架对待训练模型进行数据并行分布式训练。
2)数据并行嵌套流水并行和模型并行的混合并行策略的构建方法如下:
with whale.cluster():
with whale.replica():
with whale.pipeline(num_micro_batch=4):
with whale.stage():
USER_MODEL_DEFINATION_PART_1()
with whale.stage():
USER_MODEL_DEFINATION_PART_2()
其中,USER_MODEL_DEFINATION_PART_1()表示待训练模型的第一部分,USER_MODEL_DEFINATION_PART_2()表示待训练模型的第二部分。其中,第一部分和第二部分具体可以是用户划分的。两个with whale.stage():表示用户分别给第一部分和第二部分配置的模型并行策略。with whale.pipeline(num_micro_batch=4):表示用户分别给第一部分和第二部分配置的流水并行策略。with whale.replica():表示用户分别给第一部分和第二部分配置的数据并行策略。也就是说,用户无需修改原始代码,只需在原始代码中增加stage scope以便对待训练模型进行划分,例如,划分为第一部分和第二部分。其中,第一部分可以对应一个子图,第二部分可以对应另一个子图。在第一部分和第二部分的外层增加pipeline scope即可使得分布式训练框架对待训练模型进行流水并行的训练,在此基础上,如需对待训练模型进行数据并行的训练,还可以在pipeline scope的外层增加replica scope。
3)算子拆分与数据并行的混合并行策略的构建方法如下:
with whale.cluster():
with whale.replica():
USER_MODEL_DEFINATION_PART_1()
with whale.split(split_dim=“length”):
USER_MODEL_DEFINATION_PART_2()
其中,with whale.replica():表示用户给待训练模型的第一部分配置的数据并行策略。with whale.split(split_dim=“length”):表示用户给待训练模型的第二部分配置的算子拆分策略。也就是说,对于算子拆分,用户可以在需要拆分的模型部分增加split scope。对于数据并行,用户可以在需要数据并行的模型部分增加replica scope。
可以理解的是,以上所述的几种并行化策略的构建方法只是示意性说明,并不做具体限定。例如,在其他实施例中还可以构建出其他的并行化策略。也就是说,replica(数据并行)、split(算子拆分)、pipeline(流水并行)、stage(模型并行)可以单独使用,也可以组合使用。在组合使用的场景中,具体的组合方式可以不做限定。另外,replica(数据并行)、split(算子拆分)、pipeline(流水并行)、stage(模型并行)还可以嵌套使用,在嵌套使用时,具体的嵌套方式或嵌套顺序也不做限定。
在其他一些实施例中,当分布式训练框架120接收到第一计算图时,分布式训练框架120可以通过接口层中的scopes自动给第一计算图增加并行化策略。
本实施例通过对replica(数据并行)、split(算子拆分)、pipeline(流水并行)、stage(模型并行)进行单独使用、组合使用或嵌套使用,使得用户可以构建出各种各样的并行化策略,从而提高了并行化策略的灵活性。另外,通过上述示例可以看出,用户的原始代码即用户的模型定义部分的代码可以基于原生接口,例如,Tensorflow接口和PyTorch接口,而不需要更换模型定义的编程接口。因此,用户不需要修改原始代码,只需要在原始代码中增加几行应用程序接口(Application Programming Interface,API)调用即可轻易的组合出用户想要的并行化策略。
如图12所示,分布式训练框架120还包括模型及并行化中间表示层,该模型及并行化中间表示层中的并行化表示层可以对待训练模型的并行化策略进行解析得到相应的抽象。例如,本实施例提供了3类抽象,分别为Multi-Dimensional Resource、Subgraph Group、Virtual Device。该3类抽象可用于统一和表达不同的并行化策略。通过对并行化策略进行统一抽象后,可以基于计算图图编辑技术来实现并行化策略。
例如,数据并行和算子拆分可以通过Multi-Dimensional Resource进行表达。通常情况下,模型参数具有多个维度,例如,数据样本维度、通道维度、高度维度、宽度维度、长度维度,其中,数据样本维度记为N,通道维度记为C,高度维度记为H,宽度维度记为W,长度维度记为L。数据并行具体可以对数据样本维度N进行拆分。算子拆分具体可以对除了数据样本维度N之外的其他维度进行拆分,例如,算子拆分可以对通道维度C、高度维度H、宽度维度W、长度维度L中的一个维度进行拆 分,或者也可以对通道维度C、高度维度H、宽度维度W、长度维度L中的多个维度进行拆分。Multi-Dimensional Resource这种抽象支持在不同维度的任意拆分或切分。如图13所示,Batch Sample表示数据样本维度,Channel表示通道维度,Length表示长度维度。例如,在对数据样本维度进行拆分时,Multi-Dimensional Resource可以表示数据并行。在对通道维度C、高度维度H、宽度维度W、长度维度L中的一个维度或多个维度进行拆分时,Multi-Dimensional Resource可以表示算子拆分。在数据样本维度和除了数据样本维度之外的其他维度同时进行拆分时,Multi-Dimensional Resource可以表示数据并行和算子拆分的组合方式。
Subgraph Group这种抽象支持将模型完整的计算图(Graph)例如上述实施例所述的第一计算图划分为多个子图(Subgraph),每个子图内可以实施相同的或不同的并行化策略。子图之间可以进行通信。例如,Subgraph Group可用于表示模型并行和/或流水并行。具体的,模型并行和/或流水并行可以是子图之间的并行化策略,数据并行和/或算子拆分可以是子图内的并行化策略。如图14所示,第一计算图81可以有多种不同的划分方式,例如,第一计算图81可以划分为上述实施例所述的第二计算图82。或者,第一计算图81也可以划分为如图14所示的140,也就是说,可以将第一计算图81划分为4个子图,每个子图包括待训练模型的一层。
Virtual Device这种抽象支持将多个物理设备抽象成一个虚拟设备。其中,多个物理设备可以来自于同一个机器即单机多卡,或者多个物理设备可以来自于多个不同的机器即多机多卡。在一些实施例中,物理设备具体为GPU,多个物理设备为如图15所示的GPU0-GPU5,对GPU0-GPU5的划分方式可以有多种方法,例如,当GPU0-GPU2来自机器A,GPU3-GPU5来自机器B时,可以将GPU0-GPU3划分为虚拟设备0,将GPU4和GPU5划分为虚拟设备1。可以理解的是,如图15所示的划分方式只是一种示意性的方式,本实施例并不做具体限定。具体的,用户只需要感知虚拟设备、以及为子图分配相应的虚拟设备即可。分布式训练框架120可以根据硬件计算资源的网络拓扑将虚拟设备关联到具体的物理设备上。
另外,如图12所示的分布式训练框架120中的执行层可用于对第二计算图进行图改写,构造出可以进行并行化的第三计算图,并在第三计算图的基础上,将第三计算图转换为分布式计算图。进一步,执行层可以将分布式计算图发送给训练引擎,例如,Tensorflow、PyTorch等。
本实施例通过Multi-Dimensional Resource、Subgraph Group、Virtual Device这3类抽象来统一和表达各种不同的并行化策略,从而使得分布式训练框架可以支持任意的并行化策略、以及各种混合并行化策略,从而解决了并行化策略单一的问题。另外,本实施例还基于计算图图编辑技术来实现各种不同的并行化策略,使得多种并行化策 略可以整合于一套系统中,提高了并行化策略的灵活性和多样性。
在上述实施例的基础上,根据所述分布式计算图对所述待训练模型进行训练,包括如图16所示的如下几个步骤:
S1601、将所述分布式计算图转换为训练引擎可识别的分布式计算图。
例如,在图8、图10、图11的基础上,可以将分布式计算图83转换为训练引擎例如,Tensorflow或PyTorch可识别的分布式计算图。具体的,将分布式计算图83转换为Tensorflow或PyTorch可识别的分布式计算图的过程可以由如图12所示的并行化计算图图转换组件来执行。
S1602、将所述训练引擎可识别的分布式计算图输入到所述训练引擎,通过所述训练引擎对所述待训练模型进行训练。
如图12所示,该分布式训练框架120还包括训练引擎,当并行化计算图图转换组件将分布式计算图83转换为Tensorflow或PyTorch可识别的分布式计算图之后,还可以将Tensorflow或PyTorch可识别的分布式计算图输入到训练引擎,训练引擎可以对待训练模型进行训练。
本实施例通过将分布式计算图转换为训练引擎可识别的分布式计算图,可以做到跨平台兼容不同的训练引擎例如Tensorflow或PyTorch,从而提高了分布式训练框架的兼容性。另外,通过将分布式计算图转换为训练引擎可识别的分布式计算图,还可以降低训练引擎和并行化策略之间的耦合度,从而可以兼容已有的训练引擎,提高了用户模型的兼容性。
图17为本公开实施例提供的模型处理装置的结构示意图。本公开实施例提供的模型处理装置可以执行模型处理方法实施例提供的处理流程,如图17所示,模型处理装置170包括:
获取模块171,用于获取待训练模型对应的第一计算图、以及所述待训练模型的并行化策略,所述待训练模型的并行化策略包括流水并行、模型并行、数据并行和算子拆分中的至少一种;
添加模块172,用于根据所述待训练模型的并行化策略在所述第一计算图中添加并行化信息,得到第二计算图;
确定模块173,用于根据所述第二计算图和计算资源,确定分布式计算图;
训练模块174,用于根据所述分布式计算图对所述待训练模型进行训练。
可选的,添加模块172具体用于:根据所述待训练模型的并行化策略将所述第一计算图划分为多个第一子图;根据所述待训练模型的并行化策略在所述多个第一子图的每个第一子图中添加并行化信息,得到第二计算图。
可选的,所述并行化信息包括不同第一子图之间的并行化信息、以及每个第一子图内的并行化信息。
可选的,不同第一子图之间的并行化信息包括:不同第一子图之间采用的并行化策略。
可选的,不同第一子图之间的并行化信息还包括:不同第一子图之间采用的并行化策略的参数信息。
可选的,不同第一子图之间采用的并行化策略包括:流水并行和/或模型并行。
可选的,每个第一子图内的并行化信息包括:每个第一子图内的并行化策略。
可选的,每个第一子图内的并行化信息还包括:每个第一子图内的并行化策略的参数信息。
可选的,每个第一子图内的并行化策略包括:数据并行和/或算子拆分。
可选的,确定模块173包括:
划分单元1731,用于对所述计算资源进行划分,得到一个或多个虚拟设备,所述虚拟设备包括一个或多个物理设备;
转换单元1732,用于根据所述并行化信息将所述第二计算图转换为第三计算图;
映射单元1733,用于将所述第三计算图映射到物理设备,得到分布式计算图。
可选的,划分单元1731具体用于:根据所述并行化信息对所述计算资源进行划分,得到一个或多个虚拟设备。
可选的,转换单元1732具体用于:根据所述多个第一子图中每个第一子图的并行化信息,将所述第一子图转换为分布式的第二子图;根据所述多个第一子图之间的连接关系,将每个第一子图对应的分布式的第二子图进行连接,得到第三计算图。
可选的,映射单元1733具体用于:将所述第三计算图中的每个第二子图映射到物理设备。
可选的,映射单元1733具体用于:将每个第一子图映射到一个虚拟设备;将所述第一子图对应的每个第二子图映射到所述第一子图对应的虚拟设备包括的物理设备。
可选的,获取模块171具体用于:根据所述待训练模型对应的第一计算图,确定所述待训练模型的并行化策略。
可选的,获取模块171具体用于:获取用户选择的所述待训练模型的并行化策略。
可选的,训练模块174具体用于:将所述分布式计算图转换为训练引擎可识别的分布式计算图;将所述训练引擎可识别的分布式计算图输入到所述训练引擎,通过所述训练引擎对所述待训练模型进行训练。
图17所示实施例的模型处理装置可用于执行上述方法实施例的技术方案,其实 现原理和技术效果类似,此处不再赘述。
图18为本公开实施例提供的模型处理设备的结构示意图。本公开实施例提供的模型处理设备可以执行模型处理方法实施例提供的处理流程,如图18所示,模型处理设备180包括:存储器181、处理器182、计算机程序和通讯接口183;其中,计算机程序存储在存储器181中,并被配置为由处理器182执行如上所述的模型处理方法。
另外,本公开实施例还提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行以实现上述实施例所述的模型处理方法。
需要说明的是,在本文中,诸如“第一”和“第二”等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
以上所述仅是本公开的具体实施方式,使本领域技术人员能够理解或实现本公开。对这些实施例的多种修改对本领域的技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本公开的精神或范围的情况下,在其它实施例中实现。因此,本公开将不会被限制于本文所述的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。
Claims (20)
- 一种模型处理方法,其特征在于,所述方法包括:获取待训练模型对应的第一计算图、以及所述待训练模型的并行化策略,所述待训练模型的并行化策略包括流水并行、模型并行、数据并行和算子拆分中的至少一种;根据所述待训练模型的并行化策略在所述第一计算图中添加并行化信息,得到第二计算图;根据所述第二计算图和计算资源,确定分布式计算图;根据所述分布式计算图对所述待训练模型进行训练。
- 根据权利要求1所述的方法,其特征在于,根据所述待训练模型的并行化策略在所述第一计算图中添加并行化信息,得到第二计算图,包括:根据所述待训练模型的并行化策略将所述第一计算图划分为多个第一子图;根据所述待训练模型的并行化策略在所述多个第一子图的每个第一子图中添加并行化信息,得到第二计算图。
- 根据权利要求2所述的方法,其特征在于,所述并行化信息包括不同第一子图之间的并行化信息、以及每个第一子图内的并行化信息。
- 根据权利要求3所述的方法,其特征在于,不同第一子图之间的并行化信息包括:不同第一子图之间采用的并行化策略。
- 根据权利要求4所述的方法,其特征在于,不同第一子图之间的并行化信息还包括:不同第一子图之间采用的并行化策略的参数信息。
- 根据权利要求4或5所述的方法,其特征在于,不同第一子图之间采用的并行化策略包括:流水并行和/或模型并行。
- 根据权利要求3所述的方法,其特征在于,每个第一子图内的并行化信息包括:每个第一子图内的并行化策略。
- 根据权利要求7所述的方法,其特征在于,每个第一子图内的并行化信息还包括:每个第一子图内的并行化策略的参数信息。
- 根据权利要求7或8所述的方法,其特征在于,每个第一子图内的并行化策略包括:数据并行和/或算子拆分。
- 根据权利要求2所述的方法,其特征在于,根据所述第二计算图和计算资源,确定分布式计算图,包括:对所述计算资源进行划分,得到一个或多个虚拟设备,所述虚拟设备包括一个或多个物理设备;根据所述并行化信息将所述第二计算图转换为第三计算图;将所述第三计算图映射到物理设备,得到分布式计算图。
- 根据权利要求10所述的方法,其特征在于,对所述计算资源进行划分,得到一个或多个虚拟设备,包括:根据所述并行化信息对所述计算资源进行划分,得到一个或多个虚拟设备。
- 根据权利要求10或11所述的方法,其特征在于,根据所述并行化信息将所述第二计算图转换为第三计算图,包括:根据所述多个第一子图中每个第一子图的并行化信息,将所述第一子图转换为分布式的第二子图;根据所述多个第一子图之间的连接关系,将每个第一子图对应的分布式的第二子图进行连接,得到第三计算图。
- 根据权利要求12所述的方法,其特征在于,将所述第三计算图映射到物理设备,包括:将所述第三计算图中的每个第二子图映射到物理设备。
- 根据权利要求13所述的方法,其特征在于,将所述第三计算图中的每个第二子图映射到物理设备,包括:将每个第一子图映射到一个虚拟设备;将所述第一子图对应的每个第二子图映射到所述第一子图对应的虚拟设备包括的物理设备。
- 根据权利要求1所述的方法,其特征在于,获取所述待训练模型的并行化策略,包括:根据所述待训练模型对应的第一计算图,确定所述待训练模型的并行化策略。
- 根据权利要求1所述的方法,其特征在于,获取所述待训练模型的并行化策略,包括:获取用户选择的所述待训练模型的并行化策略。
- 根据权利要求1所述的方法,其特征在于,根据所述分布式计算图对所述待训练模型进行训练,包括:将所述分布式计算图转换为训练引擎可识别的分布式计算图;将所述训练引擎可识别的分布式计算图输入到所述训练引擎,通过所述训练引擎对所述待训练模型进行训练。
- 一种模型处理装置,其特征在于,包括:获取模块,用于获取待训练模型对应的第一计算图、以及所述待训练模型的并行化策略,所述待训练模型的并行化策略包括流水并行、模型并行、数据并行和算子拆分中的至少一种;添加模块,用于根据所述待训练模型的并行化策略在所述第一计算图中添加并行化信息,得到第二计算图;确定模块,用于根据所述第二计算图和计算资源,确定分布式计算图;训练模块,用于根据所述分布式计算图对所述待训练模型进行训练。
- 一种模型处理设备,其特征在于,包括:存储器;处理器;以及计算机程序;其中,所述计算机程序存储在所述存储器中,并被配置为由所述处理器执行以实现如权利要求1-17中任一所述的方法。
- 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1-17中任一项所述的方法。
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US18/024,901 US20230316450A1 (en) | 2020-09-10 | 2021-09-09 | Model processing method and apparatus, device, and computer-readable storage medium |
| EP21866025.6A EP4213071A4 (en) | 2020-09-10 | 2021-09-09 | MODEL PROCESSING METHOD AND APPARATUS, APPARATUS AND COMPUTER READABLE STORAGE MEDIUM |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010947896.X | 2020-09-10 | ||
| CN202010947896.XA CN114169491B (zh) | 2020-09-10 | 2020-09-10 | 一种模型处理方法、装置、设备及计算机可读存储介质 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2022052973A1 true WO2022052973A1 (zh) | 2022-03-17 |
Family
ID=80475640
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2021/117359 Ceased WO2022052973A1 (zh) | 2020-09-10 | 2021-09-09 | 一种模型处理方法、装置、设备及计算机可读存储介质 |
Country Status (4)
| Country | Link |
|---|---|
| US (1) | US20230316450A1 (zh) |
| EP (1) | EP4213071A4 (zh) |
| CN (1) | CN114169491B (zh) |
| WO (1) | WO2022052973A1 (zh) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117827404A (zh) * | 2023-12-20 | 2024-04-05 | 北京百度网讯科技有限公司 | 任务处理方法、任务执行方法、数据处理方法及装置 |
Families Citing this family (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| KR20220064665A (ko) * | 2020-11-12 | 2022-05-19 | 삼성전자주식회사 | 인공지능 모델을 분산 처리하는 전자 장치 및 그 동작 방법 |
| CN115828271A (zh) * | 2021-09-16 | 2023-03-21 | 华为技术有限公司 | 一种模型保护方法及装置 |
| CN116931938A (zh) * | 2022-04-02 | 2023-10-24 | 北京灵汐科技有限公司 | 计算图编译的方法和装置、电子设备、计算机可读介质 |
| CN114723012B (zh) * | 2022-04-14 | 2024-07-02 | 支付宝(杭州)信息技术有限公司 | 基于分布式训练系统的计算方法和装置 |
| CN115481730A (zh) * | 2022-09-20 | 2022-12-16 | 鹏城实验室 | 一种注意力机制模型训练方法、装置、终端及存储介质 |
| CN115879529A (zh) * | 2022-12-23 | 2023-03-31 | 上海交通大学 | 基于网络级模拟的自动并行策略搜索方法、介质及设备 |
| CN118261222A (zh) * | 2022-12-27 | 2024-06-28 | 腾讯科技(深圳)有限公司 | 机器学习模型的训练方法、装置、设备及存储介质 |
| CN117910525A (zh) * | 2024-01-19 | 2024-04-19 | 上海算法创新研究院 | 一种基于国产gpu深度学习的大模型转换与训练系统 |
| US20250259144A1 (en) * | 2024-02-08 | 2025-08-14 | Qomplx Llc | Platform for integration of machine learning models utilizing marketplaces and crowd and expert judgment and knowledge corpora |
| CN119917286B (zh) * | 2025-04-01 | 2025-07-11 | 宁波永耀电力投资集团有限公司 | 一种基于流水线并行的资源分配方法及系统 |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2011081539A (ja) * | 2009-10-06 | 2011-04-21 | Internatl Business Mach Corp <Ibm> | 並列化処理方法、システム、及びプログラム |
| CN108876702A (zh) * | 2018-06-21 | 2018-11-23 | 北京邮电大学 | 一种加速分布式深度神经网络的训练方法及装置 |
| CN110018817A (zh) * | 2018-01-05 | 2019-07-16 | 中兴通讯股份有限公司 | 数据的分布式运行方法及装置、存储介质及处理器 |
| CN110674936A (zh) * | 2019-09-24 | 2020-01-10 | 上海寒武纪信息科技有限公司 | 一种神经网络处理方法、装置、计算机设备及存储介质 |
Family Cites Families (13)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| KR101495575B1 (ko) * | 2006-08-10 | 2015-02-25 | 아브 이니티오 테크놀로지 엘엘시 | 그래프 기반 연산에서의 분배 서비스 |
| US9032362B2 (en) * | 2012-09-10 | 2015-05-12 | Sap Se | System and method for generating high performance calculators for calculation graphs |
| CN104915187A (zh) * | 2014-03-13 | 2015-09-16 | 华为技术有限公司 | 一种图模型计算的方法及装置 |
| US11151446B2 (en) * | 2015-10-28 | 2021-10-19 | Google Llc | Stream-based accelerator processing of computational graphs |
| KR102628902B1 (ko) * | 2015-10-28 | 2024-01-24 | 구글 엘엘씨 | 계산 그래프들 프로세싱 |
| EP4202782B1 (en) * | 2015-11-09 | 2025-07-16 | Google LLC | Training neural networks represented as computational graphs |
| US11423082B2 (en) * | 2016-06-29 | 2022-08-23 | Intel Corporation | Methods and apparatus for subgraph matching in big data analysis |
| US20180039905A1 (en) * | 2016-08-03 | 2018-02-08 | International Business Machines Corporation | Large scale distributed training of data analytics models |
| CN108021395B (zh) * | 2017-12-27 | 2022-04-29 | 北京金山安全软件有限公司 | 一种面向神经网络的数据并行处理方法及系统 |
| CN111562977B (zh) * | 2019-02-14 | 2022-12-09 | 上海寒武纪信息科技有限公司 | 神经网络模型拆分方法、装置、存储介质和计算机系统 |
| US11797876B1 (en) * | 2019-06-26 | 2023-10-24 | Amazon Technologies, Inc | Unified optimization for convolutional neural network model inference on integrated graphics processing units |
| CN110689115B (zh) * | 2019-09-24 | 2023-03-31 | 安徽寒武纪信息科技有限公司 | 神经网络模型处理方法、装置、计算机设备及存储介质 |
| CN111078415A (zh) * | 2019-12-19 | 2020-04-28 | 北京奇艺世纪科技有限公司 | 数据处理方法、装置、服务器及计算机可读存储介质 |
-
2020
- 2020-09-10 CN CN202010947896.XA patent/CN114169491B/zh active Active
-
2021
- 2021-09-09 US US18/024,901 patent/US20230316450A1/en active Pending
- 2021-09-09 EP EP21866025.6A patent/EP4213071A4/en active Pending
- 2021-09-09 WO PCT/CN2021/117359 patent/WO2022052973A1/zh not_active Ceased
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2011081539A (ja) * | 2009-10-06 | 2011-04-21 | Internatl Business Mach Corp <Ibm> | 並列化処理方法、システム、及びプログラム |
| CN110018817A (zh) * | 2018-01-05 | 2019-07-16 | 中兴通讯股份有限公司 | 数据的分布式运行方法及装置、存储介质及处理器 |
| CN108876702A (zh) * | 2018-06-21 | 2018-11-23 | 北京邮电大学 | 一种加速分布式深度神经网络的训练方法及装置 |
| CN110674936A (zh) * | 2019-09-24 | 2020-01-10 | 上海寒武纪信息科技有限公司 | 一种神经网络处理方法、装置、计算机设备及存储介质 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP4213071A4 |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117827404A (zh) * | 2023-12-20 | 2024-04-05 | 北京百度网讯科技有限公司 | 任务处理方法、任务执行方法、数据处理方法及装置 |
Also Published As
| Publication number | Publication date |
|---|---|
| CN114169491A (zh) | 2022-03-11 |
| EP4213071A1 (en) | 2023-07-19 |
| US20230316450A1 (en) | 2023-10-05 |
| EP4213071A4 (en) | 2024-01-24 |
| CN114169491B (zh) | 2025-12-16 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2022052973A1 (zh) | 一种模型处理方法、装置、设备及计算机可读存储介质 | |
| Lv et al. | Parallel k-means clustering of remote sensing images based on mapreduce | |
| US20160371117A1 (en) | Composing and executing workflows made up of functional pluggable building blocks | |
| WO2021103479A1 (zh) | 用于训练深度学习模型的方法和装置 | |
| CN112463159B (zh) | 编译方法、装置、电子设备和存储介质 | |
| US11630986B2 (en) | Graph conversion method | |
| CN107003906A (zh) | 云计算技术部件的类型到类型分析 | |
| JP2014194769A (ja) | Apachehadoop用の低レイテンシクエリエンジン | |
| JP2014194769A6 (ja) | Apache hadoop用の低レイテンシクエリエンジン | |
| Verma et al. | Big Data representation for grade analysis through Hadoop framework | |
| US10681176B1 (en) | Generating deployment templates based on deployment pipelines | |
| CN111275173B (zh) | 一种神经网络训练方法、装置及其设备 | |
| Bala et al. | P-ETL: Parallel-ETL based on the MapReduce paradigm | |
| CN117785490B (zh) | 一种图神经网络模型的训练架构、方法、系统及服务器 | |
| US20250061978A1 (en) | Small molecule generation using machine learning models | |
| KR20210141704A (ko) | 네트워크 기반 미디어 처리(nbmp)에서의 미디어 처리 함수를 대한 구성 파라미터의 그래프 표현 및 설명 | |
| CN106445645A (zh) | 用于执行分布式计算任务的方法和装置 | |
| CN118885643A (zh) | 基于数据模型的数据挖掘方法、装置、计算机设备及介质 | |
| US20190364109A1 (en) | Scale out data storage and query filtering using storage pools | |
| CN115311399A (zh) | 图像渲染方法、装置、电子设备以及存储介质 | |
| US12118111B2 (en) | Edge data processing utilizing per-endpoint subscriber configurable data processing workloads | |
| CN118779372A (zh) | 自定义交互场景导出方法、设备及存储介质 | |
| CN110909018A (zh) | Sql语句生成方法、装置、设备及存储介质 | |
| Chazapis et al. | EVOLVE: HPC and cloud enhanced testbed for extracting value from large-scale diverse data | |
| CN119149241B (zh) | 数据处理方法、装置、设备及可读存储介质 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21866025 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2021866025 Country of ref document: EP Effective date: 20230411 |