WO2017084020A1 - 模型参数融合方法及装置 - Google Patents
模型参数融合方法及装置 Download PDFInfo
- Publication number
- WO2017084020A1 WO2017084020A1 PCT/CN2015/094746 CN2015094746W WO2017084020A1 WO 2017084020 A1 WO2017084020 A1 WO 2017084020A1 CN 2015094746 W CN2015094746 W CN 2015094746W WO 2017084020 A1 WO2017084020 A1 WO 2017084020A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- fusion
- node
- nodes
- model parameter
- address
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/18—Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04Q—SELECTING
- H04Q3/00—Selecting arrangements
- H04Q3/42—Circuit arrangements for indirect selecting controlled by common circuits, e.g. register controller, marker
- H04Q3/54—Circuit arrangements for indirect selecting controlled by common circuits, e.g. register controller, marker in which the logic circuitry controlling the exchange is centralised
- H04Q3/545—Circuit arrangements for indirect selecting controlled by common circuits, e.g. register controller, marker in which the logic circuitry controlling the exchange is centralised using a stored program
- H04Q3/54541—Circuit arrangements for indirect selecting controlled by common circuits, e.g. register controller, marker in which the logic circuitry controlling the exchange is centralised using a stored program using multi-processor systems
- H04Q3/5455—Multi-processor, parallelism, distributed systems
Definitions
- the present invention relates to the field of machine learning, and in particular, to a method and device for blending model parameters.
- the model parameter refers to the parameter of the description model composed of multiple constraint parameters.
- the model parameters can be used to filter the data with common features. For example, when the model parameters are image class model parameters, different model parameters can be used. The image data of the person, the animal, or the face is selected from the image data. With the rapid growth of data volume and data types, there are more and more model parameters for data screening, and these model parameters are obtained by multiple calculations and fusions of a large number of data with common characteristics.
- the model parameter fusion divides the data into multiple data subsets, and assigns them to different nodes to train the assigned data subsets using the data iterative calculation method. Each time one or more iterations are calculated, each node pair is The model parameters obtained by training different data subsets are merged once, and the merged model parameters are used as initial model parameters for the next iteration calculation. After multiple fusions, the final total model parameters are obtained.
- model parameter fusion there are mainly two methods for model parameter fusion: the first one is a model in which the parameter server trains each node to multiple data subsets after multiple nodes perform multiple iteration calculations on multiple data subsets. The parameters are summarized and merged to obtain new model parameters. Then, each node performs the next iteration calculation on the plurality of data subsets according to the new model parameters; the second is when a node assigns a subset of the data to it. After multiple iterations, the node sends the model parameters obtained by training the assigned subset of data to the designated other nodes for model parameter fusion with the data subsets of other nodes, and then the node receives the data according to itself.
- the other nodes begin to iteratively calculate the model parameters transmitted after training other data subsets.
- the first type of parameter server for performing model parameter fusion has higher performance requirements, and is prone to downtime.
- the second type requires more data to be stored, and the data transmission amount is large.
- Embodiments of the present invention provide a method and a device for merging model parameters, which are used to solve the problem of high performance requirements and large data transmission capacity of a parameter server in model parameter fusion.
- a model parameter fusion method which is applied to a machine learning system, the machine learning system comprising M nodes, the method comprising:
- the i-th node divides the model parameter of the i-th node into N blocks; wherein the i-th node is any one of the N nodes participating in the fusion among the M nodes, 1 ⁇ the i ⁇ the N ⁇ the M, the i-th block in the N block divided by the model parameter is an i-th block model parameter;
- the ith node receives respective ith block model parameters respectively sent by the nodes other than the ith node of the N nodes;
- the i-th node fused the i-th block model parameter of the i-th node and the respective i-th block model parameters respectively sent by other nodes to obtain a total model parameter of the i-th block;
- the i-th node distributes the total model parameters of the i-th block to other nodes of the N nodes other than the i-th node.
- the i-th node receives the respective i-th block model parameters respectively sent by the nodes other than the i-th node among the N nodes, and distributes the total model parameters of the i-th block to other nodes of the N nodes.
- the full-duplex data transmission mode can be adopted, that is, the i-th node can receive data sent by other nodes at the same time when sending data to other nodes, for example, the i-th node adopts a full-duplex network card, etc., and the present invention does not limited.
- the N nodes participating in the fusion are determined from the M nodes by using a preset fusion condition, and the fusion condition may be that the number of nodes completing the iterative calculation reaches a preset value, and the preset value may be used in each fusion. It is a constant or a change; or, the fusion condition is that the number of times the specified calculation is completed reaches a preset number, and the preset number may be constant or varied at each fusion; or, the fusion condition is iteration The calculation is carried out for a predetermined period of time, and the preset duration may be constant or different at each merging.
- the merging condition may be other conditions, etc., which is not specifically limited in the present invention.
- the fusion controller may determine the nodes that have completed the fusion, and the N nodes that have not completed the fusion and complete the specified calculation.
- the method before the i-th node divides the model parameter of the i-th node into N blocks, the method further includes:
- the kth node sends the address and the fusion state information of the kth node to the fusion controller, where the fusion state information includes a calculation state and/or an iteration number of the node, where the kth node completes the designation of the M nodes.
- the node of the iterative task 1 ⁇ the k ⁇ the M;
- the ith node receives the fused indication information sent by the merging controller, and the fused indication information is determined by the merging controller to determine the convergence condition according to the received address of the kth node and the merging state information.
- the fusion indication information includes addresses and or numbers of the N nodes, wherein the fusion controller determines that the number of N nodes satisfying the fusion condition is the same or different each time.
- the k-th node After the k-th node completes the specified calculation, the k-th node sends its own address and the current recorded fusion state information to the fusion controller, and the fusion controller can be acted upon by a fixed node, and the fusion controller
- the N nodes participating in the fusion are determined according to the fusion conditions described above.
- the convergence controller is a first node, where the first node takes turns to be the M
- the method further includes: before the sending, by the kth node, the address and the fusion state information of the kth node to the fusion controller, the method further includes:
- the kth node receives the address of the first node sent by the first node; that is, the node currently serving as the fusion controller sends its own address to the kth node.
- the kth node sends the address of the kth node and the convergence state information to the fusion controller, including:
- the kth node sends the address and the fusion state information of the kth node to the first node according to the address of the first node, that is, when the fusion controller changes in rotation to be performed by any other node
- the kth node sends its own address and the merged state information to the node currently serving as the fusion controller.
- the fusion state information may be the calculation state and/or the number of iterations of the node.
- the first node may be any node of the T nodes among the M nodes, and the first node may be rotated, that is, the current fusion.
- the controller can specify any node of the T nodes among the M nodes as the next time When the node of the controller is merged, the node that acts as the fusion controller next time can designate the node that will act as the fusion controller next time, and so on.
- the method before the i-th node divides the model parameter of the i-th node into N blocks, the method further includes:
- the kth node broadcasts the address and the fusion state information of the kth node to each of the M nodes; that is, each of the M nodes can simultaneously record the address and the fusion state information of the kth node.
- the kth node receives the fusion indication information sent by the second node, the second node is any node of the K nodes of the M nodes, and the fusion indication information is received by the second node according to the
- the address of the kth node and the fusion state information are determined after the N nodes satisfying the fusion condition are determined, and the fusion indication information includes an address and/or a number of the N nodes. That is, any one of the nodes that simultaneously record the address of the kth node and the fusion state information is used as the second node, and the second node acts as the fusion controller.
- the method further includes:
- the i-th node transmits the j-th block model parameter of the i-th node to the j-th node of the N nodes, where 1 ⁇ the j ⁇ the N, and the j ⁇ i.
- the i-th node sends the other model parameter blocks of the divided model parameters except the i-th block to the other nodes of the N nodes, and sends the j-th block of the same number to the j-th node, by the The j node is responsible for the model parameter fusion of the jth block.
- the method further includes:
- the i-th node performs an iterative calculation according to the new total model parameter.
- the second aspect provides a model parameter fusion method, which is applied to a machine learning system, where the machine learning system includes M nodes, and the method includes:
- the fusion controller receives the address and the fusion state information sent by the node that performs the specified calculation among the M nodes, where the fusion state information includes the calculation state and/or the number of iterations of the node;
- the fusion controller Determining, by the fusion controller, the N nodes that meet the fusion condition according to the received address and the fusion state information; wherein the fusion controller determines that the number of N nodes that meet the fusion condition is the same or different each time;
- the fusion controller sends fusion indication information to each of the N nodes, the fusion indication information including addresses and/or numbers of the N nodes, such that each of the N nodes
- the nodes respectively divide the respective model parameters into N blocks; and send the i-th block model parameters divided in the respective model parameters to the i-th node, where 1 ⁇ the i ⁇ the N; each of the N nodes
- Each node respectively fuses the received model parameters, and each of the N nodes respectively distributes the merged model parameters to other nodes of the N nodes other than itself.
- the merging condition is that the number of nodes that complete the specified calculation reaches a preset value; or the number of times the specified calculation is completed reaches a preset The number of times; or, the preset duration.
- the preset value, the preset number of times, and the preset duration may be set in advance, and the preset value, the preset number of times, and the preset duration may be constant or may be changed, which is not limited by the present invention.
- the fusion controller may determine the nodes that have completed the fusion, and the N nodes that have not completed the fusion and complete the specified calculation.
- the merging controller is a first node, where the first node is Any one of the K nodes of the M nodes; the method further includes: before the fusion controller receives the address and the fusion state information sent by the node that performs the specified calculation in the M nodes, the method further includes:
- the first node as a fusion controller of the first time period, sends an address of the first node to other nodes of the M nodes except the first node.
- the first node when used as the fusion controller of the first time period, the first node may be any node of the K nodes among the M nodes, and the first node may be specified in advance.
- the method further includes:
- the first node determines the second node as the fusion controller of the second time period; wherein the second node is any one of the K nodes of the M nodes, K ⁇ said M;
- node fusion information includes address and fusion state information of the M nodes
- the first node sends an address of the second node to other nodes than the second node.
- the preset condition may be a certain period of time, or a certain number of times of convergence, or a certain number of iterations, etc., which is not limited by the present invention.
- a certain time, a certain number of fusion times, and a certain number of iterations may be set in advance, and a certain time, a certain number of fusion times, and a certain number of iterations may be fixed or changed.
- the first node acting as the fusion controller in the first time period specifies the node acting as the fusion controller in the second time period, and the node is referred to as the second node, that is, by The second node takes over the first node as a fusion controller, and the first node sends the node fusion information of the M nodes to the second node, and the second node sends its own address to other nodes, so that the other nodes after the fusion is completed.
- the address and the merge status information are reported.
- the first node determines that the third node is the parameter fusion controller of the second time period, where the third node is the Any one of the K nodes among the M nodes.
- the first node that previously acted as the fusion controller re-determines one of the M nodes as the fusion controller in the second time period, and re-determines
- a node can be referred to as a third node.
- the merging controller is at least one of the M nodes, At least one node receives the address and the convergence state information that each node sends after completing the specified calculation, and the fusion controller determines, according to the received address and the fusion state information, N nodes that meet the fusion condition, to the N Each node in the node sends a fusion index
- the information is: any one of the at least one node determines, according to the received address and the fusion state information, N nodes that meet the fusion condition, and sends the fusion indication information to each of the N nodes. .
- each node when one or more of the M nodes simultaneously record the node fusion information of the M nodes, each node will have its own address and fusion state information, such as the calculation state of the node, after completing the fusion. And the number of iterations is sent to at least one of the M nodes that simultaneously record the node fusion information, and any one of the at least one node determines the N conditions that satisfy the fusion condition according to the received address and the fusion state information. The node sends the fusion indication information to each of the N nodes.
- a model parameter fusion device for use in a machine learning system, the machine learning system comprising M nodes, the device comprising:
- a dividing unit configured to divide the model parameters of the model into N blocks; wherein, the N is the number of model parameter fusion devices participating in the fusion in the M model parameter fusion devices, and the N parameters of the model parameters are divided into
- the i-th block is the i-th block model parameter, 1 ⁇ the i ⁇ the N ⁇ the M;
- a first receiving unit configured to receive respective ith block model parameters respectively sent by the model parameter fusion devices of the N model parameter fusion devices except themselves;
- a merging unit configured to fuse the ith block model parameter of the ith block model and the ith block model parameter respectively sent by the other model parameter fusion device to obtain the total model parameter of the ith block;
- a first sending unit configured to distribute the total model parameter of the i-th block to other model parameter fusion devices of the N model parameter fusion devices except itself.
- the respective i-th block model parameters respectively transmitted by the model parameter fusion means other than itself to the N model parameter fusion devices, and the total model parameters of the i-th block are distributed to the N model parameter fusion devices.
- full-duplex data transmission mode can be adopted, that is, data transmitted by other model parameter fusion devices can be received while transmitting data to other model parameter fusion devices, for example, full-duplex
- the network card and the like are not limited in this embodiment of the present invention.
- the N model parameter fusion devices participating in the fusion are determined from the M model parameter fusion devices by using a preset fusion condition, and the fusion condition may be that the number of nodes completing the iterative calculation reaches a preset value, and the preset value is It can be constant or it can be changed every time it is merged.
- the fusion condition is that the number of times the specified calculation is completed reaches a preset number, and the preset number may be constant or changed in each fusion; or the fusion condition is an iterative calculation over a preset duration.
- the preset duration may be constant or different in each merging.
- the merging condition may be other conditions, etc., which are not specifically limited in the embodiment of the present invention.
- the device further includes:
- a second sending unit configured to send its own address and fusion state information to the fusion controller after completing the specified iterative task, where the fusion state information includes a calculation state and/or an iteration number of the model parameter fusion device;
- a second receiving unit configured to receive the fusion indication information, where the fusion indication information is that the fusion controller determines, according to the received address and fusion state information of the K model parameter fusion devices, the N model parameters that meet the fusion condition And transmitting, by the fusion device, the fusion indication information includes an address and/or a number of the N model parameter fusion devices, where the K model parameter fusion devices complete the specified iterative task in the M model parameter fusion devices Model parameter fusion device, 1 ⁇ said K ⁇ said M.
- the fusion controller is a first model parameter fusion device, wherein the first model parameter fusion device takes turns to be the M nodes Any one of the model parameter fusion devices of the T model parameter fusion device, wherein the T ⁇ the M, the device further includes:
- a third receiving unit configured to receive an address of the first model parameter fusion device sent by the first model parameter fusion device
- the second sending unit is specifically configured to:
- the first model parameter fusion device when the model parameter fusion device is the first model parameter fusion device, the first model parameter fusion device may be any model parameter fusion device of the K model parameter fusion devices in the M model parameter fusion device points, And the first model parameter fusion device may be rotated, that is, the first model parameter fusion device may specify any model parameter fusion device of the K model parameter fusion devices in the M model parameter fusion device as the next time Model parameter The number of fusion devices, the next model parameter fusion device can specify the next model parameter fusion device, and so on.
- the device further includes:
- a broadcast unit configured to broadcast the address and the fusion state information of each of the M model parameter fusion devices; that is, each model parameter fusion device in the M model parameter fusion devices may The address and the merge status information are recorded at the same time.
- a fourth receiving unit configured to receive the fusion indication information sent by the second model parameter fusion device, where the second model parameter fusion device is any model parameter fusion of the K model parameter fusion devices in the M model parameter fusion devices.
- the device, the fusion indication information is sent by the second model parameter fusion device after determining the N model parameter fusion devices that meet the fusion condition according to the received address and fusion state information of the K model parameter fusion devices.
- the fusion indication information includes an address and/or a number of the N nodes, and the K model parameter fusion devices are model parameter fusion devices for completing the specified iterative tasks in the M model parameter fusion devices, 1 ⁇ K ⁇ the M.
- any one of the model parameter fusion devices that simultaneously records the address of the M model parameter fusion device and the fusion state information is used as the second model parameter fusion device, and the second model parameter fusion device acts as the next model parameter fusion. Device.
- the device further includes:
- a fifth receiving unit configured to receive address and fusion state information of the K model parameter fusion devices, where the fusion state information includes a calculation state and/or an iteration number of the model parameter fusion device, where the K model parameter fusion devices are a model parameter fusion device for completing a specified iterative task in the M model parameter fusion device, 1 ⁇ the K ⁇ the M;
- a determining unit configured to determine, according to the received address and fusion state information of the K model parameter fusion devices, the N model parameter fusion devices that meet the fusion condition;
- a third sending unit configured to send, to the other model parameter fusion devices of the N model parameter fusion devices, fusion indication information, so as to facilitate other models of the N model parameter fusion devices except themselves
- the parameter fusion device performs parameter fusion according to the fusion indication information, where the fusion indication information includes an address and/or a number of the N nodes.
- the device further includes:
- a fourth sending unit configured to send, to the other model parameter fusion devices of the M model parameter fusion devices, an address thereof, so as to facilitate other models of the M model parameter fusion devices except themselves
- the parameter fusion device transmits its own address and fusion state information according to the received address.
- the device further includes:
- a fifth sending unit configured to send its own jth block model parameter to the jth model parameter fusion device in the N model parameter fusion devices, where 1 ⁇ the j ⁇ the N, and the j ⁇ i.
- the model parameter blocks other than the i-th block of the divided model parameters are sent to other model parameter fusion devices in the N model parameter fusion devices, and the j-th block of the same number is sent to the j-th model.
- the parameter fusion device is responsible for the fusion of the model parameters of the jth block by the j-th model parameter fusion device.
- the device further includes:
- a sixth receiving unit configured to receive a total model parameter of the jth block after the j-th model parameter fusion device is sent by the j-th model parameter fusion device;
- a summary unit configured to receive a total model parameter of the jth block after the fusion of the jth model parameter fusion device sent by the jth model parameter fusion device;
- a calculation unit is configured to perform an iterative calculation according to the new total model parameter.
- a model parameter fusion device for use in a machine learning system, the machine learning system comprising M nodes, the device comprising:
- a receiving unit configured to receive address and fused state information sent by the node that performs the specified calculation among the M nodes, where the fused state information includes a calculation state and/or an iteration number of the node;
- a first determining unit configured to determine, according to the received address and the merging state information, N nodes that meet the merging condition; wherein the merging controller determines that the number of N nodes that meet the merging condition is the same or different each time ;
- a first sending unit configured to send, to each of the N nodes, the fusion indication information, where the fusion indication information includes an address and/or a number of the N nodes, so that the N nodes are Each node separately divides the respective model parameters into N blocks; and sends the i-th block model parameters divided in the respective model parameters to the i-th node, where 1 ⁇ the i ⁇ the N; the N nodes Each of the nodes respectively fuses the received model parameters, and each of the N nodes respectively distributes the merged model parameters to other nodes of the N nodes other than itself.
- the merging condition is that the number of nodes that complete the specified calculation reaches a preset value; or the number of times the specified calculation is completed reaches a preset The number of times; or, the preset duration.
- the first determining unit is further specifically configured to:
- the model parameter fusion device is a first node, where the first node is The node of any one of the M nodes, the device further includes:
- a second sending unit configured to send an address of the first node to other nodes of the M nodes except the first node.
- the device further includes:
- a second determining unit configured to determine, after the preset condition is met, a second node as a model parameter fusion device of the second time period; wherein the second node is any one of the K nodes of the M nodes , K ⁇ the M;
- a third sending unit configured to send node fusion information to the second node, where the node fusion information includes address and fusion state information of the M nodes;
- a fourth sending unit configured to send an address of the second node to another node other than the second node.
- the preset condition may be a certain period of time, or a certain number of times of convergence, or a certain number of iterations, etc., which is not limited by the present invention.
- a certain time, a certain number of fusions and a certain number of iterations can be First set, and a certain time, a certain number of fusions and a certain number of iterations can be fixed or changed.
- the device further includes:
- a third determining unit configured to: if the second node fails in the second time period, determine a third node as a model parameter fusion device of the second time period, where the third node is the One of the K nodes among the M nodes.
- the third determining unit re-determines one of the M nodes as the model parameter fusion device in the second time period, and at this time, the node may be referred to as a third node.
- the model parameter fusion device receives the address and the merging state information that each node sends after the specified calculation is completed, determines the N nodes that meet the merging condition, and sends the fused indication information to each of the N nodes, where: Any one of the nodes determines the N nodes that satisfy the fusion condition according to the received address and the fusion state information, and sends the fusion indication information to each of the N nodes.
- each node when one or more of the M nodes simultaneously record the node fusion information of the M nodes, each node will have its own address and fusion state information, such as the calculation state of the node, after completing the fusion. And the number of iterations is sent to at least one of the M nodes that simultaneously record the node fusion information, and any one of the at least one node determines the N conditions that satisfy the fusion condition according to the received address and the fusion state information. The node sends the fusion indication information to each of the N nodes.
- a node comprising a processor and a memory, the memory storing code and data, the processor being operable to execute code in the memory, the processor for performing the first aspect to the first A model parameter fusion method according to any of the fifth possible implementations of the invention.
- a fusion controller comprising a processor and a memory, the memory storing code and data, the processor being operable to execute code in a memory, the processor for executing a second Aspect to the fourth aspect of the second aspect A model parameter fusion method as described in any one of the possible implementations.
- a machine learning system comprising the node of the fifth aspect, and the fusion controller of the sixth aspect.
- the fusion controller is configured to be independent of the node or configured on the node.
- a model parameter method and apparatus provided by an embodiment of the present invention, by determining N nodes satisfying a fusion condition, and dividing the model parameters of the i-th node into N blocks, and receiving N nodes other than the i-th node
- the other i-th block model parameters sent by the other nodes are respectively fused, and the i-th block model parameters of the i-th node and the respective i-th block model parameters respectively sent by the other nodes are merged to obtain the total model parameters of the i-th block.
- the total model parameter of the i-th block is distributed to the other nodes of the N nodes, wherein the i-th node is any one of the N nodes participating in the fusion, thereby solving the problem of dynamic adjustment in the computing resource, and It has the ability to dynamically delete and add nodes.
- each node participating in the fusion can send and receive model parameters at the same time, which improves the utilization of network resources and the stability of the system.
- FIG. 1 is a schematic structural diagram of a machine learning system according to an embodiment of the present invention.
- FIG. 2 is a flowchart of a method for parameter parameter fusion according to an embodiment of the present invention
- FIG. 3 is a schematic diagram of receiving an i-th block model parameter by an i-th node according to an embodiment of the present invention
- FIG. 4 is a schematic diagram of an ith node merging and transmitting an ith block total model parameter according to an embodiment of the present invention
- FIG. 5 is a schematic structural diagram of a first model parameter device according to an embodiment of the present invention.
- FIG. 6 is a schematic structural diagram of a second model parameter device according to an embodiment of the present invention.
- FIG. 7 is a schematic structural diagram of a third model parameter device according to an embodiment of the present invention.
- FIG. 8 is a schematic structural diagram of a fourth model parameter device according to an embodiment of the present invention.
- FIG. 8 is a schematic structural diagram of a fifth model parameter device according to an embodiment of the present invention.
- FIG. 8b is a schematic structural diagram of a sixth model parameter device according to an embodiment of the present invention.
- 8c is a schematic structural diagram of a seventh model parameter device according to an embodiment of the present invention.
- FIG. 9 is a schematic structural diagram of an eighth model parameter device according to an embodiment of the present invention.
- FIG. 10 is a schematic structural diagram of a ninth model parameter device according to an embodiment of the present invention.
- FIG. 11 is a schematic structural diagram of a tenth model parameter device according to an embodiment of the present invention.
- FIG. 12 is a schematic structural diagram of a node according to an embodiment of the present disclosure.
- FIG. 13 is a schematic structural diagram of a fusion controller according to an embodiment of the present invention.
- the machine learning system architecture applied by the embodiment of the present invention is shown in FIG. 1.
- the system architecture diagram includes a data storage device 101, a model parameter training platform 102, and a model parameter storage device 103.
- the data storage device 101 can be a data storage server 101.
- the data storage server 101 can be used to store raw data for model parameter training.
- the storage capacity of the data storage server 101 is much larger than the storage of the computing server 1021 in the model training platform 102. capacity.
- the original data may be language data, image data, video data, etc., and the original data is composed of a plurality of data sets, and each data set further comprises a plurality of type subsets, each type subset having a representation type
- the data label has the same label of the subset of the types included in the same data set.
- the data set may be an image containing multiple characters with a person's label, or may contain multiple animal images with animal labels. , or other categories of images, and so on.
- the model parameter training platform 102 includes a computing server 1021 for iterative computing, It may also be referred to as a node, and may be a general computer, a mobile terminal, a workstation or a general-purpose server, a dedicated server, or the like, and a switch 1022 for performing data communication between the computing servers.
- the computing server 1021 has local storage and its capacity is smaller than the data storage server 101.
- each computing server reads certain data from the data storage server 101 into the local storage device for sampling model parameters by sampling.
- the model parameter training platform 102 can obtain a total model parameter of the final fusion output by performing model parameter training fusion on the data set with the data label, and the data type of the new data can be identified by the total model parameter.
- the image of the person in the new image data can be identified by the final output model parameter, and the model parameter fusion can be performed by using the image dataset with the animal tag.
- An animal image or the like in the new image data is identified by the finally output model parameters.
- the model parameter storage server 103 is configured to store the model parameters obtained by the training.
- the model parameters obtained by the final fusion may be sent to the model parameter storage server 103 to be stored by the model parameter storage server 103.
- the model parameters originally used by the calculation server 1021 in the model parameter platform 102 for performing model parameter training fusion may also be acquired from the model parameter storage server 103.
- FIG. 2 is a flowchart of a model parameter fusion method according to an embodiment of the present invention. The method is applied to a machine learning system, where the machine learning system includes M nodes, and the method includes the following steps.
- Step 201 A node for performing model parameter fusion acquires a data subset of the data set.
- the data set refers to a data set used for iterative calculation of model parameters, and the data set may be language data, image data, video data, etc., and the data set is composed of multiple types of subsets, each type of sub- The set has data labels for representing categories, and the labels of the subset of types included in the same data set are the same.
- the data set may be stored in a storage device such as a hard disk or a disk in advance, or may be stored in a data storage server in advance.
- the storage device may directly connect to the device where the node is located. Get a subset of data, or from a data store The server comes up to get data and so on.
- the node when the node acquires the data subset in the data set, the node can extract a certain amount of data from the data set, if Knowing the computing power of each node in advance, the data amount of the data subset acquired by the node can be allocated according to the computing power of the node.
- Step 202 Each node performs a specified iterative calculation based on the data subset and the current model parameters.
- each node can perform iterative calculation based on the acquired data subset and the initial model parameters.
- each node can be based on the data subset and the currently obtained model parameters. Perform the next iteration calculation.
- the initial model parameters refer to the initial model parameters of each node, and the initial model parameters of each node may be the same.
- the currently obtained model parameters refer to the model parameters calculated by each node to complete the current iteration, or the currently received model parameters, that is, the current latest model parameters.
- Step 203 The kth node sends the address and the fusion state information of the kth node to the fusion controller, where the fusion state information includes the calculation state and/or the number of iterations of the node, and the kth node completes the specified iteration task for the M nodes. Node, 1 ⁇ k ⁇ M.
- the M nodes included in the machine learning system respectively perform specified iterative calculation based on the acquired data subset and model parameters, and when any one of the M nodes completes the specified calculation, the node sends its own address to the fusion controller. And fusion status information.
- the fusion state information includes a calculation state and/or an iteration number of the node, that is, when the kth node sends the fusion state information to the fusion controller, the current calculation state may be sent, or the number of current completions is sent, or The current calculation state and the number of iterations are sent to the fusion controller, where the calculation state refers to whether the specified iteration calculation has been completed.
- the address of the kth node may be an IP address of the node, a MAC (Media Access Control, media access control, also referred to as a physical address) address, or a number of the node, and the like.
- IP address IP address of the node
- MAC Media Access Control, media access control, also referred to as a physical address
- Step 204 The fusion controller receives the address and the convergence state information sent by the node that performs the specified calculation among the M nodes, determines the N nodes that meet the fusion condition, and sends the fusion indication information to each of the N nodes, the fusion.
- the indication information is connected by the fusion controller.
- the received address of the kth node and the merging state information are determined after the N nodes satisfying the merging condition are determined, and the fused indication information includes the address and or number of the N nodes.
- the fusion controller determines that the number of N nodes satisfying the fusion condition is the same or different each time.
- the fusion indication information includes addresses and/or numbers of the N nodes.
- the N nodes participating in the fusion are determined from the M nodes by using a preset fusion condition
- the fusion condition may be that the number of nodes completing the iterative calculation reaches a preset value, and the preset value is at each time.
- the fusion may be constant or variable; or, the fusion condition is that the number of times the specified calculation is completed reaches a preset number, and the preset number may be constant or variable at each fusion; or
- the merging condition is that the iterative calculation is performed for a predetermined period of time.
- the preset duration may be constant or different at each merging.
- the merging condition may be other conditions, etc., which is not specifically limited in the present invention.
- the fusion controller can determine the nodes that have completed the fusion, and the N nodes in the nodes that have not completed the fusion and complete the specified calculation.
- the fusion controller may be performed by a fixed node, or may be carried by different nodes in turn, or may be performed by at least one node in a distributed manner. Specific three different fusion controllers are described below.
- the first type the fusion controller is operated by a fixed node, and the fixed node can be set in advance, and any one of the M nodes can send its own address to the fixed fusion controller after completing the specified calculation.
- the fixed fusion controller determines the N nodes satisfying the fusion condition based on the received address and the fusion state information, and sends the fusion indication information to each of the N nodes.
- the second type of fusion controller is rotated by different nodes.
- the first node that acts as the fusion controller in turn can be called the first node, and the first node is any node of the T nodes among the M nodes. T ⁇ the M.
- the kth node Since the fusion controller is rotated by different nodes, in order to enable the M nodes to send the address and the fusion state information to the current fusion controller after completing the specified calculation, the kth node sends the fusion controller to the fusion controller in step 203. Before the address of the kth node and the fusion state information, the kth node receives the address of the first node sent by the first node, that is, the node currently serving as the fusion controller sends its own address to the M nodes.
- the kth node sends the address of the kth node and the merging state information to the fusion controller, including: the kth node sends the address of the kth node and the merging state information to the first node according to the address of the first node.
- the first node is used as the fusion controller of the first time period, and receives the address and the fusion state information sent by the node that completes the specified iteration calculation, and the first node determines the N conditions that meet the fusion condition based on the received address and the fusion state information.
- the node sends the fusion indication information to each of the N nodes.
- the first node determines that the second node is the fusion controller of the second time period, and the second node is any node of the K nodes among the M nodes; the first node sends the node fusion to the second node.
- Information, the node fusion information includes address and fusion state information of the M nodes; the first node sends the address of the second node to other nodes than the second node.
- the first node may designate any node of the K nodes among the M nodes as the node that acts as the fusion controller next time, and the designated node acts as the second node for the second time period.
- the first node sends the node fusion information of the M nodes to the next fusion controller, and sends the address of the next fusion controller to the other nodes.
- the node that acts as the fusion controller next time can specify the next node to act as the fusion controller, and so on.
- the first node may determine that the third node is the parameter fusion controller of the second time period, wherein the third node is any one of the K nodes of the M nodes. .
- the first node that previously acted as the fusion controller re-determines any node of the K nodes among the M nodes as the fusion control in the second time period.
- the re-determined node can be called the third node.
- the preset condition may be a certain period of time, or a certain number of times of convergence, or a certain number of iterations, etc., which is not limited by the embodiment of the present invention.
- a certain time, a certain number of fusion times, and a certain number of iterations may be set in advance, and the length of each set time, the number of fusion times, and the number of iterations may be fixed or varied, and the present invention The embodiment does not limit this.
- the fusion controller is operated by at least one node in a distributed manner, and the at least one node may be all or part of the M nodes.
- the kth node broadcasts the address and fusion state information of the kth node to each of the M nodes, and the kth node is received by at least one of the M nodes. Specifies the address and fusion status information sent after calculation.
- the kth node is any node of the M nodes that completes the specified calculation.
- each node when one or more of the M nodes simultaneously record the node fusion information of the M nodes, each node will have its own address and fusion state information, such as the calculation state of the node, after completing the fusion. / or the number of iterations, sent to at least one of the M nodes that simultaneously record node fusion information.
- any one of the at least one node determines N nodes satisfying the fusion condition according to the received address of the kth node and the fusion state information, and transmits the fusion indication information to each of the N nodes, each The node receives the fusion indication information sent by any one of the at least one node, the fusion indication information is sent after determining N nodes that satisfy the fusion condition, and the fusion indication information includes addresses and/or numbers of the N nodes.
- the number of the node is used to uniquely represent the node, and the number of the node may be a sequence number that is randomly assigned to the node, or may be any value that is randomly assigned to the node, etc., which is not limited in this embodiment of the present invention. .
- Step 205 When receiving the fusion indication information, the i-th node divides the model parameter of the i-th node into N blocks, and receives the respective i-th block models respectively sent by the nodes other than the i-th node among the N nodes.
- a parameter wherein, the i-th node is any one of the N nodes participating in the fusion among the M nodes, 1 ⁇ the i ⁇ the N ⁇ the M, the i-th block in the N block divided by the model parameter
- the i-th block model parameter refers to a block model parameter corresponding to the i-th node among the divided N-block model parameters, and the i-th node is responsible for performing subsequent fusion operations on the i-th block model parameters.
- the i-th node among the N nodes participating in the fusion divides the model parameters of the i-th node into N blocks, wherein each block model parameter corresponds to one node, and the corresponding node performs subsequent model parameter fusion.
- Operation wherein the i-th model parameter corresponds to the i-th node, and the i-th node is responsible for performing the subsequent converging operation.
- receiving respective ith block model parameters respectively sent by the nodes other than the i-th node among the N nodes.
- the i-th node receives the respective i-th block model parameters respectively sent by the nodes other than the i-th node among the N nodes, and distributes the total model parameters of the i-th block to the N nodes.
- full-duplex data transmission can be adopted, that is, the i-th node can receive data sent by other nodes while transmitting data to other nodes, for example, the i-th node adopts a full-duplex network card, etc.
- the invention is not limited thereto.
- Step 206 The i-th node sends the jth block model parameter of the i-th node to the j-th node of the N nodes, where 1 ⁇ the j ⁇ the N, and the j ⁇ i.
- the i-th node sends the model parameter blocks of the divided model parameters except the i-th block to other nodes of the N nodes, that is, the j-th block model parameters are sent to the j-th node, and the j-th node Responsible for the model parameter fusion of the jth block.
- the jth block model parameter here is the model parameter corresponding to the jth node among the divided N block model parameters, and the jth node is responsible for the subsequent fusion operation.
- Step 207 The i-th node fuses the i-th block model parameter of the i-th node and the respective i-th block model parameters respectively sent by the other nodes, obtains the total model parameter of the i-th block, and distributes the total model parameter of the i-th block. Give the other nodes of the N nodes.
- the i-th node fuses the i-th block of the i-th model parameter and the i-th block of the respective model parameters respectively sent by the other nodes to obtain the total model parameter of the i-th block, and the i-th block
- the total model parameters of the block are distributed to other nodes in the N nodes.
- Step 208 The i-th node receives the total model parameter of the j-th block after the j-th node is transmitted by the jth node, and summarizes the total of the merged nodes sent by the nodes other than the i-th node among the received N nodes. The corresponding part of the model parameters, the new total model parameters of the i-th node are generated.
- the jth node participating in the merged N nodes When the jth node participating in the merged N nodes is fused to obtain the total model parameter of the jth block, the jth node sends the total model parameter of the jth block to the i-th node, and the i-th node receives the jth node after the fusion
- the total model parameters of the jth block where 1 ⁇ j ⁇ N, and j ⁇ i.
- the i-th node aggregates the total merged model parameters sent by the nodes other than the i-th node among the N nodes received, and the total model parameters of the i-th block obtained by the i-th node self-fusion. Obtain new total model parameters after N nodes are merged.
- the i-th section may return to step 202 to perform iterative calculation based on the new total model parameters after the data subset and the N nodes are merged, until the final model is output. parameter.
- a model parameter fusion method determines N nodes satisfying a fusion condition by a fusion controller, and divides model parameters of an i-th node by an i-th node It is N blocks, and receives the respective i-th block model parameters sent by the other nodes except the i-th node among the N nodes, and then sends the i-th block model parameters of the i-th node and the respective nodes respectively sent by the other nodes.
- the i block model parameters are fused to obtain the total model parameters of the i-th block, and finally the total model parameters of the i-th block are distributed to other nodes of the N nodes, wherein the i-th node is among the N nodes participating in the fusion.
- Any node solves the problem of dynamic adjustment in computing resources, and has the ability to dynamically delete and add nodes.
- each node participating in the fusion can send and receive model parameters at the same time, improving the utilization of network resources and the system. stability.
- FIG. 5 is a schematic structural diagram of a model parameter fusion apparatus according to an embodiment of the present disclosure, which is applied to a machine learning system, where the machine learning system includes M nodes, as shown in FIG. 5, the apparatus includes:
- the dividing unit 301 is configured to divide the model parameters of the model into N blocks, where the N is the number of model parameter fusion devices participating in the fusion in the M model parameter fusion devices, and the N blocks of the model parameters are divided.
- the i-th block in the middle is the i-th block model parameter, 1 ⁇ the i ⁇ the N ⁇ the M; the i-th block model parameter herein refers to the corresponding i-th model parameter fusion device among the divided N-block model parameters
- the i-th model parameter fusion device is responsible for performing subsequent fusion operations on the i-th model parameters.
- the first receiving unit 302 is configured to receive respective ith block model parameters respectively sent by the model parameter fusion devices of the N model parameter fusion devices except themselves;
- the merging unit 303 is configured to fuse the ith block model parameters of the ith block model parameters and the ith block model parameters respectively sent by the other model parameter fusion devices to obtain the total model parameters of the ith block;
- the first sending unit 304 is configured to distribute the total model parameters of the i-th block to other model parameter fusion devices of the N model parameter fusion devices except itself.
- the respective i-th block model parameters respectively transmitted by the model parameter fusion means other than itself to the N model parameter fusion devices, and the total model parameters of the i-th block are distributed to the N model parameter fusion devices.
- full-duplex data transmission mode can be adopted, that is, data transmitted by other model parameter fusion devices can be received while transmitting data to other model parameter fusion devices, for example, full-duplex
- the network card and the like are not limited in this embodiment of the present invention.
- the N model parameter fusion devices participating in the fusion are determined from the M model parameter fusion devices by using a preset fusion condition, and the fusion condition may be that the number of nodes completing the iterative calculation reaches a preset value, and the preset value is It may be constant or variable at each fusion; or, the fusion condition is that the number of times the specified calculation is completed reaches a preset number of times, which may be constant or varied at each fusion; or
- the merging condition is that the iterative calculation is performed for a preset duration, and the preset duration may be constant or different at each merging.
- the merging condition may also be other conditions, etc. Specifically limited. .
- the apparatus further includes:
- the second sending unit 305 is configured to send its own address and fusion state information to the fusion controller after completing the specified iterative task, where the fusion state information includes a calculation state and/or an iteration number of the model parameter fusion device;
- the second receiving unit 306 is configured to receive the fusion indication information, where the fusion indication information is that the fusion controller determines, according to the received address and fusion state information of the K model parameter fusion devices, the N models that meet the fusion condition. And transmitting, by the parameter fusion device, the fusion indication information includes an address and/or a number of the N model parameter fusion devices, where the K model parameter fusion devices complete the specified iteration in the M model parameter fusion device The model parameter fusion device of the task, 1 ⁇ the K ⁇ the M.
- the fusion controller is a first model parameter fusion device, where the first model parameter fusion device takes turns to be any one of the T model parameter fusion devices of the M nodes.
- the model parameter fusion device, the T ⁇ the M, the device further includes:
- a third receiving unit 307 configured to receive an address of the first model parameter fusion device sent by the first model parameter fusion device
- the second sending unit 305 is specifically configured to:
- the first model parameter fusion device when the model parameter fusion device is the first model parameter fusion device, the first model parameter fusion device may be any model parameter fusion device of the K model parameter fusion devices in the M model parameter fusion device points, And the first model parameter fusion device may be rotated, that is, the first model parameter fusion device may specify M model parameter fusion devices. Any model parameter fusion device of the K model parameter fusion device is used as the next model parameter fusion device, and the next model parameter fusion device can specify the next model parameter fusion device, and so on.
- the apparatus further includes:
- the broadcasting unit 308 is configured to broadcast the address and the fusion state information of each of the M model parameter fusion devices to each of the model parameter fusion devices; that is, each model parameter fusion device in the M model parameter fusion devices The address and the fusion state information can be recorded simultaneously; that is, each model parameter fusion device in the M model parameter fusion devices can simultaneously record the address and fusion state information of the model parameter fusion device.
- the fourth receiving unit 309 is configured to receive the fusion indication information sent by the second model parameter fusion device, where the second model parameter fusion device is any model parameter of the K model parameter fusion devices in the M model parameter fusion devices.
- the fusion indication information is sent by the second model parameter fusion device after determining the N model parameter fusion devices that meet the fusion condition according to the received address and fusion state information of the K model parameter fusion devices,
- the fusion indication information includes an address and/or a number of the N nodes, and the K model parameter fusion devices are model parameter fusion devices for completing the specified iterative tasks in the M model parameter fusion devices, 1 ⁇ Said K ⁇ said M.
- any one of the model parameter fusion devices that simultaneously records the address of the M model parameter fusion device and the fusion state information is used as the second model parameter fusion device, and the second model parameter fusion device acts as the next model parameter fusion. Device.
- the device further includes:
- the fifth receiving unit 310 is configured to receive address and fusion state information of the K model parameter fusion devices, where the fusion state information includes a calculation state and/or an iteration number of the model parameter fusion device, where the K model parameter fusion devices are The model parameter fusion device for completing the specified iterative task in the M model parameter fusion device, 1 ⁇ the K ⁇ the M;
- a determining unit 311, configured to determine, according to the received address and fusion state information of the K model parameter fusion devices, the N model parameter fusion devices that meet the fusion condition;
- a third sending unit 312 configured to send, to the other model parameter fusion devices of the N model parameter fusion devices, fusion indication information, so as to facilitate the N other model parameter fusion devices except itself a model parameter fusion device according to the fusion indication information Parameter fusion is performed, and the fusion indication information includes an address and/or a number of the N nodes.
- the device further includes:
- a fourth sending unit 313, configured to send an address of the model parameter fusion device other than itself to the M model parameter fusion devices, so as to facilitate the M model parameter fusion device except itself
- the model parameter fusion device transmits its own address and fusion state information according to the received address.
- the device further includes:
- a fifth sending unit configured to send the jth block of the model parameter of the ith model parameter fusion device to the jth model parameter fusion device in the N model parameter fusion device, where 1 ⁇ the j ⁇ Said N, and said j ⁇ i.
- the model parameter blocks other than the i-th block of the divided model parameters are sent to other model parameter fusion devices in the N model parameter fusion devices, and the j-th block of the same number is sent to the j-th model.
- the parameter fusion device is responsible for the fusion of the model parameters of the jth block by the j-th model parameter fusion device.
- the device further includes:
- a sixth receiving unit 314, configured to receive a total model parameter of the jth block that is merged by the jth model parameter fusion device sent by the jth model parameter fusion device;
- a summary unit 315 configured to receive a total model parameter of the jth block after the j-th model parameter fusion device is sent by the j-th model parameter fusion device;
- the calculating unit 316 is configured to perform an iterative calculation according to the new total model parameter.
- the embodiment of the invention provides a model parameter fusion device, which divides model parameters of the i-th model parameter fusion device into N blocks by determining N model parameter fusion devices satisfying the fusion condition, and receives N model parameter fusion devices.
- the other i-th block model parameters respectively transmitted by the model parameter fusion device other than the i-th model parameter fusion device, and the respective i-th model parameters of the i-th model parameter fusion device and the other model parameter fusion devices are respectively sent by the respective
- the i-th model parameters are fused to obtain the total model parameters of the i-th block, and finally the total model parameters of the i-th block are distributed to other model parameter fusion devices in the N model parameter fusion devices, thereby solving the computing resources.
- the problem of dynamic adjustment also improves the utilization of network resources and the stability of the system.
- FIG. 9 is a schematic structural diagram of a model parameter fusion apparatus according to an embodiment of the present invention, which is applied to a machine learning system, where the machine learning system includes M nodes. As shown in FIG. 9, the apparatus includes:
- the receiving unit 401 is configured to receive address and fused state information sent by the node that performs the specified calculation among the M nodes, where the fused state information includes a calculation state and/or an iteration number of the node;
- the first determining unit 402 is configured to determine, according to the received address and the merging state information, the N nodes that meet the merging condition; wherein, the number of the N nodes that meet the merging condition is determined to be the same or different each time;
- a first sending unit 403 configured to send, to each of the N nodes, fused indication information, where the fused indication information includes an address and/or a number of the N nodes, so that the N nodes are Each node divides the respective model parameters into N blocks, and sends the i-th block model parameters divided in the respective model parameters to the i-th node, where 1 ⁇ the i ⁇ the N; the N Each node in the node respectively fuses the received model parameters, and each of the N nodes respectively distributes the merged model parameters to other nodes of the N nodes except itself.
- the merging condition is that the number of nodes that complete the specified calculation reaches a preset value; or the number of times the specified calculation is completed reaches a preset number of times; or, the preset duration is exceeded.
- the preset value, the preset number of times, and the preset duration may be constant or different in each merging.
- the merging condition may also be other conditions, etc. Not limited.
- the first determining unit is further configured to:
- the model parameter fusion device is a first node, where the first node is any one of the M nodes, and the device further includes:
- the second sending unit 404 is configured to send an address of the first node to other nodes of the M nodes except the first node.
- the apparatus further includes:
- a second determining unit 405 configured to determine that the second node is the second after the preset condition is met a model parameter fusion device of a time period; wherein the second node is any one of the K nodes of the M nodes, and the K is ⁇ the M;
- a third sending unit 406 configured to send node fusion information to the second node, where the node fusion information includes address and fusion state information of the M nodes;
- the fourth sending unit 407 is configured to send the address of the second node to other nodes than the second node.
- the preset condition may be a certain period of time, or a certain number of times of convergence, or a certain number of iterations, etc., which is not limited by the present invention.
- a certain time, a certain number of fusion times, and a certain number of iterations may be set in advance, and a certain time, a certain number of fusion times, and a certain number of iterations may be fixed or changed.
- the device further includes:
- a third determining unit configured to: if the second node fails in the second time period, determine a third node as a model parameter fusion device of the second time period, where the third node is the One of the K nodes among the M nodes.
- the second determining unit re-determines one of the M nodes as the model parameter fusion device in the second time period.
- the node may be referred to as a third node.
- the model parameter fusion device is at least one of the M nodes, and the at least one node receives the address and the fusion state information sent by each node after completing the specified calculation, and determines the N that satisfies the fusion condition. And sending, by the node, the fusion indication information to each of the N nodes, where: any one of the at least one node determines, according to the received address and the fusion state information, N nodes that meet the fusion condition. Sending the fusion indication information to each of the N nodes.
- each node when one or more of the M nodes simultaneously record the node fusion information of the M nodes, each node will have its own address and fusion state information, such as the calculation state of the node, after completing the fusion. And/or the number of iterations is sent to at least one of the M nodes that simultaneously record the node fusion information, and any node of the at least one node determines the N nodes that satisfy the fusion condition according to the received address and the fusion state information. Sending the fusion indication information to each of the N nodes.
- the model parameter fusion device determines N nodes satisfying the fusion condition based on the address and the fusion state information sent by the node that performs the specified calculation in the M nodes, and each of the N nodes Each node sends the fusion indication information, so that each of the N nodes respectively divides the respective model parameters into N blocks, and sends the i-th block of the respective model parameters to the i-th node, and each of the N nodes
- the model parameters received by the respective nodes are respectively merged, and each of the N nodes respectively distributes the merged model parameters to other nodes of the N nodes, thereby solving the problem of dynamic adjustment in the computing resources, and having dynamics
- the ability to delete and add nodes, while each node participating in the fusion can send and receive model parameters at the same time, improving the utilization of network resources and the stability of the system.
- FIG. 12 is a schematic diagram of a node, where the node includes a memory 1201, a processor 1202, a power component 1203, an input/output interface 1204, and a communication component 1205.
- the processor 1202 is configured to execute the foregoing embodiment 2.
- the model parameter fusion method is configured to execute the foregoing embodiment 2.
- FIG. 12 is merely illustrative and does not limit the structure of the node.
- the node may also include more or fewer components than shown in FIG. 12, or have a different configuration than that shown in FIG.
- the memory 1201 can be used to store data, software programs, and modules; and mainly includes a storage program area and a storage data area, wherein the storage program area can store an operating system, an application required for at least one function, and the like; and the storage data area can be stored according to model parameters. Data created by the use of the fusion device, etc. Further, the memory may include a high speed random access memory, and may also include a nonvolatile memory such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
- the processor 1202 is a control center of the node, and connects various parts of the entire node by using various interfaces and lines, by executing or executing software programs and/or modules stored in the memory 1201, and calling data stored in the memory 1201, and executing The various functions of the node and the processing of the data, thereby overall monitoring of the nodes.
- the processor 1202 may include one or more processing units; preferably, the processor 502 may integrate an application processor and a modem processor, where the application processor mainly processes an operating system, a user interface, an application, and the like.
- the modem processor primarily handles wireless communications. It can be understood that the above modulation and demodulation processor may not be integrated into In the processor 1202.
- Power component 1203 is used to provide power to various components of a node, which may include a power management system, one or more power sources, and other components associated with node generation, management, and distribution of power.
- the input/output interface 1204 provides an interface between the processor 1202 and the peripheral interface module.
- the peripheral interface module can be a keyboard, a mouse, or the like.
- Communication component 1205 is configured to facilitate wired or wireless communication between a node and other devices.
- the node can access a wireless network based on communication standards, such as WiFi, 2G or 3G, or a combination thereof.
- the node may also include an audio component, a multimedia component, and the like, which are not described herein again.
- a node provided by the embodiment of the present invention divides each model parameter into N blocks by each of the N nodes participating in the fusion, and sends the respective i-th block model parameters to the i-th node, N
- Each of the nodes respectively fuses the received model parameters
- each of the N nodes respectively distributes the merged model parameters to other nodes of the N nodes, thereby dynamically deleting and adding nodes.
- each node participating in the fusion can simultaneously send and receive model parameters, improving the utilization of network resources and the stability of the system.
- FIG. 13 provides a fusion controller, where the fusion controller includes a memory 1301, a processor 1302, a power component 1303, an input/output interface 1304, a communication component 1305, and the like.
- the processor 1302 is configured to perform the model parameter fusion method described in the second embodiment.
- FIG. 13 is merely illustrative and does not limit the structure of the fusion controller.
- the fusion controller may also include more or fewer components than shown in FIG. 13, or have a different configuration than that shown in FIG.
- the memory 1301 can be used to store data, software programs, and modules; and mainly includes a storage program area and a storage data area, wherein the storage program area can store an operating system, an application required for at least one function, and the like; and the storage data area can be stored according to model parameters. Created by the use of a fusion device Data, etc.
- the memory may include a high speed random access memory, and may also include a nonvolatile memory such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
- the processor 1302 is a control center of the fusion controller that connects various parts of the entire fusion controller with various interfaces and lines, by running or executing software programs and/or modules stored in the memory 1301, and by calling them stored in the memory 1301.
- the processor 1302 may include one or more processing units; preferably, the processor 502 may integrate an application processor and a modem processor, where the application processor mainly processes an operating system, a user interface, an application, and the like.
- the modem processor primarily handles wireless communications. It can be understood that the above modem processor may not be integrated into the processor 1302.
- Power component 1303 is used to provide power to various components of the fusion controller, which may include a power management system, one or more power sources, and other components associated with the fusion controller to generate, manage, and distribute power.
- the input/output interface 1304 provides an interface between the processor 1302 and the peripheral interface module.
- the peripheral interface module can be a keyboard, a mouse, or the like.
- Communication component 1305 is configured to facilitate wired or wireless communication between the fusion controller and other devices.
- the fusion controller can access a wireless network based on a communication standard such as WiFi, 2G or 3G, or a combination thereof.
- the fusion controller may further include an audio component, a multimedia component, and the like, which are not described herein again.
- a fusion controller determines, by using an address and a fusion state information sent by a node that performs a specified calculation in the M nodes, N nodes that satisfy the fusion condition, and each of the N nodes
- the nodes send the fusion indication information, so that each of the N nodes respectively divides the respective model parameters into N blocks, and sends the respective i-th block model parameters to the i-th node, and each of the N nodes
- the model parameters received by the respective nodes are respectively merged, and each of the N nodes respectively distributes the merged model parameters to other nodes of the N nodes, thereby solving the problem of dynamic adjustment in the computing resources and improving the problem. Utilization of network resources and stability of the system.
- An embodiment of the present invention provides a machine learning system, where the machine learning system includes a node according to the fifth embodiment, and a fusion controller according to the sixth embodiment.
- the fusion controller is set independently of the node or configured on the node.
- the fusion controller determines, according to the address and the fusion state information sent by the node that performs the specified calculation among the M nodes, the N nodes that meet the fusion condition, and the N nodes are Each node sends the fusion indication information, so that each of the N nodes participating in the fusion separately divides the respective model parameters into N blocks, and sends the respective i-th block model parameters to the i-th node, N
- Each node of the nodes respectively fuses the model parameters received by each node, and each of the N nodes respectively distributes the merged model parameters to other nodes of the N nodes, thereby solving the dynamic adjustment in the computing resources.
- the problem has the ability to dynamically delete and add nodes.
- each node participating in the fusion can send and receive model parameters at the same time, which improves the utilization of network resources and the stability of the system.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Databases & Information Systems (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Medical Informatics (AREA)
- Computing Systems (AREA)
- Algebra (AREA)
- Probability & Statistics with Applications (AREA)
- Operations Research (AREA)
- Computer Networks & Wireless Communication (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
- Mobile Radio Communication Systems (AREA)
Abstract
本发明实施例提供一种模型参数融合方法及装置,涉及机器学习领域,用于解决模型参数融合中数据传输量大和动态调整计算资源的问题。该方法包括:第i节点将所述第i节点的模型参数划分为N块;其中,所述第i节点为参与融合的N个节点中任一节点,1≤所述i≤所述N≤所述M;所述第i节点接收所述N个节点中除所述第i节点之外的其他节点分别发送的各自的第i块模型参数;所述第i节点将所述第i节点的第i块模型参数以及其他节点分别发送的各自的第i块模型参数进行融合,获得第i块的总模型参数;所述第i节点将第i块的总模型参数分发给所述N个节点中的其他节点。
Description
本发明涉及机器学习领域,尤其涉及一种模型参数融合方法及装置。
模型参数是指由多个约束参数组成的描述模型的参数,通过模型参数可以将具有共同特征的数据筛选出来,比如,当模型参数是图像类模型参数时,通过不同的模型参数,可以从众多的图像数据中筛选出具有人物、动物、或人脸的图像数据。随着数据量和数据种类的快速增长,用于数据筛选的模型参数也越来越多,而这些模型参数是经过对大量具有共同特征的数据进行多次计算融合得到的。
目前,模型参数融合都是将数据划分成多个数据子集,分配到不同的节点对分配的数据子集采用数据迭代的计算方法进行训练,每经过一次或者多次迭代计算,将各节点对不同数据子集训练得到的模型参数进行一次融合,并将融合后的模型参数作为下次迭代计算的初始模型参数,经过多次融合之后,得到最终的总模型参数。
现有技术中,模型参数融合的方法主要有两种:第一种是当各节点对多个数据子集完成多次迭代计算之后,参数服务器将各节点对多个数据子集训练得到的模型参数进行汇总、融合,得到新的模型参数,然后,各节点对多个数据子集再根据新的模型参数进行下次迭代计算;第二种是当某个节点对其分配的数据子集完成多次迭代计算之后,将该节点对分配的数据子集训练得到的模型参数发送给指定的其他节点,以用于与其他节点的数据子集进行模型参数融合,然后该节点再根据自己收到的其它节点对其他数据子集训练后传输过来的模型参数开始迭代计算。但是,第一种对用于进行模型参数融合的参数服务器的性能要求较高,容易发生宕机,第二种需要存储的数据较多,且数据传输量大。
发明内容
本发明的实施例提供一种模型参数融合方法及装置,用于解决模型参数融合中对参数服务器性能要求高、数据传输量大的问题。
为达到上述目的,本发明的实施例采用如下技术方案:
第一方面,提供一种模型参数融合方法,应用于机器学习系统,所述机器学习系统包括M个节点,所述方法包括:
第i节点将所述第i节点的模型参数划分为N块;其中,所述第i节点为所述M个节点中参与融合的N个节点中任一节点,1≤所述i≤所述N≤所述M,所述模型参数划分的N块中的第i块为第i块模型参数;
所述第i节点接收所述N个节点中除所述第i节点之外的其他节点分别发送的各自的第i块模型参数;
所述第i节点将所述第i节点的第i块模型参数以及其他节点分别发送的各自的第i块模型参数进行融合,获得第i块的总模型参数;
所述第i节点将第i块的总模型参数分发给所述N个节点中除所述第i节点之外的的其他节点。
其中,第i节点接收N个节点中除第i节点之外的其他节点分别发送的各自的第i块模型参数,以及将第i块的总模型参数分发给N个节点中的其他节点时,可以采用全双工的数据传输方式,也即是,第i节点在向其他节点发送数据时同时可以接收其他节点发送的数据,比如,第i节点采用全双工网卡等,本发明对此不作限定。
另外,参与融合的N个节点是通过预设的融合条件从M个节点中确定的,该融合条件可以是完成迭代计算的节点个数达到预设数值,该预设数值在每次融合时可以是常数,也可以是变化的;或者,该融合条件是完成指定计算的次数达到预设次数,该预设次数在每次融合时可以是常数,也可是变化的;或者,该融合条件是迭代计算经过预设时长,该预设时长在每次融合时可以是常数,也可是变化的,当然,该融合条件也可以是其他的条件等,本发明对此不作具体限定。
进一步,若已有N个节点完成融合,在融合控制器再次确定N个节点时,融合控制器可以确定已完成融合的节点、以及未完成融合且完成指定计算的节点中的N个节点。
结合第一方面,在第一方面的第一种可能的实现方式中,在所述第i节点将所述第i节点的模型参数划分为N块之前,所述方法还包括:
第k节点向融合控制器发送所述第k节点的地址和融合状态信息,所述融合状态信息包括节点的计算状态和/或迭代次数,所述第k节点为所述M个节点中完成指定的迭代任务的节点,1≤所述k≤所述M;
所述第i节点接收所述融合控制器发送的融合指示信息,所述融合指示信息是由所述融合控制器在根据接收的所述第k节点的地址和融合状态信息确定满足融合条件的N个节点之后发出的,所述融合指示信息包括所述N个节点的地址和或编号,其中,所述融合控制器每次确定满足融合条件的N个节点的个数相同或者不同。
需要说明的是,在第k节点完成指定计算之后,第k节点将自身的地址和当前记录的融合状态信息发送给融合控制器,该融合控制器可以由固定的节点来担当,该融合控制器根据上述的融合条件确定参与融合的N个节点。
结合第一方面的第一种可能的实现方式,在第一方面的第二种可能的实现方式中,若所述融合控制器为第一节点,其中,所述第一节点轮流为所述M个节点中T个节点的任一个节点;在所述第k节点向融合控制器发送所述第k节点的地址和融合状态信息之前,所述方法还包括:
所述第k节点接收所述第一节点发送的所述第一节点的地址;也即是,当前担当融合控制器的节点向第k节点发送自身的地址。
相应的,所述第k节点向融合控制器发送所述第k节点的地址以及融合状态信息,包括:
所述第k节点根据所述第一节点的地址向所述第一节点发送所述第k节点的地址和融合状态信息,也即是,当融合控制器轮流变化为由其他任一节点担当时,第k节点将自身的地址和融合状态信息发送给当前担任融合控制器的节点,比如,该融合状态信息可以为节点的计算状态和/或迭代次数。
需要说明的是,当融合控制器为第一节点时,该第一节点可以为M个节点中T个节点的任一节点,且该第一节点可以是轮流的,也即是,当前的融合控制器可以指定M个节点中T个节点的任一节点作为下次的担
当融合控制器的节点,下次担当融合控制器的节点可以指定下下次担当融合控制器的节点,以此类推。
结合第一方面,在第一方面的第三种可能的实现方式中,在所述第i节点将所述第i节点的模型参数划分为N块之前,所述方法还包括:
第k节点向所述M个节点中每个节点广播所述第k节点的地址和融合状态信息;也即是,M个节点中每个节点可以同时记录第k节点的地址和融合状态信息。
所述第k节点接收第二节点发送的融合指示信息,所述第二节点为所述M个节点中K个节点的任一节点,所述融合指示信息是由所述第二节点在根据接收的所述第k节点的地址和融合状态信息确定满足融合条件的N个节点之后发出的,所述融合指示信息包括所述N个节点的地址和/或编号。也即是,将同时记录第k节点的地址和融合状态信息的节点中任一个节点作为第二节点,由第二节点担当融合控制器。
结合第一方面至第一方面的第三种可能的实现方式中的任一种可能的实现方式,在第一方面的第四种可能的实现方式中,所述方法还包括:
所述第i节点将所述第i节点的第j块模型参数发送给所述N个节点中的第j节点,其中,1≤所述j≤所述N,且所述j≠i。
也即是,第i节点将划分的模型参数中除第i块之外的其他模型参数块发送给N个节点中的其他节点,且将相同编号的第j块发送给第j节点,由第j节点负责第j块的模型参数融合。
结合第一方面四种可能的实现方式,在第一方面的第五种可能的实现方式中,所述方法还包括:
所述第i节点接收所述第j节点发送的所述第j节点融合后的第j块的总模型参数;
所述第i节点汇总接收到的所述N个节点中除所述第i节点之外的其他节点发送的各自融合后的总模型参数的对应部分,生成所述第i节点的新的总模型参数;
所述第i节点根据所述新的总模型参数,进行迭代计算。
第二方面,提供一种模型参数融合方法,应用于机器学习系统,所述机器学习系统包括M个节点,所述方法包括:
融合控制器接收所述M个节点中完成指定计算的节点发送的地址和融合状态信息,所述融合状态信息包括节点的计算状态和/或迭代次数;
所述融合控制器根据接收的所述地址和融合状态信息,确定满足融合条件的N个节点;其中,所述融合控制器每次确定满足融合条件的N个节点的个数相同或者不同;
所述融合控制器向所述N个节点中的每个节点发送融合指示信息,所述融合指示信息包括所述N个节点的地址和/或编号,以使得所述N个节点中的每个节点分别将各自的模型参数划分为N块;将各自模型参数中划分的第i块模型参数发送给第i个节点,其中,1≤所述i≤所述N;所述N个节点中每个节点分别融合各自收到的模型参数,以及所述N个节点中每个节点分别将融合后的模型参数分发给所述N个节点中的除自身之外的其他节点。
结合第二方面,在第二方面的第一种可能的实现方式中,所述融合条件为所述完成指定计算的节点个数达到预设数值;或者,所述完成指定计算的次数达到预设次数;或者,经过预设时长。
需要说明的是,预设数值、预设次数和预设时长可以事先设置,且预设数值、预设次数和预设时长可以是常数,也可以是变化的,本发明对此不作限定。
另外,当融合控制器确定满足融合条件的N个节点时,融合控制器可以确定已完成融合的节点、以及未完成融合且完成指定计算的节点中的N个节点。
结合第二方面或第二方面的第一种可能的实现方式,在第二方面的第二种可能的实现方式中,所述融合控制器为第一节点,其中,所述第一节点为所述M个节点中K个节点的任一个节点;在所述融合控制器接收所述M个节点中完成指定计算的节点发送的地址以及融合状态信息之前,所述方法还包括:
所述第一节点作为第一时间段的融合控制器向所述M个节点中除所述第一节点之外的其他节点发送所述第一节点的地址。
需要说明的是,当第一节点作为第一时间段的融合控制器时,该第一节点可以为M个节点中K个节点的任一节点,且第一节点可以事先指定。
结合第二方面的第二种可能的实现方式,在第二方面的第三种可能的实现方式中,所述方法还包括:
在满足预设条件后,所述第一节点确定第二节点作为第二时间段的融合控制器;其中,所述第二节点为所述M个节点中K个节点的任一个节点,所述K≤所述M;
所述第一节点向所述第二节点发送节点融合信息,所述节点融合信息包括所述M个节点的地址和融合状态信息;
所述第一节点向所述第二节点之外的其他节点发送所述第二节点的地址。
其中,预设条件可以为经过一定的时间,或者经过一定的融合次数,或者经过一定的迭代次数等,本发明对此不作限定。
需要说明的是,一定时间、一定的融合次数和一定的迭代次数可以事先设置,且一定时间、一定的融合次数和一定的迭代次数可以是固定不变的,也可以是变化的。
具体地,在满足预设条件后,第一时间段内担当融合控制器的第一节点指定第二时间段内的担当融合控制器的节点,该节点称为第二节点,也即是,由第二节点接替第一节点作为融合控制器,并且第一节点将M个节点的节点融合信息发送给第二节点,第二节点向其他的节点发送自身的地址,以使其他节点在完成融合后将地址和融合状态信息进行上报。
进一步,若所述第二节点在所述第二时间段内故障,所述第一节点确定第三节点作为所述第二时间段的参数融合控制器,其中,所述第三节点为所述M个节点中K个节点的任一个节点。
也即是,当第二节点出现故障,无法作为融合控制器时,由之前担当融合控制器的第一节点重新确定M个节点中一个节点作为第二时间段内的融合控制器,重新确定的节点可以称为第三节点。
结合第二方面或第二方面的第一种可能的实现方式,在第二方面的第四种可能的实现方式中,所述融合控制器为所述M个节点中的至少一个节点,所述至少一个节点接收每个节点在完成指定计算后发送的地址和融合状态信息,则所述融合控制器根据接收的所述地址和融合状态信息,确定满足融合条件的N个节点,向所述N个节点中的每个节点发送融合指
示信息,为:所述至少一个节点中的任一节点根据接收的所述地址和融合状态信息,确定满足融合条件的N个节点,向所述N个节点中的每个节点发送融合指示信息。
也即是,当M个节点中的一个节点或者多个节点同时记录M个节点的节点融合信息时,每个节点在完成融合后都将自身的地址和融合状态信息,比如节点的计算状态和/或迭代次数,发送给同时记录节点融合信息的该M个节点中的至少一个节点,由至少一个节点中的任一节点根据接收的所述地址和融合状态信息,确定满足融合条件的N个节点,向所述N个节点中的每个节点发送融合指示信息。
第三方面,提供一种模型参数融合装置,应用于机器学习系统,所述机器学习系统包括M个节点,所述装置包括:
划分单元,用于将自身的模型参数划分为N块;其中,所述N为所述M个模型参数融合装置中参与融合的模型参数融合装置的个数,所述模型参数划分的N块中的第i块为第i块模型参数,1≤所述i≤所述N≤所述M;
第一接收单元,用于接收所述N个模型参数融合装置中除自身之外的其他模型参数融合装置分别发送的各自的第i块模型参数;
融合单元,用于将自身的第i块模型参数以及其他模型参数融合装置分别发送的各自的第i块模型参数进行融合,获得第i块的总模型参数;
第一发送单元,用于将所述第i块的总模型参数分发给所述N个模型参数融合装置中除自身之外的其他模型参数融合装置。
其中,在向N个模型参数融合装置中除自身之外的其他模型参数融合装置分别发送的各自的第i块模型参数,以及将第i块的总模型参数分发给N个模型参数融合装置中的其他模型参数融合装置时,可以采用全双工的数据传输方式,也即是,在向其他模型参数融合装置发送数据时同时可以接收其他模型参数融合装置发送的数据,比如,采用全双工网卡等,本发明实施例对此不作限定。
另外,参与融合的N个模型参数融合装置是通过预设的融合条件从M个模型参数融合装置中确定的,该融合条件可以是完成迭代计算的节点个数达到预设数值,该预设数值在每次融合时可以是常数,也可以是变
化的;或者,该融合条件是完成指定计算的次数达到预设次数,该预设次数在每次融合时可以是常数,也可是变化的;或者,该融合条件是迭代计算经过预设时长,该预设时长在每次融合时可以是常数,也可是变化的,当然,该融合条件也可以是其他的条件等,本发明实施例对此不作具体限定。
结合第三方面,在第三方面的第一种可能的实现方式中,所述装置还包括:
第二发送单元,用于在完成指定的迭代任务后,发送自身的地址和融合状态信息给融合控制器,所述融合状态信息包括模型参数融合装置的计算状态和/或迭代次数;
第二接收单元,用于接收融合指示信息,所述融合指示信息是所述融合控制器在根据接收的K个模型参数融合装置的地址和融合状态信息确定满足融合条件的所述N个模型参数融合装置之后发出的,所述融合指示信息包括所述N个模型参数融合装置的地址和/或编号,所述K个模型参数融合装置为所述M个模型参数融合装置中完成指定的迭代任务的模型参数融合装置,1≤所述K≤所述M。
结合第三方面,在第三方面的第二种可能的实现方式中,若所述融合控制器为第一模型参数融合装置,其中,所述第一模型参数融合装置轮流为所述M个节点中T个模型参数融合装置的任一个模型参数融合装置,所述T≤所述M,所述装置还包括:
第三接收单元,用于接收所述第一模型参数融合装置发送的所述第一模型参数融合装置的地址;
相应的,所述第二发送单元具体用于:
所根据所述第一模型参数融合装置的地址向所述第一模型参数融合装置发送自身的地址和融合状态信息。
需要说明的是,当模型参数融合装置为第一模型参数融合装置时,该第一模型参数融合装置可以为M个模型参数融合装置点中K个模型参数融合装置的任一模型参数融合装置,且该第一模型参数融合装置可以是轮流的,也即是,第一模型参数融合装置可以指定M个模型参数融合装置中的K个模型参数融合装置的任一模型参数融合装置作为下次的模型参
数融合装置,下次的模型参数融合装置可以指定下下次的模型参数融合装置,以此类推。
结合第三方面,在第三方面的第三种可能的实现方式中,所述装置还包括:
广播单元,用于向所述M个模型参数融合装置中每个模型参数融合装置广播所述自身的地址和融合状态信息;也即是,M个模型参数融合装置中每个模型参数融合装置可以同时记录地址和融合状态信息。
第四接收单元,用于接收第二模型参数融合装置发送的融合指示信息,所述第二模型参数融合装置为所述M个模型参数融合装置中K个模型参数融合装置的任一模型参数融合装置,所述融合指示信息是所述第二模型参数融合装置在根据接收的K个模型参数融合装置的地址和融合状态信息确定满足融合条件的所述N个模型参数融合装置之后发出的,所述融合指示信息包括所述N个节点的地址和/或编号,所述K个模型参数融合装置为所述M个模型参数融合装置中完成指定的迭代任务的模型参数融合装置,1≤所述K≤所述M。
也即是,将同时记录M个模型参数融合装置的地址和融合状态信息的节点中任一个模型参数融合装置作为第二模型参数融合装置,由第二模型参数融合装置担当下次的模型参数融合装置。
结合第三方面,在第三方面的第四种可能的实现方式中,所述装置还包括:
第五接收单元,用于接收K个模型参数融合装置的地址和融合状态信息,所述融合状态信息包括模型参数融合装置的计算状态和/或迭代次数,所述K个模型参数融合装置为所述M个模型参数融合装置中完成指定的迭代任务的模型参数融合装置,1≤所述K≤所述M;
确定单元,用于根据接收的所述K个模型参数融合装置的地址和融合状态信息确定满足融合条件的所述N个模型参数融合装置;
第三发送单元,用于向所述N个模型参数融合装置中除自身之外的其他模型参数融合装置发送融合指示信息,以便于所述N个模型参数融合装置中除自身之外的其他模型参数融合装置根据所述融合指示信息进行参数融合,所述融合指示信息包括所述N个节点的地址和/或编号。
结合第三方面的第三种可能的实现方式,在第三方面的第五种可能的实现方式中,所述装置还包括:
第四发送单元,用于向所述M个模型参数融合装置中除自身之外的其他模型参数融合装置发送自身的地址,以便于所述M个模型参数融合装置中除自身之外的其他模型参数融合装置根据接收的所述地址发送自身的地址和融合状态信息。
结合第三方面至第三方面的第六种可能的实现方式中的任一种可能的实现方式,在第三方面的第七种可能的实现方式中,所述装置还包括:
第五发送单元,用于将自身的第j块模型参数发送给所述N个模型参数融合装置中的第j模型参数融合装置,其中,1≤所述j≤所述N,且所述j≠i。
也即是,将划分的模型参数中除第i块之外的其他模型参数块发送给N个模型参数融合装置中的其他模型参数融合装置,且将相同编号的第j块发送给第j模型参数融合装置,由第j模型参数融合装置负责第j块的模型参数融合。
结合第三方面的第七种可能的实现方式,在第三方面的第五种可能的实现方式中,所述装置还包括:
第六接收单元,用于接收所述第j模型参数融合装置发送的所述第j模型参数融合装置融合后的第j块的总模型参数;
汇总单元,用于接收所述第j模型参数融合装置发送的所述第j模型参数融合装置融合后的第j块的总模型参数;
计算单元,用于根据所述新的总模型参数,进行迭代计算。
第四方面,提供一种模型参数融合装置,应用于机器学习系统,所述机器学习系统包括M个节点,所述装置包括:
接收单元,用于接收所述M个节点中完成指定计算的节点发送的地址和融合状态信息,所述融合状态信息包括节点的计算状态和/或迭代次数;
第一确定单元,用于根据接收的所述地址和融合状态信息,确定满足融合条件的N个节点;其中,所述融合控制器每次确定满足融合条件的N个节点的个数相同或者不同;
第一发送单元,用于向所述N个节点中的每个节点发送融合指示信息,所述融合指示信息包括所述N个节点的地址和/或编号,以使得所述N个节点中的每个节点分别将各自的模型参数划分为N块;将各自模型参数中划分的第i块模型参数发送给第i个节点,其中,1≤所述i≤所述N;所述N个节点中每个节点分别融合各自收到的模型参数,以及所述N个节点中每个节点分别将融合后的模型参数分发给所述N个节点中除自身之外的其他节点。
结合第四方面,在第四方面的第一种可能的实现方式中,所述融合条件为所述完成指定计算的节点个数达到预设数值;或者,所述完成指定计算的次数达到预设次数;或者,经过预设时长。
可选的,第一确定单元还具体用于:
确定已完成融合的节点、以及未完成融合且完成指定计算的节点中的N个节点。
结合第四方面或第四方面的第一种可能的实现方式,在第四方面的第二种可能的实现方式中,所述模型参数融合装置为第一节点,其中,所述第一节点为所述M个节点中任一个节点,所述装置还包括:
第二发送单元,用于向所述M个节点中除所述第一节点之外的其他节点发送所述第一节点的地址。
结合第四方面的第二种可能的实现方式,在第四方面的第三种可能的实现方式中,所述装置还包括:
第二确定单元,用于在满足预设条件后,确定第二节点作为第二时间段的模型参数融合装置;其中,所述第二节点为所述M个节点中K个节点的任一个节点,所述K≤所述M;
第三发送单元,用于向所述第二节点发送节点融合信息,所述节点融合信息包括所述M个节点的地址和融合状态信息;
第四发送单元,用于向所述第二节点之外的其他节点发送所述第二节点的地址。
其中,预设条件可以为经过一定的时间,或者经过一定的融合次数,或者经过一定的迭代次数等,本发明对此不作限定。
需要说明的是,一定时间、一定的融合次数和一定的迭代次数可以事
先设置,且一定时间、一定的融合次数和一定的迭代次数可以是固定不变的,也可以是变化的。
结合第四方面的第四种可能的实现方式,在第四方面的第五种可能的实现方式中,所述装置还包括:
第三确定单元,用于若所述第二节点在所述第二时间段内故障,确定第三节点作为所述第二时间段的模型参数融合装置,其中,所述第三节点为所述M个节点中K个节点任一个节点。
也即是,当第二节点出现故障,由第三确定单元重新确定M个节点中一个节点作为第二时间段内的模型参数融合装置,此时,该节点可以称为第三节点。
结合第四方面或第四方面的第一种可能的实现方式,在第四方面的第四种可能的实现方式中,所述模型参数融合装置所述M个节点中的至少一个节点,所述至少一个节点接收每个节点在完成指定计算后发送的地址和融合状态信息,确定满足融合条件的N个节点,向所述N个节点中的每个节点发送融合指示信息,为:所述至少一个节点中的任一节点根据接收的所述地址和融合状态信息,确定满足融合条件的N个节点,向所述N个节点中的每个节点发送融合指示信息。
也即是,当M个节点中的一个节点或者多个节点同时记录M个节点的节点融合信息时,每个节点在完成融合后都将自身的地址和融合状态信息,比如节点的计算状态和/或迭代次数,发送给同时记录节点融合信息的该M个节点中的至少一个节点,由至少一个节点中的任一节点根据接收的所述地址和融合状态信息,确定满足融合条件的N个节点,向所述N个节点中的每个节点发送融合指示信息。
第五方面,提供一种节点,所述节点包括处理器和存储器,所述存储器中存储代码和数据,所述处理器可运行存储器中的代码,所述处理器用于执行上述第一方面至第一方面的第五种可能的实现方式中任一项所述的模型参数融合方法。
第六方面,提供一种融合控制器,所述融合控制器包括处理器和存储器,所述存储器中存储代码和数据,所述处理器可运行存储器中的代码,所述处理器用于执行第二方面至第二方面的第四种可
能的实现方式中任一项所述的模型参数融合方法。
第七方面,提供一种机器学习系统,所述机器学习系统包括上述第五方面所述的一种节点,以及第六方面所述的一种融合控制器。
结合第七方面,在第七方面的第一种可能的实现方式中,所述融合控制器为独立于所述节点设置,或者配置在所述节点上。
本发明的实施例提供的一种模型参数方法及装置,通过确定满足融合条件的N个节点,并将第i节点的模型参数划分为N块,以及接收N个节点中除第i节点之外的其他节点分别发送的各自的第i块模型参数,再将第i节点的第i块模型参数以及其他节点分别发送的各自的第i块模型参数进行融合,获得第i块的总模型参数,最后将第i块的总模型参数分发给所述N个节点中的其他节点,其中,第i节点为参与融合的N个节点中任一节点,从而解决了计算资源中动态调整的问题,并具有动态删除和增加节点的能力,同时参与融合的每个节点可以同时发送和接收模型参数,提高了网络资源的利用率和系统的稳定性。
为了更清楚地说明本发明实施例的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本发明实施例提供的一种机器学习系统的结构示意图;
图2为本发明实施例提供的一种模型参数融合方法的流程图;
图3为本发明实施例提供的一种第i节点接收第i块模型参数的示意图;
图4为本发明实施例提供的一种第i节点融合和发送第i块总模型参数的示意图;
图5为本发明实施例提供的第一种模型参数装置的结构示意图;
图6为本发明实施例提供的第二种模型参数装置的结构示意图;
图7为本发明实施例提供的第三种模型参数装置的结构示意图;
图8为本发明实施例提供的第四种模型参数装置的结构示意图;
图8a为本发明实施例提供的第五种模型参数装置的结构示意图;
图8b为本发明实施例提供的第六种模型参数装置的结构示意图;
图8c为本发明实施例提供的第七种模型参数装置的结构示意图;
图9为本发明实施例提供的第八种模型参数装置的结构示意图;
图10为本发明实施例提供的第九种模型参数装置的结构示意图;
图11为本发明实施例提供的第十种模型参数装置的结构示意图;
图12为本发明实施例提供的一种节点的结构示意图;
图13为本发明实施例提供的一种融合控制器的结构示意图。
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
实施例一
本发明的实施例所应用的机器学习系统架构如图1所示,该系统架构图包括数据存储设备101、模型参数训练平台102和模型参数存储设备103。
其中,数据存储设备101可以为数据存储服务器101,该数据存储服务器101可以用来存储用于模型参数训练的原始数据,数据存储服务器101的存储容量远大于模型训练平台102中计算服务器1021的存储容量。该原始数据可以是语言数据、图像数据、以及视频数据等,且原始数据由多个数据集组成,且每个数据集又包括多个类型子集组成,每个类型子集带有用于表示类别的数据标签,同一个数据集中包括的类型子集的标签是相同的,比如,该数据集可以是包含带有人物标签的多张人物图像,也可以是包含带有动物标签的多张动物图像,或者其它类别的图像等等。
模型参数训练平台102包括用于迭代计算的计算服务器1021,
也可以称为节点,具体可以为普通的计算机、移动终端、工作站或通用服务器、专用服务器等,以及用于负责计算服务器间进行数据通信的交换机1022。计算服务器1021有本地的存储,其容量小于数据存储服务器101。在模型训练时,每个计算服务器通过采样的方式从数据存储服务器101中读取一定的数据到本地的存储设备中用于模型参数训练。模型参数训练平台102通过将带有数据标签的数据集进行模型参数训练融合,可以得到最终融合输出的一个总的模型参数,通过这个总的模型参数就可以识别出新数据的数据类型。比如,用带有人物标签的图像数据集进行模型参数融合,就可以通过最终输出的模型参数识别出新图像数据中的人物图像,用带有动物标签的图像数据集进行模型参数融合,就可以通过最终输出的模型参数识别出新图像数据中的动物图像等。
模型参数存储服务器103用于存储训练得到的模型参数,当模型参数训练平台102训练融合完成时,可以将最终融合得到的模型参数发送给模型参数存储服务器103,使模型参数存储服务器103进行存储,以方便后续的使用。另外,模型参数平台102中计算服务器1021最初用于进行模型参数训练融合的模型参数也可以是从模型参数存储服务器103中获取的。
实施例二
图2为本发明实施例提供的一种模型参数融合方法的流程图,该方法应用于机器学习系统,该机器学习系统包括M个节点,该方法包括以下几个步骤。
步骤201:用于进行模型参数融合的节点获取数据集中的数据子集。
其中,该数据集是指用于进行模型参数迭代计算的数据集,该数据集可以是语言数据、图像数据、以及视频数据等,且该数据集由多个类型子集组成,每个类型子集带有用于表示类别的数据标签,同一个数据集中包括的类型子集的标签是相同的。
另外,该数据集可以事先存储硬盘、磁盘等存储设备上,也可以事先存储在数据存储服务器上,当节点从数据集中获取数据子集时,可以将存储设备直接与节点所在的设备进行连接来获取数据子集,或者从数据存储
服务器上来获取数据等。
需要说明的是,由于进行模型参数融合的数据集远远大于实际模型参数用到的数据量,因此,当节点获取数据集中的数据子集时,节点可以从数据集中抽取一定量的数据,如果事先知道每个节点的计算能力,可以按照该节点的计算能力分配该节点获取的数据子集的数据量。
步骤202:各节点基于数据子集和当前的模型参数进行指定的迭代计算。
当第一次进行模型参数迭代计算时,每个节点可以基于获取的数据子集和初始的模型参数进行迭代计算,当完成迭代计算时,每个节点可以基于数据子集和当前得到的模型参数进行下次的迭代计算。
其中,初始的模型参数是指每个节点最开始的模型参数,且每个节点初始的模型参数可以是相同的。当前得到的模型参数是指每个节点完成当前迭代计算得到的模型参数,或者当前接收到的模型参数,也即是,当前最新的模型参数。
步骤203:第k节点向融合控制器发送第k节点的地址和融合状态信息,所述融合状态信息包括节点的计算状态和/或迭代次数,第k节点为M个节点中完成指定的迭代任务的节点,1≤k≤M。
该机器学习系统包括的M个节点分别基于获取的数据子集和模型参数进行指定的迭代计算,当该M个节点中有任一节点完成指定计算时,该节点向融合控制器发送自身的地址和融合状态信息。
其中,该融合状态信息包括节点的计算状态和/或迭代次数,也即是,第k节点向融合控制器发送融合状态信息时,可以发送当前的计算状态,或者发送当前完成的迭代次数,或者将当前的计算状态和迭代次数都发送给融合控制器,这里的计算状态是指是否已完成指定的迭代计算。
另外,第k节点的地址可以是该节点的IP地址、MAC(Media Access Control,媒体访问控制,也称物理地址)地址,或者节点的编号等等,本发明实施例对此不作限定。
步骤204:融合控制器接收M个节点中完成指定计算的节点发送的地址和融合状态信息,确定满足融合条件的N个节点,并向N个节点中的每个节点发送融合指示信息,该融合指示信息是由融合控制器在根据接
收的第k节点的地址和融合状态信息确定满足融合条件的N个节点之后发出的,融合指示信息包括N个节点的地址和或编号。
其中,融合控制器每次确定满足融合条件的N个节点的个数相同或者不同,另外,该融合指示信息包括N个节点的地址和/或编号。
需要说明的是,参与融合的N个节点是通过预设的融合条件从M个节点中确定的,该融合条件可以是完成迭代计算的节点个数达到预设数值,该预设数值在每次融合时可以是常数,也可以是变化的;或者,该融合条件是完成指定计算的次数达到预设次数,该预设次数在每次融合时可以是常数,也可以是变化的;或者,该融合条件是迭代计算经过预设时长,该预设时长在每次融合时可以是常数,也可是变化的,当然,该融合条件也可以是其他的条件等,本发明对此不作具体限定。
另外,若已有N个节点完成融合,在融合控制器再次确定N个节点时,融合控制器可以确定已完成融合的节点、以及未完成融合且完成指定计算的节点中的N个节点。
进一步,该融合控制器可以由固定节点来担当,也可以轮流由不同的节点来担当,还可以由至少一个节点以分布式的方式来担当,具体三种不同的融合控制器阐述如下。
第一种、融合控制器由固定的节点来担当,该固定的节点可以事先设置,且该M个节点中的任一节点在完成指定计算后,可以向该固定的融合控制器发送自身的地址和融合状态信息,由该固定的融合控制器基于接收的地址和融合状态信息,确定满足融合条件的N个节点,并向N个节点中的每个节点发送融合指示信息。
第二种、融合控制器轮流由不同的节点来担当,轮流担当融合控制器的第一个节点可以称为第一节点,且第一节点为M个节点中T个节点的任一个节点,所述T≤所述M。
由于该融合控制器是由不同节点轮流担当的,为了使M个节点在完成指定计算后,向当前的融合控制器发送地址和融合状态信息,因此,在步骤203第k节点向融合控制器发送第k节点的地址和融合状态信息之前,第k节点接收第一节点发送的第一节点的地址,也即是,当前担当融合控制器的节点将自身的地址发送给M个节点。
相应的,第k节点向融合控制器发送第k节点的地址和融合状态信息,包括:第k节点根据第一节点的地址向第一节点发送第k节点的地址和融合状态信息。
之后,由第一节点作为第一时间段的融合控制器,接收完成指定迭代计算的节点发送的地址和融合状态信息,第一节点基于接收的地址和融合状态信息,确定满足融合条件的N个节点,并向N个节点中的每个节点发送融合指示信息。
在满足预设条件后,第一节点确定第二节点作为第二时间段的融合控制器,第二节点为M个节点中K个节点的任一节点;第一节点向第二节点发送节点融合信息,该节点融合信息包括M个节点的地址和融合状态信息;第一节点向第二节点之外的其他节点发送第二节点的地址。
也即是,在满足预设条件之后,第一节点可以指定M个节点中K个节点的任一节点作为下次担当融合控制器的节点,由指定的节点作为第二节点担当第二时间段内的融合控制器,同时,第一节点将M个节点的节点融合信息发送给下次的融合控制器,并向其它节点发送下次融合控制器的地址。同理,下次担当融合控制器的节点可以指定下下次担当融合控制器的节点,以此类推。
其中,若第二节点在第二时间段内故障,第一节点可以确定第三节点作为第二时间段的参数融合控制器,其中,第三节点为M个节点中K个节点的任一个节点。
也即是,当第二节点出现故障,无法作为融合控制器时,由之前担当融合控制器的第一节点重新确定M个节点中K个节点的任一个节点作为第二时间段内的融合控制器,重新确定的节点可以称为第三节点。
需要说明的是,预设条件可以为经过一定的时间,或者经过一定的融合次数,或者经过一定的迭代次数等,本发明实施例对此不作限定。
另外,一定的时间、一定的融合次数和一定的迭代次数可以事先设置,且每一次设置的时间的长短、融合次数和迭代次数的大小可以是固定不变的,也可以是变化的,本发明实施例对此不作限定。
第三种、融合控制器由至少一个节点以分布式的方式来担当,该至少一个节点可以是M个节点中的全部或部分节点。
当M个节点中任一节点完成指定计算时,第k节点向M个节点中每个节点广播第k节点的地址和融合状态信息,由M个节点中的至少一个节点接收第k节点在完成指定计算后发送的地址和融合状态信息。其中,第k节点为M个节点中完成指定计算的任一节点。
也即是,当M个节点中的一个节点或者多个节点同时记录M个节点的节点融合信息时,每个节点在完成融合后都将自身的地址和融合状态信息,比如节点的计算状态和/或迭代次数,发送给同时记录节点融合信息的该M个节点中的至少一个节点。
之后,由该至少一个节点中的任一节点根据接收的第k节点的地址和融合状态信息确定满足融合条件的N个节点,并向N个节点中的每个节点发送融合指示信息,每个节点接收该至少一个节点中的任一节点发送的融合指示信息,该融合指示信息是在确定满足融合条件的N个节点之后发出的,该融合指示信息包括N个节点的地址和/或编号。
需要说明的是,节点的编号用于唯一表示该节点,且该节点的编号可以是随机分配给节点的序号,也可以是随机分配给节点的任一数值等,本发明实施例对此不作限定。
步骤205:当接收到该融合指示信息时,第i节点将第i节点的模型参数划分为N块,接收N个节点中除第i节点之外的其他节点分别发送的各自的第i块模型参数;其中,第i节点为M个节点中参与融合的N个节点中任一节点,1≤所述i≤所述N≤所述M,所述模型参数划分的N块中的第i块为第i块模型参数。这里的第i块模型参数是指划分的N块模型参数中对应第i节点的一块模型参数,由第i节点负责对第i块模型参数进行后续的融合操作。
比如,如图3所示,参与融合的N个节点中的第i节点将第i节点的模型参数划分为N块,其中每块模型参数对应一个节点,由对应的节点进行后续的模型参数融合操作,其中的第i块模型参数对应第i节点,由第i节点负责进行后续的融合操作。并接收N个节点中除第i节点之外的其他节点分别发送的各自的第i块模型参数。
其中,第i节点接收N个节点中除第i节点之外的其他节点分别发送的各自的第i块模型参数,以及将第i块的总模型参数分发给N个节点中
的其他节点时,可以采用全双工的数据传输方式,也即是,第i节点在向其他节点发送数据时同时可以接收其他节点发送的数据,比如,第i节点采用全双工网卡等,本发明对此不作限定。
步骤206:第i节点将第i节点的第j块模型参数发送给N个节点中的第j节点,其中,1≤所述j≤所述N,且所述j≠i。
也即是,第i节点将划分的模型参数中除第i块之外的其他模型参数块发送给N个节点中的其他节点,即将第j块模型参数发送给第j节点,由第j节点负责第j块的模型参数融合。这里的第j块模型参数是划分的N块模型参数中对应第j节点的模型参数,由第j节点负责进行后续的融合操作。
步骤207:第i节点将第i节点的第i块模型参数以及其他节点分别发送的各自的第i块模型参数进行融合,获得第i块的总模型参数,将第i块的总模型参数分发给N个节点中的其他节点。
比如,如图4所示,第i节点将第i节点模型参数的第i块以及其他节点分别发送的各自模型参数的第i块进行融合,获得第i块的总模型参数,并将第i块的总模型参数分发给N个节点中的其他节点。
步骤208:第i节点接收第j节点发送的第j节点融合后的第j块的总模型参数,汇总接收到的N个节点中除第i节点之外的其他节点发送的各自融合后的总模型参数的对应部分,生成第i节点的新的总模型参数。
当参与融合的N个节点的第j节点融合得到第j块的总模型参数时,第j节点将该第j块总的模型参数发送给第i节点,第i节点接收第第j节点融合后的第j块的总模型参数,其中,1≤j≤N,且j≠i。
之后,第i节点将接收到的N个节点中除第i节点之外的其他节点发送的各自融合后的总模型参数、以及第i节点自身融合得到的第i块的总模型参数进行汇总,得到N个节点融合后新的总模型参数。
进一步,当第i节点得到N个节点融合后新的总模型参数之后,第i节可以返回步骤202基于数据子集和N个节点融合后新的总模型参数进行迭代计算,直到输出最终的模型参数。
本发明实施例提供的一种模型参数融合方法,通过融合控制器确定满足融合条件的N个节点,并由第i节点将第i节点的模型参数划分
为N块,并接收N个节点中除第i节点之外的其他节点分别发送的各自的第i块模型参数,再将第i节点的第i块模型参数以及其他节点分别发送的各自的第i块模型参数进行融合,获得第i块的总模型参数,最后将第i块的总模型参数分发给所述N个节点中的其他节点,其中,第i节点为参与融合的N个节点中任一节点,从而解决了计算资源中动态调整的问题,并具有动态删除和增加节点的能力,同时参与融合的每个节点可以同时发送和接收模型参数,提高了网络资源的利用率和系统的稳定性。
实施例三
图5为本发明实施例提供的一种模型参数融合装置的结构示意图,应用于机器学习系统,所述机器学习系统包括M个节点,如图5所示,所述装置包括:
划分单元301,用于将自身的模型参数划分为N块;其中,所述N为所述M个模型参数融合装置中参与融合的模型参数融合装置的个数,所述模型参数划分的N块中的第i块为第i块模型参数,1≤所述i≤所述N≤所述M;这里的第i块模型参数是指划分的N块模型参数中对应第i模型参数融合装置的一块模型参数,由第i模型参数融合装置负责对第i块模型参数进行后续的融合操作。
第一接收单元302,用于接收所述N个模型参数融合装置中除自身之外的其他模型参数融合装置分别发送的各自的第i块模型参数;
融合单元303,用于将自身的第i块模型参数以及其他模型参数融合装置分别发送的各自的第i块模型参数进行融合,获得第i块的总模型参数;
第一发送单元304,用于将所述第i块的总模型参数分发给所述N个模型参数融合装置中除自身之外的其他模型参数融合装置。
其中,在向N个模型参数融合装置中除自身之外的其他模型参数融合装置分别发送的各自的第i块模型参数,以及将第i块的总模型参数分发给N个模型参数融合装置中的其他模型参数融合装置时,可以采用全双工的数据传输方式,也即是,在向其他模型参数融合装置发送数据时同时可以接收其他模型参数融合装置发送的数据,比如,采用全双工网卡等,本发明实施例对此不作限定。
另外,参与融合的N个模型参数融合装置是通过预设的融合条件从M个模型参数融合装置中确定的,该融合条件可以是完成迭代计算的节点个数达到预设数值,该预设数值在每次融合时可以是常数,也可以是变化的;或者,该融合条件是完成指定计算的次数达到预设次数,该预设次数在每次融合时可以是常数,也可是变化的;或者,该融合条件是迭代计算经过预设时长,该预设时长在每次融合时可以是常数,也可是变化的,当然,该融合条件也可以是其他的条件等,本发明实施例对此不作具体限定。。
可选的,如图6所示,所述装置还包括:
第二发送单元305,用于在完成指定的迭代任务后,发送自身的地址和融合状态信息给融合控制器,所述融合状态信息包括模型参数融合装置的计算状态和/或迭代次数;
第二接收单元306,用于接收融合指示信息,所述融合指示信息是所述融合控制器在根据接收的K个模型参数融合装置的地址和融合状态信息确定满足融合条件的所述N个模型参数融合装置之后发出的,所述融合指示信息包括所述N个模型参数融合装置的地址和/或编号,所述K个模型参数融合装置为所述M个模型参数融合装置中完成指定的迭代任务的模型参数融合装置,1≤所述K≤所述M。
可选的,如图7所示,所述融合控制器为第一模型参数融合装置,其中,所述第一模型参数融合装置轮流为所述M个节点中T个模型参数融合装置的任一个模型参数融合装置,所述T≤所述M,所述装置还包括:
第三接收单元307,用于接收所述第一模型参数融合装置发送的所述第一模型参数融合装置的地址;
相应的,所述第二发送单元305具体用于:
所根据所述第一模型参数融合装置的地址向所述第一模型参数融合装置发送自身的地址和融合状态信息。
需要说明的是,当模型参数融合装置为第一模型参数融合装置时,该第一模型参数融合装置可以为M个模型参数融合装置点中K个模型参数融合装置的任一模型参数融合装置,且该第一模型参数融合装置可以是轮流的,也即是,第一模型参数融合装置可以指定M个模型参数融合装置
中的K个模型参数融合装置的任一模型参数融合装置作为下次的模型参数融合装置,下次的模型参数融合装置可以指定下下次的模型参数融合装置,以此类推。
可选的,如图8所示,所述装置还包括:
广播单元308,用于向所述M个模型参数融合装置中每个模型参数融合装置广播所述自身的地址和融合状态信息;也即是,M个模型参数融合装置中每个模型参数融合装置可以同时记录地址和融合状态信息;也即是,M个模型参数融合装置中每个模型参数融合装置可以同时记录模型参数融合装置的地址和融合状态信息。
第四接收单元309,用于接收第二模型参数融合装置发送的融合指示信息,所述第二模型参数融合装置为所述M个模型参数融合装置中K个模型参数融合装置的任一模型参数融合装置,所述融合指示信息是所述第二模型参数融合装置在根据接收的K个模型参数融合装置的地址和融合状态信息确定满足融合条件的所述N个模型参数融合装置之后发出的,所述融合指示信息包括所述N个节点的地址和/或编号,所述K个模型参数融合装置为所述M个模型参数融合装置中完成指定的迭代任务的模型参数融合装置,1≤所述K≤所述M。
也即是,将同时记录M个模型参数融合装置的地址和融合状态信息的节点中任一个模型参数融合装置作为第二模型参数融合装置,由第二模型参数融合装置担当下次的模型参数融合装置。
可选的,如图8a所示,所述装置还包括:
第五接收单元310,用于接收K个模型参数融合装置的地址和融合状态信息,所述融合状态信息包括模型参数融合装置的计算状态和/或迭代次数,所述K个模型参数融合装置为所述M个模型参数融合装置中完成指定的迭代任务的模型参数融合装置,1≤所述K≤所述M;
确定单元311,用于根据接收的所述K个模型参数融合装置的地址和融合状态信息确定满足融合条件的所述N个模型参数融合装置;
第三发送单元312,用于向所述N个模型参数融合装置中除自身之外的其他模型参数融合装置发送融合指示信息,以便于所述N个模型参数融合装置中除自身之外的其他模型参数融合装置根据所述融合指示信息
进行参数融合,所述融合指示信息包括所述N个节点的地址和/或编号。
可选的,如图8b所示,所述装置还包括:
第四发送单元313,用于向所述M个模型参数融合装置中除自身之外的其他模型参数融合装置发送自身的地址,以便于所述M个模型参数融合装置中除自身之外的其他模型参数融合装置根据接收的所述地址发送自身的地址和融合状态信息。
可选的,所述装置还包括:
第五发送单元,用于将所述第i模型参数融合装置的模型参数的第j块发送给所述N个模型参数融合装置中的第j模型参数融合装置,其中,1≤所述j≤所述N,且所述j≠i。
也即是,将划分的模型参数中除第i块之外的其他模型参数块发送给N个模型参数融合装置中的其他模型参数融合装置,且将相同编号的第j块发送给第j模型参数融合装置,由第j模型参数融合装置负责第j块的模型参数融合。
可选的,如图8c所示,所述装置还包括:
第六接收单元314,用于接收所述第j模型参数融合装置发送的所述第j模型参数融合装置融合后的第j块的总模型参数;
汇总单元315,用于接收所述第j模型参数融合装置发送的所述第j模型参数融合装置融合后的第j块的总模型参数;
计算单元316,用于根据所述新的总模型参数,进行迭代计算。
本发明实施例提供一种模型参数融合装置,通过确定满足融合条件的N个模型参数融合装置,将第i模型参数融合装置的模型参数划分为N块,以及接收N个模型参数融合装置中除第i模型参数融合装置之外的其他模型参数融合装置分别发送的各自的第i块模型参数,再将第i模型参数融合装置的第i块模型参数以及其他模型参数融合装置分别发送的各自的第i块模型参数进行融合,获得第i块的总模型参数,最后将第i块的总模型参数分发给所述N个模型参数融合装置中的其他模型参数融合装置,从而解决了计算资源中动态调整的问题,同时也提高了网络资源的利用率和系统的稳定性。
实施例四
图9为本发明实施例提供的一种模型参数融合装置的结构示意图,应用于机器学习系统,所述机器学习系统包括M个节点,如图9所示,所述装置包括:
接收单元401,用于接收所述M个节点中完成指定计算的节点发送的地址和融合状态信息,所述融合状态信息包括节点的计算状态和/或迭代次数;
第一确定单元402,用于根据接收的所述地址和融合状态信息,确定满足融合条件的N个节点;其中,每次确定满足融合条件的N个节点的个数相同或者不同;
第一发送单元403,用于向所述N个节点中的每个节点发送融合指示信息,所述融合指示信息包括所述N个节点的地址和/或编号,以使得所述N个节点中的每个节点分别将各自的模型参数划分为N块;将各自模型参数中划分的第i块模型参数发送给第i个节点,其中,1≤所述i≤所述N;所述N个节点中每个节点分别融合各自收到的模型参数,以及所述N个节点中每个节点分别将融合后的模型参数分发给所述N个节点中除自身之外的其他节点。
可选的,所述融合条件为所述完成指定计算的节点个数达到预设数值;或者,所述完成指定计算的次数达到预设次数;或者,经过预设时长。其中,预设数值、预设次数和预设时长在每次融合时可以是常数,也可以是变化的,当然,在实际应用中,融合条件还可以为其他条件等,本发明实施例对此不作限定。
可选的,所述第一确定单元,还具体用于:
确定已完成融合的节点、以及未完成融合且完成指定计算的节点中的N个节点。
可选的,如图10所示,所述模型参数融合装置为第一节点,其中,所述第一节点为所述M个节点中任一个节点,所述装置还包括:
第二发送单元404,用于向所述M个节点中除所述第一节点之外的其他节点发送所述第一节点的地址。
可选的,如图11所示,所述装置还包括:
第二确定单元405,用于在满足预设条件后,确定第二节点作为第二
时间段的模型参数融合装置;其中,所述第二节点为所述M个节点中K个节点的任一个节点,所述K≤所述M;
第三发送单元406,用于向所述第二节点发送节点融合信息,所述节点融合信息包括所述M个节点的地址和融合状态信息;
第四发送单元407,用于向所述第二节点之外的其他节点发送所述第二节点的地址。
其中,预设条件可以为经过一定的时间,或者经过一定的融合次数,或者经过一定的迭代次数等,本发明对此不作限定。
需要说明的是,一定时间、一定的融合次数和一定的迭代次数可以事先设置,且一定时间、一定的融合次数和一定的迭代次数可以是固定不变的,也可以是变化的。
可选的,所述装置还包括:
第三确定单元,用于若所述第二节点在所述第二时间段内故障,确定第三节点作为所述第二时间段的模型参数融合装置,其中,所述第三节点为所述M个节点中K个节点任一个节点。
也即是,当第二节点出现故障,由第二确定单元重新确定M个节点中一个节点作为第二时间段内的模型参数融合装置,此时,该节点可以称为第三节点。
可选的,所述模型参数融合装置为所述M个节点中的至少一个节点,所述至少一个节点接收每个节点在完成指定计算后发送的地址和融合状态信息,确定满足融合条件的N个节点,向所述N个节点中的每个节点发送融合指示信息,为:所述至少一个节点中的任一节点根据接收的所述地址和融合状态信息,确定满足融合条件的N个节点,向所述N个节点中的每个节点发送融合指示信息。
也即是,当M个节点中的一个节点或者多个节点同时记录M个节点的节点融合信息时,每个节点在完成融合后都将自身的地址和融合状态信息,比如节点的计算状态和/或迭代次数发送给同时记录节点融合信息的该M个节点中的至少一个节点,由至少一个节点中的任一节点根据接收的所述地址和融合状态信息,确定满足融合条件的N个节点,向所述N个节点中的每个节点发送融合指示信息。
本发明实施例提供的一种模型参数融合装置,基于M个节点中完成指定计算的节点发送的地址和融合状态信息,确定满足融合条件的N个节点,并向所述N个节点中的每个节点发送融合指示信息,以使得N个节点中的每个节点分别将各自的模型参数划分为N块,将各自模型参数中第i块发送给第i个节点,N个节点中每个节点分别融合各自收到的模型参数,以及N个节点中每个节点分别将融合后的模型参数分发给所述N个节点中的其他节点,从而解决了计算资源中动态调整的问题,并具有动态删除和增加节点的能力,同时参与融合的每个节点可以同时发送和接收模型参数,提高了网络资源的利用率和系统的稳定性。
实施例五
图12为本发明实施例提供一种节点,所述节点包括存储器1201、处理器1202,电源组件1203、输入\输出接口1204和通信组件1205等,所述处理器1202用于执行上述实施例二所述的模型参数融合方法。
本领域普通技术人员可以理解,图12所示的结构仅为示意,其并不对节点的结构造成限定。例如,该节点还可包括比图12中所示更多或者更少的组件,或者具有与图12所示不同的配置。
下面对节点的各个构成部件进行具体的介绍:
存储器1201可用于存储数据、软件程序以及模块;主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序等;存储数据区可存储根据模型参数融合装置的使用所创建的数据等。此外,存储器可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。
处理器1202是节点的控制中心,利用各种接口和线路连接整个节点的各个部分,通过运行或执行存储在存储器1201内的软件程序和/或模块,以及调用存储在存储器1201内的数据,执行节点的各种功能和处理数据,从而对节点进行整体监控。可选的,处理器1202可包括一个或多个处理单元;优选的,处理器502可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到
处理器1202中。
电源组件1203用于为节点的各个组件提供电源,电源组件503可以包括电源管理系统,一个或多个电源,及其他与节点生成、管理和分配电力相关联的组件。
输入\输出接口1204为处理器1202和外围接口模块之间提供接口,比如,外围接口模块可以键盘、鼠标等。
通信组件1205被配置为便于节点和其他设备之间有线或无线方式的通信。该节点可以接入基于通信标准的无线网络,如WiFi,2G或3G,或它们的组合等。
尽管未示出,该节点还可以包括音频组件和多媒体组件等,本发明实施例在此不再赘述。
本发明实施例提供的一种节点,通过参与融合的N个节点中的每个节点分别将各自的模型参数划分为N块,以及将各自的第i块模型参数发送给第i个节点,N个节点中每个节点分别融合各自收到的模型参数,以及N个节点中每个节点分别将融合后的模型参数分发给所述N个节点中的其他节点,从而具有动态删除和增加节点的能力,同时参与融合的每个节点可以同时发送和接收模型参数,提高了网络资源的利用率和系统的稳定性。
实施例六
图13为本发明实施例提供一种融合控制器,所述融合控制器包括存储器1301、处理器1302,电源组件1303、输入\输出接口1304和通信组件1305等。所述处理器1302用于执行上述实施例二所述的模型参数融合方法。
本领域普通技术人员可以理解,图13所示的结构仅为示意,其并不对融合控制器的结构造成限定。例如,该融合控制器还可包括比图13中所示更多或者更少的组件,或者具有与图13所示不同的配置。
下面对融合控制器的各个构成部件进行具体的介绍:
存储器1301可用于存储数据、软件程序以及模块;主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序等;存储数据区可存储根据模型参数融合装置的使用所创建
的数据等。此外,存储器可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。
处理器1302是融合控制器的控制中心,利用各种接口和线路连接整个融合控制器的各个部分,通过运行或执行存储在存储器1301内的软件程序和/或模块,以及调用存储在存储器1301内的数据,执行融合控制器的各种功能和处理数据,从而对融合控制器进行整体监控。可选的,处理器1302可包括一个或多个处理单元;优选的,处理器502可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器1302中。
电源组件1303用于为融合控制器的各个组件提供电源,电源组件503可以包括电源管理系统,一个或多个电源,及其他与融合控制器生成、管理和分配电力相关联的组件。
输入\输出接口1304为处理器1302和外围接口模块之间提供接口,比如,外围接口模块可以键盘、鼠标等。
通信组件1305被配置为便于融合控制器和其他设备之间有线或无线方式的通信。该融合控制器可以接入基于通信标准的无线网络,如WiFi,2G或3G,或它们的组合等。
尽管未示出,该融合控制器还可以包括音频组件和多媒体组件等,本发明实施例在此不再赘述。
本发明实施例提供的一种融合控制器,通过基于M个节点中完成指定计算的节点发送的地址和融合状态信息,确定满足融合条件的N个节点,并向所述N个节点中的每个节点发送融合指示信息,以使得N个节点中的每个节点分别将各自的模型参数划分为N块,将各自的第i块模型参数发送给第i个节点,N个节点中每个节点分别融合各自收到的模型参数,以及N个节点中每个节点分别将融合后的模型参数分发给所述N个节点中的其他节点,从而解决了计算资源中动态调整的问题,同时提高了网络资源的利用率和系统的稳定性。
实施例七
本发明实施例提供一种机器学习系统,所述机器学习系统包括实施例五所述的一种节点,以及实施例六所述的一种融合控制器。
可选的,所述融合控制器为独立于所述节点设置,或者配置在所述节点上。
本发明实施例提供的一种机器学习系统,融合控制器基于M个节点中完成指定计算的节点发送的地址和融合状态信息,确定满足融合条件的N个节点,并向所述N个节点中的每个节点发送融合指示信息,以使参与融合的N个节点中的每个节点分别将各自的模型参数划分为N块,以及将各自的第i块模型参数发送给第i个节点,N个节点中每个节点分别融合各自收到的模型参数,以及N个节点中每个节点分别将融合后的模型参数分发给所述N个节点中的其他节点,从而解决了计算资源中动态调整的问题,并具有动态删除和增加节点的能力,同时参与融合的每个节点可以同时发送和接收模型参数,提高了网络资源的利用率和系统的稳定性。
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。
Claims (28)
- 一种模型参数融合方法,其特征在于,应用于机器学习系统,所述机器学习系统包括M个节点,所述方法包括:第i节点将所述第i节点的模型参数划分为N块;其中,所述第i节点为所述M个节点中参与融合的N个节点中任一节点,1≤所述i≤所述N≤所述M,所述模型参数划分的N块中的第i块为第i块模型参数;所述第i节点接收所述N个节点中除所述第i节点之外的其他节点分别发送的各自的第i块模型参数;所述第i节点将所述第i节点的第i块模型参数以及所述其他节点分别发送的各自的第i块模型参数进行融合,获得第i块的总模型参数;所述第i节点将所述第i块的总模型参数分发给所述N个节点中除所述第i节点之外的其他节点。
- 根据权利要求1所述的方法,其特征在于,在所述第i节点将所述第i节点的模型参数划分为N块之前,所述方法还包括:所述第i节点接收融合控制器发送的融合指示信息,所述融合指示信息是由所述融合控制器在根据接收的第k节点的地址和融合状态信息确定满足融合条件的所述N个节点之后发出的,所述融合指示信息包括所述N个节点的地址和/或编号,所述第k节点为所述M个节点中完成指定的迭代任务的节点,1≤所述k≤所述M,所述融合状态信息包括节点的计算状态和/或迭代次数。
- 根据权利要求2所述的方法,其特征在于,若所述融合控制器为第一节点,其中,所述第一节点轮流为所述M个节点中T个节点的任一个节点,所述T≤所述M;所述方法还包括:所述第k节点接收所述第一节点发送的所述第一节点的地址;所述第k节点根据所述第一节点的地址向所述第一节点发送所述第k节点的地址和融合状态信息。
- 根据权利要求1所述的方法,其特征在于,在所述第i节点将所述第i节点的模型参数划分为N块之前,所述方法还包括:第k节点向所述M个节点中每个节点广播所述第k节点的地址和融合状态信息;所述第k节点接收第二节点发送的融合指示信息,所述第二节点为所述M个节点中任一节点,所述融合指示信息是由所述第二节点在根据接收的所述第k节点的地址和融合状态信息确定满足融合条件的N个节点之后发出的,所述融合指示信息包括所述N个节点的地址和/或编号。
- 根据权利要求1-4任一项所述的方法,其特征在于,所述方法还包括:所述第i节点将所述第i节点的第j块模型参数发送给所述N个节点中的第j节点,其中,1≤所述j≤所述N,且所述j≠i。
- 根据权利要求5所述的方法,其特征在于,所述方法还包括:所述第i节点接收所述第j节点发送的所述第j节点融合后的第j块的总模型参数;所述第i节点汇总接收到的所述N个节点中除所述第i节点之外的其他节点发送的各自融合后的总模型参数的对应部分,生成所述第i节点的新的总模型参数;所述第i节点根据所述新的总模型参数,进行迭代计算。
- 一种模型参数融合方法,其特征在于,应用于机器学习系统,所述机器学习系统包括M个节点,所述方法包括:融合控制器接收所述M个节点中完成指定计算的节点发送的地址和融合状态信息,所述融合状态信息包括节点的计算状态和/或迭代次数;所述融合控制器根据接收的所述地址和融合状态信息,确定满足融合条件的N个节点;所述融合控制器向所述N个节点中的每个节点发送融合指示信息,所述融合指示信息包括所述N个节点的地址和/或编号,以使得所述N个节点中的每个节点分别将各自的模型参数划分为N块;将各自模型参数中划分的第i块模型参数发送给第i个节点,其中,1≤所述i≤所述N;所述N个节点中每个节点分别融合各自收到的模型参数,以及所述N个节点中每个节点分别将融合后的模型参数分发给所述N个节点中除自身之外的其他节点。
- 根据权利要求7所述的方法,其特征在于,所述融合条件为完成指定计算的节点个数达到预设数值;或者,完成指定计算的次数达到预设次 数;或者,经过预设时长。
- 根据权利要求7或8所述的方法,其特征在于,所述融合控制器为第一节点,其中,所述第一节点为所述M个节点中任一个节点;在所述融合控制器接收所述M个节点中完成指定计算的节点发送的地址以及融合状态信息之前,所述方法还包括:所述第一节点作为第一时间段的融合控制器向所述M个节点中除所述第一节点之外的其他节点发送所述第一节点的地址。
- 根据权利要求9所述的方法,其特征在于,所述方法还包括:在满足预设条件后,所述第一节点确定第二节点作为第二时间段的融合控制器;其中,所述第二节点为所述M个节点中K个节点的任一个节点,所述K≤所述M;所述第一节点向所述第二节点发送节点融合信息,所述节点融合信息包括所述M个节点的地址和融合状态信息;所述第一节点向所述第二节点之外的其他节点发送所述第二节点的地址。
- 根据权利要求7或8所述的方法,其特征在于,所述融合控制器为所述M个节点中的至少一个节点,所述至少一个节点接收每个节点在完成指定计算后发送的地址和融合状态信息,则所述融合控制器根据接收的所述地址和融合状态信息,确定满足融合条件的N个节点,向所述N个节点中的每个节点发送融合指示信息,为:所述至少一个节点中的任一节点根据接收的所述地址和融合状态信息,确定满足融合条件的N个节点,向所述N个节点中的每个节点发送融合指示信息。
- 一种模型参数融合装置,其特征在于,应用于机器学习系统,所述机器学习系统包括M个模型参数融合装置,所述装置包括:划分单元,用于将自身的模型参数划分为N块;其中,所述N为所述M个模型参数融合装置中参与融合的模型参数融合装置的个数,所述模型参数划分的N块中的第i块为第i块模型参数,1≤所述i≤所述N≤所述M;第一接收单元,用于接收所述N个模型参数融合装置中除自身之外的其他模型参数融合装置分别发送的各自的第i块模型参数;融合单元,用于将自身的第i块模型参数以及其他模型参数融合装置分别发送的各自的第i块模型参数进行融合,获得第i块的总模型参数;第一发送单元,用于将所述第i块的总模型参数分发给所述N个模型参数融合装置中除自身之外的其他模型参数融合装置。
- 根据权利要求12所述的装置,其特征在于,所述装置还包括:第二发送单元,用于在完成指定的迭代任务后,发送自身的地址和融合状态信息给融合控制器,所述融合状态信息包括模型参数融合装置的计算状态和/或迭代次数;第二接收单元,用于接收融合指示信息,所述融合指示信息是所述融合控制器在根据接收的K个模型参数融合装置的地址和融合状态信息确定满足融合条件的所述N个模型参数融合装置之后发出的,所述融合指示信息包括所述N个模型参数融合装置的地址和/或编号,所述K个模型参数融合装置为所述M个模型参数融合装置中完成指定的迭代任务的模型参数融合装置,1≤所述K≤所述M。
- 根据权利要求13所述的装置,其特征在于,所述融合控制器为第一模型参数融合装置,其中,所述第一模型参数融合装置轮流为所述M个节点中T个模型参数融合装置的任一个模型参数融合装置,所述T≤所述M,所述装置还包括:第三接收单元,用于接收所述第一模型参数融合装置发送的所述第一模型参数融合装置的地址;相应的,所述第二发送单元具体用于:所根据所述第一模型参数融合装置的地址向所述第一模型参数融合装置发送自身的地址和融合状态信息。
- 根据权利要求12所述的装置,其特征在于,所述装置还包括:广播单元,用于向所述M个模型参数融合装置中每个模型参数融合装置广播所述自身的地址和融合状态信息;第四接收单元,用于接收第二模型参数融合装置发送的融合指示信息,所述第二模型参数融合装置为所述M个模型参数融合装置中K个模型参数融合装置的任一模型参数融合装置,所述融合指示信息是所述第二模型参数融合装置在根据接收的K个模型参数融合装置的地址和融合状态信息确 定满足融合条件的所述N个模型参数融合装置之后发出的,所述融合指示信息包括所述N个节点的地址和/或编号,所述K个模型参数融合装置为所述M个模型参数融合装置中完成指定的迭代任务的模型参数融合装置,1≤所述K≤所述M。
- 根据权利要求12所述的装置,其特征在于,所述装置还包括:第五接收单元,用于接收K个模型参数融合装置的地址和融合状态信息,所述融合状态信息包括模型参数融合装置的计算状态和/或迭代次数,所述K个模型参数融合装置为所述M个模型参数融合装置中完成指定的迭代任务的模型参数融合装置,1≤所述K≤所述M;确定单元,用于根据接收的所述K个模型参数融合装置的地址和融合状态信息确定满足融合条件的所述N个模型参数融合装置;第三发送单元,用于向所述N个模型参数融合装置中除自身之外的其他模型参数融合装置发送融合指示信息,以便于所述N个模型参数融合装置中除自身之外的其他模型参数融合装置根据所述融合指示信息进行参数融合,所述融合指示信息包括所述N个节点的地址和/或编号。
- 根据权利要求16所述的装置,其特征在于,所述装置还包括:第四发送单元,用于向所述M个模型参数融合装置中除自身之外的其他模型参数融合装置发送自身的地址,以便于所述M个模型参数融合装置中除自身之外的其他模型参数融合装置根据接收的所述地址发送自身的地址和融合状态信息。
- 根据权利要求12-17任一项所述的装置,其特征在于,所述装置还包括:第五发送单元,用于将自身的第j块模型参数发送给所述N个模型参数融合装置中的第j模型参数融合装置,其中,1≤所述j≤所述N,且所述j≠i。
- 根据权利要求18所述的装置,其特征在于,所述装置还包括:第六接收单元,用于接收所述第j模型参数融合装置发送的所述第j模型参数融合装置融合后的第j块的总模型参数;汇总单元,用于汇总接收到的所述N个模型参数融合装置中除自身之外的其他模型参数融合装置发送的各自融合后的总模型参数的对应部分, 生成新的总模型参数;计算单元,用于根据所述新的总模型参数,进行迭代计算。
- 一种模型参数融合装置,其特征在于,应用于机器学习系统,所述机器学习系统包括M个节点,所述装置包括:接收单元,用于接收所述M个节点中完成指定计算的节点发送的地址和融合状态信息,所述融合状态信息包括节点的计算状态和/或迭代次数;第一确定单元,用于根据接收的所述地址和融合状态信息,确定满足融合条件的N个节点;第一发送单元,用于向所述N个节点中的每个节点发送融合指示信息,所述融合指示信息包括所述N个节点的地址和/或编号,以使得所述N个节点中的每个节点分别将各自的模型参数划分为N块;将各自模型参数中划分的第i块模型参数发送给第i个节点,其中,1≤所述i≤所述N;所述N个节点中每个节点分别融合各自收到的模型参数,以及所述N个节点中每个节点分别将融合后的模型参数分发给所述N个节点中除自身之外的其他节点。
- 根据权利要求20所述的装置,其特征在于,所述融合条件为所述完成指定计算的节点个数达到预设数值;或者,所述完成指定计算的次数达到预设次数;或者,经过预设时长。
- 根据权利要求20或21所述的装置,其特征在于,所述模型参数融合装置为第一节点,其中,所述第一节点为所述M个节点中任一个节点,所述装置还包括:第二发送单元,用于向所述M个节点中除所述第一节点之外的其他节点发送所述第一节点的地址。
- 根据权利要求22所述的装置,其特征在于,所述装置还包括:第二确定单元,用于在满足预设条件后,确定第二节点作为第二时间段的模型参数融合装置;其中,所述第二节点为所述M个节点中K个节点的任一个节点,所述K≤所述M;第三发送单元,用于向所述第二节点发送节点融合信息,所述节点融合信息包括所述M个节点的地址和融合状态信息;第四发送单元,用于向所述第二节点之外的其他节点发送所述第二节 点的地址。
- 根据权利要求22或23所述的装置,其特征在于,所述模型参数融合装置为所述M个节点中的至少一个节点,所述至少一个节点接收每个节点在完成指定计算后发送的地址和融合状态信息,确定满足融合条件的N个节点,向所述N个节点中的每个节点发送融合指示信息,为:所述至少一个节点中的任一节点根据接收的所述地址和融合状态信息,确定满足融合条件的N个节点,向所述N个节点中的每个节点发送融合指示信息。
- 一种节点,其特征在于,所述节点包括处理器和存储器,所述存储器中存储代码和数据,所述处理器可运行存储器中的代码,所述处理器用于执行上述权利要求1-6任一项所述的模型参数融合方法。
- 一种融合控制器,其特征在于,所述融合控制器包括处理器和存储器,所述存储器中存储代码和数据,所述处理器可运行存储器中的代码,所述处理器用于执行上述权利要求7-11任一项所述的模型参数融合方法。
- 一种机器学习系统,其特征在于,所述机器学习系统包括上述权利要求25所述的一种节点,以及权利要求26所述的一种融合控制器。
- 根据权利要求27所述的机器学习系统,其特征在于,所述融合控制器为独立于所述节点设置,或者配置在所述节点上。
Priority Applications (5)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/CN2015/094746 WO2017084020A1 (zh) | 2015-11-16 | 2015-11-16 | 模型参数融合方法及装置 |
| EP20160309.9A EP3745284A1 (en) | 2015-11-16 | 2015-11-16 | Model parameter fusion method and apparatus |
| EP15908517.4A EP3370166B1 (en) | 2015-11-16 | 2015-11-16 | Method and apparatus for model parameter fusion |
| CN201580001419.1A CN107004003B (zh) | 2015-11-16 | 2015-11-16 | 模型参数融合方法及装置 |
| US15/980,496 US11373116B2 (en) | 2015-11-16 | 2018-05-15 | Model parameter fusion method and apparatus |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/CN2015/094746 WO2017084020A1 (zh) | 2015-11-16 | 2015-11-16 | 模型参数融合方法及装置 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US15/980,496 Continuation US11373116B2 (en) | 2015-11-16 | 2018-05-15 | Model parameter fusion method and apparatus |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2017084020A1 true WO2017084020A1 (zh) | 2017-05-26 |
Family
ID=58717186
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2015/094746 Ceased WO2017084020A1 (zh) | 2015-11-16 | 2015-11-16 | 模型参数融合方法及装置 |
Country Status (4)
| Country | Link |
|---|---|
| US (1) | US11373116B2 (zh) |
| EP (2) | EP3370166B1 (zh) |
| CN (1) | CN107004003B (zh) |
| WO (1) | WO2017084020A1 (zh) |
Families Citing this family (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN107209746B (zh) * | 2015-11-16 | 2019-10-22 | 华为技术有限公司 | 模型参数融合方法及装置 |
| US20210350234A1 (en) * | 2019-01-28 | 2021-11-11 | Intel Corporation | Techniques to detect fusible operators with machine learning |
| CN109840501B (zh) * | 2019-01-31 | 2021-06-01 | 深圳市商汤科技有限公司 | 一种图像处理方法及装置、电子设备、存储介质 |
| JP7238610B2 (ja) * | 2019-06-04 | 2023-03-14 | 富士フイルムビジネスイノベーション株式会社 | 情報処理装置及びプログラム |
| CN110705177B (zh) * | 2019-09-29 | 2023-05-16 | 支付宝(杭州)信息技术有限公司 | 基于机器学习的终端风险评估模型的生成方法及其系统 |
| CN111178443B (zh) * | 2019-12-31 | 2023-10-31 | 东软集团股份有限公司 | 模型参数选择、图像分类、信息识别方法及装置、设备 |
| CN115665227B (zh) * | 2022-12-28 | 2023-04-07 | 北京交通大学 | 一种普适的异构融合算网资源智慧适配网络架构及方法 |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101996197A (zh) * | 2009-08-31 | 2011-03-30 | 中国移动通信集团公司 | 聚类实现方法及系统 |
| CN102467570A (zh) * | 2010-11-17 | 2012-05-23 | 日电(中国)有限公司 | 用于分布式数据仓库的连接查询系统和方法 |
| CN103914528A (zh) * | 2014-03-28 | 2014-07-09 | 南京邮电大学 | 一种关联分析算法的并行化方法 |
| CN104994172A (zh) * | 2015-07-16 | 2015-10-21 | 浪潮(北京)电子信息产业有限公司 | 一种云存储系统的监控管理系统和方法 |
Family Cites Families (13)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7321843B2 (en) * | 2005-09-30 | 2008-01-22 | Universidad De Salamanca | Method for generating new flow cytometry data files containing an infinite number of dimensions based on data estimation |
| CN101141413A (zh) * | 2006-09-06 | 2008-03-12 | 同济大学 | 一种公交信息采集和信息传输的方法及其实现系统 |
| US8027938B1 (en) * | 2007-03-26 | 2011-09-27 | Google Inc. | Discriminative training in machine learning |
| CN100581119C (zh) * | 2008-03-05 | 2010-01-13 | 中科院嘉兴中心微系统所分中心 | 无线传感器网络分布式融合识别方法 |
| JP5584914B2 (ja) * | 2010-07-15 | 2014-09-10 | 株式会社日立製作所 | 分散計算システム |
| EP2785058A4 (en) * | 2011-11-23 | 2014-12-03 | Huawei Tech Co Ltd | METHOD, DEVICE AND SYSTEM FOR BROADCASTING VIDEO TURNING |
| CN102869064A (zh) * | 2012-07-27 | 2013-01-09 | 南京邮电大学 | 基于特征级与决策级联合融合的分簇调制识别方法 |
| US9037519B2 (en) * | 2012-10-18 | 2015-05-19 | Enjoyor Company Limited | Urban traffic state detection based on support vector machine and multilayer perceptron |
| CN103578092A (zh) * | 2013-11-11 | 2014-02-12 | 西北大学 | 一种多聚焦图像融合方法 |
| CN104598600B (zh) * | 2015-01-23 | 2017-10-10 | 南京师范大学 | 一种基于分布式内存的并行数字地形分析优化方法 |
| CN105894087A (zh) * | 2015-01-26 | 2016-08-24 | 华为技术有限公司 | 用于神经网络中训练参数集的系统和方法 |
| CN107209746B (zh) * | 2015-11-16 | 2019-10-22 | 华为技术有限公司 | 模型参数融合方法及装置 |
| CN105959987B (zh) * | 2016-04-14 | 2019-05-14 | 北京邮电大学 | 一种提高无线传感器网络能量利用率和服务性能的数据融合算法 |
-
2015
- 2015-11-16 EP EP15908517.4A patent/EP3370166B1/en active Active
- 2015-11-16 CN CN201580001419.1A patent/CN107004003B/zh active Active
- 2015-11-16 EP EP20160309.9A patent/EP3745284A1/en not_active Withdrawn
- 2015-11-16 WO PCT/CN2015/094746 patent/WO2017084020A1/zh not_active Ceased
-
2018
- 2018-05-15 US US15/980,496 patent/US11373116B2/en active Active
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101996197A (zh) * | 2009-08-31 | 2011-03-30 | 中国移动通信集团公司 | 聚类实现方法及系统 |
| CN102467570A (zh) * | 2010-11-17 | 2012-05-23 | 日电(中国)有限公司 | 用于分布式数据仓库的连接查询系统和方法 |
| CN103914528A (zh) * | 2014-03-28 | 2014-07-09 | 南京邮电大学 | 一种关联分析算法的并行化方法 |
| CN104994172A (zh) * | 2015-07-16 | 2015-10-21 | 浪潮(北京)电子信息产业有限公司 | 一种云存储系统的监控管理系统和方法 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP3370166A4 * |
Also Published As
| Publication number | Publication date |
|---|---|
| EP3370166B1 (en) | 2020-05-06 |
| CN107004003A (zh) | 2017-08-01 |
| US11373116B2 (en) | 2022-06-28 |
| EP3370166A4 (en) | 2018-10-31 |
| CN107004003B (zh) | 2020-04-28 |
| EP3370166A1 (en) | 2018-09-05 |
| EP3745284A1 (en) | 2020-12-02 |
| US20180267927A1 (en) | 2018-09-20 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2017084020A1 (zh) | 模型参数融合方法及装置 | |
| US11704144B2 (en) | Creating virtual machine groups based on request | |
| JP6587330B2 (ja) | ランダムフォレストモデルの訓練方法、電子装置及び記憶媒体 | |
| CN112153700A (zh) | 一种网络切片资源管理方法及设备 | |
| WO2017084016A1 (zh) | 模型参数融合方法及装置 | |
| WO2023098374A1 (zh) | 网络资源部署方法、装置、电子设备及存储介质 | |
| US10754869B2 (en) | Managing data format of data received from devices in an internet of things network | |
| US20230281513A1 (en) | Data model training method and apparatus | |
| CN103414762B (zh) | 云备份方法和装置 | |
| WO2016095149A1 (zh) | 一种数据压缩存储方法、装置,及分布式文件系统 | |
| CN105490843A (zh) | 一种信息处理方法及系统 | |
| CN106101710A (zh) | 一种分布式视频转码方法及装置 | |
| CN110557679B (zh) | 一种视频内容识别方法、设备、介质和系统 | |
| CN107231440A (zh) | 一种智能化led云屏控制系统及控制方法 | |
| CN109828826A (zh) | 一种任务进度的轮询方法、装置及系统 | |
| US12517860B2 (en) | Communication method and system based on deployment relationship between multiple processors | |
| CN118041937A (zh) | 存储设备的数据访问方法及装置 | |
| WO2019225420A1 (ja) | 変換装置、および、変換プログラム | |
| CN116319802A (zh) | 一种具备自主决策功能的可配置边缘计算网关 | |
| CN105378674B (zh) | 多内核操作系统进程处理方法及装置 | |
| CN109450686B (zh) | 一种基于普适网络的网络资源管理系统及方法 | |
| CN104618421A (zh) | 存储资源分配方法及装置 | |
| CN114168345A (zh) | 一种任务处理方法、装置、计算机设备及存储介质 | |
| CN113852919A (zh) | 预警消息的生成方法和装置、存储介质及电子装置 | |
| CN115237715B (zh) | 一种确定op操作失败的方法、装置、设备及介质 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 15908517 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2015908517 Country of ref document: EP |