WO2024067404A1 - 一种模型训练管理的方法、装置和系统 - Google Patents
一种模型训练管理的方法、装置和系统 Download PDFInfo
- Publication number
- WO2024067404A1 WO2024067404A1 PCT/CN2023/120765 CN2023120765W WO2024067404A1 WO 2024067404 A1 WO2024067404 A1 WO 2024067404A1 CN 2023120765 W CN2023120765 W CN 2023120765W WO 2024067404 A1 WO2024067404 A1 WO 2024067404A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- model training
- entity
- training
- target
- task
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5094—Allocation of resources, e.g. of the central processing unit [CPU] where the allocation takes into account power or heat criteria
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5083—Techniques for rebalancing the load in a distributed system
- G06F9/5088—Techniques for rebalancing the load in a distributed system involving task migration
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0495—Quantised networks; Sparse networks; Compressed networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/09—Supervised learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/096—Transfer learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/098—Distributed learning, e.g. federated learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N99/00—Subject matter not provided for in other groups of this subclass
Definitions
- the present application relates to the field of communication technology, and more specifically, to a method, device and system for model training management.
- Models are usually obtained through training.
- a model training entity can be configured with a model training task, and a usable model is obtained by performing the model training task using training data.
- Multiple model training entities are usually configured in a communication system.
- multiple base stations can deploy multiple model training entities respectively.
- the training tasks in the model training entity in each base station can be set according to the needs of the base station.
- model training tasks of different model training entities will also be different.
- some model training entities in the communication system may have overloaded model training tasks, while some model training entities are too idle, resulting in low overall efficiency of model training in the communication system.
- the present application provides a method, device and system for model training management, which can improve the efficiency of model training.
- a method for model training management which can be implemented by a model training management entity or a chip in the model training management entity, and the method includes: the model training management entity receives training status information of a first model training entity, and the training status information indicates at least one model training task of the first model training entity; the model training management entity obtains multiple computing power resource information, and the multiple computing power resource information respectively indicates idle computing power resources for model training of multiple model training entities; the model training management entity determines a first target model training entity among multiple model training entities based on the multiple computing power resource information; the model training management entity sends first training task configuration information to the first target model training entity, and the first training task configuration information indicates assisting the first model training entity in performing a target model training task in at least one model training task.
- the number of target model training tasks can be one or more.
- the number of first target model training entities can be one or more.
- the model training management entity can manage and arrange the computing resources of multiple model training entities, and assign the training tasks of the first model training entity to other model training entities with sufficient computing resources to assist in completing the training, thereby reducing the training waiting time of the training tasks of the first model training entity and improving the efficiency of model training.
- the model training management entity obtains multiple computing power resource information, including: the model training management entity periodically receives multiple computing power resource information from multiple model training entities.
- the periods at which multiple model training entities send computing resource information can be the same or different.
- each model training entity can report its own computing power resource information in a timely manner, so that the model training management entity can perform timely arrangement and management, thereby improving the efficiency of model training.
- the model training management entity obtains multiple computing power resource information, including: the model training management entity sends multiple computing power resource query information to multiple model training entities respectively; the model training management entity receives multiple computing power resource information from multiple model training entities respectively.
- model training management entity when the model training management entity determines based on the training status information that the first model training entity is unable to independently perform all of the at least one model training task, it sends multiple computing resource query information to multiple model training entities respectively.
- the model training entity can return computing power resource information based on the query information of the model training management entity, that is, the model training management entity can send a query to the model training entity when there is a demand, such as when it is determined to allocate assistance training.
- Query information can save transmission resources and obtain real-time computing resource information.
- the model training management entity receives the training status information of the first model training entity, including: the model training management entity periodically receives the training status information from the first model training entity.
- the periods at which multiple model training entities send training status information may be the same or different.
- each model training entity can report its own training status information in a timely manner, so that the model training management entity can perform orchestration and management in a timely manner, thereby improving the efficiency of model training.
- the method further includes: the model training management entity sends training task configuration notification information to the first model training entity, and the training task configuration notification information instructs the first target model training entity to assist in performing the target model training task.
- the method further includes: the model training management entity determines the target length of training data for the first target model training entity to assist in performing the target model training task; the model training management entity sends second training task configuration information to the first model training entity, and the second training task configuration information instructs the first target model training entity to use training data of the target length to assist in performing the target model training task.
- the training status information also indicates the total length of the training data used to complete the target model training task, and the model training management entity determines the target length, including: the model training management entity determines the target length based on the total length and the computing power resource information of the first target model training entity.
- the model training management entity can decompose the target model training task and use multiple model training entities to collaborate to complete the training task, which can reduce the training task burden of the original training subject of the target model training task, i.e., the first model training entity, make full use of the resources of multiple model training entities, and reduce the training waiting time of the training task.
- the method also includes: the model training management entity receives assisted training feedback information from the first target model training entity, the assisted training feedback information indicates at least one of the following: the accuracy achieved by the first target model training entity in performing the target model training task, the time spent by the first target model training entity in performing the target model training task, the execution progress of the first target model training entity in performing the target model training task, and the amount of resources occupied by the first target model training entity in performing the target model training task.
- the method also includes: the model training management entity receives network status change information from the first target model training entity; the model training management entity determines to replace the first target model training entity with a second target model training entity among multiple model training entities based on the network status change information and multiple computing power resource information; the model training management entity sends third training task configuration information to the second target model training entity, and the third training task configuration information indicates to assist the first model training entity in performing the target model training task.
- the model training entity can provide feedback on the execution status to the model training management entity while assisting in the execution of the model training task, so that the model training management entity can be informed of the execution status of the target model training task and can make timely adjustments to improve the reliability of model training.
- the method also includes: the model training management entity obtains policy information, the policy information indicates a method for determining a first target model entity among multiple model training entities based on multiple computing resource information, and/or indicates a method for determining a target length of training data for the first target model entity to assist in performing the target model training task based on the total length of the training data used to complete the target model training task.
- a method for model training management which can be implemented by a model training entity or a chip in the model training entity, and the method includes: the model training entity sends training status information to a model training management entity, the training status information indicates at least one model training task that the model training entity has; the model training entity sends computing power resource information to the model training management entity, the computing power resource information indicates idle computing power resources that the model training entity has for model training; the model training entity receives training task information from the model training management entity, the training task information indicates a first target model training entity that assists in executing a target model training task in at least one model training task.
- the model training entity can report computing power resource information and training status information to the model training management entity, so that the model training management entity can manage and arrange the computing power resources of multiple model training entities, and assign the training tasks of the model training entity to other model training entities with sufficient computing power resources to assist in completing the training, thereby reducing the training waiting time of the training tasks of the model training entity and improving the efficiency of model training.
- the model training entity sends computing power resource information to the model training management entity, including: the model training entity periodically sends computing power resource information to the model training management entity; or the model training entity receives computing power resource information from The computing power resource query information of the model training management entity; the model training entity sends the computing power resource information to the model training management entity.
- the model training entity sends training status information to the model training management entity, including: the model training entity periodically sends training status information to the model training management entity; or, the model training entity sends training status information to the model training management entity based on a trigger event.
- the training task information further instructs the first target model training entity to use training data of a target length to assist in performing the target model training task.
- the method also includes: the model training entity receives model training report information from the target model training entity, the model training report information indicates the sub-model obtained by completing the target model training task; and the model training entity performs model aggregation based on the sub-model.
- the various implementations of the second aspect are methods of training the first model entity corresponding to the various implementations of the first aspect.
- the beneficial technical effects of the various implementations of the second aspect reference can be made to the description of the relevant implementations of the first aspect and will not be repeated here.
- a method for model training management which can be implemented by a model training entity or a chip in the model training entity, and the method includes: the model training entity sends computing power resource information to a model training management entity, and the computing power resource information indicates idle computing power resources of the model training entity for model training; the model training management entity receives first training task configuration information from the model training management entity, and the first training task configuration information indicates assisting the first model training entity in performing a target model training task; the model training management entity assists the first model training management entity in performing the target model training task.
- the model training entity can report computing power resource information and training status information to the model training management entity, so that the model training management entity can manage and arrange the computing power resources of multiple model training entities, and assign the training tasks of the model training entity to other model training entities with sufficient computing power resources to assist in completing the training, thereby reducing the training waiting time of the training tasks of the model training entity and improving the efficiency of model training.
- the model training entity sends computing power resource information to the model training management entity, including: the model training entity periodically sends computing power resource information to the model training management entity; or, the model training entity receives computing power resource query information from the model training management entity; the model training entity sends computing power resource information to the model training management entity.
- the model training management entity assists the first model training management entity in performing the target model training task, including: the model training entity obtains target training data; the model training entity uses the target training data to assist the first model training management entity in performing the target model training task.
- the method also includes: the model training entity sends model training report information to the first model training entity, and the model training report information indicates the sub-model obtained by completing the target model training task.
- the method also includes: the model training entity sends assistance training feedback information to the model training management entity, and the assistance training feedback information indicates at least one of the following: the accuracy achieved by the model training entity in executing the target model training task, the time spent by the model training entity in executing the target model training task, the execution progress of the model training entity in executing the target model training task, and the amount of resources occupied by the model training entity in executing the target model training task.
- the method also includes: the model training entity sends network status change information to the model training management entity, and the network status change information indicates that the model training entity is unable to complete the target model training task.
- the various implementation methods of the third aspect are methods of training the first target model entity corresponding to the various implementation methods of the first aspect.
- beneficial technical effects of the various implementation methods of the third aspect please refer to the description of the relevant implementation methods of the first aspect and will not be repeated here.
- a method for model training management which can be applied to a communication system, wherein the communication system includes a model training management entity and multiple model training entities, and the method includes: a first model training entity sends training status information to a model training management entity, and the model training management entity receives the training status information of the first model training entity, where the training status information indicates at least one model training task of the first model training entity; the model training management entity obtains multiple computing power resource information, and the multiple computing power resource information respectively indicates the idle computing power resources for model training of the multiple model training entities; the model training management entity determines a first target model training entity among the multiple model training entities based on the multiple computing power resource information; the model training management entity sends first training task configuration information to the first target model training entity, and the first target model training management entity receives the first training task configuration information from the model training management entity, where the first training task configuration information indicates assisting the first model training entity in performing the target model training task; the first target model training management entity assists the first model training management entity in performing
- the model training entity can report computing resource information and training status information to the model training management entity.
- the model training management entity can manage and orchestrate the computing resources of multiple model training entities, and assign the training tasks of the model training entities to other model training entities with sufficient computing resources to assist in completing the training, thereby reducing the training waiting time of the training tasks of the model training entities and improving the efficiency of model training.
- the various implementation methods of the fourth aspect are system methods corresponding to the various implementation methods of the first aspect.
- the beneficial technical effects of the various implementation methods of the fourth aspect reference can be made to the description of the relevant implementation methods of the first aspect, which will not be repeated here.
- a communication device which includes a transceiver module and a processing module, wherein the transceiver module is used to receive training status information of a first model training entity, the training status information indicating at least one model training task of the first model training entity; the transceiver module is also used to obtain multiple computing power resource information, the multiple computing power resource information respectively indicating the idle computing power resources for model training of multiple model training entities; the processing module is used to determine a first target model training entity among multiple model training entities based on the multiple computing power resource information; the transceiver module is also used to send first training task configuration information to the first target model training entity, the first training task configuration information indicating assisting the first model training entity in performing a target model training task in at least one model training task.
- the communication device described in the fifth aspect has the function of implementing the method in the first aspect, or any possible implementation of the first aspect.
- the function can be implemented by hardware, or by hardware executing corresponding software.
- the hardware or software includes one or more units corresponding to the above functions.
- the various implementations of the fifth aspect are devices of the model training management entity corresponding to the various implementations of the first aspect.
- the beneficial technical effects of the various implementations of the fifth aspect reference can be made to the description of the relevant implementations of the first aspect and will not be repeated here.
- a communication device which includes a transceiver module and a processing module, the processing module is used to generate training status information; the transceiver module is used to send training status information to a model training management entity, the training status information indicates at least one model training task of the model training entity; the transceiver module is also used to send computing power resource information to the model training management entity, the computing power resource information indicates the idle computing power resources of the model training entity for model training; the transceiver module is also used to receive training task information from the model training management entity, the training task information indicates a first target model training entity that assists in executing a target model training task in at least one model training task.
- the communication device described in the sixth aspect has the function of implementing the method in the second aspect, or any possible implementation of the second aspect.
- the function can be implemented by hardware, or by hardware executing corresponding software.
- the hardware or software includes one or more units corresponding to the above functions.
- the various implementations of the sixth aspect are devices of the first model training entity corresponding to the various implementations of the first aspect.
- the beneficial technical effects of the various implementations of the sixth aspect reference can be made to the description of the relevant implementations of the first aspect and will not be repeated here.
- a communication device which includes a transceiver module and a processing module.
- the transceiver module is used to send computing power resource information to a model training management entity, and the computing power resource information indicates the idle computing power resources of the model training entity for model training;
- the transceiver module is also used to receive first training task configuration information from the model training management entity, and the first training task configuration information indicates assisting the first model training entity to perform a target model training task;
- the processing module is used to assist the first model training management entity to perform the target model training task.
- the communication device described in the seventh aspect has the function of implementing the method in the third aspect, or any possible implementation of the third aspect.
- the function can be implemented by hardware, or by hardware executing corresponding software.
- the hardware or software includes one or more units corresponding to the above functions.
- the various implementation methods of the seventh aspect are devices of the first target model training entity corresponding to the various implementation methods of the first aspect.
- the beneficial technical effects of the various implementation methods of the seventh aspect reference can be made to the description of the relevant implementation methods of the first aspect and will not be repeated here.
- a communication device comprising a processor and a memory.
- a transceiver may also be included.
- the memory is used to store a computer program
- the processor is used to call and run the computer program stored in the memory, and control the transceiver to send and receive signals, so that the communication device performs any of the first to fourth aspects, or any possible implementation of any of these aspects.
- the communication device is a model training management function.
- a communication device comprising a processor and a memory.
- a transceiver may also be included.
- the memory is used to store a computer program
- the processor is used to call and run the computer program stored in the memory, and control the transceiver to send and receive signals, so that the communication device performs any of the first to fourth aspects, or any possible implementation of any of these aspects.
- the communication device is a model training function.
- a communication device comprising a processor and a communication interface, wherein the communication interface is used to receive data and/or information and transmit the received data and/or information to the processor, and the processor processes the data and/or information, and the communication interface is also used to output the data and/or information processed by the processor, so that the method in any aspect of the first to fourth aspects, or any possible implementation of any aspect of these aspects, is executed.
- the communication device can be a chip applied to model training management functions.
- a communication device comprising a processor and a communication interface, wherein the communication interface is used to receive data and/or information and transmit the received data and/or information to the processor, and the processor processes the data and/or information, and the communication interface is also used to output the data and/or information processed by the processor, so that the method in any aspect of the first to fourth aspects, or any possible implementation of any aspect of these aspects, is executed.
- the communication device can be a chip applied to model training function.
- a computer-readable storage medium stores computer instructions, and when the computer instructions are executed on a computer, the method in any aspect from the first to the fourth aspect, or any possible implementation of any aspect of these aspects, is executed.
- a computer program product comprising a computer program code, which, when executed on a computer, enables the method in any one of the first to fourth aspects, or any possible implementation of any one of these aspects, to be executed.
- a wireless communication system comprising the communication device as described in the fifth aspect, and/or the communication device as described in the sixth aspect, and/or the communication device as described in the seventh aspect.
- FIG1 is a schematic structural diagram of a communication system applicable to an embodiment of the present application.
- FIG2 is a schematic structural diagram of a first application scenario applicable to an embodiment of the present application.
- FIG3 is a schematic structural diagram of a second application scenario applicable to the embodiment of the present application.
- FIG4 is a schematic diagram of multiple model training entities performing model training tasks respectively;
- FIG5 is a schematic flow chart of a method for model training management provided in an embodiment of the present application.
- FIG6 is a schematic flow chart of a method for obtaining training status information and computing power resource information provided in an embodiment of the present application
- FIG7 is a schematic flowchart of a method for overall training of a target model training task provided in an embodiment of the present application
- FIG8 is a schematic flowchart of a method for decomposing target model training tasks provided in an embodiment of the present application.
- FIG9 is a schematic flowchart of a method for providing feedback during the execution of a target model training task provided by an embodiment of the present application.
- FIG10 is a schematic flow chart of an implementation method of two model training management entities performing model training management provided in an embodiment of the present application
- 11 to 13 are schematic structural diagrams of possible devices provided in embodiments of the present application.
- LTE long term evolution
- LTE-A long term evolution-advanced
- eLTE enhanced long term evolution
- 5G fifth generation mobile communication system new radio
- WiFi wireless fidelity
- WiMAX worldwide interoperability for microwave access
- 3GPP third generation partnership project
- Inference model also referred to as model for short: a function learned from data that can achieve a specific function/mapping. Models can be obtained based on artificial intelligence (AI) or machine learning (ML) technologies, and therefore can also be called artificial intelligence/AI models, machine learning/ML models, etc. Common algorithms used to generate AI/ML models include: supervised learning, unsupervised learning, and reinforcement learning. The corresponding models can be called supervised learning models, unsupervised learning models, and reinforcement learning models.
- the supervised learning model can be a classification model, a prediction model, a regression model, etc.
- the unsupervised learning model can be a clustering model.
- the model can also be obtained based on neural network (NN) technology, which can also be called a neural network model, a deep learning model, etc.
- NN neural network
- the training entity of the inference model is called the model training entity.
- the capability or function of the model training entity can be deployed on a certain network element, which is called the model training network element; the capability or function of the model training entity can also be deployed on other devices, which is not limited in the embodiments of the present application; for the convenience of description, the embodiments of the present application take the model training network element as an example for explanation, but they can all be replaced by other devices with the capability or function of training the inference model.
- the model training management entity is used to manage and orchestrate the training tasks and computing resources of multiple model training entities.
- the capabilities or functions of the model training management entity can be deployed on a certain network element, which is called the model training management network element; the capabilities or functions of the model training management entity can also be deployed on other devices, which is not limited in the embodiments of the present application; for the convenience of description, the embodiments of the present application take the model training management network element as an example for explanation, but they can all be replaced by other devices that manage the capabilities or functions of the model training network element.
- model reasoning entity An entity that performs reasoning or prediction based on a model is called a model reasoning entity.
- the capability or function of a model reasoning entity can be deployed on a certain network element, which is called a model reasoning network element; the capability or function of a model reasoning entity can also be deployed on other devices, which is not limited in the embodiments of the present application; for the convenience of description, the embodiments of the present application take the model reasoning network element as an example for explanation, but they can all be replaced by other devices with the capability or function of performing reasoning or prediction based on a model.
- a model training task is the basic unit of work that a model training entity can divide when training a model.
- Fig. 1 is a schematic structural diagram of a communication system applicable to an embodiment of the present application. First, the devices that may be involved in the communication system 100 are described.
- Model training management network element 110 can be used to manage and arrange the training tasks and computing resources of multiple model training entities.
- the model management network element 110 can be deployed in a network management system (network management system, NMS) or in an element management system (element management system, EMS).
- NMS network management system
- EMS element management system
- NMS network management system
- EMS element management system
- NMS network management system
- EMS element management system
- the model training management network element 110 can be connected to at least one model training network element.
- the model training management network element 110 is connected to the model training network element 121 and the model training network element 122 respectively.
- Model training network element 121 and model training network element 122 can be used to train the model.
- the model training entity 121 can be deployed in EMS, NMS, network equipment in the radio access network (RAN) domain, and core network elements in the core network domain, such as network data analytics function (NWDAF) network elements.
- the model training entity 122 can also be deployed in EMS, NMS, network equipment or core network elements.
- model training network elements can be deployed in the same system, device or network element, such as model training network element 121 and model training network element 122 can be deployed in the same network device; or, different model training network elements can be deployed in different systems, devices or network elements, such as model training network element 121 and model training network element 122 are respectively deployed in different network devices, and for example, model training network element 121 is deployed in a network device, and model training network element 122 is deployed in a core network element, etc.
- This application does not specifically limit this.
- Model reasoning network element 130 can be used for reasoning or prediction based on the model.
- Model reasoning network element 130 can be deployed in EMS, such as management data analytics function (MDAF) in EMS, or model reasoning network element 130 can also be deployed in network equipment in RAN domain, or core network element in core network domain, such as NWDAF network element.
- MDAF management data analytics function
- NWDAF NWDAF network element
- Figure 1 uses the model inference network element 130 and the model training network element 121 as an example to illustrate the connection.
- the model inference network element 130 and the model training network element 121 can be deployed on different devices, such as the model training network element 121 can be deployed on the NMS, and the model inference network element 130 can be deployed on the network device; the model inference network element 130 and the model training network element 121 can also be deployed on the same device, such as the model inference network element 130 and the model training network element 121 can be deployed on the same network device or core network element. This application does not specifically limit this.
- the communication system 100 may include multiple model training management network elements, such as NMS and Model training management network elements can be deployed separately in the EMS.
- the communication system 100 can also include multiple model reasoning network elements, for example, the model training network element 122 can also be connected to a model reasoning network element. This application does not limit the number of model training network elements, model training management function network elements, and model reasoning network elements.
- FIG2 is a schematic structural diagram of the first application scenario applicable to the embodiment of the present application.
- NMS 210 is deployed with a model training management network element 211 and a model training network element 212.
- NMS 210 can manage network device 220, network device 230, NWDAF 240, and NWDAF 250.
- Model training network element 221 is deployed on network device 220
- model training network element 231 is deployed on network device 230
- model training network element 241 is deployed on NWDAF 240
- model training network element 251 is deployed on NWDAF 250.
- Model training management network element 211 and model training network element 221 can communicate through the communication interface between NMS 210 and network device 220.
- Model training management network element 211 and model training network element 231 can communicate through the communication interface between NMS 210 and network device 230.
- Model training management network element 211 and model training network element 241 can communicate through the communication interface between NMS 210 and NWDAF 240.
- the model training management network element 211 and the model training network element 251 can communicate with each other through the communication interface between the NMS 210 and the NWDAF 250.
- the model training management network element 211 and the model training network element 212 can communicate with each other through the internal interface in the NMS 210.
- FIG3 is a schematic structural diagram of a second application scenario applicable to an embodiment of the present application.
- NMS 310 is deployed with model training management network element 311 and model training network element 312, and EMS 320 is deployed with model training management network element 321 and model training network element 322.
- NMS 310 can manage network device 330, network device 340, NWDAF 350 and NWDAF 360 through EMS 320.
- Model training network element 331 is deployed on network device 330
- model training network element 341 is deployed on network device 340
- model training network element 351 is deployed on NWDAF 350
- model training network element 361 is deployed on NWDAF 360.
- the model training management network element 311 and the model training management network element 321 can jointly manage the model training network element 312, the model training network element 322, the model training network element 331, the model training network element 341, the model training network element 351, and the model training network element 361.
- the model training management network element 321 in the EMS 320 can communicate with the model training network element 312, the model training network element 322, the model training network element 331, the model training network element 341, the model training network element 351, and the model training network element 361 respectively to obtain the information of each model training network element.
- the model training management network element 311 in the NMS 310 can provide the model training management network element 321 with a strategy for analyzing information.
- the model training management network element 311 and the model training management network element 321 can communicate through the interface between the NMS 310 and the EMS 320.
- the model training management network element 321 and the model training network element 331 can communicate through the communication interface between the EMS 320 and the network device 330.
- the model training management network element 321 and the model training network element 341 can communicate through the communication interface between the EMS 320 and the network device 340.
- the model training management network element 321 and the model training network element 351 can communicate through the communication interface between the EMS 320 and the NWDAF 350.
- the model training management network element 321 and the model training network element 361 can communicate through the communication interface between the EMS 320 and the NWDAF 360.
- the model training management network element 311 and the model training network element 312 can communicate through the internal interface in the NMS 310.
- the model training management network element 321 and the model training network element 322 can communicate through the internal interface in EMS 320.
- the NMS can also manage multiple model training network elements through multiple EMSs, and this application does not specifically limit this.
- the solution of the present application can be applied to other systems including corresponding entities, and the present application does not limit this.
- the above-mentioned entity or function can be a network element in a hardware device, or a software function running on dedicated hardware, or a virtualization function instantiated on a platform (e.g., a cloud platform).
- the above-mentioned entity or function can be implemented by one device, or by multiple devices, or a functional module in a device, and the embodiments of the present application do not specifically limit this.
- model training network elements can be deployed in the communication system, and each model training network element can have multiple training tasks.
- model training tasks of different model training network elements will also be different.
- Figure 4 is a schematic diagram of multiple model training entities executing model training tasks respectively.
- Figure 4 shows three model training network elements that execute model training tasks respectively. Since the network environments and network requirements of different model training entities may be different and dynamically changing, the model training tasks of different model training entities will also be different.
- the training tasks of model training network element #1 are overloaded, and some training tasks are in the queue.
- the training tasks of model training network element #3 are also at full load, and there are no idle computing resources.
- Model training network element #2 still has a large amount of idle computing resources that are not used. It can be seen that within a certain period of time, multiple model training network elements The distribution of model training tasks is unbalanced. Some model training tasks cannot be executed in a timely manner, while some computing resources are idle and unused, resulting in low overall efficiency of communication system model training.
- the present application proposes a method, device and system for model training management, which can improve the efficiency of model training.
- the method for model training management is first described below in conjunction with Figure 5.
- model training entity #1 may be an example of a first model training entity that is assisted in a target model training task
- model training entity #2 may be an example of a target model training entity that assists in performing a target model training task.
- the present application does not specifically limit the number of model training entities, and illustratively, the number of model training entities #2 may be one or more, that is, one or more target model training entities may assist in performing a target model training task.
- the model training management entity can be any model training management entity described in Figures 1 to 3, and model training entity #1 and model training entity #2 can be any model training entities described in Figures 1 to 3, and this application does not make any special limitations on this.
- FIG5 is a schematic flowchart of a model training management method provided in an embodiment of the present application.
- model training entity #1 sends training status information to the model training management entity
- model training management entity receives the training status information from model training entity #1.
- the training status information indicates that model training entity #1 has at least one model training task.
- the training status information includes at least one of the following information: model training task identification information, priority information of the model training task, process information of the model training task, and performance information of the model training task.
- the model training task identification information indicates the training identification of at least one model training task that the model training entity #1 has. For example, if the model training entity #1 has three model training tasks 1-3, then the model training task identification information may include the identifications of the three model training tasks.
- the priority information of the model training task indicates the priority of at least one model training task that the model training entity #1 has.
- the priority information may respectively indicate the priority of each model training task in at least one model training task that the model training entity #1 has, for example, the priority information indicates that the priority of model training task 1 is high, and the priority of model training tasks 2 and 3 is low.
- Priority can be expressed as high, medium, or low, or as a number (1, 2, 3, etc.), the smaller the number, the higher the priority.
- the priority information may also indicate the number of model training tasks with high priority that the model training entity #1 has, for example, the priority information indicates that the model training entity #1 has 1 model training task with high priority. It should be noted that the present application does not impose any limitation on the setting of the priority of the model training task, for example, the priority may be determined based on the chronological order of the execution of the request, or based on the importance of the model training task, etc.
- the process information of the model training task indicates the process of model training entity #1 performing model training.
- the process information may indicate the process status of each model training task in at least one model training task that model training entity #1 has.
- the process status may include waiting to run, running, and completed running.
- the process information indicates that model training task 1 has completed running, model training task 2 is running, and model training task 3 has completed running.
- the progress information may indicate the total process of model training entity #1 performing model training, such as the process information indicates the number of model training tasks that have not been completed (i.e., are running or waiting to be run) in model training entity #1, for example, the process information indicates that there are still two model training tasks in model training entity #1 that have not been completed.
- the performance information of the model training task indicates the performance of the model training entity #1 in performing the model training.
- the performance information may indicate at least one of the following: the time required for a single training of the model training task, the computing resources required for a single training of the model training task, the number of trainings required for the model training task, the average time required for each training when the model training task performs multiple trainings, the average computing resources required for each training when the model training task performs multiple trainings, the total time required for multiple trainings of the model training task, the total computing resources required for multiple trainings of the model training task, the average training time for executing multiple model training tasks, and the average computing resources required for executing multiple model training tasks.
- the training status information further indicates that the model training entity #1 requests collaborative training, that is, the model training entity #1 can determine whether to request collaborative training.
- collaborative training may mean that model training entity #1 does not perform a certain model training task alone.
- the model training task may be collaboratively performed by multiple model training entities.
- the training task may be collaboratively performed by other model training entities (such as model training entity #2).
- model training entity #1 determines whether the idle computing resources can meet the requirements of the training task. Determine whether to request collaborative training. If it can be satisfied, then model training entity #1 determines not to request collaborative training. If it cannot be satisfied, then model training entity #1 requests collaborative training.
- model training entity #1 determines to request collaborative training.
- model training entity #1 determines whether collaborative training is required based on the priority of the model training task.
- model training entity #1 may have multiple training tasks waiting to be run, and the training tasks with higher priorities among the multiple training tasks will be run before the training tasks with lower priorities. If model training entity #1 has a training task with a higher priority than the newly added training task, and the idle computing resources are not enough to complete all the training tasks waiting to be run, that is, after deducting the resources required by the high-priority training tasks from the idle computing resources, the remaining computing resources cannot meet the needs of the training tasks with higher priorities, then model training entity #1 can determine to request collaborative training.
- the training status information also indicates the target model training task for which collaborative training is requested, that is, the model training entity #1 may determine the target model training task for which collaborative training is requested.
- the target model training task for requesting collaborative training can be any one or more of the at least one model training task in model training entity #1, that is, the target model training task can be a newly added model training task or other model training task, and this application does not specifically limit this.
- the model training entity #1 determines a target model training task according to a training process of at least one training task. Exemplarily, the model training entity #1 determines a model training task that is waiting to be run as the target model training task.
- model training entity #1 determines the target model training task according to the priority of at least one model training task.
- model training entity #1 determines a model training task with a low priority as a target model training task.
- model training entity #1 can set a priority value for each model training task, such as values 1 to 10 representing different priorities from high priority to low priority. If the priority value of the model training task is greater than or equal to a specific threshold (for example, 5), then the model training task is determined as the target model training task.
- the specific threshold can be pre-configured or dynamically determined (for example, determined according to computing power or current network status), and this application does not specifically limit this.
- model training entity #1 can directly indicate the target model training task for which collaborative training is requested.
- the identifier of the model training task for which collaborative training is requested can be carried in the training status information, such as carrying the identifier of model training task 2, indicating that model training task 2 is recommended for collaborative training.
- model training entity #1 may also indirectly indicate a target model training task that is recommended for collaborative training.
- the training status information may indicate that a training task whose training process is in a waiting state is to be trained collaboratively, or may indicate that a training task whose training process is in a waiting state and a low priority is to be trained collaboratively.
- model training entity #1 and the model management training entity may also pre-configure or negotiate in advance how to select training tasks for collaborative training, and this application does not specifically limit this.
- the training status information includes identification information of the target model training task, then the identification information of the target model training task can implicitly request collaborative training.
- the target model training task is the model training task for collaborative training recommended by model training entity #1.
- the model training management entity can change the target model training task of collaborative training based on the actual network status and computing power resources of other model training entities. This application does not specifically limit this.
- model training entity #1 may periodically send training status information to the model training management entity.
- multiple model training entities may be configured to periodically report training status information, wherein the reporting period may be pre-set or configured by the model training management entity, and this application does not specifically limit this.
- model training entity #1 may send training status information to the model training management entity based on a trigger.
- the model training entity may send training status information to the model training management entity when a new model training task is added.
- the training status information may only include identification information, priority information, and other information of the newly added model training task.
- the model training management entity obtains multiple computing resource information.
- the multiple computing power resource information respectively indicate the idle computing power resources used for model training possessed by the multiple model training entities.
- the model training management entity receives computing resource information #1 from model training entity #1, and computing resource information #1 indicates The idle computing resources for model training possessed by model training entity #1.
- the model training management entity receives computing resource information #2 from model training entity #2, where computing resource information #2 indicates the idle computing resources for model training possessed by model training entity #2.
- the computing power resource information may include at least one of the following: hardware resource information, resource usage information.
- the hardware resource information may indicate the performance of the hardware resource, for example, the hardware resource information may indicate at least one of the following: the type of hardware resource, the number of cores of the hardware resource, and the processing frequency of the hardware resource.
- the hardware resource information may also indicate the quantified computing power, such as the number of floating-point operations per second (FLOPS).
- FLOPS floating-point operations per second
- the model training management function may determine the computing performance of the hardware resources through the above information.
- the resource usage information may indicate the utilization rate of the hardware resources, for example, the resource usage information indicates the idle computing power of the hardware resource, the computing power that has been used, or the computing power that can still be supported for model training.
- the model training management function may obtain the computing power that the model training entity can use for model training.
- the above-mentioned hardware resources may include a processor, a memory, etc.
- the processor may be any one or more of a central processing unit (CPU, a graphics processing unit (GPU), and a neural network processing unit (NPU).
- the memory may be a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk or an optical disk, and other media that can store program codes, and this application does not specifically limit this.
- multiple model training entities may periodically send computing power resource information to the model training management entity respectively, for example, multiple model training entities may be configured to periodically report computing power resource information, wherein the reporting period may be pre-set or configured by the model training management entity, and this application does not specifically limit this.
- the model training entity sends computing power resource information to the model training management entity based on a request, for example, the model training management entity sends computing power resource query information to the model training entity, and then the model training entity sends computing power resource query information to the model training management entity.
- the model training management entity sends multiple computing power resource query information to multiple model training entities respectively, and the model training management entity receives multiple computing power resource information from multiple model training entities respectively. That is, the model training entity can return computing power resource information based on the query information of the model training management entity.
- the model training management entity determines based on the training status information that the first model training entity is unable to independently perform all of the at least one model training tasks, it sends multiple computing power resource query information to multiple model training entities respectively.
- the model training entity can return computing power resource information based on the query information of the model training management entity, that is, the model training management entity can send query information to the model training entity when there is a demand, such as when it is determined to allocate assisted training, which can save transmission resources.
- the model training management entity determines at least one model training entity #2 among multiple model training entities based on multiple computing resource information.
- the at least one model training entity #2 is used to collaboratively perform the target model training task.
- step S501 if the training status information in step S501 indicates that the model training entity #1 requests the target model training task for collaborative training, then the model training management entity can directly determine the model training entity #2 among the multiple model training entities based on the information of the target model training task and the multiple computing power resource information. If the training status information in step S501 indicates that the model training entity #1 requests collaborative training, but does not indicate the target model training task, then before the model training management entity determines the model training entity #2, the model training management entity determines whether to configure collaborative training for the model training entity #1 based on the training status information and the multiple computing power resource information.
- step S501 If the training status information in step S501 does not indicate that the model training entity #1 requests collaborative training, nor does it indicate the target model training task, then before the model training management entity determines the model training entity #2, the model training management entity determines whether to configure collaborative training for the model training entity #1 based on the training status information and the multiple computing power resource information, and determines the target model training task if it is determined that the model training entity #1 is configured for collaborative training.
- the following describes the determination of whether the model training management entity configures collaborative training for the model training entity #1 and the determination of the target model training task.
- the model training management entity determines whether to configure collaborative training for model training entity #1 based on the computing power resource information and training status information reported by model training entity #1. In this implementation, the model training management entity determines whether to configure collaborative training for model training entity #1 in a similar way to how model training entity #1 determines whether to request collaborative training, such as based on idle computing power resources or the priority of the model training task. For a detailed description, please refer to step S501 above, which will not be repeated here.
- the model training management entity determines whether to configure collaborative training for model training entity #1 based on multiple computing resource information and training status information reported by multiple model training entities.
- the idle computing power of multiple model training entities indicated by the source information determines the model training tasks that can be processed. If the idle computing power cannot support assisting in the execution of the model training tasks indicated by the training status information, then the model training management entity determines not to configure collaborative training for model training entity #1. If the idle computing power can support assisting in the execution of the model training tasks indicated by the training status information, then the model training management entity determines to configure collaborative training for model training entity #1.
- the target model training task is determined.
- the model training management entity determines the target model training task based on the computing resource information and training status information reported by model training entity #1.
- the model training management entity determines the target model training task in a similar way to how model training entity #1 determines the target model training task, for example, it can be determined based on the training process or priority.
- step S501 please refer to step S501 above, which will not be repeated here.
- the model training management entity determines the target model training task based on multiple computing power resource information and training status information reported by multiple model training entities.
- the model training management entity can determine candidate model training tasks for collaborative training based on the training status information, such as taking all model training tasks that are in a waiting state as candidate model training tasks for collaborative training, and the model training management entity then determines the target model training task from the selected model training tasks based on the idle computing power of multiple model training entities indicated by the multiple computing power resource information, such as determining the model training task that can be processed by the idle computing power of multiple model training entities as the target model training task.
- This application does not specifically limit the specific method for determining the model training task.
- the model training management entity can also use the following two methods to implement collaborative training of the target model training task, where Method 1 can be regarded as overall collaborative training, that is, the target model training task is configured to model training entities other than model training entity #1 for training, and Method 2 can be regarded as decomposed collaborative training, such as decomposing the target model training task into multiple target model training subtasks, each target model training subtask corresponds to training data of a target length, and multiple model training entities use the training data of the target length for collaborative training, or, for example, decomposing the target model into multiple target sub-models, each target sub-model can be trained independently, and the training of the target sub-model can be understood as a target model training sub-task.
- Method 1 can be regarded as overall collaborative training, that is, the target model training task is configured to model training entities other than model training entity #1 for training
- Method 2 can be regarded as decomposed collaborative training, such as decomposing the target model training task into multiple target model training subtasks, each target model training subtask corresponds to training data of
- the model training management entity can pre-set which processing method to use, or the model training management entity can also determine which processing method to use based on the target model training task, computing power resource information or the capabilities of the model training entity. For example, if there is a model training entity with idle computing power resources that can support the execution of the target model training task, then the model training management entity can be configured in method 1, that is, the model training entity can execute the target model training task alone; if there is no model training entity with idle computing power resources that can support the execution of the target model training task alone, then the model training management entity can be configured in method 2, that is, decomposing the target model training task into multiple target model training subtasks, and selecting multiple target model training entities to execute multiple target model training subtasks respectively.
- model training management entity determines one or more target model training entities (i.e., model training entity #2 in the embodiment of the present application) is similar. For ease of description, the following will explain the manner in which a model training entity #2 is determined.
- the model training management function estimates the computing power requirement of the target model training task, and the computing power requirement of the target model training task can be used to determine the model training entity #2.
- the model training management entity determines the computing power requirement of the target model training task according to the performance requirement of the target model training task.
- the performance requirement may include at least one of the following: the training accuracy of the target model training task, the training duration of the target model training task.
- the computing power requirement may include at least one of the following: the total floating point operations (FLOPs) required to train the target model training task, and the floating point operations per second (FLOPS) of the target model training task.
- the model training management entity can determine the total number of floating-point operations required to achieve the training accuracy of the target model training task, and then determine the FLOPS based on the total number of floating-point operations and the training duration. For example: for a model training task for a model used for channel estimation, the amount of computation required to achieve a training accuracy of 90% for the model training task is estimated to be 0.5TFLOPs (number of operations per layer * number of layers * number of iterations). If the training duration expected to complete the training is 1s, then the model training management entity can determine the computing power requirement to be 0.5TFLOPS.
- the model training management function determines the model training entity #2 based on the computing power requirements of the target model training task and multiple computing power resource information.
- the model training management entity may determine model training entity #2 based on computing power resource information of each model training entity among multiple model training entities.
- the model training management entity obtains hardware resource information and resource usage information of multiple model training entities, determines at least one model training entity with idle computing power based on the resource usage information, and then determines model training entity #2 from the at least one model training entity that can support the computing power requirements of the target model training task based on the hardware resource information.
- the model training management entity may determine the model training entity that is closest to model training entity #1 as model training entity #2 that performs collaborative training on the target model training task.
- the model training entity that is closest to model training entity #1 may refer to the model training entity that is closest to model training entity #1, or may be the model training entity that is separated from model training entity #1 by the fewest transmission nodes, and this application does not specifically limit this.
- the model training management entity sends first training task configuration information to model training entity #2;
- model training entity #2 receives the first training task configuration information from the model training management entity.
- the first training task configuration information instructs model training entity #2 to assist model training entity #1 in performing a target model training task in at least one model training task.
- the first training task configuration information indicates an identifier of the target model training task.
- the model training entity #2 can obtain the model to be trained and the training data based on the identifier of the target model training task.
- the first training task configuration information further indicates target training data of the target model training task.
- the model training management entity may also indicate the information about the target training data to model training function #2.
- the first training task configuration information further indicates an identifier of the model training entity #1.
- the model training entity #2 can perform collaborative training based on assisting the model training entity 1.
- the model training management entity sends training task information to the model training entity #1;
- model training entity #1 receives training task information from the model training management entity.
- the training task information can also be called training task configuration notification information, which indicates that model training entity #2 assists in performing the target model training task. That is, the training task configuration notification information indicates that model training entity #1 is the model training entity #2 assists in performing the target model training task.
- the training task notification information may indicate the identifier of the model training entity # 2. If the target model training task is determined by the model training management entity, the training task notification information may also indicate the identifier of the target model training task.
- the training task information can also be called second training task configuration information, which instructs model training entity #2 to use training data of target length to assist in executing the target model training task.
- the second training task configuration information may indicate the identifier of the model training entity #2 and the target length of the training data used by the model training entity #2 for collaborative training. If the target model training task is determined by the model training management entity, the training task notification information may also indicate the identifier of the target model training task.
- model training entity #2 obtains training data information and assists in executing the target model training task.
- the training data information indicates the training data used by model training entity #2 to assist in performing the target model training task.
- the model training entity #2 obtains training data information from the model training management entity, for example, the first training task configuration information also indicates target training data of the target model training task.
- model training entity #2 may obtain training data information from model training entity #1.
- model training entity #1 may obtain the identifier of model training entity #2 based on the received training task information, and model training entity #1 actively sends training data information to model training entity #2.
- the training data information also indicates the identifier of the target model training task and the data address of the training data.
- the data address of the training data may be the address of the device storing the training data, and then the model training entity #2 may download the training data from the device according to the data address of the training data.
- the training data may be stored in the model training entity #1, or in the device deployed by the model training entity #1.
- the data address of the training data may be the collection address of the training data, and then the model training entity #2 may collect data according to the data address, and the collected data may be used for training the target model training task.
- the training data can be stored in the database.
- the training data information may indicate the identifier or address of the DCCF or ADRF
- the model training entity #2 may send information for requesting training data (e.g., Ndccf_DataManagement_Subscribe or Ndccf_DataManagement_Fetch or Nadrf_DataManagement_RetrievalRequest or Nadrf_DataManagement_RetrievalSubscribe) to the DCCF or ADRF, and the DCCF or ADRF sends the training data to the model training entity #2.
- training data e.g., Ndccf_DataManagement_Subscribe or Ndccf_DataManagement_Fetch or Nadrf_DataManagement_RetrievalRequest or Nadrf_DataManagement_RetrievalSubscribe
- the training data information also indicates a target length
- the model training entity #2 can then obtain training data of the target length for training.
- Model training entity #2 can use training data to train the target model training task. This application does not specifically limit the method of training the target model training task.
- model training entity #2 can use batch gradient descent, mini-batch gradient descent, or stochastic gradient descent.
- the model training entity #2 may send assistance training feedback information to the model training management entity;
- model training management entity receives the assisted training feedback information from model training entity #2.
- model training entity #2 may send assistance training feedback information to the model training management entity.
- the assistance feedback information may indicate at least one of the following: the accuracy achieved by model training entity #2 in performing the target model training task, the time spent by target model training entity #2 in performing the target model training task, the execution progress of target model training entity #2 in performing the target model training task, and the amount of resources occupied by the target model training entity in performing the target model training task.
- model training entity #2 may periodically send assistance training feedback information to the model training management entity.
- the period may be pre-set or configured by the model training management entity, and this application does not specifically limit this.
- model training entity #2 sends network status change information to the model training management entity, and the network status information indicates that the network status of model training entity #2 has changed and can no longer assist in the execution of the target model training task.
- the network status change information indicates that the energy-saving mode of model training function #2 has changed to enter the energy-saving state, that is, the model training function #2 will be shut down to achieve energy saving.
- the network status change information indicates that the resources used for model training by model training function #2 are occupied, that is, the computing power resources for model training function #2 to assist in the execution of the target model training task are insufficient.
- the model training management entity can then change the model training entity that assists the first model training entity in executing the target model training task based on the network status information.
- the model training management entity can select other model training entities to assist in training. In order to facilitate understanding of the embodiments of the present application, this is described in more detail below in conjunction with Figure 9.
- the network status change information can be the assisted training feedback information sent on a periodic basis after the network status changes.
- model training entity #2 sends model training report information
- model training entity #2 After model training entity #2 completes the training of the target model training task, a trained model can be generated, and then the model training report information can include the trained model.
- the training report information also includes performance information of the model after training, such as the accuracy of training, the time taken for training, etc.
- model training entity #2 performs the target model training task alone, then model training entity #2 can send model training report information to model reasoning entity #1, where model reasoning entity #1 can be the model reasoning entity that requests to perform the target model training task, such as model training entity #1 and model reasoning entity #1 are entities deployed on the same network device, or model training entity #1 and model reasoning entity #1 are deployed on different NWDAFs respectively.
- model reasoning entity #1 can be the model reasoning entity that requests to perform the target model training task, such as model training entity #1 and model reasoning entity #1 are entities deployed on the same network device, or model training entity #1 and model reasoning entity #1 are deployed on different NWDAFs respectively.
- model training entity #2 performs the target model training subtask decomposed from the target model training task. Then model training entity #2 can send model training report information to model training entity #1, and then model training entity #1 can perform model aggregation according to the trained model in the model training report information.
- model training entity #2 can send model training report information to model training entity #1, and then model training entity #1 can perform model aggregation according to the trained model in the model training report information.
- model training entity #2 can send model training report information to model training entity #1, or send model training report information to model reasoning entity #1.
- the model training report information can be Nnwdaf_MLModelProvision_Notify or Nnwdaf_MLModelInfo_Request response.
- the model training management entity can manage and orchestrate the training tasks and computing resources of multiple model training entities, and assign the training tasks of model training entity #1 to other model training entities with sufficient resources to assist in completing the training, thereby reducing the training waiting time of the training tasks of model training entity #1 and improving the efficiency of model training.
- FIG6 is a schematic flowchart of a method for obtaining training status information and computing resource information provided in an embodiment of the present application.
- the model training management entity sends computing resource subscription information #1 to the model training entity #1;
- model training entity #1 receives computing power resource subscription information #1 from the model training management entity.
- the computing power resource subscription information is used to subscribe to the computing power resource information #1 of the model training entity #1.
- For the content of the computing power resource information please refer to the introduction of step S502 in Figure 5, which will not be repeated here.
- the computing power resource subscription information #1 also indicates the period for the model training entity #1 to report the computing power resource information #1.
- the model training management entity may determine the period according to the average time taken by the model training entity to perform the training task.
- the model training entity #1 sends computing resource information #1 to the model training management entity;
- the model training management entity receives the computing power resource information #1 from the model training entity #1.
- the model training entity #1 may send computing power resource information #1 to the model training management entity in response to the computing power resource subscription information #1.
- the computing power resource information #1 may include the computing power resource information subscribed by the computing power resource subscription information #1.
- model training entity #1 periodically sends computing resource information #1 to the model training management entity.
- model training management entity sends computing resource subscription information #2 to model training entity #2;
- model training entity #2 receives computing power resource subscription information #2 from the model training management entity.
- the computing power resource subscription information #2 is used to subscribe to the computing power resource information #2 of the model training entity #2. This step is similar to the description in step S601 above and will not be repeated here.
- the computing power resource subscription information #1 also indicates the period for model training entity #1 to report computing power resource information #1
- the computing power resource subscription information #2 also indicates the period for model training entity #2 to report computing power resource information #2, which can be the same as or different from the period for computing power resource subscription information #2 to indicate model training entity #2 to report computing power resource information #2.
- the period for each model training entity to report computing power resource information can be determined based on the average time taken to execute each training task, and this application does not make any special limitations on this.
- the model training entity #2 sends computing resource information #2 to the model training management entity;
- the model training management entity receives the computing power resource information #2 from the model training entity #2.
- the model training entity #2 may send computing power resource information #2 to the model training management entity in response to the computing power resource subscription information #2.
- the computing power resource reporting information #2 may include the computing power resource information #2 subscribed to by the computing power resource subscription information #2.
- model training entity #2 periodically sends computing resource information #2 to the model training management entity.
- the model training management entity sends training status subscription information #1 to the model training entity #1;
- model training entity #1 receives training status subscription information #1 from the model training management entity.
- the training state subscription information #1 is used to subscribe to the training state information #1 of the model training entity #1.
- For the content of the training state information please refer to the introduction of step S501 in FIG5, which will not be described in detail here.
- training status subscription information #1 indicates that model training entity #1 periodically sends training status information #1.
- the training status subscription information #1 indicates that the model training entity #1 sends the training status information #1 based on a trigger event
- the trigger event may include at least one of the following: the model training entity #1 adds a new training task, and the model training entity #1 completes a training task.
- the model training entity #1 sends training status information #1 to the model training management entity;
- model training management entity receives training status information #1 from model training entity #1.
- the model training entity #1 may send training state information #1 to the model training management entity in response to the training state subscription information #1.
- the training state information #1 may include the training state information #1 subscribed to by the training state subscription information #1.
- model training entity #1 periodically sends training status information #1 to the model training management entity.
- model training entity #1 sends training status information #1 to the model training management entity based on a triggering event.
- model training management entity #1 sends training status subscription information #2 to model training entity #2;
- model training entity #2 receives training status subscription information #2 from the model training management entity.
- Training status subscription information #2 is used to subscribe to training status information #2 of model training entity #2. This step is similar to step S605 above. The description is similar and will not be repeated here.
- the training status subscription information #1 also indicates the period for model training entity #1 to report training status information #1
- the training status subscription information #2 also indicates the period for model training entity #2 to report training status information #2, which can be the same as or different from the period for training status subscription information #2 to report training status information #2.
- the period for each model training entity to report training status information can be determined based on the average time taken to execute each training task, and this application does not make any special limitations on this.
- the model training entity #2 sends training status reporting information #2 to the model training management entity;
- model training management entity receives training status information #2 from model training entity #2.
- Model training entity #2 may send training status reporting information #2 to the model training management entity in response to training status subscription information #2.
- model training entity #2 periodically sends training status information #2 to the model training management entity.
- model training entity #2 sends training status information #2 to the model training management entity based on the triggering event.
- Figure 6 is only an illustrative illustration of two model training entities.
- the model training management entity can also manage three or more model training entities.
- the model training management entity can send computing resource subscription information and training status subscription information to each model training entity. This application does not specifically limit this.
- each model training entity can promptly report its own computing resource information and training status information, so that the model training management entity can perform timely orchestration and management, thereby improving the efficiency of model training.
- Figure 7 is a schematic flowchart of a method for overall training of a target model training task provided in an embodiment of the present application.
- the steps shown in Figure 7 can be performed after the steps shown in Figure 6, that is, the model training entity can send training status information to the model training management entity periodically or based on a trigger event, and the model training entity can also send computing power resource information to the model training management entity periodically.
- model reasoning entity #1 sends a model training request message to model training entity #1.
- the model training request message is used to request the model training entity #1 to train the inference model.
- the model training request message includes at least one of the following information: inference model, identification information of the inference model, model performance requirement information, and expected training duration information.
- the inference model is the inference model to be trained, and the inference model and the identification information of the inference model can be used to mark the inference model.
- the inference model can refer to the model file of the inference model, and the model file can include an identifier or inference type for identifying the inference model.
- the model performance requirement information can indicate the requirements for model training, for example, the accuracy of the expected inference model after training can be greater than or equal to a certain threshold, the accuracy after training can be greater than or equal to a certain threshold, etc.
- the expected training duration information can indicate the training duration that the expected model training takes.
- the expected training duration information can include a first duration, indicating that the model training entity completes the model training within the first duration from the receipt of the model training request message.
- model reasoning entity #1 and the model training entity #1 can be two modules deployed in one device, such as the model training network element 121 and the model reasoning network element 130 shown in FIG. 1, that is, the model reasoning entity #1 can send a model training request message to the model training entity #1 in the device by default, and the model training request message of step S710 is transmitted through the internal interface, thereby reducing the external interface overhead.
- model training entity #1 determines the training task.
- Model training entity #1 can determine the training task based on the model training request message.
- model training entity #1 can generate or index the training task of the inference model according to the inference model, such as model training entity #1 can generate the training task according to the information in the model training request message, or can set the parameters of the training task according to the model training request message, such as weight, etc., which is not particularly limited in this application.
- model training entity #1 may determine whether to request collaborative training based on the training task. Whether the idle computing resources can meet the needs of the training task determines whether to request collaborative training. For example, model training entity #1 determines whether collaborative training is needed based on the priority of the model training task. For the content of this part, please refer to the description of step S501 in Figure 5, which will not be repeated here.
- the model training entity #1 determines the target model training task for which the collaborative training is requested. For the contents of this part, reference may be made to the description of step S501 in FIG. 5 , which will not be described in detail here.
- model training entity #1 sends a training task update notification message to the model training management entity.
- the training task update notification message indicates that a new model training task has been added to model training entity #1.
- the model training entity #1 sends training status information based on a triggering event
- the triggering event is a new model training task added to the model training entity #1, such as the model training entity #1 receiving the model training request information from the model reasoning entity in step S701
- the training task update notification information can be used to notify the model training management entity of the new model training task added to the model training entity #1.
- the training task update notification message can be the training status information sent based on the triggering event.
- the training task update notification message may indicate information about all training tasks on model training entity #1, or the training task update message may also indicate information about changed training tasks, where the changed training tasks may include training tasks with changed task progress, newly added training tasks, etc.
- the training task update notification message indicates at least one of the following: at least one training task on the model training entity #1, the task progress of each training task in at least one training task, and the priority of each training task in at least one training task.
- the training task update notification message includes three training tasks 1-3, training task 1 is running, training tasks 2 and 3 are waiting to run, and the priority of training task 2 is higher than the priority of training task 3. Then the model training management entity can determine whether the model training entity #1 can independently complete the training of the three training tasks in combination with the computing power resource information #1 reported by the model training entity #1, that is, request collaborative training.
- the training task update notification message includes at least one of the following: the identifier of the training task whose task progress has been changed and the type of task progress change, the identifier of the newly added training task, and the priority of the newly added training task.
- transmission resources can be saved.
- the training task update notification message notifies the model training entity #1 that training task #1 has completed training and added training task 3. It should be noted that the training task update notification message does not include information about training task #2, which may indicate that the task progress of training task #2 has not changed, for example, training task #2 may still be in training, or still waiting to be run, etc.
- the model training management entity determines model training entity #2 among multiple model training entities based on multiple computing resource information.
- the at least one model training entity #2 is used to collaboratively perform the target model training task.
- the model training management entity may obtain the target model training task from the training task update notification message, or the model training management entity may determine the target model training task based on the training task update notification message.
- the model training management entity determining the model training entity #2 please refer to the description of step S503 in FIG. 5, which will not be described in detail here.
- the model training management entity sends first training task configuration information to model training entity #2.
- the first training task configuration information indicates that the model training entity #2 assists the model training entity #1 in performing the target model training task in at least one model training task.
- the content of the first training task configuration information please refer to the description of step S504 in Figure 5, which will not be repeated here.
- the model training management entity sends training task configuration notification information to model training entity #1.
- the training task configuration notification information instructs the model training entity #2 to assist in executing the target model training task.
- the content of the training task configuration notification information please refer to the description of step S505 in FIG. 5 , which will not be described in detail here.
- model training entity #1 sends target training data indication information to model training entity #2.
- the target training data indication information indicates the target training data used to train the target model training task.
- the model training management entity does not have the information of the target model training data, that is, the information of the target training data is not carried in the first training task configuration information in step S705
- the model training entity #1 can be informed based on the training task configuration notification information that the target model training task is trained by the model training entity #2, and then send the target training data indication information to the model training entity #2.
- model training entity #2 obtains target training data.
- model training entity #2 downloads target training data based on target training data indication information, for example, the target training data is stored in model training entity #1, model training entity #2 sends a download request to model training entity #1, and model training entity #1 sends the target training data to model training entity #2.
- the target training data is stored in model training entity #1
- model training entity #2 sends a download request to model training entity #1
- model training entity #1 sends the target training data to model training entity #2.
- the content of obtaining the target training data please refer to the description of step S506 in FIG. 5, which will not be described in detail here.
- model training entity #2 assists in executing the target model training task.
- Model training entity #2 can use the target training data to train the target model training task.
- model training entity #2 sends training report information to model reasoning entity #1.
- model training entity #2 After model training entity #2 completes the training of the target model training task, a trained model can be generated, and then the training report information can include the trained model.
- the training report information also includes performance information of the trained model, such as training accuracy, training time, etc.
- the model reasoning entity can directly obtain the trained and usable model from the model training entity #2.
- the model training management entity can manage and orchestrate the training tasks and computing resources of multiple model training entities, assign the training tasks of model training entity #1 to other model training entities with sufficient resources to assist in completing the training, reduce the training waiting time of the training tasks of model training entity #1, improve the utilization of system resources, and improve the efficiency of model training.
- FIG8 is a schematic flowchart of a method for task decomposition training of a target model training provided in an embodiment of the present application.
- the steps shown in Figure 8 can be performed after the steps shown in Figure 6, that is, the model training entity can send training status information to the model training management entity periodically or based on a trigger event, and the model training entity can also send computing power resource information to the model training management entity periodically.
- the model reasoning entity sends a model training request message to the model training entity #1.
- the model training request message is used to request the model training entity #1 to train the inference model.
- the model training entity #1 For the content of this part, please refer to the description of step S701 in Figure 7, which will not be repeated here.
- model training entity #1 determines the training task.
- Model training entity #1 can determine the training task based on the model training request message. For the content of this part, please refer to the description of step S702 in Figure 7, which will not be repeated here.
- model training entity #1 sends a training task update notification message to the model training management entity.
- the training task update notification message indicates that a new model training task is added to the model training entity #1.
- For the content of the training task update notification message please refer to the description of step S703 in Figure 7, which will not be repeated here.
- the model training request message may further include training data length information.
- the training data length information indicates the length of the training data, and the training data is used to perform the training task.
- the training data length information can indicate the length of the training data of each training task on the model training entity #1, and then the model training management entity can determine the target model training task for collaborative training based on the length of the training data corresponding to each training task, thereby improving the reliability of determining the target model training task.
- model training entity #1 determines that a target model training task requires collaborative training, that is, the training task update notification message includes information about the target model training task
- the training data length information may only indicate the length of the target training data corresponding to the target model training task, thereby saving transmission resources.
- the model training management entity decomposes the target model training task and determines the model training entity #2.
- the model training management entity can obtain the target model training task from the training task update notification message, or the model training management entity can determine the target model training task based on the training task update notification message.
- the model training management entity can obtain the target model training task from the training task update notification message, or the model training management entity can determine the target model training task based on the training task update notification message.
- model training management entity may decompose the target model training task by splitting the target model, or the model training management entity may also decompose the target model training task by splitting the training data.
- the model training management entity can determine the M model training entities that participate in the training of the target model training task based on the computing power resource information of multiple model training entities, and split the target model into M target sub-models, each of which can be trained separately by the model training entity, and the training of a target sub-model is a model training sub-task, where M is a positive integer.
- the target model is a 4-layer neural network model
- the model training management entity can decompose the neural network model into There are two target sub-models: the first two layers of neural network model are used as one target sub-model, and the last two layers of neural network model are used as another target sub-model.
- the model training management entity can decompose the target model training task into multiple target model training subtasks based on the computing power resource information of multiple model training entities, and determine the training data sub-length corresponding to each target model training subtask.
- the model training management entity may determine N model training entities participating in training the target model training task based on the computing power resource information of multiple model training entities, divide the target model training task into N target model training subtasks, and determine the training data sub-length of each target model training subtask in the N target model training subtasks.
- the N model training entities may respectively perform the training of the N target model training subtasks.
- N is a positive integer greater than or equal to 1.
- the N model training entities may include model training entity #1 or may not include model training entity #1.
- model training entity #1 can train one of the target model training subtasks. If model training entity #1 does not have spare computing power, then model training entity #1 may not participate in the training of the target model training task, and the other N model training entities #2 may perform the training separately.
- model training management entity can set model training entity #1 as the main model training entity and other model training entities as collaborative model training entities.
- the main model training entity can receive the sub-model obtained by the collaborative model training entity through training the target model training sub-task from the collaborative model training entity.
- Model training entity #1 as the main model training entity can improve the reliability of training task allocation.
- model training management entity determines, based on computing resource information, that two model training entities, model training entity #1 and model training entity #2, perform collaborative training on the target model training task, wherein model training entity #1 can serve as the main model training entity and model training entity #2 as the collaborative model training entity.
- the model training management entity determines that model training entity #2 assists in performing the target model task, and determines that the training data used to assist in performing the target model task is of a target length.
- model training management entity may first determine the model training entity #2 and then decompose the target model training task, or the model training management entity may first decompose the target model training task and then determine the model training entity #2, or the model training management entity may also simultaneously decompose the target model training task and determine the model training entity #2, and this application does not specifically limit this.
- the content related to the model training management entity determining the model training entity #2 can refer to the description of step S503 in Figure 5, which will not be repeated here.
- the model training management entity sends first training task configuration information to model training entity #2.
- the first training task configuration information indicates that the model training entity #2 assists the model training entity #1 in performing the target model training task in at least one model training task.
- the content of the first training task configuration information please refer to the description of step S504 in Figure 5, which will not be repeated here.
- the model training management entity sends second training task configuration information to model training entity #1.
- the second training task configuration information indicates that model training entity #2 assists model training entity #1 in performing the target model training task.
- the second training task configuration information indicates the identifier of the target model training task and the identifier of the model training entity #2.
- the model training entity #1 can know that the target model training task is performed by the model training entity #2.
- the second training task configuration information also indicates the splitting method information, such as the splitting node identifier, and the splitting node information is used to indicate the splitting method of the target model, and then the model training function #1 can obtain the target sub-model into which the target model is split according to the splitting method information.
- the second training task configuration information also indicates the target length of the training data used by the model training entity #2 to assist in executing the target model training task. Then, the model training entity #1 can obtain the training data indicating the target length of the model training entity #2.
- model training entity #1 preliminarily executes the target model training task.
- Model training entity #1 determines the model structure, training algorithm, training hyperparameters, etc., and uses the training data of the target length at model training entity #1 to perform several rounds of initial model training to obtain an initial model.
- model training entity #1 sends target training data indication information to model training entity #2.
- the target training data indication information indicates the target training data used to train the target model training task. Used to instruct model training entity #2 to use training data of target length to assist in performing the target model training task.
- the target training data indication information indicates the identifier of the target collaborative training task, the address of the training data, and the address of the target sub-model corresponding to the training sub-task executed by the model training entity #2. Then, the model training entity #2 can obtain the target sub-model based on the address of the target sub-model, and obtain the training data of the target length based on the address of the training data.
- the target training data indication information indicates the identifier of the target collaborative training task, the target length, the address of the training data, and the address of the initial model. Then, the model training entity #2 can obtain the initial model based on the address of the initial model, and obtain the training data of the target length based on the address of the training data.
- model training entity #2 obtains target training data.
- Model training entity #2 can obtain training data of target length based on target training data indication information. For details on obtaining target training data, please refer to the description of step S708 in FIG. 7 , which will not be described in detail here.
- model training entity #2 assists in executing the target model training task.
- Model training entity #2 uses the training data of the target length to perform several rounds of training on the initial model to generate a sub-model.
- model training entity #2 uses the training data to perform several rounds of training on the target sub-model to generate a trained target sub-model.
- model training entity #2 sends model transfer information to model training entity #1.
- the model transfers information indicating a trained sub-model, that is, an initial model after several rounds of training using target length training data, or a sub-model after several rounds of training of a target sub-model.
- the model transfer information indicates a model gradient of the sub-model or a model address of the sub-model.
- the model transfer information also indicates the identifier of the target sub-model.
- the identifier of the target sub-model can be determined based on the splitting method information when the model training management function is split, or it can also be agreed upon by model training function #1 or model training function #2. This application does not specifically limit this.
- model training entity #1 performs model aggregation.
- model training entity #1 can perform model aggregation in different ways. For example, when the model training management function decomposes the target model training task by splitting the target model, then model training entity #1 obtains the trained target sub-model from model training entity #2, and the model training entity can aggregate the target sub-model. Or, for the model training management function that decomposes the model target model training task by splitting the training data, the model training entity can obtain the initial model trained by model training function #2 based on the model transfer information, and perform model aggregation with the initial model obtained by the initial execution of the training to obtain a trained inference model. For example, the model training entity can average each gradient/weight of multiple sub-models and the initial model to obtain the final aggregated model gradient/weight.
- model training entity #1 sends training report information to model reasoning entity #1.
- the training report information may include the trained inference model.
- the content of the training report information please refer to the description of step S710 in FIG. 7 , which will not be described in detail here.
- the model training management entity can decompose the target model training task and use multiple model training entities to collaborate to complete the training task, which can reduce the training task burden of the original training subject model training entity #1 of the target model training task, make full use of the resources of multiple model training entities, and reduce the training waiting time of model training entity #1.
- the model training function #2 can send the assisted training feedback information and/or network status change information described in S507 in Figure 5 to the model training management entity during the execution of the target model training task, so that the model training management entity can be aware of the status of the execution of the target model training task and can make timely adjustments to improve the reliability of model training. This is explained below in conjunction with Figure 9.
- Figure 9 is a schematic flowchart of a method for providing feedback during the execution of a target model training task provided in an embodiment of the present application.
- model training management entity is used to manage three model training entities.
- the model training management entity can be any model training management entity described in FIG. 1 to FIG. 3, and model training entity #1, model training entity #2, and model training entity #3 can be any model training entity described in FIG. 1 to FIG. 3, and the present application does not specifically limit this.
- the method described in FIG. 9 can be combined with any method described in FIG. 5 to FIG. 8.
- model training entity #2 sends assistance training feedback information and/or network status change information to the model training management entity.
- the assistance feedback information may indicate at least one of the following: the accuracy achieved by the model training entity #2 in executing the target model training task, the time taken by the target model training entity #2 to execute the target model training task, the time taken by the target model training entity #2 to execute the target model training task, and the time taken by the target model training entity #2 to execute the target model training task. Execution progress, the number of resources occupied by the target model training entity to execute the target model training task.
- the network status information indicates that the network status of model training entity #2 has changed so that it can no longer assist in executing the target model training task, or in other words, it can no longer complete the target model training task.
- the network status change information indicates that the energy-saving mode of model training function #2 has changed to an energy-saving state, that is, the model training function #2 will be shut down to achieve energy saving.
- the network status change information indicates that the resources used for model training by model training function #2 are occupied, that is, the computing resources of model training function #2 to assist in executing the target model training task are insufficient.
- model training entity #2 can only send training assistance feedback information to the model training management entity, and the model training management entity can determine whether it is necessary to replace other model training entities to assist in executing the target model training task based on the content of the training assistance feedback information.
- Model training entity #2 can also only send network status change information to the model training management entity, and then the model training management entity can directly determine to replace other model training entities to assist in executing the target model training task based on the network status change information.
- Model training entity #2 can also send training assistance feedback information and network status change information to the model training management entity, and this application does not specifically limit this.
- the model training management entity adjusts the allocation method.
- Adjusting the allocation method may include: adjusting the method in which the model training node assists in executing the target model training task (such as changing the target length), and replacing the model training entity (such as replacing model training entity #2 with model training entity #3). That is, model training entity #3 is used to assist in executing the target model training task. That is, model training entity #3 may be an example of a second target model training entity of the model training management entity among multiple model training entities.
- the model training management entity can adjust the allocation method.
- the model training management entity can adjust the allocation method of the target collaborative training task. If the network status change information indicates that the resources used for model training by model training entity #2 are occupied, that is, the entity for model training entity #2 to train the model is shut down, then the model training management entity can also adjust the allocation method.
- the model training management entity sends configuration change information #1 to model training entity #1.
- Configuration change information #1 indicates the adjusted allocation method, such as indicating the changed target length, or indicating that the target model training task is changed from being assisted by model training entity #2 to being assisted by model training entity #3.
- the model training management entity sends configuration change information #2 to model training entity #2.
- Configuration change information #2 instructs model training entity #2 to assist in executing the target model training task in an adjusted manner, or instructs model training entity #2 to stop assisting in executing the target model training task.
- the method further includes step S905:
- the model training management entity sends third training task configuration information to model training entity #3.
- the third training task configuration information instructs model training entity #3 to assist model training entity #1 in performing the target model training task.
- the content of the third training task configuration information is similar to the description of the first training task configuration information in step S504 in FIG5 , and will not be repeated here.
- model training entity #3 can assist in executing the target model training task based on the third training task configuration information, that is, the model training entity #3 can execute the contents of steps S506 to S508, or S707 to S710, or S809 to S813 in Figure 5, which will not be elaborated here.
- the model training entity can provide feedback on the execution status to the model training management entity while assisting in the execution of the model training task, so that the model training management entity can be informed of the execution status of the target model training task and can make timely adjustments to improve the reliability of model training.
- Figures 5 to 9 can all be applied to the architectures described in Figures 1 to 3. It can be understood that, in order to facilitate understanding of the embodiments of the present application, a model training management entity is used for illustration in the above description process. If there are multiple model training management entities in the system, for example, refer to Figure 3, there is a model training management entity in EMS and NMS respectively, then the model training management entity in Figures 5 to 9 can be regarded as the model training management entity in EMS, and the model training management entity in EMS can interact with the model training management entity in NMS to achieve the joint management of multiple model training entities. For ease of understanding, the implementation method is described below in conjunction with Figure 10.
- FIG10 is a schematic flowchart of an implementation method of two model training management entities performing model training management provided in an embodiment of the present application.
- FIG10 takes the model training management entity #1 as the model training management entity in the NMS and the model training management entity #2 as the model training management entity in the EMS as an example.
- model training management entity #1 sends total computing power resource subscription information to model training management entity #2.
- the total computing power resource subscription information indicates that the model training management entity #2 subscribes to the computing power resource information of multiple model training management entities.
- the multiple model training management entities are model training entities managed by the model training management entity #2.
- For the content of the computing power resource information please refer to the description of step S502 in Figure 5, which will not be repeated here.
- Model training management entity #2, model training entity #1, and model training entity #2 can execute the contents of steps S601 to S604 as described in FIG. 6 , which will not be described in detail here.
- model training management entity #2 sends total computing power resource information to model training management entity #1.
- Model training management entity #2 can send the computing power resource information #1 and computing power resource information #2 received from model training entity #1 and model training entity #2 directly to model training management entity #1, or model training management entity #2 can also process the computing power resource information #1 and computing power resource information #2 and send them to model training management entity #1. This application does not make any special restrictions on this.
- model training management entity #2 sends total training status subscription information to model training management entity #1.
- the total training state subscription information indicates that the model training management entity #2 subscribes to the training state information of multiple model training management entities.
- the content of the training state information please refer to the description of step S501 in Figure 5, which will not be repeated here.
- Model training management entity #2, model training entity #1, and model training entity #2 may execute the contents of steps S605 to S608 as described in FIG6 , which will not be described in detail here.
- model training management entity #2 sends total training status information to model training management entity #1.
- Model training management entity #2 may send the training status information #1 and training status information #2 received from model training entity #1 and model training entity #2 directly to model training management entity #1, or model training management entity #2 may process the training status information #1 and training status information #2 and send them to model training management entity #1. This application does not make any special limitation on this.
- model training management entity #1 sends policy information to model training management entity #2.
- the strategy information indicates a method for determining a target model training entity (i.e., model training entity #2 in the embodiment of the present application) for assisting in executing a target model training task among multiple model training entities based on multiple computing resource information, and/or indicates a method for determining a target length of training data for a target model training task based on the total length of training data used to complete the target model training task.
- a target model training entity i.e., model training entity #2 in the embodiment of the present application
- the policy information also indicates a manner of determining whether collaborative training is required.
- the method of determining whether collaborative training is required includes at least one of the following:
- Priority allocation method Determine whether collaborative training is required based on the priority of the training task. When the training task priority is greater than or equal to the priority threshold, collaborative training is required; when the training task priority is less than the priority threshold, collaborative training is not required;
- Allocation method based on computing power resource utilization Determine whether collaborative training is required based on computing power resource utilization. When the computing power resource utilization of the model training entity is greater than or equal to the utilization threshold, it is determined that the model training entity needs collaborative training; when the computing power resource utilization of the model training entity is less than the utilization threshold, it is determined that the model training entity does not need collaborative training;
- Method based on training data volume allocation Determine whether collaborative training is required based on the amount of training data required for the training task. When the amount of training data for the training task is greater than the data volume threshold, collaborative training is determined to be required; when the amount of training data for the training task is less than or equal to the data volume threshold, collaborative training is determined not to be required.
- the method of determining the target model training entity includes at least one of the following:
- Computing resource-based method Determine the target model training entity based on idle computing resources. For example, select the model training entity with the most idle computing resources as the target model training entity;
- Node location-based approach Determine the target model training entity based on the model training entity and the main model training entity (e.g., model training entity #1 in the embodiment of the present application). For example, select the model training entity closest to the main model training entity as the target model training entity.
- the main model training entity e.g., model training entity #1 in the embodiment of the present application.
- the method of determining the target length of the training data includes at least one of the following:
- the target length is determined based on the idle computing resources.
- the target length is determined to be the target length that can be processed by the idle computing resources of the model training entity;
- the target length is determined based on the number of model training entities with idle computing resources. For example, if the model training management entity selects three model training entities to assist in executing the target model training task, the training data of the target model training task can be decomposed into three training data of target length.
- the policy information may include a specific method, or may also include an index of the method, such as the name or identifier of the method, or may also include parameters of the method, such as a priority threshold, a data volume threshold, etc. This application does not specifically limit this.
- Model training entity #1 and model training management entity #2 can execute step S503 in Figure 5, or steps S701 to S703 in Figure 7, or steps S801 to S803 in Figure 8, which will not be repeated here.
- model training management entity #2 sends training task notification update information to model training management entity #1.
- the training task notification update information may be training task update notification information obtained by model training management entity #1 from model training entity #1, which is used to indicate that a new model training task is added to model training entity #1.
- model training management entity #1 obtained by model training management entity #1 from model training entity #1, which is used to indicate that a new model training task is added to model training entity #1.
- model training management entity #2 can directly send the training task update notification information obtained from model training entity #1 to model training management entity #2, or can process the training task update notification information and then send it to model training management entity #2. This application does not specifically limit this.
- Model training management entity #2, model training entity #1, and model training entity #2 can execute the contents of steps S504 to S508 in Figure 5, or steps S704 to S710 in Figure 7, or steps S804 to S813 in Figure 8, and optionally can also execute the entire contents of Figure 9, which will not be repeated here.
- model training management entity #2 can determine the allocation method based on the policy information obtained from model training management entity #1.
- model training management entity #2 sends training task information to model training management entity #1.
- the training task information may be the training task information sent by the model training management entity #2 to the model training function #1.
- the relevant content may refer to the description of step S505 in FIG5 and will not be elaborated here.
- model training management entity #2 can directly send the training task information sent to model training entity #1 to model training management entity #2, or it can process the training task information and then send it to model training management entity #2. This application does not specifically limit this.
- the model training management function in the NMS can send policy information to the model training management function in the EMS to improve the efficiency of model training.
- a piece of information can be carried in one or more messages or one or more information elements in the same message, such as two messages, or two information elements in the same message, and this application does not impose any special limitations on this.
- FIG11 is a schematic diagram of a communication device provided in an embodiment of the present application.
- the device 1100 may include a transceiver module 1110 and a processing module 1120.
- the transceiver module 1110 may communicate with the outside of the device, and the processing module 1120 is used for data processing.
- the transceiver module 1110 may also be referred to as a communication interface or a transceiver module.
- the device 1100 can implement a process corresponding to the execution of the model training management function in the method embodiments shown in Figures 5 to 10 above, wherein the processing module 1120 is used to execute processing-related operations of the model training management function in the method embodiments shown in Figures 5 to 10 above, and the transceiver module 1110 is used to execute transceiver-related operations of the model training management function in the method embodiments shown in Figures 5 to 10 above.
- the transceiver module 1110 is used to receive training status information of a first model training entity, where the training status information indicates at least one model training task of the first model training entity; the transceiver module 1110 is also used to obtain a plurality of computing power resource information, where the plurality of computing power resource information respectively indicates idle computing power resources for model training of the plurality of model training entities; the processing module 1120 is used to determine a first target model training entity among the plurality of model training entities based on the plurality of computing power resource information; the transceiver module 1110 is also used to send a first target model training entity to the first target model training entity.
- the first model training entity sends first training task configuration information, where the first training task configuration information indicates assisting the first model training entity in performing a target model training task in at least one model training task.
- the model training management entity can manage and arrange the computing resources of multiple model training entities, and assign the training tasks of the first model training entity to other model training entities with sufficient computing resources to assist in completing the training, thereby reducing the training waiting time of the training tasks of the first model training entity and improving the efficiency of model training.
- the device 1100 may implement a process corresponding to the execution of the first model management function (i.e., model management function #1) in the method embodiment shown in Figures 5 to 10 above, wherein the transceiver module 1110 is used to execute the transceiver-related operations of the model management function #1 in the method embodiment shown in Figures 5 to 10 above, and the processing module 1120 is used to execute the processing-related operations of the model management function #1 in the method embodiment shown in Figures 5 to 10 above.
- the transceiver module 1110 is used to execute the transceiver-related operations of the model management function #1 in the method embodiment shown in Figures 5 to 10 above
- the processing module 1120 is used to execute the processing-related operations of the model management function #1 in the method embodiment shown in Figures 5 to 10 above.
- the model training entity can report computing power resource information and training status information to the model training management entity, so that the model training management entity can manage and arrange the computing power resources of multiple model training entities, and assign the training tasks of the model training entity to other model training entities with sufficient computing power resources to assist in completing the training, thereby reducing the training waiting time of the training tasks of the model training entity and improving the efficiency of model training.
- the processing module 1120 is used to generate training status information; the transceiver module 1110 is used to send training status information to the model training management entity, the training status information indicating at least one model training task of the model training entity; the transceiver module is also used to send computing power resource information to the model training management entity, the computing power resource information indicates the idle computing power resources of the model training entity for model training; the transceiver module 1110 is also used to receive training task information from the model training management entity, the training task information indicates the first target model training entity that assists in executing the target model training task in at least one model training task.
- the device 1100 can implement a process corresponding to the execution of the first target model management function (i.e., model management function #2) in the method embodiment shown in Figures 5 to 10 above, wherein the transceiver module 1110 is used to execute the transceiver-related operations of the model management function #2 in the method embodiment shown in Figures 5 to 10 above, and the processing module 1120 is used to execute the processing-related operations of the model management function #2 in the method embodiment shown in Figures 5 to 10 above.
- the transceiver module 1110 is used to execute the transceiver-related operations of the model management function #2 in the method embodiment shown in Figures 5 to 10 above
- the processing module 1120 is used to execute the processing-related operations of the model management function #2 in the method embodiment shown in Figures 5 to 10 above.
- the transceiver module 1110 is used to send computing power resource information to the model training management entity, and the computing power resource information indicates the idle computing power resources of the model training entity for model training; the transceiver module 1110 is also used to receive first training task configuration information from the model training management entity, and the first training task configuration information indicates assisting the first model training entity to perform the target model training task; the processing module 1120 is used to assist the first model training management entity to perform the target model training task.
- the model training entity can report computing power resource information and training status information to the model training management entity, so that the model training management entity can manage and arrange the computing power resources of multiple model training entities, and assign the training tasks of the model training entity to other model training entities with sufficient computing power resources to assist in completing the training, thereby reducing the training waiting time of the training tasks of the model training entity and improving the efficiency of model training.
- the device 1100 here is embodied in the form of a functional unit.
- the term "unit” here may refer to an application specific integrated circuit (ASIC), an electronic circuit, a processor (such as a shared processor, a dedicated processor or a group processor, etc.) and a memory for executing one or more software or firmware programs, a merged logic circuit and/or other suitable components that support the described functions.
- ASIC application specific integrated circuit
- processor such as a shared processor, a dedicated processor or a group processor, etc.
- memory for executing one or more software or firmware programs, a merged logic circuit and/or other suitable components that support the described functions.
- the device 1100 may be specifically the model training management function in the above-mentioned embodiment or a chip applied to the model training management function, and may be used to execute the process corresponding to the model training management function in the above-mentioned method embodiment, or the device 1100 may be specifically the model training function in the above-mentioned embodiment or a chip applied to the model training function, and may be used to execute the process corresponding to the model training function in the above-mentioned method embodiment. To avoid repetition, it will not be described here.
- the above-mentioned device 1100 has the function of implementing the corresponding steps performed by the model training management function in the above-mentioned method, or the above-mentioned device 1100 has the function of implementing the corresponding steps performed by the model training function in the above-mentioned method.
- the function can be implemented by hardware, or the corresponding software can be implemented by hardware.
- the hardware or software includes one or more modules corresponding to the above-mentioned functions; for example, the transceiver module can be replaced by a transceiver (for example, the sending unit in the transceiver module can be replaced by a transmitter, and the receiving unit in the transceiver module can be replaced by a receiver), and other units, such as processing modules, can be replaced by processors to respectively perform the transceiver operations and related processing operations in each method embodiment.
- the transceiver module can be replaced by a transceiver (for example, the sending unit in the transceiver module can be replaced by a transmitter, and the receiving unit in the transceiver module can be replaced by a receiver), and other units, such as processing modules, can be replaced by processors to respectively perform the transceiver operations and related processing operations in each method embodiment.
- the above-mentioned transceiver module can also be a transceiver circuit (for example, it can include a receiving circuit and a sending circuit), and the processing module can be a processing circuit.
- the device in Figure 11 can be the model training management function or the model training function in the aforementioned embodiment, or it can be a chip or a chip system, such as a system on chip (SoC).
- the transceiver module can be an input and output circuit, a communication interface.
- the processing module is a processor or microprocessor or integrated circuit integrated on the chip. This is not limited here.
- FIG12 shows a communication device 1200 provided in an embodiment of the present application.
- the device 1200 includes a processor 1210 and a memory 1220.
- the memory 1220 is used to store instructions, and the processor 1210 can call the instructions stored in the memory 1220 to execute the model training management function or the process corresponding to the model training function in the above method embodiment.
- the memory 1220 is used to store instructions, and the processor 1210 can call the instructions stored in the memory 1220 to execute the process corresponding to the model training management function in the above method embodiment.
- the memory 1220 is used to store instructions, and the processor 1210 can call the instructions stored in the memory 1220 to execute the process corresponding to the model training function in the above method embodiment.
- the device 1200 can be specifically the model training management function or the model training function in the above embodiment, or it can be a chip or chip system for the model training management function or the model training function. Specifically, the device 1200 can be used to execute the process corresponding to the model training management function or the model training function in the above method embodiment.
- the memory 1220 may include a read-only memory and a random access memory, and provide instructions and data to the processor.
- a portion of the memory may also include a non-volatile random access memory.
- the memory may also store information about the device type.
- the processor 1210 may be used to execute instructions stored in the memory, and when the processor 1210 executes instructions stored in the memory, the processor 1210 is used to execute the process of the method embodiment corresponding to the model training management function or the model training function.
- each step of the above method can be completed by an integrated logic circuit of hardware in a processor or an instruction in the form of software.
- the steps of the method disclosed in conjunction with the embodiment of the present application can be directly embodied as a hardware processor for execution, or a combination of hardware and software modules in a processor for execution.
- the software module can be located in a storage medium mature in the art such as a random access memory, a flash memory, a read-only memory, a programmable read-only memory or an electrically erasable programmable memory, a register, etc.
- the storage medium is located in a memory, and the processor reads the information in the memory and completes the steps of the above method in conjunction with its hardware. To avoid repetition, it is not described in detail here.
- the processor in the embodiment of the present application can be an integrated circuit chip with signal processing capabilities.
- each step of the above method embodiment can be completed by an integrated logic circuit of hardware in the processor or an instruction in the form of software.
- the above processor can be a general-purpose processor, a digital signal processor, an application-specific integrated circuit, a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component.
- the processor in the embodiment of the present application can implement or execute the methods, steps and logic block diagrams disclosed in the embodiment of the present application.
- the general-purpose processor can be a microprocessor or the processor can also be any conventional processor, etc.
- the steps of the method disclosed in the embodiment of the present application can be directly embodied as a hardware decoding processor to execute, or the hardware and software modules in the decoding processor can be combined and executed.
- the software module can be located in a mature storage medium in the field such as a random access memory, a flash memory, a read-only memory, a programmable read-only memory or an electrically erasable programmable memory, a register, etc.
- the storage medium is located in a memory, and the processor reads the information in the memory and completes the steps of the above method in combination with its hardware.
- the memory in the embodiments of the present application can be a volatile memory or a non-volatile memory, or can include both volatile and non-volatile memories.
- the non-volatile memory can be a read-only memory (ROM), a programmable read-only memory (PROM), an erasable programmable read-only memory (EPROM), an electrically erasable programmable read-only memory (EEPROM), or a flash memory.
- the volatile memory can be a random access memory (RAM), which is used as an external cache.
- RAM random access memory
- SRAM static RAM
- DRAM dynamic RAM
- SDRAM synchronous DRAM
- DDR SDRAM double data rate SDRAM
- ESDRAM enhanced SDRAM
- SLDRAM synchlink DRAM
- DR RAM direct rambus RAM
- FIG13 shows a communication device 1300 provided in an embodiment of the present application.
- the device 1300 includes a processing circuit 1310 and a transceiver circuit 1320.
- the processing circuit 1310 and the transceiver circuit 1320 communicate with each other through an internal connection path, and the processing circuit 1310 is used to execute instructions to control the transceiver circuit 1320 to send and/or receive signals.
- the device 1300 may further include a storage medium 1330, which communicates with the processing circuit 1310 and the transceiver circuit 1320 via an internal connection path.
- the storage medium 1330 is used to store instructions, and the processing circuit 1310 may execute the instructions stored in the storage medium 1330.
- the device 1300 is used to implement the process corresponding to the model training management entity in the above method embodiment.
- the processing circuit 1310 is used to implement the above-mentioned processing unit 1120
- the transceiver circuit 1320 is used to implement the functions of the above-mentioned transceiver unit 1110 or the transceiver unit 1110 and the processing unit 1120.
- the device 1300 is used to implement the process corresponding to the model training entity in the above method embodiment.
- the processing circuit 1310 is used to implement the function of the above-mentioned processing unit 1120
- the transceiver circuit 1320 is used to implement the functions of the above-mentioned transceiver unit 1110 or the transceiver unit 1110 and the processing unit 1120.
- the present application also provides a computer program product, which includes: a computer program code, when the computer program code is run on a computer, the computer executes the method in the embodiments shown in Figures 5 to 10.
- the present application also provides a computer-readable medium, which stores a program code.
- the program code runs on a computer, the computer executes the method in the embodiments shown in Figures 5 to 10.
- the present application also provides a system, which includes the aforementioned model training management function and multiple model training functions.
- At least one of! or “at least one of" herein refers to all or any combination of the listed items.
- at least one of A, B, and C may refer to the following six situations: A exists alone, B exists alone, C exists alone, A and B exist at the same time, B and C exist at the same time, and A, B, and C exist at the same time.
- At least one herein refers to one or more.
- “More than one” refers to two or more.
- B corresponding to A means that B is associated with A, and B can be determined according to A.
- determining B according to A does not mean determining B only according to A, but B can also be determined according to A and/or other information.
- the terms “include”, “comprises”, “has” and their variations all mean “including but not limited to”, unless otherwise specifically emphasized.
- indication may include direct indication and indirect indication, and may also include explicit indication and implicit indication.
- the information indicated by a certain information (such as the first information described above) is referred to as information to be indicated.
- the information to be indicated can be directly indicated, such as the information to be indicated itself or the index of the information to be indicated.
- the information to be indicated can also be indirectly indicated by indicating other information, wherein the other information has an association relationship with the information to be indicated. It is also possible to indicate only a part of the information to be indicated, while the other parts of the information to be indicated are known or agreed in advance.
- the indication of specific information can also be achieved by means of the arrangement order of each information agreed in advance (such as specified by the protocol), thereby reducing the indication overhead to a certain extent.
- pre-configuration can be achieved by pre-saving corresponding codes, tables or other methods that can be used to indicate relevant information in a device (for example, a first terminal device), and the present application does not limit its specific implementation method.
- the disclosed systems, devices and methods can be implemented in other ways.
- the device embodiments described above are only schematic.
- the division of the units is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed.
- Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of devices or units, which can be electrical, mechanical or other forms.
- the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
- each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
- the functions are implemented in the form of software functional units and sold or used as independent products, they can be stored in a computer.
- Machine-readable storage medium Based on this understanding, the technical solution of the present application can essentially or contribute to the part or part of the technical solution in the form of a software product, which is stored in a storage medium and includes a number of instructions to enable a computer device (which can be a personal computer, server, or network device, etc.) to perform all or part of the steps of the method described in each embodiment of the present application.
- the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), disk or optical disk, etc., various media that can store program code.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Neurology (AREA)
- Mobile Radio Communication Systems (AREA)
Abstract
Description
Claims (41)
- 一种模型训练管理的方法,其特征在于,所述方法包括:模型训练管理实体接收第一模型训练实体的训练状态信息,所述训练状态信息指示所述第一模型训练实体具有的至少一个模型训练任务;所述模型训练管理实体获取多个算力资源信息,所述多个算力资源信息分别指示多个模型训练实体具有的用于模型训练的空闲算力资源;所述模型训练管理实体基于所述多个算力资源信息在所述多个模型训练实体中确定第一目标模型训练实体;所述模型训练管理实体向所述第一目标模型训练实体发送第一训练任务配置信息,所述第一训练任务配置信息指示协助所述第一模型训练实体执行所述至少一个模型训练任务中的目标模型训练任务。
- 如权利要求1所述的方法,其特征在于,所述模型训练管理实体获取所述多个算力资源信息,包括:所述模型训练管理实体分别周期接收来自所述多个模型训练实体的所述多个算力资源信息。
- 如权利要求1所述的方法,其特征在于,所述模型训练管理实体获取所述多个算力资源信息,包括:所述模型训练管理实体分别向所述多个模型训练实体发送多个算力资源查询信息;所述模型训练管理实体分别接收来自所述多个模型训练实体的所述多个算力资源信息。
- 如权利要求1至3中任一项所述的方法,其特征在于,模型训练管理实体接收第一模型训练实体的训练状态信息,包括:所述模型训练管理实体周期接收来自所述第一模型训练实体的训练状态信息。
- 如权利要求1至4中任一项所述的方法,其特征在于,所述方法还包括:所述模型训练管理实体向所述第一模型训练实体发送训练任务配置通知信息,所述训练任务配置通知信息指示所述第一目标模型训练实体协助执行所述目标模型训练任务。
- 如权利要求1至4中任一项所述的方法,其特征在于,所述方法还包括:所述模型训练管理实体确定用于所述第一目标模型训练实体协助执行所述目标模型训练任务的训练数据的目标长度;所述模型训练管理实体向所述第一模型训练实体发送第二训练任务配置信息,所述第二训练任务配置信息指示所述第一目标模型训练实体采用所述目标长度的训练数据协助执行所述目标模型训练任务。
- 如权利要求6所述的方法,其特征在于,所述训练状态信息还指示用于完成所述目标模型训练任务的训练数据的总长度,所述模型训练管理实体确定所述目标长度,包括:所述模型训练管理实体基于所述总长度和所述第一目标模型训练实体的算力资源信息确定所述目标长度。
- 如权利要求1至7中任一项所述的方法,其特征在于,所述方法还包括:所述模型训练管理实体接收来自所述第一目标模型训练实体的协助训练反馈信息,所述协助训练反馈信息指示以下至少一项:所述第一目标模型训练实体执行所述目标模型训练任务达到的精度、所述第一目标模型训练实体执行所述目标模型训练任务耗费的时长、所述第一目标模型训练实体执行所述目标模型训练任务的执行进度、所述第一目标模型训练实体执行所述目标模型训练任务占用的资源数量。
- 如权利要求1至8中任一项所述的方法,其特征在于,所述方法还包括:所述模型训练管理实体接收来自所述第一目标模型训练实体的网络状态更改信息;所述模型训练管理实体基于所述网络状态更改信息和所述多个算力资源信息确定将所述第一目标模型训练实体更换为所述多个模型训练实体中的第二目标模型训练实体;所述模型训练管理实体向所述第二目标模型训练实体发送第三训练任务配置信息,所述第三训练 任务配置信息指示协助所述第一模型训练实体执行所述目标模型训练任务。
- 如权利要求1至9中任一项所述的方法,其特征在于,所述方法还包括:所述模型训练管理实体获取策略信息,所述策略信息指示基于所述多个算力资源信息在所述多个模型训练实体确定所述第一目标模型实体的方式,和/或,指示基于用于完成所述目标模型训练任务的训练数据的总长度确定用于所述第一目标模型实体协助执行所述目标模型训练任务的训练数据的目标长度的方式。
- 一种模型训练管理的方法,其特征在于,所述方法包括:模型训练实体向模型训练管理实体发送训练状态信息,所述训练状态信息指示所述模型训练实体具有的至少一个模型训练任务;所述模型训练实体向所述模型训练管理实体发送算力资源信息,所述算力资源信息指示所述模型训练实体具有的用于模型训练的空闲算力资源;所述模型训练实体接收来自所述模型训练管理实体的训练任务信息,所述训练任务信息指示协助执行所述至少一个模型训练任务中的目标模型训练任务的第一目标模型训练实体。
- 如权利要求11所述的方法,其特征在于,所述模型训练实体向所述模型训练管理实体发送所述算力资源信息,包括:所述模型训练实体向所述模型训练管理实体周期发送所述算力资源信息;或者,所述模型训练实体接收来自所述模型训练管理实体的算力资源查询信息;所述模型训练实体向所述模型训练管理实体发送所述算力资源信息。
- 如权利要求11或12所述的方法,其特征在于,所述模型训练实体向所述模型训练管理实体发送所述训练状态信息,包括:所述模型训练实体向所述模型训练管理实体周期发送所述训练状态信息;或者,所述模型训练实体基于触发事件向所述模型训练管理实体发送所述训练状态信息。
- 如权利要求11至13中任一项所述的方法,其特征在于,所述训练任务信息还指示所述第一目标模型训练实体采用目标长度的训练数据协助执行所述目标模型训练任务。
- 如权利要求11至14中任一项所述的方法,其特征在于,所述方法还包括:所述模型训练实体接收来自所述目标模型训练实体的模型训练报告信息,所述模型训练报告信息指示完成所述目标模型训练任务获得的子模型;所述模型训练实体基于所述子模型执行模型聚合。
- 一种模型训练管理的方法,其特征在于,所述方法包括:模型训练实体向模型训练管理实体发送算力资源信息,所述算力资源信息指示所述模型训练实体具有的用于模型训练的空闲算力资源;所述模型训练管理实体接收来自所述模型训练管理实体的第一训练任务配置信息,所述第一训练任务配置信息指示协助第一模型训练实体执行目标模型训练任务;所述模型训练管理实体协助所述第一模型训练管理实体执行所述目标模型训练任务。
- 如权利要求16所述的方法,其特征在于,所述模型训练实体向所述模型训练管理实体发送所述算力资源信息,包括:所述模型训练实体向所述模型训练管理实体周期发送所述算力资源信息;或者,所述模型训练实体接收来自所述模型训练管理实体的算力资源查询信息;所述模型训练实体向所述模型训练管理实体发送所述算力资源信息。
- 如权利要求16或17所述的方法,其特征在于,所述模型训练管理实体协助所述第一模型训练管理实体执行所述目标模型训练任务,包括:所述模型训练实体获取目标训练数据;所述模型训练实体采用目标训练数据协助所述第一模型训练管理实体执行所述目标模型训练任务。
- 如权利要求16至18中任一项所述的方法,其特征在于,所述方法还包括:所述模型训练实体向所述第一模型训练实体发送模型训练报告信息,所述模型训练报告信息指示完成所述目标模型训练任务获得的子模型。
- 如权利要求16至19中任一项所述的方法,其特征在于,所述方法还包括:所述模型训练实体向所述模型训练管理实体发送协助训练反馈信息,所述协助训练反馈信息指示以下至少一项:所述模型训练实体执行所述目标模型训练任务达到的精度、所述模型训练实体执行所述目标模型训练任务耗费的时长、所述模型训练实体执行所述目标模型训练任务的执行进度、所述模型训练实体执行所述目标模型训练任务占用的资源数量。
- 如权利要求16至20中任一项所述的方法,其特征在于,所述方法还包括:所述模型训练实体向所述模型训练管理实体发送网络状态更改信息,所述网络状态更改信息指示所述模型训练实体不能够完成目标模型训练任务。
- 一种模型训练管理的方法,其特征在于,所述方法包括:第一模型训练实体向模型训练管理实体发送训练状态信息,所述训练状态信息指示所述第一模型训练实体具有的至少一个模型训练任务;所述模型训练管理实体接收所述第一模型训练实体的所述训练状态信息;所述模型训练管理实体获取多个算力资源信息,所述多个算力资源信息分别指示多个模型训练实体具有的用于模型训练的空闲算力资源;所述模型训练管理实体基于所述多个算力资源信息在所述多个模型训练实体中确定第一目标模型训练实体;所述模型训练管理实体向所述第一目标模型训练实体发送第一训练任务配置信息,所述第一训练任务配置信息指示协助所述第一模型训练实体执行目标模型训练任务;所述第一目标模型训练管理实体接收来自所述模型训练管理实体的所述第一训练任务配置信息;所述第一目标模型训练管理实体协助所述第一模型训练管理实体执行所述目标模型训练任务。
- 如权利要求22所述的方法,其特征在于,所述模型训练管理实体获取多个算力资源信息,包括:所述模型训练管理实体分别周期接收来自所述多个模型训练实体的所述多个算力资源信息;所述多个模型训练实体向所述模型训练管理实体周期发送所述算力资源信息。
- 如权利要求22所述的方法,其特征在于,所述模型训练管理实体获取所述多个算力资源信息,包括:所述模型训练管理实体分别向所述多个模型训练实体发送多个算力资源查询信息;所述多个模型训练实体分别接收来自所述模型训练管理实体的所述多个算力资源查询信息;所述模型训练管理实体分别接收来自所述多个模型训练实体的所述多个算力资源信息;所述多个模型训练实体分别向所述模型训练管理实体发送所述多个算力资源信息。
- 如权利要求22至24中任一项所述的方法,其特征在于,所述第一模型训练实体向所述模型训练管理实体发送第一模型训练实体的训练状态信息,模型训练管理实体接收第一模型训练实体的训练状态信息,包括:所述模型训练管理实体周期接收来自所述第一模型训练实体的训练状态信息;所述第一模型训练实体向所述模型训练管理实体周期发送所述第一模型训练实体的训练状态信息;或者,所述第一模型训练实体基于触发事件向所述模型训练管理实体发送所述训练状态信息;所述模型训练管理实体接收来自所述第一模型训练实体的所述训练状态信息。
- 如权利要求22至25中任一项所述的方法,其特征在于,所述方法还包括:所述模型训练管理实体向所述第一模型训练实体发送训练任务配置通知信息,所述训练任务配置通知信息指示所述第一目标模型训练实体协助执行所述目标模型训练任务;所述第一模型训练实体接收来自所述模型训练管理实体的所述训练任务配置通知信息。
- 如权利要求22至25中任一项所述的方法,其特征在于,所述方法还包括:所述模型训练管理实体确定用于所述第一目标模型训练实体协助执行所述目标模型训练任务的训练数据的目标长度;所述模型训练管理实体向所述第一模型训练实体发送第二训练任务配置信息,所述第二训练任务配置信息指示所述第一目标模型训练实体采用所述目标长度的训练数据协助执行所述目标模型训练任 务;所述第一模型训练实体接收来自所述模型训练管理实体的所述第二训练任务配置信息。
- 如权利要求27所述的方法,其特征在于,所述模型训练管理实体确定所述目标长度,包括:所述模型训练管理实体基于所述总长度和所述第一目标模型训练实体的算力资源信息确定所述目标长度,其中,所述训练状态信息还指示用于完成所述目标模型训练任务的训练数据的总长度。
- 如权利要求22至28中任一项所述的方法,其特征在于,所述方法还包括:所述模型训练管理实体接收来自所述第一目标模型训练实体的协助训练反馈信息,所述协助训练反馈信息指示以下至少一项:所述第一目标模型训练实体执行所述目标模型训练任务达到的精度、所述第一目标模型训练实体执行所述目标模型训练任务耗费的时长、所述第一目标模型训练实体执行所述目标模型训练任务的执行进度、所述第一目标模型训练实体执行所述目标模型训练任务占用的资源数量;所述第一目标模型训练实体向所述模型训练管理实体发送所述协助训练反馈信息。
- 如权利要求22至29中任一项所述的方法,其特征在于,所述方法还包括:所述模型训练管理实体接收来自所述第一目标模型训练实体的网络状态更改信息;所述第一目标模型训练实体向所述模型训练管理实体发送所述网络状态更改信息;所述模型训练管理实体基于所述网络状态更改信息和所述多个算力资源信息确定将所述第一目标模型训练实体更换为所述多个模型训练实体中的第二目标模型训练实体;所述模型训练管理实体向所述第二目标模型训练实体发送第三训练任务配置信息,所述第三训练任务配置信息指示协助所述第一模型训练实体执行所述目标模型训练任务;所述第二目标模型训练实体接收来自所述模型训练管理实体的所述第三训练任务配置信息。
- 如权利要求22至30中任一项所述的方法,其特征在于,所述方法还包括:所述模型训练管理实体获取策略信息,所述策略信息指示基于所述多个算力资源信息在所述多个模型训练实体确定所述第一目标模型实体的方式,和/或,指示基于用于完成所述目标模型训练任务的训练数据的总长度确定用于所述第一目标模型实体协助执行所述目标模型训练任务的训练数据的目标长度的方式。
- 如权利要求22至31中任一项所述的方法,其特征在于,所述方法还包括:所述目标模型训练实体向所述模型训练实体发送所述模型训练报告信息,所述模型训练报告信息指示完成所述目标模型训练任务获得的子模型;所述模型训练实体接收来自所述目标模型训练实体的模型训练报告信息;所述模型训练实体基于所述子模型执行模型聚合。
- 如权利要求22或32所述的方法,其特征在于,所述模型训练管理实体协助所述第一模型训练管理实体执行所述目标模型训练任务,包括:所述模型训练实体获取目标训练数据;所述模型训练实体采用目标训练数据协助所述第一模型训练管理实体执行所述目标模型训练任务。
- 一种通信装置,其特征在于,包括用于执行如权利要求1至10中任一项所述方法的模块,或者,包括用于执行如权利要求11至15中任一项所述方法的模块,或者,包括用于执行如权利要求16至21中任一项所述方法的模块,或者,包括用于执行如权利要求22至33中任一项所述方法的模块。
- 一种通信装置,其特征在于,包括至少一个处理器,所述至少一个处理器用于执行存储器中存储的计算机程序,以使得所述装置实现如权利要求1至10中任一项所述的方法、或者如权利要求11至15中任一项所述的方法、或者如权利要求16至21中任一项所述的方法、或者如权利要求22至33中任一项所述的方法。
- 一种通信装置,其特征在于,包括至少一个处理器,所述至少一个处理器与至少一个存储器耦合,所述至少一个处理器用于执行所述至少一个存储器中存储的计算机程序或指令,如权利要求1至10中任一项所述的方法被执行,或者如权利要求11至15中任一项所述的方法被执行,或者如权利要求16至21中任一项所述的方法被执行,或者如权利要求22至33中任一项所述的方法被执行。
- 一种计算机可读存储介质,其特征在于,包括计算机程序,当所述计算机程序在计算机上运行时,使得所述计算机执行如权利要求1至10中任一项所述的方法、或者如权利要求11至15中任一项所述的方法、或者如权利要求16至21中任一项所述的方法、或者如权利要求22至33中任一项所述的方法。
- 一种通信系统,其特征在于,包括用于执行如权利要求1至10中任一项所述方法的装置,和用于执行如权利要求11至15中任一项所述方法的装置,以及用于执行如权利要求16至21中任一项所述方法的装置。
- 一种计算机程序产品,其特征在于,包括计算机程序代码,当所述计算机程序代码在计算机上运行时,使得计算机实现如权利要求1至10中任一项所述的方法,或使得所述计算机实现如权利要求11至15中任一项所述的方法,或使得所述计算机实现如权利要求16至21中任一项所述的方法,或使得所述计算机实现如权利要求22至33中任一项所述的方法。
- 一种通信装置,其特征在于,包括:处理器,用于执行存储器中存储的计算机程序,以使得所述装置执行如权利要求1至10中任一项所述的方法,或者以使得所述装置执行如权利要求11至15中任一项所述的方法,或者以使得所述装置执行如权利要求16至21中任一项所述的方法,或者以使得所述装置执行如权利要求22至33中任一项所述的方法。
- 如权利要求40所述的装置,其特征在于,所述装置还包括所述存储器。
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP23870635.2A EP4567638A4 (en) | 2022-09-27 | 2023-09-22 | METHOD, APPARATUS AND MODEL LEARNING MANAGEMENT SYSTEM |
| US19/060,546 US20250190880A1 (en) | 2022-09-27 | 2025-02-21 | Model training management method, apparatus, and system |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202211181988.7A CN117828341A (zh) | 2022-09-27 | 2022-09-27 | 一种模型训练管理的方法、装置和系统 |
| CN202211181988.7 | 2022-09-27 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US19/060,546 Continuation US20250190880A1 (en) | 2022-09-27 | 2025-02-21 | Model training management method, apparatus, and system |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024067404A1 true WO2024067404A1 (zh) | 2024-04-04 |
Family
ID=90476149
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2023/120765 Ceased WO2024067404A1 (zh) | 2022-09-27 | 2023-09-22 | 一种模型训练管理的方法、装置和系统 |
Country Status (4)
| Country | Link |
|---|---|
| US (1) | US20250190880A1 (zh) |
| EP (1) | EP4567638A4 (zh) |
| CN (1) | CN117828341A (zh) |
| WO (1) | WO2024067404A1 (zh) |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119725805A (zh) * | 2024-12-12 | 2025-03-28 | 广芯微电子(广州)股份有限公司 | 一种基于算力协同的无线电池管理方法及系统 |
| WO2025213018A1 (en) * | 2024-04-05 | 2025-10-09 | Interdigital Patent Holdings, Inc. | Edge-aware aiml enabler service for distributed and split learning |
| WO2025256547A1 (zh) * | 2024-06-14 | 2025-12-18 | 华为技术有限公司 | 通信方法及装置 |
Families Citing this family (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN120872424A (zh) * | 2024-04-30 | 2025-10-31 | 华为技术有限公司 | 一种训练任务的处理方法以及相关设备 |
| CN120803755B (zh) * | 2025-09-15 | 2025-11-25 | 浪潮电子信息产业股份有限公司 | 一种模型训练的控制方法、程序产品、电子设备及介质 |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110221913A (zh) * | 2019-04-26 | 2019-09-10 | 深圳市致宸信息科技有限公司 | 监控服务器的云算力的方法、终端、设备及存储介质 |
| CN111368991A (zh) * | 2018-12-25 | 2020-07-03 | 杭州海康威视数字技术股份有限公司 | 深度学习模型的训练方法、装置及电子设备 |
| WO2021105313A1 (en) * | 2019-11-28 | 2021-06-03 | Secondmind Limited | Parallelised training of machine learning models |
| CN113849295A (zh) * | 2020-06-28 | 2021-12-28 | 华为技术有限公司 | 模型训练的方法、装置及计算机可读存储介质 |
| CN114089889A (zh) * | 2021-02-09 | 2022-02-25 | 京东科技控股股份有限公司 | 模型训练方法、装置以及存储介质 |
| CN114169427A (zh) * | 2021-12-06 | 2022-03-11 | 北京百度网讯科技有限公司 | 基于端到端自适应的分布式训练方法、装置、设备 |
| CN114595058A (zh) * | 2022-03-02 | 2022-06-07 | 北京金山云网络技术有限公司 | 基于gpu资源的模型训练方法和装置、电子设备和存储介质 |
Family Cites Families (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20190317825A1 (en) * | 2018-04-16 | 2019-10-17 | Kazuhm, Inc. | System for managing deployment of distributed computing resources |
| US11423254B2 (en) * | 2019-03-28 | 2022-08-23 | Intel Corporation | Technologies for distributing iterative computations in heterogeneous computing environments |
| CN112202928B (zh) * | 2020-11-16 | 2022-05-17 | 绍兴文理学院 | 传感边缘云区块链网络可信卸载协作节点选择系统及方法 |
-
2022
- 2022-09-27 CN CN202211181988.7A patent/CN117828341A/zh active Pending
-
2023
- 2023-09-22 WO PCT/CN2023/120765 patent/WO2024067404A1/zh not_active Ceased
- 2023-09-22 EP EP23870635.2A patent/EP4567638A4/en active Pending
-
2025
- 2025-02-21 US US19/060,546 patent/US20250190880A1/en active Pending
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111368991A (zh) * | 2018-12-25 | 2020-07-03 | 杭州海康威视数字技术股份有限公司 | 深度学习模型的训练方法、装置及电子设备 |
| CN110221913A (zh) * | 2019-04-26 | 2019-09-10 | 深圳市致宸信息科技有限公司 | 监控服务器的云算力的方法、终端、设备及存储介质 |
| WO2021105313A1 (en) * | 2019-11-28 | 2021-06-03 | Secondmind Limited | Parallelised training of machine learning models |
| CN113849295A (zh) * | 2020-06-28 | 2021-12-28 | 华为技术有限公司 | 模型训练的方法、装置及计算机可读存储介质 |
| CN114089889A (zh) * | 2021-02-09 | 2022-02-25 | 京东科技控股股份有限公司 | 模型训练方法、装置以及存储介质 |
| CN114169427A (zh) * | 2021-12-06 | 2022-03-11 | 北京百度网讯科技有限公司 | 基于端到端自适应的分布式训练方法、装置、设备 |
| CN114595058A (zh) * | 2022-03-02 | 2022-06-07 | 北京金山云网络技术有限公司 | 基于gpu资源的模型训练方法和装置、电子设备和存储介质 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP4567638A4 |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2025213018A1 (en) * | 2024-04-05 | 2025-10-09 | Interdigital Patent Holdings, Inc. | Edge-aware aiml enabler service for distributed and split learning |
| WO2025256547A1 (zh) * | 2024-06-14 | 2025-12-18 | 华为技术有限公司 | 通信方法及装置 |
| CN119725805A (zh) * | 2024-12-12 | 2025-03-28 | 广芯微电子(广州)股份有限公司 | 一种基于算力协同的无线电池管理方法及系统 |
Also Published As
| Publication number | Publication date |
|---|---|
| EP4567638A1 (en) | 2025-06-11 |
| US20250190880A1 (en) | 2025-06-12 |
| CN117828341A (zh) | 2024-04-05 |
| EP4567638A4 (en) | 2025-12-31 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2024067404A1 (zh) | 一种模型训练管理的方法、装置和系统 | |
| US12328602B2 (en) | ML model management in O-RAN | |
| US20230224752A1 (en) | Communication method, apparatus, and system | |
| CN111967605B (zh) | 无线电接入网中的机器学习 | |
| US20210235277A1 (en) | Method and apparatus for dynamically allocating radio resources in a wireless communication system | |
| US20220012645A1 (en) | Federated learning in o-ran | |
| Wang et al. | Cooperative end-edge-cloud computing and resource allocation for digital twin enabled 6G industrial IoT | |
| WO2022060923A1 (en) | Non-realtime services for ai/ml | |
| JP2025520969A (ja) | 連合学習方法、装置、通信機器及び可読記憶媒体 | |
| Kumar et al. | Quality of service‐aware adaptive radio resource management based on deep federated Q‐learning for multi‐access edge computing in beyond 5G cloud‐radio access network | |
| Joloudari et al. | The state-of-the-art review on resource allocation problem using artificial intelligence methods on various computing paradigms | |
| Cao et al. | Learning-based multitier split computing for efficient convergence of communication and computation | |
| Singh et al. | Digital Twin-Assisted Adaptive Federated Multi-Agent DRL with GenAI for Optimized Resource Allocation in IoV Networks | |
| Yu et al. | Snake learning: A communication-and computation-efficient distributed learning framework for 6g | |
| WO2021152629A1 (en) | Method and apparatus for dynamically allocating radio resources in a wireless communication system | |
| WO2026077104A1 (zh) | 算法功能的管理方法、装置、存储介质及程序产品 | |
| Umamaheswaran et al. | A Hybrid Machine Learning Framework for Dynamic Resource Optimization in 5G Networks | |
| Bano et al. | A novel approach to distributed model aggregation using Apache Kafka | |
| Zhao et al. | Online optimal task offloading with one-bit feedback | |
| US20190149613A1 (en) | Method, communication terminal, and communication node device for associating resources | |
| Deb et al. | Loop-the-loops: Fragmented learning over networks for constrained IoT devices | |
| WO2024125787A1 (en) | Using distributed learning to develop a machine learning model | |
| Cao et al. | A deep Q-network-based edge service offloading in cloud–edge–terminal environment: B. Cao et al. | |
| CN118802564A (zh) | 网络构建方法、装置及存储介质 | |
| CN120916114A (zh) | 在网计算管理方法、装置及存储介质 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23870635 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2023870635 Country of ref document: EP |
|
| ENP | Entry into the national phase |
Ref document number: 2023870635 Country of ref document: EP Effective date: 20250304 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| WWP | Wipo information: published in national office |
Ref document number: 2023870635 Country of ref document: EP |