WO2024067404A1 - 一种模型训练管理的方法、装置和系统 - Google Patents

一种模型训练管理的方法、装置和系统 Download PDF

Info

Publication number
WO2024067404A1
WO2024067404A1 PCT/CN2023/120765 CN2023120765W WO2024067404A1 WO 2024067404 A1 WO2024067404 A1 WO 2024067404A1 CN 2023120765 W CN2023120765 W CN 2023120765W WO 2024067404 A1 WO2024067404 A1 WO 2024067404A1
Authority
WO
WIPO (PCT)
Prior art keywords
model training
entity
training
target
task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2023/120765
Other languages
English (en)
French (fr)
Inventor
黄谢田
曹龙雨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to EP23870635.2A priority Critical patent/EP4567638A4/en
Publication of WO2024067404A1 publication Critical patent/WO2024067404A1/zh
Priority to US19/060,546 priority patent/US20250190880A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5094Allocation of resources, e.g. of the central processing unit [CPU] where the allocation takes into account power or heat criteria
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • G06F9/5088Techniques for rebalancing the load in a distributed system involving task migration
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0495Quantised networks; Sparse networks; Compressed networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/096Transfer learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/098Distributed learning, e.g. federated learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N99/00Subject matter not provided for in other groups of this subclass

Definitions

  • the present application relates to the field of communication technology, and more specifically, to a method, device and system for model training management.
  • Models are usually obtained through training.
  • a model training entity can be configured with a model training task, and a usable model is obtained by performing the model training task using training data.
  • Multiple model training entities are usually configured in a communication system.
  • multiple base stations can deploy multiple model training entities respectively.
  • the training tasks in the model training entity in each base station can be set according to the needs of the base station.
  • model training tasks of different model training entities will also be different.
  • some model training entities in the communication system may have overloaded model training tasks, while some model training entities are too idle, resulting in low overall efficiency of model training in the communication system.
  • the present application provides a method, device and system for model training management, which can improve the efficiency of model training.
  • a method for model training management which can be implemented by a model training management entity or a chip in the model training management entity, and the method includes: the model training management entity receives training status information of a first model training entity, and the training status information indicates at least one model training task of the first model training entity; the model training management entity obtains multiple computing power resource information, and the multiple computing power resource information respectively indicates idle computing power resources for model training of multiple model training entities; the model training management entity determines a first target model training entity among multiple model training entities based on the multiple computing power resource information; the model training management entity sends first training task configuration information to the first target model training entity, and the first training task configuration information indicates assisting the first model training entity in performing a target model training task in at least one model training task.
  • the number of target model training tasks can be one or more.
  • the number of first target model training entities can be one or more.
  • the model training management entity can manage and arrange the computing resources of multiple model training entities, and assign the training tasks of the first model training entity to other model training entities with sufficient computing resources to assist in completing the training, thereby reducing the training waiting time of the training tasks of the first model training entity and improving the efficiency of model training.
  • the model training management entity obtains multiple computing power resource information, including: the model training management entity periodically receives multiple computing power resource information from multiple model training entities.
  • the periods at which multiple model training entities send computing resource information can be the same or different.
  • each model training entity can report its own computing power resource information in a timely manner, so that the model training management entity can perform timely arrangement and management, thereby improving the efficiency of model training.
  • the model training management entity obtains multiple computing power resource information, including: the model training management entity sends multiple computing power resource query information to multiple model training entities respectively; the model training management entity receives multiple computing power resource information from multiple model training entities respectively.
  • model training management entity when the model training management entity determines based on the training status information that the first model training entity is unable to independently perform all of the at least one model training task, it sends multiple computing resource query information to multiple model training entities respectively.
  • the model training entity can return computing power resource information based on the query information of the model training management entity, that is, the model training management entity can send a query to the model training entity when there is a demand, such as when it is determined to allocate assistance training.
  • Query information can save transmission resources and obtain real-time computing resource information.
  • the model training management entity receives the training status information of the first model training entity, including: the model training management entity periodically receives the training status information from the first model training entity.
  • the periods at which multiple model training entities send training status information may be the same or different.
  • each model training entity can report its own training status information in a timely manner, so that the model training management entity can perform orchestration and management in a timely manner, thereby improving the efficiency of model training.
  • the method further includes: the model training management entity sends training task configuration notification information to the first model training entity, and the training task configuration notification information instructs the first target model training entity to assist in performing the target model training task.
  • the method further includes: the model training management entity determines the target length of training data for the first target model training entity to assist in performing the target model training task; the model training management entity sends second training task configuration information to the first model training entity, and the second training task configuration information instructs the first target model training entity to use training data of the target length to assist in performing the target model training task.
  • the training status information also indicates the total length of the training data used to complete the target model training task, and the model training management entity determines the target length, including: the model training management entity determines the target length based on the total length and the computing power resource information of the first target model training entity.
  • the model training management entity can decompose the target model training task and use multiple model training entities to collaborate to complete the training task, which can reduce the training task burden of the original training subject of the target model training task, i.e., the first model training entity, make full use of the resources of multiple model training entities, and reduce the training waiting time of the training task.
  • the method also includes: the model training management entity receives assisted training feedback information from the first target model training entity, the assisted training feedback information indicates at least one of the following: the accuracy achieved by the first target model training entity in performing the target model training task, the time spent by the first target model training entity in performing the target model training task, the execution progress of the first target model training entity in performing the target model training task, and the amount of resources occupied by the first target model training entity in performing the target model training task.
  • the method also includes: the model training management entity receives network status change information from the first target model training entity; the model training management entity determines to replace the first target model training entity with a second target model training entity among multiple model training entities based on the network status change information and multiple computing power resource information; the model training management entity sends third training task configuration information to the second target model training entity, and the third training task configuration information indicates to assist the first model training entity in performing the target model training task.
  • the model training entity can provide feedback on the execution status to the model training management entity while assisting in the execution of the model training task, so that the model training management entity can be informed of the execution status of the target model training task and can make timely adjustments to improve the reliability of model training.
  • the method also includes: the model training management entity obtains policy information, the policy information indicates a method for determining a first target model entity among multiple model training entities based on multiple computing resource information, and/or indicates a method for determining a target length of training data for the first target model entity to assist in performing the target model training task based on the total length of the training data used to complete the target model training task.
  • a method for model training management which can be implemented by a model training entity or a chip in the model training entity, and the method includes: the model training entity sends training status information to a model training management entity, the training status information indicates at least one model training task that the model training entity has; the model training entity sends computing power resource information to the model training management entity, the computing power resource information indicates idle computing power resources that the model training entity has for model training; the model training entity receives training task information from the model training management entity, the training task information indicates a first target model training entity that assists in executing a target model training task in at least one model training task.
  • the model training entity can report computing power resource information and training status information to the model training management entity, so that the model training management entity can manage and arrange the computing power resources of multiple model training entities, and assign the training tasks of the model training entity to other model training entities with sufficient computing power resources to assist in completing the training, thereby reducing the training waiting time of the training tasks of the model training entity and improving the efficiency of model training.
  • the model training entity sends computing power resource information to the model training management entity, including: the model training entity periodically sends computing power resource information to the model training management entity; or the model training entity receives computing power resource information from The computing power resource query information of the model training management entity; the model training entity sends the computing power resource information to the model training management entity.
  • the model training entity sends training status information to the model training management entity, including: the model training entity periodically sends training status information to the model training management entity; or, the model training entity sends training status information to the model training management entity based on a trigger event.
  • the training task information further instructs the first target model training entity to use training data of a target length to assist in performing the target model training task.
  • the method also includes: the model training entity receives model training report information from the target model training entity, the model training report information indicates the sub-model obtained by completing the target model training task; and the model training entity performs model aggregation based on the sub-model.
  • the various implementations of the second aspect are methods of training the first model entity corresponding to the various implementations of the first aspect.
  • the beneficial technical effects of the various implementations of the second aspect reference can be made to the description of the relevant implementations of the first aspect and will not be repeated here.
  • a method for model training management which can be implemented by a model training entity or a chip in the model training entity, and the method includes: the model training entity sends computing power resource information to a model training management entity, and the computing power resource information indicates idle computing power resources of the model training entity for model training; the model training management entity receives first training task configuration information from the model training management entity, and the first training task configuration information indicates assisting the first model training entity in performing a target model training task; the model training management entity assists the first model training management entity in performing the target model training task.
  • the model training entity can report computing power resource information and training status information to the model training management entity, so that the model training management entity can manage and arrange the computing power resources of multiple model training entities, and assign the training tasks of the model training entity to other model training entities with sufficient computing power resources to assist in completing the training, thereby reducing the training waiting time of the training tasks of the model training entity and improving the efficiency of model training.
  • the model training entity sends computing power resource information to the model training management entity, including: the model training entity periodically sends computing power resource information to the model training management entity; or, the model training entity receives computing power resource query information from the model training management entity; the model training entity sends computing power resource information to the model training management entity.
  • the model training management entity assists the first model training management entity in performing the target model training task, including: the model training entity obtains target training data; the model training entity uses the target training data to assist the first model training management entity in performing the target model training task.
  • the method also includes: the model training entity sends model training report information to the first model training entity, and the model training report information indicates the sub-model obtained by completing the target model training task.
  • the method also includes: the model training entity sends assistance training feedback information to the model training management entity, and the assistance training feedback information indicates at least one of the following: the accuracy achieved by the model training entity in executing the target model training task, the time spent by the model training entity in executing the target model training task, the execution progress of the model training entity in executing the target model training task, and the amount of resources occupied by the model training entity in executing the target model training task.
  • the method also includes: the model training entity sends network status change information to the model training management entity, and the network status change information indicates that the model training entity is unable to complete the target model training task.
  • the various implementation methods of the third aspect are methods of training the first target model entity corresponding to the various implementation methods of the first aspect.
  • beneficial technical effects of the various implementation methods of the third aspect please refer to the description of the relevant implementation methods of the first aspect and will not be repeated here.
  • a method for model training management which can be applied to a communication system, wherein the communication system includes a model training management entity and multiple model training entities, and the method includes: a first model training entity sends training status information to a model training management entity, and the model training management entity receives the training status information of the first model training entity, where the training status information indicates at least one model training task of the first model training entity; the model training management entity obtains multiple computing power resource information, and the multiple computing power resource information respectively indicates the idle computing power resources for model training of the multiple model training entities; the model training management entity determines a first target model training entity among the multiple model training entities based on the multiple computing power resource information; the model training management entity sends first training task configuration information to the first target model training entity, and the first target model training management entity receives the first training task configuration information from the model training management entity, where the first training task configuration information indicates assisting the first model training entity in performing the target model training task; the first target model training management entity assists the first model training management entity in performing
  • the model training entity can report computing resource information and training status information to the model training management entity.
  • the model training management entity can manage and orchestrate the computing resources of multiple model training entities, and assign the training tasks of the model training entities to other model training entities with sufficient computing resources to assist in completing the training, thereby reducing the training waiting time of the training tasks of the model training entities and improving the efficiency of model training.
  • the various implementation methods of the fourth aspect are system methods corresponding to the various implementation methods of the first aspect.
  • the beneficial technical effects of the various implementation methods of the fourth aspect reference can be made to the description of the relevant implementation methods of the first aspect, which will not be repeated here.
  • a communication device which includes a transceiver module and a processing module, wherein the transceiver module is used to receive training status information of a first model training entity, the training status information indicating at least one model training task of the first model training entity; the transceiver module is also used to obtain multiple computing power resource information, the multiple computing power resource information respectively indicating the idle computing power resources for model training of multiple model training entities; the processing module is used to determine a first target model training entity among multiple model training entities based on the multiple computing power resource information; the transceiver module is also used to send first training task configuration information to the first target model training entity, the first training task configuration information indicating assisting the first model training entity in performing a target model training task in at least one model training task.
  • the communication device described in the fifth aspect has the function of implementing the method in the first aspect, or any possible implementation of the first aspect.
  • the function can be implemented by hardware, or by hardware executing corresponding software.
  • the hardware or software includes one or more units corresponding to the above functions.
  • the various implementations of the fifth aspect are devices of the model training management entity corresponding to the various implementations of the first aspect.
  • the beneficial technical effects of the various implementations of the fifth aspect reference can be made to the description of the relevant implementations of the first aspect and will not be repeated here.
  • a communication device which includes a transceiver module and a processing module, the processing module is used to generate training status information; the transceiver module is used to send training status information to a model training management entity, the training status information indicates at least one model training task of the model training entity; the transceiver module is also used to send computing power resource information to the model training management entity, the computing power resource information indicates the idle computing power resources of the model training entity for model training; the transceiver module is also used to receive training task information from the model training management entity, the training task information indicates a first target model training entity that assists in executing a target model training task in at least one model training task.
  • the communication device described in the sixth aspect has the function of implementing the method in the second aspect, or any possible implementation of the second aspect.
  • the function can be implemented by hardware, or by hardware executing corresponding software.
  • the hardware or software includes one or more units corresponding to the above functions.
  • the various implementations of the sixth aspect are devices of the first model training entity corresponding to the various implementations of the first aspect.
  • the beneficial technical effects of the various implementations of the sixth aspect reference can be made to the description of the relevant implementations of the first aspect and will not be repeated here.
  • a communication device which includes a transceiver module and a processing module.
  • the transceiver module is used to send computing power resource information to a model training management entity, and the computing power resource information indicates the idle computing power resources of the model training entity for model training;
  • the transceiver module is also used to receive first training task configuration information from the model training management entity, and the first training task configuration information indicates assisting the first model training entity to perform a target model training task;
  • the processing module is used to assist the first model training management entity to perform the target model training task.
  • the communication device described in the seventh aspect has the function of implementing the method in the third aspect, or any possible implementation of the third aspect.
  • the function can be implemented by hardware, or by hardware executing corresponding software.
  • the hardware or software includes one or more units corresponding to the above functions.
  • the various implementation methods of the seventh aspect are devices of the first target model training entity corresponding to the various implementation methods of the first aspect.
  • the beneficial technical effects of the various implementation methods of the seventh aspect reference can be made to the description of the relevant implementation methods of the first aspect and will not be repeated here.
  • a communication device comprising a processor and a memory.
  • a transceiver may also be included.
  • the memory is used to store a computer program
  • the processor is used to call and run the computer program stored in the memory, and control the transceiver to send and receive signals, so that the communication device performs any of the first to fourth aspects, or any possible implementation of any of these aspects.
  • the communication device is a model training management function.
  • a communication device comprising a processor and a memory.
  • a transceiver may also be included.
  • the memory is used to store a computer program
  • the processor is used to call and run the computer program stored in the memory, and control the transceiver to send and receive signals, so that the communication device performs any of the first to fourth aspects, or any possible implementation of any of these aspects.
  • the communication device is a model training function.
  • a communication device comprising a processor and a communication interface, wherein the communication interface is used to receive data and/or information and transmit the received data and/or information to the processor, and the processor processes the data and/or information, and the communication interface is also used to output the data and/or information processed by the processor, so that the method in any aspect of the first to fourth aspects, or any possible implementation of any aspect of these aspects, is executed.
  • the communication device can be a chip applied to model training management functions.
  • a communication device comprising a processor and a communication interface, wherein the communication interface is used to receive data and/or information and transmit the received data and/or information to the processor, and the processor processes the data and/or information, and the communication interface is also used to output the data and/or information processed by the processor, so that the method in any aspect of the first to fourth aspects, or any possible implementation of any aspect of these aspects, is executed.
  • the communication device can be a chip applied to model training function.
  • a computer-readable storage medium stores computer instructions, and when the computer instructions are executed on a computer, the method in any aspect from the first to the fourth aspect, or any possible implementation of any aspect of these aspects, is executed.
  • a computer program product comprising a computer program code, which, when executed on a computer, enables the method in any one of the first to fourth aspects, or any possible implementation of any one of these aspects, to be executed.
  • a wireless communication system comprising the communication device as described in the fifth aspect, and/or the communication device as described in the sixth aspect, and/or the communication device as described in the seventh aspect.
  • FIG1 is a schematic structural diagram of a communication system applicable to an embodiment of the present application.
  • FIG2 is a schematic structural diagram of a first application scenario applicable to an embodiment of the present application.
  • FIG3 is a schematic structural diagram of a second application scenario applicable to the embodiment of the present application.
  • FIG4 is a schematic diagram of multiple model training entities performing model training tasks respectively;
  • FIG5 is a schematic flow chart of a method for model training management provided in an embodiment of the present application.
  • FIG6 is a schematic flow chart of a method for obtaining training status information and computing power resource information provided in an embodiment of the present application
  • FIG7 is a schematic flowchart of a method for overall training of a target model training task provided in an embodiment of the present application
  • FIG8 is a schematic flowchart of a method for decomposing target model training tasks provided in an embodiment of the present application.
  • FIG9 is a schematic flowchart of a method for providing feedback during the execution of a target model training task provided by an embodiment of the present application.
  • FIG10 is a schematic flow chart of an implementation method of two model training management entities performing model training management provided in an embodiment of the present application
  • 11 to 13 are schematic structural diagrams of possible devices provided in embodiments of the present application.
  • LTE long term evolution
  • LTE-A long term evolution-advanced
  • eLTE enhanced long term evolution
  • 5G fifth generation mobile communication system new radio
  • WiFi wireless fidelity
  • WiMAX worldwide interoperability for microwave access
  • 3GPP third generation partnership project
  • Inference model also referred to as model for short: a function learned from data that can achieve a specific function/mapping. Models can be obtained based on artificial intelligence (AI) or machine learning (ML) technologies, and therefore can also be called artificial intelligence/AI models, machine learning/ML models, etc. Common algorithms used to generate AI/ML models include: supervised learning, unsupervised learning, and reinforcement learning. The corresponding models can be called supervised learning models, unsupervised learning models, and reinforcement learning models.
  • the supervised learning model can be a classification model, a prediction model, a regression model, etc.
  • the unsupervised learning model can be a clustering model.
  • the model can also be obtained based on neural network (NN) technology, which can also be called a neural network model, a deep learning model, etc.
  • NN neural network
  • the training entity of the inference model is called the model training entity.
  • the capability or function of the model training entity can be deployed on a certain network element, which is called the model training network element; the capability or function of the model training entity can also be deployed on other devices, which is not limited in the embodiments of the present application; for the convenience of description, the embodiments of the present application take the model training network element as an example for explanation, but they can all be replaced by other devices with the capability or function of training the inference model.
  • the model training management entity is used to manage and orchestrate the training tasks and computing resources of multiple model training entities.
  • the capabilities or functions of the model training management entity can be deployed on a certain network element, which is called the model training management network element; the capabilities or functions of the model training management entity can also be deployed on other devices, which is not limited in the embodiments of the present application; for the convenience of description, the embodiments of the present application take the model training management network element as an example for explanation, but they can all be replaced by other devices that manage the capabilities or functions of the model training network element.
  • model reasoning entity An entity that performs reasoning or prediction based on a model is called a model reasoning entity.
  • the capability or function of a model reasoning entity can be deployed on a certain network element, which is called a model reasoning network element; the capability or function of a model reasoning entity can also be deployed on other devices, which is not limited in the embodiments of the present application; for the convenience of description, the embodiments of the present application take the model reasoning network element as an example for explanation, but they can all be replaced by other devices with the capability or function of performing reasoning or prediction based on a model.
  • a model training task is the basic unit of work that a model training entity can divide when training a model.
  • Fig. 1 is a schematic structural diagram of a communication system applicable to an embodiment of the present application. First, the devices that may be involved in the communication system 100 are described.
  • Model training management network element 110 can be used to manage and arrange the training tasks and computing resources of multiple model training entities.
  • the model management network element 110 can be deployed in a network management system (network management system, NMS) or in an element management system (element management system, EMS).
  • NMS network management system
  • EMS element management system
  • NMS network management system
  • EMS element management system
  • NMS network management system
  • EMS element management system
  • the model training management network element 110 can be connected to at least one model training network element.
  • the model training management network element 110 is connected to the model training network element 121 and the model training network element 122 respectively.
  • Model training network element 121 and model training network element 122 can be used to train the model.
  • the model training entity 121 can be deployed in EMS, NMS, network equipment in the radio access network (RAN) domain, and core network elements in the core network domain, such as network data analytics function (NWDAF) network elements.
  • the model training entity 122 can also be deployed in EMS, NMS, network equipment or core network elements.
  • model training network elements can be deployed in the same system, device or network element, such as model training network element 121 and model training network element 122 can be deployed in the same network device; or, different model training network elements can be deployed in different systems, devices or network elements, such as model training network element 121 and model training network element 122 are respectively deployed in different network devices, and for example, model training network element 121 is deployed in a network device, and model training network element 122 is deployed in a core network element, etc.
  • This application does not specifically limit this.
  • Model reasoning network element 130 can be used for reasoning or prediction based on the model.
  • Model reasoning network element 130 can be deployed in EMS, such as management data analytics function (MDAF) in EMS, or model reasoning network element 130 can also be deployed in network equipment in RAN domain, or core network element in core network domain, such as NWDAF network element.
  • MDAF management data analytics function
  • NWDAF NWDAF network element
  • Figure 1 uses the model inference network element 130 and the model training network element 121 as an example to illustrate the connection.
  • the model inference network element 130 and the model training network element 121 can be deployed on different devices, such as the model training network element 121 can be deployed on the NMS, and the model inference network element 130 can be deployed on the network device; the model inference network element 130 and the model training network element 121 can also be deployed on the same device, such as the model inference network element 130 and the model training network element 121 can be deployed on the same network device or core network element. This application does not specifically limit this.
  • the communication system 100 may include multiple model training management network elements, such as NMS and Model training management network elements can be deployed separately in the EMS.
  • the communication system 100 can also include multiple model reasoning network elements, for example, the model training network element 122 can also be connected to a model reasoning network element. This application does not limit the number of model training network elements, model training management function network elements, and model reasoning network elements.
  • FIG2 is a schematic structural diagram of the first application scenario applicable to the embodiment of the present application.
  • NMS 210 is deployed with a model training management network element 211 and a model training network element 212.
  • NMS 210 can manage network device 220, network device 230, NWDAF 240, and NWDAF 250.
  • Model training network element 221 is deployed on network device 220
  • model training network element 231 is deployed on network device 230
  • model training network element 241 is deployed on NWDAF 240
  • model training network element 251 is deployed on NWDAF 250.
  • Model training management network element 211 and model training network element 221 can communicate through the communication interface between NMS 210 and network device 220.
  • Model training management network element 211 and model training network element 231 can communicate through the communication interface between NMS 210 and network device 230.
  • Model training management network element 211 and model training network element 241 can communicate through the communication interface between NMS 210 and NWDAF 240.
  • the model training management network element 211 and the model training network element 251 can communicate with each other through the communication interface between the NMS 210 and the NWDAF 250.
  • the model training management network element 211 and the model training network element 212 can communicate with each other through the internal interface in the NMS 210.
  • FIG3 is a schematic structural diagram of a second application scenario applicable to an embodiment of the present application.
  • NMS 310 is deployed with model training management network element 311 and model training network element 312, and EMS 320 is deployed with model training management network element 321 and model training network element 322.
  • NMS 310 can manage network device 330, network device 340, NWDAF 350 and NWDAF 360 through EMS 320.
  • Model training network element 331 is deployed on network device 330
  • model training network element 341 is deployed on network device 340
  • model training network element 351 is deployed on NWDAF 350
  • model training network element 361 is deployed on NWDAF 360.
  • the model training management network element 311 and the model training management network element 321 can jointly manage the model training network element 312, the model training network element 322, the model training network element 331, the model training network element 341, the model training network element 351, and the model training network element 361.
  • the model training management network element 321 in the EMS 320 can communicate with the model training network element 312, the model training network element 322, the model training network element 331, the model training network element 341, the model training network element 351, and the model training network element 361 respectively to obtain the information of each model training network element.
  • the model training management network element 311 in the NMS 310 can provide the model training management network element 321 with a strategy for analyzing information.
  • the model training management network element 311 and the model training management network element 321 can communicate through the interface between the NMS 310 and the EMS 320.
  • the model training management network element 321 and the model training network element 331 can communicate through the communication interface between the EMS 320 and the network device 330.
  • the model training management network element 321 and the model training network element 341 can communicate through the communication interface between the EMS 320 and the network device 340.
  • the model training management network element 321 and the model training network element 351 can communicate through the communication interface between the EMS 320 and the NWDAF 350.
  • the model training management network element 321 and the model training network element 361 can communicate through the communication interface between the EMS 320 and the NWDAF 360.
  • the model training management network element 311 and the model training network element 312 can communicate through the internal interface in the NMS 310.
  • the model training management network element 321 and the model training network element 322 can communicate through the internal interface in EMS 320.
  • the NMS can also manage multiple model training network elements through multiple EMSs, and this application does not specifically limit this.
  • the solution of the present application can be applied to other systems including corresponding entities, and the present application does not limit this.
  • the above-mentioned entity or function can be a network element in a hardware device, or a software function running on dedicated hardware, or a virtualization function instantiated on a platform (e.g., a cloud platform).
  • the above-mentioned entity or function can be implemented by one device, or by multiple devices, or a functional module in a device, and the embodiments of the present application do not specifically limit this.
  • model training network elements can be deployed in the communication system, and each model training network element can have multiple training tasks.
  • model training tasks of different model training network elements will also be different.
  • Figure 4 is a schematic diagram of multiple model training entities executing model training tasks respectively.
  • Figure 4 shows three model training network elements that execute model training tasks respectively. Since the network environments and network requirements of different model training entities may be different and dynamically changing, the model training tasks of different model training entities will also be different.
  • the training tasks of model training network element #1 are overloaded, and some training tasks are in the queue.
  • the training tasks of model training network element #3 are also at full load, and there are no idle computing resources.
  • Model training network element #2 still has a large amount of idle computing resources that are not used. It can be seen that within a certain period of time, multiple model training network elements The distribution of model training tasks is unbalanced. Some model training tasks cannot be executed in a timely manner, while some computing resources are idle and unused, resulting in low overall efficiency of communication system model training.
  • the present application proposes a method, device and system for model training management, which can improve the efficiency of model training.
  • the method for model training management is first described below in conjunction with Figure 5.
  • model training entity #1 may be an example of a first model training entity that is assisted in a target model training task
  • model training entity #2 may be an example of a target model training entity that assists in performing a target model training task.
  • the present application does not specifically limit the number of model training entities, and illustratively, the number of model training entities #2 may be one or more, that is, one or more target model training entities may assist in performing a target model training task.
  • the model training management entity can be any model training management entity described in Figures 1 to 3, and model training entity #1 and model training entity #2 can be any model training entities described in Figures 1 to 3, and this application does not make any special limitations on this.
  • FIG5 is a schematic flowchart of a model training management method provided in an embodiment of the present application.
  • model training entity #1 sends training status information to the model training management entity
  • model training management entity receives the training status information from model training entity #1.
  • the training status information indicates that model training entity #1 has at least one model training task.
  • the training status information includes at least one of the following information: model training task identification information, priority information of the model training task, process information of the model training task, and performance information of the model training task.
  • the model training task identification information indicates the training identification of at least one model training task that the model training entity #1 has. For example, if the model training entity #1 has three model training tasks 1-3, then the model training task identification information may include the identifications of the three model training tasks.
  • the priority information of the model training task indicates the priority of at least one model training task that the model training entity #1 has.
  • the priority information may respectively indicate the priority of each model training task in at least one model training task that the model training entity #1 has, for example, the priority information indicates that the priority of model training task 1 is high, and the priority of model training tasks 2 and 3 is low.
  • Priority can be expressed as high, medium, or low, or as a number (1, 2, 3, etc.), the smaller the number, the higher the priority.
  • the priority information may also indicate the number of model training tasks with high priority that the model training entity #1 has, for example, the priority information indicates that the model training entity #1 has 1 model training task with high priority. It should be noted that the present application does not impose any limitation on the setting of the priority of the model training task, for example, the priority may be determined based on the chronological order of the execution of the request, or based on the importance of the model training task, etc.
  • the process information of the model training task indicates the process of model training entity #1 performing model training.
  • the process information may indicate the process status of each model training task in at least one model training task that model training entity #1 has.
  • the process status may include waiting to run, running, and completed running.
  • the process information indicates that model training task 1 has completed running, model training task 2 is running, and model training task 3 has completed running.
  • the progress information may indicate the total process of model training entity #1 performing model training, such as the process information indicates the number of model training tasks that have not been completed (i.e., are running or waiting to be run) in model training entity #1, for example, the process information indicates that there are still two model training tasks in model training entity #1 that have not been completed.
  • the performance information of the model training task indicates the performance of the model training entity #1 in performing the model training.
  • the performance information may indicate at least one of the following: the time required for a single training of the model training task, the computing resources required for a single training of the model training task, the number of trainings required for the model training task, the average time required for each training when the model training task performs multiple trainings, the average computing resources required for each training when the model training task performs multiple trainings, the total time required for multiple trainings of the model training task, the total computing resources required for multiple trainings of the model training task, the average training time for executing multiple model training tasks, and the average computing resources required for executing multiple model training tasks.
  • the training status information further indicates that the model training entity #1 requests collaborative training, that is, the model training entity #1 can determine whether to request collaborative training.
  • collaborative training may mean that model training entity #1 does not perform a certain model training task alone.
  • the model training task may be collaboratively performed by multiple model training entities.
  • the training task may be collaboratively performed by other model training entities (such as model training entity #2).
  • model training entity #1 determines whether the idle computing resources can meet the requirements of the training task. Determine whether to request collaborative training. If it can be satisfied, then model training entity #1 determines not to request collaborative training. If it cannot be satisfied, then model training entity #1 requests collaborative training.
  • model training entity #1 determines to request collaborative training.
  • model training entity #1 determines whether collaborative training is required based on the priority of the model training task.
  • model training entity #1 may have multiple training tasks waiting to be run, and the training tasks with higher priorities among the multiple training tasks will be run before the training tasks with lower priorities. If model training entity #1 has a training task with a higher priority than the newly added training task, and the idle computing resources are not enough to complete all the training tasks waiting to be run, that is, after deducting the resources required by the high-priority training tasks from the idle computing resources, the remaining computing resources cannot meet the needs of the training tasks with higher priorities, then model training entity #1 can determine to request collaborative training.
  • the training status information also indicates the target model training task for which collaborative training is requested, that is, the model training entity #1 may determine the target model training task for which collaborative training is requested.
  • the target model training task for requesting collaborative training can be any one or more of the at least one model training task in model training entity #1, that is, the target model training task can be a newly added model training task or other model training task, and this application does not specifically limit this.
  • the model training entity #1 determines a target model training task according to a training process of at least one training task. Exemplarily, the model training entity #1 determines a model training task that is waiting to be run as the target model training task.
  • model training entity #1 determines the target model training task according to the priority of at least one model training task.
  • model training entity #1 determines a model training task with a low priority as a target model training task.
  • model training entity #1 can set a priority value for each model training task, such as values 1 to 10 representing different priorities from high priority to low priority. If the priority value of the model training task is greater than or equal to a specific threshold (for example, 5), then the model training task is determined as the target model training task.
  • the specific threshold can be pre-configured or dynamically determined (for example, determined according to computing power or current network status), and this application does not specifically limit this.
  • model training entity #1 can directly indicate the target model training task for which collaborative training is requested.
  • the identifier of the model training task for which collaborative training is requested can be carried in the training status information, such as carrying the identifier of model training task 2, indicating that model training task 2 is recommended for collaborative training.
  • model training entity #1 may also indirectly indicate a target model training task that is recommended for collaborative training.
  • the training status information may indicate that a training task whose training process is in a waiting state is to be trained collaboratively, or may indicate that a training task whose training process is in a waiting state and a low priority is to be trained collaboratively.
  • model training entity #1 and the model management training entity may also pre-configure or negotiate in advance how to select training tasks for collaborative training, and this application does not specifically limit this.
  • the training status information includes identification information of the target model training task, then the identification information of the target model training task can implicitly request collaborative training.
  • the target model training task is the model training task for collaborative training recommended by model training entity #1.
  • the model training management entity can change the target model training task of collaborative training based on the actual network status and computing power resources of other model training entities. This application does not specifically limit this.
  • model training entity #1 may periodically send training status information to the model training management entity.
  • multiple model training entities may be configured to periodically report training status information, wherein the reporting period may be pre-set or configured by the model training management entity, and this application does not specifically limit this.
  • model training entity #1 may send training status information to the model training management entity based on a trigger.
  • the model training entity may send training status information to the model training management entity when a new model training task is added.
  • the training status information may only include identification information, priority information, and other information of the newly added model training task.
  • the model training management entity obtains multiple computing resource information.
  • the multiple computing power resource information respectively indicate the idle computing power resources used for model training possessed by the multiple model training entities.
  • the model training management entity receives computing resource information #1 from model training entity #1, and computing resource information #1 indicates The idle computing resources for model training possessed by model training entity #1.
  • the model training management entity receives computing resource information #2 from model training entity #2, where computing resource information #2 indicates the idle computing resources for model training possessed by model training entity #2.
  • the computing power resource information may include at least one of the following: hardware resource information, resource usage information.
  • the hardware resource information may indicate the performance of the hardware resource, for example, the hardware resource information may indicate at least one of the following: the type of hardware resource, the number of cores of the hardware resource, and the processing frequency of the hardware resource.
  • the hardware resource information may also indicate the quantified computing power, such as the number of floating-point operations per second (FLOPS).
  • FLOPS floating-point operations per second
  • the model training management function may determine the computing performance of the hardware resources through the above information.
  • the resource usage information may indicate the utilization rate of the hardware resources, for example, the resource usage information indicates the idle computing power of the hardware resource, the computing power that has been used, or the computing power that can still be supported for model training.
  • the model training management function may obtain the computing power that the model training entity can use for model training.
  • the above-mentioned hardware resources may include a processor, a memory, etc.
  • the processor may be any one or more of a central processing unit (CPU, a graphics processing unit (GPU), and a neural network processing unit (NPU).
  • the memory may be a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk or an optical disk, and other media that can store program codes, and this application does not specifically limit this.
  • multiple model training entities may periodically send computing power resource information to the model training management entity respectively, for example, multiple model training entities may be configured to periodically report computing power resource information, wherein the reporting period may be pre-set or configured by the model training management entity, and this application does not specifically limit this.
  • the model training entity sends computing power resource information to the model training management entity based on a request, for example, the model training management entity sends computing power resource query information to the model training entity, and then the model training entity sends computing power resource query information to the model training management entity.
  • the model training management entity sends multiple computing power resource query information to multiple model training entities respectively, and the model training management entity receives multiple computing power resource information from multiple model training entities respectively. That is, the model training entity can return computing power resource information based on the query information of the model training management entity.
  • the model training management entity determines based on the training status information that the first model training entity is unable to independently perform all of the at least one model training tasks, it sends multiple computing power resource query information to multiple model training entities respectively.
  • the model training entity can return computing power resource information based on the query information of the model training management entity, that is, the model training management entity can send query information to the model training entity when there is a demand, such as when it is determined to allocate assisted training, which can save transmission resources.
  • the model training management entity determines at least one model training entity #2 among multiple model training entities based on multiple computing resource information.
  • the at least one model training entity #2 is used to collaboratively perform the target model training task.
  • step S501 if the training status information in step S501 indicates that the model training entity #1 requests the target model training task for collaborative training, then the model training management entity can directly determine the model training entity #2 among the multiple model training entities based on the information of the target model training task and the multiple computing power resource information. If the training status information in step S501 indicates that the model training entity #1 requests collaborative training, but does not indicate the target model training task, then before the model training management entity determines the model training entity #2, the model training management entity determines whether to configure collaborative training for the model training entity #1 based on the training status information and the multiple computing power resource information.
  • step S501 If the training status information in step S501 does not indicate that the model training entity #1 requests collaborative training, nor does it indicate the target model training task, then before the model training management entity determines the model training entity #2, the model training management entity determines whether to configure collaborative training for the model training entity #1 based on the training status information and the multiple computing power resource information, and determines the target model training task if it is determined that the model training entity #1 is configured for collaborative training.
  • the following describes the determination of whether the model training management entity configures collaborative training for the model training entity #1 and the determination of the target model training task.
  • the model training management entity determines whether to configure collaborative training for model training entity #1 based on the computing power resource information and training status information reported by model training entity #1. In this implementation, the model training management entity determines whether to configure collaborative training for model training entity #1 in a similar way to how model training entity #1 determines whether to request collaborative training, such as based on idle computing power resources or the priority of the model training task. For a detailed description, please refer to step S501 above, which will not be repeated here.
  • the model training management entity determines whether to configure collaborative training for model training entity #1 based on multiple computing resource information and training status information reported by multiple model training entities.
  • the idle computing power of multiple model training entities indicated by the source information determines the model training tasks that can be processed. If the idle computing power cannot support assisting in the execution of the model training tasks indicated by the training status information, then the model training management entity determines not to configure collaborative training for model training entity #1. If the idle computing power can support assisting in the execution of the model training tasks indicated by the training status information, then the model training management entity determines to configure collaborative training for model training entity #1.
  • the target model training task is determined.
  • the model training management entity determines the target model training task based on the computing resource information and training status information reported by model training entity #1.
  • the model training management entity determines the target model training task in a similar way to how model training entity #1 determines the target model training task, for example, it can be determined based on the training process or priority.
  • step S501 please refer to step S501 above, which will not be repeated here.
  • the model training management entity determines the target model training task based on multiple computing power resource information and training status information reported by multiple model training entities.
  • the model training management entity can determine candidate model training tasks for collaborative training based on the training status information, such as taking all model training tasks that are in a waiting state as candidate model training tasks for collaborative training, and the model training management entity then determines the target model training task from the selected model training tasks based on the idle computing power of multiple model training entities indicated by the multiple computing power resource information, such as determining the model training task that can be processed by the idle computing power of multiple model training entities as the target model training task.
  • This application does not specifically limit the specific method for determining the model training task.
  • the model training management entity can also use the following two methods to implement collaborative training of the target model training task, where Method 1 can be regarded as overall collaborative training, that is, the target model training task is configured to model training entities other than model training entity #1 for training, and Method 2 can be regarded as decomposed collaborative training, such as decomposing the target model training task into multiple target model training subtasks, each target model training subtask corresponds to training data of a target length, and multiple model training entities use the training data of the target length for collaborative training, or, for example, decomposing the target model into multiple target sub-models, each target sub-model can be trained independently, and the training of the target sub-model can be understood as a target model training sub-task.
  • Method 1 can be regarded as overall collaborative training, that is, the target model training task is configured to model training entities other than model training entity #1 for training
  • Method 2 can be regarded as decomposed collaborative training, such as decomposing the target model training task into multiple target model training subtasks, each target model training subtask corresponds to training data of
  • the model training management entity can pre-set which processing method to use, or the model training management entity can also determine which processing method to use based on the target model training task, computing power resource information or the capabilities of the model training entity. For example, if there is a model training entity with idle computing power resources that can support the execution of the target model training task, then the model training management entity can be configured in method 1, that is, the model training entity can execute the target model training task alone; if there is no model training entity with idle computing power resources that can support the execution of the target model training task alone, then the model training management entity can be configured in method 2, that is, decomposing the target model training task into multiple target model training subtasks, and selecting multiple target model training entities to execute multiple target model training subtasks respectively.
  • model training management entity determines one or more target model training entities (i.e., model training entity #2 in the embodiment of the present application) is similar. For ease of description, the following will explain the manner in which a model training entity #2 is determined.
  • the model training management function estimates the computing power requirement of the target model training task, and the computing power requirement of the target model training task can be used to determine the model training entity #2.
  • the model training management entity determines the computing power requirement of the target model training task according to the performance requirement of the target model training task.
  • the performance requirement may include at least one of the following: the training accuracy of the target model training task, the training duration of the target model training task.
  • the computing power requirement may include at least one of the following: the total floating point operations (FLOPs) required to train the target model training task, and the floating point operations per second (FLOPS) of the target model training task.
  • the model training management entity can determine the total number of floating-point operations required to achieve the training accuracy of the target model training task, and then determine the FLOPS based on the total number of floating-point operations and the training duration. For example: for a model training task for a model used for channel estimation, the amount of computation required to achieve a training accuracy of 90% for the model training task is estimated to be 0.5TFLOPs (number of operations per layer * number of layers * number of iterations). If the training duration expected to complete the training is 1s, then the model training management entity can determine the computing power requirement to be 0.5TFLOPS.
  • the model training management function determines the model training entity #2 based on the computing power requirements of the target model training task and multiple computing power resource information.
  • the model training management entity may determine model training entity #2 based on computing power resource information of each model training entity among multiple model training entities.
  • the model training management entity obtains hardware resource information and resource usage information of multiple model training entities, determines at least one model training entity with idle computing power based on the resource usage information, and then determines model training entity #2 from the at least one model training entity that can support the computing power requirements of the target model training task based on the hardware resource information.
  • the model training management entity may determine the model training entity that is closest to model training entity #1 as model training entity #2 that performs collaborative training on the target model training task.
  • the model training entity that is closest to model training entity #1 may refer to the model training entity that is closest to model training entity #1, or may be the model training entity that is separated from model training entity #1 by the fewest transmission nodes, and this application does not specifically limit this.
  • the model training management entity sends first training task configuration information to model training entity #2;
  • model training entity #2 receives the first training task configuration information from the model training management entity.
  • the first training task configuration information instructs model training entity #2 to assist model training entity #1 in performing a target model training task in at least one model training task.
  • the first training task configuration information indicates an identifier of the target model training task.
  • the model training entity #2 can obtain the model to be trained and the training data based on the identifier of the target model training task.
  • the first training task configuration information further indicates target training data of the target model training task.
  • the model training management entity may also indicate the information about the target training data to model training function #2.
  • the first training task configuration information further indicates an identifier of the model training entity #1.
  • the model training entity #2 can perform collaborative training based on assisting the model training entity 1.
  • the model training management entity sends training task information to the model training entity #1;
  • model training entity #1 receives training task information from the model training management entity.
  • the training task information can also be called training task configuration notification information, which indicates that model training entity #2 assists in performing the target model training task. That is, the training task configuration notification information indicates that model training entity #1 is the model training entity #2 assists in performing the target model training task.
  • the training task notification information may indicate the identifier of the model training entity # 2. If the target model training task is determined by the model training management entity, the training task notification information may also indicate the identifier of the target model training task.
  • the training task information can also be called second training task configuration information, which instructs model training entity #2 to use training data of target length to assist in executing the target model training task.
  • the second training task configuration information may indicate the identifier of the model training entity #2 and the target length of the training data used by the model training entity #2 for collaborative training. If the target model training task is determined by the model training management entity, the training task notification information may also indicate the identifier of the target model training task.
  • model training entity #2 obtains training data information and assists in executing the target model training task.
  • the training data information indicates the training data used by model training entity #2 to assist in performing the target model training task.
  • the model training entity #2 obtains training data information from the model training management entity, for example, the first training task configuration information also indicates target training data of the target model training task.
  • model training entity #2 may obtain training data information from model training entity #1.
  • model training entity #1 may obtain the identifier of model training entity #2 based on the received training task information, and model training entity #1 actively sends training data information to model training entity #2.
  • the training data information also indicates the identifier of the target model training task and the data address of the training data.
  • the data address of the training data may be the address of the device storing the training data, and then the model training entity #2 may download the training data from the device according to the data address of the training data.
  • the training data may be stored in the model training entity #1, or in the device deployed by the model training entity #1.
  • the data address of the training data may be the collection address of the training data, and then the model training entity #2 may collect data according to the data address, and the collected data may be used for training the target model training task.
  • the training data can be stored in the database.
  • the training data information may indicate the identifier or address of the DCCF or ADRF
  • the model training entity #2 may send information for requesting training data (e.g., Ndccf_DataManagement_Subscribe or Ndccf_DataManagement_Fetch or Nadrf_DataManagement_RetrievalRequest or Nadrf_DataManagement_RetrievalSubscribe) to the DCCF or ADRF, and the DCCF or ADRF sends the training data to the model training entity #2.
  • training data e.g., Ndccf_DataManagement_Subscribe or Ndccf_DataManagement_Fetch or Nadrf_DataManagement_RetrievalRequest or Nadrf_DataManagement_RetrievalSubscribe
  • the training data information also indicates a target length
  • the model training entity #2 can then obtain training data of the target length for training.
  • Model training entity #2 can use training data to train the target model training task. This application does not specifically limit the method of training the target model training task.
  • model training entity #2 can use batch gradient descent, mini-batch gradient descent, or stochastic gradient descent.
  • the model training entity #2 may send assistance training feedback information to the model training management entity;
  • model training management entity receives the assisted training feedback information from model training entity #2.
  • model training entity #2 may send assistance training feedback information to the model training management entity.
  • the assistance feedback information may indicate at least one of the following: the accuracy achieved by model training entity #2 in performing the target model training task, the time spent by target model training entity #2 in performing the target model training task, the execution progress of target model training entity #2 in performing the target model training task, and the amount of resources occupied by the target model training entity in performing the target model training task.
  • model training entity #2 may periodically send assistance training feedback information to the model training management entity.
  • the period may be pre-set or configured by the model training management entity, and this application does not specifically limit this.
  • model training entity #2 sends network status change information to the model training management entity, and the network status information indicates that the network status of model training entity #2 has changed and can no longer assist in the execution of the target model training task.
  • the network status change information indicates that the energy-saving mode of model training function #2 has changed to enter the energy-saving state, that is, the model training function #2 will be shut down to achieve energy saving.
  • the network status change information indicates that the resources used for model training by model training function #2 are occupied, that is, the computing power resources for model training function #2 to assist in the execution of the target model training task are insufficient.
  • the model training management entity can then change the model training entity that assists the first model training entity in executing the target model training task based on the network status information.
  • the model training management entity can select other model training entities to assist in training. In order to facilitate understanding of the embodiments of the present application, this is described in more detail below in conjunction with Figure 9.
  • the network status change information can be the assisted training feedback information sent on a periodic basis after the network status changes.
  • model training entity #2 sends model training report information
  • model training entity #2 After model training entity #2 completes the training of the target model training task, a trained model can be generated, and then the model training report information can include the trained model.
  • the training report information also includes performance information of the model after training, such as the accuracy of training, the time taken for training, etc.
  • model training entity #2 performs the target model training task alone, then model training entity #2 can send model training report information to model reasoning entity #1, where model reasoning entity #1 can be the model reasoning entity that requests to perform the target model training task, such as model training entity #1 and model reasoning entity #1 are entities deployed on the same network device, or model training entity #1 and model reasoning entity #1 are deployed on different NWDAFs respectively.
  • model reasoning entity #1 can be the model reasoning entity that requests to perform the target model training task, such as model training entity #1 and model reasoning entity #1 are entities deployed on the same network device, or model training entity #1 and model reasoning entity #1 are deployed on different NWDAFs respectively.
  • model training entity #2 performs the target model training subtask decomposed from the target model training task. Then model training entity #2 can send model training report information to model training entity #1, and then model training entity #1 can perform model aggregation according to the trained model in the model training report information.
  • model training entity #2 can send model training report information to model training entity #1, and then model training entity #1 can perform model aggregation according to the trained model in the model training report information.
  • model training entity #2 can send model training report information to model training entity #1, or send model training report information to model reasoning entity #1.
  • the model training report information can be Nnwdaf_MLModelProvision_Notify or Nnwdaf_MLModelInfo_Request response.
  • the model training management entity can manage and orchestrate the training tasks and computing resources of multiple model training entities, and assign the training tasks of model training entity #1 to other model training entities with sufficient resources to assist in completing the training, thereby reducing the training waiting time of the training tasks of model training entity #1 and improving the efficiency of model training.
  • FIG6 is a schematic flowchart of a method for obtaining training status information and computing resource information provided in an embodiment of the present application.
  • the model training management entity sends computing resource subscription information #1 to the model training entity #1;
  • model training entity #1 receives computing power resource subscription information #1 from the model training management entity.
  • the computing power resource subscription information is used to subscribe to the computing power resource information #1 of the model training entity #1.
  • For the content of the computing power resource information please refer to the introduction of step S502 in Figure 5, which will not be repeated here.
  • the computing power resource subscription information #1 also indicates the period for the model training entity #1 to report the computing power resource information #1.
  • the model training management entity may determine the period according to the average time taken by the model training entity to perform the training task.
  • the model training entity #1 sends computing resource information #1 to the model training management entity;
  • the model training management entity receives the computing power resource information #1 from the model training entity #1.
  • the model training entity #1 may send computing power resource information #1 to the model training management entity in response to the computing power resource subscription information #1.
  • the computing power resource information #1 may include the computing power resource information subscribed by the computing power resource subscription information #1.
  • model training entity #1 periodically sends computing resource information #1 to the model training management entity.
  • model training management entity sends computing resource subscription information #2 to model training entity #2;
  • model training entity #2 receives computing power resource subscription information #2 from the model training management entity.
  • the computing power resource subscription information #2 is used to subscribe to the computing power resource information #2 of the model training entity #2. This step is similar to the description in step S601 above and will not be repeated here.
  • the computing power resource subscription information #1 also indicates the period for model training entity #1 to report computing power resource information #1
  • the computing power resource subscription information #2 also indicates the period for model training entity #2 to report computing power resource information #2, which can be the same as or different from the period for computing power resource subscription information #2 to indicate model training entity #2 to report computing power resource information #2.
  • the period for each model training entity to report computing power resource information can be determined based on the average time taken to execute each training task, and this application does not make any special limitations on this.
  • the model training entity #2 sends computing resource information #2 to the model training management entity;
  • the model training management entity receives the computing power resource information #2 from the model training entity #2.
  • the model training entity #2 may send computing power resource information #2 to the model training management entity in response to the computing power resource subscription information #2.
  • the computing power resource reporting information #2 may include the computing power resource information #2 subscribed to by the computing power resource subscription information #2.
  • model training entity #2 periodically sends computing resource information #2 to the model training management entity.
  • the model training management entity sends training status subscription information #1 to the model training entity #1;
  • model training entity #1 receives training status subscription information #1 from the model training management entity.
  • the training state subscription information #1 is used to subscribe to the training state information #1 of the model training entity #1.
  • For the content of the training state information please refer to the introduction of step S501 in FIG5, which will not be described in detail here.
  • training status subscription information #1 indicates that model training entity #1 periodically sends training status information #1.
  • the training status subscription information #1 indicates that the model training entity #1 sends the training status information #1 based on a trigger event
  • the trigger event may include at least one of the following: the model training entity #1 adds a new training task, and the model training entity #1 completes a training task.
  • the model training entity #1 sends training status information #1 to the model training management entity;
  • model training management entity receives training status information #1 from model training entity #1.
  • the model training entity #1 may send training state information #1 to the model training management entity in response to the training state subscription information #1.
  • the training state information #1 may include the training state information #1 subscribed to by the training state subscription information #1.
  • model training entity #1 periodically sends training status information #1 to the model training management entity.
  • model training entity #1 sends training status information #1 to the model training management entity based on a triggering event.
  • model training management entity #1 sends training status subscription information #2 to model training entity #2;
  • model training entity #2 receives training status subscription information #2 from the model training management entity.
  • Training status subscription information #2 is used to subscribe to training status information #2 of model training entity #2. This step is similar to step S605 above. The description is similar and will not be repeated here.
  • the training status subscription information #1 also indicates the period for model training entity #1 to report training status information #1
  • the training status subscription information #2 also indicates the period for model training entity #2 to report training status information #2, which can be the same as or different from the period for training status subscription information #2 to report training status information #2.
  • the period for each model training entity to report training status information can be determined based on the average time taken to execute each training task, and this application does not make any special limitations on this.
  • the model training entity #2 sends training status reporting information #2 to the model training management entity;
  • model training management entity receives training status information #2 from model training entity #2.
  • Model training entity #2 may send training status reporting information #2 to the model training management entity in response to training status subscription information #2.
  • model training entity #2 periodically sends training status information #2 to the model training management entity.
  • model training entity #2 sends training status information #2 to the model training management entity based on the triggering event.
  • Figure 6 is only an illustrative illustration of two model training entities.
  • the model training management entity can also manage three or more model training entities.
  • the model training management entity can send computing resource subscription information and training status subscription information to each model training entity. This application does not specifically limit this.
  • each model training entity can promptly report its own computing resource information and training status information, so that the model training management entity can perform timely orchestration and management, thereby improving the efficiency of model training.
  • Figure 7 is a schematic flowchart of a method for overall training of a target model training task provided in an embodiment of the present application.
  • the steps shown in Figure 7 can be performed after the steps shown in Figure 6, that is, the model training entity can send training status information to the model training management entity periodically or based on a trigger event, and the model training entity can also send computing power resource information to the model training management entity periodically.
  • model reasoning entity #1 sends a model training request message to model training entity #1.
  • the model training request message is used to request the model training entity #1 to train the inference model.
  • the model training request message includes at least one of the following information: inference model, identification information of the inference model, model performance requirement information, and expected training duration information.
  • the inference model is the inference model to be trained, and the inference model and the identification information of the inference model can be used to mark the inference model.
  • the inference model can refer to the model file of the inference model, and the model file can include an identifier or inference type for identifying the inference model.
  • the model performance requirement information can indicate the requirements for model training, for example, the accuracy of the expected inference model after training can be greater than or equal to a certain threshold, the accuracy after training can be greater than or equal to a certain threshold, etc.
  • the expected training duration information can indicate the training duration that the expected model training takes.
  • the expected training duration information can include a first duration, indicating that the model training entity completes the model training within the first duration from the receipt of the model training request message.
  • model reasoning entity #1 and the model training entity #1 can be two modules deployed in one device, such as the model training network element 121 and the model reasoning network element 130 shown in FIG. 1, that is, the model reasoning entity #1 can send a model training request message to the model training entity #1 in the device by default, and the model training request message of step S710 is transmitted through the internal interface, thereby reducing the external interface overhead.
  • model training entity #1 determines the training task.
  • Model training entity #1 can determine the training task based on the model training request message.
  • model training entity #1 can generate or index the training task of the inference model according to the inference model, such as model training entity #1 can generate the training task according to the information in the model training request message, or can set the parameters of the training task according to the model training request message, such as weight, etc., which is not particularly limited in this application.
  • model training entity #1 may determine whether to request collaborative training based on the training task. Whether the idle computing resources can meet the needs of the training task determines whether to request collaborative training. For example, model training entity #1 determines whether collaborative training is needed based on the priority of the model training task. For the content of this part, please refer to the description of step S501 in Figure 5, which will not be repeated here.
  • the model training entity #1 determines the target model training task for which the collaborative training is requested. For the contents of this part, reference may be made to the description of step S501 in FIG. 5 , which will not be described in detail here.
  • model training entity #1 sends a training task update notification message to the model training management entity.
  • the training task update notification message indicates that a new model training task has been added to model training entity #1.
  • the model training entity #1 sends training status information based on a triggering event
  • the triggering event is a new model training task added to the model training entity #1, such as the model training entity #1 receiving the model training request information from the model reasoning entity in step S701
  • the training task update notification information can be used to notify the model training management entity of the new model training task added to the model training entity #1.
  • the training task update notification message can be the training status information sent based on the triggering event.
  • the training task update notification message may indicate information about all training tasks on model training entity #1, or the training task update message may also indicate information about changed training tasks, where the changed training tasks may include training tasks with changed task progress, newly added training tasks, etc.
  • the training task update notification message indicates at least one of the following: at least one training task on the model training entity #1, the task progress of each training task in at least one training task, and the priority of each training task in at least one training task.
  • the training task update notification message includes three training tasks 1-3, training task 1 is running, training tasks 2 and 3 are waiting to run, and the priority of training task 2 is higher than the priority of training task 3. Then the model training management entity can determine whether the model training entity #1 can independently complete the training of the three training tasks in combination with the computing power resource information #1 reported by the model training entity #1, that is, request collaborative training.
  • the training task update notification message includes at least one of the following: the identifier of the training task whose task progress has been changed and the type of task progress change, the identifier of the newly added training task, and the priority of the newly added training task.
  • transmission resources can be saved.
  • the training task update notification message notifies the model training entity #1 that training task #1 has completed training and added training task 3. It should be noted that the training task update notification message does not include information about training task #2, which may indicate that the task progress of training task #2 has not changed, for example, training task #2 may still be in training, or still waiting to be run, etc.
  • the model training management entity determines model training entity #2 among multiple model training entities based on multiple computing resource information.
  • the at least one model training entity #2 is used to collaboratively perform the target model training task.
  • the model training management entity may obtain the target model training task from the training task update notification message, or the model training management entity may determine the target model training task based on the training task update notification message.
  • the model training management entity determining the model training entity #2 please refer to the description of step S503 in FIG. 5, which will not be described in detail here.
  • the model training management entity sends first training task configuration information to model training entity #2.
  • the first training task configuration information indicates that the model training entity #2 assists the model training entity #1 in performing the target model training task in at least one model training task.
  • the content of the first training task configuration information please refer to the description of step S504 in Figure 5, which will not be repeated here.
  • the model training management entity sends training task configuration notification information to model training entity #1.
  • the training task configuration notification information instructs the model training entity #2 to assist in executing the target model training task.
  • the content of the training task configuration notification information please refer to the description of step S505 in FIG. 5 , which will not be described in detail here.
  • model training entity #1 sends target training data indication information to model training entity #2.
  • the target training data indication information indicates the target training data used to train the target model training task.
  • the model training management entity does not have the information of the target model training data, that is, the information of the target training data is not carried in the first training task configuration information in step S705
  • the model training entity #1 can be informed based on the training task configuration notification information that the target model training task is trained by the model training entity #2, and then send the target training data indication information to the model training entity #2.
  • model training entity #2 obtains target training data.
  • model training entity #2 downloads target training data based on target training data indication information, for example, the target training data is stored in model training entity #1, model training entity #2 sends a download request to model training entity #1, and model training entity #1 sends the target training data to model training entity #2.
  • the target training data is stored in model training entity #1
  • model training entity #2 sends a download request to model training entity #1
  • model training entity #1 sends the target training data to model training entity #2.
  • the content of obtaining the target training data please refer to the description of step S506 in FIG. 5, which will not be described in detail here.
  • model training entity #2 assists in executing the target model training task.
  • Model training entity #2 can use the target training data to train the target model training task.
  • model training entity #2 sends training report information to model reasoning entity #1.
  • model training entity #2 After model training entity #2 completes the training of the target model training task, a trained model can be generated, and then the training report information can include the trained model.
  • the training report information also includes performance information of the trained model, such as training accuracy, training time, etc.
  • the model reasoning entity can directly obtain the trained and usable model from the model training entity #2.
  • the model training management entity can manage and orchestrate the training tasks and computing resources of multiple model training entities, assign the training tasks of model training entity #1 to other model training entities with sufficient resources to assist in completing the training, reduce the training waiting time of the training tasks of model training entity #1, improve the utilization of system resources, and improve the efficiency of model training.
  • FIG8 is a schematic flowchart of a method for task decomposition training of a target model training provided in an embodiment of the present application.
  • the steps shown in Figure 8 can be performed after the steps shown in Figure 6, that is, the model training entity can send training status information to the model training management entity periodically or based on a trigger event, and the model training entity can also send computing power resource information to the model training management entity periodically.
  • the model reasoning entity sends a model training request message to the model training entity #1.
  • the model training request message is used to request the model training entity #1 to train the inference model.
  • the model training entity #1 For the content of this part, please refer to the description of step S701 in Figure 7, which will not be repeated here.
  • model training entity #1 determines the training task.
  • Model training entity #1 can determine the training task based on the model training request message. For the content of this part, please refer to the description of step S702 in Figure 7, which will not be repeated here.
  • model training entity #1 sends a training task update notification message to the model training management entity.
  • the training task update notification message indicates that a new model training task is added to the model training entity #1.
  • For the content of the training task update notification message please refer to the description of step S703 in Figure 7, which will not be repeated here.
  • the model training request message may further include training data length information.
  • the training data length information indicates the length of the training data, and the training data is used to perform the training task.
  • the training data length information can indicate the length of the training data of each training task on the model training entity #1, and then the model training management entity can determine the target model training task for collaborative training based on the length of the training data corresponding to each training task, thereby improving the reliability of determining the target model training task.
  • model training entity #1 determines that a target model training task requires collaborative training, that is, the training task update notification message includes information about the target model training task
  • the training data length information may only indicate the length of the target training data corresponding to the target model training task, thereby saving transmission resources.
  • the model training management entity decomposes the target model training task and determines the model training entity #2.
  • the model training management entity can obtain the target model training task from the training task update notification message, or the model training management entity can determine the target model training task based on the training task update notification message.
  • the model training management entity can obtain the target model training task from the training task update notification message, or the model training management entity can determine the target model training task based on the training task update notification message.
  • model training management entity may decompose the target model training task by splitting the target model, or the model training management entity may also decompose the target model training task by splitting the training data.
  • the model training management entity can determine the M model training entities that participate in the training of the target model training task based on the computing power resource information of multiple model training entities, and split the target model into M target sub-models, each of which can be trained separately by the model training entity, and the training of a target sub-model is a model training sub-task, where M is a positive integer.
  • the target model is a 4-layer neural network model
  • the model training management entity can decompose the neural network model into There are two target sub-models: the first two layers of neural network model are used as one target sub-model, and the last two layers of neural network model are used as another target sub-model.
  • the model training management entity can decompose the target model training task into multiple target model training subtasks based on the computing power resource information of multiple model training entities, and determine the training data sub-length corresponding to each target model training subtask.
  • the model training management entity may determine N model training entities participating in training the target model training task based on the computing power resource information of multiple model training entities, divide the target model training task into N target model training subtasks, and determine the training data sub-length of each target model training subtask in the N target model training subtasks.
  • the N model training entities may respectively perform the training of the N target model training subtasks.
  • N is a positive integer greater than or equal to 1.
  • the N model training entities may include model training entity #1 or may not include model training entity #1.
  • model training entity #1 can train one of the target model training subtasks. If model training entity #1 does not have spare computing power, then model training entity #1 may not participate in the training of the target model training task, and the other N model training entities #2 may perform the training separately.
  • model training management entity can set model training entity #1 as the main model training entity and other model training entities as collaborative model training entities.
  • the main model training entity can receive the sub-model obtained by the collaborative model training entity through training the target model training sub-task from the collaborative model training entity.
  • Model training entity #1 as the main model training entity can improve the reliability of training task allocation.
  • model training management entity determines, based on computing resource information, that two model training entities, model training entity #1 and model training entity #2, perform collaborative training on the target model training task, wherein model training entity #1 can serve as the main model training entity and model training entity #2 as the collaborative model training entity.
  • the model training management entity determines that model training entity #2 assists in performing the target model task, and determines that the training data used to assist in performing the target model task is of a target length.
  • model training management entity may first determine the model training entity #2 and then decompose the target model training task, or the model training management entity may first decompose the target model training task and then determine the model training entity #2, or the model training management entity may also simultaneously decompose the target model training task and determine the model training entity #2, and this application does not specifically limit this.
  • the content related to the model training management entity determining the model training entity #2 can refer to the description of step S503 in Figure 5, which will not be repeated here.
  • the model training management entity sends first training task configuration information to model training entity #2.
  • the first training task configuration information indicates that the model training entity #2 assists the model training entity #1 in performing the target model training task in at least one model training task.
  • the content of the first training task configuration information please refer to the description of step S504 in Figure 5, which will not be repeated here.
  • the model training management entity sends second training task configuration information to model training entity #1.
  • the second training task configuration information indicates that model training entity #2 assists model training entity #1 in performing the target model training task.
  • the second training task configuration information indicates the identifier of the target model training task and the identifier of the model training entity #2.
  • the model training entity #1 can know that the target model training task is performed by the model training entity #2.
  • the second training task configuration information also indicates the splitting method information, such as the splitting node identifier, and the splitting node information is used to indicate the splitting method of the target model, and then the model training function #1 can obtain the target sub-model into which the target model is split according to the splitting method information.
  • the second training task configuration information also indicates the target length of the training data used by the model training entity #2 to assist in executing the target model training task. Then, the model training entity #1 can obtain the training data indicating the target length of the model training entity #2.
  • model training entity #1 preliminarily executes the target model training task.
  • Model training entity #1 determines the model structure, training algorithm, training hyperparameters, etc., and uses the training data of the target length at model training entity #1 to perform several rounds of initial model training to obtain an initial model.
  • model training entity #1 sends target training data indication information to model training entity #2.
  • the target training data indication information indicates the target training data used to train the target model training task. Used to instruct model training entity #2 to use training data of target length to assist in performing the target model training task.
  • the target training data indication information indicates the identifier of the target collaborative training task, the address of the training data, and the address of the target sub-model corresponding to the training sub-task executed by the model training entity #2. Then, the model training entity #2 can obtain the target sub-model based on the address of the target sub-model, and obtain the training data of the target length based on the address of the training data.
  • the target training data indication information indicates the identifier of the target collaborative training task, the target length, the address of the training data, and the address of the initial model. Then, the model training entity #2 can obtain the initial model based on the address of the initial model, and obtain the training data of the target length based on the address of the training data.
  • model training entity #2 obtains target training data.
  • Model training entity #2 can obtain training data of target length based on target training data indication information. For details on obtaining target training data, please refer to the description of step S708 in FIG. 7 , which will not be described in detail here.
  • model training entity #2 assists in executing the target model training task.
  • Model training entity #2 uses the training data of the target length to perform several rounds of training on the initial model to generate a sub-model.
  • model training entity #2 uses the training data to perform several rounds of training on the target sub-model to generate a trained target sub-model.
  • model training entity #2 sends model transfer information to model training entity #1.
  • the model transfers information indicating a trained sub-model, that is, an initial model after several rounds of training using target length training data, or a sub-model after several rounds of training of a target sub-model.
  • the model transfer information indicates a model gradient of the sub-model or a model address of the sub-model.
  • the model transfer information also indicates the identifier of the target sub-model.
  • the identifier of the target sub-model can be determined based on the splitting method information when the model training management function is split, or it can also be agreed upon by model training function #1 or model training function #2. This application does not specifically limit this.
  • model training entity #1 performs model aggregation.
  • model training entity #1 can perform model aggregation in different ways. For example, when the model training management function decomposes the target model training task by splitting the target model, then model training entity #1 obtains the trained target sub-model from model training entity #2, and the model training entity can aggregate the target sub-model. Or, for the model training management function that decomposes the model target model training task by splitting the training data, the model training entity can obtain the initial model trained by model training function #2 based on the model transfer information, and perform model aggregation with the initial model obtained by the initial execution of the training to obtain a trained inference model. For example, the model training entity can average each gradient/weight of multiple sub-models and the initial model to obtain the final aggregated model gradient/weight.
  • model training entity #1 sends training report information to model reasoning entity #1.
  • the training report information may include the trained inference model.
  • the content of the training report information please refer to the description of step S710 in FIG. 7 , which will not be described in detail here.
  • the model training management entity can decompose the target model training task and use multiple model training entities to collaborate to complete the training task, which can reduce the training task burden of the original training subject model training entity #1 of the target model training task, make full use of the resources of multiple model training entities, and reduce the training waiting time of model training entity #1.
  • the model training function #2 can send the assisted training feedback information and/or network status change information described in S507 in Figure 5 to the model training management entity during the execution of the target model training task, so that the model training management entity can be aware of the status of the execution of the target model training task and can make timely adjustments to improve the reliability of model training. This is explained below in conjunction with Figure 9.
  • Figure 9 is a schematic flowchart of a method for providing feedback during the execution of a target model training task provided in an embodiment of the present application.
  • model training management entity is used to manage three model training entities.
  • the model training management entity can be any model training management entity described in FIG. 1 to FIG. 3, and model training entity #1, model training entity #2, and model training entity #3 can be any model training entity described in FIG. 1 to FIG. 3, and the present application does not specifically limit this.
  • the method described in FIG. 9 can be combined with any method described in FIG. 5 to FIG. 8.
  • model training entity #2 sends assistance training feedback information and/or network status change information to the model training management entity.
  • the assistance feedback information may indicate at least one of the following: the accuracy achieved by the model training entity #2 in executing the target model training task, the time taken by the target model training entity #2 to execute the target model training task, the time taken by the target model training entity #2 to execute the target model training task, and the time taken by the target model training entity #2 to execute the target model training task. Execution progress, the number of resources occupied by the target model training entity to execute the target model training task.
  • the network status information indicates that the network status of model training entity #2 has changed so that it can no longer assist in executing the target model training task, or in other words, it can no longer complete the target model training task.
  • the network status change information indicates that the energy-saving mode of model training function #2 has changed to an energy-saving state, that is, the model training function #2 will be shut down to achieve energy saving.
  • the network status change information indicates that the resources used for model training by model training function #2 are occupied, that is, the computing resources of model training function #2 to assist in executing the target model training task are insufficient.
  • model training entity #2 can only send training assistance feedback information to the model training management entity, and the model training management entity can determine whether it is necessary to replace other model training entities to assist in executing the target model training task based on the content of the training assistance feedback information.
  • Model training entity #2 can also only send network status change information to the model training management entity, and then the model training management entity can directly determine to replace other model training entities to assist in executing the target model training task based on the network status change information.
  • Model training entity #2 can also send training assistance feedback information and network status change information to the model training management entity, and this application does not specifically limit this.
  • the model training management entity adjusts the allocation method.
  • Adjusting the allocation method may include: adjusting the method in which the model training node assists in executing the target model training task (such as changing the target length), and replacing the model training entity (such as replacing model training entity #2 with model training entity #3). That is, model training entity #3 is used to assist in executing the target model training task. That is, model training entity #3 may be an example of a second target model training entity of the model training management entity among multiple model training entities.
  • the model training management entity can adjust the allocation method.
  • the model training management entity can adjust the allocation method of the target collaborative training task. If the network status change information indicates that the resources used for model training by model training entity #2 are occupied, that is, the entity for model training entity #2 to train the model is shut down, then the model training management entity can also adjust the allocation method.
  • the model training management entity sends configuration change information #1 to model training entity #1.
  • Configuration change information #1 indicates the adjusted allocation method, such as indicating the changed target length, or indicating that the target model training task is changed from being assisted by model training entity #2 to being assisted by model training entity #3.
  • the model training management entity sends configuration change information #2 to model training entity #2.
  • Configuration change information #2 instructs model training entity #2 to assist in executing the target model training task in an adjusted manner, or instructs model training entity #2 to stop assisting in executing the target model training task.
  • the method further includes step S905:
  • the model training management entity sends third training task configuration information to model training entity #3.
  • the third training task configuration information instructs model training entity #3 to assist model training entity #1 in performing the target model training task.
  • the content of the third training task configuration information is similar to the description of the first training task configuration information in step S504 in FIG5 , and will not be repeated here.
  • model training entity #3 can assist in executing the target model training task based on the third training task configuration information, that is, the model training entity #3 can execute the contents of steps S506 to S508, or S707 to S710, or S809 to S813 in Figure 5, which will not be elaborated here.
  • the model training entity can provide feedback on the execution status to the model training management entity while assisting in the execution of the model training task, so that the model training management entity can be informed of the execution status of the target model training task and can make timely adjustments to improve the reliability of model training.
  • Figures 5 to 9 can all be applied to the architectures described in Figures 1 to 3. It can be understood that, in order to facilitate understanding of the embodiments of the present application, a model training management entity is used for illustration in the above description process. If there are multiple model training management entities in the system, for example, refer to Figure 3, there is a model training management entity in EMS and NMS respectively, then the model training management entity in Figures 5 to 9 can be regarded as the model training management entity in EMS, and the model training management entity in EMS can interact with the model training management entity in NMS to achieve the joint management of multiple model training entities. For ease of understanding, the implementation method is described below in conjunction with Figure 10.
  • FIG10 is a schematic flowchart of an implementation method of two model training management entities performing model training management provided in an embodiment of the present application.
  • FIG10 takes the model training management entity #1 as the model training management entity in the NMS and the model training management entity #2 as the model training management entity in the EMS as an example.
  • model training management entity #1 sends total computing power resource subscription information to model training management entity #2.
  • the total computing power resource subscription information indicates that the model training management entity #2 subscribes to the computing power resource information of multiple model training management entities.
  • the multiple model training management entities are model training entities managed by the model training management entity #2.
  • For the content of the computing power resource information please refer to the description of step S502 in Figure 5, which will not be repeated here.
  • Model training management entity #2, model training entity #1, and model training entity #2 can execute the contents of steps S601 to S604 as described in FIG. 6 , which will not be described in detail here.
  • model training management entity #2 sends total computing power resource information to model training management entity #1.
  • Model training management entity #2 can send the computing power resource information #1 and computing power resource information #2 received from model training entity #1 and model training entity #2 directly to model training management entity #1, or model training management entity #2 can also process the computing power resource information #1 and computing power resource information #2 and send them to model training management entity #1. This application does not make any special restrictions on this.
  • model training management entity #2 sends total training status subscription information to model training management entity #1.
  • the total training state subscription information indicates that the model training management entity #2 subscribes to the training state information of multiple model training management entities.
  • the content of the training state information please refer to the description of step S501 in Figure 5, which will not be repeated here.
  • Model training management entity #2, model training entity #1, and model training entity #2 may execute the contents of steps S605 to S608 as described in FIG6 , which will not be described in detail here.
  • model training management entity #2 sends total training status information to model training management entity #1.
  • Model training management entity #2 may send the training status information #1 and training status information #2 received from model training entity #1 and model training entity #2 directly to model training management entity #1, or model training management entity #2 may process the training status information #1 and training status information #2 and send them to model training management entity #1. This application does not make any special limitation on this.
  • model training management entity #1 sends policy information to model training management entity #2.
  • the strategy information indicates a method for determining a target model training entity (i.e., model training entity #2 in the embodiment of the present application) for assisting in executing a target model training task among multiple model training entities based on multiple computing resource information, and/or indicates a method for determining a target length of training data for a target model training task based on the total length of training data used to complete the target model training task.
  • a target model training entity i.e., model training entity #2 in the embodiment of the present application
  • the policy information also indicates a manner of determining whether collaborative training is required.
  • the method of determining whether collaborative training is required includes at least one of the following:
  • Priority allocation method Determine whether collaborative training is required based on the priority of the training task. When the training task priority is greater than or equal to the priority threshold, collaborative training is required; when the training task priority is less than the priority threshold, collaborative training is not required;
  • Allocation method based on computing power resource utilization Determine whether collaborative training is required based on computing power resource utilization. When the computing power resource utilization of the model training entity is greater than or equal to the utilization threshold, it is determined that the model training entity needs collaborative training; when the computing power resource utilization of the model training entity is less than the utilization threshold, it is determined that the model training entity does not need collaborative training;
  • Method based on training data volume allocation Determine whether collaborative training is required based on the amount of training data required for the training task. When the amount of training data for the training task is greater than the data volume threshold, collaborative training is determined to be required; when the amount of training data for the training task is less than or equal to the data volume threshold, collaborative training is determined not to be required.
  • the method of determining the target model training entity includes at least one of the following:
  • Computing resource-based method Determine the target model training entity based on idle computing resources. For example, select the model training entity with the most idle computing resources as the target model training entity;
  • Node location-based approach Determine the target model training entity based on the model training entity and the main model training entity (e.g., model training entity #1 in the embodiment of the present application). For example, select the model training entity closest to the main model training entity as the target model training entity.
  • the main model training entity e.g., model training entity #1 in the embodiment of the present application.
  • the method of determining the target length of the training data includes at least one of the following:
  • the target length is determined based on the idle computing resources.
  • the target length is determined to be the target length that can be processed by the idle computing resources of the model training entity;
  • the target length is determined based on the number of model training entities with idle computing resources. For example, if the model training management entity selects three model training entities to assist in executing the target model training task, the training data of the target model training task can be decomposed into three training data of target length.
  • the policy information may include a specific method, or may also include an index of the method, such as the name or identifier of the method, or may also include parameters of the method, such as a priority threshold, a data volume threshold, etc. This application does not specifically limit this.
  • Model training entity #1 and model training management entity #2 can execute step S503 in Figure 5, or steps S701 to S703 in Figure 7, or steps S801 to S803 in Figure 8, which will not be repeated here.
  • model training management entity #2 sends training task notification update information to model training management entity #1.
  • the training task notification update information may be training task update notification information obtained by model training management entity #1 from model training entity #1, which is used to indicate that a new model training task is added to model training entity #1.
  • model training management entity #1 obtained by model training management entity #1 from model training entity #1, which is used to indicate that a new model training task is added to model training entity #1.
  • model training management entity #2 can directly send the training task update notification information obtained from model training entity #1 to model training management entity #2, or can process the training task update notification information and then send it to model training management entity #2. This application does not specifically limit this.
  • Model training management entity #2, model training entity #1, and model training entity #2 can execute the contents of steps S504 to S508 in Figure 5, or steps S704 to S710 in Figure 7, or steps S804 to S813 in Figure 8, and optionally can also execute the entire contents of Figure 9, which will not be repeated here.
  • model training management entity #2 can determine the allocation method based on the policy information obtained from model training management entity #1.
  • model training management entity #2 sends training task information to model training management entity #1.
  • the training task information may be the training task information sent by the model training management entity #2 to the model training function #1.
  • the relevant content may refer to the description of step S505 in FIG5 and will not be elaborated here.
  • model training management entity #2 can directly send the training task information sent to model training entity #1 to model training management entity #2, or it can process the training task information and then send it to model training management entity #2. This application does not specifically limit this.
  • the model training management function in the NMS can send policy information to the model training management function in the EMS to improve the efficiency of model training.
  • a piece of information can be carried in one or more messages or one or more information elements in the same message, such as two messages, or two information elements in the same message, and this application does not impose any special limitations on this.
  • FIG11 is a schematic diagram of a communication device provided in an embodiment of the present application.
  • the device 1100 may include a transceiver module 1110 and a processing module 1120.
  • the transceiver module 1110 may communicate with the outside of the device, and the processing module 1120 is used for data processing.
  • the transceiver module 1110 may also be referred to as a communication interface or a transceiver module.
  • the device 1100 can implement a process corresponding to the execution of the model training management function in the method embodiments shown in Figures 5 to 10 above, wherein the processing module 1120 is used to execute processing-related operations of the model training management function in the method embodiments shown in Figures 5 to 10 above, and the transceiver module 1110 is used to execute transceiver-related operations of the model training management function in the method embodiments shown in Figures 5 to 10 above.
  • the transceiver module 1110 is used to receive training status information of a first model training entity, where the training status information indicates at least one model training task of the first model training entity; the transceiver module 1110 is also used to obtain a plurality of computing power resource information, where the plurality of computing power resource information respectively indicates idle computing power resources for model training of the plurality of model training entities; the processing module 1120 is used to determine a first target model training entity among the plurality of model training entities based on the plurality of computing power resource information; the transceiver module 1110 is also used to send a first target model training entity to the first target model training entity.
  • the first model training entity sends first training task configuration information, where the first training task configuration information indicates assisting the first model training entity in performing a target model training task in at least one model training task.
  • the model training management entity can manage and arrange the computing resources of multiple model training entities, and assign the training tasks of the first model training entity to other model training entities with sufficient computing resources to assist in completing the training, thereby reducing the training waiting time of the training tasks of the first model training entity and improving the efficiency of model training.
  • the device 1100 may implement a process corresponding to the execution of the first model management function (i.e., model management function #1) in the method embodiment shown in Figures 5 to 10 above, wherein the transceiver module 1110 is used to execute the transceiver-related operations of the model management function #1 in the method embodiment shown in Figures 5 to 10 above, and the processing module 1120 is used to execute the processing-related operations of the model management function #1 in the method embodiment shown in Figures 5 to 10 above.
  • the transceiver module 1110 is used to execute the transceiver-related operations of the model management function #1 in the method embodiment shown in Figures 5 to 10 above
  • the processing module 1120 is used to execute the processing-related operations of the model management function #1 in the method embodiment shown in Figures 5 to 10 above.
  • the model training entity can report computing power resource information and training status information to the model training management entity, so that the model training management entity can manage and arrange the computing power resources of multiple model training entities, and assign the training tasks of the model training entity to other model training entities with sufficient computing power resources to assist in completing the training, thereby reducing the training waiting time of the training tasks of the model training entity and improving the efficiency of model training.
  • the processing module 1120 is used to generate training status information; the transceiver module 1110 is used to send training status information to the model training management entity, the training status information indicating at least one model training task of the model training entity; the transceiver module is also used to send computing power resource information to the model training management entity, the computing power resource information indicates the idle computing power resources of the model training entity for model training; the transceiver module 1110 is also used to receive training task information from the model training management entity, the training task information indicates the first target model training entity that assists in executing the target model training task in at least one model training task.
  • the device 1100 can implement a process corresponding to the execution of the first target model management function (i.e., model management function #2) in the method embodiment shown in Figures 5 to 10 above, wherein the transceiver module 1110 is used to execute the transceiver-related operations of the model management function #2 in the method embodiment shown in Figures 5 to 10 above, and the processing module 1120 is used to execute the processing-related operations of the model management function #2 in the method embodiment shown in Figures 5 to 10 above.
  • the transceiver module 1110 is used to execute the transceiver-related operations of the model management function #2 in the method embodiment shown in Figures 5 to 10 above
  • the processing module 1120 is used to execute the processing-related operations of the model management function #2 in the method embodiment shown in Figures 5 to 10 above.
  • the transceiver module 1110 is used to send computing power resource information to the model training management entity, and the computing power resource information indicates the idle computing power resources of the model training entity for model training; the transceiver module 1110 is also used to receive first training task configuration information from the model training management entity, and the first training task configuration information indicates assisting the first model training entity to perform the target model training task; the processing module 1120 is used to assist the first model training management entity to perform the target model training task.
  • the model training entity can report computing power resource information and training status information to the model training management entity, so that the model training management entity can manage and arrange the computing power resources of multiple model training entities, and assign the training tasks of the model training entity to other model training entities with sufficient computing power resources to assist in completing the training, thereby reducing the training waiting time of the training tasks of the model training entity and improving the efficiency of model training.
  • the device 1100 here is embodied in the form of a functional unit.
  • the term "unit” here may refer to an application specific integrated circuit (ASIC), an electronic circuit, a processor (such as a shared processor, a dedicated processor or a group processor, etc.) and a memory for executing one or more software or firmware programs, a merged logic circuit and/or other suitable components that support the described functions.
  • ASIC application specific integrated circuit
  • processor such as a shared processor, a dedicated processor or a group processor, etc.
  • memory for executing one or more software or firmware programs, a merged logic circuit and/or other suitable components that support the described functions.
  • the device 1100 may be specifically the model training management function in the above-mentioned embodiment or a chip applied to the model training management function, and may be used to execute the process corresponding to the model training management function in the above-mentioned method embodiment, or the device 1100 may be specifically the model training function in the above-mentioned embodiment or a chip applied to the model training function, and may be used to execute the process corresponding to the model training function in the above-mentioned method embodiment. To avoid repetition, it will not be described here.
  • the above-mentioned device 1100 has the function of implementing the corresponding steps performed by the model training management function in the above-mentioned method, or the above-mentioned device 1100 has the function of implementing the corresponding steps performed by the model training function in the above-mentioned method.
  • the function can be implemented by hardware, or the corresponding software can be implemented by hardware.
  • the hardware or software includes one or more modules corresponding to the above-mentioned functions; for example, the transceiver module can be replaced by a transceiver (for example, the sending unit in the transceiver module can be replaced by a transmitter, and the receiving unit in the transceiver module can be replaced by a receiver), and other units, such as processing modules, can be replaced by processors to respectively perform the transceiver operations and related processing operations in each method embodiment.
  • the transceiver module can be replaced by a transceiver (for example, the sending unit in the transceiver module can be replaced by a transmitter, and the receiving unit in the transceiver module can be replaced by a receiver), and other units, such as processing modules, can be replaced by processors to respectively perform the transceiver operations and related processing operations in each method embodiment.
  • the above-mentioned transceiver module can also be a transceiver circuit (for example, it can include a receiving circuit and a sending circuit), and the processing module can be a processing circuit.
  • the device in Figure 11 can be the model training management function or the model training function in the aforementioned embodiment, or it can be a chip or a chip system, such as a system on chip (SoC).
  • the transceiver module can be an input and output circuit, a communication interface.
  • the processing module is a processor or microprocessor or integrated circuit integrated on the chip. This is not limited here.
  • FIG12 shows a communication device 1200 provided in an embodiment of the present application.
  • the device 1200 includes a processor 1210 and a memory 1220.
  • the memory 1220 is used to store instructions, and the processor 1210 can call the instructions stored in the memory 1220 to execute the model training management function or the process corresponding to the model training function in the above method embodiment.
  • the memory 1220 is used to store instructions, and the processor 1210 can call the instructions stored in the memory 1220 to execute the process corresponding to the model training management function in the above method embodiment.
  • the memory 1220 is used to store instructions, and the processor 1210 can call the instructions stored in the memory 1220 to execute the process corresponding to the model training function in the above method embodiment.
  • the device 1200 can be specifically the model training management function or the model training function in the above embodiment, or it can be a chip or chip system for the model training management function or the model training function. Specifically, the device 1200 can be used to execute the process corresponding to the model training management function or the model training function in the above method embodiment.
  • the memory 1220 may include a read-only memory and a random access memory, and provide instructions and data to the processor.
  • a portion of the memory may also include a non-volatile random access memory.
  • the memory may also store information about the device type.
  • the processor 1210 may be used to execute instructions stored in the memory, and when the processor 1210 executes instructions stored in the memory, the processor 1210 is used to execute the process of the method embodiment corresponding to the model training management function or the model training function.
  • each step of the above method can be completed by an integrated logic circuit of hardware in a processor or an instruction in the form of software.
  • the steps of the method disclosed in conjunction with the embodiment of the present application can be directly embodied as a hardware processor for execution, or a combination of hardware and software modules in a processor for execution.
  • the software module can be located in a storage medium mature in the art such as a random access memory, a flash memory, a read-only memory, a programmable read-only memory or an electrically erasable programmable memory, a register, etc.
  • the storage medium is located in a memory, and the processor reads the information in the memory and completes the steps of the above method in conjunction with its hardware. To avoid repetition, it is not described in detail here.
  • the processor in the embodiment of the present application can be an integrated circuit chip with signal processing capabilities.
  • each step of the above method embodiment can be completed by an integrated logic circuit of hardware in the processor or an instruction in the form of software.
  • the above processor can be a general-purpose processor, a digital signal processor, an application-specific integrated circuit, a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component.
  • the processor in the embodiment of the present application can implement or execute the methods, steps and logic block diagrams disclosed in the embodiment of the present application.
  • the general-purpose processor can be a microprocessor or the processor can also be any conventional processor, etc.
  • the steps of the method disclosed in the embodiment of the present application can be directly embodied as a hardware decoding processor to execute, or the hardware and software modules in the decoding processor can be combined and executed.
  • the software module can be located in a mature storage medium in the field such as a random access memory, a flash memory, a read-only memory, a programmable read-only memory or an electrically erasable programmable memory, a register, etc.
  • the storage medium is located in a memory, and the processor reads the information in the memory and completes the steps of the above method in combination with its hardware.
  • the memory in the embodiments of the present application can be a volatile memory or a non-volatile memory, or can include both volatile and non-volatile memories.
  • the non-volatile memory can be a read-only memory (ROM), a programmable read-only memory (PROM), an erasable programmable read-only memory (EPROM), an electrically erasable programmable read-only memory (EEPROM), or a flash memory.
  • the volatile memory can be a random access memory (RAM), which is used as an external cache.
  • RAM random access memory
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDR SDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchlink DRAM
  • DR RAM direct rambus RAM
  • FIG13 shows a communication device 1300 provided in an embodiment of the present application.
  • the device 1300 includes a processing circuit 1310 and a transceiver circuit 1320.
  • the processing circuit 1310 and the transceiver circuit 1320 communicate with each other through an internal connection path, and the processing circuit 1310 is used to execute instructions to control the transceiver circuit 1320 to send and/or receive signals.
  • the device 1300 may further include a storage medium 1330, which communicates with the processing circuit 1310 and the transceiver circuit 1320 via an internal connection path.
  • the storage medium 1330 is used to store instructions, and the processing circuit 1310 may execute the instructions stored in the storage medium 1330.
  • the device 1300 is used to implement the process corresponding to the model training management entity in the above method embodiment.
  • the processing circuit 1310 is used to implement the above-mentioned processing unit 1120
  • the transceiver circuit 1320 is used to implement the functions of the above-mentioned transceiver unit 1110 or the transceiver unit 1110 and the processing unit 1120.
  • the device 1300 is used to implement the process corresponding to the model training entity in the above method embodiment.
  • the processing circuit 1310 is used to implement the function of the above-mentioned processing unit 1120
  • the transceiver circuit 1320 is used to implement the functions of the above-mentioned transceiver unit 1110 or the transceiver unit 1110 and the processing unit 1120.
  • the present application also provides a computer program product, which includes: a computer program code, when the computer program code is run on a computer, the computer executes the method in the embodiments shown in Figures 5 to 10.
  • the present application also provides a computer-readable medium, which stores a program code.
  • the program code runs on a computer, the computer executes the method in the embodiments shown in Figures 5 to 10.
  • the present application also provides a system, which includes the aforementioned model training management function and multiple model training functions.
  • At least one of! or “at least one of" herein refers to all or any combination of the listed items.
  • at least one of A, B, and C may refer to the following six situations: A exists alone, B exists alone, C exists alone, A and B exist at the same time, B and C exist at the same time, and A, B, and C exist at the same time.
  • At least one herein refers to one or more.
  • “More than one” refers to two or more.
  • B corresponding to A means that B is associated with A, and B can be determined according to A.
  • determining B according to A does not mean determining B only according to A, but B can also be determined according to A and/or other information.
  • the terms “include”, “comprises”, “has” and their variations all mean “including but not limited to”, unless otherwise specifically emphasized.
  • indication may include direct indication and indirect indication, and may also include explicit indication and implicit indication.
  • the information indicated by a certain information (such as the first information described above) is referred to as information to be indicated.
  • the information to be indicated can be directly indicated, such as the information to be indicated itself or the index of the information to be indicated.
  • the information to be indicated can also be indirectly indicated by indicating other information, wherein the other information has an association relationship with the information to be indicated. It is also possible to indicate only a part of the information to be indicated, while the other parts of the information to be indicated are known or agreed in advance.
  • the indication of specific information can also be achieved by means of the arrangement order of each information agreed in advance (such as specified by the protocol), thereby reducing the indication overhead to a certain extent.
  • pre-configuration can be achieved by pre-saving corresponding codes, tables or other methods that can be used to indicate relevant information in a device (for example, a first terminal device), and the present application does not limit its specific implementation method.
  • the disclosed systems, devices and methods can be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of the units is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed.
  • Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of devices or units, which can be electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the functions are implemented in the form of software functional units and sold or used as independent products, they can be stored in a computer.
  • Machine-readable storage medium Based on this understanding, the technical solution of the present application can essentially or contribute to the part or part of the technical solution in the form of a software product, which is stored in a storage medium and includes a number of instructions to enable a computer device (which can be a personal computer, server, or network device, etc.) to perform all or part of the steps of the method described in each embodiment of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), disk or optical disk, etc., various media that can store program code.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Neurology (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

本申请提供了一种模型训练管理的方法、装置和系统,该方法包括:模型训练管理实体接收第一模型训练实体的训练状态信息,训练状态信息指示第一模型训练实体具有的至少一个模型训练任务;模型训练管理实体获取多个算力资源信息,多个算力资源信息分别指示多个模型训练实体具有的用于模型训练的空闲算力资源;模型训练管理实体基于多个算力资源信息在多个模型训练实体中确定第一目标模型训练实体;模型训练管理实体向第一目标模型训练实体发送第一训练任务配置信息,第一训练任务配置信息指示协助第一模型训练实体执行至少一个模型训练任务中的目标模型训练任务。从而能够提高模型训练的效率。

Description

一种模型训练管理的方法、装置和系统
本申请要求于2022年9月27日提交中国国家知识产权局、申请号为202211181988.7、申请名称为“一种模型训练管理的方法、装置和系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及通信技术的领域,并且更具体地,涉及一种模型训练管理的方法、装置和系统。
背景技术
为了提高网络的智能化和自动化水平,推理模型,例如人工智能(artificial intelligence,AI)模型和机器学习(machine learning,ML)模型运用于越来越多的技术领域。模型通常是通过训练获得,比如模型训练实体可以被配置模型训练任务,通过采用训练数据执行模型训练任务以获得能够使用的模型。通信系统中通常配置有多个模型训练实体,比如多个基站可以分别部署多个模型训练实体。每个基站中的模型训练实体中的训练任务可以根据该基站的需求设置。
然而,由于不同的模型训练实体所处的网络环境和网络需求可能不同且动态变化,不同的模型训练实体具有的模型训练任务也会随之不同。进而通信系统中可能会出现某些模型训练实体具有的模型训练任务超负荷,而某些模型训练实体过于空闲的情况,导致通信系统模型训练的整体效率较低。
发明内容
本申请提供了一种模型训练管理的方法、装置和系统,能够提高模型训练的效率。
第一方面,提供了一种模型训练管理的方法,该方法可以由模型训练管理实体或者模型训练管理实体中的芯片实现,该方法包括:模型训练管理实体接收第一模型训练实体的训练状态信息,训练状态信息指示第一模型训练实体具有的至少一个模型训练任务;模型训练管理实体获取多个算力资源信息,多个算力资源信息分别指示多个模型训练实体具有的用于模型训练的空闲算力资源;模型训练管理实体基于多个算力资源信息在多个模型训练实体中确定第一目标模型训练实体;模型训练管理实体向第一目标模型训练实体发送第一训练任务配置信息,第一训练任务配置信息指示协助第一模型训练实体执行至少一个模型训练任务中的目标模型训练任务。
目标模型训练任务的个数可以是一个或多个。第一目标模型训练实体的个数可以是一个或多个。
基于本技术方案,模型训练管理实体可以对多个模型训练实体的算力资源进行管理编排,将第一模型训练实体的训练任务分配给其他算力资源充足的模型训练实体协助完成训练,减少第一模型训练实体的训练任务的训练等待时间,提高模型训练的效率。
结合第一方面,在第一方面的某些实现方式中,模型训练管理实体获取多个算力资源信息,包括:模型训练管理实体分别周期接收来自多个模型训练实体的多个算力资源信息。
多个模型训练实体发送算力资源信息的周期可以相同或者不同。
基于本技术方案,对于网络状态频繁变化的通信系统中,各个模型训练实体可以及时上报自己的算力资源信息,使得模型训练管理实体可以及时进行编排与管理,提升模型训练的效率。结合第一方面,在第一方面的某些实现方式中,模型训练管理实体获取多个算力资源信息,包括:模型训练管理实体分别向多个模型训练实体发送多个算力资源查询信息;模型训练管理实体分别接收来自多个模型训练实体的多个算力资源信息。
可选地,模型训练管理实体基于训练状态信息确定第一模型训练实体不能够独立执行全部所述至少一个模型训练任务时,分别向多个模型训练实体发送多个算力资源查询信息。
基于本技术方案,模型训练实体可以基于模型训练管理实体的查询信息返回算力资源信息,即模型训练管理实体可以在具有需求的情况下,比如确定要分配协助训练的情况下向模型训练实体发送查 询信息,能够节省传输资源,获取到实时的算力资源信息。
结合第一方面,在第一方面的某些实现方式中,模型训练管理实体接收第一模型训练实体的训练状态信息,包括:模型训练管理实体周期接收来自第一模型训练实体的训练状态信息。
多个模型训练实体发送训练状态信息的周期可以相同或者不同。
基于本技术方案,对于网络状态频繁变化的通信系统中,各个模型训练实体可以及时上报自己的训练状态信息,使得模型训练管理实体可以及时进行编排与管理,提升模型训练的效率。
结合第一方面,在第一方面的某些实现方式中,方法还包括:模型训练管理实体向第一模型训练实体发送训练任务配置通知信息,训练任务配置通知信息指示第一目标模型训练实体协助执行目标模型训练任务。结合第一方面,在第一方面的某些实现方式中,方法还包括:模型训练管理实体确定用于第一目标模型训练实体协助执行目标模型训练任务的训练数据的目标长度;模型训练管理实体向第一模型训练实体发送第二训练任务配置信息,第二训练任务配置信息指示第一目标模型训练实体采用目标长度的训练数据协助执行目标模型训练任务。
结合第一方面,在第一方面的某些实现方式中,训练状态信息还指示用于完成目标模型训练任务的训练数据的总长度,模型训练管理实体确定目标长度,包括:模型训练管理实体基于总长度和第一目标模型训练实体的算力资源信息确定目标长度。
基于本技术方案,模型训练管理实体可以对目标模型训练任务进行分解,利用多个模型训练实体协作完成训练任务,可以减轻目标模型训练任务的原训练主体即第一模型训练实体的训练任务负担,充分利用多个模型训练实体的资源,减少训练任务的训练等待时间。
结合第一方面,在第一方面的某些实现方式中,方法还包括:模型训练管理实体接收来自第一目标模型训练实体的协助训练反馈信息,协助训练反馈信息指示以下至少一项:第一目标模型训练实体执行目标模型训练任务达到的精度、第一目标模型训练实体执行目标模型训练任务耗费的时长、第一目标模型训练实体执行目标模型训练任务的执行进度、第一目标模型训练实体执行目标模型训练任务占用的资源数量。
结合第一方面,在第一方面的某些实现方式中,方法还包括:模型训练管理实体接收来自第一目标模型训练实体的网络状态更改信息;模型训练管理实体基于网络状态更改信息和多个算力资源信息确定将第一目标模型训练实体更换为多个模型训练实体中的第二目标模型训练实体;模型训练管理实体向第二目标模型训练实体发送第三训练任务配置信息,第三训练任务配置信息指示协助第一模型训练实体执行目标模型训练任务。
基于本技术方案,模型训练实体在协助执行模型训练任务的过程中,可以向模型训练管理实体反馈执行的情况,使得模型训练管理实体能够获知执行目标模型训练任务的情况,并能够及时作出调整,提升模型训练的可靠性。
结合第一方面,在第一方面的某些实现方式中,方法还包括:模型训练管理实体获取策略信息,策略信息指示基于多个算力资源信息在多个模型训练实体确定第一目标模型实体的方式,和/或,指示基于用于完成目标模型训练任务的训练数据的总长度确定用于第一目标模型实体协助执行目标模型训练任务的训练数据的目标长度的方式。
第二方面,提供了一种模型训练管理的方法,该方法可以由模型训练实体或者模型训练实体中的芯片实现,方法包括:模型训练实体向模型训练管理实体发送训练状态信息,训练状态信息指示模型训练实体具有的至少一个模型训练任务;模型训练实体向模型训练管理实体发送算力资源信息,算力资源信息指示模型训练实体具有的用于模型训练的空闲算力资源;模型训练实体接收来自模型训练管理实体的训练任务信息,训练任务信息指示协助执行至少一个模型训练任务中的目标模型训练任务的第一目标模型训练实体。
基于本技术方案,模型训练实体可以向模型训练管理实体上报算力资源信息和训练状态信息,使得模型训练管理实体可以对多个模型训练实体的算力资源进行管理编排,将模型训练实体的训练任务分配给其他算力资源充足的模型训练实体协助完成训练,减少模型训练实体的训练任务的训练等待时间,提高模型训练的效率。
结合第二方面,在第二方面的某些实现方式中,模型训练实体向模型训练管理实体发送算力资源信息,包括:模型训练实体向模型训练管理实体周期发送算力资源信息;或者,模型训练实体接收来自 模型训练管理实体的算力资源查询信息;模型训练实体向模型训练管理实体发送算力资源信息。
结合第二方面,在第二方面的某些实现方式中,模型训练实体向模型训练管理实体发送训练状态信息,包括:模型训练实体向模型训练管理实体周期发送训练状态信息;或者,模型训练实体基于触发事件向模型训练管理实体发送训练状态信息。
结合第二方面,在第二方面的某些实现方式中,训练任务信息还指示第一目标模型训练实体采用目标长度的训练数据协助执行目标模型训练任务。
结合第二方面,在第二方面的某些实现方式中,方法还包括:模型训练实体接收来自目标模型训练实体的模型训练报告信息,模型训练报告信息指示完成目标模型训练任务获得的子模型;模型训练实体基于子模型执行模型聚合。
第二方面的各种实现方式是与第一方面的各种实现方式对应的第一模型训练实体的方法,关于第二方面的各种实现方式的有益技术效果,可以参考第一方面的相关实现方式的说明,在此不予以赘述。
第三方面,提供了一种模型训练管理的方法,该方法可以由模型训练实体或者模型训练实体中的芯片实现,方法包括:模型训练实体向模型训练管理实体发送算力资源信息,算力资源信息指示模型训练实体具有的用于模型训练的空闲算力资源;模型训练管理实体接收来自模型训练管理实体的第一训练任务配置信息,第一训练任务配置信息指示协助第一模型训练实体执行目标模型训练任务;模型训练管理实体协助第一模型训练管理实体执行目标模型训练任务。
基于本技术方案,模型训练实体可以向模型训练管理实体上报算力资源信息和训练状态信息,使得模型训练管理实体可以对多个模型训练实体的算力资源进行管理编排,将模型训练实体的训练任务分配给其他算力资源充足的模型训练实体协助完成训练,减少模型训练实体的训练任务的训练等待时间,提高模型训练的效率。
结合第三方面,在第三方面的某些实现方式中,述模型训练实体向模型训练管理实体发送算力资源信息,包括:模型训练实体向模型训练管理实体周期发送算力资源信息;或者,模型训练实体接收来自模型训练管理实体的算力资源查询信息;模型训练实体向模型训练管理实体发送算力资源信息。
结合第三方面,在第三方面的某些实现方式中,模型训练管理实体协助第一模型训练管理实体执行目标模型训练任务,包括:模型训练实体获取目标训练数据;模型训练实体采用目标训练数据协助第一模型训练管理实体执行目标模型训练任务。
结合第三方面,在第三方面的某些实现方式中,方法还包括:模型训练实体向第一模型训练实体发送模型训练报告信息,模型训练报告信息指示完成目标模型训练任务获得的子模型。
结合第三方面,在第三方面的某些实现方式中,方法还包括:模型训练实体向模型训练管理实体发送协助训练反馈信息,协助训练反馈信息指示以下至少一项:模型训练实体执行目标模型训练任务达到的精度、模型训练实体执行目标模型训练任务耗费的时长、模型训练实体执行目标模型训练任务的执行进度、模型训练实体执行目标模型训练任务占用的资源数量。
结合第三方面,在第三方面的某些实现方式中,方法还包括:模型训练实体向模型训练管理实体发送网络状态更改信息,网络状态更改信息指示模型训练实体不能够完成目标模型训练任务。
第三方面的各种实现方式是与第一方面的各种实现方式对应的第一目标模型训练实体的方法,关于第三方面的各种实现方式的有益技术效果,可以参考第一方面的相关实现方式的说明,在此不予以赘述。
第四方面,提供了一种模型训练管理的方法,该方法可以应用于一种通信系统,该通信系统包括模型训练管理实体和多个模型训练实体,该方法包括:第一模型训练实体向模型训练管理实体发送训练状态信息,模型训练管理实体接收第一模型训练实体的训练状态信息,训练状态信息指示第一模型训练实体具有的至少一个模型训练任务;模型训练管理实体获取多个算力资源信息,多个算力资源信息分别指示多个模型训练实体具有的用于模型训练的空闲算力资源;模型训练管理实体基于多个算力资源信息在多个模型训练实体中确定第一目标模型训练实体;模型训练管理实体向第一目标模型训练实体发送第一训练任务配置信息,第一目标模型训练管理实体接收来自模型训练管理实体的第一训练任务配置信息,第一训练任务配置信息指示协助第一模型训练实体执行目标模型训练任务;第一目标模型训练管理实体协助第一模型训练管理实体执行目标模型训练任务。
基于本技术方案,模型训练实体可以向模型训练管理实体上报算力资源信息和训练状态信息,使 得模型训练管理实体可以对多个模型训练实体的算力资源进行管理编排,将模型训练实体的训练任务分配给其他算力资源充足的模型训练实体协助完成训练,减少模型训练实体的训练任务的训练等待时间,提高模型训练的效率。
第四方面的各种实现方式是与第一方面的各种实现方式对应的系统的方法,关于第四方面的各种实现方式的有益技术效果,可以参考第一方面的相关实现方式的说明,在此不予以赘述。
第五方面,提供了一种通信装置,该装置包括收发模块和处理模块,其中收发模块用于接收第一模型训练实体的训练状态信息,训练状态信息指示第一模型训练实体具有的至少一个模型训练任务;收发模块还用于获取多个算力资源信息,多个算力资源信息分别指示多个模型训练实体具有的用于模型训练的空闲算力资源;处理模块用于基于多个算力资源信息在多个模型训练实体中确定第一目标模型训练实体;收发模块还用于向第一目标模型训练实体发送第一训练任务配置信息,第一训练任务配置信息指示协助第一模型训练实体执行至少一个模型训练任务中的目标模型训练任务。
第五方面所述的通信装置具有实现第一方面,或第一方面的任一可能的实现方式中的方法的功能。所述功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。所述硬件或软件包括一个或多个与上述功能相对应的单元。
第五方面的各种实现方式是与第一方面的各种实现方式对应的模型训练管理实体的装置,关于第五方面的各种实现方式的有益技术效果,可以参考第一方面的相关实现方式的说明,在此不予以赘述。
第六方面,提供了一种通信装置,该装置包括收发模块和处理模块,处理模块用于生成训练状态信息;收发模块用于向模型训练管理实体发送训练状态信息,训练状态信息指示模型训练实体具有的至少一个模型训练任务;收发模块还用于向模型训练管理实体发送算力资源信息,算力资源信息指示模型训练实体具有的用于模型训练的空闲算力资源;收发模块还用于接收来自模型训练管理实体的训练任务信息,训练任务信息指示协助执行至少一个模型训练任务中的目标模型训练任务的第一目标模型训练实体。
第六方面所述的通信装置具有实现第二方面,或第二方面的任一可能的实现方式中的方法的功能。所述功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。所述硬件或软件包括一个或多个与上述功能相对应的单元。
第六方面的各种实现方式是与第一方面的各种实现方式对应的第一模型训练实体的装置,关于第六方面的各种实现方式的有益技术效果,可以参考第一方面的相关实现方式的说明,在此不予以赘述。
第七方面,提供了一种通信装置,该装置包括收发模块和处理模块,收发模块用于向模型训练管理实体发送算力资源信息,算力资源信息指示模型训练实体具有的用于模型训练的空闲算力资源;收发模块还用于接收来自模型训练管理实体的第一训练任务配置信息,第一训练任务配置信息指示协助第一模型训练实体执行目标模型训练任务;处理模块用于协助第一模型训练管理实体执行目标模型训练任务。
第七方面所述的通信装置具有实现第三方面,或第三方面的任一可能的实现方式中的方法的功能。所述功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。所述硬件或软件包括一个或多个与上述功能相对应的单元。
第七方面的各种实现方式是与第一方面的各种实现方式对应的第一目标模型训练实体的装置,关于第七方面的各种实现方式的有益技术效果,可以参考第一方面的相关实现方式的说明,在此不予以赘述。
第八方面,提供一种通信装置,包括处理器和存储器。可选地,还可以包括收发器。其中,存储器用于存储计算机程序,处理器用于调用并运行存储器中存储的计算机程序,并控制收发器收发信号,以使通信装置执行如第一方面至第四方面中的任一方面,或这些方面中的任一方面的任一可能的实现方式中的方法。
示例性地,该通信装置为模型训练管理功能。
第九方面,提供一种通信装置,包括处理器和存储器。可选地,还可以包括收发器。其中,存储器用于存储计算机程序,处理器用于调用并运行存储器中存储的计算机程序,并控制收发器收发信号,以使通信装置执行如第一方面至第四方面中的任一方面,或这些方面中的任一方面的任一可能的实现方式中的方法。
示例性地,该通信装置为模型训练功能。
第十方面,提供一种通信装置,包括处理器和通信接口,所述通信接口用于接收数据和/或信息,并将接收到的数据和/或信息传输至所述处理器,所述处理器处理所述数据和/或信息,以及,通信接口还用于输出经处理器处理之后的数据和/或信息,以使得如第一方面至第四方面中的任一方面,或这些方面中的任一方面的任一可能的实现方式中的方法被执行。
其中,该通信装置可以为应用于模型训练管理功能的芯片。
第十一方面,提供一种通信装置,包括处理器和通信接口,所述通信接口用于接收数据和/或信息,并将接收到的数据和/或信息传输至所述处理器,所述处理器处理所述数据和/或信息,以及,通信接口还用于输出经处理器处理之后的数据和/或信息,以使得如第一方面至第四方面中的任一方面,或这些方面中的任一方面的任一可能的实现方式中的方法被执行。
其中,该通信装置可以为应用于模型训练功能的芯片。
第十二方面,提供一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机指令,当计算机指令在计算机上运行时,使得如第一方面至第四方面中的任一方面,或这些方面中的任一方面的任一可能的实现方式中的方法被执行。
第十三方面,提供一种计算机程序产品,所述计算机程序产品包括计算机程序代码,当所述计算机程序代码在计算机上运行时,使得如第一方面至第四方面中的任一方面,或这些方面中的任一方面的任一可能的实现方式中的方法被执行。
第十四方面,提供一种无线通信系统,包括如第五方面所述的通信装置,和/或如第六方面所述的通信装置,和/或如第七方面所述的通信装置。
附图说明
图1是适用本申请实施例适用的一种通信系统的示意性结构图;
图2是本申请实施例适用的第一种应用场景的示意性结构图;
图3是本申请实施例适用的第二种应用场景的示意性结构图;
图4是多个模型训练实体分别执行模型训练任务的示意图;
图5是本申请实施例提供的一种模型训练管理的方法的示意性流程图;
图6是本申请实施例提供的一种获取训练状态信息和算力资源信息的方式的示意性流程图;
图7是本申请实施例提供的一种目标模型训练任务整体训练的方法的示意性流程图;
图8是本申请实施例提供的一种目标模型训练任务分解训练的方法的示意性流程图;
图9是本申请实施例提供的一种执行目标模型训练任务的过程中进行反馈的方法的示意性流程图;
图10是本申请实施例提供的一种两个模型训练管理实体进行模型训练管理的一种实现方式的示意性流程图;
图11至图13是本申请实施例提供的可能的装置的示意性结构图。
具体实施方式
下面将结合附图,对本申请中的技术方案进行描述。
本申请实施例的方法可以应用于长期演进技术(long term evolution,LTE)系统,长期演进高级技术(long term evolution-advanced,LTE-A)系统,增强的长期演进技术(enhanced long term evolution-advanced,eLTE),第五代(the 5th Generation,5G)移动通信系统新空口(New Radio,NR)系统,也可以扩展到类似的无线通信系统中,如无线保真(wireless-fidelity,WiFi),全球微波互联接入(worldwide interoperability for microwave access,WIMAX),以及第三代合作伙伴计划(3rd generation partnership project,3gpp)相关的蜂窝系统。
为了清楚,以下对本申请实施例中的部分术语进行解释。
1.推理模型(也可以简称为模型):从数据中学习到的,可以实现特定功能/映射的函数。模型可以基于人工智能(artificial intelligence,AI)或者机器学习(machine learning,ML)的技术得到,因此,也可以称为人工智能/AI模型、机器学习/ML模型等。常用的用于生成AI/ML模型的算法包括:监督学习、无监督学习、增强学习,对应的模型可以称为监督学习模型、无监督学习模型、增强学习模型。 示例的,监督学习模型可以是分类模型、预测模型、回归模型等,无监督学习模型可以是聚类模型。此外,模型还可以基于神经网络(neural network,NN)技术得到,这种模型也可以称为神经网络模型、深度学习模型等。
2.模型训练实体
推理模型的训练实体称为模型训练实体。示例性地,模型训练实体的能力或功能可以部署在某个网元上,该网元称为模型训练网元;模型训练实体的能力或功能也可以部署在其他设备上,对此,本申请实施例不作限定;为叙述方便,本申请实施例以模型训练网元为例进行说明,但都可以替换为训练推理模型的能力或功能的其他设备。
3.模型训练管理实体
模型训练管理实体用于对多个模型训练实体的训练任务和算力资源进行管理编排。示例性地,模型训练管理实体的能力或功能可以部署在某个网元上,该网元称为模型训练管理网元;模型训练管理实体的能力或功能也可以部署在其他设备上,对此,本申请实施例不作限定;为叙述方便,本申请实施例以模型训练管理网元为例进行说明,但都可以替换为管理模型训练网元的能力或功能的其他设备。
4.模型推理实体
基于模型进行推理或预测的实体称为模型推理实体。示例性地,模型推理实体的能力或功能可以部署在某个网元上,该网元称为模型推理网元;模型推理实体的能力或功能也可以部署在其他设备上,对此,本申请实施例不作限定;为叙述方便,本申请实施例以模型推理网元为例进行说明,但都可以替换为基于模型进行推理或预测的能力或功能的其他设备。
5.模型训练任务
模型训练任务是模型训练实体对模型进行模型训练时能够划分的基本工作单位。
上文对本申请实施例涉及的相关术语进行了说明,以下结合图1至图4对本申请实施例适用的应用场景进行说明。
图1是本申请实施例适用的一种通信系统的示意性结构图。首先对该通信系统100中可能涉及的装置进行说明。
1、模型训练管理网元110:能够用于对多个模型训练实体的训练任务和算力资源进行管理编排。模型管理网元110可以部署于网络管理系统(network management system,NMS)中、或者也可以部署于网元管理系统(element management system,EMS)。其中,NMS用于网络的运行、管理和维护,也可以称之为跨域管理系统。EMS用于管理一个或多个特定类别的网元,也可以称之为域管理系统或单域管理系统。示例性地,模型训练管理网元110可以与至少一个模型训练网元相连,比如参见图1,模型训练管理网元110分别与模型训练网元121和模型训练网元122相连。
2、模型训练网元121、模型训练网元122:能够用于对模型进行训练。模型训练实体121可以部署于EMS、NMS、无线接入网(radio access network,RAN)域中的网络设备、核心网域中的核心网网元,例如网络数据分析功能(network data analytics function,NWDAF)网元。类似地,模型训练实体122也可以部署于EMS、NMS、网络设备或核心网网元中。需要说明的是,不同的模型训练网元可以部署于同一个系统、设备或网元中,比如模型训练网元121和模型训练网元122可以部署与同一个网络设备中;或者,不同的模型训练网元可以部署于不同的系统、设备或网元中,比如模型训练网元121和模型训练网元122分别部署于不同的网络设备,再比如模型训练网元121部署于网络设备中,模型训练网元122部署于核心网网元中等,本申请对此不作特别限定。
3、模型推理网元130:能够用于基于模型进行推理或预测。模型推理网元130可以部署于EMS中,比如EMS中的管理数据分析功能(management data analytics function,MDAF)、或者模型推理网元130也可以部署于RAN域的网络设备中、或者核心网域中的核心网网元,比如,NWDAF网元。
需要说明的是,图1以模型推理网元130与模型训练网元121相连进行示例性说明,模型推理网元130和模型训练网元121可以部署于不同的设备,比如模型训练网元121可以部署于NMS,模型推理网元130可以部署于网络设备;模型推理网元130和模型训练网元121也可以部署于同一个的设备,比如模型推理网元130和模型训练网元121可以部署于同一个网络设备或核心网网元,本申请对此不作特别限定。
还需要说明的是,虽然图未示出,通信系统100中可以包括多个模型训练管理网元,比如NMS和 EMS中可以分别部署模型训练管理网元。通信系统100中也可以包括多个模型推理网元,比如模型训练网元122也可以与一个模型推理网元相连。本申请对模型训练网元、模型训练管理功能网元、模型推理网元的个数不作限定。
为了更加便于理解本申请实施例,以下结合图2至图4对可能的本申请实施例适用的应用场景进行说明。
图2是本申请实施例适用的第一种应用场景的示意性结构图。
参见图2,NMS 210部署有模型训练管理网元211和模型训练网元212,NMS 210可以对网络设备220、网络设备230、NWDAF 240和NWDAF 250进行管理,网络设备220上部署有模型训练网元221,网络设备230上部署有模型训练网元231,NWDAF 240上部署有模型训练网元241,NWDAF 250上部署有模型训练网元251。模型训练管理网元211和模型训练网元221之间可以通过NMS 210和网络设备220之间的通信接口进行通信。模型训练管理网元211和模型训练网元231之间可以通过NMS 210和网络设备230之间的通信接口进行通信。模型训练管理网元211和模型训练网元241之间可以通过NMS 210和NWDAF 240之间的通信接口进行通信。模型训练管理网元211和模型训练网元251之间可以通过NMS 210和NWDAF 250之间的通信接口进行通信。模型训练管理网元211和模型训练网元212之间可以通过NMS 210中的内部接口进行通信。
图3是本申请实施例适用的第二种应用场景的示意性结构图。
参见图3,NMS 310部署有模型训练管理网元311和模型训练网元312,EMS 320部署有模型训练管理网元321和模型训练网元322。NMS 310可以通过EMS 320管理网络设备330、网络设备340、NWDAF 350和NWDAF 360。模型训练网元331部署于网络设备330,模型训练网元341部署于网络设备340,模型训练网元351部署于NWDAF 350,模型训练网元361部署于NWDAF 360。模型训练管理网元311和模型训练管理网元321可以共同对模型训练网元312、模型训练网元322、模型训练网元331、模型训练网元341、模型训练网元351、模型训练网元361进行管理,比如EMS 320中的模型训练管理网元321可以分别与模型训练网元312、模型训练网元322、模型训练网元331、模型训练网元341、模型训练网元351、模型训练网元361进行通信,获取每个模型训练网元的信息,NMS 310中的模型训练管理网元311可以为模型训练管理网元321提供分析信息的策略。
模型训练管理网元311和模型训练管理网元321之间可以通过NMS 310和EMS 320之间的接口进行通信。模型训练管理网元321和模型训练网元331之间可以通过EMS 320和网络设备330之间的通信接口进行通信。模型训练管理网元321和模型训练网元341之间可以通过EMS 320和网络设备340之间的通信接口进行通信。模型训练管理网元321和模型训练网元351之间可以通过EMS 320和NWDAF 350之间的通信接口进行通信。模型训练管理网元321和模型训练网元361之间可以通过EMS 320和NWDAF 360之间的通信接口进行通信。模型训练管理网元311和模型训练网元312之间可以通过NMS 310中的内部接口进行通信。模型训练管理网元321和模型训练网元322之间可以通过EMS 320中的内部接口进行通信。
另外,虽然图未示出,NMS也可以通过多个EMS管理多个模型训练网元,本申请对此不作特别限定。
需要说明的是,本申请的方案可以应用于包含相应实体的其它系统中,本申请不作限定。可以理解的是,上述实体或者功能既可以是硬件设备中的网络元件,也可以是在专用硬件上运行软件功能,或者是平台(例如,云平台)上实例化的虚拟化功能。可选的,上述实体或者功能可以由一个设备实现,也可以由多个设备共同实现,还可以是一个设备内的一个功能模块,本申请实施例对此不作具体限定。
通过上文可知,通信系统中可以部署多个模型训练网元,每个模型训练网元中可以具有多个训练任务。然而,由于不同的模型训练网元所处的网络环境和网络需求可能不同且动态变化,不同的模型训练网元具有的模型训练任务也会随之不同。为了便于理解本申请实施例,以下结合图4进行说明。
图4是多个模型训练实体分别执行模型训练任务的示意图。图4中示出了三个分别执行模型训练任务的模型训练网元,由于不同的模型训练实体所处的网络环境和网络需求可能不同且动态变化,不同的模型训练实体具有的模型训练任务也会随之不同。参见图4,模型训练网元#1的训练任务超负荷,并且有部分训练任务处于排队中。模型训练网元#3的训练任务也处于满负荷,没有空闲的算力资源。模型训练网元#2还有大量的空闲算力资源没有被使用。可见,在某个时间段内,多个模型训练网元的 模型训练任务分布不平衡,有的模型训练任务不能得到及时的执行,而有的算力资源会被闲置没有使用,导致通信系统模型训练的整体效率较低。
本申请提出了一种模型训练管理的方法、装置和系统,能够提高模型训练的效率,以下结合图5首先对模型训练管理的方法进行说明。
首先需要说明的是,为了便于描述,图5所示的实施例以一个模型训练管理实体和两个模型训练实体进行描述,其中,两个模型训练实体分别用模型训练实体#1和模型训练实体#2标识,模型训练实体#1可以是被协助目标模型训练任务的第一模型训练实体的一种示例,模型训练实体#2可以是协助执行目标模型训练任务的目标模型训练实体的一种示例。本申请对模型训练实体的数量不作特别限定,示例性地,模型训练实体#2的数量可以是一个或多个,即可以有一个或多个目标模型训练实体协助执行目标模型训练任务。
模型训练管理实体可以是图1至图3所述的任意的模型训练管理实体,模型训练实体#1和模型训练实体#2可以是图1至图3所述的任意的模型训练实体,本申请对此不作特别限定。
图5是本申请实施例提供的一种模型训练管理的方法的示意性流程图。
S501,模型训练实体#1向模型训练管理实体发送训练状态信息;
对应地,模型训练管理实体接收来自模型训练实体#1的该训练状态信息。
训练状态信息指示模型训练实体#1具有的至少一个模型训练任务。
可选地,训练状态信息包括以下信息中的至少一项:模型训练任务标识信息、模型训练任务的优先级信息、模型训练任务的进程信息、模型训练任务的性能信息。
其中,模型训练任务标识信息指示模型训练实体#1具有的至少一个模型训练任务的训练标识,比如,模型训练实体#1具有三个模型训练任务1-3,那么模型训练任务标识信息可以包括三个模型训练任务的标识。
模型训练任务的优先级信息指示模型训练实体#1具有的至少一个模型训练任务的优先级。比如,优先级信息可以分别指示模型训练实体#1具有的至少一个模型训练任务中每一个模型训练任务的优先级,例如优先级信息指示模型训练任务1的优先级为高、模型训练任务2和3的优先级为低。优先级可以用高、中、低表示,也可以用数字(1、2、3等)表示,数字越小表示优先级越高。或者,优先级信息也可以指示模型训练实体#1具有的优先级为高的模型训练任务的个数,例如优先级信息指示模型训练实体#1具有1个优先级为高的模型训练任务。需要说明的是,本申请对模型训练任务的优先级的设置不作任何限定,比如优先级可以基于请求执行的时间先后顺序确定,或者基于模型训练任务的重要程度确定等。
模型训练任务的进程信息指示模型训练实体#1进行模型训练的进程。比如,进程信息可以指示模型训练实体#1具有的至少一个模型训练任务中每一个模型训练任务的进程状态。其中,进程状态可以包括等待运行、正在运行、已完成运行。例如,进程信息指示模型训练任务1已完成运行、模型训练任务2正在运行、模型训练任务3已完成运行。或者,进行信息可以指示模型训练实体#1进行模型训练的总进程,比如进程信息指示模型训练实体#1中未完成运行(即正在运行或等待运行)的模型训练任务的个数,例如进程信息指示模型训练实体#1中还有两个模型训练任务未完成。
模型训练任务的性能信息指示模型训练实体#1进行模型训练的性能。示例性地,性能信息可以指示以下至少一项:模型训练任务进行单次训练需要的时间、模型训练任务进行单次训练需要占用的算力资源、模型训练任务需要训练的次数、模型训练任务进行多次训练时每次训练需要的平均时间、模型训练任务进行多次训练时每次训练需要占用的平均算力资源、模型训练任务进行多次训练需要的总时间、模型训练任务需要多次训练需要占用的总算力资源、执行多个模型训练任务的平均训练时间、执行多个模型训练任务需要占用的平均算力资源。
可选地,训练状态信息还指示模型训练实体#1请求协作训练。即模型训练实体#1可以确定是否请求协作训练。
其中,协作训练可以是指不是模型训练实体#1单独执行某个模型训练任务,比如,该模型训练任务可以由多个模型训练实体协作执行,再比如,该训练任务可以是由其它模型训练实体(例如模型训练实体#2)协作执行。
在第一种可能的实现方式中,模型训练实体#1基于空闲的算力资源是否能够满足训练任务的需求 确定是否请求协作训练。如果能够满足,那么模型训练实体#1确定不请求协作训练。如果不能够满足,那么模型训练实体#1请求协作训练。
示例性地,模型训练实体#1如果要满足训练任务的需求,预计需要50%的算力资源,然而模型训练实体#1当前仅具有30%的空闲算力资源,那么模型训练实体#1确定请求协作训练。
在第二种可能的实现方式中,模型训练实体#1基于模型训练任务的优先级确定是否需要进行协作训练。
示例性地,模型训练实体#1可能具有多个等待运行的训练任务,多个训练任务中优先级高的训练任务将会比优先级低的训练任务先运行。若模型训练实体#1具有比新增的训练任务的优先级还高的训练任务,且空闲的算力资源不够完成全部等待运行的训练任务,即空闲的算力资源中扣除优先级高的训练任务需要的资源后,剩余的算力资源不能够满足优先级较高的训练任务的需求,那么模型训练实体#1可以确定请求协作训练。
可选地,训练状态信息还指示请求协作训练的目标模型训练任务。即模型训练实体#1可以确定请求协作训练的目标模型训练任务。
需要说明的是,请求协作训练的目标模型训练任务可以是模型训练实体#1中的至少一个模型训练任务中的任意一个或多个,即目标模型训练任务可以是新增的模型训练任务,也可以是其它模型训练任务,本申请对此不作特别限定。
在第一种可能的实现方式中,模型训练实体#1根据至少一个训练任务的训练进程确定目标模型训练任务。示例性地,模型训练实体#1将处于等待运行的模型训练任务确定为目标模型训练任务。
在第二种可能的实现方式中,模型训练实体#1根据至少一个模型训练任务的优先级确定目标模型训练任务。示例性地,模型训练实体#1将优先级低的模型训练任务确定为目标模型训练任务,例如,模型训练实体#1可以设置每个模型训练任务的优先级值,比如数值1至10表示从优先级高至优先级低的不同的优先级,如果模型训练任务的优先级值大于或等于特定阈值(例如5),那么将该模型训练任务确定为目标模型训练任务。其中,该特定阈值可以是预先配置的或者是动态确定(比如根据算力或者当前的网络状态确定)的,本申请对此不作特别限定。
示例性地,模型训练实体#1可以直接指示请求协作训练的目标模型训练任务,例如可以在训练状态信息中携带请求协作训练的模型训练任务的标识,比如携带模型训练任务2的标识,表示建议模型训练任务2进行协作训练。
再示例性地,模型训练实体#1也可以间接指示建议进行协作训练的目标模型训练任务。例如,训练状态信息可以指示将训练进程处于等待运行状态的训练任务进行协作训练,或者,也可以指示将训练进程处于等待运行状态、以及优先级低的训练任务进行协作训练。
需要说明的是,模型训练实体#1和模型管理训练实体也可以预先配置或者提前协商如何选取进行协作训练的训练任务,本申请对此不作特别限定。
可以理解的是,如果训练状态信息包括目标模型训练任务的标识信息,那么该目标模型训练任务的标识信息可以隐含请求协作训练。
还可以理解的是,该目标模型训练任务是模型训练实体#1建议的进行协作训练的模型训练任务,模型训练管理实体可以根据实际的网络状态和其它模型训练实体的算力资源更改协作训练的目标模型训练任务,本申请对此不作特别限定。
可选地,模型训练实体#1可以周期向模型训练管理实体发送训练状态信息,比如多个模型训练实体可以被配置为周期上报训练状态信息,其中上报的周期可以是预先设置的或者模型训练管理实体配置的,本申请对此不作特别限定。或者,模型训练实体#1可以基于触发向模型训练管理实体发送训练状态信息,比如模型训练实体可以在新增模型训练任务的情况下向模型训练管理实体发送训练状态信息,在这种情况下,训练状态信息可以仅包括新增的模型训练任务的标识信息、优先级信息等信息。为了便于理解本申请实施例,有关模型训练实体向模型训练管理实体发送训练状态信息的更为详细的描述可以参见下文图6的介绍,在此不予赘述。
S502,模型训练管理实体获取多个算力资源信息。
多个算力资源信息分别指示多个模型训练实体具有的用于模型训练的空闲算力资源。
示例性地,模型训练管理实体接收来自模型训练实体#1的算力资源信息#1,算力资源信息#1指示 模型训练实体#1具有的用于模型训练的空闲算力资源。模型训练管理实体接收来自模型训练实体#2的算力资源信息#2,算力资源信息#2指示模型训练实体#2具有的用于模型训练的空闲算力资源。
算力资源信息可以包括以下至少一项:硬件资源信息、资源使用信息。其中,硬件资源信息可以指示硬件资源的性能,比如硬件资源信息可以指示以下至少一项:硬件资源的类型、硬件资源的核数、硬件资源的处理频率。或者,硬件资源信息也可以指示量化后的运算能力,比如每秒浮点运算次数(floating-point operations per second,FLOPS)。进而,模型训练管理功能可以通过上述信息确定硬件资源的运算性能。另外,资源使用信息可以指示硬件资源的利用率,比如资源使用信息指示硬件资源空闲的算力、已被使用的算力、或者还能够支持的用于模型训练的算力等。进而模型训练管理功能可以获知模型训练实体能够用于模型训练的运算能力。
示例性地,上述硬件资源可以包括处理器、存储器等,处理器可以是中央处理单元(central processing unit,CPU、图形处理单元(graphics processing unit,GPU)、神经网络处理单元(neural network processing unit,NPU)中任一项或多项。存储器可以是U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质,本申请对此不作特别限定。
可选地,多个模型训练实体可以周期分别向模型训练管理实体发送算力资源信息,比如多个模型训练实体可以被配置为周期上报算力资源信息,其中上报的周期可以是预先设置的或者模型训练管理实体配置的,本申请对此不作特别限定。或者,模型训练实体基于请求向模型训练管理实体发送算力资源信息,比如模型训练管理实体向模型训练实体发送算力资源查询信息,进而模型训练实体向模型训练管理实体发送算力资源查询信息。为了便于理解本申请实施例,有关模型训练管理实体获取算力资源信息的更为详细的描述可以参见下文图6的介绍,在此不予赘述。
可选地,模型训练管理实体分别向多个模型训练实体发送多个算力资源查询信息,模型训练管理实体分别接收来自多个模型训练实体的多个算力资源信息。即模型训练实体可以基于模型训练管理实体的查询信息返回算力资源信息。
示例性地,模型训练管理实体基于训练状态信息确定第一模型训练实体不能够独立执行全部所述至少一个模型训练任务时,分别向多个模型训练实体发送多个算力资源查询信息。模型训练实体可以基于模型训练管理实体的查询信息返回算力资源信息,即模型训练管理实体可以在具有需求的情况下,比如确定要分配协助训练的情况下向模型训练实体发送查询信息,能够节省传输资源。
S503,模型训练管理实体基于多个算力资源信息在多个模型训练实体中确定至少一个模型训练实体#2。
该至少一个模型训练实体#2用于协作执行目标模型训练任务。
可以理解的是,如果步骤S501中的训练状态信息指示模型训练实体#1请求协作训练的目标模型训练任务,那么模型训练管理实体可以直接基于该目标模型训练任务的信息和多个算力资源信息在多个模型训练实体中确定模型训练实体#2。如果步骤S501中的训练状态信息指示模型训练实体#1请求协作训练,但是没有指示目标模型训练任务,那么模型训练管理实体在确定模型训练实体#2之前,模型训练管理实体基于训练状态信息和多个算力资源信息确定是否为模型训练实体#1配置协作训练。如果步骤S501中的训练状态信息没有指示模型训练实体#1请求协作训练,也没有指示目标模型训练任务,那么模型训练管理实体在确定模型训练实体#2之前,模型训练管理实体基于训练状态信息和多个算力资源信息确定是否为模型训练实体#1配置协作训练,在确定为模型训练实体#1配置协作训练的情况下,确定目标模型训练任务。以下分别对模型训练管理实体确定是否为模型训练实体#1配置协作训练,以及确定目标模型训练任务进行说明。
针对模型训练管理实体确定是否为模型训练实体#1配置协作训练,在第一种可能的实现方式中,模型训练管理实体基于模型训练实体#1上报的算力资源信息和训练状态信息确定是否为模型训练实体#1配置协作训练。在这种实现方式中,模型训练管理实体确定是否为模型训练实体#1配置协作训练与模型训练实体#1确定是否请求协作训练的方式类似,比如可以基于空闲的算力资源确定或者模型训练任务的优先级确定,详细的描述请参考上文步骤S501,在此不予赘述。
在第二种可能的实现方式中,模型训练管理实体基于多个模型训练实体上报的多个算力资源信息和训练状态信息确定是否为模型训练实体#1配置协作训练。示例性地,模型训练实体根据多个算力资 源信息指示的多个模型训练实体的空闲算力确定能够处理的模型训练任务,如果空闲算力不能够支持协助执行训练状态信息指示的模型训练任务,那么模型训练管理实体确定不为模型训练实体#1配置协作训练,如果空闲算力能够支持协助执行训练状态信息指示的模型训练任务,那么模型训练管理实体确定为模型训练实体#1配置协作训练。
针对模型训练管理实体在确定为模型训练实体#1配置协作训练的情况下,确定目标模型训练任务。在第一种可能的实现方式中,模型训练管理实体基于模型训练实体#1上报的算力资源信息和训练状态信息确定目标模型训练任务。在这种实现方式中,模型训练管理实体确定是目标模型训练任务与模型训练实体#1确定目标模型训练任务的方式类似,比如可以基于训练进程或者优先级确定,详细的描述请参考上文步骤S501,在此不予赘述。
在第二种可能的实现方式中,模型训练管理实体基于多个模型训练实体上报的多个算力资源信息和训练状态信息确定目标模型训练任务。示例性地,模型训练管理实体可以基于训练状态信息确定候选的协作训练的模型训练任务,比如将全部处于等待运行状态的模型训练任务作为候选的协作训练的模型训练任务,模型训练管理实体再基于多个算力资源信息指示的多个模型训练实体的空闲算力在获选的模型训练任务中确定目标模型训练任务,比如确定多个模型训练实体的空闲算力能够处理的模型训练任务作为目标模型训练任务,本申请对确定模型训练任务的具体方式不作特别限定。
上文对是否进行协助训练以及协作训练的目标模型训练任务进行了说明,可选地,模型训练管理实体还可以采用如下两种方式实现目标模型训练任务的协作训练,其中,方式1可以视为整体协作训练,即将目标模型训练任务配置给除模型训练实体#1以外的模型训练实体进行训练,方式2可以视为分解协作训练,比如将目标模型训练任务分解为多个目标模型训练子任务,每个目标模型训练子任务对应目标长度的训练数据,由多个模型训练实体分别采用目标长度的训练数据进行协作训练,或者比如将目标模型分解为多个目标子模型,每个目标子模型可以被独立训练,对目标子模型的训练可以理解为是一个目标模型训练子任务。为了便于了解本申请实施例,有关方式1更详细的说明请参见下文图7的描述,有关方式2更详细的说明请参见下文图8的描述,在此不予赘述。
可以理解的是,模型训练管理实体可以预先设置采用哪一种方式进行处理,或者模型训练管理实体也可以基于目标模型训练任务、算力资源信息或模型训练实体的能力确定采用哪一种方式进行处理,比如如果存在一个模型训练实体的空闲算力资源能够支持执行目标模型训练任务,那么模型训练管理实体可以采用方式1进行配置,即该模型训练实体可以单独执行目标模型训练任务;如果不存在一个模型训练实体的空闲算力资源能够支持单独执行目标模型训练任务,那么模型训练管理实体可以采用方式2进行配置,即将目标模型训练任务分解成多个目标模型训练子任务,并选择多个目标模型训练实体分别执行多个目标模型训练子任务。
还可以理解的是,模型训练管理实体确定一个或多个目标模型训练实体(即本申请实施例中的模型训练实体#2)的方式类似,为了便于描述,以下以确定一个模型训练实体#2的方式进行说明。
可选地,模型训练管理功能对目标模型训练任务的算力需求进行预估。进而该目标模型训练任务的算力需求可以用于确定模型训练实体#2。
在一种可能的实现方式中,模型训练管理实体根据目标模型训练任务的性能需求确定目标模型训练任务的算力需求。其中,性能需求可以包括以下至少一项:目标模型训练任务的训练精确度、目标模型训练任务的训练时长。算力需求可以包括以下至少一项:预计训练目标模型训练任务所需的总浮点运算数(FLOPs)、预计模型训练目标模型训练任务的每秒浮点运算次数(FLOPS)。
示例性地,模型训练管理实体可以根据目标模型训练任务的训练精确度确定为了达到该训练精确度所需的总浮点运算次数,再根据总浮点运算次数和训练时长确定FLOPS。例如:针对用于进行信道估计的模型的模型训练任务,该模型训练任务要达到90%的训练精确度需要的计算量预计为0.5TFLOPs(每层的运算次数*层数*迭代次数),如果期望完成训练的训练时长为1s,那么模型训练管理实体可以确定算力需求为0.5TFLOPS。
可选地,模型训练管理功能基于目标模型训练任务的算力需求和多个算力资源信息确定模型训练实体#2。
在一种可能的实现方式中,模型训练管理实体可以根据多个模型训练实体中每一个模型训练实体的算力资源信息确定模型训练实体#2。
示例性地,模型训练管理实体获取多个模型训练实体的硬件资源信息和资源使用信息,基于资源使用信息确定还具有空闲算力的至少一个模型训练实体,再基于硬件资源信息从至少一个模型训练实体中确定能够支持目标模型训练任务算力需求的模型训练实体#2。
可选地,如果存在多个支持目标模型训练任务算力需求的候选的模型训练实体,那么在多个候选的模型训练实体中,模型训练管理实体可以将与模型训练实体#1最邻近的模型训练实体确定为对目标模型训练任务进行协作训练的模型训练实体#2。其中,与模型训练实体#1最邻近的模型训练实体可以是指距离模型训练实体#1最近的模型训练实体,或者可以是与模型训练实体#1相隔传输节点最少的模型训练实体,本申请对此不作特别限定。
S504,模型训练管理实体向模型训练实体#2发送第一训练任务配置信息;
对应地,模型训练实体#2接收来自模型训练管理实体的第一训练任务配置信息。
该第一训练任务配置信息指示模型训练实体#2协助模型训练实体#1执行至少一个模型训练任务中的目标模型训练任务。
可选地,该第一训练任务配置信息指示目标模型训练任务的标识。进而模型训练实体#2可以基于目标模型训练任务的标识获取待训练的模型和训练数据。
可选地,该第一训练任务配置信息还指示目标模型训练任务的目标训练数据。示例性地,如果模型训练管理实体具有用于训练目标模型训练任务的目标训练数据的信息,比如目标训练数据、目标训练数据的地址或者目标训练数据的获取方式等,那么模型训练管理实体还可以将目标训练数据的信息指示给模型训练功能#2。
可选地,该第一训练任务配置信息还指示模型训练实体#1的标识。进而,模型训练实体#2可以根据协助模型训练实体1执行协作训练。
可选地,S505,模型训练管理实体向模型训练实体#1发送训练任务信息;
对应地,模型训练实体#1接收来自模型训练管理实体的训练任务信息。
如果模型训练管理实体是采用上文的方式1进行配置,那么训练任务信息也可以称为训练任务配置通知信息,该训练任务配置通知信息指示模型训练实体#2协助执行目标模型训练任务。即该训练任务配置通知信息指示模型训练实体#1是模型训练实体#2协助执行目标模型训练任务。
示例性地,该训练任务通知信息可以指示模型训练实体#2的标识。如果目标模型训练任务是模型训练管理实体确定的,那么该训练任务通知信息还可以指示目标模型训练任务的标识。
如果模型训练管理实体是采用上文的方式2进行配置,那么训练任务信息也可以称为第二训练任务配置信息,该第二训练任务配置信息指示模型训练实体#2采用目标长度的训练数据协助执行目标模型训练任务。
示例性地,该第二训练任务配置信息可以指示模型训练实体#2的标识,以及模型训练实体#2进行协作训练采用的训练数据的目标长度。如果目标模型训练任务是模型训练管理实体确定的,那么该训练任务通知信息还可以指示目标模型训练任务的标识。
S506,模型训练实体#2获取训练数据信息,并协助执行目标模型训练任务。
该训练数据信息指示模型训练实体#2用于协助执行目标模型训练任务的训练数据。
在第一种可能的实现方式中,模型训练实体#2从模型训练管理实体获取训练数据信息,比如第一训练任务配置信息还指示目标模型训练任务的目标训练数据。
在第二种可能的实现方式中,模型训练实体#2可以从模型训练实体#1获取训练数据信息,示例性地,模型训练实体#1可以基于接收的训练任务信息获取模型训练实体#2的标识,模型训练实体#1主动向模型训练实体#2发送训练数据信息。
可选地,该训练数据信息还指示目标模型训练任务的标识、训练数据的数据地址。
需要说明的是,训练数据的数据地址可以是存储训练数据的装置的地址,进而模型训练实体#2可以根据训练数据的数据地址从该装置中下载训练数据,比如,该训练数据可以存储于模型训练实体#1中,或者模型训练实体#1所部署的装置中。或者,训练数据的数据地址可以是训练数据的采集地址,进而模型训练实体#2可以根据数据地址进行数据采集,采集获得的数据可以用于目标模型训练任务的训练。
示例性地,如果模型训练功能#1和模型训练功能#2部署于NWDAF,那么训练数据可以存储于数 据收集协调功能(data collection coordination function,DCCF)实体或者分析数据库功能(analytics data repository function,ADRF)实体中。例如,训练数据信息可以指示DCCF或ADRF的标识或者地址,模型训练实体#2可以向DCCF或ADRF发送用于请求训练数据的信息(例如Ndccf_DataManagement_Subscribe或Ndccf_DataManagement_Fetch或Nadrf_DataManagement_RetrievalRequest或Nadrf_DataManagement_RetrievalSubscribe),DCCF或ADRF向模型训练实体#2发送训练数据。
可选地,该训练数据信息还指示目标长度。进而模型训练实体#2可以获取目标长度的训练数据进行训练。
模型训练实体#2可以采用训练数据对目标模型训练任务进行训练。本申请对目标模型训练任务进行训练的方式不作特别限定,比如模型训练实体#2可以采用批量梯度下降、小批量梯度下降或随机梯度下降等训练方式。
可选地,S507,模型训练实体#2可以向模型训练管理实体发送协助训练反馈信息;
对应地,模型训练管理实体接收来自模型训练实体#2的协助训练反馈信息。
在模型训练实体#2协助模型训练实体#1执行目标模型训练任务的过程中,模型训练实体#2可以向模型训练管理实体发送协助训练反馈信息。该协助反馈信息可以指示以下至少一项:模型训练实体#2执行目标模型训练任务达到的精度、目标模型训练实体#2执行目标模型训练任务耗费的时长、目标模型训练实体#2执行目标模型训练任务的执行进度、目标模型训练实体执行目标模型训练任务占用的资源数量。
示例性地,模型训练实体#2可以周期向模型训练管理实体发送协助训练反馈信息,该周期可以是预先设置的或者是模型训练管理实体配置的,本申请对此不作特别限定。
可选地,模型训练实体#2向模型训练管理实体发送网络状态更改信息,该网络状态信息指示模型训练实体#2的网络状态发生不能够继续协助执行目标模型训练任务的变化。示例性地,网络状态更改信息指示模型训练功能#2的节能模式变为进入节能态,即该模型训练功能#2将被关闭以实现节能。再示例性地,网络状态更改信息指示模型训练功能#2用于模型训练的资源被占用,即模型训练功能#2协助执行目标模型训练任务的算力资源不足。进而模型训练管理实体可以基于网络状态信息更改协助第一模型训练实体执行目标模型训练任务的模型训练实体,比如模型训练管理实体可以选择其它模型训练实体进行协助训练,为了便于理解本申请实施例,以下结合图9对此进行更为详细的描述。
可以理解的是,如果步骤S507中模型训练实体#2周期向模型训练管理实体发送协助训练反馈信息,那么网络状态更改信息可以是网络状态改变后基于周期发送的协助训练反馈信息。
可选地,S508,模型训练实体#2发送模型训练报告信息;
模型训练实体#2完成目标模型训练任务的训练后,可以生成训练完成的模型,进而,该模型训练报告信息可以包括训练完成的模型。
可选地,该训练报告信息还包括训练完成的模型性能信息,比如训练完成的精确度、训练耗费时长等。
在第一种可能的实现方式中,模型训练实体#2独自执行目标模型训练任务,那么模型训练实体#2可以向模型推理实体#1发送模型训练报告信息,其中模型推理实体#1可以是请求执行目标模型训练任务的模型推理实体,比如模型训练实体#1和模型推理实体#1是同一个网络设备部署的实体,或者模型训练实体#1和模型推理实体#1分别部署于不同的NWDAF。更详细的内容可以参见图7的描述,在此不予赘述。
在第二种可能的实现方式中,模型训练实体#2执行的是目标模型训练任务分解而成的目标模型训练子任务,那么模型训练实体#2可以向模型训练实体#1发送模型训练报告信息,进而模型训练实体#1可以根据模型训练报告信息中的训练完成的模型进行模型聚合,更详细的内容可以参见图8的描述,在此不予赘述。
需要说明的是,如果模型训练实体#1和模型推理实体#1部署于不同的NWDAF,模型训练实体#2可以向模型训练实体#1发送模型训练报告信息,也可以向模型推理实体#1发送模型训练报告信息。可选地模型训练报告信息可以是Nnwdaf_MLModelProvision_Notify或Nnwdaf_MLModelInfo_Request response。
基于本技术方案,模型训练管理实体可以对多个模型训练实体的训练任务和算力资源进行管理编排,将模型训练实体#1的训练任务分配给其他资源充足的模型训练实体协助完成训练,减少模型训练实体#1的训练任务的训练等待时间,提高模型训练的效率。
以上对模型训练管理实体对多个模型训练实体的训练任务和算力资源进行管理编排的方法进行了说明,以下结合图6对图5中步骤S501和步骤S502提及的训练状态信息和算力资源信息的获取方式进行说明。
图6是本申请实施例提供的一种获取训练状态信息和算力资源信息的方式的示意性流程图。
S601,模型训练管理实体向模型训练实体#1发送算力资源订阅信息#1;
对应地,模型训练实体#1接收来自模型训练管理实体的算力资源订阅信息#1。
算力资源订阅信息用于订阅模型训练实体#1的算力资源信息#1。有关算力资源信息的内容可以参考图5中步骤S502的介绍,在此不予赘述。
可选地,该算力资源订阅信息#1还指示模型训练实体#1上报算力资源信息#1的周期。示例性地,模型训练管理实体可以根据模型训练实体执行训练任务的平均耗时确定该周期。
S602,模型训练实体#1向模型训练管理实体发送算力资源信息#1;
对应地,模型训练管理实体接收来自模型训练实体#1的该算力资源信息#1。
模型训练实体#1可以响应于算力资源订阅信息#1向模型训练管理实体发送算力资源信息#1。该算力资源信息#1可以包括算力资源订阅信息#1订阅的算力资源信息。
可选地,模型训练实体#1周期向模型训练管理实体发送算力资源信息#1。
S603,模型训练管理实体向模型训练实体#2发送算力资源订阅信息#2;
对应地,模型训练实体#2接收来自模型训练管理实体的算力资源订阅信息#2。
算力资源订阅信息#2用于订阅模型训练实体#2的算力资源信息#2。该步骤与上文步骤S601中的描述类似,在此不予赘述。
可以理解的是,该算力资源订阅信息#1还指示模型训练实体#1上报算力资源信息#1的周期,与算力资源订阅信息#2还指示模型训练实体#2上报算力资源信息#2的周期可以相同也可以不同,比如,每个模型训练实体上报算力资源信息的周期可以基于各自执行训练任务的平均耗时确定,本申请对此不作特别限定。
S604,模型训练实体#2向模型训练管理实体发送算力资源信息#2;
对应地,模型训练管理实体接收来自模型训练实体#2的该算力资源信息#2。
模型训练实体#2可以响应于算力资源订阅信息#2向模型训练管理实体发送算力资源信息#2。该算力资源上报信息#2可以包括算力资源订阅信息#2订阅的算力资源信息#2。
可选地,模型训练实体#2周期向模型训练管理实体发送算力资源信息#2。
S605,模型训练管理实体向模型训练实体#1发送训练状态订阅信息#1;
对应地,模型训练实体#1接收来自模型训练管理实体的训练状态订阅信息#1。
训练状态订阅信息#1用于订阅模型训练实体#1的训练状态信息#1。有关练状态信息的内容可以参考图5中步骤S501的介绍,在此不予赘述。
可选地,训练状态订阅信息#1指示模型训练实体#1周期发送训练状态信息#1。
可选地,训练状态订阅信息#1指示模型训练实体#1基于触发事件发送训练状态信息#1,该触发事件可以包括以下至少一项:模型训练实体#1新增训练任务、模型训练实体#1执行完成某个训练任务。
S606,模型训练实体#1向模型训练管理实体发送训练状态信息#1;
对应地,模型训练管理实体接收来自模型训练实体#1的训练状态信息#1。
模型训练实体#1可以响应于训练状态订阅信息#1向模型训练管理实体发送训练状态信息#1。该训练状态信息#1可以包括训练状态订阅信息#1订阅的训练状态信息#1。
可选地,模型训练实体#1周期向模型训练管理实体发送训练状态信息#1。
可选地,模型训练实体#1基于触发事件向模型训练管理实体发送训练状态信息#1。
S607,模型训练管理实体#1向模型训练实体#2发送训练状态订阅信息#2;
对应地,模型训练实体#2接收来自模型训练管理实体的训练状态订阅信息#2。
训练状态订阅信息#2用于订阅模型训练实体#2的训练状态信息#2。该步骤与上文步骤S605中的 描述类似,在此不予赘述。
可以理解的是,该训练状态订阅信息#1还指示模型训练实体#1上报训练状态信息#1的周期,与训练状态订阅信息#2还指示模型训练实体#2上报训练状态信息#2的周期可以相同也可以不同,比如,每个模型训练实体上报训练状态信息的周期可以基于各自执行训练任务的平均耗时确定,本申请对此不作特别限定。
S608,模型训练实体#2向模型训练管理实体发送训练状态上报信息#2;
对应地,模型训练管理实体接收来自模型训练实体#2的训练状态信息#2。
模型训练实体#2可以响应于训练状态订阅信息#2向模型训练管理实体发送训练状态上报信息#2。
可选地,模型训练实体#2周期向模型训练管理实体发送训练状态信息#2。
可选地,模型训练实体#2基于触发事件向模型训练管理实体发送训练状态信息#2。
可以理解的是,为了便于理解本申请实施例,图6仅是以两个模型训练实体进行示例性说明,模型训练管理实体还可以管理三个及三个以上个模型训练实体,模型训练管理是实体可以向每一个模型训练实体发送算力资源订阅信息和训练状态订阅信息,本申请对此不作特别限定。
基于本技术方案,对于网络状态频繁变化的通信系统中,各个模型训练实体可以及时上报自己的算力资源信息和训练状态信息,使得模型训练管理实体可以及时进行编排与管理,提升模型训练的效率。
上文对模型训练管理实体获取训练状态信息和算力资源信息的方式进行了说明,以下结合图7和图8分别对图5中步骤S503提及的方式1和方式2进行说明,其中图7描述方式1,即目标模型训练任务整体训练的一种实现方式,图8描述方式2,即目标模型训练任务分解训练的一种实现方式。
图7是本申请实施例提供的一种目标模型训练任务整体训练的方法的示意性流程图。
需要说明的是,图7所示的步骤可以在图6所示的步骤之后执行,即模型训练实体可以周期或基于触发事件向模型训练管理实体发送训练状态信息,模型训练实体也可以周期向模型训练管理实体发送算力资源信息。
S701,模型推理实体#1向模型训练实体#1发送模型训练请求消息。
模型训练请求消息用于请求模型训练实体#1训练推理模型。
可选地,模型训练请求消息包括以下信息中至少一项:推理模型、推理模型的标识信息、模型性能需求信息、期望训练时长信息。
其中,推理模型是待训练的推理模型,推理模型和推理模型的标识信息都可以用于标记推理模型,比如推理模型可以是指推理模型的模型文件,模型文件中可以包括用于标识推理模型的标识或推理类型。模型性能需求信息可以指示模型训练的需求,比如,期望推理模型训练后的精度可以大于或等于某特定阈值、训练后的准确度可以大于或等于某特定阈值等。期待训练时长信息可以指示期望模型训练占用的训练时长,比如期望训练时长信息可以包括第一时长,指示模型训练实体从接收到模型训练请求消息起在第一时长内完成模型的训练。
需要说明的是,模型推理实体#1和模型训练实体#1可以是部署在一个设备中的两个模块,比如图1所示的模型训练网元121和模型推理网元130,即模型推理实体#1可以默认向所述设备中的模型训练实体#1发送模型训练请求消息,步骤S710的模型训练请求消息是通过内部接口进行传递,从而减少外部接口开销。模型推理实体#1和模型训练实体#1还可以部署于不同的NWDAF,那么模型训练请求消息可以通过NWDAF的接口交互,模型训练请求消息也可以是模型订阅消息或模型请求消息(例如Nnwdaf_MLModelProvision_Subscribe或Nnwdaf_MLModelInfo_Request),该消息可以包括请求的模型对应的分析标识(Analytics ID),分析标识指示某种具体的分析类型,例如:切片负载等级相关分析(Analytics ID=Load level information)、网络功能负载分析(Analytics ID=NF load information)。
S702,模型训练实体#1确定训练任务。
模型训练实体#1可以基于模型训练请求消息确定训练任务。示例性地,模型训练实体#1可以根据推理模型生成或者索引到推理模型的训练任务,比如模型训练实体#1可以根据模型训练请求消息中的信息生成训练任务,或者说可以根据模型训练请求消息设置训练任务的参数,比如权重等,本申请对此不作特别限定。
可选地,模型训练实体#1可以基于训练任务确定是否请求协作训练。比如模型训练实体#1基于空 闲的算力资源是否能够满足训练任务的需求确定是否请求协作训练,再比如,模型训练实体#1基于模型训练任务的优先级确定是否需要进行协作训练,有关该部分的内容可以参考图5中步骤S501的描述,在此不予赘述。
可选地,模型训练实体#1确定请求协作训练的目标模型训练任务。有关该部分的内容可以参考图5中步骤S501的描述,在此不予赘述。
S703,模型训练实体#1向模型训练管理实体发送训练任务更新通知消息。
训练任务更新通知消息指示模型训练实体#1上新增模型训练任务。
可以理解的是,如果模型训练实体#1基于触发事件发送训练状态信息,触发事件为模型训练实体#1上新增模型训练任务,比如步骤S701中模型训练实体#1接收了来自模型推理实体的模型训练请求信息,训练任务更新通知信息可以用于通知模型训练管理实体模型训练实体#1上新增模型训练任务。那么训练任务更新通知消息可以是基于触发事件发送的训练状态信息。
训练任务更新通知消息可以指示模型训练实体#1上的全部训练任务的信息,或者,该训练任务更新消息也可以指示改变的训练任务的信息,其中改变的训练任务可以包括任务进程改变的训练任务、新增的训练任务等。
如果训练任务更新通知消息通知模型训练实体#1上的全部训练任务的信息,可选地,该训练任务更新通知消息指示以下至少一项:模型训练实体#1上的至少一个训练任务、至少一个训练任务中每个训练任务的任务进程、至少一个训练任务中每个训练任务的优先级。
示例性地,训练任务更新通知消息包括三个训练任务1-3,训练任务1正在运行,训练任务2和训练任务3等待运行,训练任务2的优先级高于训练任务3的优先级。进而模型训练管理实体可以结合模型训练实体#1上报的算力资源信息#1确定模型训练实体#1是否能够独立完成三个训练任务的训练,即请求进行协作训练。
如果训练任务更新通知消息通知模型训练实体#1上任务进程改变的训练任务、新增的训练任务的信息等,可选地,该训练任务更新通知消息包括以下至少一项:任务进程改变的训练任务的标识以及任务进程改变的类型、新增的训练任务的标识、新增的训练任务的优先级。从而,传输资源能够得以节省。
示例性地,训练任务更新通知消息通知模型训练实体#1上的训练任务#1已经完成训练,并新增训练任务3,需要说明的是,训练任务更新通知消息没有包括训练任务#2的信息,可以表示训练任务#2的任务进程还未发生改变,比如训练任务#2可以还在训练中,或者还在等待运行等。
S704,模型训练管理实体基于多个算力资源信息在多个模型训练实体中确定模型训练实体#2。
该至少一个模型训练实体#2用于协作执行目标模型训练任务。模型训练管理实体可以从训练任务更新通知消息中获取目标模型训练任务,或者模型训练管理实体可以自己基于训练任务更新通知消息确定目标模型训练任务,有关模型训练管理实体确定模型训练实体#2的内容可以参考图5中步骤S503的描述,在此不予赘述。
S705,模型训练管理实体向模型训练实体#2发送第一训练任务配置信息。
该第一训练任务配置信息指示模型训练实体#2协助模型训练实体#1执行至少一个模型训练任务中的目标模型训练任务。有关第一训练任务配置信息的内容可以参考图5中步骤S504的描述,在此不予赘述。
S706,模型训练管理实体向模型训练实体#1发送训练任务配置通知信息。
该训练任务配置通知信息指示模型训练实体#2协助执行目标模型训练任务。有关训练任务配置通知信息的内容可以参考图5中步骤S505的描述,在此不予赘述。
S707,模型训练实体#1向模型训练实体#2发送目标训练数据指示信息。
目标训练数据指示信息指示用于训练目标模型训练任务的目标训练数据。
可以理解的是,如果模型训练管理实体不具有目标模型训练数据的信息,即没有在步骤S705中的第一训练任务配置信息中携带目标训练数据的信息,那么模型训练实体#1可以基于训练任务配置通知信息获知由模型训练实体#2训练目标模型训练任务,进而向模型训练实体#2发送目标训练数据指示信息。
S708,模型训练实体#2获取目标训练数据。
示例性地,模型训练实体#2基于目标训练数据指示信息下载目标训练数据,比如目标训练数据存储于模型训练实体#1,模型训练实体#2向模型训练实体#1发送下载请求,模型训练实体#1向模型训练实体#2发送目标训练数据。有关获取目标训练数据的内容可以参考图5中步骤S506的描述,在此不予赘述。
S709,模型训练实体#2协助执行目标模型训练任务。
模型训练实体#2可以采用目标训练数据对目标模型训练任务进行训练。有关协助执行目标模型训练任务的内容可以参考图5中步骤S506的描述,在此不予赘述。
S710,模型训练实体#2向模型推理实体#1发送训练报告信息。
模型训练实体#2完成目标模型训练任务的训练后,可以生成训练完成的模型,进而,该训练报告信息可以包括训练完成的模型。
可选地,该训练报告信息还包括训练完成的模型性能信息,比如训练完成的精确度、训练耗费时长等。进而模型推理实体可以从模型训练实体#2直接获取训练完成能够使用的模型。
基于本技术方案,模型训练管理实体可以对多个模型训练实体的训练任务和算力资源进行管理编排,将模型训练实体#1的训练任务分配给其他资源充足的模型训练实体协助完成训练,减少模型训练实体#1的训练任务的训练等待时间,提高系统资源的利用率,以及提升模型训练的效率。
图8是本申请实施例提供的一种目标模型训练任务分解训练的方法的示意性流程图。
需要说明的是,图8所示的步骤可以在图6所示的步骤之后执行,即模型训练实体可以周期或基于触发事件向模型训练管理实体发送训练状态信息,模型训练实体也可以周期向模型训练管理实体发送算力资源信息。
S801,模型推理实体向模型训练实体#1发送模型训练请求消息。
模型训练请求消息用于请求模型训练实体#1训练推理模型。有关该部分的内容可以参考图7中步骤S701的描述,在此不予赘述。
S802,模型训练实体#1确定训练任务。
模型训练实体#1可以基于模型训练请求消息确定训练任务。有关该部分的内容可以参考图7中步骤S702的描述,在此不予赘述。
S803,模型训练实体#1向模型训练管理实体发送训练任务更新通知消息。
训练任务更新通知消息指示模型训练实体#1上新增模型训练任务。有关训练任务更新通知消息的内容可以参考图7中步骤S703的描述,在此不予赘述。
可选地,该模型训练请求消息还可以包括训练数据长度信息。训练数据长度信息指示训练数据的长度,训练数据用于执行训练任务。
示例性地,该训练数据长度信息可以指示模型训练实体#1上的每个训练任务的训练数据的长度,进而模型训练管理实体能够根据每个训练任务对应的训练数据的长度确定进行协作训练的目标模型训练任务,提升确定目标模型训练任务的可靠性。
再示例性地,如果模型训练实体#1确定需要协作训练的目标模型训练任务,即训练任务更新通知消息包括了目标模型训练任务的信息,那么训练数据长度信息可以仅指示目标模型训练任务对应的目标训练数据的长度,能够节省传输资源。
S804,模型训练管理实体分解目标模型训练任务,以及确定模型训练实体#2。
模型训练管理实体可以从训练任务更新通知消息中获取目标模型训练任务,或者模型训练管理实体可以自己基于训练任务更新通知消息确定目标模型训练任务,有关内容可以参考图5中步骤S503的描述,在此不予赘述。
可选地,模型训练管理实体可以通过切分目标模型分解目标模型训练任务,或者,模型训练管理实体也可以通过切分训练数据分解目标模型训练任务。
如果模型训练管理实体通过切分目标模型分解获得目标模型训练任务。模型训练管理实体可以基于多个模型训练实体的算力资源信息确定参与对目标模型训练任务进行训练的M个模型训练实体,并将目标模型切分为M个目标子模型,每个目标子模型可以被模型训练实体单独训练,一个目标子模型的训练为一个模型训练子任务,M为正整数。
示例性地,目标模型为一个4层的神经网络模型,模型训练管理实体可以将该神经网络模型分解 为两个目标子模型:前两层神经网络模型作为一个目标子模型,后两层神经网络模型作为另一个目标子模型。
如果模型训练管理实体通过切分训练数据分解获得目标模型训练任务。模型训练管理实体可以基于多个模型训练实体的算力资源信息将目标模型训练任务分解成多个目标模型训练子任务,并确定每个目标模型训练子任务对应的训练数据子长度。
示例性地,模型训练管理实体可以基于多个模型训练实体的算力资源信息确定参与对目标模型训练任务进行训练的N个模型训练实体,并将目标模型训练任务划分为N个目标模型训练子任务,并确定N个目标模型训练子任务中每个目标模型训练子任务的训练数据子长度。N个模型训练实体可以分别执行N个目标模型训练子任务的训练。其中,N为大于或等于1的正整数。
可以理解的是,N个模型训练实体可以包括模型训练实体#1,也可以不包括模型训练实体#1,比如若模型训练实体#1还具有空闲的算力,那么模型训练实体#1可以对其中一个目标模型训练子任务进行训练,若模型训练实体#1不具有空闲的算力,那么模型训练实体#1可以不参与目标模型训练任务的训练,由其它N个模型训练实体#2分别进行训练。
需要说明的是,模型训练管理实体可以设置模型训练实体#1为主模型训练实体,其它模型训练实体为协作模型训练实体。其中,主模型训练实体可以从协作模型训练实体接收协作模型训练实体对目标模型训练子任务进行训练获取的子模型,模型训练实体#1作为主模型训练实体能够提高训练任务分配的可靠性。
示例性地,模型训练管理实体基于算力资源信息确定由模型训练实体#1和模型训练实体#2两个模型训练实体对目标模型训练任务进行协作训练,其中,模型训练实体#1可以作为主模型训练实体,模型训练实体#2作为协作模型训练实体。
可选地,模型训练管理实体确定模型训练实体#2协助执行目标模型任务,并确定用于协助执行目标模型任务的训练数据为目标长度。
可以理解的是,模型训练管理实体可以先确定模型训练实体#2,再分解目标模型训练任务,或者模型训练管理实体也可以先分解目标模型训练任务,再确定模型训练实体#2,或者模型训练管理实体也可以同步分解目标模型训练任务和确定模型训练实体#2,本申请对此不作特别限定。其中,有关模型训练管理实体确定模型训练实体#2的内容可以参考图5中步骤S503的描述,在此不予赘述。
为了便于理解本申请实施例,以下以上述示例对本申请实施例进行说明。其它分配方式,比如两个以上的模型训练实体对执行目标协作训练任务的方式与之类似,以下不予赘述。
S805,模型训练管理实体向模型训练实体#2发送第一训练任务配置信息。
该第一训练任务配置信息指示模型训练实体#2协助模型训练实体#1执行至少一个模型训练任务中的目标模型训练任务。有关第一训练任务配置信息的内容可以参考图5中步骤S504的描述,在此不予赘述。
S806,模型训练管理实体向模型训练实体#1发送第二训练任务配置信息。
第二训练任务配置信息指示由模型训练实体#2协助模型训练实体#1执行目标模型训练任务。
可选地,第二训练任务配置信息指示目标模型训练任务的标识、模型训练实体#2的标识。进而模型训练实体#1可以获知由模型训练实体#2执行目标模型训练任务。
可选地,如果模型训练管理实体通过切分目标模型分解模型训练任务,那么第二训练任务配置信息还指示切分方式信息,比如切分节点标识,切分节点信息用于指示目标模型的切分方式,进而模型训练功能#1可以根据切分方式信息获知目标模型被切分的目标子模型。
可选地,如果模型训练管理实体通过切分训练数据分解模型训练任务,那么第二训练任务配置信息还指示模型训练实体#2协助执行目标模型训练任务使用的训练数据的目标长度。进而模型训练实体#1可以获指示模型训练实体#2目标长度的训练数据。
S807,模型训练实体#1初步执行目标模型训练任务。
模型训练实体#1确定模型结构、训练算法、训练超参数等,利用模型训练实体#1处目标长度的训练数据进行若干轮初始模型训练,得到初始模型。
S808,模型训练实体#1向模型训练实体#2发送目标训练数据指示信息。
目标训练数据指示信息指示用于训练目标模型训练任务的目标训练数据。目标训练数据指示信息 用于指示模型训练实体#2采用目标长度的训练数据协助执行目标模型训练任务。
可选地,如果模型训练管理实体通过切分目标模型分解模型训练任务,那么目标训练数据指示信息指示目标协作训练任务的标识、训练数据的地址和模型训练实体#2执行训练子任务对应的目标子模型的地址。进而模型训练实体#2可以基于目标子模型的地址获取目标子模型,基于训练数据的地址获取目标长度的训练数据。
可选地,如果模型训练管理实体通过切分训练数据分解模型训练任务,那么该目标训练数据指示信息指示目标协作训练任务的标识、目标长度、训练数据的地址、初始模型的地址。进而模型训练实体#2可以基于初始模型的地址获取初始模型,基于训练数据的地址获取目标长度的训练数据。
S809,模型训练实体#2获取目标训练数据。
模型训练实体#2可以基于目标训练数据指示信息获取目标长度的训练数据。有关获取目标训练数据的内容可以参考图7中步骤S708的描述,在此不予赘述。
S810,模型训练实体#2协助执行目标模型训练任务。
模型训练实体#2采用目标长度的训练数据对初始模型进行若干轮训练,以生成子模型。或者,模型训练实体#2采用训练数据对目标子模型进行若干轮训练,以生成训练后的目标子模型。
S811,模型训练实体#2向模型训练实体#1发送模型传递信息。
该模型传递信息指示经过训练的子模型,即采用目标长度训练数据进行若干轮训练后的初始模型,或者对目标子模型进行若干轮训练后的子模型。
可选地,该模型传递信息指示子模型的模型梯度或者子模型的模型地址。
可选地,如果模型训练管理功能是通过切分目标模型分解目标模型训练任务的,那么模型传递信息还指示目标子模型的标识,目标子模型的标识可以是基于模型训练管理功能切分时的切分方式信息确定的,或者也可以是模型训练功能#1或模型训练功能#2约定的,本申请对此不作特别限定。
S812,模型训练实体#1执行模型聚合。
可以理解的是,针对不同的模型训练任务的分解方式,模型训练实体#1可以采取不同的方式执行模型聚合,比如当模型训练管理功能是通过切分目标模型分解目标模型训练任务的,那么模型训练实体#1从模型训练实体#2获取的是经过训练的目标子模型,模型训练实体可以对目标子模型进行聚合。或者针对模型训练管理功能是通过切分训练数据分解模型目标模型训练任务的,那么模型训练实体可以基于模型传递信息获取经过模型训练功能#2训练的初始模型,并与初始执行训练获得的初始模型执行模型聚合,以获得训练完成的推理模型。例如,模型训练实体可以将多个子模型和初始模型的每个梯度/权重进行平均得到最终聚合后的模型梯度/权重。
S813,模型训练实体#1向模型推理实体#1发送训练报告信息。
训练报告信息可以包括训练完成的推理模型。有关训练报告信息的内容可以参考图7中步骤S710的描述,在此不予赘述。
基于本技术方案,模型训练管理实体可以对目标模型训练任务进行分解,利用多个模型训练实体协作完成训练任务,可以减轻目标模型训练任务的原训练主体模型训练实体#1的训练任务负担,充分利用多个模型训练实体的资源,减少模型训练实体#1的训练等待时间。
以上分别对模型训练功能#2完成目标模型训练任务的方式进行说明,在一种可能的实现方式中,模型训练功能#2可以在执行目标模型训练任务的过程中,向模型训练管理实体发送图5中S507所述的协助训练反馈信息和/或网络状态更改信息,以使得模型训练管理实体能够获知执行目标模型训练任务的情况,并能够及时作出调整,提升模型训练的可靠性,以下结合图9对此进行说明。
图9是本申请实施例提供的一种执行目标模型训练任务的过程中进行反馈的方法的示意性流程图。
首先需要说明的是,为了便于理解本申请实施例,在图9所述的示例中,以模型训练管理实体管理三个模型训练实体进行举例,模型训练管理实体可以是图1至图3所述的任意的模型训练管理实体,模型训练实体#1、模型训练实体#2和模型训练实体#3可以是图1至图3所述的任意的模型训练实体,本申请对此不作特别限定。并且,图9所述的方法可以与图5至图8中任意描述的方法进行结合。
S901,模型训练实体#2向模型训练管理实体发送协助训练反馈信息和/或网络状态更改信息。
该协助反馈信息可以指示以下至少一项:模型训练实体#2执行目标模型训练任务达到的精度、目标模型训练实体#2执行目标模型训练任务耗费的时长、目标模型训练实体#2执行目标模型训练任务的 执行进度、目标模型训练实体执行目标模型训练任务占用的资源数量。
该网络状态信息指示模型训练实体#2的网络状态发生不能够继续协助执行目标模型训练任务的变化,或者说不能够完成目标模型训练任务的变化。示例性地,网络状态更改信息指示模型训练功能#2的节能模式变为进入节能态,即该模型训练功能#2将被关闭以实现节能。再示例性地,网络状态更改信息指示模型训练功能#2用于模型训练的资源被占用,即模型训练功能#2协助执行目标模型训练任务的算力资源不足。
可以理解的是,模型训练实体#2可以仅向模型训练管理实体发送协助训练反馈信息,模型训练管理实体可以基于协助训练反馈信息中的内容确定是否需要更换其它模型训练实体协助执行目标模型训练任务。模型训练实体#2也可以仅向模型训练管理实体发送网络状态更改信息,进而模型训练管理实体可以直接基于网络状态更改信息确定更换其它模型训练实体协助执行目标模型训练任务。模型训练实体#2也可以向模型训练管理实体发送协助训练反馈信息和网络状态更改信息,本申请对此不作特别限定。
S902,模型训练管理实体调整分配方式。
其中调整分配方式可以包括:调整模型训练节点协助执行目标模型训练任务的方式(比如更改目标长度),以及更换模型训练实体(比如将模型训练实体#2更换为模型训练实体#3)。即模型训练实体#3用于协助执行目标模型训练任务。即模型训练实体#3可以是模型训练管理实体在多个模型训练实体中的第二目标模型训练实体的一种示例。
示例性地,如果协助训练反馈信息指示模型训练实体#2协助执行目标模型训练任务的训练过程异常,比如训练耗时大于或等于某特定阈值、模型精度低于或等于某特定阈值、训练进度指示的完成百分比低于或等于某特定阈值等,那么模型训练管理实体可以调整分配方式。
示例性地,如果网络状态更改信息指示模型训练实体#2的节能模式变为进入节能态,即该模型训练实体#2将被关闭以实现节能,那么模型训练管理实体可以调整目标协作训练任务的分配方式。如果网络状态更改信息指示模型训练实体#2用于模型训练的资源被占用,即模型训练实体#2训练模型的实体被关闭,那么模型训练管理实体也可以调整分配方式。
S903,模型训练管理实体向模型训练实体#1发送配置更改信息#1。
配置更改信息#1指示调整后的分配方式,比如指示更改后的目标长度,再比如指示由模型训练实体#2协助执行目标模型训练任务更改为由模型训练实体#3协助执行目标模型训练任务。
S904,模型训练管理实体向模型训练实体#2发送配置更改信息#2。
配置更改信息#2指示模型训练实体#2采用调整后的方式协助执行目标模型训练任务,或者指示模型训练实体#2停止协助执行目标模型训练任务。
可选地,如果模型训练管理实体调整的方式为更换协助训练的模型训练实体,比如将模型训练实体#2更换为模型训练实体#3,那么该方法还包括步骤S905:
可选地,S905,模型训练管理实体向模型训练实体#3发送第三训练任务配置信息。
该第三训练任务配置信息指示模型训练实体#3协助模型训练实体#1执行目标模型训练任务。该第三训练任务配置信息的内容与图5中步骤S504中第一训练任务配置信息的描述类似,在此不予赘述。
可以理解的是,模型训练实体#3可以基于第三训练任务配置信息协助执行目标模型训练任务,即模型训练实体#3可以执行如图5中步骤S506至S508、或者S707至S710、或者S809至S813的内容,在此不予赘述。
基于本技术方案,模型训练实体在协助执行模型训练任务的过程中,可以向模型训练管理实体反馈执行的情况,使得模型训练管理实体能够获知执行目标模型训练任务的情况,并能够及时作出调整,提升模型训练的可靠性。
以上图5至图9的方法均可以运用于图1至图3所述的架构中,可以理解的是,为了便于理解本申请实施例,在上述的描述过程中以一个模型训练管理实体进行说明。如果系统中存在多个模型训练管理实体,比如参见图3,分别在EMS和NMS中各存在一个模型训练管理实体,那么图5至图9中的模型训练管理实体可以视为EMS中的模型训练管理实体,并且EMS中的模型训练管理实体可以与NMS中的模型训练管理实体进行交互,以实现共同管理多个模型训练实体。为了便于理解,以下结合图10对该实现方式进行说明。
图10是本申请实施例提供的一种两个模型训练管理实体进行模型训练管理的一种实现方式的示意性流程图。
需要说明的是,图10以模型训练管理实体#1为NMS中的模型训练管理实体,模型训练管理实体#2为EMS中的模型训练管理实体进行举例。
S1001,模型训练管理实体#1向模型训练管理实体#2发送总算力资源订阅信息。
总算力资源订阅信息指示模型训练管理实体#2订阅多个模型训练管理实体的算力资源信息。其中,多个模型训练管理实体为模型训练管理实体#2管理的模型训练实体。有关算力资源信息的内容可以参见图5中步骤S502的描述,在此不予赘述。
S1002,算力资源信息订阅与上报。
模型训练管理实体#2、模型训练实体#1、模型训练实体#2可以执行如图6所述的步骤S601至步骤S604的内容,在此不予赘述。
S1003,模型训练管理实体#2向模型训练管理实体#1发送总算力资源信息。
模型训练管理实体#2可以将从模型训练实体#1和模型训练实体#2处接收的算力资源信息#1和算力资源信息#2直接发送给模型训练管理实体#1,或者模型训练管理实体#2也可以对算力资源信息#1和算力资源信息#2进行处理后向模型训练管理实体#1发送,本申请对此不作特别限定。
S1004,模型训练管理实体#2向模型训练管理实体#1发送总训练状态订阅信息。
总训练状态订阅信息指示模型训练管理实体#2订阅多个模型训练管理实体的训练状态信息。有关训练状态信息的内容可以参见图5中步骤S501的描述,在此不予赘述。
S1005,训练状态信息订阅与上报。
模型训练管理实体#2、模型训练实体#1、模型训练实体#2可以执行如图6所述的步骤S605至步骤S608的内容,在此不予赘述。
S1006,模型训练管理实体#2向模型训练管理实体#1发送总训练状态信息。
模型训练管理实体#2可以将从模型训练实体#1和模型训练实体#2处接收的训练状态信息#1和训练状态信息#2直接发送给模型训练管理实体#1,或者模型训练管理实体#2也可以对训练状态信息#1和训练状态信息#2进行处理后向模型训练管理实体#1发送,本申请对此不作特别限定。
S1007,模型训练管理实体#1向模型训练管理实体#2发送策略信息。
策略信息指示基于多个算力资源信息在多个模型训练实体确定用于协助执行目标模型训练任务的目标模型训练实体(即本申请实施例中的模型训练实体#2)的方式,和/或,指示用于完成目标模型训练任务的训练数据的总长度确定用于目标模型训练任务的训练数据的目标长度的方式。
可选地,如果是模型训练管理实体确定是否需要协作训练,那么策略信息还指示确定是否需要协作训练的方式。
示例性地,确定是否需要协作训练的方式包括以下至少一项:
基于优先级分配的方式:基于训练任务的优先级确定是否需要协作训练。当训练任务优先级大于或等于优先级阈值时,确定需要协作训练,当训练任务优先级小于优先级阈值时,确定不需要协作训练;
基于算力资源利用率分配的方式:基于算力资源利用率确定是否需要协作训练。当模型训练实体的算力资源利用率大于或等于利用率阈值时,确定该模型训练实体需要协作训练,当模型训练实体的算力资源利用率小于利用率阈值时,确定该模型训练实体不需要协作训练;
基于训练数据量分配的方式:基于训练任务需要训练数据量确定是否需要协作训练。当训练任务的训练数据量大于数据量阈值时,确定需要协作训练,当训练任务的训练数据量小于或等于数据量阈值时,确定不需要协作训练。
示例性地,确定目标模型训练实体的方式包括以下至少一项:
基于算力资源的方式:基于空闲的算力资源确定目标模型训练实体。例如,选择空闲算力资源最多的模型训练实体作为目标模型训练实体;
基于节点位置的方式:基于模型训练实体与主模型训练实体(例如本申请实施例中的模型训练实体#1)确定目标模型训练实体。例如,选择与主模型训练实体最邻近的模型训练实体为目标模型训练实体。
示例性地,确定训练数据的目标长度的方式包括以下至少一项:
基于算力资源的方式:基于空闲的算力资源确定目标长度。例如,确定目标长度为模型训练实体空闲的算力资源能够处理的目标长度;
基于模型训练实体的个数的方式:基于具有空闲算力资源的模型训练实体的个数确定目标长度。例如,模型训练管理实体选定3个模型训练实体协助执行目标模型训练任务,那么可以将目标模型训练任务的训练数据分解为3个目标长度的训练数据。
可以理解的是,策略信息可以包括具体的方式,或者也可以包括方式的索引,比如方式的名称或标识,或者还可以包括方式的参数,比如优先级阈值、数据量阈值等,本申请对此不作特别限定。
S1008,训练任务确定。
模型训练实体#1、模型训练管理实体#2可以执行图5中步骤S503、或者图7中步骤S701至步骤S703、或者图8中步骤S801至S803中的内容,在此不予赘述。
S1009,模型训练管理实体#2向模型训练管理实体#1发送训练任务通知更新信息。
该训练任务通知更新信息可以是模型训练管理实体#1从模型训练实体#1处获取的训练任务更新通知信息,用于指示模型训练实体#1上新增模型训练任务。相关内容可以参考图7中步骤S703或者图8中步骤S803的描述,在此不予赘述。
可以理解的是,模型训练管理实体#2可以将从模型训练实体#1处获取的训练任务更新通知信息直接发送给模型训练管理实体#2,或者也可以对训练任务更新通知信息进行处理后发送给模型训练管理实体#2,本申请对此不作特别限定。
S1010,分配方式确定,协作训练执行。
模型训练管理实体#2、模型训练实体#1、和模型训练实体#2可以执行图5中步骤S504至S508、或者图7中步骤S704至S710、或者图8中步骤S804至S813的内容,可选地还可以执行图9的全部内容,在此不予赘述。
可以理解的是,模型训练管理实体#2可以基于从模型训练管理实体#1获取的策略信息确定分配方式。
S1011,模型训练管理实体#2向模型训练管理实体#1发送训练任务信息。
该训练任务信息可以是模型训练管理实体#2向模型训练功能#1发送的训练任务信息,相关内容可以参考图5中步骤S505描述,在此不予赘述。
可以理解的是,模型训练管理实体#2可以直接将发送给模型训练实体#1的训练任务信息发送给模型训练管理实体#2,或者也可以对训练任务信息进行处理后发送给模型训练管理实体#2,本申请对此不作特别限定。
基于本技术方案,NMS中的模型训练管理功能可以向EMS中的模型训练管理功能发送策略信息,提高模型训练的效率。
应理解,在上文实施例的描述过程中,一条信息可以承载在一条或多条消息或同一条消息中的一个或多个信元中,比如两条消息,或,同一条消息中的两个信元,本申请对此不作特别限定。
以上描述了本申请实施例的方法实施例,下面对相应的装置实施例进行介绍。应理解,装置实施例的描述与方法实施例的描述相互对应,因此,未详细描述的部分可以参见前面方法实施例。
图11为本申请实施例提供的一种通信装置的示意图。如图11所示,该装置1100可以包括收发模块1110和处理模块1120。收发模块1110可以与该装置的外部进行通信,处理模块1120用于进行数据处理。收发模块1110还可以称为通信接口或收发模块。
在一种可能的设计中,该装置1100可实现对应于上文图5至图10所示方法实施例中的模型训练管理功能执行的流程,其中,处理模块1120用于执行上文图5至图10所示方法实施例中模型训练管理功能的处理相关的操作,收发模块1110用于执行上文图5至图10所示方法实施例中模型训练管理功能的收发相关的操作。
示例性地,收发模块1110用于接收第一模型训练实体的训练状态信息,训练状态信息指示第一模型训练实体具有的至少一个模型训练任务;收发模块1110还用于获取多个算力资源信息,多个算力资源信息分别指示多个模型训练实体具有的用于模型训练的空闲算力资源;处理模块1120用于基于多个算力资源信息在多个模型训练实体中确定第一目标模型训练实体;收发模块1110还用于向第一目标模 型训练实体发送第一训练任务配置信息,第一训练任务配置信息指示协助第一模型训练实体执行至少一个模型训练任务中的目标模型训练任务。
基于本技术方案,模型训练管理实体可以对多个模型训练实体的算力资源进行管理编排,将第一模型训练实体的训练任务分配给其他算力资源充足的模型训练实体协助完成训练,减少第一模型训练实体的训练任务的训练等待时间,提高模型训练的效率。
在又一种可能的设计中,该装置1100可实现对应于上文图5至图10所示方法实施例中的第一模型管理功能(即模型管理功能#1)执行的流程,其中,收发模块1110用于执行上文图5至图10所示方法实施例中模型管理功能#1的收发相关的操作,处理模块1120用于执行上文图5至图10所示方法实施例中模型管理功能#1的处理相关的操作。
基于本技术方案,模型训练实体可以向模型训练管理实体上报算力资源信息和训练状态信息,使得模型训练管理实体可以对多个模型训练实体的算力资源进行管理编排,将模型训练实体的训练任务分配给其他算力资源充足的模型训练实体协助完成训练,减少模型训练实体的训练任务的训练等待时间,提高模型训练的效率。
示例性地,处理模块1120用于生成训练状态信息;收发模块1110用于向模型训练管理实体发送训练状态信息,训练状态信息指示模型训练实体具有的至少一个模型训练任务;收发模块还用于向模型训练管理实体发送算力资源信息,算力资源信息指示模型训练实体具有的用于模型训练的空闲算力资源;收发模块1110还用于接收来自模型训练管理实体的训练任务信息,训练任务信息指示协助执行至少一个模型训练任务中的目标模型训练任务的第一目标模型训练实体。
在又一种可能的设计中,该装置1100可实现对应于上文图5至图10所示方法实施例中的第一目标模型管理功能(即模型管理功能#2)执行的流程,其中,收发模块1110用于执行上文图5至图10所示方法实施例中模型管理功能#2的收发相关的操作,处理模块1120用于执行上文图5至图10所示方法实施例中模型管理功能#2的处理相关的操作。
示例性地,收发模块1110用于向模型训练管理实体发送算力资源信息,算力资源信息指示模型训练实体具有的用于模型训练的空闲算力资源;收发模块1110还用于接收来自模型训练管理实体的第一训练任务配置信息,第一训练任务配置信息指示协助第一模型训练实体执行目标模型训练任务;处理模块1120用于协助第一模型训练管理实体执行目标模型训练任务。
基于本技术方案,模型训练实体可以向模型训练管理实体上报算力资源信息和训练状态信息,使得模型训练管理实体可以对多个模型训练实体的算力资源进行管理编排,将模型训练实体的训练任务分配给其他算力资源充足的模型训练实体协助完成训练,减少模型训练实体的训练任务的训练等待时间,提高模型训练的效率。
应理解,这里的装置1100以功能单元的形式体现。这里的术语“单元”可以指应用特有集成电路(application specific integrated circuit,ASIC)、电子电路、用于执行一个或多个软件或固件程序的处理器(例如共享处理器、专有处理器或组处理器等)和存储器、合并逻辑电路和/或其它支持所描述的功能的合适组件。在一个可选例子中,本领域技术人员可以理解,装置1100可以具体为上述实施例中的模型训练管理功能或应用于模型训练管理功能的芯片,可以用于执行上述方法实施例中与模型训练管理功能对应的流程,或者,装置1100可以具体为上述实施例中的模型训练功能或应用于模型训练功能的芯片,可以用于执行上述方法实施例中与模型训练功能对应的流程,为避免重复,在此不予赘述。
上述装置1100具有实现上述方法中模型训练管理功能所执行的相应步骤的功能,或者,上述装置1100具有实现上述方法中模型训练功能所执行的相应步骤的功能。所述功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。所述硬件或软件包括一个或多个与上述功能相对应的模块;例如收发模块可以由收发机替代(例如,收发模块中的发送单元可以由发送机替代,收发模块中的接收单元可以由接收机替代),其它单元,如处理模块等可以由处理器替代,分别执行各个方法实施例中的收发操作以及相关的处理操作。
此外,上述收发模块还可以是收发电路(例如可以包括接收电路和发送电路),处理模块可以是处理电路。在本申请的实施例,图11中的装置可以是前述实施例中的模型训练管理功能或模型训练功能,也可以是芯片或者芯片系统,例如:片上系统(system on chip,SoC)。其中,收发模块可以是输入输出电路、通信接口。处理模块为该芯片上集成的处理器或者微处理器或者集成电路。在此不做限定。
图12示出了本申请实施例提供的通信装置1200。该装置1200包括处理器1210和存储器1220。存储器1220用于存储指令,该处理器1210可以调用该存储器1220中存储的指令,以执行上述方法实施例中的模型训练管理功能或模型训练功能对应的流程。
具体地,在一种可能的实现方式中,存储器1220用于存储指令,该处理器1210可以调用该存储器1220中存储的指令,以执行上述方法实施例中的模型训练管理功能对应的流程。
具体地,在另一种可能的实现方式中,存储器1220用于存储指令,该处理器1210可以调用该存储器1220中存储的指令,以执行上述方法实施例中的模型训练功能对应的流程。
应理解,装置1200可以具体为上述实施例中的模型训练管理功能或模型训练功能,也可以是用于模型训练管理功能或模型训练功能的芯片或者芯片系统。具体地,该装置1200可以用于执行上述方法实施例中与模型训练管理功能或模型训练功能对应的流程。
可选地,该存储器1220可以包括只读存储器和随机存取存储器,并向处理器提供指令和数据。存储器的一部分还可以包括非易失性随机存取存储器。例如,存储器还可以存储设备类型的信息。该处理器1210可以用于执行存储器中存储的指令,并且当该处理器1210执行存储器中存储的指令时,该处理器1210用于执行上述与模型训练管理功能或模型训练功能对应的方法实施例的流程。
在实现过程中,上述方法的各步骤可以通过处理器中的硬件的集成逻辑电路或者软件形式的指令完成。结合本申请实施例所公开的方法的步骤可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成上述方法的步骤。为避免重复,这里不再详细描述。
应注意,本申请实施例中的处理器可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法实施例的各步骤可以通过处理器中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器可以是通用处理器、数字信号处理器、专用集成电路、现场可编程门阵列或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。本申请实施例中的处理器可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成上述方法的步骤。
可以理解,本申请实施例中的存储器可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(dynamic RAM,DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。应注意,本文描述的系统和方法的存储器旨在包括但不限于这些和任意其它适合类型的存储器。
图13示出了本申请实施例提供的通信装置1300。该装置1300包括处理电路1310和收发电路1320。其中,处理电路1310和收发电路1320通过内部连接通路互相通信,该处理电路1310用于执行指令,以控制该收发电路1320发送信号和/或接收信号。
可选地,该装置1300还可以包括存储介质1330,该存储介质1330与处理电路1310、收发电路1320通过内部连接通路互相通信。该存储介质1330用于存储指令,该处理电路1310可以执行该存储介质1330中存储的指令。
在一种可能的实现方式中,装置1300用于实现上述方法实施例中的模型训练管理实体对应的流程。
当通信装置1300用于实现图5至图10所示的方法时,处理电路1310用于实现上述处理单元1120 的功能,收发电路1320用于实现上述收发单元1110或者收发单元1110和处理单元1120的功能。
在另一种可能的实现方式中,装置1300用于实现上述方法实施例中的模型训练实体对应的流程。
当通信装置1300用于实现图5至图10所示的方法时,处理电路1310用于实现上述处理单元1120的功能,收发电路1320用于实现上述收发单元1110或者收发单元1110和处理单元1120的功能。
根据本申请实施例提供的方法,本申请还提供一种计算机程序产品,该计算机程序产品包括:计算机程序代码,当该计算机程序代码在计算机上运行时,使得该计算机执行图5至图10所示实施例中的方法。
根据本申请实施例提供的方法,本申请还提供一种计算机可读介质,该计算机可读介质存储有程序代码,当该程序代码在计算机上运行时,使得该计算机执行图5至图10所示实施例中的方法。
根据本申请实施例提供的方法,本申请还提供一种系统,其包括前述的模型训练管理功能和多个模型训练功能。
本文中术语“……中的至少一个”或“……中的至少一种”,表示所列出的各项的全部或任意组合,例如,“A、B和C中的至少一种”,可以表示:单独存在A,单独存在B,单独存在C,同时存在A和B,同时存在B和C,同时存在A、B和C这六种情况。本文中的“至少一个”表示一个或者多个。“多个”表示两个或者两个以上。
应理解,在本申请各实施例中,“与A相应的B”表示B与A相关联,根据A可以确定B。但还应理解,根据A确定B并不意味着仅仅根据A确定B,还可以根据A和/或其它信息确定B。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”,除非是以其他方式另外特别强调。
还应理解,在本申请的各种实施例中,第一、第二以及各种数字编号仅为描述方便进行的区分,并不用来限制本申请实施例的范围。例如,区分不同的信息等。
还应理解,在本申请的各种实施例中,“指示”可以包括直接指示和间接指示,也可以包括显式指示和隐式指示。将某一信息(例如上文所述的第一信息)所指示的信息称为待指示信息,则具体实现过程中,对待指示信息进行指示的方式有很多种,例如但不限于,可以直接指示待指示信息,如待指示信息本身或者该待指示信息的索引等。也可以通过指示其他信息来间接指示待指示信息,其中该其他信息与待指示信息之间存在关联关系。还可以仅仅指示待指示信息的一部分,而待指示信息的其他部分则是已知的或者提前约定的。例如,还可以借助预先约定(例如协议规定)的各个信息的排列顺序来实现对特定信息的指示,从而在一定程度上降低指示开销。
还应理解,在本申请的各种实施例中,“预先配置”可以通过在设备(例如,第一终端设备)中预先保存相应的代码、表格或其他可用于指示相关信息的方式来实现,本申请对于其具体的实现方式不做限定。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不予赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算 机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (41)

  1. 一种模型训练管理的方法,其特征在于,所述方法包括:
    模型训练管理实体接收第一模型训练实体的训练状态信息,所述训练状态信息指示所述第一模型训练实体具有的至少一个模型训练任务;
    所述模型训练管理实体获取多个算力资源信息,所述多个算力资源信息分别指示多个模型训练实体具有的用于模型训练的空闲算力资源;
    所述模型训练管理实体基于所述多个算力资源信息在所述多个模型训练实体中确定第一目标模型训练实体;
    所述模型训练管理实体向所述第一目标模型训练实体发送第一训练任务配置信息,所述第一训练任务配置信息指示协助所述第一模型训练实体执行所述至少一个模型训练任务中的目标模型训练任务。
  2. 如权利要求1所述的方法,其特征在于,所述模型训练管理实体获取所述多个算力资源信息,包括:
    所述模型训练管理实体分别周期接收来自所述多个模型训练实体的所述多个算力资源信息。
  3. 如权利要求1所述的方法,其特征在于,所述模型训练管理实体获取所述多个算力资源信息,包括:
    所述模型训练管理实体分别向所述多个模型训练实体发送多个算力资源查询信息;
    所述模型训练管理实体分别接收来自所述多个模型训练实体的所述多个算力资源信息。
  4. 如权利要求1至3中任一项所述的方法,其特征在于,模型训练管理实体接收第一模型训练实体的训练状态信息,包括:
    所述模型训练管理实体周期接收来自所述第一模型训练实体的训练状态信息。
  5. 如权利要求1至4中任一项所述的方法,其特征在于,所述方法还包括:
    所述模型训练管理实体向所述第一模型训练实体发送训练任务配置通知信息,所述训练任务配置通知信息指示所述第一目标模型训练实体协助执行所述目标模型训练任务。
  6. 如权利要求1至4中任一项所述的方法,其特征在于,所述方法还包括:
    所述模型训练管理实体确定用于所述第一目标模型训练实体协助执行所述目标模型训练任务的训练数据的目标长度;
    所述模型训练管理实体向所述第一模型训练实体发送第二训练任务配置信息,所述第二训练任务配置信息指示所述第一目标模型训练实体采用所述目标长度的训练数据协助执行所述目标模型训练任务。
  7. 如权利要求6所述的方法,其特征在于,所述训练状态信息还指示用于完成所述目标模型训练任务的训练数据的总长度,所述模型训练管理实体确定所述目标长度,包括:
    所述模型训练管理实体基于所述总长度和所述第一目标模型训练实体的算力资源信息确定所述目标长度。
  8. 如权利要求1至7中任一项所述的方法,其特征在于,所述方法还包括:
    所述模型训练管理实体接收来自所述第一目标模型训练实体的协助训练反馈信息,所述协助训练反馈信息指示以下至少一项:
    所述第一目标模型训练实体执行所述目标模型训练任务达到的精度、
    所述第一目标模型训练实体执行所述目标模型训练任务耗费的时长、
    所述第一目标模型训练实体执行所述目标模型训练任务的执行进度、
    所述第一目标模型训练实体执行所述目标模型训练任务占用的资源数量。
  9. 如权利要求1至8中任一项所述的方法,其特征在于,所述方法还包括:
    所述模型训练管理实体接收来自所述第一目标模型训练实体的网络状态更改信息;
    所述模型训练管理实体基于所述网络状态更改信息和所述多个算力资源信息确定将所述第一目标模型训练实体更换为所述多个模型训练实体中的第二目标模型训练实体;
    所述模型训练管理实体向所述第二目标模型训练实体发送第三训练任务配置信息,所述第三训练 任务配置信息指示协助所述第一模型训练实体执行所述目标模型训练任务。
  10. 如权利要求1至9中任一项所述的方法,其特征在于,所述方法还包括:
    所述模型训练管理实体获取策略信息,所述策略信息指示基于所述多个算力资源信息在所述多个模型训练实体确定所述第一目标模型实体的方式,和/或,指示基于用于完成所述目标模型训练任务的训练数据的总长度确定用于所述第一目标模型实体协助执行所述目标模型训练任务的训练数据的目标长度的方式。
  11. 一种模型训练管理的方法,其特征在于,所述方法包括:
    模型训练实体向模型训练管理实体发送训练状态信息,所述训练状态信息指示所述模型训练实体具有的至少一个模型训练任务;
    所述模型训练实体向所述模型训练管理实体发送算力资源信息,所述算力资源信息指示所述模型训练实体具有的用于模型训练的空闲算力资源;
    所述模型训练实体接收来自所述模型训练管理实体的训练任务信息,所述训练任务信息指示协助执行所述至少一个模型训练任务中的目标模型训练任务的第一目标模型训练实体。
  12. 如权利要求11所述的方法,其特征在于,所述模型训练实体向所述模型训练管理实体发送所述算力资源信息,包括:
    所述模型训练实体向所述模型训练管理实体周期发送所述算力资源信息;或者,
    所述模型训练实体接收来自所述模型训练管理实体的算力资源查询信息;所述模型训练实体向所述模型训练管理实体发送所述算力资源信息。
  13. 如权利要求11或12所述的方法,其特征在于,所述模型训练实体向所述模型训练管理实体发送所述训练状态信息,包括:
    所述模型训练实体向所述模型训练管理实体周期发送所述训练状态信息;或者,
    所述模型训练实体基于触发事件向所述模型训练管理实体发送所述训练状态信息。
  14. 如权利要求11至13中任一项所述的方法,其特征在于,所述训练任务信息还指示所述第一目标模型训练实体采用目标长度的训练数据协助执行所述目标模型训练任务。
  15. 如权利要求11至14中任一项所述的方法,其特征在于,所述方法还包括:
    所述模型训练实体接收来自所述目标模型训练实体的模型训练报告信息,所述模型训练报告信息指示完成所述目标模型训练任务获得的子模型;
    所述模型训练实体基于所述子模型执行模型聚合。
  16. 一种模型训练管理的方法,其特征在于,所述方法包括:
    模型训练实体向模型训练管理实体发送算力资源信息,所述算力资源信息指示所述模型训练实体具有的用于模型训练的空闲算力资源;
    所述模型训练管理实体接收来自所述模型训练管理实体的第一训练任务配置信息,所述第一训练任务配置信息指示协助第一模型训练实体执行目标模型训练任务;
    所述模型训练管理实体协助所述第一模型训练管理实体执行所述目标模型训练任务。
  17. 如权利要求16所述的方法,其特征在于,所述模型训练实体向所述模型训练管理实体发送所述算力资源信息,包括:
    所述模型训练实体向所述模型训练管理实体周期发送所述算力资源信息;或者,
    所述模型训练实体接收来自所述模型训练管理实体的算力资源查询信息;所述模型训练实体向所述模型训练管理实体发送所述算力资源信息。
  18. 如权利要求16或17所述的方法,其特征在于,所述模型训练管理实体协助所述第一模型训练管理实体执行所述目标模型训练任务,包括:
    所述模型训练实体获取目标训练数据;
    所述模型训练实体采用目标训练数据协助所述第一模型训练管理实体执行所述目标模型训练任务。
  19. 如权利要求16至18中任一项所述的方法,其特征在于,所述方法还包括:
    所述模型训练实体向所述第一模型训练实体发送模型训练报告信息,所述模型训练报告信息指示完成所述目标模型训练任务获得的子模型。
  20. 如权利要求16至19中任一项所述的方法,其特征在于,所述方法还包括:
    所述模型训练实体向所述模型训练管理实体发送协助训练反馈信息,所述协助训练反馈信息指示以下至少一项:
    所述模型训练实体执行所述目标模型训练任务达到的精度、
    所述模型训练实体执行所述目标模型训练任务耗费的时长、
    所述模型训练实体执行所述目标模型训练任务的执行进度、
    所述模型训练实体执行所述目标模型训练任务占用的资源数量。
  21. 如权利要求16至20中任一项所述的方法,其特征在于,所述方法还包括:
    所述模型训练实体向所述模型训练管理实体发送网络状态更改信息,所述网络状态更改信息指示所述模型训练实体不能够完成目标模型训练任务。
  22. 一种模型训练管理的方法,其特征在于,所述方法包括:
    第一模型训练实体向模型训练管理实体发送训练状态信息,所述训练状态信息指示所述第一模型训练实体具有的至少一个模型训练任务;
    所述模型训练管理实体接收所述第一模型训练实体的所述训练状态信息;
    所述模型训练管理实体获取多个算力资源信息,所述多个算力资源信息分别指示多个模型训练实体具有的用于模型训练的空闲算力资源;
    所述模型训练管理实体基于所述多个算力资源信息在所述多个模型训练实体中确定第一目标模型训练实体;
    所述模型训练管理实体向所述第一目标模型训练实体发送第一训练任务配置信息,所述第一训练任务配置信息指示协助所述第一模型训练实体执行目标模型训练任务;
    所述第一目标模型训练管理实体接收来自所述模型训练管理实体的所述第一训练任务配置信息;
    所述第一目标模型训练管理实体协助所述第一模型训练管理实体执行所述目标模型训练任务。
  23. 如权利要求22所述的方法,其特征在于,所述模型训练管理实体获取多个算力资源信息,包括:
    所述模型训练管理实体分别周期接收来自所述多个模型训练实体的所述多个算力资源信息;
    所述多个模型训练实体向所述模型训练管理实体周期发送所述算力资源信息。
  24. 如权利要求22所述的方法,其特征在于,所述模型训练管理实体获取所述多个算力资源信息,包括:
    所述模型训练管理实体分别向所述多个模型训练实体发送多个算力资源查询信息;
    所述多个模型训练实体分别接收来自所述模型训练管理实体的所述多个算力资源查询信息;
    所述模型训练管理实体分别接收来自所述多个模型训练实体的所述多个算力资源信息;
    所述多个模型训练实体分别向所述模型训练管理实体发送所述多个算力资源信息。
  25. 如权利要求22至24中任一项所述的方法,其特征在于,所述第一模型训练实体向所述模型训练管理实体发送第一模型训练实体的训练状态信息,模型训练管理实体接收第一模型训练实体的训练状态信息,包括:
    所述模型训练管理实体周期接收来自所述第一模型训练实体的训练状态信息;所述第一模型训练实体向所述模型训练管理实体周期发送所述第一模型训练实体的训练状态信息;或者,
    所述第一模型训练实体基于触发事件向所述模型训练管理实体发送所述训练状态信息;所述模型训练管理实体接收来自所述第一模型训练实体的所述训练状态信息。
  26. 如权利要求22至25中任一项所述的方法,其特征在于,所述方法还包括:
    所述模型训练管理实体向所述第一模型训练实体发送训练任务配置通知信息,所述训练任务配置通知信息指示所述第一目标模型训练实体协助执行所述目标模型训练任务;
    所述第一模型训练实体接收来自所述模型训练管理实体的所述训练任务配置通知信息。
  27. 如权利要求22至25中任一项所述的方法,其特征在于,所述方法还包括:
    所述模型训练管理实体确定用于所述第一目标模型训练实体协助执行所述目标模型训练任务的训练数据的目标长度;
    所述模型训练管理实体向所述第一模型训练实体发送第二训练任务配置信息,所述第二训练任务配置信息指示所述第一目标模型训练实体采用所述目标长度的训练数据协助执行所述目标模型训练任 务;
    所述第一模型训练实体接收来自所述模型训练管理实体的所述第二训练任务配置信息。
  28. 如权利要求27所述的方法,其特征在于,所述模型训练管理实体确定所述目标长度,包括:
    所述模型训练管理实体基于所述总长度和所述第一目标模型训练实体的算力资源信息确定所述目标长度,其中,所述训练状态信息还指示用于完成所述目标模型训练任务的训练数据的总长度。
  29. 如权利要求22至28中任一项所述的方法,其特征在于,所述方法还包括:
    所述模型训练管理实体接收来自所述第一目标模型训练实体的协助训练反馈信息,所述协助训练反馈信息指示以下至少一项:
    所述第一目标模型训练实体执行所述目标模型训练任务达到的精度、
    所述第一目标模型训练实体执行所述目标模型训练任务耗费的时长、
    所述第一目标模型训练实体执行所述目标模型训练任务的执行进度、
    所述第一目标模型训练实体执行所述目标模型训练任务占用的资源数量;
    所述第一目标模型训练实体向所述模型训练管理实体发送所述协助训练反馈信息。
  30. 如权利要求22至29中任一项所述的方法,其特征在于,所述方法还包括:
    所述模型训练管理实体接收来自所述第一目标模型训练实体的网络状态更改信息;
    所述第一目标模型训练实体向所述模型训练管理实体发送所述网络状态更改信息;
    所述模型训练管理实体基于所述网络状态更改信息和所述多个算力资源信息确定将所述第一目标模型训练实体更换为所述多个模型训练实体中的第二目标模型训练实体;
    所述模型训练管理实体向所述第二目标模型训练实体发送第三训练任务配置信息,所述第三训练任务配置信息指示协助所述第一模型训练实体执行所述目标模型训练任务;
    所述第二目标模型训练实体接收来自所述模型训练管理实体的所述第三训练任务配置信息。
  31. 如权利要求22至30中任一项所述的方法,其特征在于,所述方法还包括:
    所述模型训练管理实体获取策略信息,所述策略信息指示基于所述多个算力资源信息在所述多个模型训练实体确定所述第一目标模型实体的方式,和/或,指示基于用于完成所述目标模型训练任务的训练数据的总长度确定用于所述第一目标模型实体协助执行所述目标模型训练任务的训练数据的目标长度的方式。
  32. 如权利要求22至31中任一项所述的方法,其特征在于,所述方法还包括:
    所述目标模型训练实体向所述模型训练实体发送所述模型训练报告信息,所述模型训练报告信息指示完成所述目标模型训练任务获得的子模型;
    所述模型训练实体接收来自所述目标模型训练实体的模型训练报告信息;
    所述模型训练实体基于所述子模型执行模型聚合。
  33. 如权利要求22或32所述的方法,其特征在于,所述模型训练管理实体协助所述第一模型训练管理实体执行所述目标模型训练任务,包括:
    所述模型训练实体获取目标训练数据;
    所述模型训练实体采用目标训练数据协助所述第一模型训练管理实体执行所述目标模型训练任务。
  34. 一种通信装置,其特征在于,包括用于执行如权利要求1至10中任一项所述方法的模块,或者,包括用于执行如权利要求11至15中任一项所述方法的模块,或者,包括用于执行如权利要求16至21中任一项所述方法的模块,或者,包括用于执行如权利要求22至33中任一项所述方法的模块。
  35. 一种通信装置,其特征在于,包括至少一个处理器,所述至少一个处理器用于执行存储器中存储的计算机程序,以使得所述装置实现如权利要求1至10中任一项所述的方法、或者如权利要求11至15中任一项所述的方法、或者如权利要求16至21中任一项所述的方法、或者如权利要求22至33中任一项所述的方法。
  36. 一种通信装置,其特征在于,包括至少一个处理器,所述至少一个处理器与至少一个存储器耦合,所述至少一个处理器用于执行所述至少一个存储器中存储的计算机程序或指令,如权利要求1至10中任一项所述的方法被执行,或者如权利要求11至15中任一项所述的方法被执行,或者如权利要求16至21中任一项所述的方法被执行,或者如权利要求22至33中任一项所述的方法被执行。
  37. 一种计算机可读存储介质,其特征在于,包括计算机程序,当所述计算机程序在计算机上运行时,使得所述计算机执行如权利要求1至10中任一项所述的方法、或者如权利要求11至15中任一项所述的方法、或者如权利要求16至21中任一项所述的方法、或者如权利要求22至33中任一项所述的方法。
  38. 一种通信系统,其特征在于,包括用于执行如权利要求1至10中任一项所述方法的装置,和用于执行如权利要求11至15中任一项所述方法的装置,以及用于执行如权利要求16至21中任一项所述方法的装置。
  39. 一种计算机程序产品,其特征在于,包括计算机程序代码,当所述计算机程序代码在计算机上运行时,使得计算机实现如权利要求1至10中任一项所述的方法,或使得所述计算机实现如权利要求11至15中任一项所述的方法,或使得所述计算机实现如权利要求16至21中任一项所述的方法,或使得所述计算机实现如权利要求22至33中任一项所述的方法。
  40. 一种通信装置,其特征在于,包括:处理器,用于执行存储器中存储的计算机程序,以使得所述装置执行如权利要求1至10中任一项所述的方法,或者以使得所述装置执行如权利要求11至15中任一项所述的方法,或者以使得所述装置执行如权利要求16至21中任一项所述的方法,或者以使得所述装置执行如权利要求22至33中任一项所述的方法。
  41. 如权利要求40所述的装置,其特征在于,所述装置还包括所述存储器。
PCT/CN2023/120765 2022-09-27 2023-09-22 一种模型训练管理的方法、装置和系统 Ceased WO2024067404A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP23870635.2A EP4567638A4 (en) 2022-09-27 2023-09-22 METHOD, APPARATUS AND MODEL LEARNING MANAGEMENT SYSTEM
US19/060,546 US20250190880A1 (en) 2022-09-27 2025-02-21 Model training management method, apparatus, and system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211181988.7A CN117828341A (zh) 2022-09-27 2022-09-27 一种模型训练管理的方法、装置和系统
CN202211181988.7 2022-09-27

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US19/060,546 Continuation US20250190880A1 (en) 2022-09-27 2025-02-21 Model training management method, apparatus, and system

Publications (1)

Publication Number Publication Date
WO2024067404A1 true WO2024067404A1 (zh) 2024-04-04

Family

ID=90476149

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/120765 Ceased WO2024067404A1 (zh) 2022-09-27 2023-09-22 一种模型训练管理的方法、装置和系统

Country Status (4)

Country Link
US (1) US20250190880A1 (zh)
EP (1) EP4567638A4 (zh)
CN (1) CN117828341A (zh)
WO (1) WO2024067404A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119725805A (zh) * 2024-12-12 2025-03-28 广芯微电子(广州)股份有限公司 一种基于算力协同的无线电池管理方法及系统
WO2025213018A1 (en) * 2024-04-05 2025-10-09 Interdigital Patent Holdings, Inc. Edge-aware aiml enabler service for distributed and split learning
WO2025256547A1 (zh) * 2024-06-14 2025-12-18 华为技术有限公司 通信方法及装置

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120872424A (zh) * 2024-04-30 2025-10-31 华为技术有限公司 一种训练任务的处理方法以及相关设备
CN120803755B (zh) * 2025-09-15 2025-11-25 浪潮电子信息产业股份有限公司 一种模型训练的控制方法、程序产品、电子设备及介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110221913A (zh) * 2019-04-26 2019-09-10 深圳市致宸信息科技有限公司 监控服务器的云算力的方法、终端、设备及存储介质
CN111368991A (zh) * 2018-12-25 2020-07-03 杭州海康威视数字技术股份有限公司 深度学习模型的训练方法、装置及电子设备
WO2021105313A1 (en) * 2019-11-28 2021-06-03 Secondmind Limited Parallelised training of machine learning models
CN113849295A (zh) * 2020-06-28 2021-12-28 华为技术有限公司 模型训练的方法、装置及计算机可读存储介质
CN114089889A (zh) * 2021-02-09 2022-02-25 京东科技控股股份有限公司 模型训练方法、装置以及存储介质
CN114169427A (zh) * 2021-12-06 2022-03-11 北京百度网讯科技有限公司 基于端到端自适应的分布式训练方法、装置、设备
CN114595058A (zh) * 2022-03-02 2022-06-07 北京金山云网络技术有限公司 基于gpu资源的模型训练方法和装置、电子设备和存储介质

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190317825A1 (en) * 2018-04-16 2019-10-17 Kazuhm, Inc. System for managing deployment of distributed computing resources
US11423254B2 (en) * 2019-03-28 2022-08-23 Intel Corporation Technologies for distributing iterative computations in heterogeneous computing environments
CN112202928B (zh) * 2020-11-16 2022-05-17 绍兴文理学院 传感边缘云区块链网络可信卸载协作节点选择系统及方法

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111368991A (zh) * 2018-12-25 2020-07-03 杭州海康威视数字技术股份有限公司 深度学习模型的训练方法、装置及电子设备
CN110221913A (zh) * 2019-04-26 2019-09-10 深圳市致宸信息科技有限公司 监控服务器的云算力的方法、终端、设备及存储介质
WO2021105313A1 (en) * 2019-11-28 2021-06-03 Secondmind Limited Parallelised training of machine learning models
CN113849295A (zh) * 2020-06-28 2021-12-28 华为技术有限公司 模型训练的方法、装置及计算机可读存储介质
CN114089889A (zh) * 2021-02-09 2022-02-25 京东科技控股股份有限公司 模型训练方法、装置以及存储介质
CN114169427A (zh) * 2021-12-06 2022-03-11 北京百度网讯科技有限公司 基于端到端自适应的分布式训练方法、装置、设备
CN114595058A (zh) * 2022-03-02 2022-06-07 北京金山云网络技术有限公司 基于gpu资源的模型训练方法和装置、电子设备和存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4567638A4

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2025213018A1 (en) * 2024-04-05 2025-10-09 Interdigital Patent Holdings, Inc. Edge-aware aiml enabler service for distributed and split learning
WO2025256547A1 (zh) * 2024-06-14 2025-12-18 华为技术有限公司 通信方法及装置
CN119725805A (zh) * 2024-12-12 2025-03-28 广芯微电子(广州)股份有限公司 一种基于算力协同的无线电池管理方法及系统

Also Published As

Publication number Publication date
EP4567638A1 (en) 2025-06-11
US20250190880A1 (en) 2025-06-12
CN117828341A (zh) 2024-04-05
EP4567638A4 (en) 2025-12-31

Similar Documents

Publication Publication Date Title
WO2024067404A1 (zh) 一种模型训练管理的方法、装置和系统
US12328602B2 (en) ML model management in O-RAN
US20230224752A1 (en) Communication method, apparatus, and system
CN111967605B (zh) 无线电接入网中的机器学习
US20210235277A1 (en) Method and apparatus for dynamically allocating radio resources in a wireless communication system
US20220012645A1 (en) Federated learning in o-ran
Wang et al. Cooperative end-edge-cloud computing and resource allocation for digital twin enabled 6G industrial IoT
WO2022060923A1 (en) Non-realtime services for ai/ml
JP2025520969A (ja) 連合学習方法、装置、通信機器及び可読記憶媒体
Kumar et al. Quality of service‐aware adaptive radio resource management based on deep federated Q‐learning for multi‐access edge computing in beyond 5G cloud‐radio access network
Joloudari et al. The state-of-the-art review on resource allocation problem using artificial intelligence methods on various computing paradigms
Cao et al. Learning-based multitier split computing for efficient convergence of communication and computation
Singh et al. Digital Twin-Assisted Adaptive Federated Multi-Agent DRL with GenAI for Optimized Resource Allocation in IoV Networks
Yu et al. Snake learning: A communication-and computation-efficient distributed learning framework for 6g
WO2021152629A1 (en) Method and apparatus for dynamically allocating radio resources in a wireless communication system
WO2026077104A1 (zh) 算法功能的管理方法、装置、存储介质及程序产品
Umamaheswaran et al. A Hybrid Machine Learning Framework for Dynamic Resource Optimization in 5G Networks
Bano et al. A novel approach to distributed model aggregation using Apache Kafka
Zhao et al. Online optimal task offloading with one-bit feedback
US20190149613A1 (en) Method, communication terminal, and communication node device for associating resources
Deb et al. Loop-the-loops: Fragmented learning over networks for constrained IoT devices
WO2024125787A1 (en) Using distributed learning to develop a machine learning model
Cao et al. A deep Q-network-based edge service offloading in cloud–edge–terminal environment: B. Cao et al.
CN118802564A (zh) 网络构建方法、装置及存储介质
CN120916114A (zh) 在网计算管理方法、装置及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23870635

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2023870635

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2023870635

Country of ref document: EP

Effective date: 20250304

NENP Non-entry into the national phase

Ref country code: DE

WWP Wipo information: published in national office

Ref document number: 2023870635

Country of ref document: EP