WO2024017283A1 - 一种模型训练系统、方法及相关设备 - Google Patents
一种模型训练系统、方法及相关设备 Download PDFInfo
- Publication number
- WO2024017283A1 WO2024017283A1 PCT/CN2023/108091 CN2023108091W WO2024017283A1 WO 2024017283 A1 WO2024017283 A1 WO 2024017283A1 CN 2023108091 W CN2023108091 W CN 2023108091W WO 2024017283 A1 WO2024017283 A1 WO 2024017283A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- memory
- node
- parameters
- model
- parameter
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F13/00—Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
- G06F13/10—Program control for peripheral devices
- G06F13/12—Program control for peripheral devices using hardware independent of the central processor, e.g. channel or peripheral processor
- G06F13/124—Program control for peripheral devices using hardware independent of the central processor, e.g. channel or peripheral processor where hardware is a sequential transfer control unit, e.g. microprocessor, peripheral processor or state-machine
- G06F13/128—Program control for peripheral devices using hardware independent of the central processor, e.g. channel or peripheral processor where hardware is a sequential transfer control unit, e.g. microprocessor, peripheral processor or state-machine for dedicated transfers to a network
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F13/00—Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
- G06F13/14—Handling requests for interconnection or transfer
- G06F13/16—Handling requests for interconnection or transfer for access to memory bus
- G06F13/1605—Handling requests for interconnection or transfer for access to memory bus based on arbitration
- G06F13/1652—Handling requests for interconnection or transfer for access to memory bus based on arbitration in a multiprocessor architecture
- G06F13/1657—Access to multiple memories
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F13/00—Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
- G06F13/38—Information transfer, e.g. on bus
- G06F13/40—Bus structure
- G06F13/4004—Coupling between buses
- G06F13/4027—Coupling between buses using bus bridges
- G06F13/4045—Coupling between buses using bus bridges where the bus bridge performs an extender function
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F13/00—Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
- G06F13/38—Information transfer, e.g. on bus
- G06F13/42—Bus transfer protocol, e.g. handshake; Synchronisation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
Definitions
- This application relates to the field of artificial intelligence, and in particular to a model training system, method and related equipment.
- Search and recommendation tasks are common business scenarios for Internet applications and are also the core technologies of the Internet.
- the attribute information of an object is represented as an embedding vector, and the embedding vector is updated during the model training process.
- Model parameters During model training, the node responsible for model training can obtain the required model parameters from the node responsible for storing training samples in the node cluster.
- the node responsible for training sample storage can send model parameters one by one to the node responsible for model training.
- the number of model parameters is large, the data transmission efficiency is low, which in turn results in model training efficiency. lower.
- This application provides a model training system for improving data transmission efficiency between nodes in a computing node cluster during model training.
- this application also provides a model training system, method, computer-readable storage medium and computer program product.
- this application provides a model training system.
- the model training system includes multiple nodes. Parameters required for model training are distributed among the multiple nodes.
- the first node among the multiple nodes includes a first memory and a second memory. , some parameters of the model are stored in the first memory and the second memory respectively.
- the first memory and the second memory communicate with the processor in the first node through different protocols; the second node among the multiple nodes is used to send Multiple parameter acquisition requests are sent to the first node.
- the multiple parameter acquisition requests are used to acquire the parameters required by the second node to train the model.
- the first node is used to determine the parameters acquired by the multiple parameter acquisition requests according to the multiple parameter acquisition requests. Stored in the first memory and the second memory, merge the parameters obtained from the first memory and the second memory, and send the merged parameters to the second node; the second node is also used to receive the merged parameters, and Use the merged parameters to train the model.
- the parameters obtained from the first memory and the second memory are combined and then sent.
- the transmitted report can be made The number of files is greatly reduced, thereby improving data transmission efficiency.
- the first node when used to merge the parameters obtained from the first memory and the second memory, it is specifically used to: obtain the parameters from the first memory and the parameters obtained from the second memory. are cached into the first memory for merging.
- the first node when used to merge the parameters obtained from the first memory and the second memory, it is specifically used to: obtain the parameters from the first memory and the parameters obtained from the second memory. are cached into the second memory for merging.
- the first node when used to merge the parameters obtained from the first memory and the second memory, it is specifically used to: cache the parameters obtained from the first memory to the second memory, and cache the parameters obtained from the first memory to the second memory.
- the parameters obtained in the second memory are cached in the first memory.
- Caching the parameters obtained from the first memory to the second memory can reduce the bandwidth occupation of the DDR in the writing process compared to caching the parameters obtained from the first memory to the first memory.
- Caching the parameters obtained from the second memory to the first memory can reduce the data transmission delay compared to caching the parameters obtained from the second memory to the second memory.
- the parameters obtained from the first memory and the parameters obtained from the second memory are combined and cached in the first memory, allowing data transmission delay.
- the parameters obtained from the first memory and the parameters obtained from the second memory are combined and cached in the second memory, which can reduce the bandwidth occupation of the DDR in the writing process.
- the second node is also used to, after receiving the merged parameters, sort the merged parameters with parameters received from other nodes, and then use the sorted parameters to train the model.
- the second node is also used to: after training the model, obtain the gradient data generated after the model training, and determine the gradient corresponding to the parameters of the first node based on the gradient data; convert the parameters of the first node The corresponding gradients are merged and sent to the first node; the first node is used to: after receiving the gradient, distinguish the gradient corresponding to the parameters stored in the first memory and the gradient corresponding to the parameters stored in the second memory, and update the first node respectively. Parameters in one memory and parameters in the second memory.
- the first memory is connected to the processor through the DDR protocol
- the second memory is an extended memory
- the second memory is connected to the processor through the open interconnect protocol CXL.
- the model parameters are stored in the extended memory of the first node.
- the DDR bandwidth is not occupied, thereby reducing the occupation of the DDR bandwidth.
- due to the memory Its own storage space is limited. In scenarios where the number of model parameters such as recommended models is huge, the number of nodes storing model parameters can be reduced by storing model parameters in extended memory.
- the model is a recommendation model; the parameters are embedding vector embeddings.
- this application provides a model training method, which is applied to a distributed training system.
- the distributed training system includes multiple nodes, and the parameters required for model training are distributed among the multiple nodes;
- the first node among the plurality of nodes includes a first memory and a second memory.
- the first memory and the second memory respectively store some parameters of the model.
- the first memory and the second memory communicate with the first node through different protocols.
- the processor communicates; methods include:
- the second node among the multiple nodes sends multiple parameter acquisition requests to the first node, and the multiple parameter acquisition requests are used to acquire parameters required by the second node when training the model;
- the first node determines according to the multiple parameter acquisition requests that the parameters obtained by the multiple parameter acquisition requests are stored in the first memory and the second memory respectively, merges the parameters obtained from the first memory and the second memory, and merges the parameters obtained by the first memory and the second memory. Parameters are sent to the second node;
- the second node receives the merged parameters and uses the merged parameters to train the model.
- the parameters obtained from the first memory and the second memory are merged, including:
- the parameters obtained from the first memory and the parameters obtained from the second memory are cached in the first memory for merging.
- the parameters obtained from the first memory and the second memory are merged, including:
- the parameters obtained from the first memory and the parameters obtained from the second memory are cached into the second memory for merging.
- the parameters obtained from the first memory and the second memory are merged, including:
- the parameters obtained from the first memory are cached in the second memory, and the parameters obtained from the second memory are cached in the first memory.
- the method also includes:
- the second node After receiving the merged parameters, the second node sorts the merged parameters with the parameters received from other nodes, and then uses the sorted parameters to train the model.
- the method also includes:
- the second node After the second node trains the model, it obtains the gradient data generated after the model training, and determines the gradient corresponding to the parameters of the first node based on the gradient data;
- the first node After receiving the gradient, the first node distinguishes the gradient corresponding to the parameters stored in the first memory and the gradient corresponding to the parameters stored in the second memory, and updates the parameters in the first memory and the parameters in the second memory respectively.
- the first memory is connected to the processor through the DDR protocol
- the second memory is an extended memory
- the second memory is connected to the processor through the open interconnect protocol CXL.
- the model is a recommendation model
- the parameter is the embedding vector embedding.
- the model training method provided in the second aspect corresponds to the model training system provided in the first aspect, so any method in the second aspect may be implemented
- any method in the second aspect may be implemented
- the present application provides a computing node, which includes a processor, a memory, and a network card; the memory is used to store instructions, and when the computing node is running, the processor and the network card are used to execute the instructions stored in the memory. Instructions to cause the computing node to execute the steps executed by the first node or the second node in the above first aspect or any possible implementation manner.
- the memory can be integrated into the processor or independent of the processor.
- Compute nodes may also include buses. Among them, the processor is connected to the memory through a bus.
- the memory may include readable memory and random access memory; in addition, the memory may also include internal memory and extended memory.
- the present application provides a computer-readable storage medium that stores instructions that, when run on a computing node, cause the computing node to execute the above-mentioned second aspect or any one of the second aspects. A method to achieve this.
- the present application provides a computer program product containing instructions that, when run on a computing node, causes the computing node to execute the above-mentioned second aspect or the method of any implementation of the second aspect, or the above-mentioned third aspect.
- Figure 1 is a schematic diagram of a recommendation scenario provided by an embodiment of this application.
- Figure 2A is an architectural schematic diagram of model training and inference provided by the embodiment of the present application.
- Figure 2B is a schematic architectural diagram of the training system provided by an embodiment of the present application.
- Figure 3A is a schematic diagram of an exemplary node architecture provided by an embodiment of the present application.
- Figure 3B is a schematic diagram of another exemplary node architecture provided by an embodiment of the present application.
- Figure 4 is a schematic diagram of another exemplary node architecture provided by an embodiment of the present application.
- Figure 5 is a schematic flow chart of a model training method provided by an embodiment of the present application.
- Figure 6 is a schematic diagram of data storage provided by an embodiment of the present application.
- Figure 7 is a schematic flow chart of data pulling provided by an embodiment of the present application.
- Figure 8 is a schematic flow chart of data push provided by an embodiment of the present application.
- Figure 9 is a schematic structural diagram of a model training device provided by an embodiment of the present application.
- At least one means one or more, and “plurality” means two or more.
- A/B can represent A or B.
- references to the terms “including”, “including” and “having” in the description of this application are intended to cover a non-exclusive inclusion.
- a process, method, system, product or device that includes a series of steps or units is not limited to the listed steps or units, but optionally also includes other unlisted steps or units, or optionally also Includes other steps or units that are inherent to such processes, methods, products, or devices.
- the nodes may be the first node and the second node in the embodiment of this application.
- a node is a device that has at least one of the two functions of processing data and storing data.
- There is an operating system running on the node and the node can be distinguished by the operating system, that is, different nodes run different operating systems. In other words, the hardware and software used to run an operating system can visually belong to the same node.
- the node can be a complete physical machine.
- a terminal, or a network device such as a server, server agent, etc.
- Data access in this application can be understood as one node accessing data maintained by another node through an instance.
- the devices in the node refer to the components or components in the node.
- the CPU and memory are both devices in the node.
- a physical machine refers to a computer packaged into a product, such as a server, desktop computer, all-in-one PC (AIO), laptop or smartphone, etc.
- the main working principle is to use the relationship between the amount of charge stored in the capacitor and the threshold value to represent a binary bit (bit), with a value of 1 or 0. Since transistors have leakage current in reality, the amount of charge stored on the capacitor is not enough to correctly judge the data, resulting in data corruption. Therefore, for DRAM, periodic charging (also called refreshing) is an unavoidable condition. Because of this characteristic of requiring regular refresh, it is called “dynamic" random access memory. Relatively speaking, as long as the static random access memory stores data, the data will not be lost even if it is not refreshed.
- DDR is a standard of the Joint Electronic Engineering Design and Development Council (JEDEC) and a parallel bus communication standard.
- JEDEC Joint Electronic Engineering Design and Development Council
- the standard includes the physical specifications of the connection interface and related protocols for data transmission.
- (4) Local For an instance, the node running the instance (for example, an instance related to model training) is local. For example, local memory, the full name should be "local memory of an instance", which refers to the memory of the node running the instance.
- the nodes on which instances run can be described at different granularities. For example, it can be just a processor, such as a Central Processing Unit (CPU) or a Graphics Processing Unit (GPU), or it can be a complete physical machine, including processing circuits and storage media. The specific description used depends on whether the data access process involved spans physical machines.
- CPU Central Processing Unit
- GPU Graphics Processing Unit
- Remote is a concept opposite to "local". That is to say, for an instance, except for the node running the instance, other nodes are remote.
- the remote end can be a device with computing capabilities or a device used to store data.
- Identification used to distinguish one or one thing from other things of the same or different types.
- the identity of the node the identity of the network, and the identity of the network card.
- the identification can be a name, a number, or it can be identified by a certain distinguishing feature, such as a category identification.
- This application does not place restrictions on the implementation of various logos, as long as distinctions can be made.
- the identifier of the virtual address space of the instance is used as the identifier of the instance, instead of using the common name or number of the instance as the identifier of the instance.
- Address space It can also be called storage space, which refers to one or more addresses that can be used by a certain device or instance.
- the virtual address space of a device or instance can be used by the device or instance, that is, one or more virtual addresses belonging to the device or instance.
- the virtual address space of a device or instance is allocated by the operating system in which the device or instance is located.
- the physical address space of a device or instance is one or more physical addresses allocated to the device or instance. When the device or instance is in this physical address space, other devices or instances cannot use this physical address space. address.
- the physical address space of an instance is allocated by the operating system that the instance is running on. This allocation may be dynamic. For example, as the instance runs, the occupied physical address space becomes larger and larger, but there will be an upper limit.
- the size and range of a device's physical address space are usually fixed.
- Embodiments of this application can be applied to the field of information recommendation.
- This scenario includes but is not limited to e-commerce product recommendation, search engine result recommendation, software recommendation in the application market, music recommendation, video recommendation, news recommendation, reading content recommendation, and lifelong partner related application scenarios, among which items recommended in various application scenarios can also be called "recommended objects".
- the recommended object can be a media content item, such as an APP, or a video (such as a short video or live broadcast video), or music, or a certain product (such as the presentation interface of an online shopping platform, which will be displayed based on the user's preferences). Different products are selected for presentation), or articles.
- users can trigger the recommendation module of the app market by opening the mobile app market.
- the recommendation module of the app market will be based on the user's historical download records, user click records, the application's own characteristics, time, location and other environmental characteristics. information to predict a user's download likelihood for each given candidate application. Based on the predicted results, the application market is displayed in descending order of likelihood, achieving the effect of increasing the probability of application downloads. Specifically, apps that are more likely to be downloaded are ranked higher, and apps that are less likely to be downloaded are ranked lower.
- the user's behavior will also be stored in the log and the parameters of the prediction model will be trained and updated through the offline training module.
- a cognitive brain in applications related to lifelong partners, can be built based on the user's historical data in video, music, news and other fields through various models and algorithms, imitating the human brain mechanism, and building a user lifelong learning system framework.
- Lifelong Companion can record the user's past events based on system data and application data, understand the user's current intentions, predict the user's future actions or behaviors, and ultimately implement intelligent services.
- the recommendation process can be implemented by a recommendation system, which can present information of interest to users. In order to determine the information that the user is interested in, it is necessary to match users and items based on context information, user attribute information, item attribute information, etc.
- Recommendation systems usually involve user behavior log collection, log data preprocessing (such as quantification, sampling, etc.), sample set training to obtain recommendation models, and objects involved in the scenarios corresponding to the training sample items (such as APPs, music, etc.) based on the recommendation models. etc.) for analysis and processing.
- the samples selected in the recommendation model training process come from the operating behavior of users in the mobile application market for the recommended APP, then the recommendation model trained thereby is suitable for the above-mentioned mobile APP application market, or It can be used in the APP application market of other types of terminals to recommend terminal APPs.
- the recommendation model will eventually calculate the recommendation probability or score of each object to be recommended.
- the recommendation system selects the recommendation results according to certain selection rules, such as sorting according to the recommendation probability or score, and presents them to the user through the corresponding application or terminal device. , the user operates the objects in the recommendation results to generate user behavior logs and other links.
- This application can be applied to recommendation systems. Specifically, it can be applied to the training process of recommendation models in recommendation systems.
- Figure 1 is a schematic diagram of a recommendation system provided by an embodiment of the present application.
- the recommendation system will input the request and related information into the recommendation model, and then predict the user's selection rate of items in the system. Furthermore, the items are arranged in descending order according to the predicted selection rate or a function based on the selection rate, that is, the recommendation system can display the items in different locations in order as a recommendation result to the user. Users browse different located items and perform user actions such as browsing, selection, and downloading. At the same time, the user's actual behavior will be stored in the log as model parameters, and the parameters of the recommended model will be continuously updated through the training module (for example, offline or online) to improve the prediction effect of the model.
- the training module for example, offline or online
- the recommendation system in the application market can be triggered.
- the recommendation system of the application market will predict each candidate application recommended by the user based on the user's historical behavior logs, such as the user's historical download records, user selection records, and the application market's own characteristics, such as time, location and other environmental feature information. , APP) probability.
- the recommendation system of the application market can display the candidate APPs in descending order according to the predicted probability value, thereby increasing the download probability of the candidate APPs.
- APPs with a higher predicted user selection rate may be displayed in the front recommendation position
- APPs with a lower predicted user selection rate may be displayed in the lower recommendation position
- the above recommendation model may be a neural network model.
- the relevant terms and concepts of neural networks that may be involved in the embodiments of this application are introduced below.
- the neural network can be composed of neural units.
- the neural unit can refer to an operation unit that takes xs (ie, input data) and intercept 1 as input.
- the output of the operation unit can be:
- s 1, 2,...n, n is a natural number greater than 1
- Ws is the weight of xs
- b is the bias of the neural unit.
- f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal.
- the output signal of this activation function can be used as the input of the next convolutional layer, and the activation function can be a sigmoid function.
- a neural network is a network formed by connecting multiple above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit.
- the input of each neural unit can be connected to the local receptive field of the previous layer to extract the features of the local receptive field.
- the local receptive field can be an area composed of several neural units.
- the error back propagation (BP) algorithm can be used to correct the size of the parameters in the initial model during the training process, so that the error loss of the model becomes smaller and smaller. Specifically, forward propagation of the input signal until the output will produce an error loss, and backward propagation of the error loss information is used to update the parameters in the initial model, so that the error loss converges.
- the backpropagation algorithm is a backpropagation movement dominated by error loss, aiming to obtain optimal model parameters, such as weight matrices.
- Figure 2A, Figure 2B, Figure 3A, Figure 3B and Figure 4 are schematic diagrams of the application architecture of this application.
- Figure 2A is a schematic diagram of the architecture described from the perspective of training and reasoning of the recommendation system
- Figure 2B is a distributed training system.
- Architectural diagram, Figure 3A and Figure 3B provide richer details for the node hardware architecture in the distributed training system
- Figure 4 provides richer details for the software architecture of the nodes in the distributed training system.
- this embodiment provides a recommendation system architecture 200.
- the data collection device 260 is used to collect samples.
- a model parameter can be composed of multiple feature information (or described as attribute information, such as user attributes and item attributes), or an embedding vector embedding obtained based on the feature information (such as the embodiment of the present application).
- the characteristic information can be of many kinds, specifically it can include user characteristic information, object characteristic information and tag characteristics.
- the user characteristic information is used to characterize the user's characteristics, such as gender, age, occupation, and hobbies. etc.
- object feature information is used to characterize the characteristics of objects pushed to users.
- Different recommendation systems correspond to different objects, and the types of features that need to be extracted are not the same for different objects.
- the objects extracted from the model parameters of the APP market Features can be the name (logo), type, size, etc. of the APP; while the object features mentioned in the model parameters of the e-commerce APP can be the name of the product, its category, price range, etc.; the label feature is Used to indicate whether the sample is a positive or negative example.
- the label characteristics of the sample can be obtained from the user's operation information on the recommended object. A sample where the user has operated on the recommended object is a positive example, and a sample where the user has not operated on the recommended object is a positive example. Samples that perform operations or only browse are negative examples.
- the label feature is 1, indicating that the sample is a positive example, and if the user does not perform any operations on the recommended object , then the label feature is 0, indicating that the sample is a negative example.
- the sample can be stored in the database 230.
- Some or all of the characteristic information in the sample in the database 230 can also be obtained directly from the client device 240, such as user characteristic information, user operation information on the object (used to determine the type identification ), object characteristic information (such as object identification), etc.
- the model parameters collected by the data collection device 260 can be stored in the database 230.
- the database 230 can be specifically a memory in multiple nodes.
- the memory can be a memory or an extended memory.
- the distributed training system 220 trains the recommendation model 201 based on sample training in the database 230 .
- the recommended model 201 can be used to The objects are evaluated to obtain the scores of each object to be recommended. Further, a specified or preset number of objects can be recommended from the evaluation results of a large number of objects.
- the calculation module 211 obtains the recommendation results based on the evaluation results of the recommendation model 201, and uses I /O interface 212 is recommended to client devices.
- the distributed training system 220 can use the samples in the training set to train the recommendation model to obtain the trained recommendation model; for the implementation details of the calculation module 211, please refer to the details of the method embodiment shown in Figure 5 describe.
- the distributed training system 220 After the distributed training system 220 obtains the trained recommendation model 201 through training, it can send the recommended model 201 to the execution device 210, or directly send the model parameter matrix to the execution device 210, and build the recommendation model in the execution device 210 for Carry out corresponding system recommendations.
- the recommendation model obtained by training based on video-related samples can be used to recommend videos to users in video websites or APPs.
- the recommendation model obtained by training based on APP-related samples can be used to recommend users in the application market. Make APP recommendations.
- the execution device 210 is configured with an I/O interface 212 for data interaction with external devices.
- the execution device 210 can obtain user characteristic information from the client device 240 through the I/O interface 212, such as user identification, user identity, gender, occupation, hobbies, etc. , this part of information can also be obtained from the system database.
- the recommendation model 201 recommends target recommendation objects to the user based on the user characteristic information and the characteristic information of the object to be recommended.
- the execution device 210 can be set in the cloud server or in the user client.
- the execution device 210 can call data, codes, etc. in the data storage system 250, and can also store the output data in the data storage system 250.
- the data storage system 250 can be set up in the execution device 210, can be set up independently, or can be set up in other network entities, and the number can be one or multiple.
- the calculation module 211 uses the recommendation model 201 to process the user feature information and the feature information of the objects to be recommended. For example, the calculation module 211 uses the recommendation model 201 to analyze and process the user feature information and the feature information of the objects to be recommended, thereby obtaining the According to the scores of the objects to be recommended, the objects to be recommended are sorted according to their scores, and the objects with the highest ranking will be used as objects recommended to the client device 240 .
- the I/O interface 212 returns the recommendation results to the client device 240 and presents them to the user.
- the distributed training system 220 may include multiple nodes, and the database 230 may deploy training parameters in multiple nodes.
- the database 230 may deploy training parameters in multiple nodes.
- one or more nodes among the multiple nodes may be configured through a network node cluster. Pull the training parameters required for training from other nodes to perform model training, and return the gradient obtained after training to other nodes, and the other nodes update the training parameters.
- the node 100 can be a node in the distributed training system introduced in Figure 2B. It should be understood that the architecture described in Figure 3A is only an illustration for ease of understanding, and is not a limitation on the architecture that can be used by the node mentioned in this application. Other software parts in the node 100, such as the operating system and other hardware parts, such as The monitor etc. are not displayed.
- the architecture diagram of this node includes the hardware part. Specifically, the hardware includes:
- CPU Central processing unit
- MMU Memory Management Unit
- the two are generally packaged into one chip.
- the CPU runs the application and initiates a request for reading or writing data, which is also referred to as a memory access request below, that is, a request to access the storage medium, because reading or writing data requires finding the location in the storage medium (such as memory or extended content). address.
- the MMU is responsible for address translation of memory access requests initiated by the CPU, that is, converting the virtual address in the memory access request into a physical address.
- Memory (Memory) 1006 Figure 3A takes memory as an example to illustrate the storage medium on the node.
- the physical form of the memory can be a memory stick.
- the memory in addition to being provided to local instances (such as processes), the memory can also be used by instances of other nodes. The way instances of other nodes use it is to write data to this memory through requests, or request to read. the data in this memory.
- the CPU 1004 reads and writes data to the memory 1006, it occupies double data rate (DDR) bandwidth.
- DDR double data rate
- the extended memory component 1002 may include an expansion chip and an expansion memory.
- the expansion chip may include a controller.
- the controller may be an open interconnect protocol (compute express link, CXL) controller and a memory controller.
- CXL control The connector is used to bring out the CXL interface.
- the CXL interface can be connected to the CPU 1004.
- the CXL interface can be connected to the CPU 1004 through the PCIe gold finger. Since the CXL bus is compatible with the physical layer of the PCIe bus, the memory expansion component 1002 can use the PCIe slot in the server chassis like a PCIe device without the need for additional communication cables.
- Extended memory is based on the CXL protocol (or CXL-like protocol) and is a piece of hardware that is inserted into the PCIe slot and has functions similar to memory. It does not have persistence features.
- the internal medium is DRAM, so it can be regarded as an extension of DDR DRAM.
- Its bandwidth is determined by the PCIe protocol The bandwidth (PCIe 5.0: 128GB/S), the latency is about 350ns, which is slightly worse than the read and write latency and bandwidth of DDR.
- CPU 1004 is connected to the expansion chip through the CXL interface, and the expansion chip is connected to the expanded memory through the memory controller, thereby realizing the CPU 1004 to connect to the expanded memory and realizing memory expansion within the node.
- the training module 1001 can be an AI training card such as a neural network processing unit (NPU), a graphics processing unit (GPU), or a tensor processing unit (TPU).
- NPU neural network processing unit
- GPU graphics processing unit
- TPU tensor processing unit
- the node 100 includes the training module 1001 in the architecture shown in FIG. 3A , for a node used to store model parameters, the training module 1001 may not be included, such as that shown in FIG. 3B .
- the transceiver module 1003 can be a network card based on the interconnection protocol between physical machines.
- the interconnection protocol between physical machines can be ROCE, IB, TCP, CXL, etc.
- the CPU 1004 can control the transceiver module 1003 to transmit data to other computing nodes through the network, and the transceiver module 1003 can receive data transmitted from other nodes.
- Figure 4 shows the software architecture of the node cluster.
- each node can include multiple workers and multiple servers (only one is shown in Figure 4). Both workers and servers are separate processes. Each worker is mainly used for training and inference. Related model data, and each server is used to store model parameters (such as the embedding table shown in Figure 4).
- each worker contains a training module, which is responsible for data training and inference work.
- Each server mainly runs on the CPU.
- the embedding table of the server can be placed in the memory or extended memory.
- other data such as calculation parameters in the optimizer can also be placed in memory or extended memory.
- an iterative training process of the model includes two processes: forward propagation and back propagation.
- the forward propagation process is to input the model parameters into the model to be trained and obtain the processing results of the model to be trained.
- the processing results can determine the update gradient corresponding to the model parameters, and back propagation updates the model based on the update gradient corresponding to the model parameters.
- the node responsible for forward propagation can obtain the model parameters from the node that stores the model parameters (such as the first node in the embodiment of the present application).
- the model parameters Obtain the update gradient, and transmit the update gradient back to the node that stores the model parameters, so that the node that stores the model parameters obtains the updated model parameters based on the update gradient, and replaces the model parameters stored before the update with the updated model parameters.
- the process in which the node responsible for forward propagation obtains model parameters can be called the data pull process
- the process in which the node responsible for forward propagation obtains updated gradients based on the model parameters can be called the training process.
- the above-mentioned back-transmission of updated gradients to the node that stores the model parameters can be called a data push process.
- the model parameters are deployed in the memory of the node (such as dynamic random access memory (DRAM)), and the node responsible for forward propagation obtains the model from other nodes responsible for storage. parameters, the DDR bandwidth needs to be occupied, which will cause DDR bandwidth congestion.
- DRAM dynamic random access memory
- the node when the model parameters required for the current model iteration are stored in the memory located at the remote node of the training module (for example, the training module and the memory are located on different nodes), the node (the node where the training module is located)
- the processor of the remote node can notify the processor in the remote node (the node where the model parameters required for the current model iteration are stored) to obtain the model parameters required for the current model iteration.
- the processor of the remote node can obtain the model parameters required for the current model iteration from the corresponding storage in the memory through DDR.
- the transceiver module can pass the data to the node where the training module is located.
- the processor of the remote node reads the model parameters from the memory or writes the model parameters to the cache, it needs to occupy DDR bandwidth.
- the node responsible for forward propagation can include a training module.
- the training module can be a processor with AI processing capabilities such as CPU, NPU, or GPU.
- the current model iteration requires
- the node's processor such as the CPU
- the node's processor can read the current model iteration parameters from the corresponding storage location of the memory through DDR. required model parameters, and write the data to the cache (attribute) through DDR in the memory storage space) and passed to the training module.
- the processor obtains model parameters from memory or writes model parameters to cache, it requires DDR bandwidth.
- some model parameters can be stored in memories that communicate with the processor through different protocols (such as the first memory and the second memory in this application), where the processor reads the data from the second memory.
- the processor reads the data from the second memory.
- PCIE peripheral component interconnect express
- FIG. 5 a schematic flow chart of a model training method provided by an embodiment of the present application is shown. This method can be applied to the network architecture shown in Figures 2 to 4, or can be used in other network architectures. As shown in Figure 5, the method may specifically include:
- S401 The second node among the multiple nodes is used to send multiple parameter acquisition requests to the first node, and the multiple parameter acquisition requests are used to acquire parameters required by the second node when training the model.
- model parameters that need to be updated is large. Therefore, a large number of model parameters can be stored in multiple nodes. During model training, the node responsible for forward propagation needs to obtain it from multiple nodes. Model parameters required for the current batch.
- the model parameters can be the embedding vectors (embeddings) corresponding to the user or items.
- the recommendation model can determine which of the multiple items to recommend to the user based on the user's embedding vector and the embedding vectors of multiple items. one.
- the user's attribute information can be embedded through the embedding layer to obtain the user's embedding vector
- the item's attribute information can be embedded through the embedding layer to obtain the item's embedding vector
- the user's attribute information may be attributes related to the user's preference characteristics, including at least one of gender, age, occupation, income, hobbies, and educational level.
- the gender may be male or female, and the age may be 0-100.
- the number between, the occupation can be teachers, programmers, chefs, etc.
- the hobbies can be basketball, tennis, running, etc.
- the education level can be elementary school, junior high school, high school, university, etc.; this application does not limit the user's The specific type of attribute information.
- the items can be physical items or virtual items, such as applications (APPs), audio and video, web pages, news information, etc.
- the attribute information of the items can be item name, developer, installation package size, At least one of category and positive rating.
- the category of the item can be chatting, parkour games, office, etc., and the positive rating can be ratings, comments, etc. for the item. ; This application does not limit the specific type of attribute information of items.
- model parameters (such as the above embedding vectors of users and items, or other model parameters) can be stored in corresponding nodes.
- model parameters corresponding to the model to be trained can be deployed in the node cluster. Specifically, the proportion of data placed on each node can be adjusted according to the needs of the business and model.
- some parameters of the model can be deployed in the first node.
- some parameters of the model can be stored in the first memory and the second memory of the first node respectively.
- the first node may be the remote node introduced above.
- the first memory and the second memory communicate with the processor of the first node through different protocols.
- the first memory can communicate with the processor through the DDR protocol
- the second memory can communicate with the processor through the CXL protocol.
- Processor communication where when the processor reads and writes data from the first memory, it occupies the DDR bandwidth, and when the processor reads and writes data from the second memory, it occupies the PCIE bandwidth.
- the second node can determine the model parameters required for the current model training iteration round and store them in the first node. Specifically, the processor of the second node determines the model parameters required for the current model training iteration round.
- the second node can execute step S401. S401 is specifically the second The node sends multiple parameter acquisition requests to the first node, and the multiple parameter acquisition requests are used to acquire parameters required by the second node when training the model.
- the second node can send multiple sample acquisition requests to the transceiver module (such as a network card) of the first node, and the transceiver module of the first node can notify the processor of the contents in the multiple sample acquisition requests.
- the transceiver module such as a network card
- the first node determines according to the multiple parameter acquisition requests that the parameters acquired by the multiple parameter acquisition requests are stored in the first memory and the second memory respectively, merges the parameters acquired from the first memory and the second memory, and merges the parameters. The following parameters are sent to the second node.
- step S402 may specifically include: step S4021, step S4022, step S4023 and step S4024.
- Step S4021 specifically is: the first node determines the parameters obtained by the multiple parameter acquisition requests according to the multiple parameter acquisition requests. stored respectively in the A memory and a second memory
- step S4022 is specifically: obtaining parameters from the first memory and the second memory
- step S4023 is specifically: obtaining parameters from the first memory and the second memory and merging them
- step S4024 is specifically: merging the parameters are sent to the second node.
- the multiple parameter acquisition requests may include virtual addresses of data that the processor needs to acquire (that is, some parameters of the model).
- the first node may maintain a corresponding storage page table.
- the storage page table may include a mapping relationship between the virtual address of the data and the physical address.
- the physical address is the address of the storage space where the data is stored.
- the multiple parameter acquisition requests may include identifiers indicating the data (that is, some parameters of the model).
- the first node may maintain a corresponding storage page table, and the storage page table may include identifiers of the data.
- the mapping relationship with the virtual address, the virtual address corresponds to the physical address of the data storage space.
- the processor of the first node can determine the virtual address corresponding to some parameters of the model indicated in the request.
- some of the multiple parameter acquisition requests are used to acquire parameters stored in the first memory of the first node (for convenience of description, the embodiment of the present application may use parameters stored in the first memory to be The parameters are called first parameters, and the first parameters may include one or more parameters of the model). Another part of the multiple parameter acquisition requests is used to obtain parameters stored in the second memory of the first node ( For convenience of description, in this embodiment of the present application, the parameters stored in the second memory may be called second parameters, and the second parameters may include one or more parameters of the model). Therefore, the processor can confirm that the parameters acquired by the multiple parameter acquisition requests are stored in the first memory and the second memory respectively according to the multiple parameter acquisition requests.
- step S4021 the first node determines according to the multiple parameter acquisition requests that the parameters acquired by the multiple parameter acquisition requests are stored in the first memory and the second memory respectively
- step S4022 from the first memory and the second memory.
- the second memory gets the parameters).
- the processor of the first node can read the first parameter from the first memory of the first node. Specifically, the processor of the first node can read the first parameter from the first memory based on the virtual address of the first parameter. The first parameter is read from a corresponding storage location in the first memory of a node.
- the processor of the first node when the processor of the first node reads the first parameter from the corresponding storage location in the first memory of the first node based on the virtual address of the first parameter, the virtual address corresponding to the first parameter The address can be used to determine the physical address of the first parameter in the first memory of the first node (this mapping process can be implemented through translation of the memory controller, which will not be described again here).
- the processor of the first node can read the second parameter from the second memory of the first node. Specifically, the processor of the first node can read the second parameter from the second memory based on the virtual address of the second parameter. The second parameter is read from a corresponding storage location in the second memory of a node.
- the virtual address corresponding to the second parameter can be used to determine the physical address of the second parameter in the second memory of the first node (this mapping process can be implemented through translation of the memory controller, which will not be described again here), and the virtual address corresponding to the second parameter can be used to Determine the physical address of the second parameter in the second memory of the first node.
- the first memory can communicate with the processor through the DDR protocol
- the second memory can communicate with the processor through the CXL protocol.
- the processor of the first node reads the first parameter, it can be a line based on DDR.
- the processor reads the second parameter, it can be based on CXL or CXL-like lines.
- CXL and DDR are two completely different sets of data lines.
- CXL is based on PCIe
- DDR is based on the CPU's memory controller. They are mutually exclusive. The bandwidth between them is not affected.
- the model parameters are stored in the extended memory of the first node.
- the DDR bandwidth is not occupied, thereby reducing the occupation of the DDR bandwidth.
- due to the memory Its own storage space is limited. In scenarios where the number of model parameters such as recommended models is huge, the number of nodes storing model parameters can be reduced by storing model parameters in extended memory.
- the processor may perform step S4023 (merging the parameters obtained from the first memory and the second memory).
- the merged parameters can be the parameters required by the model in a certain round of iteration process, or the processor can merge the obtained parameters after a certain interval of time, or the processor can obtain them every time. After reaching a certain number of parameters, the obtained parameters are merged.
- merging can be understood as writing the read parameters into one or more consecutive reading storage spaces in the cache.
- the cache may be a storage space provided by the first memory, a storage space provided by the second memory, or a storage space provided by both the first memory and the second memory.
- the first parameter can be one or more embeddings
- the second parameter can be one or more embeddings.
- the first parameter is multiple embeddings
- the multiple embeddings in the first parameter can be aggregated and written to In cache. Multiple embeddings in the first parameter occupy a continuous storage space in the cache.
- the second parameter contains multiple embeddings
- the multiple embeddings in the second parameter can be aggregated and written into the cache.
- the multiple embeddings in the second parameter occupy a continuous storage space in the cache.
- step S4023 may specifically include: the processor of the first node writes the first parameter into the second memory of the first node, and writes the second parameter into the first memory of the first node. in the memory; or, S4023 is specifically: the processor of the first node writes the first parameter into the first memory of the first node, and writes the second parameter into the second memory of the first node; or, S4023 specifically includes: the processor of the first node writes the first parameter into the first memory of the first node, and writes the second parameter into the second memory of the first node.
- the first memory is a main memory
- the second memory is an extended memory. Reading and writing data from the second memory is based on the CXL protocol, which restricts both reading and writing of data. CXL-related message headers, and the latency during reading and writing will be longer than that of DDR.
- Caching the parameters obtained from the first memory to the second memory can reduce the bandwidth occupation of the DDR in the writing process compared to caching the parameters obtained from the first memory to the first memory.
- Caching the parameters obtained from the second memory to the first memory can reduce the data transmission delay compared to caching the parameters obtained from the second memory to the second memory.
- the parameters obtained from the first memory and the parameters obtained from the second memory are combined and cached in the first memory, thereby reducing the data transmission delay.
- the parameters obtained from the first memory and the parameters obtained from the second memory are combined and cached in the second memory, which can reduce the bandwidth occupation of the DDR in the writing process.
- Step S4024 is specifically: the first node sends the combined parameters to the second node.
- Step S4024 may specifically include: Step S40241: The processor of the first node notifies the transceiver module of the first node to read the merged parameters from the cache of the first node. The notification can carry the address in the cache where the merged parameters are stored. Step S40242: The transceiver module of the first node reads the merged parameters from the cache of the first node. Step S40243: The transceiver module of the first node sends the combined parameters to the second node.
- the parameters obtained from the first memory and the second memory are combined and then sent.
- the transmitted report can be made The number of files is greatly reduced, thereby improving data transmission efficiency.
- the processor of the first node can trigger the transceiver module of the first node to obtain the first parameter and the second parameter, and send the first parameter and the second parameter to the second node, for example, to
- the transceiver module of the second node is optional.
- the transceiver module can be a network card.
- the transceiver module can send the first parameter and the second parameter to the second node through remote direct memory access (RDMA), CXL or other high-speed transmission protocols.
- RDMA remote direct memory access
- CXL CXL
- the size of a single model parameter is small, often only a few bytes or hundreds of bytes. If the transceiver module directly sends a message carrying a single model parameter to the receiving module of the second node, the transmission message header
- the data may be larger than the actual model parameters that need to be transmitted.
- a slightly larger model needs to pull millions of model parameters each time, so excessive packet transmission will bring huge requests to the network card. pressure.
- the message data can be sent to the second node, and the first parameter and the second parameter in the message data share the same message header (that is, combined and sent), thereby reducing the data size of the message. size, and the number of packets sent when data is pulled.
- Step S403 The second node receives the merged parameters and uses the merged parameters to train the model.
- step S403 may include: step S4031: the transceiver module of the second node receives the first parameter and Second parameter, step S4032: transfer the first parameter and the second parameter to the training module.
- step S4032 may include step S40321 and step S40322.
- Step S40321 may be: the transceiver module of the second node writes the first parameter and the second parameter into the cache of the second node.
- Step S40322 may be: second The processor of the node may read the first parameter and the second parameter from the cache, and transfer the first parameter and the second parameter to the training module of the second node.
- the training module can perform step S404: the training module can perform the training process of the model to be trained according to the first parameter and the second parameter, and obtain the update gradient corresponding to the first parameter. And the update gradient corresponding to the second parameter, or directly obtain the updated first parameter and the updated second parameter.
- the second node after obtaining the first parameter and the second parameter, can pass the first parameter and the second parameter to the training module.
- the training module can obtain according to the first parameter and the second parameter.
- the processor of the second node can write the update gradient corresponding to the first parameter and the update gradient corresponding to the second parameter, or the updated first parameter and the updated second parameter to the location.
- the processor of the second node notifies the transceiver module of the second node to read the update gradient corresponding to the first parameter and the update gradient corresponding to the second parameter, or the updated first parameter and the update, from the cache of the memory. the second parameter after.
- the transceiver module of the second node may perform step S405: combine the gradient corresponding to the first parameter and the gradient corresponding to the second parameter and send them to the first node.
- the correspondence between the gradient or updated data and the corresponding virtual address or identifier can be carried in the message data. relation.
- gradients or updated data for the same node can be placed in the same message data and sent by the transceiver module of the second node to the transceiver module of the first node, thereby reducing the message size.
- step S406 may also be included: after receiving the gradient, the first node distinguishes the gradient corresponding to the parameter stored in the first memory and the gradient corresponding to the parameter stored in the second memory, and updates the gradient in the first memory respectively. parameters and parameters in the second memory.
- the update here may include updating parameters based on gradients, and writing the updated parameters to the corresponding storage location in the memory. That is to say, S406 may include step S4061 and step S4062.
- step S4061 is: the processor of the first node (for example, it can be an optimizer included in the processor) can update the first parameter according to the update gradient corresponding to the first parameter to obtain the updated first parameter,
- the processor of the first node may update the second parameter according to the update gradient corresponding to the second parameter to obtain the updated second parameter.
- Step S4062 is specifically: the processor of the first node can write the updated first parameter into the storage location where the first parameter is located.
- the processor of the first node may write the updated second parameter to the storage location where the second parameter is located.
- Figure 7 shows the model parameter pulling (pull) process of the model training process
- Figure 8 shows the training (training) process and data push (push) process of the model training process.
- the model parameters are embedding vectors (embedding) as an example for illustration:
- model parameter deployment phase embedding data and other data (other value) can be deployed on the memory DRAM and extended memory (Memory Expander) of different physical nodes. These data are placed in the virtual address applied for by the server process and can be deployed according to the business Adjust the proportion of data placement according to the needs of the model.
- the deployment diagram of model parameters can be shown in Figure 6.
- the training module on node 1 serves as the training initiator.
- the model parameters required for the current model training round (data in embedding table 1 and data in embedding table 2)
- On node 1 perform a local memory copy (copy can be understood as reading and writing).
- the copy here is divided into two forms: a copy from the memory to the cache in the memory, and a copy from the extended memory to the memory.
- the former uses DDR bandwidth when reading and writing
- the latter uses PCIe bandwidth when reading data
- DDR bandwidth when writing uses DDR bandwidth when writing.
- the model parameters required for the current model training round are on node 2.
- the data in embedding table 4 are aggregated and copied to
- the data in the extended memory is aggregated and copied to the memory, and each pair forms a continuous buffer.
- the purpose of staggering the data here is to balance the copy delay and bandwidth pressure between PCIe and DDR, so that the overall delay bandwidth can be optimized.
- Node 2 sends the data in the buffer to the memory of node 1 through the network card, etc.
- Node 1 can pass the shared memory buffer Send the data to the training module and start the training process.
- Figure 8 shows a data push process.
- the training module sends data (such as update gradient) to the memory.
- the data needs to be transferred to the destination. If the destination is at node 1, the data is placed together with other value. Enter the optimizer for calculation.
- the embedding table and other value can exist in the same hardware or not. The bandwidth occupation of DDR and PCIe can be balanced according to the actual situation.
- the aggregated data of node 1 is sent to node 2 through the network card.
- node 2 After node 2 receives the model parameters, it places the model parameters in the memory or extended memory according to the address requirements. Put the model parameters into the optimizer for calculation. After the calculation is completed, they are scattered into the embedding table to complete the process.
- the model training method provided by the present application is described in detail above with reference to FIGS. 1 to 8 .
- the model training device provided by the present application will be described with reference to FIG. 9 .
- Figure 9 is a schematic structural diagram of a model training device provided by this application. As shown in Figure 9, the device is applied to the first node, and the device 900 includes a processor 901 and a transceiver module 902;
- the processor 901 is configured to receive multiple parameter acquisition requests, and the multiple parameter acquisition requests are used to acquire parameters required for the second node to train the model;
- the transceiver module 902 is configured to send the combined parameters to the second node.
- the processor 901 when used to merge parameters obtained from the first memory and the second memory, is specifically used to:
- the parameters obtained from the first memory and the parameters obtained from the second memory are cached in the first memory for merging.
- the processor 901 when used to merge parameters obtained from the first memory and the second memory, is specifically used to:
- the parameters obtained from the first memory and the parameters obtained from the second memory are cached in the second memory for merging.
- the processor 901 when used to merge parameters obtained from the first memory and the second memory, is specifically used to:
- the parameters obtained from the first memory are cached in the second memory, and the parameters obtained from the second memory are cached in the first memory.
- the first memory is connected to the processor through the DDR protocol
- the second memory is an extended memory
- the second memory is connected to the processor through the open interconnect protocol CXL.
- the model is a recommendation model; the parameters are embedding vector embeddings.
- the model training device 900 shown in Figure 9 corresponds to the first node in the model training method in the embodiment shown in Figure 5. Therefore, the functions and technical effects of the model training device 900 can be found in the implementation shown in Figure 5 The relevant descriptions in the example will not be repeated here.
- embodiments of the present application also provide a computer-readable storage medium.
- the computer-readable storage medium stores instructions. When run on a computing node, the computing node causes the computing node to execute the model training method in the above embodiment. .
- embodiments of the present application also provide a computer program product.
- the computer program product When the computer program product is executed by a model training device, one or more model training devices execute any of the foregoing model training methods.
- the computer program product can be a software installation package. If it is necessary to use any of the foregoing model training methods, the computer program product can be downloaded and executed on the computer.
- the device embodiments described above are only illustrative.
- the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units. , that is, it can be located in one place, or it can be distributed to multiple network units.
- Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
- the connection relationship between modules indicates that there are communication connections between them, which can be specifically implemented as one or more communication buses or signal lines.
- the present application can be implemented by software plus necessary general hardware. Of course, it can also be implemented by dedicated hardware including dedicated integrated circuits, dedicated CPUs, dedicated memories, Special components, etc. to achieve. In general, all functions performed by computer programs can be easily implemented with corresponding hardware. Moreover, the specific hardware structures used to implement the same function can also be diverse, such as analog circuits, digital circuits or special-purpose circuits. circuit etc. However, for this application, software program implementation is a better implementation in most cases. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or that contributes to the existing technology.
- the computer software product is stored in a readable storage medium, such as a computer floppy disk. , U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc., including several instructions to cause a computer device (which can be a personal computer, training device, or network device, etc.) to execute the methods of various embodiments of the present application.
- a computer device which can be a personal computer, training device, or network device, etc.
- the above embodiments may be implemented in whole or in part by software, hardware, firmware, or any other combination.
- the above-described embodiments may be implemented in whole or in part in the form of a computer program product.
- a computer program product includes one or more computer instructions. When computer program instructions are loaded or executed on a computer, processes or functions according to embodiments of the present application are generated in whole or in part.
- the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.
- Computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, e.g., computer instructions may be transmitted from a website, computer, server or data center via a wired link (e.g.
- Coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless means to transmit to another website site, computer, server or data center.
- Computer-readable storage media can be any available media that can be accessed by the computer, or data storage devices such as servers and data centers that contain one or more sets of available media.
- Usable media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media.
- the semiconductor medium may be a solid state drive (SSD).
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Computer Hardware Design (AREA)
- Microelectronics & Electronic Packaging (AREA)
- Neurology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims (19)
- 一种模型训练系统,其特征在于,所述模型训练系统包括多个节点,所述模型训练需要的参数分布在所述多个节点中;所述多个节点中的第一节点包括第一内存和第二内存,所述第一内存中和第二内存中分别存储所述模型的部分参数,所述第一内存和所述第二内存通过不同的协议与所述第一节点中的处理器进行通信;所述多个节点中的第二节点用于发送多个参数获取请求至第一节点,所述多个参数获取请求用于获取所述第二节点训练所述模型时需要的参数;所述第一节点用于根据所述多个参数获取请求确定所述多个参数获取请求所获取的参数分别存储在所述第一内存及所述第二内存,将从所述第一内存及所述第二内存获取的参数进行合并,并将合并后的参数发送至所述第二节点;所述第二节点还用于接收所述合并后的参数,并使用所述合并后的参数对所述模型进行训练。
- 根据权利要求1所述的系统,其特征在于,所述第一节点在用于对所述第一内存及所述第二内存获取的参数进行合并时具体用于:将从所述第一内存中获取的参数及从所述第二内存中获取的参数都缓存至所述第一内存中进行合并。
- 根据权利要求1所述的系统,其特征在于,所述第一节点在用于对所述第一内存及所述第二内存获取的参数进行合并时具体用于:将从所述第一内存中获取的参数及从所述第二内存中获取的参数都缓存至所述第二内存中进行合并。
- 根据权利要求1所述的系统,其特征在于,所述第一节点在用于对所述第一内存及所述第二内存获取的参数进行合并时具体用于:将从所述第一内存中获取的参数缓存至所述第二内存,将从所述第二内存中获取的参数缓存至所述第一内存。
- 根据权利要求1至4任意一项所述的系统,其特征在于,所述第二节点还用于接收到所述合并后的参数后,将所述合并后的参数与从其他节点接收的参数进行排序后,再使用排序后的参数对所述模型进行训练。
- 根据权利要求1至5任意一项所述的系统,所述第二节点还用于:对所述模型训练完后,获取所述模型训练后产生的梯度数据,并根据所述梯度数据确定所述第一节点的参数对应的梯度;将所述第一节点的参数对应的梯度合并后发送至所述第一节点;所述第一节点用于:在接收到所述梯度后,区分所述第一内存中存储的参数对应的梯度及所述第二内存中存储的参数对应的梯度,并分别更新所述第一内存中的参数及所述第二内存中的参数。
- 根据权利要求1至6任意一项所述的系统,其特征在于,所述第一内存通过DDR协议连接至所述处理器,所述第二内存为扩展内存,所述第二内存通过开放式互连协议CXL连接至所述处理器。
- 根据权利要求1至7任意一项所述的系统,其特征在于,所述模型为推荐模型;所述参数为嵌入向量embedding。
- 一种模型训练方法,其特征在于,应用于分布式训练系统,所述分布式训练系统包括多个节点,所述模型训练需要的参数分布在所述多个节点中;所述多个节点中的第一节点包括第一内存和第二内存,所述第一内存中和第二内存中分别存储所述模型的部分参数,所述第一内存和所述第二内存通过不同的协议与所述第一节点中的处理器进行通信;所述方法包括:所述多个节点中的第二节点发送多个参数获取请求至第一节点,所述多个参数获取请求用于获取所述第二节点训练所述模型时需要的参数;所述第一节点根据所述多个参数获取请求确定所述多个参数获取请求所获取的参数分别存储在所 述第一内存及所述第二内存,将从所述第一内存及所述第二内存获取的参数进行合并,并将合并后的参数发送至所述第二节点;所述第二节点接收所述合并后的参数,并使用所述合并后的参数对所述模型进行训练。
- 根据权利要求9所述的方法,其特征在于,所述对所述第一内存及所述第二内存获取的参数进行合并,包括:将从所述第一内存中获取的参数及从所述第二内存中获取的参数都缓存至所述第一内存中进行合并。
- 根据权利要求9所述的方法,其特征在于,所述对所述第一内存及所述第二内存获取的参数进行合并,包括:将从所述第一内存中获取的参数及从所述第二内存中获取的参数都缓存至所述第二内存中进行合并。
- 根据权利要求9所述的方法,其特征在于,所述对所述第一内存及所述第二内存获取的参数进行合并,包括:将从所述第一内存中获取的参数缓存至所述第二内存,将从所述第二内存中获取的参数缓存至所述第一内存。
- 根据权利要求9至12任意一项所述的方法,其特征在于,所述方法还包括:所述第二节点接收到所述合并后的参数后,将所述合并后的参数与从其他节点接收的参数进行排序后,再使用排序后的参数对所述模型进行训练。
- 根据权利要求9至13任意一项所述的方法,所述方法还包括:所述第二节点对所述模型训练完后,获取所述模型训练后产生的梯度数据,并根据所述梯度数据确定所述第一节点的参数对应的梯度;将所述第一节点的参数对应的梯度合并后发送至所述第一节点;所述第一节点在接收到所述梯度后,区分所述第一内存中存储的参数对应的梯度及所述第二内存中存储的参数对应的梯度,并分别更新所述第一内存中的参数及所述第二内存中的参数。
- 根据权利要求9至14任意一项所述的方法,其特征在于,所述第一内存通过DDR协议连接至所述处理器,所述第二内存为扩展内存,所述第二内存通过开放式互连协议CXL连接至所述处理器。
- 根据权利要求9至15任意一项所述的方法,其特征在于,所述模型为推荐模型;所述参数为嵌入向量embedding。
- 一种计算节点,其特征在于,所述计算节点包括处理器及存储器,所述存储器存储有程序指令,所述处理器运行所述程序指令以执行权利要求9至16任意一项所述的方法中所述处理器及所述训练模块所执行的步骤。
- 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有指令,当其在计算节点上运行时,使得所述计算节点执行如权利要求9至16任一项所述的方法。
- 一种包含指令的计算机程序产品,当其在计算节点上运行时,使得所述计算节点执行如权利要求9至16任一项所述的方法。
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP23842345.3A EP4542449A4 (en) | 2022-07-22 | 2023-07-19 | MODEL LEARNING SYSTEM AND METHOD AND ASSOCIATED DEVICE |
| US19/024,744 US20260037796A1 (en) | 2022-07-22 | 2025-01-16 | Model training system and method, and related device |
Applications Claiming Priority (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210869512.6 | 2022-07-22 | ||
| CN202210869512 | 2022-07-22 | ||
| CN202211634067.1A CN117436499A (zh) | 2022-07-22 | 2022-12-19 | 一种模型训练系统、方法及相关设备 |
| CN202211634067.1 | 2022-12-19 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US19/024,744 Continuation US20260037796A1 (en) | 2022-07-22 | 2025-01-16 | Model training system and method, and related device |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024017283A1 true WO2024017283A1 (zh) | 2024-01-25 |
Family
ID=89548619
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2023/108091 Ceased WO2024017283A1 (zh) | 2022-07-22 | 2023-07-19 | 一种模型训练系统、方法及相关设备 |
Country Status (4)
| Country | Link |
|---|---|
| US (1) | US20260037796A1 (zh) |
| EP (1) | EP4542449A4 (zh) |
| CN (1) | CN117436499A (zh) |
| WO (1) | WO2024017283A1 (zh) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118175585A (zh) * | 2024-05-11 | 2024-06-11 | 中国电信股份有限公司 | 数据传输方法及相关设备 |
| CN118396073A (zh) * | 2024-06-28 | 2024-07-26 | 山东海量信息技术研究院 | 异构计算系统及其模型训练方法、设备、介质、程序产品 |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN121764645A (zh) * | 2024-09-30 | 2026-03-31 | 华为技术有限公司 | 一种模型数据的处理方法及装置 |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN107330516A (zh) * | 2016-04-29 | 2017-11-07 | 腾讯科技(深圳)有限公司 | 模型参数训练方法、装置及系统 |
| CN111813869A (zh) * | 2020-08-21 | 2020-10-23 | 支付宝(杭州)信息技术有限公司 | 一种基于分布式数据的多任务模型训练方法及系统 |
| US20210083950A1 (en) * | 2019-09-13 | 2021-03-18 | Oracle International Corporation | Determining optimum software update transmission parameters |
| CN113033800A (zh) * | 2019-12-25 | 2021-06-25 | 香港理工大学深圳研究院 | 分布式深度学习方法、装置、参数服务器及主工作节点 |
| CN113283596A (zh) * | 2021-05-18 | 2021-08-20 | 北京达佳互联信息技术有限公司 | 一种模型参数训练方法、服务器、系统及存储介质 |
-
2022
- 2022-12-19 CN CN202211634067.1A patent/CN117436499A/zh active Pending
-
2023
- 2023-07-19 WO PCT/CN2023/108091 patent/WO2024017283A1/zh not_active Ceased
- 2023-07-19 EP EP23842345.3A patent/EP4542449A4/en active Pending
-
2025
- 2025-01-16 US US19/024,744 patent/US20260037796A1/en active Pending
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN107330516A (zh) * | 2016-04-29 | 2017-11-07 | 腾讯科技(深圳)有限公司 | 模型参数训练方法、装置及系统 |
| US20210083950A1 (en) * | 2019-09-13 | 2021-03-18 | Oracle International Corporation | Determining optimum software update transmission parameters |
| CN113033800A (zh) * | 2019-12-25 | 2021-06-25 | 香港理工大学深圳研究院 | 分布式深度学习方法、装置、参数服务器及主工作节点 |
| CN111813869A (zh) * | 2020-08-21 | 2020-10-23 | 支付宝(杭州)信息技术有限公司 | 一种基于分布式数据的多任务模型训练方法及系统 |
| CN113283596A (zh) * | 2021-05-18 | 2021-08-20 | 北京达佳互联信息技术有限公司 | 一种模型参数训练方法、服务器、系统及存储介质 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP4542449A4 |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118175585A (zh) * | 2024-05-11 | 2024-06-11 | 中国电信股份有限公司 | 数据传输方法及相关设备 |
| CN118396073A (zh) * | 2024-06-28 | 2024-07-26 | 山东海量信息技术研究院 | 异构计算系统及其模型训练方法、设备、介质、程序产品 |
Also Published As
| Publication number | Publication date |
|---|---|
| EP4542449A4 (en) | 2025-10-22 |
| CN117436499A (zh) | 2024-01-23 |
| EP4542449A1 (en) | 2025-04-23 |
| US20260037796A1 (en) | 2026-02-05 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US20210256403A1 (en) | Recommendation method and apparatus | |
| WO2024017283A1 (zh) | 一种模型训练系统、方法及相关设备 | |
| CN109902708B (zh) | 一种推荐模型训练方法及相关装置 | |
| CN111143686B (zh) | 资源推荐方法及装置 | |
| CN110474820B (zh) | 流量回放方法、装置、电子设备 | |
| CN106250464B (zh) | 排序模型的训练方法及装置 | |
| WO2020093289A1 (zh) | 资源推荐方法、装置、电子设备及存储介质 | |
| CN107292326A (zh) | 一种模型的训练方法和装置 | |
| CN110008397A (zh) | 一种推荐模型训练方法及装置 | |
| WO2023051678A1 (zh) | 一种推荐方法及相关装置 | |
| CN115640470A (zh) | 一种推荐方法及电子设备 | |
| JP7803006B2 (ja) | オブジェクト認識モデルの更新方法およびその装置、電子機器、並びにコンピュータプログラム | |
| CN116204709A (zh) | 一种数据处理方法及相关装置 | |
| CN112653579A (zh) | 一种基于OpenResty的灰度发布方法及相关设备 | |
| CN116974898A (zh) | 一种数据处理方法、装置、设备以及计算机可读存储介质 | |
| CN114600125A (zh) | 相同组内跨异类子组的鲁棒模型性能 | |
| WO2022268089A1 (zh) | 一种数据处理方法、系统及相关设备 | |
| CN116610873A (zh) | 信息推荐方法及装置、存储介质 | |
| CN113761004B (zh) | 网络模型数据处理、数据展示方法、装置和存储介质 | |
| CN115687810A (zh) | 网页搜索方法、装置及相关设备 | |
| CN113934637A (zh) | 一种测试方法、装置、设备及存储介质 | |
| CN113900820B (zh) | 分类任务处理方法、装置、电子设备及存储介质 | |
| CN111813711B (zh) | 训练样本数据的读取方法和装置、存储介质及电子设备 | |
| CN114021739B (zh) | 业务处理、业务处理模型训练方法、装置及电子设备 | |
| CN115757976A (zh) | 基于产品订阅关键词的信息推送方法、系统、介质及设备 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23842345 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2023842345 Country of ref document: EP |
|
| ENP | Entry into the national phase |
Ref document number: 2023842345 Country of ref document: EP Effective date: 20250117 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| WWP | Wipo information: published in national office |
Ref document number: 2023842345 Country of ref document: EP |