CN120596258B

CN120596258B - PDU load distribution system based on reinforcement learning

Info

Publication number: CN120596258B
Application number: CN202510689960.1A
Authority: CN
Inventors: 孙远东; 时亮
Original assignee: Anhui Weiyuan New Energy Technology Co ltd
Current assignee: Anhui Weiyuan New Energy Technology Co ltd
Priority date: 2025-05-27
Filing date: 2025-05-27
Publication date: 2025-11-04
Anticipated expiration: 2045-05-27
Also published as: CN120596258A

Abstract

The invention discloses a PDU load distribution system based on reinforcement learning, which relates to the technical field of data center resource management, and realizes high-efficiency scheduling of PDU load, server migration and start-stop by combining a multidimensional execution cost model with a digital twin environment, and can balance energy consumption and performance in different load scenes by utilizing offline pre-training and layering control and continuously optimize by online learning under concept drift detection. Meanwhile, resource impact and performance jitter caused by large-scale operation are further avoided through batch migration and staged start-stop, flexible coping can be realized under multidimensional risk, and robustness and sustainability of scheduling under heterogeneous load are ensured.

Description

PDU load distribution system based on reinforcement learning

Technical Field

The invention relates to the technical field of data center resource management, in particular to a PDU load distribution system based on reinforcement learning.

Background

With the rapid aggregation of energy-intensive businesses such as cloud computing, artificial intelligence reasoning and training, big data analysis and the like in a very large scale data center, computing loads show the trend of peak transients, waveform non-stabilization and business type diversification. The power consumption of the server cabinet is often increased by times along with the rapid increase of second-level flow, the power supply side is required to coordinate PDU (PowerDistributionUnit) phase balance, standby power supply switching and bus redundancy in millisecond level, and meanwhile, the temperature control of a cold and hot channel, the liquid cooling pump speed and the UPS energy storage scheduling are also required to be linked in real time. The traditional heuristic scheduling with the CPU utilization rate or the task queue depth as a single dimension has difficulty in considering power peak suppression, energy utilization rate improvement and adaptation to the stable requirement of the AI training long tail load. In recent years, deep Reinforcement Learning (RL) is used for resource scheduling, but most researches only use server power consumption or electric charge as a reward signal, neglect physical and operation and maintenance limits such as phase line capacity difference, standby battery depth of discharge, key service safety isolation, multi-copy fault tolerance overhead and the like in PDU topology, so that policy convergence speed is low when a real scene falls to the ground, a power risk threshold is easy to break through, and high reliability requirements of financial-grade and government-grade bearing environments on zero fault and zero data loss are difficult to meet.

Through retrieval, in the Chinese patent application publication No. CN109324875A, a data center server power consumption management and optimization method based on reinforcement learning is disclosed. The reinforcement learning method is used for solving the problems of power consumption management and optimization of the data center, and decisions are sequentially made by continuously observing the load arrival, load distribution and power consumption use information of the random system of the data center. I.e. based on the observed state at each moment, a decision is made by selecting an action from the set of available actions. The decision maker makes new decisions again based on the newly observed state, and so on repeatedly. The invention can directly optimize the load distribution strategy of the data center on line without any priori knowledge, thereby reducing the overall operation power consumption of the data center.

The existing scheduling framework cannot simultaneously quantify the composite cost of 'instantaneous impact of virtual machine migration/server start-stop on a power supply link' and 'critical service safety redundancy', so that an RL intelligent agent lacks an effective autonomous judging mechanism when facing power supply side fluctuation, load mode drift or safety level up-regulation. When a plurality of GPU training tasks of the same cabinet are migrated to adjacent cabinets in batches at night so as to save cooling energy consumption, the algorithm does not consider the instantaneous allowance and phase line unbalance of a target PDU loop, phase overload or current impact is often caused, and if the financial core service virtual machine is migrated to a node which is only backed up singly and is in a maintenance window at the moment, the fault tolerance level is suddenly reduced. Once PDU overload triggers circuit breaking or host double hot standby failure, the result will be AI task interruption, transaction system SLA violations and source station log data damage, which in turn causes chain reimbursement and brand reputation loss.

Therefore, there is a need for an intelligent scheduling method that can embed power topology limitations, security isolation levels, and fault tolerant redundancy overhead in a multidimensional execution cost model simultaneously, and can adaptively retrain when the load concept drift out, so as to achieve collaborative optimization of energy consumption, performance, and security reliability of a data center.

Disclosure of Invention

(One) solving the technical problems

In order to overcome the defects of the prior art, the PDU load distribution system based on reinforcement learning is provided, the high-efficiency scheduling of PDU load, server migration and start-stop is realized through the combination of a multidimensional execution cost model and a digital twin environment, energy consumption and performance can be balanced in different load scenes by utilizing offline pre-training and layering control, and continuous optimization is realized through online learning under concept drift detection, in addition, key business and sensitive data can be protected preferentially after safety cost and fault tolerance cost are superposed, the safety risk and downtime loss are effectively reduced, and finally the high-efficiency, safety and expandable data center management of multidimensional cooperation is achieved. Meanwhile, resource impact and performance jitter caused by large-scale operation are further avoided through batch migration and staged start-stop, flexible coping can be realized under the multi-dimensional risk, and the technical problems recorded in the background technology are solved.

(II) technical scheme

The PDU load distribution system based on reinforcement learning comprises that when detecting load fluctuation of a data center, a real-time monitoring component is called to acquire original monitoring data, and the depth quantization of energy consumption and delay is carried out on virtual machine migration and server start and stop through a multidimensional execution cost model, and a dynamic cost sequence is output;

After receiving the dynamic cost sequence and the original monitoring data, performing multi-round reinforcement learning offline training on the PDU power distribution topology, the server cluster and the typical load scene in a digital twin environment, and screening out a pre-training strategy for hierarchical scheduling;

after the pre-training strategy is successfully loaded, the upper reinforcement learning controller generates a macro migration and start-stop instruction set, and the lower flexible scheduling module performs batch execution according to the multidimensional execution cost and the safety limit and records specific scheduling feedback to a system log;

If the concept drift metric exceeds a set threshold or the strategy gain is greatly reduced, performing small-scale online learning based on a digital twin environment and latest scheduling feedback, and deploying a resource allocation scheme in advance by using a load prediction model;

And injecting safety cost and fault tolerance cost into the original multidimensional execution cost model, carrying out priority screening on the reinforcement learning strategy through safety-related punishment or rewarding in offline pre-training, and carrying out redundancy, encryption or batch migration on the key server or sensitive service in the layering scheduling and online self-adaption stage.

Preferably, the multidimensional execution cost model sets independent dimensions for virtual machine migration, server start-stop and PDU switching respectively, and weights the time delay, energy consumption overhead and business performance influence of each dimension to dynamically measure the comprehensive cost of each scheduling action in the current state and write the comprehensive cost into the monitoring database.

Preferably, the deployed monitoring component collects the server utilization rate, PDU load, power supply starting duration and network bandwidth in a fixed time window, maps the collection result to the cost model in real time to dynamically update the cost data, and generates an available margin threshold value according to the bandwidth safety coefficient and the temperature safety coefficient to serve as a subsequent batch scheduling judgment basis.

Preferably, the constructed digital twin environment completely maps PDU distribution topology, server hardware configuration, network bandwidth limitation and typical traffic load curve of a real data center, and uses performance excitation coefficients and cost penalty coefficients in offline simulation to guide reinforcement learning agents to generate a pre-training strategy.

Preferably, multiple rounds of reinforcement learning simulation training are performed on multiple load scenes in the digital twin environment, cost sensitivity is adjusted in the offline training process to amplify negative return of high-cost actions, regularization constraint is performed on differences between new strategies and old strategies through strategy stability coefficients, and a pre-training strategy and performance evaluation result which are compatible with energy efficiency and performance are output.

Preferably, after the upper reinforcement learning controller calls the pre-training strategy to generate an instruction set, the instruction of the cost peak value is screened out according to the real-time cost and priority label, and then the instruction is carried out in batches by the lower module according to the scheduling amplification factor and the flexibility control parameter, wherein before the instruction set is generated, the instruction of which the cost exceeds a threshold value is screened out based on the multidimensional execution cost model, and the execution priority is given to the rest instructions.

Preferably, the lower layer scheduling module performs batch execution on the instructions in the instruction set according to the bandwidth allowance and the server temperature allowance, and returns scheduling feedback including actual execution duration, power consumption change and performance jitter to the upper layer controller together with the updated cost vector after each batch is completed for use in an online self-adaption stage.

Preferably, the environmental drift degree is calculated by monitoring the difference between the actual system state and the expected state;

And when the drift degree exceeds a preset threshold, generating a drift alarm, recording a trigger time stamp and a main drift index, and triggering subsequent online training or migration learning actions.

Preferably, the online training process selects a plurality of latest scheduling periods to form a miniature training data set, the old strategy is updated in an increment mode by adopting the learning rate to obtain a new strategy, and the new strategy is deployed again after the rapid regression test is completed in the digital twin environment.

The micro training dataset includes current state, actions generated by the upper reinforcement learning controller, multi-dimensional execution costs, and actual benefits.

Preferably, the historical load sequence and the recent monitoring data are called, the load of a plurality of time periods in the future is predicted by combining the long-short-term memory network, when the average absolute percentage error of the predicted deviation exceeds a set threshold value, the average absolute percentage error is fed back to the time sequence prediction model to update the parameters in an iterative manner, and the available calculation and power source resources are preset in advance according to the average absolute percentage error.

Preferably, the safety cost parameter and the fault-tolerant overhead parameter are newly added in the original multidimensional execution cost model to form a comprehensive cost model, wherein the priority of key service migration, redundant copy synchronization and sensitive node power management is controlled through the safety weight coefficient and the fault-tolerant weight coefficient.

Preferably, a safe positive reward and a fault-tolerant negative penalty are added to the reward function in the offline simulation process, the guiding scheduling strategy preferentially guarantees the safe isolation of the sensitive service and the high available redundancy of the key nodes, if the host computer where the sensitive service is located does not complete encryption isolation or copy backup, the shutdown operation is delayed, and the log is recorded in the execution feedback.

Preferably, after introducing additional offset related to the security alarm and the redundant state of the key node, during online self-adaption, dynamically adjusting the security and fault-tolerant weight coefficient according to the frequency of the security alarm and the number of times of fault triggering so as to maintain a preset balance target among energy consumption, performance and reliability, and feeding back the actual execution result to the digital twin environment again.

(III) beneficial effects

The invention provides a PDU load distribution system based on reinforcement learning, which has the following beneficial effects:

According to the scheme, through the loop connection from the first step to the fifth step, the deep interaction of the multidimensional cost function C (a _i,s_t) and the digital twin environment omega is combined, so that the overall management of PDU load distribution, server start-stop and safety fault tolerance is realized, and the following beneficial effects are achieved:

Step two, embedding the multidimensional execution cost into an offline training process, simulating various load scenes by using a digital twin environment omega and iterating out a pre-training strategy for a plurality of times, so as to ensure that the reinforcement learning agent can master the spontaneous evasion capability of high-cost actions before actual deployment;

And thirdly, adopting a layered control architecture, wherein the upper reinforcement learning controller pi _top generates a macroscopic load instruction based on a pre-training strategy, and the lower module performs flexible scheduling and batch execution to avoid impact on a system caused by one-time large-scale migration or frequent start and stop. The layering mechanism not only balances the energy consumption and the performance, but also continuously collects scheduling feedback in the execution process, and provides high-precision data for online self-adaption in the step four.

And step four, capturing environmental changes through concept drift measurement when the load is not stable, combining a small-scale online training rapid adjustment strategy in a digital twin environment omega, and ensuring that the system is quick and stable and keeps high efficiency in long-term operation. Step five, adding a safety cost functionAnd fault tolerant overhead functionThe comprehensive cost function C' (a _i,s_t) is obtained by combining the original multidimensional cost function C (a _i,s_t), so that safety/fault-tolerant rewards or limits can be applied in offline training (step two) and hierarchical scheduling (step three), and the weight can be dynamically improved as required during online self-adaption (step four), thereby realizing fine control of redundancy of a key server and isolation of sensitive services.

The method has the advantages that the unified state s _t, the action a _i and the cost function C' (a _i,s_t) are mutually coupled, the early offline simulation lays a policy foundation for later hierarchical scheduling, feedback is performed in a layered mode, and online self-adaption is performed in a back feeding mode, and the addition of the safety and fault tolerance factors endows the whole system with more toughness and expands the policy application range, so that continuous promotion of energy consumption optimization can be kept, flexible coping can be realized under multidimensional risks (performance jitter, environmental drift and safety threat), and the comprehensive management and control effect is achieved.

Drawings

Fig. 1 is a schematic diagram of a PDU load distribution system based on reinforcement learning according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, the present invention provides a PDU load distribution system based on reinforcement learning, comprising:

When load fluctuation of a data center is monitored, a real-time monitoring component is called to carry out deep acquisition on indexes such as server utilization rate, PDU load, power supply starting time, network bandwidth and the like, nonlinear calculation is carried out on action cost according to energy consumption, delay and performance jitter weights in a multidimensional cost function C (a _i,s_t), comprehensive cost data form is dynamically updated, real-time assessment and quantitative depiction on virtual machine migration or server starting and stopping are realized, and generated time sequence cost information is provided for subsequent offline pre-training and online self-adaption as key input;

The first step comprises the following steps:

Step 101, establishing a multidimensional execution cost model

Step 101 is directed to various execution actions (including virtual machine migration, server start-stop and PDU switching) in the data center, and defines a unified multidimensional execution cost function based on historical operation and maintenance data and a predictive analysis result.

Let a _i denote the type i actions that can be performed in the data center environment, such as migrating a virtual machine to a target host or enabling a dormant server, etc., let s _t denote the current system state, such as a comprehensive description of server utilization, PDU remaining power margin, network bandwidth load, etc

Let E (a _i,s_t) denote the extra energy consumption expected from the execution of action a _i in state s _t, let M (a _i,s_t) denote the performance migration overhead or the degree of service jitter that this action may bring, let L (a _i,s_t) denote the operation delay (the duration of the process from the instruction down to effect) caused by this action, based on the three key cost factors mentioned above, a multidimensional cost function C (a _i,s_t) can be constructed under dimensionless conditions, in particular as follows:

Zeta ₁,ζ₂,ζ₃ is a forward weighting coefficient, the values are all between 0 and 1, and the sum of the three is 1, so that the three factors are used for balancing the relative importance of energy consumption, performance jitter and time delay;

Phi ₁,φ₂,φ₃ is a motion sensitivity coefficient, the value is larger than 0, dimension matching is carried out according to the average cost magnitude of different motion categories, and the method is used for controlling the amplification or attenuation degree of each cost factor in exponential or logarithmic transformation;

The higher the return value of the multidimensional cost C (a _i,s_t) is, the greater the overall execution cost of the action in the current state is, the high risk or high overhead action can be identified more easily by splitting the three key dimensions of energy consumption, performance and delay, the action cost can be accurately represented, the universality and portability of a parameter system are that the data center and service requirements of different scales can be adapted through the adjustable properties of the weighting coefficient ζ _i and the sensitivity coefficient phi _i, and the multi-scene multiplexing of a set of function structures is realized.

102, Performing real-time monitoring and dynamic updating

After the multidimensional cost function C (a _i,s_t) is established, step 102 further deploys a real-time monitoring component in the data center, and dynamically collects and updates the key parameters defined in step 101;

Deploying a monitoring probe to collect hardware layer data such as CPU utilization rate of a server, network I/O utilization rate, power supply starting time and the like;

The deployment load analysis engine tracks the real-time performance state and service delay of each virtual machine, converts the real-time performance state and service delay into statistics required by migration overhead M (a _i,s_t) and operation delay L (a _i,s_t), integrates the output of the power monitoring module, and calculates and summarizes the estimated value of the additional energy consumption (a _i,s_t) generated when the action is executed.

The method comprises the steps of continuously inputting incremental energy consumption, performance jitter and time delay information obtained through monitoring into a multidimensional cost function C (a _i,s_t) of a step 101 to obtain the latest execution cost value, uniformly storing the latest multidimensional cost function C (a _i,s_t) and a current system state s _t to form a traceable and time-evolution cost time sequence data set for subsequent offline pre-training and online self-adaptive learning and calling, dynamically describing the execution cost of each action under different load conditions through real-time updating, and avoiding decision deviation caused by historical or static configuration. When the scale or hardware type of the data center is expanded, a new cost factor can be incorporated into the multidimensional cost function C (a _i,s_t) system by adding or upgrading the monitoring component, so that the system has good flexibility and maintainability.

Step two, loading PDU distribution topology, server group and typical load mode in digital twin environment omega when obtaining execution cost data and real-time monitoring record, adopting reinforcement learning agent to perform offline training on multi-round simulation interaction, screening out pre-training strategy through nonlinear objective function balancing energy consumption, performance and multi-dimensional cost function C (a _i,s_t) overhead, completing strategy performance evaluation and solidification, and providing accurate and feasible macroscopic decision templates for subsequent layered scheduling;

the second step comprises the following steps:

Step 201, building a digital twin environment and loading an execution cost model

According to the system state s _t, the action a _i and the corresponding multidimensional cost function C (a _i,s_t) output in the first step, a digital twin environment omega which is highly consistent with the real data center is built in a virtual simulation platform;

Defining a complete PDU distribution topology, a computing node (server cluster), network link attributes in a digital twin environment omega, and setting corresponding load injection modules to simulate the workload (such as daytime peak load, nighttime low load, AI training high fluctuation load, etc.) under various typical service scenarios;

combining the real-time monitoring data of the first step with the historical load mode to generate an initial state set of the digital twin environment omega Each state s _t contains server utilization, PDU power supply margin, network occupancy, etc.;

Using sets of actions comprising different actions a _i The method comprises the steps of strictly reproducing the influence of an execution action in a digital twin environment omega, and calling a multidimensional cost function C (a _i,s_t) to evaluate the comprehensive cost such as energy consumption, time delay, performance jitter and the like generated in a simulation environment by each action;

PDU topology, server cluster and load characteristics are reproduced through digital twin environment omega, so that the training and verification stage approaches the production environment as much as possible, and the real system trial-error risk is reduced. The real cost of each action is estimated directly by using a multidimensional cost function C (a _i,s_t) in the digital twin environment omega, so that the offline simulation can accurately embody the operation cost in the subsequent deployment, and a reliable basis is provided for strategy training. The high-cost actions can be accurately identified in the digital twin environment, so that frequent migration or low-gain switching actions can be effectively restrained during simulation training.

Step 202, offline pretraining and policy verification

Considering the positive incentive for system performance improvement and the negative penalty of the execution cost comprehensively, an instant benefit function R (s _t,a_i) of the reinforcement learning agent is defined, and the example is as follows:

The alpha is a performance excitation coefficient, the value is larger than 0, the method is used for dimension matching and balancing convergence speed, the benefit of the system overall performance index lambda (s _t) under the current state s _t can be amplified, lambda (s _t) can be obtained by comprehensively measuring throughput, task completion rate and the like of a data center, the larger the value is, the better the performance is represented, the specific value can be obtained by weighting after linear normalization, the epsilon is a cost penalty coefficient, the value is larger than 0, the method is used for restraining high-cost actions, the delta is cost sensitivity, the value is larger than 0, and the method is used for dimension matching and amplifying the influence of C (a _i,s_t) in logarithmic or exponential operation.

In the digital twin environment omega, starting from a state s _t, the intelligent agent tries different actions a _i and carries out multi-round iterative updating on the strategy according to the instant benefit R (s _t,a_i), real load fluctuation is simulated by collecting monitoring data in the first step, so that the intelligent agent can learn an optimal or suboptimal load distribution mode in various situations such as high load, low load, network redundancy, network congestion and the like, and the influence of the execution cost can be further amplified by adjusting the cost sensitivity delta in the later stage of training convergence, so that the intelligent agent is guided to reduce the high-cost action with limited performance benefit.

After training, counting the indexes of the intelligent agent such as accumulated income, average executing cost, load balance degree and the like under each scene to obtain a plurality of candidate strategies, selecting the strategy with the highest comprehensive score on the energy efficiency and the SLA performance as a pre-training strategy, generating a corresponding performance evaluation result theta, including average migration frequency, overall energy consumption, response time and the like under the strategy, outputting the obtained pre-training strategy and the performance evaluation result theta together,

Wherein, the SLA performance refers to various key quality indexes which are agreed in a service level agreement (SERVICELEVELAGREEMENT, SLA) and used for measuring whether a system or a service meets contract requirements, including availability, response delay, throughput and the like;

Large-scale migration or testing in a real data center environment brings high risk and cost, and step 202 completes a large number of simulation iterations in a digital twin environment Ω, so that a better strategy can be quickly searched and influence on production business is avoided. By embedding a nonlinear amplification mode in the reinforcement learning objective function R (s _t,a_i), the intelligent agent can voluntarily avoid the operation with limited contribution to system income but overlarge execution cost, the stability and energy efficiency in subsequent deployment are improved, the low-income high-cost actions are effectively restrained, once simulation training is converged, a pre-training strategy capable of being directly transplanted is generated, the output strategy can be directly applied to a subsequent hierarchical control architecture, the online debugging period can be greatly shortened, and better scheduling efficiency can be obtained in an initial stage.

Step three, when a pre-training strategy is called and deployed to an actual data center, the upper reinforcement learning controller generates a migration or start-stop instruction set { a _i '} according to the macro load distribution scheme II screened in the step two, and transmits the migration or start-stop instruction set { a _i' }, the migration or start-stop instruction set II and the start-stop instruction set II to a lower scheduling module to be executed in batches by referring to the real-time monitoring information in the step one and the dynamic data of a multidimensional cost function C (a _i,s_t), and the migration and the start-stop of a stably-falling virtual machine are detected through sectional control and safety limit, so that retroactive execution feedback is generated for online self-adaption subsequent calling;

The third step comprises the following steps:

step 301, deployment of upper reinforcement learning controller and global load distribution

Loading the pre-training strategy output in the second step into an upper reinforcement learning controller pi _top, which refers to a decision module which is deployed at the top layer of a data center dispatching architecture and runs the reinforcement learning strategy, wherein the upper reinforcement learning controller pi _top(s_t) can generate a macroscopic load distribution decision when the current system state s _t is given, namely, a high-level instruction set { a _i } is given to each PDU load, server group and virtual machine migration direction;

And (3) placing the performance evaluation result theta generated in the step two in a reference database of the upper reinforcement learning controller pi _top so as to compare actual effects and detect deviation in the strategy execution process. The method is linked with the multidimensional cost function C (a _i,s_t) constructed in the first step, the latest real-time monitoring information (such as server utilization rate, PDU load capacity, power supply starting time length and the like) is input into the state observation of the upper reinforcement learning controller pi _top, and the controller is ensured to sense possible high-cost operation of the current system;

after the upper reinforcement learning controller pi _top calculates an action instruction set { a _i }, re-filtering the instruction actions based on the multidimensional cost function C (a _i,s_t), and automatically reducing the priority of actions with excessive cost or unclear benefits so as to avoid large-scale or frequent operation from the source;

The action instruction set { a _i '} finally reserved in the upper reinforcement learning controller pi _top is defined as a global load distribution instruction, such as migrating a plurality of virtual machines to a designated host to enable/disable a specific server group, and the corresponding action instruction set { a _i' } is output to the lower scheduling module of step 302 along with its corresponding priority label for implementing specific operations in batches or in segments.

In the execution process, if the lower scheduling module (step 302) returns information of execution delay or partial operation failure, the upper reinforcement learning controller pi _top performs fast simulation evaluation in combination with the digital twin environment Ω constructed in the second step to determine whether a temporary adjustment strategy is needed, and the intermediate feedback not only enables the controller to perform fine adjustment on the current strategy, but also provides more real operation data for the subsequent step four, namely online self-adaptive learning and continuous iterative optimization.

When the method is used, the pre-training strategy is directly utilized to schedule in a real environment, the online tuning time is greatly shortened, the initial strategy has better energy consumption and performance trade-off, the global optimal trend of macroscopic decision can be ensured, and the control instructions with overlarge expenditure can be preferentially eliminated or delayed when the action is generated by referring to the multidimensional cost function C (a _i,s_t) in real time, so that the execution load of the subsequent lower layer is reduced. Once the lower level execution results do not match the expectations, the strategy can be quickly checked and revised in the digital twin environment Ω, forming a quick test-tuning closed loop. In the conventional system, the problem of excessive migration or over-frequency start-stop is often found only in the execution stage, and the scheme performs pre-screening according to the multidimensional cost function C (a _i,s_t) at the upper layer, so that the execution oscillation can be obviously reduced.

Step 302, lower flexible dispatch and multi-stage execution

The lower layer scheduling module acquires an action instruction set { a _i' } output by the upper layer and reads a corresponding priority label;

According to the dynamic monitoring data (such as the network bandwidth BW _t, the server temperature Temp _t, etc.) in the first step, it is determined whether the current system has the simultaneous operation capability of executing the multiple instructions { a _i ' }, if the multiple instructions { a _i ' } cannot be executed simultaneously, the current system schedules according to the priority or the dependency order, where, for example, the system will first estimate the resource requirement of the current available bandwidth BW _avail =the current total bandwidth BW _cap -the used bandwidth BW _t and the server-tolerant temperature margin Δt _avail(s) =the safe upper temperature limit Temp _max -the current temperature Temp _t(s) according to each migration or start-stop instruction a _i '. The system then accumulates the bandwidth requirements of all instructions and compares them to BW _avail×μ_bw (bandwidth safety factor), while, for each target server, comparing the temperature change caused by the instruction, deltaT _i, with DeltaT _avail(s)×μ_tmp (temperature safety factor). If the two conditions of network and temperature are satisfied for all instructions, the system can determine that the instructions can be executed in parallel, otherwise, the instructions are split into smaller batches, and each operation is ensured to be executed within the safety margin range of bandwidth and temperature control. Performance jitter or service interruption due to bandwidth overload or machine room overheating can be avoided.

For a large number of virtual machines to be migrated or a plurality of servers to be enabled, a batch and segment execution strategy is adopted:

where k represents the number of operands of the current batch, such as the number of virtual machines that need to be migrated;

omega and gamma are respectively a dispatch amplification factor and a flexibility control parameter, and the values are positive real numbers for determining the specific execution scale of each batch, and when gamma is more than 1, the execution scale of the subsequent batch is gradually increased along with the increase of the batch serial number, and vice versa.

Through the batch control function, a small amount of migration or start-stop test can be executed first to verify whether the influence of the batch control function on the server load and the network I/O is in the second simulation range, and then batch-wise according to the execution effect.

After each batch or segment of execution is completed, monitoring whether the current system state s _t reaches or exceeds a preset safety limit (e.g., a maximum available margin of PDU load, a critical value of network occupancy), if so, suspending subsequent operations and reporting to the upper reinforcement learning controller pi _top;

and synchronizing the execution process data (including the actual migration time, the power-on time of the server and the success/failure log) to the digital twin environment omega in the second step and the online self-adaptive learning of the next step as references.

If the upper instruction comprises the operation of shutting down a plurality of idle servers or low-load servers, the lower module firstly checks whether the current task migration is completed or not so as to ensure that the situation that the power is off when the migration is completed is not generated, and when the shutdown is confirmed, the servers are shut down in sequence in a small-scale batch mode, so that the excessive current impact is avoided. The lower module refers to an execution unit which is deployed in a data center control architecture and is specially responsible for performing floor-based execution on an upper scheduling instruction;

The method has the advantages that large-scale migration or start-stop actions are divided into multiple sections, system influence caused by execution of each section is observed gradually, resource bottleneck and performance jitter can be effectively prevented, key indexes in the state s _t are continuously monitored and compared with a safety threshold, operation scale can be flexibly controlled in the execution process, and protection of SLA and equipment health can be ensured.

The upper reinforcement learning controller _top utilizes the pre-training strategy output by the second step to fuse the multi-dimensional execution cost data C (a _i,s_t) of the first step when generating the total load distribution instruction, reduces high-cost actions from the source, the lower flexible scheduling further distributes execution sequences according to the safety limit of real-time monitoring and multi-section batch strategy to ensure that the system maintains stable transition when large-scale migration, start-stop or PDU switching are carried out, the instruction generated by the upper controller falls to the ground gradually at the lower layer, the execution process data of the lower layer is fed back to the upper layer and the digital twin environment omega, and real scene feedback is further provided for online self-adaptive training in the fourth step.

Therefore, the pre-training strategy is combined with a batch/segment scheduling mechanism, so that the safety and efficiency of the real data center environment in landing are remarkably improved, and the common defect that the impact of an execution layer is ignored by purely relying on macroscopic optimization is avoided, and the global optimization capacity and the execution stability are improved.

Step four, when the concept drift metric CD (delta (s _t)) is detected to exceed a threshold value or the strategy income is obviously reduced, performing quick small-scale online training according to the actual scheduling feedback collected in the step three and the digital twin environment omega, performing incremental fine tuning on the reinforcement learning strategy, pre-judging the medium-short term load by combining a prediction model, pre-arranging a virtual machine migration or standby power supply starting scheme, and throwing the updated strategy into a real environment so as to keep the overall optimization on multidimensional cost and performance index under a non-steady load;

the fourth step comprises the following steps:

step 401, environmental offset monitoring and concept drift detection

After the flexible dispatch of the lower layer in the third step is completed, the actual system performance of each execution batch is comparedAnd step two, expected state verified in digital twin environment ΩCollecting errors of key indexes (such as server utilization rate, PDU power supply allowance and network occupancy rate) after load distribution;

Let Deltal (s _τ) represent the difference vector of the actual system state and the ideal or expected state at the time τ, to capture the accumulated deviation and the deviation change speed in the last period of time, the sliding time window length can be defined as L, and when the drift is evaluated at the current time t, the difference evolution in the comprehensive consideration interval [ t-L, t ] is given as the following measurement function:

delta (s _τ) represents the difference between the actual state and the expected state of the system at the moment tau, and the form is a vector, which is used for measuring the deviation condition of multidimensional load indexes (such as CPU utilization rate, PDU load, network occupancy rate and the like) at the moment;

w represents a configurable positive definite symmetric weighting matrix for differential weight allocation between the various metrics within the vector, e.g., CPU utilization deviations may be given higher weights while temperature or network metrics are given secondary weights;

[ delta (s _τ)]^TW[Δ(s_τ) ] can be considered as a quadratic form under the weighting matrix for measuring the combined effect of the multidimensional deviation.

The first derivative of the deviation vector to the time tau is used for reflecting the change speed of the deviation along with the time, wherein kappa is a drift speed sensitivity coefficient, and the value is larger than 0 and is used for controlling the attention degree of delta (s _τ) to the change rate along with the time;

beta is the drift integral sensitivity coefficient, the value is larger than 0, and L is the sliding time window length;

If the obtained deviation degree CD (Δ _t) exceeds the deviation threshold τ _cd, it can be considered that significant drift occurs, and when the drift monitoring module detects CD (Δ _t)>τ_cd), the subsequent online training or migration learning action is triggered.

If it is detected that the deviation CD (delta _t) is greater than the deviation threshold tau _cd, a drift alarm is generated, details such as a trigger time stamp and a main deviation index are recorded, and the drift alarm is sent to step 402 at the same time to start a targeted policy fine tuning process, and step 403 is notified to combine with a load prediction mechanism to adjust the policy in advance.

When the method is used, the difference between the actual state and the expected state is continuously monitored, the environment change can be quickly captured, the strategy mismatch can be identified in real time by avoiding using the outdated strategy for a long time under the non-steady load, the subsequent learning is only carried out when the deviation CD (delta _t) value exceeds the deviation threshold tau _cd, the unnecessary training expenditure can be reduced, and the system stability can be maintained.

Step 402, small-scale online training and migration learning

After receiving the drift alarm, extracting data from the real running log in the last several execution periods and the execution feedback collected in the third step to form a micro training data setMiniature training data setIncluding the current stateAction a _i' generated by upper reinforcement learning controller pi _top and multidimensional execution costAnd information such as actual returns (SLA performance, energy consumption).

Importing a micro training dataset in a digital twin environment ΩTaking the pre-training strategy output in the second step as an initial point to perform small-scale online training or transfer learning:

wherein, pi _old is the original strategy function, such as the strategy currently used by the upper reinforcement learning controller pi _top, eta is the learning rate and can be adjusted according to the drift amplitude and the system stability requirement;

loss function for reinforcement learning or deep learning for minimizing the data set in micro training The upper layer reinforcement learning controller is trained by adopting a composite loss function based on strategy gradient:

The first term calculates negative logarithmic probability weighted return on a state-action-return triplet (s, a, r) in the data set D, so that the maximization of the performance incentive alpha lambda(s) and the cost penalty epsilon (e ^δC(a,s) -1) is realized;

The second item introduces Kullback-Leibler divergence regularization between the old policy pi _old and the new policy pi, and the weight coefficient lambda is used for smoothing policy updating, so that unstable decision making caused by one-time large-scale adjustment is avoided, and the policy can continuously promote the overall benefit of the system and maintain continuity and robustness with the verified policy in the off-line pre-training and on-line fine tuning stages;

And (3) carrying out multi-scene quick verification (the simulation quantity is smaller than the complete training of the second step) on the trained strategy pi _new in the digital twin environment omega, replacing the trained strategy pi _new in the upper-layer controller if the income is obviously improved and the system fluctuation is controllable, and keeping the trained strategy pi _old and carrying out incremental update on only part of actions if the new strategy is found to be unstable or limited by insufficient data sets in the verification process.

The method can correct the strategy part which is not matched with the new environment in time after the concept drifting alarm, avoid high cost and long time delay caused by comprehensive retraining, and can screen out the high risk scheme before the real system is applied by rapidly verifying the pi _new through the digital twin environment omega so as to ensure the service continuity of the data center. Even using a real execution logThe latest environment is reflected, and efficient simulation fine adjustment is performed in the digital twin environment omega, so that on-line training can be quickly completed without disturbing a production system, and micro training data and digital twin double linkage is realized.

Step 403, combining the prediction model to perform prospective strategy layout

While concept drift monitoring and small-scale online training are parallel, historical load sequences and recent monitoring data are called, a prospective algorithm of a time sequence prediction model (such as based on a long-short-term memory network LSTM) is applied, and load trend estimation of T time periods in the future is obtainedPerforming advanced evaluation on possible virtual machine migration or server start-stop actions by combining the multidimensional cost function C (a _i,s_t) in the first step, and preferentially preparing sufficient available servers or load offline windows;

Estimating predicted load trend Inputting a strategy pi updated (or reserved) in the step 402 to simulate a scheduling decision for a next period of time in a digital twin environment omega, if a large number of migration or equipment starting is possibly triggered in an upcoming high-load interval, carrying out batch preheating or resource preparation in a real system in advance, avoiding centralized operation when approaching high load, realizing a prospective strategy layout, submitting the prospective strategy layout and an execution log to an upper reinforcement learning controller pi _top together, and observing whether the prospective strategy layout further relieves system fluctuation in the concept drift detection of the next stage;

If the prediction deviation is large, the load trend estimation is referred to as And if the error index between the observed loads in the actual arrival time period exceeds a preset threshold, feeding the difference back to the time sequence prediction model, and iteratively updating the prediction parameters to improve the follow-up accuracy.

By grasping the load fluctuation trend in advance, the system can utilize a low-load period to finish resource preparation or redundant equipment offline operation, so that emergency scheduling in a load peak period can be actively avoided, and burst migration and start-stop risks in the peak period are greatly reduced. When dealing with non-stationary load, not only passive on-line learning can be performed, but also more prospective strategy layout can be performed under the support of predicted data.

Through the built online self-adaptive mechanism, the method not only can ensure quick response to real-time environmental changes, but also has the capability of prospective prevention of future fluctuation, and realizes the monitoring-learning-predicting-executing four-dimensional deep fusion. Through the deep cooperation, links such as concept drift detection, online training, load prediction and the like are simply stacked, and the links are mutually reinforced, so that the optimal state of PDU load distribution and energy consumption optimization is continuously maintained in long-term operation, and an effective technical guarantee is provided for sustainable operation of a large-scale data center in a non-steady load environment.

Fifthly, when the data center needs to strengthen safety protection or improve fault tolerance level, safety cost is added in the multidimensional cost function C (a _i,s_t)And fault tolerant overheadAnd forming a comprehensive cost function C' (a _i,s_t), carrying out priority screening on the reinforcement learning strategy through security-related punishment or rewarding in offline pre-training, and carrying out redundancy, encryption or batch migration on the key server or sensitive service in the layering scheduling and online self-adaption stage, thereby finally realizing high-dimension cooperation of data security and redundancy fault tolerance on the premise of meeting energy consumption and performance requirements.

The fifth step comprises the following steps:

step 501, incorporating security and fault tolerant overheads in a multidimensional execution cost model

Inheriting the multidimensional cost function C (a _i,s_t) in the first step to further define a security cost functionAnd fault tolerant overhead functionThe influence degree of the execution action a _i on the system security policy and the fault tolerance policy in the state s _t is measured respectively;

Can represent extra load caused by encryption, authentication and isolation operations; additional resources consumed by fault tolerant redundant deployments (e.g., multi-copy synchronization, critical node dual hot standby) may be described;

the safety and fault tolerance cost is included in the original cost model, and a comprehensive cost function C' (a _i,s_t) is defined:

Wherein C (a _i,s_t) is the multi-dimensional cost of the original energy consumption, delay and performance jitter (defined in step one);

ρ ₁,ρ₂ is the weight coefficient of safety and fault tolerance, the value is between 0 and 1, the priority amplifying degree of safety and fault tolerance is determined when Or (b)When larger, the fault-tolerant weight coefficientThe probability of selection of high risk or low fault tolerance actions in subsequent decisions is suppressed;

According to the monitoring component in the first step, the acquisition of information such as security alarm, vulnerability detection, availability of key nodes and the like is newly added, and fault-tolerant weight coefficients are updated in real time And fault tolerant overhead functionThe comprehensive cost function C' (a _i,s_t) reflects the latest safety and fault-tolerant states at each moment, and the dynamic tracking of states such as security holes, key node redundancy and the like is realized by means of the original monitoring mechanism and the newly added safety alarm interface, so that the safety and fault-tolerant requirements can be perceived in real time.

Step 502, introducing security/fault tolerance limits or rewards in offline pretraining and hierarchical scheduling

Evaluating a newly defined comprehensive cost function C' (a _i,s_t) within the digital twin environment Ω;

Incorporating a negative gain or positive incentive related to security and fault tolerance into the reinforcement learning objective function, such as giving additional positive rewards to necessary security operations (such as high priority encryption), and applying an exponential penalty to unsafe or non-fault tolerant actions, so that the agent learns to consider both security and fault tolerance during the offline training phase;

When the upper reinforcement learning controller pi _top generates a macroscopic load distribution scheme, the security state of the key server and whether redundant nodes are complete are preferentially considered, if a certain action is easy to cause weak security or insufficient fault tolerance, the execution cost of the upper reinforcement learning controller pi _top is automatically increased based on the comprehensive cost function C' (a _i,s_t), and the scheduling priority is reduced;

And when the lower flexible scheduling module migrates in batches or starts and stops the equipment, the lower flexible scheduling module can refer to the safety fault tolerance index, namely, if the host computer where the sensitive service is located does not complete encryption isolation or copy backup, the shutdown operation is delayed, and the log is recorded into the execution feedback so as to be finely adjusted in the follow-up online self-adaption. Therefore, independent safety strategy training does not need to be reconstructed, more careful batch or batch schemes can be carried out on safety and fault-tolerant key operation during layered scheduling, hidden danger to system safety is avoided, and execution of high-priority safety requirements can be ensured.

Step 503, on-line self-adapting enhanced security and fault tolerance

And in the fourth step, introducing extra offset related to safety alarm and the redundant state of the key node when detecting the concept drift. For example, when the overall system load is still normal but a high-density network attack or a multi-node hardware failure occurs, the security dimension weight within the offset CD (delta (s _t)) is quickly triggered to rise, focusing the subsequent online training on security countermeasures.

Fault tolerant overhead function if security alarms are frequent or redundant nodes are insufficientGreatly rising, and starting small-scale online training or transfer learning in the fourth step;

in a miniature training data set And collecting recent real execution logs in a form, incorporating a safety fault tolerance index, increasing a punishment coefficient for unsafe actions or improving an excitation value for redundant operations, and ensuring that an intelligent agent quickly converges to a new strategy with balanced safety and performance.

And (3) combining with the time sequence prediction model in the step four, if the potential attack peak or the critical load extreme fluctuation exists in the future period, pre-configuring more redundant servers in advance or starting more strict authentication and isolation measures, setting a temporary safe silence window to finish backup if necessary, and feeding back an actual execution result to the digital twin environment omega again to reserve an interface for the subsequent higher-level safety or fault tolerance expansion.

When the system is used, when the sudden change of the environment of the data center mainly originates from safety or fault tolerance factors, the concept drift detection mechanism can rapidly capture the risk, trigger on-line update, keep strategy and environment synchronous evolution, dynamically adapt to the safety/fault tolerance requirements, and complete key data isolation or multi-copy deployment before the main load is not overheated by pre-preventing attack or fault peaks, so that the overall toughness of the system can be improved. Step 501 couples the safety and fault tolerance dimensions into the original multidimensional execution cost model, so that all execution actions can quantify the safety and fault tolerance cost under the same framework, step 502 inserts a safety/fault tolerance related rewarding or punishing mechanism in the offline pre-training and layered scheduling links (corresponding to step two and step three), so that the intelligent agent realizes energy consumption and performance optimization while considering safety compliance and fault tolerance stability, step 503 integrates safety fault tolerance index weight in the concept drift detection and small-scale online training aiming at the online self-adaptive (corresponding to step four) scene, can urgently cope with burst scenes with high-risk safety situation or insufficient redundancy, and performs prospective deployment in cooperation with load prediction.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the elements is merely a division of some logic functions, and there may be additional divisions in actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. The PDU load distribution system based on reinforcement learning is characterized by comprising,

When load fluctuation of the data center is detected, a real-time monitoring component is called to acquire original monitoring data, the depth quantization of energy consumption and delay is carried out on virtual machine migration and server start-stop through a multidimensional execution cost model, and a dynamic cost sequence is output;

2. The reinforcement learning based PDU load distribution system of claim 1, wherein:

the multidimensional execution cost model sets independent dimensions for virtual machine migration, server start-stop and PDU switching respectively, weights and quantifies time delay, energy consumption expenditure and business performance influence of each dimension, and is used for dynamically measuring comprehensive cost of each scheduling action in a current state and writing the comprehensive cost into a monitoring database.

3. The reinforcement learning based PDU load distribution system of claim 2, wherein:

the deployed monitoring component collects the server utilization rate, PDU load, power supply starting duration and network bandwidth in a fixed time window, maps the collection result to the cost model in real time to dynamically update the cost data, and generates an available margin threshold value according to the bandwidth safety coefficient and the temperature safety coefficient to serve as a judgment basis for subsequent batch scheduling.

4. The reinforcement learning based PDU load distribution system of claim 3, wherein:

the constructed digital twin environment completely maps PDU distribution topology, server hardware configuration, network bandwidth limitation and typical service load curve of a real data center, and uses performance excitation coefficients and cost penalty coefficients in offline simulation to guide reinforcement learning agents to generate a pre-training strategy.

5. The reinforcement learning based PDU payload distribution system of claim 4, wherein:

And carrying out multi-round reinforcement learning simulation training on various load scenes in the digital twin environment, adjusting cost sensitivity in the offline training process to amplify negative return of high-cost actions, restraining the difference between new strategies and old strategies through strategy stability coefficients, and outputting a pre-training strategy and a performance evaluation result which are compatible with energy efficiency and performance.

6. The reinforcement learning based PDU load distribution system of claim 5, wherein:

After the upper reinforcement learning controller calls a pre-training strategy to generate an instruction set, the instruction set is screened out according to the real-time cost and priority labels, and then the instruction set is transmitted to the lower module to be executed in batches according to the scheduling amplification coefficient and the flexibility control parameter;

Wherein instructions whose cost exceeds a threshold are filtered out based on the multidimensional execution cost model and execution priorities are assigned to the remaining instructions prior to generating the instruction set.

7. The reinforcement learning based PDU load distribution system of claim 6, wherein:

The lower layer scheduling module performs batch execution on the instructions in the instruction set according to the bandwidth allowance and the server temperature allowance, and returns scheduling feedback containing actual execution time length, power consumption change and performance jitter to the upper layer controller together with the updated cost vector after each batch is completed for use in an online self-adaption stage.

8. The reinforcement learning based PDU payload distribution system of claim 7, wherein:

Monitoring the difference between the actual system state and the expected state to calculate the environment drift degree;

9. The reinforcement learning based PDU load distribution system of claim 8, wherein:

the online training process selects a plurality of latest scheduling periods to form a miniature training data set, the old strategy is updated in an increment mode by adopting the learning rate to obtain a new strategy, and the new strategy is deployed again after the digital twin environment completes the quick regression test;

10. The reinforcement learning based PDU load distribution system of claim 9, wherein:

And calling a historical load sequence and recent monitoring data, predicting loads of a plurality of time periods in the future by combining a long-short-term memory network, feeding back to a time sequence prediction model to iteratively update parameters when the average absolute percentage error of the prediction deviation exceeds a set threshold, and presetting available calculation and power supply resources in advance according to the parameters.

11. The reinforcement learning based PDU payload distribution system of claim 10, wherein:

and newly adding a safety cost parameter and a fault-tolerant overhead parameter into the original multidimensional execution cost model to form a comprehensive cost model, wherein the priority of key service migration, redundant copy synchronization and sensitive node power management is controlled through the safety weight coefficient and the fault-tolerant weight coefficient.

12. The reinforcement learning based PDU payload distribution system of claim 11, wherein:

Adding a safe positive reward and a fault-tolerant negative penalty to the reward function in the offline simulation process, and leading a scheduling strategy to ensure the safe isolation of sensitive services and the high available redundancy of key nodes preferentially;

if the host computer where the sensitive service is located does not complete encryption isolation or copy backup, the shutdown operation is delayed, and the log is recorded into the execution feedback.

13. The reinforcement learning based PDU load distribution system of claim 12, wherein:

After introducing extra offset related to safety alarm and key node redundant states, dynamically adjusting safety and fault-tolerant weight coefficients according to the safety alarm frequency and the fault triggering times during online self-adaption so as to maintain a preset balance target among energy consumption, performance and reliability, and feeding back an actual execution result to the digital twin environment again.