WO2019169743A1 - 一种服务器故障的检测方法及系统 - Google Patents
一种服务器故障的检测方法及系统 Download PDFInfo
- Publication number
- WO2019169743A1 WO2019169743A1 PCT/CN2018/088240 CN2018088240W WO2019169743A1 WO 2019169743 A1 WO2019169743 A1 WO 2019169743A1 CN 2018088240 W CN2018088240 W CN 2018088240W WO 2019169743 A1 WO2019169743 A1 WO 2019169743A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- fault
- feature
- monitoring data
- server
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0631—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
- H04L41/0636—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis based on a decision tree analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3058—Monitoring arrangements for monitoring environmental properties or parameters of the computing system or of the computing system component, e.g. monitoring of power, currents, temperature, humidity, position, vibrations
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/008—Reliability or availability analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0751—Error or fault detection not based on redundancy
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/22—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
- G06F11/26—Functional testing
- G06F11/261—Functional testing by simulating additional hardware, e.g. fault simulation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/3006—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3055—Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3447—Performance evaluation by modeling
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0631—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0677—Localisation of faults
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/14—Network analysis or design
- H04L41/142—Network analysis or design using statistical or mathematical methods
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/14—Network analysis or design
- H04L41/145—Network analysis or design involving simulating, designing, planning or modelling of a network
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/14—Network analysis or design
- H04L41/149—Network analysis or design for prediction of maintenance
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/16—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using machine learning or artificial intelligence
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3409—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
- G06F11/3433—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment for load management
Definitions
- the present invention relates to the field of Internet technologies, and in particular, to a method and system for detecting a server fault.
- the server usually has a fault alarm mechanism.
- the server When the server is abnormal, the server will issue an alarm. In this way, the server administrator can overhaul the server to find out which component has an exception.
- the purpose of the present application is to provide a method and system for detecting a server failure, which can improve the efficiency of fault detection.
- the present application provides a method for detecting a server failure, the method comprising: collecting sample monitoring data of a plurality of servers, wherein the sample monitoring data is used to represent an operating state of the server; The sample monitoring data is trained to obtain a fault detection model for the plurality of servers; collecting current monitoring data of the target server, and inputting the current monitoring data into the fault detection model to obtain the current monitoring data corresponding to Running failure.
- a server fault detection system where the system includes a data collection unit, a data processing unit, and a fault detection unit, wherein: the data collection unit is configured to collect multiple servers.
- Sample monitoring data for characterizing an operating state of the server;
- the data processing unit includes a big data platform and a model training module, wherein the big data platform is configured to receive the data collecting unit
- the sample monitoring module is configured to: based on the sample monitoring data, train a fault detection model for the plurality of servers; and the fault detection unit is configured to collect current monitoring data of the target server. And inputting the current monitoring data into the fault detection model to obtain an operation fault corresponding to the current monitoring data.
- the technical solution provided by the present application can provide a method for machine learning, and based on sample monitoring data of multiple servers, training to obtain a fault detection model for the server.
- the sample monitoring data may include data of power data, temperature data, fan data, port data, network link data, system event data, and system service data of the server.
- the current monitoring data of the target server may be collected, and the current monitoring data is input into the fault detection model obtained by the training.
- the result of the fault detection model output can characterize the operational fault corresponding to the current monitoring data.
- the corresponding sub-model can be trained for each type of monitoring data.
- a suitable sub-model can be selected for fault identification, thereby improving the accuracy of fault identification. It can be seen from the above that the technical solution provided by the present application can save a lot of manpower and material resources, and can improve the efficiency of fault detection.
- FIG. 1 is a flowchart of a method for detecting a server failure in an embodiment of the present invention
- FIG. 2 is a schematic diagram of an example of a system for detecting a fault of a server in an embodiment of the present invention
- FIG. 3 is a schematic structural diagram of a system for detecting a fault of a server according to an embodiment of the present invention
- FIG. 4 is a schematic structural diagram of a computer terminal in an embodiment of the present invention.
- the present application provides a method for detecting a server failure.
- the method may include the following steps.
- S1 Collecting sample monitoring data of multiple servers, the sample monitoring data is used to characterize the running status of the server.
- monitoring data for characterizing the operating state of the server may be collected from a plurality of servers on the line through a pre-defined acquisition probe.
- the monitoring data may include data of CDM monitoring data, power data, temperature data, fan data, port data, network link data, system event data, and system service data of the plurality of servers.
- the CDM monitoring data includes CPU (Central Processing Unit) monitoring data, DISK (hard disk) monitoring data, and MEMORY (memory) monitoring data. This data can reflect whether the server is in a normal running state. After analyzing the data, you can determine the current running fault of the server.
- the pre-defined collection probe may be a preset collection device, and the collection device may read the monitoring data from the server through a data transmission protocol agreed with the server.
- the read monitoring data can be used as sample monitoring data for machine learning, and by learning a large amount of sample monitoring data, various types of fault characteristics can be analyzed.
- the process of collecting sample monitoring data may be completed at the acquisition layer.
- the acquisition layer collects the sample monitoring data by collecting the data recorded on the Baseboard Management Controler (BMC) through the Intelligent Platform Management Interface (IPMI). After the acquisition, the data is formatted and uploaded to the big data platform.
- BMC Baseboard Management Controler
- IPMI Intelligent Platform Management Interface
- the big data platform may train the fault detection model by using a machine learning method based on the sample monitoring data.
- the collected sample monitoring data usually includes a plurality of different types of monitoring data as described in step S1. Each type of monitoring data can be used as a set of feature data, such that the sample monitoring data can include multiple sets of feature data.
- the sample monitoring data may be divided into power group feature data, fan group feature data, memory group feature data, and the like.
- the sample monitoring data may be grouped according to feature data, and respectively trained to obtain sub-models for each set of feature data.
- the power failure detection sub-model can be trained;
- the memory failure detection sub-model can be trained.
- each set of feature data may include a plurality of feature data, and the plurality of feature data may be running data of the same server in different periods, or may be multiple The running data of the server.
- 1000 memory data collected from 100 servers may be included.
- each feature data when the sub-model training is performed for each set of feature data, each feature data may be associated with a standard operation fault in advance, and the standard operation fault may be obtained by analyzing the feature data, and thus, the associated A standard operational fault is an operational fault that is actually reflected by the characteristic data.
- the feature data may be input into the initial detection sub-model to obtain a predicted operational failure of the feature data.
- the initial detection submodel may include an initialized neural network, and the neurons in the initialized neural network may have initial parameter values. Since these initial parameter values are set by default, after the input characteristic data is processed based on these initial parameter values, the obtained predicted operational failure may not be consistent with the standard operational failure actually reflected by the characteristic data.
- the result obtained by the initial detection sub-model prediction may be a prediction probability group, and the probability probability group may include multiple probability values, and each probability value may correspond to one failure type.
- the predicted probability group obtained by the final prediction may include three probability values, and the three probability values respectively correspond to three fault types related to the memory. Among them, the higher the probability value, the greater the possibility that there is a corresponding fault type. For example, if the predicted probability group is (0.1, 0.6, 0.3), then the fault type corresponding to 0.6 can be a predicted operational fault.
- the standard probability group corresponding to the standard operational fault associated with the feature data may be, for example, (1, 0, 0), wherein the fault type corresponding to the probability value 1 may be the standard operational fault.
- an error between the predicted operational fault and the standard operational fault can be obtained.
- the parameters in the initial detection submodel can be corrected.
- the feature data may be input again to the corrected detection sub-model, and the process of correcting the parameters in the sub-model by the error may be repeated, so that the finally obtained predicted operational failure is consistent with the standard operational failure.
- the sub-model is repeatedly trained by a large amount of feature data in each set of feature data, so that the final sub-model obtained by the training can have higher prediction accuracy.
- the feature data may characterize the operational status of a component in the server, for example, the CPU data may characterize the operational state of the CPU.
- the feature data may further include a plurality of feature sub-data, and the feature sub-data may represent a state of each aspect of the component corresponding to the runtime.
- the CPU data may include feature sub-data such as CPU usage, CPU usage duration, and CPU usage thread number.
- the decision order determined according to the technology of the decision tree is to first determine the CPU usage, then decide the CPU to use the number of threads, and finally determine the CPU usage time. Then, in each decision step, the value obtained by the decision can be used as the above-mentioned feature value.
- the feature value can be 80%.
- an array of prediction probabilities corresponding to the feature data may be calculated.
- the decision process may be performed by a neural network, and the neurons in the neural network may obtain the final predicted probability set by weighted summation or other non-linear calculation according to the feature value of each decision process.
- the predicted probability array may include at least one probability value, the probability value corresponding to the fault type.
- the predicted probability group obtained by the final prediction may include three probability values, and the three probability values respectively correspond to three fault types related to the memory.
- the fault type corresponding to the largest probability value in the predicted probability array may be used as the predicted operational fault. For example, if the predicted probability group is (0.1, 0.6, 0.3), then the fault type corresponding to 0.6 can be a predicted operational fault.
- the training process of the failure prediction model can be completed in the data layer.
- the data layer may include the big data platform described above, and may further include a feature grouping module and a model training module.
- the feature grouping module is configured to group the sample monitoring data in the big data platform according to the feature data.
- the grouped feature data can be trained in the model training module to obtain the respective sub-models.
- S5 Collect current monitoring data of the target server, and input the current monitoring data into the fault detection model to obtain an operation fault corresponding to the current monitoring data.
- the current monitoring data of the target server may be collected, and the fault detection model obtained by the training is used to perform fault detection on the current monitoring data.
- the target server may be a server to be detected.
- the current monitoring data of the target server may also be collected by using a preset acquisition probe.
- the plurality of sets of feature data may also exist in the current monitoring data, and after collecting the current monitoring data of the target server, the target feature data included in the current monitoring data may be identified, and the target feature data is input and adapted.
- an operation fault corresponding to the target feature data is obtained. In this way, for each set of feature data, each corresponding operational fault can be obtained, and finally, each running fault of the target server can be summarized.
- the above fault detection process can be completed in the application layer.
- the server in addition to fault location of the server that has failed, the server can be periodically detected to predict the possible failure of the server for timely maintenance.
- the current monitoring data of the target server may be collected when the target server itself issues a fault prompt message.
- the purpose of this processing is that the fault prompt message sent by the target server is usually a relatively broad information, which may only prompt the target server to be currently faulty, but does not indicate the specific type of fault.
- the current monitoring data can be collected, and the fault detection model obtained by the training can be used to detect detailed fault information.
- the current monitoring data of the target server can also be collected according to the specified time period, and the fault detection model obtained by the training is used for the fault detection for each collected monitoring data.
- the purpose of this processing is to periodically detect the failure of the target server, so that the target server can be predicted to have a tendency to fail in order to perform maintenance before the failure occurs.
- the target server in order not to affect the normal network service of the target server, the target server may be fault detected when the target server is idle.
- a load distribution of the target server may be counted, and the load distribution may include an average load of the target server within a specified time period. For example, you can count the average load of the target server every 3 hours in a day.
- a target time period may be determined based on the load distribution, and fault detection may be performed on the target server within the target time period.
- the average load within the target time period may be lower.
- the specified time period corresponding to when the average load is less than or equal to the specified load threshold may be used as the target time period.
- the specified load threshold can be set, for example, to 50%.
- the specified load threshold can be flexibly adjusted according to actual conditions.
- the number of corresponding specified time periods is at least two, then one of the specified time periods may be randomly selected as the target time period, or the corresponding specified when the average load is minimum.
- the time period is taken as the target time period. For example, after counting the average load of the target server every 3 hours in a day, and finding that the average load is less than or equal to 50% of the time period from 0:00 am to 3:00 am and from 3:00 am to 6:00 am, then any of them can be used.
- a time period is used as the target time period.
- the load of the target server is small. At this time, the current running parameters of the target server can be collected and fault detection is performed, so that the target server is not greatly affected.
- a diagnostic policy matching the running fault may be invoked, and the target server is fault diagnosed by using the diagnostic policy.
- the diagnosis policy may be a policy based on a summary of past diagnostic history, and the policies may be stored in association with corresponding operational faults.
- the associated diagnostic strategy can be invoked for detailed diagnosis. For example, the severity of the operational fault can be diagnosed and the frequency of occurrence of the operational fault can be diagnosed. In this way, according to the result of the fault diagnosis, the detection period for the target server can be determined, and the target server is periodically detected for failure based on the detection period.
- the detection period can be set according to the severity of the operation fault and the frequency of occurrence, and the more serious the operation failure, the higher the frequency of occurrence, the shorter the detection period can be. This ensures that the target server's operational failures are discovered in a timely manner to prevent and troubleshoot before the failure occurs.
- the present application also provides a server fault detection system.
- the system includes a data acquisition unit, a data processing unit, and a fault detection unit, where:
- the data collection unit is configured to collect sample monitoring data of multiple servers, where the sample monitoring data is used to represent an operating state of the server;
- the data processing unit includes a big data platform and a model training module, wherein the big data platform is configured to receive the sample monitoring data sent by the data collection unit; the model training module is configured to monitor based on the sample Data, training to obtain a fault detection model for the plurality of servers;
- the fault detection unit is configured to collect current monitoring data of the target server, and input the current monitoring data into the fault detection model to obtain an operation fault corresponding to the current monitoring data.
- the sample monitoring data includes a plurality of sets of feature data; correspondingly, the data processing unit further includes:
- a feature grouping module configured to group the sample monitoring data according to the feature data, so that the model training module separately trains the sub-models for each group of feature data.
- the model training module includes:
- An initial prediction module configured to input the feature data into an initial detection submodel, to obtain a predicted operation fault of the feature data
- An error correction module configured to determine an error between the predicted operational fault and the standard operational fault, and correct a parameter in the initial detection submodel by the error, so that the feature data is re-entered After the corrected detection submodel, the resulting predicted operational failure is consistent with the standard operational failure.
- the feature data includes a plurality of feature sub-data; correspondingly, the initial prediction module includes:
- a decision order determining module configured to determine a decision order of each feature sub-data in the feature data, and respectively determine feature values corresponding to each feature sub-data according to the decision order;
- a probability array calculation module configured to calculate, according to the feature value, an array of prediction probabilities corresponding to the feature data, where the prediction probability array includes at least one probability value, where the probability value corresponds to a fault type;
- a fault determining module configured to use the fault type corresponding to the largest probability value in the predicted probability array as the predicted running fault.
- system further comprises:
- a load distribution statistics unit configured to collect a load distribution of the target server, where the load distribution includes an average load of the target server within a specified time period
- a periodic detection module configured to determine a target time period based on the load distribution, and perform fault detection on the target server within the target time period.
- Computer terminal 10 may include one or more (only one of which is shown) processor 102 (processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), for storing data.
- processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), for storing data.
- FIG. 4 is merely illustrative and does not limit the structure of the above electronic device.
- computer terminal 10 may also include more or fewer components than shown in FIG. 4, or have a different configuration than that shown in FIG.
- the memory 104 can be used to store software programs and modules of application software, and the processor 102 executes various functional applications and data processing by running software programs and modules stored in the memory 104.
- Memory 104 may include high speed random access memory, and may also include non-volatile memory such as one or more magnetic storage devices, flash memory, or other non-volatile solid state memory.
- memory 104 may further include memory remotely located relative to processor 102, which may be coupled to computer terminal 10 via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
- the above-described method for detecting a server failure may be stored as a computer program in the above-described memory 104, and the memory 104 may be coupled to the processor 102, and then when the processor 102 executes the memory 104
- each of the above steps of the server failure detection method can be implemented.
- Transmission device 106 is for receiving or transmitting data via a network.
- the network specific examples described above may include a wireless network provided by a communication provider of the computer terminal 10.
- the transmission device 106 includes a Network Interface Controller (NIC) that can be connected to other network devices through a base station to communicate with the Internet.
- the transmission device 106 can be a Radio Frequency (RF) module for communicating with the Internet wirelessly.
- NIC Network Interface Controller
- RF Radio Frequency
- the BMC (Baseboard Management Controler) 108 functions as follows: When the acquisition layer collects sample monitoring data, the data recorded on the BMC can be collected through the Intelligent Platform Management Interface (IPMI). Data is then uploaded to the big data platform.
- IPMI Intelligent Platform Management Interface
- the technical solution provided by the present application can provide a method for machine learning, and based on sample monitoring data of multiple servers, training to obtain a fault detection model for the server.
- the sample monitoring data may include data of power data, temperature data, fan data, port data, network link data, system event data, and system service data of the server.
- the current monitoring data of the target server may be collected, and the current monitoring data is input into the fault detection model obtained by the training.
- the result of the fault detection model output can characterize the operational fault corresponding to the current monitoring data.
- the corresponding sub-model can be trained for each type of monitoring data.
- a suitable sub-model can be selected for fault identification, thereby improving the accuracy of fault identification. It can be seen from the above that the technical solution provided by the present application can save a lot of manpower and material resources, and can improve the efficiency of fault detection.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Computer Hardware Design (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Algebra (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Probability & Statistics with Applications (AREA)
- Pure & Applied Mathematics (AREA)
- Debugging And Monitoring (AREA)
Abstract
本发明公开了一种服务器故障的检测方法及系统,其中,所述方法包括:采集多个服务器的样本监控数据,所述样本监控数据用于表征所述服务器的运行状态;基于所述样本监控数据,训练得到针对所述多个服务器的故障检测模型;采集目标服务器当前的监控数据,并将所述当前的监控数据输入所述故障检测模型,以得到所述当前的监控数据对应的运行故障。本申请提供的技术方案,能够提高故障检测的效率。
Description
本发明涉及互联网技术领域,特别涉及一种服务器故障的检测方法及系统。
随着互联网技术的不断发展,网络中的服务器数量也在不断增加。服务器的性能会直接影响其提供的服务的质量,当服务器发生故障时,需要及时找到发生故障的原因,以便及时修复。
当前,服务器通常会具备故障报警机制,当服务器出现异常时,服务器会发出报警提示。这样,服务器的管理人员便可以对服务器进行检修,以找出发生异常的组件。
然而,随着服务器数量的不断增加,如果仅靠人工排查的方式来检测服务器的故障,会浪费大量的人力物力,并且故障检测的效率也较低。
发明内容
本申请的目的在于提供一种服务器故障的检测方法及系统,能够提高故障检测的效率。
为实现上述目的,本申请一方面提供一种服务器故障的检测方法,所述方法包括:采集多个服务器的样本监控数据,所述样本监控数据用于表征所述服务器的运行状态;基于所述样本监控数据,训练得到针对所述多个服务器的故障检测模型;采集目标服务器当前的监控数据,并将所述当前的监控数据输入所述故障检测模型,以得到所述当前的监控数据对应的运行故障。
为实现上述目的,本申请另一方面还提供一种服务器故障的检测系统,所述系统包括数据采集单元、数据处理单元以及故障检测单元,其中:所述数据采集单元,用于采集多个服务器的样本监控数据,所述样本监控数据用于表征所述服务器的运行状态;所述数据处理单元包括大数据平台和模型训练模块,其中,所述大数据平台用于接收所述数据采集单元发来的所述样本监控数据; 所述模型训练模块用于基于所述样本监控数据,训练得到针对所述多个服务器的故障检测模型;所述故障检测单元,用于采集目标服务器当前的监控数据,并将所述当前的监控数据输入所述故障检测模型,以得到所述当前的监控数据对应的运行故障。
由上可见,本申请提供的技术方案,可以提供机器学习的方法,基于多个服务器的样本监控数据,训练得到针对服务器的故障检测模型。具体地,所述样本监控数据可以包含服务器的电源数据、温度数据、风扇数据、端口数据、网络链路数据、系统事件数据以及系统服务数据等方面的数据。后续在判断目标服务器发生的具体故障或者对目标服务器进行故障预测时,可以采集目标服务器当前的监控数据,并将该当前的监控数据输入训练得到的故障检测模型中。最终,故障检测模型输出的结果便可以表征该当前的监控数据对应的运行故障。在实际应用中,针对每种监控数据,可以训练得到对应的子模型。这样,针对输入的监控数据,可以选用相适配的子模型进行故障识别,从而可以提高故障识别的精度。由上可见,本申请提供的技术方案,能够节省大量的人力物力,并且能够提高故障检测的效率。
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本发明实施例中服务器故障的检测方法流程图;
图2是本发明实施例中服务器故障的检测系统实例示意图;
图3是本发明实施例中服务器故障的检测系统结构示意图;
图4是本发明实施例中计算机终端的结构示意图。
为使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明实施方式作进一步地详细描述。
实施例一
本申请提供一种服务器故障的检测方法,请参阅图1,所述方法可以包括以下步骤。
S1:采集多个服务器的样本监控数据,所述样本监控数据用于表征所述服务器的运行状态。
在本实施方式中,可以通过预先定义的采集探针,从线上的多个服务器中采集用于表征服务器的运行状态的监控数据。所述监控数据可以包括这多个服务器的CDM监控数据、电源数据、温度数据、风扇数据、端口数据、网络链路数据、系统事件数据以及系统服务数据等方面的数据。其中,所述CDM监控数据包括CPU(中央处理器)监控数据、DISK(硬盘)监控数据以及MEMORY(内存)监控数据。这些数据可以反映服务器是否处于正常的运行状态中。在对这些数据进行分析之后,可以确定出服务器当前存在的运行故障。
在本实施方式中,所述预先定义的采集探针可以是预设的采集设备,所述采集设备可以通过与服务器约定好的数据传输协议,从服务器中读取监控数据。读取的监控数据可以作为机器学习的样本监控数据,通过对这些大量的样本监控数据进行学习,从而可以分析出各种类型的故障特征。
请参阅图2,在本实施方式中,采集样本监控数据的过程可以在采集层完成。采集层采集样本监控数据是通过智能平台管理接口(Intelligent Platform Management Interface,IPMI)采集基板管理控制器(Baseboard Management Controler,BMC)上记录的数据,采集后格式化数据,再上传至大数据平台。
S3:基于所述样本监控数据,训练得到针对所述多个服务器的故障检测模型。
在本实施方式中,大数据平台在接收到采集层上传来的样本监控数据之后,可以基于该样本监控数据,通过机器学习的方法训练得到故障检测模型。在实际应用中,采集得到的样本监控数据中通常包含如步骤S1中所述的多种不同类型的监控数据。其中,每种类型的监控数据均可以作为一组特征数据,这样,所述样本监控数据中可以包括多组特征数据。例如,可以将样本监控数据划分为电源组特征数据、风扇组特征数据、内存组特征数据等。
在一个实施方式中,为了能够对服务器发生的故障进行精确定位,可以将所述样本监控数据按照特征数据进行分组,并分别训练得到针对各组特征数据 的子模型。例如,针对电源组特征数据,可以训练得到电源故障检测子模型;针对内存组特征数据,可以训练得到内存故障检测子模型。需要说明的是,为了使得训练得到的子模型比较精准,每组特征数据中,可以包括多个特征数据,这多个特征数据可以是同一个服务器在不同时期的运行数据,也可以是多个服务器的运行数据。例如,在内存组特征数据中,可以包括采集自100个服务器的1000个内存数据。
在本实施方式中,针对每组特征数据进行子模型训练时,可以预先给每个特征数据关联标准运行故障,所述标准运行故障可以是通过对该特征数据进行分析得到的,因此,关联的标准运行故障是该特征数据实际反映的运行故障。在开始训练时,可以将所述特征数据输入初始检测子模型,从而得到所述特征数据的预测运行故障。其中,所述初始检测子模型中可以包括初始化的神经网络,该初始化的神经网络中的神经元可以具备初始参数值。由于这些初始参数值是默认设置的,因此基于这些初始参数值对输入的特征数据进行处理之后,得到的预测运行故障与该特征数据实际反映的标准运行故障可能并不一致。此时,可以确定所述预测运行故障与所述标准运行故障之间的误差。具体地,经过初始检测子模型预测得到的结果可以是一个预测概率组,在该预测概率组中可以包括多个概率值,每个概率值可以对应一个故障类型。例如,针对内存数据,最终预测得到的预测概率组中可以包括3个概率值,这3个概率值分别对应与内存相关的3个故障类型。其中,概率值越高,表示存在对应的故障类型的可能性越大。例如,预测概率组为(0.1,0.6,0.3),那么0.6对应的故障类型便可以是预测运行故障。与特征数据关联的标准运行故障对应的标准概率组例如可以是(1,0,0),其中,概率值1对应的故障类型便可以是所述标准运行故障。这样,通过将预测概率组和标准概率组中对应的概率值相减,便可以得到所述预测运行故障与所述标准运行故障之间的误差。通过将该误差作为反馈值输入初始检测子模型,从而可以对初始检测子模型中的参数进行校正。在校正之后,可以将该特征数据再次输入经过校正的检测子模型,后续可以重复利用误差对子模型中的参数进行校正的过程,从而使得最终得到的预测运行故障与所述标准运行故障一致。这样,通过每组特征数据中大量的特征数据反复对子模型进行训练,从而可以使得训练得到的最终子模型具备较高的预测精度。
在一个实施方式中,所述特征数据可以表征服务器中一个组件的运行状态, 例如,CPU数据可以表征CPU的运行状态。而特征数据中还可以包括多个特征子数据,所述特征子数据则可以表征该组件在运行时对应的各方面的状态。例如,CPU数据中可以包含CPU使用率、CPU使用时长、CPU使用线程数等方面的特征子数据。在对特征数据进行训练时,可以通过决策树的技术,确定所述特征数据中各个特征子数据的决策顺序,并按照所述决策顺序分别确定各个所述特征子数据对应的特征值。其中,所述特征值用于表征决策步骤中的具体数值。例如,针对CPU数据而言,按照决策树的技术确定出的决策顺序是先决策CPU使用率,然后决策CPU使用线程数,最后决策CPU使用时长。那么在各个决策步骤中,决策得到的数值便可以作为上述的特征值。例如,CPU使用率决策步骤中,特征值可以为80%。
在本实施方式中,根据决策得到的所述特征值,可以计算得到所述特征数据对应的预测概率数组。具体地,决策过程可以是通过神经网络完成的,那么神经网络中的神经元根据每个决策过程的特征值,可以通过加权求和或者其它非线性的计算方式得到最终的预测概率组。所述预测概率数组中可以包括至少一个概率值,所述概率值与故障类型相对应。例如,针对内存数据,最终预测得到的预测概率组中可以包括3个概率值,这3个概率值分别对应与内存相关的3个故障类型。最终,可以将所述预测概率数组中最大的概率值对应的故障类型作为所述预测运行故障。例如,预测概率组为(0.1,0.6,0.3),那么0.6对应的故障类型便可以是预测运行故障。
如图2所示,在本实施方式中,故障预测模型的训练过程可以在数据层中完成。所述数据层中可以包括上述的大数据平台,还可以包括特征分组模块和模型训练模块。其中,所述特征分组模块,用于将所述大数据平台中的样本监控数据按照特征数据进行分组。分组后的特征数据可以分别在模型训练模块中训练得到各自的子模型。
S5:采集目标服务器当前的监控数据,并将所述当前的监控数据输入所述故障检测模型,以得到所述当前的监控数据对应的运行故障。
在本实施方式中,在训练得到故障检测模型之后,可以采集目标服务器当前的监控数据,并利用训练得到的故障检测模型对当前的监控数据进行故障检测。所述目标服务器可以是待检测的服务器,在本实施方式中,同样可以采用预先定以的采集探针采集该目标服务器当前的监控数据。该当前的监控数据中 同样可以存在多组特征数据,那么在采集目标服务器当前的监控数据之后,可以识别所述当前的监控数据中包含的目标特征数据,并将所述目标特征数据输入相适配的子模型中,以得到所述目标特征数据对应的运行故障。这样,针对每组特征数据,均可以得到各自对应的运行故障,最终便可以汇总得到该目标服务器的各个运行故障。
如图2所示,在本实施方式中,上述故障检测的过程可以在应用层中完成。在应用层中,除了可以对已发生故障的服务器进行故障定位,还能够对服务器进行周期性地检测,从而对服务器可能发生的故障进行预测,以便及时检修。
在一个实施方式中,采集目标服务器当前的监控数据的时机也可以有多种选择。一方面,可以在目标服务器自身发出故障提示信息时,采集所述目标服务器当前的监控数据。这样处理的目的在于,目标服务器发出的故障提示信息通常是比较宽泛的信息,该信息中可能仅提示目标服务器当前发生了故障,但并不会指明故障的具体类型。此时,为了快速排查故障所处的位置,可以采集当前的监控数据,并通过训练得到的故障检测模型检测得到详细的故障信息。另一方面,还可以按照指定时间周期采集目标服务器当前的监控数据,并针对每次采集的监控数据,都利用训练得到的故障检测模型进行故障检测。这样处理的目的在于可以周期性地对目标服务器进行故障检测,从而可以预测目标服务器是否有发生故障的趋势,以便在发生故障之前进行检修。
在一个实施方式中,为了不影响目标服务器的正常网络服务,可以在目标服务器处于空闲的时候再对目标服务器进行故障检测。具体地,可以统计所述目标服务器的负载分布,所述负载分布可以包括所述目标服务器在指定时段内的平均负载。例如,可以统计目标服务器在一天内每3个小时的平均负载。然后,可以基于所述负载分布确定目标时段,并在所述目标时段内对所述目标服务器进行故障检测。其中,所述目标时段内的平均负载可以较低。具体地,可以将平均负载小于或者等于指定负载阈值时对应的指定时段作为所述目标时段。所述指定负载阈值例如可以设置为50%,当然,还可以根据实际情况灵活地对指定负载阈值进行调整。在实际应用中,若平均负载小于或者等于指定负载阈值时对应的指定时段的数量为至少两个,那么可以随机选择其中的一个指定时段作为所述目标时段,或者将平均负载最小时对应的指定时段作为所述目标时段。举例来说,在统计目标服务器在一天内每3个小时的平均负载之后, 发现平均负载小于或者等于50%的时段为凌晨0点至3点以及凌晨3点至6点,那么可以将其中任意一个时段作为目标时段。在所述目标时段内,目标服务器的负载较小,此时可以采集目标服务器当前的运行参数并进行故障检测,从而不会对目标服务器造成太大的影响。
在一个实施方式中,在得到所述当前的监控数据对应的运行故障之后,可以调用与所述运行故障相匹配的诊断策略,并利用所述诊断策略对所述目标服务器进行故障诊断。其中,所述诊断策略可以是基于过往的诊断历史总结得到的策略,这些策略可以与对应的运行故障进行关联存储。这样,在检测得到某个运行故障之后,可以调用相关联的诊断策略进行详细的诊断。例如,可以诊断出该运行故障的严重程度,并且可以诊断出该运行故障的发生频率。这样,根据故障诊断的结果,可以确定针对所述目标服务器的检测周期,并基于所述检测周期定期对所述目标服务器进行故障检测。所述检测周期可以根据运行故障的严重性和发生频率进行设定,运行故障越严重,发生频率越高,那么检测周期可以越短。这样可以保证及时地发现目标服务器的运行故障,以便在故障发生之前进行预防和检修。
实施例二
本申请还提供一种服务器故障的检测系统,请参阅图3,所述系统包括数据采集单元、数据处理单元以及故障检测单元,其中:
所述数据采集单元,用于采集多个服务器的样本监控数据,所述样本监控数据用于表征所述服务器的运行状态;
所述数据处理单元包括大数据平台和模型训练模块,其中,所述大数据平台用于接收所述数据采集单元发来的所述样本监控数据;所述模型训练模块用于基于所述样本监控数据,训练得到针对所述多个服务器的故障检测模型;
所述故障检测单元,用于采集目标服务器当前的监控数据,并将所述当前的监控数据输入所述故障检测模型,以得到所述当前的监控数据对应的运行故障。
在一个实施方式中,所述样本监控数据中包括多组特征数据;相应地,所述数据处理单元中还包括:
特征分组模块,用于将所述样本监控数据按照特征数据进行分组,以使得 所述模型训练模块分别训练得到针对各组特征数据的子模型。
在一个实施方式中,所述特征数据与标准运行故障相关联;相应地,所述模型训练模块包括:
初始预测模块,用于将所述特征数据输入初始检测子模型,得到所述特征数据的预测运行故障;
误差校正模块,用于确定所述预测运行故障与所述标准运行故障之间的误差,并通过所述误差对所述初始检测子模型中的参数进行校正,以使得将所述特征数据再次输入经过校正的检测子模型后,得到的预测运行故障与所述标准运行故障一致。
在一个实施方式中,所述特征数据中包括多个特征子数据;相应地,所述初始预测模块包括:
决策顺序确定模块,用于确定所述特征数据中各个特征子数据的决策顺序,并按照所述决策顺序分别确定各个所述特征子数据对应的特征值;
概率数组计算模块,用于根据所述特征值,计算得到所述特征数据对应的预测概率数组,所述预测概率数组中包括至少一个概率值,所述概率值与故障类型相对应;
故障确定模块,用于将所述预测概率数组中最大的概率值对应的故障类型作为所述预测运行故障。
在一个实施方式中,所述系统还包括:
负载分布统计单元,用于统计所述目标服务器的负载分布,所述负载分布包括所述目标服务器在指定时段内的平均负载;
定期检测模块,用于基于所述负载分布确定目标时段,并在所述目标时段内对所述目标服务器进行故障检测。
请参阅图4,在本申请中,上述实施例中的技术方案可以应用于如图4所示的计算机终端10上。计算机终端10可以包括一个或多个(图中仅示出一个)处理器102(处理器102可以包括但不限于微处理器MCU或可编程逻辑器件FPGA等的处理装置)、用于存储数据的存储器104、以及用于通信功能的传输模块106。本领域普通技术人员可以理解,图4所示的结构仅为示意,其并不对上述电子装置的结构造成限定。例如,计算机终端10还可包括比图4中所示更多或者更少的组件,或者具有与图4所示不同的配置。
存储器104可用于存储应用软件的软件程序以及模块,处理器102通过运行存储在存储器104内的软件程序以及模块,从而执行各种功能应用以及数据处理。存储器104可包括高速随机存储器,还可包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器104可进一步包括相对于处理器102远程设置的存储器,这些远程存储器可以通过网络连接至计算机终端10。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
具体地,在本申请中,上述的服务器故障的检测方法可以作为计算机程序存储于上述的存储器104中,所述存储器104可以与处理器102耦合,那么当处理器102执行所述存储器104中的计算机程序时,便可以实现上述的服务器故障的检测方法中的各个步骤。
传输装置106用于经由一个网络接收或者发送数据。上述的网络具体实例可包括计算机终端10的通信供应商提供的无线网络。在一个实例中,传输装置106包括一个网络适配器(Network Interface Controller,NIC),其可通过基站与其他网络设备相连从而可与互联网进行通讯。在一个实例中,传输装置106可以为射频(Radio Frequency,RF)模块,其用于通过无线方式与互联网进行通讯。
BMC(采集基板管理控制器,Baseboard Management Controler)108的作用为:采集层采集样本监控数据时,可以通过智能平台管理接口(Intelligent Platform Management Interface,IPMI)采集BMC上记录的数据,采集后格式化数据,再上传至大数据平台。
由上可见,本申请提供的技术方案,可以提供机器学习的方法,基于多个服务器的样本监控数据,训练得到针对服务器的故障检测模型。具体地,所述样本监控数据可以包含服务器的电源数据、温度数据、风扇数据、端口数据、网络链路数据、系统事件数据以及系统服务数据等方面的数据。后续在判断目标服务器发生的具体故障或者对目标服务器进行故障预测时,可以采集目标服务器当前的监控数据,并将该当前的监控数据输入训练得到的故障检测模型中。最终,故障检测模型输出的结果便可以表征该当前的监控数据对应的运行故障。在实际应用中,针对每种监控数据,可以训练得到对应的子模型。这样,针对输入的监控数据,可以选用相适配的子模型进行故障识别,从而可以提高故障 识别的精度。由上可见,本申请提供的技术方案,能够节省大量的人力物力,并且能够提高故障检测的效率。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件来实现。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。
Claims (14)
- 一种服务器故障的检测方法,其特征在于,所述方法包括:采集多个服务器的样本监控数据,所述样本监控数据用于表征所述服务器的运行状态;基于所述样本监控数据,训练得到针对所述多个服务器的故障检测模型;采集目标服务器当前的监控数据,并将所述当前的监控数据输入所述故障检测模型,以得到所述当前的监控数据对应的运行故障。
- 根据权利要求1所述的方法,其特征在于,所述样本监控数据中包括多组特征数据;相应地,训练得到针对所述多个服务器的故障检测模型包括:将所述样本监控数据按照特征数据进行分组,并分别训练得到针对各组特征数据的子模型。
- 根据权利要求2所述的方法,其特征在于,在采集目标服务器当前的监控数据之后,所述方法还包括:识别所述当前的监控数据中包含的目标特征数据,并将所述目标特征数据输入相适配的子模型中,以得到所述目标特征数据对应的运行故障。
- 根据权利要求2所述的方法,其特征在于,所述特征数据与标准运行故障相关联;相应地,训练得到针对各组特征数据的子模型包括:将所述特征数据输入初始检测子模型,得到所述特征数据的预测运行故障;确定所述预测运行故障与所述标准运行故障之间的误差,并通过所述误差对所述初始检测子模型中的参数进行校正,以使得将所述特征数据再次输入经过校正的检测子模型后,得到的预测运行故障与所述标准运行故障一致。
- 根据权利要求4所述的方法,其特征在于,所述特征数据中包括多个特征子数据;相应地,所述预测运行故障按照以下方式确定:确定所述特征数据中各个特征子数据的决策顺序,并按照所述决策顺序分别确定各个所述特征子数据对应的特征值;根据所述特征值,计算得到所述特征数据对应的预测概率数组,所述预测概率数组中包括至少一个概率值,所述概率值与故障类型相对应;将所述预测概率数组中最大的概率值对应的故障类型作为所述预测运行故障。
- 根据权利要求1所述的方法,其特征在于,采集目标服务器当前的监控数据包括:在目标服务器发出故障提示信息时,采集所述目标服务器当前的监控数据;或者按照指定时间周期采集目标服务器当前的监控数据。
- 根据权利要求1所述的方法,其特征在于,所述方法还包括:统计所述目标服务器的负载分布,所述负载分布包括所述目标服务器在指定时段内的平均负载;基于所述负载分布确定目标时段,并在所述目标时段内对所述目标服务器进行故障检测。
- 根据权利要求7所述的方法,其特征在于,基于所述负载分布确定目标时段包括:将平均负载小于或者等于指定负载阈值时对应的指定时段作为所述目标时段;其中,若平均负载小于或者等于指定负载阈值时对应的指定时段的数量为至少两个,随机选择其中的一个指定时段作为所述目标时段,或者将平均负载最小时对应的指定时段作为所述目标时段。
- 根据权利要求1所述的方法,其特征在于,在得到所述当前的监控数据对应的运行故障之后,所述方法还包括:调用与所述运行故障相匹配的诊断策略,并利用所述诊断策略对所述目标服务器进行故障诊断;根据故障诊断的结果,确定针对所述目标服务器的检测周期,并基于所述检测周期定期对所述目标服务器进行故障检测。
- 一种服务器故障的检测系统,其特征在于,所述系统包括数据采集单元、数据处理单元以及故障检测单元,其中:所述数据采集单元,用于采集多个服务器的样本监控数据,所述样本监控数据用于表征所述服务器的运行状态;所述数据处理单元包括大数据平台和模型训练模块,其中,所述大数据平台用于接收所述数据采集单元发来的所述样本监控数据;所述模型训练模块用于基于所述样本监控数据,训练得到针对所述多个服务器的故障检测模型;所述故障检测单元,用于采集目标服务器当前的监控数据,并将所述当前的监控数据输入所述故障检测模型,以得到所述当前的监控数据对应的运行故障。
- 根据权利要求10所述的系统,其特征在于,所述样本监控数据中包括多组特征数据;相应地,所述数据处理单元中还包括:特征分组模块,用于将所述样本监控数据按照特征数据进行分组,以使得所述模型训练模块分别训练得到针对各组特征数据的子模型。
- 根据权利要求11所述的系统,其特征在于,所述特征数据与标准运行故障相关联;相应地,所述模型训练模块包括:初始预测模块,用于将所述特征数据输入初始检测子模型,得到所述特征数据的预测运行故障;误差校正模块,用于确定所述预测运行故障与所述标准运行故障之间的误差,并通过所述误差对所述初始检测子模型中的参数进行校正,以使得将所述特征数据再次输入经过校正的检测子模型后,得到的预测运行故障与所述标准运行故障一致。
- 根据权利要求12所述的系统,其特征在于,所述特征数据中包括多个特征子数据;相应地,所述初始预测模块包括:决策顺序确定模块,用于确定所述特征数据中各个特征子数据的决策顺序,并按照所述决策顺序分别确定各个所述特征子数据对应的特征值;概率数组计算模块,用于根据所述特征值,计算得到所述特征数据对应的预测概率数组,所述预测概率数组中包括至少一个概率值,所述概率值与故障类型相对应;故障确定模块,用于将所述预测概率数组中最大的概率值对应的故障类型作为所述预测运行故障。
- 根据权利要求10所述的系统,其特征在于,所述系统还包括:负载分布统计单元,用于统计所述目标服务器的负载分布,所述负载分布包括所述目标服务器在指定时段内的平均负载;定期检测模块,用于基于所述负载分布确定目标时段,并在所述目标时段内对所述目标服务器进行故障检测。
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP18869459.0A EP3557819B1 (en) | 2018-03-09 | 2018-05-24 | Server failure detection method and system |
| US16/330,961 US20210377102A1 (en) | 2018-03-09 | 2018-05-24 | A method and system for detecting a server fault |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201810193351.7 | 2018-03-09 | ||
| CN201810193351.7A CN108491305B (zh) | 2018-03-09 | 2018-03-09 | 一种服务器故障的检测方法及系统 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2019169743A1 true WO2019169743A1 (zh) | 2019-09-12 |
Family
ID=63338247
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2018/088240 Ceased WO2019169743A1 (zh) | 2018-03-09 | 2018-05-24 | 一种服务器故障的检测方法及系统 |
Country Status (4)
| Country | Link |
|---|---|
| US (1) | US20210377102A1 (zh) |
| EP (1) | EP3557819B1 (zh) |
| CN (1) | CN108491305B (zh) |
| WO (1) | WO2019169743A1 (zh) |
Cited By (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112906969A (zh) * | 2021-03-01 | 2021-06-04 | 三一重工股份有限公司 | 发动机故障预测方法、装置、电子设备及存储介质 |
| CN113626242A (zh) * | 2021-08-11 | 2021-11-09 | 中国银行股份有限公司 | 一种数据处理方法、装置及电子设备 |
| CN113778802A (zh) * | 2021-09-15 | 2021-12-10 | 深圳前海微众银行股份有限公司 | 异常预测方法及设备 |
| CN113806178A (zh) * | 2021-09-22 | 2021-12-17 | 中国建设银行股份有限公司 | 一种集群节点故障检测方法及装置 |
| CN113835962A (zh) * | 2021-09-24 | 2021-12-24 | 超越科技股份有限公司 | 一种服务器故障检测方法、装置、计算机设备及存储介质 |
| CN113935400A (zh) * | 2021-09-10 | 2022-01-14 | 东风商用车有限公司 | 一种车辆故障诊断方法、装置、系统及存储介质 |
| CN114328198A (zh) * | 2021-12-17 | 2022-04-12 | 浪潮电子信息产业股份有限公司 | 一种系统故障检测方法、装置、设备及介质 |
| CN115022916A (zh) * | 2022-05-05 | 2022-09-06 | 北京国联视讯信息技术股份有限公司 | 一种基于状态检测的5g通信异常预警方法及系统 |
| CN115794520A (zh) * | 2022-11-22 | 2023-03-14 | 浪潮商用机器有限公司 | 一种k1 power服务器检测方法、装置、设备及存储介质 |
| CN117278383A (zh) * | 2023-11-21 | 2023-12-22 | 航天科工广信智能技术有限公司 | 一种物联网故障排查方案生成系统及方法 |
| CN117910617A (zh) * | 2023-12-25 | 2024-04-19 | 江苏方洋能源科技有限公司 | 一种光伏板故障远程预测系统 |
| CN120803836A (zh) * | 2025-06-05 | 2025-10-17 | 三六零纵横信息技术有限公司 | 一种信息系统健康状态监测方法、装置、电子设备 |
Families Citing this family (51)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN109344017A (zh) * | 2018-09-06 | 2019-02-15 | 浪潮电子信息产业股份有限公司 | 一种基于机器学习预测内存故障的方法,设备及可读存储介质 |
| CN109397703B (zh) * | 2018-10-29 | 2020-08-07 | 北京航空航天大学 | 一种故障检测方法及装置 |
| CN109218114B (zh) * | 2018-11-12 | 2021-06-08 | 西安微电子技术研究所 | 一种基于决策树的服务器故障自动检测系统及检测方法 |
| CN109634828A (zh) * | 2018-12-17 | 2019-04-16 | 浪潮电子信息产业股份有限公司 | 故障预测方法、装置、设备及存储介质 |
| CN109714214B (zh) * | 2018-12-29 | 2021-08-27 | 网宿科技股份有限公司 | 一种服务器异常的处理方法及管理设备 |
| CN110032480B (zh) * | 2019-01-17 | 2024-02-06 | 创新先进技术有限公司 | 一种服务器异常检测方法、装置及设备 |
| CN109905278A (zh) * | 2019-02-28 | 2019-06-18 | 深圳力维智联技术有限公司 | 基于大数据的基站故障检测方法、装置和存储介质 |
| CN109992477B (zh) * | 2019-03-27 | 2021-07-16 | 联想(北京)有限公司 | 用于电子设备的信息处理方法、系统以及电子设备 |
| CN110164101B (zh) * | 2019-04-09 | 2021-05-11 | 烽台科技(北京)有限公司 | 一种处理报警信息的方法及设备 |
| CN110704278A (zh) * | 2019-09-30 | 2020-01-17 | 山东超越数控电子股份有限公司 | 一种智能服务器管理系统及其管理方法 |
| CN110740061B (zh) * | 2019-10-18 | 2020-09-29 | 北京三快在线科技有限公司 | 故障预警方法、装置及计算机存储介质 |
| CN110765486B (zh) * | 2019-10-23 | 2024-01-26 | 南方电网科学研究院有限责任公司 | 一种资产故障识别方法 |
| CN111061620B (zh) * | 2019-12-27 | 2022-07-01 | 南京林科斯拉信息技术有限公司 | 一种混合策略的服务器异常智能检测方法及检测系统 |
| CN111143173A (zh) * | 2020-01-02 | 2020-05-12 | 山东超越数控电子股份有限公司 | 一种基于神经网络的服务器故障监测方法及系统 |
| CN111382029B (zh) * | 2020-03-05 | 2021-09-03 | 清华大学 | 基于pca和多维监测数据的主板异常诊断方法及装置 |
| DE102020202865B3 (de) * | 2020-03-06 | 2021-08-26 | Robert Bosch Gesellschaft mit beschränkter Haftung | Verfahren und Recheneinheit zur Überwachung des Zustandes einer Maschine |
| CN114500218B (zh) | 2020-11-11 | 2023-07-18 | 华为技术有限公司 | 一种控制网络设备的方法及装置 |
| CN114630352B (zh) * | 2020-12-11 | 2023-08-15 | 中国移动通信集团湖南有限公司 | 一种接入设备的故障监测方法和装置 |
| CN112817823A (zh) * | 2021-02-05 | 2021-05-18 | 杭州和利时自动化有限公司 | 一种网络状态监控方法、装置及介质 |
| CN112988545B (zh) * | 2021-04-20 | 2021-08-17 | 湖南博匠信息科技有限公司 | 一种基于深度学习的vpx设备健康控制方法及系统 |
| US20240314019A1 (en) * | 2021-04-26 | 2024-09-19 | NetBrain Technologies, Inc. | System to automate network assessment |
| CN113411204B (zh) * | 2021-05-17 | 2023-05-02 | 吴志伟 | 电信接入网设施故障检测方法、装置及计算机存储介质 |
| CN113238535B (zh) * | 2021-06-03 | 2022-02-11 | 中国核动力研究设计院 | 一种核安全级dcs模拟量输入模块故障诊断方法及系统 |
| CN113505039A (zh) * | 2021-07-13 | 2021-10-15 | 河北建筑工程学院 | 一种通信故障分析方法、设备及系统 |
| CN113568798B (zh) * | 2021-09-28 | 2022-01-04 | 苏州浪潮智能科技有限公司 | 服务器故障定位方法、装置、电子设备及存储介质 |
| CN113869444B (zh) * | 2021-10-09 | 2024-11-08 | 中国南方电网有限责任公司超高压输电公司昆明局 | 变电站故障检测方法、装置、计算机设备和存储介质 |
| CN114443398B (zh) * | 2022-01-28 | 2024-10-18 | 苏州浪潮智能科技有限公司 | 内存故障预测模型的生成方法、检测方法、装置及设备 |
| CN114861181A (zh) * | 2022-05-25 | 2022-08-05 | 杭州安恒信息技术股份有限公司 | 一种webshell检测方法、装置、设备及介质 |
| US20240028955A1 (en) * | 2022-07-22 | 2024-01-25 | Vmware, Inc. | Methods and systems for using machine learning with inference models to resolve performance problems with objects of a data center |
| CN115292004A (zh) * | 2022-08-10 | 2022-11-04 | 广东电网有限责任公司 | 故障应急方法、装置、电子设备及存储介质 |
| CN115408219B (zh) * | 2022-08-29 | 2026-03-03 | 苏州元脑智能科技有限公司 | 一种整机系统诊断平台的优化方法、系统、装置及介质 |
| CN115437886A (zh) * | 2022-09-09 | 2022-12-06 | 中国电信股份有限公司 | 基于存算一体芯片的故障预警方法、装置、设备及存储 |
| CN115238831B (zh) * | 2022-09-21 | 2023-04-14 | 中国南方电网有限责任公司超高压输电公司广州局 | 故障预测方法、装置、计算机设备和存储介质 |
| CN116016142B (zh) * | 2022-12-14 | 2024-03-26 | 南方电网数字电网研究院有限公司 | 传感网络故障识别方法、装置、计算机设备和存储介质 |
| CN116017404B (zh) * | 2022-12-30 | 2024-08-27 | 中国联合网络通信集团有限公司 | 园区专网的网元驱动方法、装置、电子设备及存储介质 |
| CN116361132A (zh) * | 2023-03-29 | 2023-06-30 | 山东浪潮科学研究院有限公司 | 一种服务器故障预警方法、装置、设备及存储介质 |
| CN116112344B (zh) * | 2023-04-11 | 2023-06-20 | 山东金宇信息科技集团有限公司 | 一种机房故障网络设备检测方法、设备及介质 |
| CN116436106B (zh) * | 2023-06-14 | 2023-09-05 | 浙江卓松电气有限公司 | 低压配电检测系统、方法、终端设备及计算机存储介质 |
| CN116743546A (zh) * | 2023-07-11 | 2023-09-12 | 西安雷风电子科技有限公司 | 云桌面的故障检测方法、装置、服务器和云桌面系统 |
| CN116827755A (zh) * | 2023-07-18 | 2023-09-29 | 中国移动通信集团江苏有限公司 | 服务器故障处理方法、装置、设备及存储介质 |
| CN116932724B (zh) * | 2023-07-31 | 2025-10-31 | 招商银行股份有限公司 | 基于heapdump的问题处理方法、装置、电子设备及存储介质 |
| CN117170994B (zh) * | 2023-09-07 | 2024-07-19 | 湖南胜云光电科技有限公司 | Ipmi接口协议的故障预测扩展方法及系统 |
| CN117056086A (zh) * | 2023-10-11 | 2023-11-14 | 国网山东省电力公司滨州市滨城区供电公司 | 基于排列熵算法的故障检测方法、系统、终端及存储介质 |
| CN117608974A (zh) * | 2024-01-22 | 2024-02-27 | 金品计算机科技(天津)有限公司 | 基于人工智能的服务器故障检测方法、装置、设备及介质 |
| CN117806912B (zh) * | 2024-02-28 | 2024-05-14 | 济南聚格信息技术有限公司 | 一种服务器异常监测方法及系统 |
| CN118518226B (zh) * | 2024-05-17 | 2024-12-13 | 天津撒布浪斯探测仪器有限公司 | 基于测温枪的温度智能监测方法及系统 |
| CN119512894A (zh) * | 2024-11-11 | 2025-02-25 | 广东电网有限责任公司 | 服务器的故障检测方法、装置、处理器和存储介质 |
| CN119690764B (zh) * | 2024-11-30 | 2026-04-17 | 苏州元脑智能科技有限公司 | 服务器温控性能的测试方法及装置 |
| CN119941046B (zh) * | 2025-01-23 | 2025-09-05 | 成都越动无限信息科技有限公司 | 一种基于安防运维平台的绩效考核方法及系统 |
| CN120602390B (zh) * | 2025-07-30 | 2025-10-31 | 苏州元脑智能科技有限公司 | 服务器链路故障定位系统、方法、电子设备及存储介质 |
| CN120723571B (zh) * | 2025-08-28 | 2025-11-07 | 苏州元脑智能科技有限公司 | 服务器单板的故障修复方法及电子设备 |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20100318837A1 (en) * | 2009-06-15 | 2010-12-16 | Microsoft Corporation | Failure-Model-Driven Repair and Backup |
| CN104935464A (zh) * | 2015-06-12 | 2015-09-23 | 北京奇虎科技有限公司 | 一种网站系统的故障预警方法和装置 |
| US20160124787A1 (en) * | 2014-11-05 | 2016-05-05 | International Business Machines Corporation | Electronic system configuration management |
| CN107024915A (zh) * | 2016-02-02 | 2017-08-08 | 同济大学 | 一种电网控制器板卡故障检测系统及检测方法 |
| CN107248927A (zh) * | 2017-05-02 | 2017-10-13 | 华为技术有限公司 | 故障定位模型的生成方法、故障定位方法和装置 |
| CN107392320A (zh) * | 2017-07-28 | 2017-11-24 | 郑州云海信息技术有限公司 | 一种使用机器学习预测硬盘故障的方法 |
| CN107479836A (zh) * | 2017-08-29 | 2017-12-15 | 郑州云海信息技术有限公司 | 磁盘故障监控方法、装置以及存储系统 |
Family Cites Families (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20030126258A1 (en) * | 2000-02-22 | 2003-07-03 | Conkright Gary W. | Web based fault detection architecture |
| KR100900505B1 (ko) * | 2006-08-31 | 2009-06-03 | 영남대학교 산학협력단 | 자율망간 환경에서 트래픽 엔지니어링을 위한웹기반기업관리 기반의 차등화 경로보호를 이용한장애관리시스템 및 방법 |
| CN103116531A (zh) * | 2013-01-25 | 2013-05-22 | 浪潮(北京)电子信息产业有限公司 | 存储系统故障预测方法和装置 |
| WO2015091785A1 (en) * | 2013-12-19 | 2015-06-25 | Bae Systems Plc | Method and apparatus for detecting fault conditions in a network |
| CN106991502A (zh) * | 2017-04-27 | 2017-07-28 | 深圳大数点科技有限公司 | 一种设备故障预测系统和方法 |
| CN107273273A (zh) * | 2017-06-27 | 2017-10-20 | 郑州云海信息技术有限公司 | 一种分布式集群硬件故障预警方法及系统 |
-
2018
- 2018-03-09 CN CN201810193351.7A patent/CN108491305B/zh not_active Expired - Fee Related
- 2018-05-24 WO PCT/CN2018/088240 patent/WO2019169743A1/zh not_active Ceased
- 2018-05-24 US US16/330,961 patent/US20210377102A1/en not_active Abandoned
- 2018-05-24 EP EP18869459.0A patent/EP3557819B1/en not_active Not-in-force
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20100318837A1 (en) * | 2009-06-15 | 2010-12-16 | Microsoft Corporation | Failure-Model-Driven Repair and Backup |
| US20160124787A1 (en) * | 2014-11-05 | 2016-05-05 | International Business Machines Corporation | Electronic system configuration management |
| CN104935464A (zh) * | 2015-06-12 | 2015-09-23 | 北京奇虎科技有限公司 | 一种网站系统的故障预警方法和装置 |
| CN107024915A (zh) * | 2016-02-02 | 2017-08-08 | 同济大学 | 一种电网控制器板卡故障检测系统及检测方法 |
| CN107248927A (zh) * | 2017-05-02 | 2017-10-13 | 华为技术有限公司 | 故障定位模型的生成方法、故障定位方法和装置 |
| CN107392320A (zh) * | 2017-07-28 | 2017-11-24 | 郑州云海信息技术有限公司 | 一种使用机器学习预测硬盘故障的方法 |
| CN107479836A (zh) * | 2017-08-29 | 2017-12-15 | 郑州云海信息技术有限公司 | 磁盘故障监控方法、装置以及存储系统 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP3557819A4 * |
Cited By (13)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112906969A (zh) * | 2021-03-01 | 2021-06-04 | 三一重工股份有限公司 | 发动机故障预测方法、装置、电子设备及存储介质 |
| CN113626242A (zh) * | 2021-08-11 | 2021-11-09 | 中国银行股份有限公司 | 一种数据处理方法、装置及电子设备 |
| CN113935400A (zh) * | 2021-09-10 | 2022-01-14 | 东风商用车有限公司 | 一种车辆故障诊断方法、装置、系统及存储介质 |
| CN113778802A (zh) * | 2021-09-15 | 2021-12-10 | 深圳前海微众银行股份有限公司 | 异常预测方法及设备 |
| CN113806178A (zh) * | 2021-09-22 | 2021-12-17 | 中国建设银行股份有限公司 | 一种集群节点故障检测方法及装置 |
| CN113835962A (zh) * | 2021-09-24 | 2021-12-24 | 超越科技股份有限公司 | 一种服务器故障检测方法、装置、计算机设备及存储介质 |
| CN114328198A (zh) * | 2021-12-17 | 2022-04-12 | 浪潮电子信息产业股份有限公司 | 一种系统故障检测方法、装置、设备及介质 |
| CN115022916A (zh) * | 2022-05-05 | 2022-09-06 | 北京国联视讯信息技术股份有限公司 | 一种基于状态检测的5g通信异常预警方法及系统 |
| CN115794520A (zh) * | 2022-11-22 | 2023-03-14 | 浪潮商用机器有限公司 | 一种k1 power服务器检测方法、装置、设备及存储介质 |
| CN117278383A (zh) * | 2023-11-21 | 2023-12-22 | 航天科工广信智能技术有限公司 | 一种物联网故障排查方案生成系统及方法 |
| CN117278383B (zh) * | 2023-11-21 | 2024-02-20 | 航天科工广信智能技术有限公司 | 一种物联网故障排查方案生成系统及方法 |
| CN117910617A (zh) * | 2023-12-25 | 2024-04-19 | 江苏方洋能源科技有限公司 | 一种光伏板故障远程预测系统 |
| CN120803836A (zh) * | 2025-06-05 | 2025-10-17 | 三六零纵横信息技术有限公司 | 一种信息系统健康状态监测方法、装置、电子设备 |
Also Published As
| Publication number | Publication date |
|---|---|
| CN108491305B (zh) | 2021-05-25 |
| CN108491305A (zh) | 2018-09-04 |
| EP3557819A4 (en) | 2019-12-11 |
| EP3557819A8 (en) | 2020-07-15 |
| EP3557819A1 (en) | 2019-10-23 |
| EP3557819B1 (en) | 2020-10-28 |
| US20210377102A1 (en) | 2021-12-02 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN108491305B (zh) | 一种服务器故障的检测方法及系统 | |
| CN118331779B (zh) | 分布式系统故障判断与恢复方法、应用该方法的云操作系统以及计算平台 | |
| CN112882796A (zh) | 异常根因分析方法和装置,及存储介质 | |
| CN115865649B (zh) | 一种智能运维管理控制方法、系统和存储介质 | |
| CN118740678A (zh) | 网络设备的故障检测方法、装置及电子设备 | |
| CN114363151A (zh) | 故障检测方法和装置、电子设备和存储介质 | |
| CN115499294B (zh) | 一种分布式存储环境网络亚健康检测及故障自动处理方法 | |
| CN119383115A (zh) | 一种基于云服务构架的业务运维方法及系统 | |
| CN119276710A (zh) | 网络质量优化方法、装置、电子设备及存储介质 | |
| CN114664494B (zh) | 一种分布式测温智能电缆 | |
| CN119440954A (zh) | 服务器的故障预测方法、装置、存储介质和电子设备 | |
| CN119865845A (zh) | 网元状态监测方法、装置、系统、电子设备及存储介质 | |
| CN120358147A (zh) | 监控指标依赖关系分析与拓扑建立方法、装置及计算机设备 | |
| CN118536041A (zh) | 一种智能化异常检测、事件关联分析及自动化运维方法 | |
| CN118642910A (zh) | 一种状态监控方法及其装置、数据中心及多活数据中心 | |
| CN120856548B (zh) | 一种网络设备的性能监控方法、装置及电子设备 | |
| CN117061335A (zh) | 云平台设备健康管控方法、装置、存储介质和电子设备 | |
| CN121166472A (zh) | 一种服务器智能运维方法及系统 | |
| CN119829314B (zh) | 一种磁盘阵列故障检测的方法和装置 | |
| CN120508429A (zh) | 容器的故障处理方法、装置、电子设备及计算机程序产品 | |
| TW201409968A (zh) | 資通信服務品質評估與即時告警系統與方法 | |
| CN119440952A (zh) | 基于事件触发的应用服务器监控方法、系统、设备及介质 | |
| CN106897189A (zh) | 一种基于数据实时推送的日志监控系统 | |
| CN115378841B (zh) | 设备接入云平台状态的检测方法及装置、存储介质、终端 | |
| CN113746695B (zh) | 一种故障监测的调整方法、装置及设备 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| ENP | Entry into the national phase |
Ref document number: 2018869459 Country of ref document: EP Effective date: 20190430 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |