WO2017000536A1 - 一种bfd检测方法与装置 - Google Patents

一种bfd检测方法与装置 Download PDF

Info

Publication number
WO2017000536A1
WO2017000536A1 PCT/CN2016/070063 CN2016070063W WO2017000536A1 WO 2017000536 A1 WO2017000536 A1 WO 2017000536A1 CN 2016070063 W CN2016070063 W CN 2016070063W WO 2017000536 A1 WO2017000536 A1 WO 2017000536A1
Authority
WO
WIPO (PCT)
Prior art keywords
fault
application
value
interval
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2016/070063
Other languages
English (en)
French (fr)
Inventor
伍湘平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to EP16816903.5A priority Critical patent/EP3316520B1/en
Publication of WO2017000536A1 publication Critical patent/WO2017000536A1/zh
Priority to US15/837,442 priority patent/US10447561B2/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/04Processing captured monitoring data, e.g. for logfile generation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/40Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using virtualisation of network functions or resources, e.g. SDN or NFV entities
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0805Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
    • H04L43/0811Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking connectivity
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/10Active monitoring, e.g. heartbeat, ping or trace-route
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/20Arrangements for monitoring or testing data switching networks the monitoring system or the monitored elements being virtualised, abstracted or software-defined entities, e.g. SDN or NFV

Definitions

  • the present invention relates to the field of communications technologies, and in particular, to a BFD detection method and apparatus.
  • the fault detection usually adopts the heartbeat mechanism between the virtual machine (which can be equated to one communication node).
  • the basic principle of the heartbeat mechanism is as follows: the node q monitors the node p as an example, and the node p is fixed. The time interval ⁇ i sends a heartbeat message to the node q, and the node q receives the heartbeat message at a fixed time interval ⁇ t.
  • the node p If the heartbeat message sent by the node p is not received within a prescribed time (eg, three ⁇ i), Then, it is determined that the node p has failed; for example, several packets are continuously lost, that is, several time interval nodes q do not receive the response of the node p, and the node p is considered to be faulty.
  • a prescribed time eg, three ⁇ i
  • BFD Bidirectional Forwarding Detection
  • the BFD detection is focused on determining the detection time of the BFD.
  • the detection time of the BFD is mainly determined by the following three parameters: the shortest BFD detection packet sending period DMTI (Desired Min Tx Interval); the local node can support The shortest BFD detection message receiving period RMRI (Required Min Rx Interval); the detection time multiple Detect Mult (Detect time multiplier).
  • DMTI Desired Min Tx Interval
  • RMRI Required Min Rx Interval
  • Detect time multiplier the detection time multiple Detect Mult
  • BFD includes asynchronous mode and query mode.
  • the two detection methods are different, so the detection time is different. Generally, it is implemented by using different Detect Mult values.
  • Asynchronous mode detection time received remote Detect Multi*max (local RMRI, received DMTI);
  • Query mode detection time local Detect Multi*max (local RMRI, received DMTI);
  • DMTI, RMRI, and Detect Multi can be configured independently on each node. However, once DMTI, RMRI, and Detect are configured, one node will receive BFD detection packets from another node at regular intervals. If the other party's detection message is not received within the time, the application/service of the other party is determined to be faulty.
  • the interrupt time is 250ms, but it is not reported because it is less than 300ms of the BFD detection interval; if the signaling application with an interrupt time of 500ms is allowed, it may not appear yet.
  • the fault, the interruption time is 400ms, but it is mistaken for a fault because it is greater than the BFD detection interval of 300ms.
  • an embodiment of the present invention provides a BFD detection method, apparatus, and system.
  • a BFD detection method for a bidirectional forwarding detection mechanism is provided. The method is applied to a process in which a first virtual machine receives a BFD detection packet, and the BFD detection packet is from a second.
  • the method includes: obtaining a prediction duration and a predetermined number of sample time intervals; wherein the prediction duration is a time interval between the current time and the last time that the first virtual machine receives the BFD detection packet, where the sample interval is And an eigenvalue obtained according to the foregoing prediction duration and the predetermined number of sample time intervals; wherein the eigenvalue is used to indicate the possibility of application failure in the second virtual machine; An application running in the second virtual machine compares the feature value with a preset failure determination criterion of the one application, and determines whether the one application is faulty according to the comparison result.
  • the obtaining the feature value according to the foregoing prediction duration and the predetermined number of sample time intervals comprises: obtaining the sample interval according to the predetermined number of sample time intervals Mean and variance; obtain a distribution function according to the mean and variance of the sample time interval; substitute the above prediction duration into the distribution function to calculate a function value; and obtain the above characteristic value according to the function value.
  • the obtaining the feature value according to the function value includes: performing a negative pair on the function value Number, the above characteristic value is obtained.
  • the preset fault determination criterion of the foregoing application includes a non-fault value interval And a fault value interval; comparing the foregoing feature value with the preset fault determination criterion of the foregoing application, and determining whether the fault occurs in the one application according to the comparison result comprises: determining that the feature value falls within the non-fault value interval And one of the above-mentioned fault value interval; if the comparison result is that the feature value falls within the fault value interval, it is determined that the one application fails; if the comparison result is that the feature value falls within the non-fault value interval, the determination is One of the above applications has not failed.
  • the method further includes: performing fault processing on the one application.
  • the preset fault determination criterion of the foregoing application includes a non-fault value interval And at least two fault value intervals, each fault value interval corresponds to a different fault level; the foregoing feature value is compared with the preset fault determination criterion of the above application, and whether the one application is determined according to the comparison result
  • the fault includes: determining which one of the non-fault value interval and the at least two fault value ranges is included in the foregoing feature value; if the comparison result is that the feature value falls within a fault value interval, determining that the one application occurs
  • the fault level is the fault level corresponding to the fault value interval; wherein the one fault value interval is one of the at least two fault value intervals; if the comparison result is that the feature value falls within the non-fault value The interval determines that the above application has not failed.
  • each fault value interval corresponds to a different fault processing manner; After determining that the fault level of the one application is the fault level corresponding to the fault value interval, the method further includes: performing fault processing corresponding to the fault value interval in the foregoing one application.
  • the foregoing predetermined quantity is M, and the M sample intervals are consecutive M+1 BFD detection packets are obtained, where M is an integer greater than 20.
  • a BFD detection apparatus for a bidirectional forwarding detection mechanism.
  • the apparatus is applied to a process in which a first virtual machine receives a BFD detection message, and the BFD detection message is from a second virtual machine.
  • the device includes: an acquisition module, configured to acquire a prediction duration and a predetermined number of sample times
  • the time interval is the time interval at which the BFD detection packet is received by the first virtual machine at the current time
  • the sample interval is the time interval of the arrival of two adjacent BFD detection messages
  • a comparison module is configured to target An application running in the second virtual machine compares the feature value obtained by the calculating module with a preset fault determination criterion of the one application
  • the determining module is configured to determine, according to the comparison result of the comparison module, whether the one application is malfunction.
  • the foregoing calculating module includes: a first calculating unit, configured to obtain the sample time interval according to a predetermined number of sample time intervals acquired by the acquiring module And a second calculation unit, configured to obtain a distribution function according to a mean value and a variance of the sample time interval obtained by the first calculating unit, and a third calculating unit, configured to substitute the prediction time obtained by the acquiring module into the second
  • the distribution function obtained by the calculation unit calculates a function value; and the fourth calculation unit is configured to obtain the feature value according to the function value obtained by the third calculation unit.
  • the fourth calculating unit is specifically configured to: obtain the foregoing third computing unit
  • the function value takes a negative logarithm to obtain the above eigenvalue.
  • the preset fault determination criterion of the foregoing application includes a non-fault value interval And a fault value interval;
  • the comparison module is specifically configured to determine which one of the non-fault value interval and the fault value interval the feature value obtained by the calculation module falls in; if the comparison result of the comparison module is the foregoing feature The value falls within the fault value interval, and the determining module determines that the one application has failed. If the comparison result of the comparison module is that the feature value falls within the non-fault value interval, the determining module determines that the one application has not failed.
  • the foregoing apparatus further includes a fault processing module, where the fault processing module is used in the foregoing After the determining module determines that the one application is faulty, performing fault processing on the one application.
  • the preset fault determination criterion of the foregoing application includes a non-fault value interval And at least two fault value intervals, each fault value interval corresponding to a different fault level; the comparison module is specifically configured to determine that the feature value obtained by the calculation module falls within the non-fault value interval and the at least two Which one of the fault value interval is; if the comparison result of the comparison module is that the feature value falls within a fault value interval, the determining module determines that the fault level of the one application is the fault level corresponding to the one fault value interval.
  • the one fault value interval is one of the at least two fault value intervals; if the comparison result of the comparison module is that the feature value falls within the non-fault value interval, the determining module determines that the one application is not malfunction.
  • each fault value interval corresponds to a different fault processing manner;
  • the device further includes a fault processing module, wherein the fault processing module is configured to perform the foregoing fault value interval corresponding to the one application after the determining module determines that the fault level of the one application is the fault level corresponding to the one fault value interval Troubleshooting.
  • the foregoing predetermined quantity is M, and the M sample time intervals are consecutive M+1 BFD detection packets are obtained, where M is an integer greater than 20.
  • a bidirectional forwarding detection mechanism BFD detection system includes the first virtual machine, the second virtual machine, and any one of the BFD detection devices.
  • a BFD detection mechanism is introduced in a virtualized environment.
  • the rapid detection of faults meets the reliability requirements of telecom high-traffic services.
  • the fault diagnosis standard of BFD detection can adapt to the requirements of different applications, and the fault judgment of each application is more accurate. Compared with the unified discriminating method in the traditional BFD detection, the misjudgment of faults is greatly reduced.
  • FIG. 1 is a schematic diagram of a BFD detection architecture in the prior art
  • FIG. 2 is a schematic diagram of a BFD detection architecture according to an embodiment of the present invention.
  • FIG. 3 is a flowchart of a method for detecting BFD according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of a time interval of a continuous packet according to an embodiment of the present invention.
  • FIG. 5 is a diagram of a distribution function that needs to be constructed in a BFD detection method according to an embodiment of the present invention
  • FIG. 6 is a schematic diagram of a method for determining whether a BFD fault occurs in an application according to an embodiment of the present invention
  • FIG. 7 is a schematic diagram of a BFD detecting apparatus according to an embodiment of the present invention.
  • FIG. 8 is a schematic diagram showing the structure of a computing module in a BFD detecting apparatus according to an embodiment of the present invention.
  • FIG. 9 is a schematic structural diagram of a BFD detecting device according to an embodiment of the present invention.
  • Embodiments of the present invention provide a BFD detection method, apparatus, and system.
  • FIG. 1 is a schematic diagram of a BFD detection architecture in the prior art.
  • the traditional BFD detection module will only output a certain detection result, that is, the Boolean result of the failure or not, and then inform the failure management module of the Boolean result, and the fault management module performs subsequent fault processing.
  • the traditional BFD detection mechanism does not distinguish between applications.
  • a detection time is uniformly set as the detection standard. Therefore, BFD detection only uses a single detection time for each application running in the other node. Determining the fault may result in misjudgment of some applications, which will not be applicable to different application types.
  • the fault detection and fault determination in the traditional BFD detection module are "functionally separated", and different reliability detection standards are adopted for different types of applications. That is, the BFD does not provide the Boolean result of the fault. It is only responsible for detecting the packet and reporting the detection result to the upper-layer application or the special fault management module. The upper-layer application or the special fault management module is based on the detection result and the difference. The failure criterion is determined by applying the respective failure criteria (hereinafter, only the fault management module is used as an example, and the replacement of the fault management module based on the same principle for the upper layer application or the like should be within the protection scope of the present invention).
  • FIG. 2 is a schematic diagram of a BFD detection architecture according to an embodiment of the present invention.
  • the BFD detection module detects the detection packet, and reads the arrival time of the packet or directly records the arrival time of the packet.
  • the arrival time of the packet is that the detection packet of the other node reaches the local end.
  • the architecture also includes a fault management module, and each application determines a threshold ( ⁇ ) as a criterion for determining whether the fault is determined according to the respective fault determination requirements, and stores it in the fault management module in advance; thus, the BFD detection module detects the report.
  • the eigenvalues are processed by the aging time interval and reported to the fault management module.
  • the fault management module compares the threshold value with the threshold value pre-stored by each application, and determines whether each running application in the other node is faulty according to the comparison result, and then according to the determination result. Take fault management measures for failed applications.
  • FIG. 3 is a flowchart of a method for detecting BFD according to an embodiment of the present invention.
  • the first virtual machine and the second virtual machine generally refer to two virtual machines in the virtualized cloud environment that are communicating with each other.
  • two virtual machines perform two-way communication, and the first virtual machine receives
  • the second virtual machine can also receive and detect the BFD detection packet of the first virtual machine, so the BFD detection mechanism in each virtual machine is similar; the first virtual machine receives
  • the BFD detection packet of the second virtual machine may be equivalently expressed as the BFD detection packet of the second virtual machine reaches the first virtual machine.
  • the specific method is as follows.
  • Step 101 When the first virtual machine receives the BFD detection packet of the second virtual machine, the BFD detection module acquires the prediction duration and the predetermined number of sample time intervals.
  • the prediction duration is the current time and is closest to the first virtual machine.
  • the sample interval is the interval between the arrival of two adjacent BFD test packets.
  • the BFD detection mechanism in the present invention is more concerned with the relationship between the interval between two detection packets and the probability of successful packet reception. Therefore, the method obtains the arrival time interval of adjacent detection packets, especially continuous detection.
  • the arrival time interval of the message for example, there are 100 detection messages sent consecutively, numbered from 1 to 100, then the statistics are the first and second, the second and the third, the third And the arrival time interval between the 4th ... 99th and 100th detection messages, a total of 99.
  • FIG. 4 is a schematic diagram of a time interval for continuously detecting a message according to an embodiment of the present invention. With time, packets 1 and 2 are the first. A series of consecutive messages originating from the second virtual machine received by the virtual machine.
  • the sample arrival time interval sample data to be acquired is (t 2 -t 1 ), (t 3 -t 2 )...(t N - t N-1 ); t n represents the arrival time of the nth message, n can take any positive integer; where t N is generally the latest one received before the current time, that is, the most recently received Message.
  • each sample interval may be used, or not every sample interval will be used, such as selecting the M sample interval for statistical calculation, M specific
  • M The value is usually based on the requirements of the sample statistical method, and sometimes related to the amount of packets that can be stored in the virtual machine, and sometimes related to the timeliness of the message; where M is a positive integer, in order to To meet the empirical needs of sample statistics, M is generally greater than 20.
  • Step 102 The BFD detection module obtains a feature value according to the predicted duration and a predetermined number of sample time intervals; wherein the feature value is used to indicate a possibility that the application in the second virtual machine is faulty.
  • the BFD detection module can perform statistical calculation on the arrival time interval of the M packets obtained in step 101 by using a mathematical statistical method to obtain a feature value, which can describe that the first virtual machine successfully receives the second.
  • a feature value which can describe that the first virtual machine successfully receives the second.
  • the next BFD detection packet sent by the second VM can be successfully received after the prediction period.
  • the feature values will be described in detail later. It can be seen that this scheme is different from the traditional BFD detection in simply outputting the detection time.
  • Step 103 For an application running in the second virtual machine, compare the feature value with the preset failure determination criterion of the one application, and determine whether the application is faulty according to the comparison result.
  • Steps 101-102 of the present invention are based on the principle that when the first virtual machine finds that one detection packet is lost, there may be two possibilities at this time.
  • the detection packet is late due to the network or the like;
  • the application of the second virtual machine really failed. Therefore, it is necessary to make a judgment at this time, whether the application is late or the application in the second virtual machine fails; and the basis of the judgment is based on the statistics of the arrival time interval of the previous detection message, and the latest reception according to the current time distance.
  • the time interval to the message is used to predict the probability of successfully receiving the next message. Therefore, to achieve this function, you first need to make a statistics on the arrival time interval of the test packets that have been successfully received in the past.
  • the BFD detection module acquires or records the arrival time of each detection packet, and stores the arrival time in a storage queue.
  • the arrival time is usually stored in a fixed-length storage queue, and the storage mode is in a stack manner.
  • the storage queue has a fixed length of 100 arrival times.
  • the first arrival time value of the original storage overflows, ensuring that the storage queue is always 100 values.
  • the arrival time interval of two adjacent detection messages varies according to factors such as node processing capability and network transmission capability.
  • the fluctuation of such time interval is subject to Gaussian distribution; for example, when the communication module in the network is in normal operation, two phases The interval at which the neighboring detection packet arrives at the first virtual machine is 1 s. However, due to the fluctuation of the performance of each communication module in the network, the fluctuation of the arrival time interval of the detection packet is caused. The time interval between the neighboring detection packets arriving at the first virtual machine fluctuates in 1s, and the overall packet arrival interval shows a Gaussian distribution.
  • FIG. 5 is a distribution function constructed in a BFD detection method according to an embodiment of the present invention.
  • the abscissa x represents a prediction duration, and the current moment is caused by the uncertainty of the next detection packet, so that x becomes a variable.
  • G(x) represents the probability that the next detection message can be successfully received after the current time
  • f(x) represents the probability density. It is not difficult to infer from the figure that the shorter the prediction duration, the greater the probability of successfully receiving the next message, the longer the prediction duration, and the smaller the probability of receiving the next message.
  • the distribution function f(x) is determined by a predetermined number of sample time intervals, that is, the arrival time of the detection message in the fixed length queue is determined, and the fixed length queue is still taken as an example of 100 arrival times, since f(x) needs statistics. It is the influence of the arrival time interval of adjacent packets on the probability of successful packet reception. Therefore, the arrival time of 100 consecutive detection messages will generate 99 sample time intervals ⁇ t1, ⁇ t2, ..., ⁇ t99, and the sample interval here. Refers to the arrival time interval of two adjacent detection messages; the 99 sample time intervals are counted, and the mean ⁇ and variance ⁇ of the 99 sample time intervals are calculated, and then determined by the mean ⁇ and the variance ⁇
  • the distribution function the general formula is as follows:
  • n the number of sample time intervals
  • ⁇ t(j) is the time interval
  • j is an integer
  • represents the mean
  • represents the variance
  • the time interval of the detection message in the storage queue changes. Although the new time interval does not have a great impact on the whole sample, the 99 sample time interval is counted.
  • the mean ⁇ and variance ⁇ and the distribution function f(x) are updated in real time, which ensures that the detection and prediction of new messages can be closer to the current state of network communication, thus ensuring certain timeliness and accuracy. Sex.
  • the probability that the first virtual machine has not received the packet at the current moment that is, the probability that the packet can be received after the current moment:
  • the solution is concerned that: when the latest detection packet arrives successfully at t last , and the current time t now does not receive the message, the probability of receiving the message after the current time t now is: G ( t now - t last ; ⁇ last , ⁇ last ), where ⁇ last , ⁇ last respectively indicate that the latest detection message is stored in a queue of a fixed length, and the arrival time interval of all adjacent detection messages in the current queue is performed.
  • Statistics, t now -t last is the latest forecast duration.
  • G(t now - t last ; ⁇ last , ⁇ last ) can be used as the feature value, and correspondingly, each application also sets a corresponding fault determination probability, for example, for the second virtual machine.
  • the probability of failure determination is 0.2, and the probability of failure determination is stored in advance in the BFD detection module; if G(t now - t last ; ⁇ last , ⁇ last ) is less than 0.2, it is considered to be received after t now The probability of the next detection message is very small, and it is considered that the application A is faulty; if G(t now - t last ; ⁇ last , ⁇ last ) is greater than 0.2, it is considered that there is still a large possibility to receive the next report after t now In the text, it is considered that the A application has not failed.
  • the statistics and calculations of the sample time interval in the present invention are not limited to the mathematical methods mentioned above, and similar or derived statistical mathematical models related to Gaussian probability statistics can be statistically calculated accordingly.
  • the basic mathematical operations are implemented, so the statistical processing of the sample time interval by any conventional mathematical operation and finally the eigenvalues should fall within the protection scope of the present invention.
  • the fault management module compares the threshold value with a fault determination criterion preset by the running application in the second virtual machine, thereby determining whether the application is faulty.
  • the judging process is as shown in FIG. 6.
  • FIG. 6 is a schematic diagram of a method for determining whether an application has a fault in the BFD according to an embodiment of the present invention. Due to the wide variety of applications, the functional and reliability requirements of each application are different, so there is diversity in determining whether a fault has occurred. There are two common situations (the B application and the C application mentioned below are all running applications in the second virtual machine, and are not specific to which application; the fault determination values of the B application and the C application are pre-stored. In the BFD detection module).
  • Case 1 If the fault determination value of the B application is a single threshold ⁇ 1, the preset fault determination criterion applied by the B includes a non-fault value interval (0, ⁇ 1) and a fault value interval [ ⁇ 1, + ⁇ ); The fault management module determines in which value interval the feature value falls. If the comparison result is that the feature value falls in [ ⁇ 1, + ⁇ ), the fault management module determines that the B application has failed. If the comparison result is that the feature value falls in (0, ⁇ 1), the fault management module determines that the B application has not failed. Or don't respond.
  • the fault management module will also perform the fault handling measures corresponding to the B application, so that the B application can run normally; if the B application does not fail or the fault returns to normal after the fault processing, the fault occurs.
  • the management module does not perform troubleshooting or does not respond.
  • the preset fault determination criterion applied by the C at this time includes a non-fault value interval and at least two fault value intervals, and each fault value interval Corresponding to a different fault level, each fault level corresponds to a different fault handling measure, and the fault value interval corresponds to the fault level one by one, and the fault level corresponds to the fault handling measure one by one; different threshold intervals correspond to different ones
  • the fault level corresponds to different fault severity levels; the fault management module compares the characteristic value ⁇ with the non-fault value interval and at least two fault value intervals to determine the value interval in which the eigenvalue is located, thereby determining that the C application occurs.
  • the fault level As shown in FIG.
  • the fault management module determines that the C application has no fault or does not respond; when ⁇ ( ⁇ 1, ⁇ 2), the fault management module determines that the C application is at fault level 1; When ⁇ ( ⁇ 2, ⁇ 3], the fault management module determines that the C application is at fault level 2; when ⁇ ( ⁇ 3, + ⁇ ), the fault management module determines that the C application is at fault level 3; The fault severity of fault level 1, fault level 2, and fault level 3 increases in turn.
  • the fault management module determines that the C application has failed, the fault management module performs corresponding fault handling measures according to the fault level of the C application, so that the C application can run normally; if the C application does not fail or the fault passes When the fault returns to normal after the fault is processed, the fault management module does not perform fault processing.
  • some applications have different fault levels and corresponding fault handling methods, mainly due to the number of hardware and software resources that need to be called during fault processing.
  • the software and hardware resources to be called are less and simpler than the hardware and software resources that need to be called when the fault is processed at a higher level, and the overall cost is correspondingly lower; therefore, when the C application is used
  • the fault level 1 occurs, the corresponding fault processing method corresponding to the fault level 1 is invoked, and the corresponding software and hardware resources are called to complete the fault repair; therefore, the fault processing method of the corresponding level is more suitable for the rational use of resources.
  • the fault determination value of different applications is based on the reliability and real-time requirements of the application, and is obtained through a lot of experience. For example, for applications with high real-time and high reliability requirements, faster fault detection and recovery is required. However, the detection time is increased and the false alarm rate is increased accordingly, so a fast detection time and a reasonable false alarm rate are required. , you need to determine an appropriate threshold ⁇ . For example, an application developer can determine a suitable threshold within a range in which the false alarm rate can be received based on a large number of test results of threshold and detection time and false positive rate. This also illustrates the advantages of this solution: while ensuring the rapid detection capability of BFD, the false alarm rate can be controlled to an acceptable level.
  • the determination of ⁇ is also related to the algorithm for obtaining the eigenvalue in step 102; if the eigenvalue is a probability, then the ⁇ expression should also be a pre-selected probability value; if the eigenvalue is a logarithm of the probability, then ⁇ is also pre-selected The probability value takes a logarithm; if the eigenvalue takes a negative logarithm of the probability, then ⁇ also takes a negative logarithm of the pre-selected probability value.
  • the technical solution of the present invention is different from the traditional BFD detection mechanism, and the statistical analysis of the arrival time interval of adjacent detection packets is introduced, so that the BFD detection is no longer simple to output each application based on a single detection time. If there is no fault information, firstly, based on the arrival time interval of the adjacent message, mathematical statistics processing is performed to calculate a feature value, and the feature value is compared with the fault determination value preset by each application, and then determined in a targeted manner. The fault conditions of each application, so that the BFD detection can adapt to the requirements of different applications, and no unified judgment is made, thereby reducing the misjudgment caused by the fault.
  • the first virtual machine and the second virtual machine generally refer to two virtual machines in the virtualized cloud environment that are communicating with each other.
  • two virtual machines perform two-way communication, and the first virtual machine can receive the second virtual machine report.
  • the second virtual machine can also receive the packets of the first virtual machine, so the BFD detection mechanism is similar for each virtual machine.
  • FIG. 7 is a schematic diagram of a BFD detecting apparatus according to an embodiment of the present invention.
  • the device is applied to the first virtual machine to receive the BFD detection packet, and the BFD detection packet is from the second virtual machine; the device may be located in the first virtual machine or in the virtual machine.
  • the external device serves as an upper layer function device for supervising the communication status of the first and second virtual machines, and the device 200 includes:
  • the obtaining module 201 is configured to obtain a prediction duration and a predetermined number of sample time intervals, where the prediction duration is a time interval between the current time and the last time that the first virtual machine receives the BFD detection packet, and the sample interval is two adjacent BFDs. Detects the arrival interval of packets.
  • the interval used by the obtaining module 201 to obtain the adjacent packets in particular, the interval between successive messages: (t 2 - t 1 ), (t 3 - t 2 ), ... (t N - t N-1 )
  • t m represents the arrival time of the mth message
  • m can represent any positive integer.
  • the so-called predetermined number is usually determined by the number of messages that can be stored in the virtual machine or the sample requirements of the mathematical statistics; for example, the acquisition module obtains (t M -t M-1 ), (t M+1 -t M )... (t N - t N-1 ); where M is less than N, NM is a predetermined number and is generally greater than 20; the Nth message is the last message received by the first virtual machine before the current time.
  • the calculation module 202 is configured to obtain the feature value according to the prediction duration acquired by the acquisition module 201 and the predetermined number of sample time intervals; wherein the feature value is used to indicate the possibility that the running application in the second virtual machine is faulty.
  • the calculation module 202 can include a first calculation unit 2021, a second calculation unit 2022, and a third calculation unit 2023.
  • FIG. 8 is a calculation module of a BFD detection apparatus according to an embodiment of the present invention. Schematic diagram of the structure.
  • the first calculating unit 2021 is configured to perform statistics on the NM sample time interval samples, calculate a mean value ⁇ and a variance ⁇ of the NM sample time intervals, and the second calculating unit 2022 is further configured according to the mean ⁇ and the variance.
  • determines the probability density distribution function f(x), and the general formula is as follows:
  • n represents the predetermined number, ie N-M, ⁇ t(j) is the sample time interval, j is an integer, and j ⁇ [1, n].
  • G(x; ⁇ , ⁇ ) expresses the probability relationship between the length of the sample interval and the probability that the first virtual machine can successfully receive BFD detection packets. Because the loss of the detection packet reflects the application failure, G(x; ⁇ ) , ⁇ ) also expresses the possibility that the length of the sample interval will fail. If no message is received at the current time t now, the time when the first virtual machine receives the detection message last time before t now is t last , and the probability of receiving the message after t now is: G ( t now -t last ; ⁇ last , ⁇ last ), t now -t last is the latest prediction duration.
  • the calculation module 202 can also include a plurality of computing units, and the plurality of computing units can obtain a subsequent feature value that can be conveniently used by performing a series of conventional mathematical operations according to a preset algorithm.
  • the comparison module 203 is configured to compare the feature value obtained by the calculation module 202 with the preset failure determination criterion of the one application for an application running in the second virtual machine.
  • the preset fault determination criteria of each application in the second virtual machine are pre-stored in the comparison module.
  • the determining module 204 is configured to determine, according to the comparison result of the comparison module 203, whether the one application is faulty.
  • the comparison module 203 and the determination module 204 fault management process are similar to the method corresponding to FIG. 5 . Due to the wide variety of applications, the functional and reliability requirements of each application are different, so there is diversity in determining whether a fault has occurred. There are two common situations, see the following two examples:
  • Example 1 If the fault determination value of the voice service preset is a single threshold value of 0.2; the comparison module 203 compares the feature value ⁇ obtained by the calculation module 202 with 0.2; if the comparison result is that the feature value is greater than or equal to 0.2, the determination module 204 determines that If an application fails, if the comparison result is that the feature value is less than 0.2, the decision module 204 determines that the voice service application has not failed or does not respond.
  • the device 200 may further include a fault processing module 205.
  • the fault processing module 205 in the device 200 will perform the fault handling measure corresponding to the voice service, so that the voice service can run normally; if the voice service does not fail or the fault passes the fault processing When the recovery is normal, the determination module 204 determines that the voice service has not failed or does not respond, and the fault processing module 205 does not perform fault processing or does not respond.
  • Example 2 If the fault determination value preset by the picture transmission service is four value intervals, (0, 3) indicates that the picture transmission service has no fault; (3, 5) indicates that the picture transmission service has a level one fault; 5, 10] indicates that the picture transmission service has a secondary fault; (10, + ⁇ ) indicates that the picture transmission service has a three-level fault.
  • the primary fault, the secondary fault, and the third-level fault correspond to the primary fault handling measure and the secondary fault respectively.
  • the processing module 203 compares the feature value ⁇ with the at least two threshold intervals to determine the interval in which the feature value is located; the determining module 204 determines the fault of the picture transmission service according to the value interval where the ⁇ is located. As shown in FIG.
  • Block 204 determines three picture transmission service is in fault. Interval threshold value indicates a better serious fault, a fault such as the severity of the fault, two and three fault failure annually.
  • the device 200 may further include a fault processing module 205.
  • the fault processing module 205 in the device 200 performs corresponding fault handling measures on the second application according to the fault level determined by the determining module 204, so that the second application can run normally. If the second application does not fail or the fault returns to normal after the fault processing, the determining module 204 determines that the second application has not failed or does not respond, and then the fault processing module 205 does not perform fault processing or does not respond.
  • the embodiment of the present invention provides a BFD detecting apparatus. Compared with the conventional BFD detecting apparatus, the single-valued determination is not used to simply output the fault information.
  • the BFD detecting apparatus includes the acquiring module 201 and the calculating module.
  • the comparison module 203 and the determination module 204 may further include a fault processing module 205.
  • the obtaining module 201 and the calculation module 202 perform mathematical processing on the interval of detecting the packet to obtain a feature value for measuring whether the application is faulty;
  • the value is compared with the preset fault determination criteria of different applications, so that the determination module 204 can specifically determine the fault condition of each application, so that the detection of the BFD can adapt to the requirements of different types of applications, and no traditional unified discrimination is performed.
  • the fault spoofing caused by the fault is reduced, and finally the fault processing module 205 can perform corresponding processing measures for the faulty application in a targeted manner.
  • Embodiments of the present invention provide a BFD detection system, which is composed of a first virtual machine, a second virtual machine, and a BFD detection apparatus 200 described in the above paragraphs.
  • the BFD detection device mentioned above may be located in the first virtual machine, and directly receive the BFD detection message from the second virtual machine, and perform the step 101 as mentioned above for the detection message.
  • the method of step 103 determines whether an application running in the second virtual machine is faulty.
  • the BFD detection device is located outside the virtual machine as the upper management function device of the first virtual machine and the second virtual machine, and receives the BFD detection message of the second virtual machine in the first virtual machine.
  • the related information of the messages is obtained, and the method of step 101-step 103 mentioned above is performed on the detection messages to determine whether the running application in the second virtual machine is faulty.
  • FIG. 9 is a schematic structural diagram of a BFD detecting device according to an embodiment of the present invention.
  • the device 400 includes:
  • the processor 401 is configured to generate a corresponding operation control signal and send the corresponding component to the computing processing device.
  • the data in the software is read and processed, in particular, the data and programs in the memory 402 are read and processed so that each of the functional modules therein performs a corresponding function, thereby controlling the respective components to act as required by the instructions.
  • the memory 402 is used to store programs and various data, and mainly stores software units such as operating systems, applications, and function instructions, or a subset thereof, or an extended set thereof.
  • NVRAM Non-volatile random access memory
  • the memory 402 is used to store programs and various data, and mainly stores software units such as operating systems, applications, and function instructions, or a subset thereof, or an extended set thereof.
  • Non-volatile random access memory (NVRAM) may also be included, providing hardware, software, and data resources in the management computing processing device to the processor 401, supporting control software and applications.
  • the transceiver 403 is configured to collect, acquire or send information, and can be used to transfer information between modules.
  • Each of the above hardware units can communicate via a bus connection.
  • the processor 401 acquires the prediction duration and a predetermined number of sample time intervals, and the processor 401 is configured according to the memory 402.
  • the pre-stored algorithm performs arithmetic processing on a predetermined number of sample time intervals to obtain feature values; and compares the feature values with preset failure determination criteria of each application running in the detected node, and determines whether there is any If the application fails, the processor 401 performs corresponding fault handling on the failed application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Cardiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Environmental & Geological Engineering (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Monitoring And Testing Of Exchanges (AREA)

Abstract

本发明实施例提供了一种BFD检测方法、装置与系统。第一虚拟机接收第二虚拟机的BFD检测报文时,获取预测时长和预定数量个样本时间间隔;根据预测时长以及预定数量个样本时间间隔得到特征值;特征值用来表示第二虚拟机中应用出现故障的可能性;针对第二虚拟机中正在运行的一个应用,将特征值与该应用的预设故障判定标准进行对照,并根据对照结果判定该应用是否发生故障。本发明通过将特征值与不同应用的预设故障判定标准进行比较,进而进行故障判定和处理,以使得BFD检测能够适应不同类型应用/业务的要求,而不再对所有的应用/业务采用统一的故障判定标准,减少由此带来的故障误判。

Description

一种BFD检测方法与装置 技术领域
本发明涉及通信技术领域,尤其涉及一种BFD检测方法与装置。
背景技术
在当前的虚拟化云环境中,故障检测通常采用虚拟机(可以等同于一个通信节点)间的心跳机制,这种心跳机制的基本原理下:以节点q监测节点p为例,节点p以固定的时间间隔Δi发送心跳报文给节点q,节点q则以固定的时间间隔Δt接收心跳报文,如果在规定的时间内(如三个Δi)没有接收到节点p发来的心跳报文,则判定节点p发生了故障;例如连续丢几个包,即若干个时间间隔节点q没有收到节点p的响应,则认为节点p故障。
这种心跳检测的时间通常都在秒级甚至分钟级,无法满足对实时性要求较高的电信业务的可靠性要求。尤其当数据速率到吉比特时,故障反馈时间长代表着大量数据的丢失。相邻节点间快速监测通信故障的需求日益增多,而且也显得越来越重要。于是就产生了一种在双向路由引擎之间建立一条路径的快速检测方法-双向转发检测机制BFD(Bidirectional Forwarding Detection)。BFD通过与上层路由协议联动,可以实现路由的快速收敛,可以实现链路的快速检测,提供毫秒级的检测。
BFD检测重点在于确定BFD的检测时间,在BFD的检测时间主要取决于下面三个参数:本端节点想要采用的最短BFD检测报文发送周期DMTI(Desired Min Tx Interval);本端节点能够支持的最短BFD检测报文接收周期RMRI(Required Min Rx Interval);检测时间的倍数Detect Mult(Detect time multiplier)。首先,一个节点B接收到对端节点A发来的BFD检测报文后,将该检测报文携带的A端节点的RMRI与B端本地的DMTI进行比较,取二者中的较大值作为B端节点发送BFD检测报文的速率。
BFD包含异步模式和查询模式,两种检测方式不同,因此检测时间也不同,一般通过使用不同的Detect Mult值来实现。
异步模式的检测时间=接收到的远端Detect Multi*max(本地的RMRI,接收到的DMTI);
查询模式的检测时间=本地的Detect Multi*max(本地的RMRI,接收到的DMTI);
DMTI、RMRI、Detect Multi在各个节点都是可以独立配置的,然而,DMTI、RMRI、Detect一旦配置完毕后,一个节点就会以固定的时间间隔接收另一个节点的BFD检测报文,如果在规定的时间内没有收到对方的检测报文则判定对方的应用/业务出现故障。
这在实际应用中存在一个问题,采用统一的固定检测时间间隔,无法针对不同应用类型的电信业务要求做出较为准确的故障判定。对所有应用采用单一的判定方式,使得不同应用的故障判定结果有失偏颇。例如,不同的应用对中断时间和检测速度的要求不同,如语音数据流要求中断时间不能超过200ms、信令要求不能超过500ms,但数据业务的实时性要求没有语音那么高,但统一设定一个检测时间间隔如300ms,将无法适用于不同的应用类型;并且可能引起相应的误判。如允许中断时间为200ms的语音应用,可能出现了故障,中断时间为250ms,却因为小于BFD的检测时间间隔300ms而没有被报错;再如允许中断时间为500ms的信令应用,可能还没有出现故障,中断时间为400ms,却因为大于BFD的检测时间间隔300ms而被误认为发生了故障。
发明内容
有鉴于此,本发明实施例提供了一种BFD检测方法、装置和系统。
根据本发明实施例的第一方面,提供了一种双向转发检测机制BFD检测方法,该方法应用于第一虚拟机接收BFD检测报文的过程中,BFD检测报文来自第二 虚拟机,该方法包括:获取预测时长和预定数量个样本时间间隔;其中,上述预测时长为当前时刻距离上述第一虚拟机最近一次接收到BFD检测报文的时间间隔,上述样本时间间隔为相邻两个BFD检测报文的到达时间间隔;根据上述预测时长以及上述预定数量个样本时间间隔得到特征值;其中,上述特征值用来表示第二虚拟机中应用出现故障的可能性;针对上述第二虚拟机中正在运行的一个应用,将上述特征值与上述一个应用的预设故障判定标准进行对照,并根据对照结果判定上述一个应用是否发生故障。
结合第一方面,在第一方面的第一种可能的实现方式中,上述根据上述预测时长以及上述预定数量个样本时间间隔得到特征值包括:根据上述预定数量个样本时间间隔得到上述样本时间间隔的均值和方差;根据上述样本时间间隔的均值和方差得到分布函数;将上述预测时长代入上述分布函数计算函数值;根据上述函数值得到上述特征值。
结合第一方面,或者结合第一方面的第一种可能的实现方式,在第一方面的第二种可能的实现方式中,上述根据上述函数值得到特征值包括:对上述函数值取负对数,得到上述特征值。
结合第一方面,或者结合第一方面的上述任何一种可能的实现方式,在第一方面的第三种可能的实现方式中,上述一个应用的预设故障判定标准包括一个非故障取值区间和一个故障取值区间;上述将上述特征值与上述一个应用的预设故障判定标准进行对照,并根据对照结果判定上述一个应用是否发生故障包括:判断上述特征值落在上述非故障取值区间和上述故障取值区间中的哪一个;若对照结果为上述特征值落在上述故障取值区间,判定上述一个应用发生故障;若对照结果为上述特征值落在上述非故障取值区间,判定上述一个应用未发生故障。
结合第一方面,或者结合第一方面的上述任何一种可能的实现方式,在第一 方面的第四种可能的实现方式中,在判定上述一个应用发生故障之后,上述方法还包括:对上述一个应用执行故障处理。
结合第一方面,或者结合第一方面的上述任何一种可能的实现方式,在第一方面的第五种可能的实现方式中,上述一个应用的预设故障判定标准包括一个非故障取值区间和至少两个故障取值区间,每一个故障取值区间对应着一个不同的故障等级;上述将上述特征值与上述一个应用的预设故障判定标准进行对照,并根据对照结果判定上述一个应用是否发生故障包括:判断上述特征值落在上述非故障取值区间和上述至少两个故障取值区间中的哪一个;若对照结果为上述特征值落在一个故障取值区间,判定上述一个应用发生的故障等级为上述一个故障取值区间对应的故障等级;其中,上述一个故障取值区间为上述至少两个故障取值区间中的一个;若对照结果为上述特征值落在上述非故障取值区间,判定上述一个应用未发生故障。
结合第一方面,或者结合第一方面的上述任何一种可能的实现方式,在第一方面的第六种可能的实现方式中,每一个故障取值区间对应着一个不同的故障处理方式;在判定上述一个应用发生的故障等级为上述一个故障取值区间对应的故障等级之后,上述方法还包括:对上述一个应用执行上述一个故障取值区间对应的故障处理。
结合第一方面,或者结合第一方面的上述任何一种可能的实现方式,在第一方面的第七种可能的实现方式中,上述预定数量为M,上述M个样本时间间隔为从连续的M+1个BFD检测报文中获取到的,其中M为大于20的整数。
根据本发明实施例的第二方面,提供了一种双向转发检测机制BFD检测装置,该装置应用于第一虚拟机接收BFD检测报文的过程中,这些BFD检测报文来自第二虚拟机;该装置包括:获取模块,用于获取预测时长和预定数量个样本时间间 隔;其中,上述预测时长为当前时刻距离上述第一虚拟机最近一次接收到BFD检测报文的时间间隔,上述样本时间间隔为相邻两个BFD检测报文的到达时间间隔;计算模块,用于根据上述获取模块获取的预测时长以及上述预定数量个样本时间间隔得到特征值;其中,上述特征值用来表示第二虚拟机中正在运行的应用出现故障的可能性;对照模块,用于针对上述第二虚拟机中正在运行的一个应用,将上述计算模块得到的特征值与上述一个应用的预设故障判定标准进行对照;判定模块,用于根据上述对照模块的对照结果判定上述一个应用是否发生故障。
结合第二方面,在第二方面的第一种可能的实现方式中,上述计算模块包括:第一计算单元,用于根据上述获取模块获取到的预定数量个样本时间间隔得到上述样本时间间隔的均值和方差;第二计算单元,用于根据上述第一计算单元得到的样本时间间隔的均值和方差得到分布函数;第三计算单元,用于将上述获取模块获取到的预测时长代入上述第二计算单元得到的分布函数计算函数值;第四计算单元,用于根据上述第三计算单元得到的函数值得到上述特征值。
结合第二方面,或者结合第二方面的第一种可能的实现方式,在第二方面的第二种可能的实现方式中,上述第四计算单元具体用于:对上述第三计算单元得到的函数值取负对数,得到上述特征值。
结合第二面,或者结合第二方面的上述任何一种可能的实现方式,在第二方面的第三种可能的实现方式中,上述一个应用的预设故障判定标准包括一个非故障取值区间和一个故障取值区间;上述对照模块具体用于判断上述计算模块得到的特征值落在上述非故障取值区间和上述故障取值区间中的哪一个;若上述对照模块的对照结果为上述特征值落在上述故障取值区间,上述判定模块判定上述一个应用发生故障;若上述对照模块的对照结果为上述特征值落在上述非故障取值区间,上述判定模块判定上述一个应用未发生故障。
结合第二方面,或者结合第二方面的上述任何一种可能的实现方式,在第二方面的第四种可能的实现方式中,上述装置还包括故障处理模块,上述故障处理模块用于在上述判定模块判定上述一个应用发生故障之后,对上述一个应用执行故障处理。
结合第二方面,或者结合第二方面的上述任何一种可能的实现方式,在第二方面的第五种可能的实现方式中,上述一个应用的预设故障判定标准包括一个非故障取值区间和至少两个故障取值区间,每一个故障取值区间对应着一个不同的故障等级;上述对照模块具体用于判断上述计算模块得到的特征值落在上述非故障取值区间和上述至少两个故障取值区间中的哪一个;若上述对照模块的对照结果为上述特征值落在一个故障取值区间,上述判定模块判定上述一个应用发生的故障等级为上述一个故障取值区间对应的故障等级;其中,上述一个故障取值区间为上述至少两个故障取值区间中的一个;若上述对照模块的对照结果为上述特征值落在上述非故障取值区间,上述判定模块判定上述一个应用未发生故障。
结合第二方面,或者结合第二方面的上述任何一种可能的实现方式,在第二方面的第六种可能的实现方式中,每一个故障取值区间对应着一个不同的故障处理方式;上述装置还包括故障处理模块,上述故障处理模块用于在上述判定模块判定上述一个应用发生的故障等级为上述一个故障取值区间对应的故障等级之后,对上述一个应用执行上述一个故障取值区间对应的故障处理。
结合第二方面,或者结合第二方面的上述任何一种可能的实现方式,在第二方面的第七种可能的实现方式中,上述预定数量为M,上述M个样本时间间隔为从连续的M+1个BFD检测报文中获取到的,其中M为大于20的整数。
根据本发明实施例的第三方面,提供了一种双向转发检测机制BFD检测系统,该系统包括上述第一虚拟机、上述第二虚拟机以及上述任意一种BFD检测装置。
根据本发明实施例提供的技术方案,在虚拟化环境中引入BFD检测机制以实 现故障的快速检测,满足电信高实时性业务对可靠性的要求。获取预定数量个相邻两个BFD检测报文的到达时间间隔;并根据上述时间间隔通过统计计算得到特征值,通过与不同应用的故障判定值进行比较,进而进行相应的故障判定和处理,以使得BFD检测的判定故障标准能够适应不同应用的要求,使每一种应用的故障判断更为精准,相比于传统BFD检测中的统一判别方式,大大减少了故障的误判。
附图说明
图1为现有技术中一种BFD检测架构示意图;
图2为本发明实施例中一种BFD检测架构示意图;
图3为本发明实施例中一种BFD检测的方法流程图;
图4为本发明实施例中一种连续报文的时间间隔示意图;
图5为本发明实施例中一种BFD检测方法中需要构造的分布函数图;
图6为本发明实施例中一种BFD判断应用是否出现故障的方法示意图;
图7为本发明实施例中一种BFD检测装置的示意图;
图8为本发明实施例中一种BFD检测装置中计算模块组成结构的示意图;
图9为本发明实施例中一种BFD检测设备结构示意图。
具体实施方式
本发明实施例提供了一种BFD检测方法、装置与系统。
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分优选实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围;值得注意的是,由于词汇的使用习惯,在一些语境中,“BFD”与“BFD检测”表达的意思相同;“BFD检测报文”有时可简称为“检测报文”或者“报文”;“应用”可以等价于通信“业务”。
在传统BFD的检测模块中,故障检测和故障判定两个功能是结合在一起的。比如BFD检测过程中在连续丢失3个BFD报文(可以表现为3个报文周期时间内没有收到报文)的情况下,就会判断与对方节点的通信连接发生中断,从而触发相应应用/业务的故障处理。请参阅图1,图1为现有技术中一种BFD检测架构示意图。传统的BFD检测模块只会输出一个确定的检测结果,即故障与否的布尔结果,然后将布尔结果告知给故障管理模块,由故障管理模块进行后续的故障处理。传统的BFD检测机制是不区分应用之间的差别的,无论什么类型的应用,都是统一设置一个检测时间作为检测标准,因此BFD检测仅通过单一的检测时间对对方节点中正在运行的各个应用进行故障判断,可能会对一些应用进行误判,这将无法适用于不同的应用类型。
为了解决上述技术问题,我们的方案中,将传统BFD检测模块中的故障检测和故障判定进行“功能分离”,进而对不同类型的应用采用不同的可靠性检测标准。即BFD不再提供是否故障的布尔结果,而是只负责对报文进行检测,并将检测结果上报给上层应用或专门的故障管理模块,由上层应用或专门的故障管理模块根据检测结果以及不同应用各自的故障标准做出故障与否的判断(下文中仅以故障管理模块进行举例,对于上层应用等基于相同原理对故障管理模块的进行的替换,都应属于本发明的保护范围内)。
请参阅图2,图2为本发明实施例中一种BFD检测架构示意图。BFD检测模块进行检测报文的检测,从报文中携带的信息中读取报文的到达时间或者直接记录报文的到达时间(报文的到达时间的就是对方节点的检测报文到达本端节点的时刻);进而得到不同报文之间的到达时间间隔,并对若干个时间间隔进行一系列的数学运算得到一个和报文到达时间间隔有关的一个特征值ψ;与此同时,BFD检测架构中还含有故障管理模块,各个应用根据各自的故障判定需求预先确定一个阈值(Ф)作为是否故障的判断标准,将其预先存储于故障管理模块中;如此一来,BFD检测模块对检测报文时间间隔进行处理得到特征值ψ,上报给故障管理模块。故障管理模块会将这个ψ值与各个应用预先存储的Ф值进行比较,根据比较结果判定对方节点中各个正在运行的应用是否出现故障,进而根据判定结果 对出现故障的应用采取故障管理措施。
请参阅图3,图3为本发明实施例中一种BFD检测的方法流程图。在具体实现过程中,第一虚拟机和第二虚拟机泛指虚拟化云环境中正在彼此通信的两台虚拟机,通常来讲两个虚拟机之间都会进行双向通信,第一虚拟机接收第二虚拟机BFD检测报文的同时,第二虚拟机也可以接收并检测第一虚拟机的BFD检测报文,因此每个虚拟机中的BFD检测机制都是类似的;第一虚拟机接收到第二虚拟机的BFD检测报文,也可以等价表述为第二虚拟机的BFD检测报文到达第一虚拟机。具体方法如下示例。
步骤101:第一虚拟机接收第二虚拟机的BFD检测报文时,双向转发检测机制BFD检测模块获取预测时长和预定数量个样本时间间隔;其中,预测时长为当前时刻距离第一虚拟机最近一次接收到BFD检测报文的时间间隔,样本时间间隔为相邻两个BFD检测报文的到达时间间隔。
由于本发明中BFD检测机制更关心的是两个检测报文的时间间隔与报文接收成功的概率关系,所以该方法中获取的是相邻检测报文的到达时间间隔,尤其是连续的检测报文的到达时间间隔;例如有连续先后发来的100个检测报文,编号为从1至100,那么统计的是第1个和第2个、第2个和第3个、第3个和第4个……第99个和第100个检测报文之间的到达时间间隔,共99个。更为通用的情形请见图4,图4为本发明实施例中一种连续检测报文的时间间隔示意图,随着时间的先后,报文1、报文2……报文N为第一虚拟机接收到的源于第二虚拟机的一系列连续报文,该方法需要获取的报文到达时间间隔样本数据为(t2-t1)、(t3-t2)……(tN-tN-1);tn表示第n个报文的到达时间,n可以取遍任意正整数;其中tN一般为当前时刻之前接收的最新的一个报文,即最近一次接收到的报文。
在具体实现的过程中,可能获取到的每一个样本时间间隔都会被用到,也可能不是每一个样本时间间隔都会被用到,比如选取其中的M个样本时间间隔进行统计计算,M的具体取值通常以样本统计方法的需求为准,有时也和报文能够在虚拟机中存储的数量有关,有时也和报文的时效性有关;其中M为正整数,为了 满足样本统计的经验需求,M一般大于20。
步骤102:BFD检测模块根据预测时长以及预定数量个样本时间间隔得到特征值;其中,该特征值用来表示第二虚拟机中应用出现故障的可能性。
在具体实现过程中,BFD检测模块可以采用数学统计方法对步骤101中得到的M个报文到达时间间隔进行统计计算,得到特征值,这个特征值可以描述第一虚拟机在成功接收到第二虚拟机发送的上一个BFD检测报文的基础上,在预测时长后能成功接收到第二虚拟机发送的下一个BFD检测报文的可能性。关于特征值后文会详细介绍。可见本方案与传统BFD检测单纯地输出检测时间是有区别的。
步骤103:针对第二虚拟机中正在运行的一个应用,将特征值与这一个应用的预设故障判定标准进行对照,并根据对照结果判定这一个应用是否发生故障。
本发明中步骤101-102所基于的原理是:当第一虚拟机发现一个检测报文丢失时,这时可能存在两种可能性,一是由于网络等原因导致检测报文晚到;二是第二虚拟机的应用真的发生了故障。所以此时就需要做一个判断,究竟是报文晚到还是第二虚拟机中的应用发生故障;而判定的依据就是基于对以往检测报文的到达时间间隔统计,根据当前时刻距离最近一次接收到报文的时间间隔来预测成功接收下一个报文的可能性。所以要实现这个功能,首先需要对以往成功接收到的检测报文的到达时间间隔做一个统计。
在具体实现过程中,当检测报文到达第一虚拟机时,BFD检测模块会获取或记录每个检测报文的到达时间,并将到达时间存储到一个存储队列中。为了保证时间间隔的时效性,通常将到达时间存放在一个固定长度的存储队列中,存储方式采用堆栈方式,例如存储队列的固定长度为100个到达时间,当存储第101个到达时间时,将原本存储的第1个到达时间值溢出,保证这个存储队列一直是100个值。两个相邻检测报文的到达时间间隔因节点处理能力、网络传输能力等因素而不同,通常认为这种时间间隔的波动服从高斯分布;例如当网络中通信模块处于正常工作时,两个相邻的检测报文的到达第一虚拟机的时间间隔为1s,而由于网络中各个通信模块的性能波动,会带来检测报文到达时间间隔的波动,使得相 邻的检测报文的到达第一虚拟机的时间间隔在1s上下浮动,整体的报文到达时间间隔会呈现一个高斯分布。
在具体实现的过程中,可以根据检测报文到达时间间隔的高斯分布特性,相应地建立一个高斯分布函数进行统计。如图5所示,图5为本发明实施例中一种BFD检测方法中构造的分布函数,横坐标x表示预测时长,当前时刻由于下一个检测报文的不确定到达,使得x成为一个变量;G(x)表示当前时刻后能够成功接收下一检测报文的概率,f(x)表示概率密度。从图中不难推断出,预测时长越短,能够成功接收下一个报文的概率越大,预测时长越长,接收到下一个报文的概率也就越来越小。分布函数f(x)由预定数量个样本时间间隔决定,即固定长度队列中的检测报文到达时间来确定,仍以固定长度队列为100个到达时间为例,由于f(x)需要统计的是相邻报文的到达时间间隔对报文接收成功概率的影响,因此100个连续检测报文的到达时间会产生99个样本时间间隔Δt1、Δt2、……、Δt99,此处的样本时间间隔指的是相邻的两个检测报文的到达时间间隔;对这99个样本时间间隔进行统计,计算出这99个样本时间间隔的均值μ和方差σ,再由均值μ和方差σ确定出分布函数,通用的公式如下:
Figure PCTCN2016070063-appb-000001
Figure PCTCN2016070063-appb-000002
其中n代表样本时间间隔的个数,Δt(j)为时间间隔,j为整数,且j∈[1,n],μ表示均值,σ表示方差。
Figure PCTCN2016070063-appb-000003
可见,每接收到新的检测报文,存储队列中的检测报文时间间隔就会有变化,虽然新的时间间隔不会对这个样本整体有很大的影响,但99个样本时间间隔统计出来的均值μ和方差σ和分布函数f(x)都是实时更新的,这就能够保证对新的报文的检测和预测能够更接近目前网络通信的最新状况,从而保证一定的时效性和准确性。
进而我们可以得到,当前时刻第一虚拟机应该正常接收到报文的概率为:
Figure PCTCN2016070063-appb-000004
当前时刻第一虚拟机还没有接收到报文的概率,即在当前时刻之后能够接收到报文的概率:
Figure PCTCN2016070063-appb-000005
现在本方案关心的是:在tlast时最新的一个检测报文成功到达,而当前时刻tnow没有收到报文,则在当前时刻tnow之后才收到报文的概率即为:G(tnow-tlast;μlast,σlast),其中μlast,σlast分别表示最新检测报文存储到固定长度的队列中时,对当前队列中的所有相邻检测报文的到达时间间隔进行统计得到的,tnow-tlast为最新的预测时长。
作为可选的,G(tnow-tlast;μlast,σlast)可以作为特征值,相应地每一种应用也都设置各自对应的故障判定概率即可,例如对于第二虚拟机中正在运行的A应用来说,故障判定概率为0.2,这个故障判定概率预先存储于BFD检测模块中;如果G(tnow-tlast;μlast,σlast)小于0.2,则认为tnow之后收到下一个检测报文的概率非常小,认为A应用发生故障;如果G(tnow-tlast;μlast,σlast)大于0.2,则认为tnow之后还有很大的可能收到下一个报文,认为A应用还未发生故障。
作为另一种可选的,G(tnow-tlast;μlast,σlast)表达的是概率,因此是一个小数,而且有时可能是一个较小的小数,对这样的小数直接进行比较可能会带来一定的误差和不确定性,因此为表达方便可以取其对数log[G(tnow-tlast;μlast,σlast)],又因为0<G(tnow-tlast;μlast,σlast)<1,故log[G(tnow-tlast;μlast,σlast)]<0,因此定义ψ=-log[G(tnow-tlast;μlast,σlast)],这个ψ就是步骤102提到的特征值;并且不难看出,随着报文的不断更新,特征值也在不断地更新。
值得注意的是,本发明中对样本时间间隔进行的统计和计算并不局限于上文所提到的数学方法,与高斯概率统计相关的类似、或衍生的统计数学模型都可以进行相应地统计计算,由于这些都是数学领域中公知的统计方法,因此本发明实施例中不进行一一的详细列举;包括后续对于概率值的变形处理,都是可以通过 基本的数学运算来实现的,因此通过任意常规的数学运算来对样本时间间隔进行统计处理最终得到特征值都应属于本发明的保护范围内。
步骤103在具体实现过程中,故障管理模块将ψ值与第二虚拟机中正在运行的应用预先设定的故障判定标准进行对照,从而做出应用是否故障的判断。判断过程如图6所示,图6为本发明实施例中一种BFD判断应用是否出现故障的方法示意图。由于应用种类众多,各个应用的功能和可靠性的指标要求不同,因此在判断是否发生故障的时候也就有多样性。常见的有以下两种情况(以下提到的B应用和C应用均为第二虚拟机中正在运行的某一应用,并非特指具体哪一应用;B应用和C应用的故障判定值预先存储于BFD检测模块中)。
情形1:若B应用的故障判定值为单一阈值Ф1时,则B应用的预设故障判定标准包括一个非故障取值区间(0,Ф1)和一个故障取值区间[Ф1,+∞);故障管理模块判断特征值ψ落在哪个取值区间中。若对照结果为特征值落在[Ф1,+∞)中,则故障管理模块判定B应用发生故障,若对照结果为特征值落在(0,Ф1)中,故障管理模块判定B应用没有发生故障或不做响应。
相应地,如果判断B应用发生了故障,故障管理模块还将执行B应用对应的故障处理措施,使得B应用能够正常运行;如果B应用没有发生故障或者故障经过故障处理后恢复正常时,则故障管理模块不会执行故障处理或不做任何响应。
情形2:若C应用的故障判定值为至少两个阈值区间时,此时C应用的预设故障判定标准包括一个非故障取值区间和至少两个故障取值区间,每一个故障取值区间对应着一个不同的故障等级,每一个故障等级对应一种不同的故障处理措施,并且故障取值区间与故障等级一一对应,故障等级与故障处理措施一一对应;不同的阈值区间对应不同的故障等级,即对应着不同的故障严重程度;故障管理模块将特征值ψ与非故障取值区间和至少两个故障取值区间进行对照,确定特征值所在的取值区间,进而判定C应用发生的故障等级。如图5所示,当ψ∈(0,Ф1]时,故障管理模块判定C应用无故障或不做响应;当ψ∈(Ф1,Ф2]时,故障管理模块判定C应用处于故障等级1;当ψ∈(Ф2,Ф3]时,故障管理模块判定C应用处于故障等级2;当ψ∈(Ф3,+∞)时,故障管理模块判定C应用处于故障等级3; 其中故障等级1、故障等级2、故障等级3的故障严重程度依次递增。
相应地,如果故障管理模块判断C应用发生了故障,故障管理模块还会根据C应用发生的故障等级,执行对应的故障处理措施,使得C应用能够正常运行;如果C应用没有发生故障或者故障经过故障处理后恢复正常时,则故障管理模块不会执行故障处理。
作为补充说明,有些应用之所以会有不同的故障等级和相应的故障处理方式,主要源于故障处理时需要调用软硬件资源的多少。例如C应用处于低一级故障处理时所需要调用的软硬件资源要比处于高一级故障处理时所需要调用的软硬件资源更少、更简单,总体成本也相应更低;因此当C应用出现故障等级1时,对应采取故障等级1所对应的故障处理方式,调用相应的软硬件资源即可完成故障的修复;因此这种对应等级的故障处理方式更符合资源的合理利用。
在具体实现过程中,不同应用的故障判定值Ф是根据该应用对可靠性和实时性的要求,通过大量的经验尝试得到的。比如对实时性、可靠性要求高的应用,就需要更加快速的故障检测和恢复,然而检测时间加快了,误报率就会相应提高,所以要获得一个快速的检测时间以及合理的误报率,就需要确定一个适当的阈值Ф。例如应用开发人员可以基于大量的阈值Ф与检测时间和误报率的测试统计结果,在误报率能够被所接收的范围内,确定出一个合适的阈值。这也说明了本方案的优势:在保证BFD快速检测能力的同时,可以将误报率控制在一个可接受的水平。此外,Ф的确定还与步骤102中得到特征值的算法有关;如果特征值为概率,那么Ф表达的应该也是预先选取的概率值;如果特征值对概率取对数,那么Ф也是对预先选取的概率值取对数;如果特征值对概率取负对数,那么Ф也是对预先选取的概率值取负对数。总之,归根结底,本发明中需要得到的是下一个报文即将到达的概率G(x),Ф的原始数值为应用开发人员根据以往的大量经验评测得到的一个经验允许概率G(x0);其中G(x)采用什么样的运算方法最终获得特征值,那么对G(x0)采用同样的运算方法得到相应的Ф。
本发明技术方案与传统的BFD检测机制不同,引入了对相邻检测报文的到达时间间隔的统计分析,使BFD检测不再基于单一的检测时间简单地输出各应用是 否故障的信息,而是首先基于相邻报文到达时间间隔进行数学统计处理计算出一个特征值,通过将该特征值与不同应用各自预先设置的故障判定值进行比较,进而有针对性地判定各个应用的故障情况,以使得BFD检测能够适应不同应用的要求,而不再进行统一判别,减少由此带来的故障误判。
第一虚拟机和第二虚拟机泛指虚拟化云环境中正在彼此通信的两台虚拟机,通常来讲两个虚拟机之间都会进行双向通信,第一虚拟机可以接收第二虚拟机报文,第二虚拟机也可以接收第一虚拟机的报文,因此对于每个虚拟机,BFD检测机制都是类似的。请参阅图7,图7为本发明实施例中一种BFD检测装置的示意图。本实施例中该装置应用于第一虚拟机接收BFD检测报文的过程中,所述BFD检测报文来自第二虚拟机;该装置即可以位于第一虚拟机中,也可以是在虚拟机外部作为监管第一、第二虚拟机通信状况的上层功能装置,该装置200包括:
获取模块201,用于获取预测时长和预定数量个样本时间间隔;其中,预测时长为当前时刻距离第一虚拟机最近一次接收到BFD检测报文的时间间隔,样本时间间隔为相邻两个BFD检测报文的到达时间间隔。
获取模块201用来获取相邻报文的间隔时间,尤其是连续的报文的相隔时间:(t2-t1)、(t3-t2)……(tN-tN-1),其中tm表示第m个报文的到达时间,m可以表示任意正整数。所谓的预定数量通常由报文能够在虚拟机中存储的数量或数学统计的样本需求来决定;如获取模块获取到(tM-tM-1)、(tM+1-tM)……(tN-tN-1);其中,M小于N,N-M为预定数量且一般大于20;第N个报文为第一虚拟机在当前时刻之前收到的最后一个报文。
计算模块202,用于根据获取模块201获取的预测时长以及预定数量个样本时间间隔得到特征值;其中,特征值用来表示第二虚拟机中正在运行的应用出现故障的可能性。
作为可选的,计算模块202可以包括第一计算单元2021,第二计算单元2022,第三计算单元2023,请参阅图8,图8为本发明实施例中一种BFD检测装置中计算模块组成结构的示意图。
根据预定的计算规则,第一计算单元2021用于对上述N-M个样本时间间隔样本进行统计,计算出这N-M个样本时间间隔的均值μ和方差σ,第二计算单元2022再根据均值μ和方差σ确定出概率密度分布函数f(x),通用的公式如下:
Figure PCTCN2016070063-appb-000006
Figure PCTCN2016070063-appb-000007
其中n代表预定数量,即N-M,Δt(j)为样本时间间隔,j为整数,且j∈[1,n]。
Figure PCTCN2016070063-appb-000008
其中x为预测时长,接下来由第三计算单元2023得到当前时刻还没有接收到报文的概率为:
Figure PCTCN2016070063-appb-000009
G(x;μ,σ)表达了样本时间间隔的长短与第一虚拟机能够成功收到BFD检测报文的概率关系,由于检测报文的丢失体现了应用的故障,因此G(x;μ,σ)也表达了样本时间间隔的长短应用发生故障的可能性。若当前时刻tnow没有收到报文,第一虚拟机在tnow之前最近一次接收到检测报文的时刻记为tlast,而则在刻tnow之后收到报文的概率为:G(tnow-tlast;μlast,σlast),tnow-tlast为最新的预测时长。
可选的,由于概率是一个小数,而且可能是一个较小的小数,因此为表达方便,计算模块202还可以包括第四计算单元2024,对第三计算单元2023得到的函数值取负对数,定义ψ=-log[G(tnow-tlast;μlast,σlast)],这个ψ就是特征值。
可选的,计算模块202也可以包含若干个计算单元,这些若干个计算单元能够依据预设算法,通过完成一系列常规的数学运算手段得到一个后续可以方便使用的特征值。
对照模块203,用于针对第二虚拟机中正在运行的一个应用,将计算模块202得到的特征值与所述一个应用的预设故障判定标准进行对照。其中,第二虚拟机中每一种应用的预设故障判定标准都预先存储于对照模块中。
判定模块204,用于根据对照模块203的对照结果判定所述一个应用是否发生故障。
在具体实现过程中,对照模块203、判定模块204故障管理过程类似于图5对应执行的方法。由于应用种类众多,各个应用的功能和可靠性的指标要求不同,因此在判断是否发生故障的时候也就有多样性。常见的有两种情况,分别见以下两例:
例1:若语音业务预设的故障判定值为单一阈值0.2时;对照模块203将计算模块202得到的特征值ψ与0.2进行比较;若比较结果为特征值大于等于0.2,判定模块204判定第一应用发生故障,若比较结果为特征值小于0.2,判定模块204判定语音业务这一应用没有发生故障或不做响应。
可选的,装置200还可以包含故障处理模块205。相应地,如果判定模块204判断语音业务发生了故障,装置200中的故障处理模块205将执行语音业务对应的故障处理措施,使得语音业务能够正常运行;如果语音业务没有发生故障或者故障经过故障处理后恢复正常时,判定模块204判定语音业务没有发生故障或者不做响应,进而故障处理模块205不会执行故障处理或者不做任何响应。
例2:若图片传送业务预设的故障判定值为四个取值区间时,其中,(0,3]表示图片传送业务没有故障;(3,5]表示图片传送业务发生一级故障;(5,10]表示图片传送业务发生二级故障;(10,+∞)表示图片传送业务发生三级故障。一级故障、二级故障、三级故障分别对应一级故障处理措施、二级故障处理措施、三级故障处理措施;对照模块203将特征值ψ与上述至少两个阈值区间进行对照,确定特征值所在的区间;判定模块204根据ψ所在的取值区间判定图片传送业务发生的故障等级。如图6所示,当对照模块203确定ψ∈(0,3]时,判定模块204判定图片传送业务无故障;当对照模块203确定ψ∈(3,5]时,判定模块204判定图片传送业务处于一级故障;当对照模块203确定ψ∈(5,10]时,判定模块204判定图片传送业务处于二级故障;当对照模块203确定ψ∈(10,+∞)时,判定模块204判定图片传送业务处于三级故障。阈值区间的值越大表明故障越严重,比如一级故障、二级故障与三级故障的故障严重程度递增。
可选的,装置200还可以包含故障处理模块205。相应地,如果判定模块204判定图片传送业务发生了故障,装置200中的故障处理模块205根据判定模块204判定的故障等级,对第二应用执行对应的故障处理措施,使得第二应用能够正常运行;如果第二应用没有发生故障或者故障经过故障处理后恢复正常时,判定模块204判定第二应用没有发生故障或者不做响应,进而则故障处理模块205不会执行故障处理或者不做任何响应。
本发明实施例提供了一种BFD检测装置,相比于传统的BFD检测装置,不再使用单一值的判定而简单地输出是否故障的信息,本发明中BFD检测装置包括获取模块201、计算模块202、对照模块203、判定模块204,还可以包含故障处理模块205;获取模块201和计算模块202对检测报文的间隔时间进行数学处理得到一个衡量应用是否故障的特征值;对照模块203将特征值与不同应用的预设故障判定标准进行比较,进而使得判定模块204有针对性地判定各个应用的故障情况,以使得BFD的检测能够适应不同类型应用的要求,而不再进行传统的统一判别,减少由此带来的故障误判,最终使得故障处理模块205能够有针对性地对出现故障的应用执行相应的处理措施。
本发明实施例提供了一种BFD检测系统,该系统由上述段落中描述的第一虚拟机、第二虚拟机和BFD检测装置200组成。作为可选的,上文提到的BFD检测装置可以位于第一虚拟机中,直接接收来自第二虚拟机的BFD检测报文,并对这些检测报文执行如同上文提到的步骤101-步骤103的方法来判定第二虚拟机中正在运行的应用是否出现故障。作为另一种可选的,BFD检测装置位于虚拟机之外,作为第一虚拟机和第二虚拟机的上层管理功能装置,并在第一虚拟机接收第二虚拟机的BFD检测报文的过程中,获取这些报文的相关信息,对这些检测报文执行如同上文提到的步骤101-步骤103的方法来判定第二虚拟机中正在运行的应用是否出现故障。
请参阅图9,图9为本发明实施例中一种BFD检测设备结构示意图。该设备400包括:
处理器401,用于产生相应的操作控制信号,发给计算处理设备相应的部件, 读取以及处理软件中的数据,尤其是读取和处理存储器402中的数据和程序,以使其中的各个功能模块执行相应的功能,从而控制相应的部件按指令的要求进行动作。
存储器402,用于存储程序和各种数据,主要存储操作系统、应用和功能指令等软件单元、或者他们的子集、或者他们的扩展集。还可以包括非易失性随机存取存储器(NVRAM),向处理器401提供包括管理计算处理设备中的硬件、软件及数据资源,支持控制软件和应用。
收发器403,用于采集、获取或发送信息,在模块之间可以用来传递信息。
上述各个硬件单元可以通过总线连接进行通信。
如此一来,通过调用存储器402存储的程序或指令,收发器403接收第二虚拟机的BFD检测报文时,处理器401获取预测时长以及预定数量个样本时间间隔,处理器401根据存储器402中预先存储的算法对预定数量个样本时间间隔进行运算处理,得到特征值;并将特征值与被检测节点中正在运行的每种应用的预设故障判定标准分别对照,并根据对照结果判定是否有应用发生故障,处理器401对发生故障的应用执行相应的故障处理。
本领域普通技术人员可知,上述方法中的全部或部分步骤可以通过程序指令相关的硬件完成,该程序可以存储于一计算机可读存储介质中。通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本发明可以用硬件实现,或固件实现,或它们的组合方式来实现。
以上实施例仅为本发明技术方案的较佳实施例而已,并非用于限定本发明的保护范围。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围。

Claims (16)

  1. 一种双向转发检测机制BFD检测方法,所述方法应用于第一虚拟机接收BFD检测报文的过程中,所述BFD检测报文来自第二虚拟机;其特征在于,所述方法包括:
    获取预测时长和预定数量个样本时间间隔;其中,所述预测时长为当前时刻距离所述第一虚拟机最近一次接收到BFD检测报文的时间间隔,所述样本时间间隔为相邻两个BFD检测报文的到达时间间隔;
    根据所述预测时长以及所述预定数量个样本时间间隔得到特征值;其中,所述特征值用来表示所述第二虚拟机中正在运行的应用出现故障的可能性;
    针对所述第二虚拟机中正在运行的一个应用,将所述特征值与所述一个应用的预设故障判定标准进行对照,并根据对照结果判定所述一个应用是否发生故障。
  2. 如权利要求1所述方法,其特征在于,所述根据所述预测时长以及所述预定数量个样本时间间隔得到特征值包括:
    根据所述预定数量个样本时间间隔得到所述样本时间间隔的均值和方差;
    根据所述样本时间间隔的均值和方差得到分布函数;
    将所述预测时长代入所述分布函数计算函数值;
    根据所述函数值得到所述特征值。
  3. 如权利要求2所述方法,其特征在于,所述根据所述函数值得到特征值包括:对所述函数值取负对数,得到所述特征值。
  4. 如权利要求1-3任一项所述方法,其特征在于,所述一个应用的预设故障判定标准包括一个非故障取值区间和一个故障取值区间;所述将所述特征值与所述一个应用的预设故障判定标准进行对照,并根据对照结果判定所述一个应用是否发 生故障包括:
    判断所述特征值落在所述非故障取值区间和所述故障取值区间中的哪一个;
    若对照结果为所述特征值落在所述故障取值区间,判定所述一个应用发生故障;若对照结果为所述特征值落在所述非故障取值区间,判定所述一个应用未发生故障。
  5. 如权利要求4所述方法,其特征在于,在判定所述一个应用发生故障之后,所述方法还包括:对所述一个应用执行故障处理。
  6. 如权利要求1-3任一项所述方法,其特征在于,所述一个应用的预设故障判定标准包括一个非故障取值区间和至少两个故障取值区间,每一个故障取值区间对应着一个不同的故障等级;所述将所述特征值与所述一个应用的预设故障判定标准进行对照,并根据对照结果判定所述一个应用是否发生故障包括:
    判断所述特征值落在所述非故障取值区间和所述至少两个故障取值区间中的哪一个;
    若对照结果为所述特征值落在一个故障取值区间,判定所述一个应用发生的故障等级为所述一个故障取值区间对应的故障等级;其中,所述一个故障取值区间为所述至少两个故障取值区间中的一个;
    若对照结果为所述特征值落在所述非故障取值区间,判定所述一个应用未发生故障。
  7. 如权利要求6所述方法,其特征在于,每一个故障取值区间对应着一个不同的故障处理方式;在判定所述一个应用发生的故障等级为所述一个故障取值区间对应的故障等级之后,所述方法还包括:对所述一个应用执行所述一个故障取值区间对应的故障处理。
  8. 如权利要求1-7任一项所述方法,其特征在于,所述预定数量为M,所述M个样本时间间隔为从连续的M+1个BFD检测报文中获取到的,其中M为大于20的整数。
  9. 一种双向转发检测机制BFD检测装置,所述装置应用于第一虚拟机接收BFD检测报文的过程中,所述BFD检测报文来自第二虚拟机;其特征在于,所述装置包括:
    获取模块,用于获取预测时长和预定数量个样本时间间隔;其中,所述预测时长为当前时刻距离所述第一虚拟机最近一次接收到BFD检测报文的时间间隔,所述样本时间间隔为相邻两个BFD检测报文的到达时间间隔;
    计算模块,用于根据所述获取模块获取的预测时长以及所述预定数量个样本时间间隔得到特征值;其中,所述特征值用来表示所述第二虚拟机中正在运行的应用出现故障的可能性;
    对照模块,用于针对所述第二虚拟机中正在运行的一个应用,将所述计算模块得到的特征值与所述一个应用的预设故障判定标准进行对照;
    判定模块,用于根据所述对照模块的对照结果判定所述一个应用是否发生故障。
  10. 如权利要求9所述装置,其特征在于,所述计算模块包括:
    第一计算单元,用于根据所述获取模块获取到的预定数量个样本时间间隔得到所述样本时间间隔的均值和方差;
    第二计算单元,用于根据所述第一计算单元得到的样本时间间隔的均值和方差得到分布函数;
    第三计算单元,用于将所述获取模块获取到的预测时长代入所述第二计算单元得到的分布函数计算函数值;
    第四计算单元,用于根据所述第三计算单元得到的函数值得到所述特征值。
  11. 如权利要求10所述装置,其特征在于,所述第四计算单元具体用于:对所述第三计算单元得到的函数值取负对数,得到所述特征值。
  12. 如权利要求9-11任一项所述装置,其特征在于,所述一个应用的预设故障判定标准包括一个非故障取值区间和一个故障取值区间;所述对照模块具体用于判断所述计算模块得到的特征值落在所述非故障取值区间和所述故障取值区间中的哪一个;
    若所述对照模块的对照结果为所述特征值落在所述故障取值区间,所述判定模块判定所述一个应用发生故障;
    若所述对照模块的对照结果为所述特征值落在所述非故障取值区间,所述判定模块判定所述一个应用未发生故障。
  13. 如权利要求12所述装置,其特征在于,所述装置还包括故障处理模块,所述故障处理模块用于在所述判定模块判定所述一个应用发生故障之后,对所述一个应用执行故障处理。
  14. 如权利要求9-11任一项所述装置,其特征在于,所述一个应用的预设故障判定标准包括一个非故障取值区间和至少两个故障取值区间,每一个故障取值区间对应着一个不同的故障等级;所述对照模块具体用于判断所述计算模块得到的特征值落在所述非故障取值区间和所述至少两个故障取值区间中的哪一个;
    若所述对照模块的对照结果为所述特征值落在一个故障取值区间,所述判定模块判定所述一个应用发生的故障等级为所述一个故障取值区间对应的故障等级;其中,所述一个故障取值区间为所述至少两个故障取值区间中的一个;
    若所述对照模块的对照结果为所述特征值落在所述非故障取值区间,所述判定模 块判定所述一个应用未发生故障。
  15. 如权利要求14所述装置,其特征在于,每一个故障取值区间对应着一个不同的故障处理方式;所述装置还包括故障处理模块,所述故障处理模块用于在所述判定模块判定所述一个应用发生的故障等级为所述一个故障取值区间对应的故障等级之后,对所述一个应用执行所述一个故障取值区间对应的故障处理。
  16. 如权利要求9-15任一项所述装置,其特征在于,所述预定数量为M,所述M个样本时间间隔为从连续的M+1个BFD检测报文中获取到的,其中M为大于20的整数。
PCT/CN2016/070063 2015-06-29 2016-01-04 一种bfd检测方法与装置 Ceased WO2017000536A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP16816903.5A EP3316520B1 (en) 2015-06-29 2016-01-04 Bfd method and apparatus
US15/837,442 US10447561B2 (en) 2015-06-29 2017-12-11 BFD method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510369358.6A CN106330588B (zh) 2015-06-29 2015-06-29 一种bfd检测方法与装置
CN201510369358.6 2015-06-29

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/837,442 Continuation US10447561B2 (en) 2015-06-29 2017-12-11 BFD method and apparatus

Publications (1)

Publication Number Publication Date
WO2017000536A1 true WO2017000536A1 (zh) 2017-01-05

Family

ID=57607565

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/070063 Ceased WO2017000536A1 (zh) 2015-06-29 2016-01-04 一种bfd检测方法与装置

Country Status (4)

Country Link
US (1) US10447561B2 (zh)
EP (1) EP3316520B1 (zh)
CN (1) CN106330588B (zh)
WO (1) WO2017000536A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119544556A (zh) * 2024-11-29 2025-02-28 中国农业银行股份有限公司 基于双向转发检测bfd状态的中断检测方法、装置及设备

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10367682B2 (en) * 2017-06-30 2019-07-30 Bank Of American Corporation Node failure recovery tool
CN108234222A (zh) * 2018-04-15 2018-06-29 肖恒念 一种云服务器虚拟机管理方法和云服务器
CN112418474A (zh) * 2019-08-20 2021-02-26 北京国双科技有限公司 一种故障处理期限的预测方法及装置
US11196651B2 (en) * 2019-10-23 2021-12-07 Vmware, Inc. BFD offload in virtual network interface controller
CN110971459B (zh) * 2019-11-29 2020-07-14 新华三半导体技术有限公司 会话故障检测方法、装置、终端设备及可读存储介质
CN112511265B (zh) 2020-02-20 2025-05-13 中兴通讯股份有限公司 一种bfd报文长度切换的方法、装置及储存介质
US11121956B1 (en) * 2020-05-22 2021-09-14 Arista Networks, Inc. Methods and systems for optimizing bidirectional forwarding detection in hardware

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101808022A (zh) * 2010-04-22 2010-08-18 中兴通讯股份有限公司 双向转发检测的实现方法及装置
CN102347855A (zh) * 2011-07-21 2012-02-08 福建星网锐捷网络有限公司 双向转发检测实现方法、装置及网络设备
CN104104644A (zh) * 2013-04-01 2014-10-15 中兴通讯股份有限公司 双向转发检测系统及双向转发检测的检测时间配置方法
CN104283711A (zh) * 2014-09-29 2015-01-14 中国联合网络通信集团有限公司 基于双向转发检测bfd的故障检测方法、节点及系统

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7561527B1 (en) * 2003-05-02 2009-07-14 David Katz Bidirectional forwarding detection
CN101212400B (zh) 2006-12-25 2011-06-15 华为技术有限公司 一种协商伪线的双向转发检测会话区分符的方法及系统
JP4639207B2 (ja) * 2007-03-14 2011-02-23 株式会社日立製作所 ネットワークシステム、ノード装置及び管理サーバ
CN102487338B (zh) 2010-12-01 2014-11-05 中兴通讯股份有限公司 根据网络抖动调整bfd发送间隔的方法及装置
US8842520B2 (en) * 2011-09-12 2014-09-23 Honeywell International Inc. Apparatus and method for identifying optimal node placement to form redundant paths around critical nodes and critical links in a multi-hop network
US9231838B2 (en) * 2011-11-09 2016-01-05 Telefonaktiebolaget L M Ericsson (Publ) Method and apparatus for detecting and locating network connection failures
CN104426696B (zh) * 2013-08-29 2018-09-07 深圳市腾讯计算机系统有限公司 一种故障处理的方法、服务器及系统
US9882806B2 (en) * 2015-06-03 2018-01-30 Cisco Technology, Inc. Network description mechanisms for anonymity between systems

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101808022A (zh) * 2010-04-22 2010-08-18 中兴通讯股份有限公司 双向转发检测的实现方法及装置
CN102347855A (zh) * 2011-07-21 2012-02-08 福建星网锐捷网络有限公司 双向转发检测实现方法、装置及网络设备
CN104104644A (zh) * 2013-04-01 2014-10-15 中兴通讯股份有限公司 双向转发检测系统及双向转发检测的检测时间配置方法
CN104283711A (zh) * 2014-09-29 2015-01-14 中国联合网络通信集团有限公司 基于双向转发检测bfd的故障检测方法、节点及系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3316520A4 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119544556A (zh) * 2024-11-29 2025-02-28 中国农业银行股份有限公司 基于双向转发检测bfd状态的中断检测方法、装置及设备

Also Published As

Publication number Publication date
EP3316520B1 (en) 2019-07-24
CN106330588A (zh) 2017-01-11
EP3316520A1 (en) 2018-05-02
US20180102951A1 (en) 2018-04-12
EP3316520A4 (en) 2018-05-30
US10447561B2 (en) 2019-10-15
CN106330588B (zh) 2020-01-10

Similar Documents

Publication Publication Date Title
CN106330588B (zh) 一种bfd检测方法与装置
US11671342B2 (en) Link fault isolation using latencies
US10318366B2 (en) System and method for relationship based root cause recommendation
CN108076019B (zh) 基于流量镜像的异常流量检测方法及装置
JP5418250B2 (ja) 異常検出装置、プログラム、及び異常検出方法
CN110830289A (zh) 一种容器异常监测方法及监测系统
US10033592B2 (en) Method and system for monitoring network link and storage medium therefor
CN113438110B (zh) 一种集群性能的评价方法、装置、设备及存储介质
CN104506392B (zh) 一种宕机检测方法及设备
JP2014068283A (ja) ネットワーク障害検出システムおよびネットワーク障害検出装置
CN110445650B (zh) 检测报警方法、设备及服务器
WO2018125628A1 (en) A network monitor and method for event based prediction of radio network outages and their root cause
US20210021526A1 (en) Using machine learning to detect slow drain conditions in a storage area network
US9847970B1 (en) Dynamic traffic regulation
CN115038088B (zh) 一种智能网络安全检测预警系统和方法
US20170206125A1 (en) Monitoring system, monitoring device, and monitoring program
CN120856548B (zh) 一种网络设备的性能监控方法、装置及电子设备
US10044584B1 (en) Network interface port management
WO2018035765A1 (zh) 网络异常的检测方法及装置
US20250028621A1 (en) Detection of underutilized data center resources
US9264338B1 (en) Detecting upset conditions in application instances
CN115022209B (zh) 监控方法、装置和计算机可读存储介质
CN120567708B (zh) 一种基于人工智能的业务告警方法及系统
KR102719678B1 (ko) 소프트웨어 정의 네트워크 환경에서 기계 학습에 기반한 네트워크 장애 유형의 분석 방법 및 장치
US20260052059A1 (en) Detecting and Recovering From Network Failures

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16816903

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2016816903

Country of ref document: EP