WO2017167099A1 - 一种分布式系统中节点的处理方法和装置 - Google Patents
一种分布式系统中节点的处理方法和装置 Download PDFInfo
- Publication number
- WO2017167099A1 WO2017167099A1 PCT/CN2017/077717 CN2017077717W WO2017167099A1 WO 2017167099 A1 WO2017167099 A1 WO 2017167099A1 CN 2017077717 W CN2017077717 W CN 2017077717W WO 2017167099 A1 WO2017167099 A1 WO 2017167099A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- service node
- node
- central
- abnormality
- state information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/10—Active monitoring, e.g. heartbeat, ping or trace-route
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/08—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
- H04L43/0805—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
- H04L43/0817—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking functioning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/0709—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0654—Management of faults, events, alarms or notifications using network fault recovery
- H04L41/0659—Management of faults, events, alarms or notifications using network fault recovery by isolating or reconfiguring faulty entities
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0677—Localisation of faults
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/08—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
- H04L43/0876—Network utilisation, e.g. volume of load or congestion level
- H04L43/0882—Utilisation of link capacity
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0631—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
- H04L41/064—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis involving time analysis
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0654—Management of faults, events, alarms or notifications using network fault recovery
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/08—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
- H04L43/0823—Errors, e.g. transmission errors
- H04L43/0829—Packet loss
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/08—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
- H04L43/0876—Network utilisation, e.g. volume of load or congestion level
Definitions
- the present application relates to the field of data processing technologies, and in particular, to a method for processing a node in a distributed system and a processing device for a node in a distributed system.
- a distributed system is a system consisting of geographically and physically dispersed independent nodes or nodes, which include a service node and a central node.
- the central node can coordinate the provisioning of the service nodes. These nodes are connected together to share resources, and the distributed system is equivalent to a unified whole.
- each service node in the distributed system sends the survival status information to the central node at a predetermined period.
- the central node updates its status information table.
- the status information table records the latest update time and the next update time of the service node.
- the central node will check the status information table from time to time to confirm the survival status of the service node. If the next update time of the service node is found to be less than the current system time, the central node may determine that the service node status is abnormal.
- the central node of the system can control the service node, and the service node periodically reports the survival status information to the central node, and the central node
- the survival status of the service node is confirmed according to the survival status information, and the failure processing flow is performed if the service node with abnormal status is found.
- the central node may not receive the survival status information reported by the service node in time due to network delay, or may not be able to process the survival status information in time because the system resource load is too large, which may lead to the service node. If the survival status information is lost, or the next update time is invalid, then the central node may be misjudged. The situation in which the node is alive.
- embodiments of the present application have been made in order to provide a processing method for a node in a distributed system and a corresponding processing device for a node in a distributed system that overcomes the above problems or at least partially solves the above problems.
- the embodiment of the present application discloses a method for processing a node in a distributed system, where the node includes a service node and a central node, and the method includes:
- the service node having the abnormality is processed according to the central state information.
- the distributed system includes a state information table, and the obtaining the survival state information of the service node includes:
- the state information table is updated with the survival state information of the service node.
- the survival status information includes a next update time of the service node
- the current system information includes a current system time of the central node
- the survival status information and the current system information are used to determine the service node.
- the steps to see if there are exceptions include:
- the service node is determined to have an abnormality by using the next update time and the current system time.
- the step of determining whether the service node has an abnormality by using the next update time and the current system time comprises:
- the service node is determined to be absent.
- the central state information includes network busy condition data and/or system resource usage data
- the step of processing the abnormal service node according to the central state information includes:
- the survival status information of the service node with the abnormality in the status information table is updated.
- the network busy condition data includes a network throughput and a network packet loss rate
- the system resource usage data includes an average load of the system
- the determining, by using network busy condition data and/or system resource usage data, The steps of whether the central node is overloaded include:
- Determining the central node if the network throughput is greater than or equal to the network bandwidth, and/or the network packet loss rate is greater than a preset packet loss rate, and/or the average load of the system is greater than a preset load threshold. The load is too heavy.
- the step of updating, in the update status information table, the survival status information of the abnormal service node includes:
- the next update time of the service node with the abnormality described in the status information table is extended.
- the step of updating, in the update status information table, the survival status information of the abnormal service node includes:
- the new survival status information includes a new next update time
- the next update time of the service node with the abnormality in the status information table is updated by using the new next update time.
- the method further comprises:
- the service node If the service node does not have an abnormality, the service node is regarded as a failed service node.
- the method further includes:
- the embodiment of the present application further discloses a processing device for a node in a distributed system, where the node includes a service node and a central node, and the device includes:
- a survival state information obtaining module configured to acquire survival state information of the service node
- a current system information obtaining module configured to acquire current system information of the central node
- a service node abnormality determining module configured to determine whether the service node has an abnormality by using the survival state information and the current system information; and if the service node has an abnormality, calling a central state information acquiring module;
- a central state information obtaining module configured to acquire central state information of the central node
- the abnormal service node processing module is configured to process the service node with the abnormality according to the central state information.
- the distributed system includes a status information table, where the survival status information obtaining module includes:
- a survival status information receiving submodule configured to receive survival status information uploaded by the service node
- the first state information table update submodule is configured to update the state information table by using the survival state information of the service node.
- the survival state information includes a next update time of the service node
- the current system information includes a current system time of the central node
- the service node abnormality determining module includes:
- the status information table traverses the sub-module, and is configured to traverse the next update time in the status information table when the preset time is reached;
- the service node abnormality determining submodule is configured to determine whether the service node has an abnormality by using the next update time and the current system time.
- the service node abnormality determining submodule comprises:
- a time determining unit configured to determine whether the next update time is less than the current system time; if yes, the first determining unit is invoked, and if not, the second determining unit is invoked;
- a first determining unit configured to determine that the service node is abnormal
- a second determining unit configured to determine, by the serving node, that there is no abnormality.
- the central state information includes network busy condition data and/or system resource usage data
- the abnormal service node processing module includes:
- a central node state determining submodule configured to determine, by using the network busy condition data and/or system resource usage data, whether the central node is overloaded; if yes, calling the second state information table to update the submodule;
- the second state information table update submodule is configured to update the survival state information of the service node having the abnormality in the state information table.
- the network busy condition data includes a network throughput and a network packet loss rate
- the system resource usage data includes an average load of the system
- the central node status determining submodule includes:
- a first network busy condition determining unit configured to determine whether the network throughput is greater than Lose bandwidth on the network
- a second network busy condition determining unit configured to determine that the network packet loss rate is greater than a preset packet loss rate
- a system resource usage determining unit configured to determine whether an average load of the system is greater than a preset load threshold
- the central node load determining unit is configured to: when the network throughput is greater than or equal to the network bandwidth, and/or the network packet loss rate is greater than a preset packet loss rate, and/or the average load of the system is greater than a preset load At the threshold, it is determined that the central node is overloaded.
- the second state information table update submodule includes:
- the next update time extension unit is configured to extend the next update time of the service node having the abnormality in the status information table.
- the second state information table update submodule includes:
- An update request sending unit configured to send an update request to the service node
- a next update time receiving unit configured to receive new survival state information uploaded by the service node for the update request; the new survival state information includes a new next update time;
- the next update time update unit is configured to update the next update time of the service node with the abnormality in the status information table by using the new next update time.
- the device further comprises:
- the invalid service node determining module is configured to use the service node as a failed service node when there is no abnormality in the service node.
- the device further comprises:
- a failed service node deletion module configured to delete the failed service node in the central node
- a failure service node notification module configured to notify other service nodes in the distributed system The failed service node.
- the central node confirms whether the service node has an abnormality according to the survival state information reported by the service node and the current system information of the central node itself.
- the central node further determines Its own state information is processed for the service node with an exception.
- the state of the central node itself can be integrated, and the service node with abnormality can be adaptively processed, and the state of the service node is misjudged due to the problem of the central node itself, and the probability of the central node error is reduced.
- 1 is a schematic diagram of a workflow of a distributed system central node and a service node
- Embodiment 1 is a flow chart showing the steps of Embodiment 1 of a method for processing a node in a distributed system according to the present application;
- Embodiment 3 is a flow chart showing the steps of Embodiment 2 of a method for processing a node in a distributed system according to the present application;
- FIG. 4 is a flow chart showing the working steps of a distributed system central node and a service node according to the present application
- FIG. 5 is a schematic diagram of a working principle of a distributed system central node and a service node according to the present application
- FIG. 6 is a structural block diagram of an embodiment of a processing device for a node in a distributed system of the present application.
- the node may include a service node and a central node, and the method may specifically include the following steps:
- Step 101 Acquire survival state information of the service node.
- a service node refers to a node in a distributed system that has a storage function or a service processing function, and is usually a server and the like
- a central node refers to a node in a distributed system that has a function of coordinating a service node, usually for control.
- Equipment such as equipment.
- the distributed information system may include a status information table, and the step 101 may include the following sub-steps:
- Sub-step S11 receiving the survival status information uploaded by the service node
- Sub-step S12 updating the status information table by using the survival status information of the service node.
- the service node is coordinated by the central node, so the central node needs to know whether the service node works normally. It can be understood that the service node needs to perform many tasks as a device with functions such as storage and service, and in the process of executing the task, the task may be repeatedly executed due to excessive tasks and the remaining memory is too small, and the system is faulty.
- the node needs to report the survival status information to inform the central node whether there is an abnormality or failure, and the central node will perform corresponding processing according to whether the service node has an abnormality or failure.
- a central state information table is stored at the central node for storing survival state information capable of reflecting the survival state of the service node.
- the service node periodically reports its survival status information, and the central node saves the survival status information to the status information table, and updates the node status of the service node accordingly.
- the survival status information can also The central node sends a request to the service node when it is idle, so that it needs to report its survival status information, which is not limited in this embodiment of the present application.
- Step 102 Acquire current system information of the central node.
- Step 103 using the survival state information and the current system information to determine whether the service node is abnormal; if the service node is abnormal, step 104 is performed;
- the survival status information may include a next update time of the service node
- the current system information may include a current system time of the central node
- the step 103 may include the following step:
- Sub-step S21 when the preset time is reached, traversing the next update time in the state information table
- Sub-step S22 determining whether the service node has an abnormality by using the next update time and the current system time.
- the state information table stores the next update time of the service node, and the next update time is the time that the service node reports to the central node according to its own task scheduling situation, and the next time the survival state is updated.
- the service node determines that the next update time is 2016.02.24 according to its own task scheduling. If the service node has no abnormality, the survival status information should be reported to the central node before 2016.02.24.
- the current system information may include the current system time when the central node performs abnormal judgment on the service node. For example, the current system time may be 2016.02.25.
- the time unit of the next update time and the current system time may be accurate, minute, minute, or roughly as the month, this application.
- the embodiment does not limit this.
- the central node When the preset time is reached, it starts to detect whether there is an abnormality in the service node. Specifically, the central node starts to acquire its current system time, traverses the next update time in the state information table, and compares it with the current system time to determine whether the service node has an abnormality. Its The period of the traversal state information table may be set to a fixed period, for example, 30 seconds, 1 minute, 10 minutes, or 20 minutes, etc., and the time of traversal may also be determined by the business requirement.
- the sub-step S22 may include the following sub-steps:
- Sub-step S22-11 determining whether the next update time is less than the current system time; if yes, executing sub-step S22-12, and if not, executing sub-step S22-13;
- Sub-step S22-12 determining the service node as having an abnormality
- Sub-step S22-13 the service node is determined to be absent from the abnormality.
- next update time of the service node is less than the current system time of the central node. It can be understood that the next update time is the time when the service node reports the survival status information next time. If the time is less than the current system time, the service node has exceeded the time that should be reported. If the time is greater than or equal to the current system time, the service node has not exceeded the time that should be reported. It can be determined that there is no abnormality.
- Step 104 Acquire central state information of the central node.
- Step 105 Process the service node with an abnormality according to the central state information.
- the embodiment of the present application may affect the judgment of the abnormality of the service node in consideration of the state of the central node itself. Therefore, the central state information of the central node itself is also combined to further process the existence.
- An abnormal service node When determining the existence of an abnormal service node, the embodiment of the present application may affect the judgment of the abnormality of the service node in consideration of the state of the central node itself. Therefore, the central state information of the central node itself is also combined to further process the existence. An abnormal service node.
- the central node confirms whether the service node has an abnormality according to the survival state information reported by the service node and the current system information of the central node itself. When it is determined that the service node has an abnormality, the central node further determines The central state information of itself is processed for the service node with an abnormality.
- the state of the central node itself can be integrated, and the service node with abnormality is adaptively processed, and the state of the service node is reduced due to the problem of the central node itself. Judging the situation, reducing the probability of errors in the central node.
- the node may include a service node and a central node, and the method may specifically include the following steps:
- Step 201 Acquire survival state information of the service node.
- Step 202 Obtain current system information of the central node.
- Step 203 The survival state information and the current system information are used to determine whether the service node has an abnormality. If the service node is abnormal, step 204 is performed. If the service node does not have an abnormality, perform the step. 207;
- Step 204 Acquire central state information of the central node; the central state information may include network busy condition data and/or system resource usage data;
- Step 205 using the network busy condition data and / or system resource usage data to determine whether the central node is overloaded; if yes, proceed to step 206;
- the network busy condition data may be embodied as network throughput and network packet loss rate
- the system resource usage data may be embodied as an average load of the system.
- network throughput is simply referred to as throughput, which refers to the amount of data successfully transmitted through a network (or a certain channel, a node) at any given time. Throughput depends on the current available bandwidth of the central node network and is limited by the network bandwidth. Throughput is often an important indicator of network testing in actual network engineering, for example, to measure the performance of network devices.
- the network packet loss rate refers to the ratio of the amount of lost data to the amount of data sent. The packet loss rate is related to network load, data length, and data transmission frequency.
- the system average load (load average) is the average number of processes in the queue that the central node runs during a specific time interval.
- the step 205 may include the following sub-steps:
- Sub-step S31 determining whether the network throughput is greater than or equal to the network bandwidth
- Sub-step S32 determining that the network packet loss rate is greater than a preset packet loss rate
- Sub-step S33 determining whether the average load of the system is greater than a preset load threshold; if the network throughput is greater than or equal to the network bandwidth, and/or, the network packet loss rate is greater than a preset packet loss rate, and/or, If the average load of the system is greater than a preset threshold, perform sub-step S34;
- Sub-step S34 determining that the central node is overloaded.
- the formula for calculating the busy condition of the central node network is:
- N ranges from 1-100.
- the formula for calculating the system resource usage of the central node is:
- System resource usage system load average value > N;
- N is an integer, generally N>1.
- the network node busy situation data and the system resource usage data of the central node are used for judging. If some or all of the data reach certain critical values, indicating that the central node is overloaded, the The first service node that is determined to be abnormal is not necessarily the failed service node, then the next update time of the service node needs to be extended; otherwise, if the central node load is normal, the service node that is previously determined to be abnormal should be invalid. Service node. In this way, by combining the state of the central node itself, the misjudgment of the service node due to the central node itself can be reduced.
- Step 206 Update survival state information of the service node that has an abnormality in the state information table.
- the step 206 may include the following sub-steps:
- Sub-step S41 extending the service node of the abnormality in the state information table Update time.
- the central node combines its own node network busy condition and system resource usage to determine the failure of the service node. If the network is very busy or the system resources are busy, then the central node may determine the failure of the service node at this time. The reliability is low. For example, it is possible that the service node survival status update in the survival status information table is invalid due to busy resources. At this time, the judgment of the central node may not be adopted, and the central node processing failure is determined, and the status information table is extended accordingly. The previous update time of the abnormal service node is determined.
- the step 206 may include the following sub-steps:
- Sub-step S51 sending an update request to the service node
- Sub-step S52 receiving new survival state information uploaded by the service node for the update request; the new survival state information includes a new next update time;
- Sub-step S53 updating the next update time of the service node with the abnormality in the status information table by using the new next update time.
- the central node may automatically extend the next update time of the service node according to its own state, or actively initiate a request for status update to the service node to extend the next update time of the service node, and reduce the misjudgment of the service node status due to the problem of the central node itself. The situation arises.
- the central node may send an update request to the service node, and after receiving the request, the service node restarts according to its own task scheduling situation.
- the central node updates the status information table with the new next update time to extend the next update time of the service node.
- Step 207 the service node is regarded as a failed service node.
- the serving node is regarded as a failed service section After the step, it also includes:
- the related information of the failed service node in the central node may be deleted, such as a registry.
- the information about the failed service node of other service nodes in the distributed system may be notified, for example, the IP address of the failed service node, and the service node may locally clear the invalid service after receiving the notification. Information about the node.
- the service node reports the survival status information to the central node.
- the central node updates the status information table according to the survival status information reported by the service node, where the update content includes: the latest update time and the next update time;
- the central node scans the survival status information table.
- S5. Determine whether the next update time of a service node is less than the current system time. If yes, execute S6. If no, return to S4 to continue scanning the survival status information table.
- the central node determines the busyness of the node network and the usage of the system resources. If the network is very busy or the system resources are busy, the next update time of the service node in the survival state information table is extended;
- the central node combines its own state and loses the service node.
- the effect judgment can reduce the misjudgment caused by the network node congestion or system resource problem of the central node without updating the node state information table, and reduce the probability of the central node error.
- FIG. 6 is a structural block diagram of an embodiment of a processing device for a node in a distributed system of the present application.
- the node includes a service node and a central node, and the device may specifically include the following modules:
- a survival status information obtaining module 301 configured to acquire survival status information of the service node
- the distributed system includes a status information table, and the survival status information obtaining module 301 may include the following sub-modules:
- a survival status information receiving submodule configured to receive survival status information uploaded by the service node
- the first state information table update submodule is configured to update the state information table by using the survival state information of the service node.
- the current system information obtaining module 302 is configured to acquire current system information of the central node
- the service node abnormality determining module 303 is configured to determine, by using the survival state information and the current system information, whether the service node has an abnormality; if the service node is different Often, the central state information acquisition module is called;
- the survival status information includes a next update time of the service node
- the current system information includes a current system time of the central node
- the service node abnormality determining module 303 may include The following submodules:
- the status information table traverses the sub-module, and is configured to traverse the next update time in the status information table when the preset time is reached;
- the service node abnormality determining submodule is configured to determine whether the service node has an abnormality by using the next update time and the current system time.
- the service node abnormality determining submodule includes:
- a time determining unit configured to determine whether the next update time is less than the current system time; if yes, the first determining unit is invoked, and if not, the second determining unit is invoked;
- a first determining unit configured to determine that the service node is abnormal
- a second determining unit configured to determine, by the serving node, that there is no abnormality.
- the central state information obtaining module 304 is configured to acquire central state information of the central node
- the abnormal service node processing module 305 is configured to process the service node with an abnormality according to the central state information.
- the central state information includes network busy condition data and/or system resource usage data
- the abnormal service node processing module 305 includes:
- a central node state determining submodule configured to determine, by using the network busy condition data and/or system resource usage data, whether the central node is overloaded; if yes, calling the second state information table to update the submodule;
- a second status information table update submodule for updating the presence in the status information table The survival status information of the abnormal service node.
- the network busy condition data includes network throughput
- the system resource usage data includes an average load of the system
- the central node status determining submodule includes:
- a first network busy condition determining unit configured to determine whether the network throughput is greater than or equal to a network bandwidth
- a second network busy condition determining unit configured to determine that the network packet loss rate is greater than a preset packet loss rate
- a system resource usage determining unit configured to determine whether an average load of the system is greater than a preset load threshold
- the central node load determining unit is configured to: when the network throughput is greater than or equal to the network bandwidth, and/or the network packet loss rate is greater than a preset packet loss rate, and/or the average load of the system is greater than a preset load At the threshold, it is determined that the central node is overloaded.
- the second status information table update submodule includes:
- a next update time extension unit configured to extend a next update time of the service node having the abnormality in the status information table
- the second status information table update submodule includes:
- An update request sending unit configured to send an update request to the service node
- a next update time receiving unit configured to receive new survival state information uploaded by the service node for the update request; the new survival state information includes a new next update time;
- the next update time update unit is configured to update the next update time of the service node with the abnormality in the status information table by using the new next update time.
- the device further includes:
- the invalid service node determining module is configured to use the service node as a failed service node when there is no abnormality in the service node.
- the device further includes:
- a failed service node deletion module configured to delete the failed service node in the central node
- the invalid service node notification module is configured to notify the other service nodes in the distributed system of the failed service node.
- the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.
- embodiments of the embodiments of the present application can be provided as a method, apparatus, or computer program product. Therefore, the embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware. Moreover, embodiments of the present application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
- computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
- the computer device includes one or more processors (CPU), input/output interface, network interface, and memory.
- the memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory.
- RAM random access memory
- ROM read only memory
- Memory is an example of a computer readable medium.
- Computer readable media includes both permanent and non-persistent, removable and non-removable media.
- Information storage can be implemented by any method or technology.
- the information can be computer readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory.
- PRAM phase change memory
- SRAM static random access memory
- DRAM dynamic random access memory
- RAM random access memory
- ROM read only memory
- EEPROM electrically erasable programmable read only memory
- flash memory or other memory technology
- compact disk read only memory CD-ROM
- DVD digital versatile disk
- Magnetic tape cartridges magnetic tape storage or other magnetic storage devices or any other non-transportable media can be used to store information that can be accessed by a computing device.
- computer readable media does not include non-persistent computer readable media, such as modulated data signals and carrier waves.
- Embodiments of the present application are described with reference to flowcharts and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block of the flowchart illustrations and/or FIG. These computer programs are available Having a processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing terminal device to produce a machine such that instructions executed by a processor of a computer or other programmable data processing terminal device are generated for implementation Flowchart A process or a plurality of processes and/or a block diagram of a device of a function specified in a block or blocks.
- the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing terminal device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
- the instruction device implements the functions specified in one or more blocks of the flowchart or in a flow or block of the flowchart.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Environmental & Geological Engineering (AREA)
- General Engineering & Computer Science (AREA)
- Cardiology (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Computer Hardware Design (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Debugging And Monitoring (AREA)
- Computer And Data Communications (AREA)
- Hardware Redundancy (AREA)
Abstract
Description
Claims (20)
- 一种分布式系统中节点的处理方法,其特征在于,所述节点包括服务节点和中心节点,所述的方法包括:获取所述服务节点的存活状态信息;获取所述中心节点的当前系统信息;采用所述存活状态信息和所述当前系统信息,确定所述服务节点是否存在异常;若所述服务节点存在异常,则获取所述中心节点的中心状态信息;依据所述中心状态信息处理所述存在异常的服务节点。
- 根据权利要求1所述的方法,其特征在于,所述分布式系统中包括状态信息表,所述获取服务节点的存活状态信息包括:接收所述服务节点上传的存活状态信息;采用所述服务节点的存活状态信息更新所述状态信息表。
- 根据权利要求1或2所述的方法,其特征在于,所述存活状态信息包括服务节点的下次更新时间,所述当前系统信息包括所述中心节点的当前系统时间,所述采用存活状态信息和所述当前系统信息,确定所述服务节点是否存在异常的步骤包括:当达到预设时间时,遍历所述状态信息表中的下次更新时间;采用所述下次更新时间和所述当前系统时间确定所述服务节点是否存在异常。
- 根据权利要求3所述的方法,其特征在于,所述采用下次更新时间和所述当前系统时间确定所述服务节点是否存在异常的步骤包括:判断所述下次更新时间是否小于所述当前系统时间;若是,则将所述服务节点确定为存在异常;若否,则将所述服务节点确定为不存在异常。
- 根据权利要求1或2所述的方法,其特征在于,所述中心状态信息包括网络繁忙情况数据和/或系统资源使用情况数据,所述依据中心状态信息处理所述存在异常的服务节点的步骤包括:采用所述网络繁忙情况数据和/或系统资源使用情况数据确定所述中心节点是否负荷过重;若是,则更新所述状态信息表中所述存在异常的服务节点的存活状态信息。
- 根据权利要求5所述的方法,其特征在于,所述网络繁忙情况数据包括网络吞吐量和网络丢包率,所述系统资源使用情况数据包括系统的平均负荷,所述采用网络繁忙情况数据和/或系统资源使用情况数据确定所述中心节点是否负荷过重的步骤包括:判断所述网络吞吐量是否大于等于网络带宽;判断所述网络丢包率大于预设丢包率;判断所述系统的平均负荷是否大于预设负荷阈值;若所述网络吞吐量大于等于网络带宽,和/或,所述网络丢包率大于预设丢包率,和/或,所述系统的平均负荷大于预设负荷阈值,则确定所述中心节点负荷过重。
- 根据权利要求5所述的方法,其特征在于,所述更新状态信息表中所述存在异常的服务节点的存活状态信息的步骤包括:延长所述状态信息表中所述存在异常的服务节点的下次更新时间。
- 根据权利要求5所述的方法,其特征在于,所述更新状态信息表中所述存在异常的服务节点的存活状态信息的步骤包括:向所述服务节点发送更新请求;接收所述服务节点针对所述更新请求上传的新的存活状态信息;所述新的存活状态信息中包括新的下次更新时间;采用所述新的下次更新时间更新所述状态信息表中所述存在异常的服务节点的下次更新时间。
- 根据权利要求1所述的方法,其特征在于,还包括:若所述服务节点不存在异常,则将所述服务节点作为失效的服务节点。
- 根据权利要求1所述的方法,其特征在于,所述将服务节点作为失效的服务节点的步骤之后,还包括:在所述中心节点的中删除所述失效的服务节点;通知所述分布式系统中其他服务节点所述失效的服务节点。
- 一种分布式系统中节点的处理装置,其特征在于,所述节点包括服务节点和中心节点,所述的装置包括:存活状态信息获取模块,用于获取所述服务节点的存活状态信息;当前系统信息获取模块,用于获取所述中心节点的当前系统信息;服务节点异常确定模块,用于采用所述存活状态信息和所述当前系统信息,确定所述服务节点是否存在异常;若所述服务节点存在异常,则调用中心状态信息获取模块;中心状态信息获取模块,用于获取所述中心节点的中心状态信息;异常服务节点处理模块,用于依据所述中心状态信息处理所述存在异常的服务节点。
- 根据权利要求11所述的装置,其特征在于,所述分布式系统中包括状态信息表,所述存活状态信息获取模块包括:存活状态信息接收子模块,用于接收所述服务节点上传的存活状态信息;第一状态信息表更新子模块,用于采用所述服务节点的存活状态信息更新所述状态信息表。
- 根据权利要求11或12所述的装置,其特征在于,所述存活状态信息包括服务节点的下次更新时间,所述当前系统信息包括所述中心节点的当前系统时间,所述服务节点异常确定模块包括:状态信息表遍历子模块,用于当达到预设时间时,遍历所述状态信息表中的下次更新时间;服务节点异常确定子模块,用于采用所述下次更新时间和所述当前系统时间确定所述服务节点是否存在异常。
- 根据权利要求13所述的装置,其特征在于,所述服务节点异常确定子模块包括:时间判断单元,用于判断所述下次更新时间是否小于所述当前系统时间;若是,则调用第一确定单元,若否,则调用第二确定单元;第一确定单元,用于将所述服务节点确定为存在异常;第二确定单元,用于将所述服务节点确定为不存在异常。
- 根据权利要求11或12所述的装置,其特征在于,所述中心状态信息包括网络繁忙情况数据和/或系统资源使用情况数据,所述异常服务节点处理模块包括:中心节点状态确定子模块,用于采用所述网络繁忙情况数据和/或系统资源使用情况数据确定所述中心节点是否负荷过重;若是,则调用第二状态信息表更新子模块;第二状态信息表更新子模块,用于更新所述状态信息表中所述存在异常的服务节点的存活状态信息。
- 根据权利要求15所述的装置,其特征在于,所述网络繁忙情况数据包括网络吞吐量和网络丢包率,所述系统资源使用情况数据包括系统的平均负荷,所述中心节点状态确定子模块包括:第一网络繁忙情况判断单元,用于判断所述网络吞吐量是否大于等 于网络丢带宽;第二网络繁忙情况判断单元,用于判断所述网络丢包率大于预设丢包率;系统资源使用情况判断单元,用于判断所述系统的平均负荷是否大于预设负荷阈值;中心节点负荷确定单元,用于在所述网络吞吐量大于等于网络带宽,和/或,所述网络丢包率大于预设丢包率,和/或,所述系统的平均负荷大于预设负荷阈值时,确定所述中心节点负荷过重。
- 根据权利要求15所述的装置,其特征在于,所述第二状态信息表更新子模块包括:下次更新时间延长单元,用于延长所述状态信息表中所述存在异常的服务节点的下次更新时间。
- 根据权利要求15所述的装置,其特征在于,所述第二状态信息表更新子模块包括:更新请求发送单元,用于向所述服务节点发送更新请求;下次更新时间接收单元,用于接收所述服务节点针对所述更新请求上传的新的存活状态信息;所述新的存活状态信息中包括新的下次更新时间;下次更新时间更新单元,用于采用所述新的下次更新时间更新所述状态信息表中所述存在异常的服务节点的下次更新时间。
- 根据权利要求11所述的装置,其特征在于,还包括:失效服务节点确定模块,用于在所述服务节点不存在异常时,将所述服务节点作为失效的服务节点。
- 根据权利要求11所述的装置,其特征在于,还包括:失效服务节点删除模块,用于在所述中心节点的中删除所述失效的 服务节点;失效服务节点通知模块,用于通知所述分布式系统中其他服务节点所述失效的服务节点。
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP17773129.6A EP3439242B1 (en) | 2016-03-31 | 2017-03-22 | Method and apparatus for node processing in distributed system |
| SG11201808551UA SG11201808551UA (en) | 2016-03-31 | 2017-03-22 | Method and apparatus for node processing in distributed system |
| US16/146,130 US20190036798A1 (en) | 2016-03-31 | 2018-09-28 | Method and apparatus for node processing in distributed system |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201610201955.2A CN107294799B (zh) | 2016-03-31 | 2016-03-31 | 一种分布式系统中节点的处理方法和装置 |
| CN201610201955.2 | 2016-03-31 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US16/146,130 Continuation US20190036798A1 (en) | 2016-03-31 | 2018-09-28 | Method and apparatus for node processing in distributed system |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2017167099A1 true WO2017167099A1 (zh) | 2017-10-05 |
Family
ID=59963464
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2017/077717 Ceased WO2017167099A1 (zh) | 2016-03-31 | 2017-03-22 | 一种分布式系统中节点的处理方法和装置 |
Country Status (6)
| Country | Link |
|---|---|
| US (1) | US20190036798A1 (zh) |
| EP (1) | EP3439242B1 (zh) |
| CN (1) | CN107294799B (zh) |
| SG (1) | SG11201808551UA (zh) |
| TW (1) | TW201742403A (zh) |
| WO (1) | WO2017167099A1 (zh) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110708177A (zh) * | 2018-07-09 | 2020-01-17 | 阿里巴巴集团控股有限公司 | 分布式系统中的异常处理方法、系统和装置 |
Families Citing this family (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10659561B2 (en) * | 2017-06-09 | 2020-05-19 | Microsoft Technology Licensing, Llc | Service state preservation across nodes |
| CN108881407A (zh) * | 2018-05-30 | 2018-11-23 | 郑州云海信息技术有限公司 | 一种信息处理方法及装置 |
| CN108833205B (zh) * | 2018-06-05 | 2022-03-29 | 中国平安人寿保险股份有限公司 | 信息处理方法、装置、电子设备及存储介质 |
| CN111342986B (zh) * | 2018-12-19 | 2022-09-16 | 杭州海康威视系统技术有限公司 | 分布式节点管理方法及装置、分布式系统、存储介质 |
| CN110213106B (zh) * | 2019-06-06 | 2022-04-19 | 宁波三星医疗电气股份有限公司 | 一种设备信息管理方法、装置、系统及电子设备 |
| CN110716985B (zh) * | 2019-10-16 | 2022-09-09 | 北京小米移动软件有限公司 | 一种节点信息处理方法、装置及介质 |
| CN110730110A (zh) * | 2019-10-18 | 2020-01-24 | 深圳市网心科技有限公司 | 节点异常处理方法、电子设备、系统及介质 |
| CN113064732B (zh) * | 2020-01-02 | 2024-05-31 | 阿里巴巴集团控股有限公司 | 一种分布式系统及其管理方法 |
| CN114257495A (zh) * | 2021-11-16 | 2022-03-29 | 国家电网有限公司客户服务中心 | 一种云平台计算节点异常自动处置系统 |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20050144505A1 (en) * | 2003-11-28 | 2005-06-30 | Fujitsu Limited | Network monitoring program, network monitoring method, and network monitoring apparatus |
| CN101188527A (zh) * | 2007-12-24 | 2008-05-28 | 杭州华三通信技术有限公司 | 一种心跳检测方法和装置 |
| CN104618466A (zh) * | 2015-01-20 | 2015-05-13 | 上海交通大学 | 基于消息传递的负载均衡和过负荷控制系统及其控制方法 |
| CN105357069A (zh) * | 2015-11-04 | 2016-02-24 | 浪潮(北京)电子信息产业有限公司 | 分布式节点服务状态监测的方法、装置及系统 |
Family Cites Families (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7003575B2 (en) * | 2001-10-15 | 2006-02-21 | First Hop Oy | Method for assisting load balancing in a server cluster by rerouting IP traffic, and a server cluster and a client, operating according to same |
| JP2005293101A (ja) * | 2004-03-31 | 2005-10-20 | Pacific Ind Co Ltd | 光lan装置 |
| WO2010052028A1 (en) * | 2008-11-07 | 2010-05-14 | Nokia Siemens Networks Oy | Inter-network carrier ethernet service protection |
| US8364775B2 (en) * | 2010-08-12 | 2013-01-29 | International Business Machines Corporation | High availability management system for stateless components in a distributed master-slave component topology |
| CN102231681B (zh) * | 2011-06-27 | 2014-07-30 | 中国建设银行股份有限公司 | 一种高可用集群计算机系统及其故障处理方法 |
| CN102387210B (zh) * | 2011-10-25 | 2014-04-23 | 曙光信息产业(北京)有限公司 | 一种基于快速同步网络的分布式文件系统监控方法 |
| WO2013145325A1 (ja) * | 2012-03-30 | 2013-10-03 | 富士通株式会社 | 情報処理システム、障害検知方法および情報処理装置 |
| CN103001809B (zh) * | 2012-12-25 | 2016-12-28 | 曙光信息产业(北京)有限公司 | 用于云存储系统的服务节点状态监控方法 |
| WO2016147281A1 (ja) * | 2015-03-16 | 2016-09-22 | 株式会社日立製作所 | 分散型ストレージシステム及び分散型ストレージシステムの制御方法 |
| CN104933132B (zh) * | 2015-06-12 | 2019-11-19 | 深圳巨杉数据库软件有限公司 | 基于操作序列号的分布式数据库有权重选举方法 |
-
2016
- 2016-03-31 CN CN201610201955.2A patent/CN107294799B/zh active Active
-
2017
- 2017-02-22 TW TW106105965A patent/TW201742403A/zh unknown
- 2017-03-22 SG SG11201808551UA patent/SG11201808551UA/en unknown
- 2017-03-22 WO PCT/CN2017/077717 patent/WO2017167099A1/zh not_active Ceased
- 2017-03-22 EP EP17773129.6A patent/EP3439242B1/en active Active
-
2018
- 2018-09-28 US US16/146,130 patent/US20190036798A1/en not_active Abandoned
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20050144505A1 (en) * | 2003-11-28 | 2005-06-30 | Fujitsu Limited | Network monitoring program, network monitoring method, and network monitoring apparatus |
| CN101188527A (zh) * | 2007-12-24 | 2008-05-28 | 杭州华三通信技术有限公司 | 一种心跳检测方法和装置 |
| CN104618466A (zh) * | 2015-01-20 | 2015-05-13 | 上海交通大学 | 基于消息传递的负载均衡和过负荷控制系统及其控制方法 |
| CN105357069A (zh) * | 2015-11-04 | 2016-02-24 | 浪潮(北京)电子信息产业有限公司 | 分布式节点服务状态监测的方法、装置及系统 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP3439242A4 * |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110708177A (zh) * | 2018-07-09 | 2020-01-17 | 阿里巴巴集团控股有限公司 | 分布式系统中的异常处理方法、系统和装置 |
Also Published As
| Publication number | Publication date |
|---|---|
| EP3439242A4 (en) | 2019-10-30 |
| CN107294799A (zh) | 2017-10-24 |
| SG11201808551UA (en) | 2018-10-30 |
| CN107294799B (zh) | 2020-09-01 |
| EP3439242B1 (en) | 2025-12-03 |
| US20190036798A1 (en) | 2019-01-31 |
| TW201742403A (zh) | 2017-12-01 |
| EP3439242A1 (en) | 2019-02-06 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2017167099A1 (zh) | 一种分布式系统中节点的处理方法和装置 | |
| CN108400904B (zh) | 一种基于微服务架构的健康检查方法和装置 | |
| CN110740072B (zh) | 一种故障检测方法、装置和相关设备 | |
| CN108737132B (zh) | 一种告警信息处理方法及装置 | |
| CN110727560B (zh) | 云服务报警方法及装置 | |
| CN106856489A (zh) | 一种分布式存储系统的服务节点切换方法和装置 | |
| US10545817B2 (en) | Detecting computer system anomaly events based on modified Z-scores generated for a window of performance metrics | |
| CN105635331A (zh) | 一种分布式环境下的服务寻址方法及装置 | |
| CN109274544B (zh) | 一种分布式存储系统的故障检测方法及装置 | |
| US20180262581A1 (en) | State Information For a Service | |
| JP7623364B2 (ja) | イベント通知方法、システム、サーバデバイス、コンピュータ記憶媒体 | |
| CN104239156A (zh) | 一种外部服务的调用方法及系统 | |
| CN112860505B (zh) | 一种分布式集群的调控方法及装置 | |
| CN109560976B (zh) | 一种消息延迟的监控方法及装置 | |
| CN107040566B (zh) | 业务处理方法及装置 | |
| CN112737945B (zh) | 服务器连接控制方法及装置 | |
| CN109324914B (zh) | 服务调用方法、服务调用装置及中心服务器 | |
| CN109560949B (zh) | 一种数据处理方法及管理服务器、业务设备 | |
| CN106713014A (zh) | 一种监控系统中的被监控主机、监控系统以及监控方法 | |
| CN108390770B (zh) | 一种信息生成方法、装置及服务器 | |
| CN110955579A (zh) | 一种基于Ambari的大数据平台的监测方法 | |
| TW201828087A (zh) | 分布式儲存系統的服務節點切換方法及裝置 | |
| CN113765686B (zh) | 设备管理方法、装置、业务获取设备及存储介质 | |
| CN113377627B (zh) | 一种业务服务器异常检测方法、系统、设备、存储介质 | |
| CN111984713A (zh) | 一种数据处理方法、装置、设备和存储介质 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| WWE | Wipo information: entry into national phase |
Ref document number: 11201808551U Country of ref document: SG |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2017773129 Country of ref document: EP |
|
| ENP | Entry into the national phase |
Ref document number: 2017773129 Country of ref document: EP Effective date: 20181031 |
|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 17773129 Country of ref document: EP Kind code of ref document: A1 |
|
| WWG | Wipo information: grant in national office |
Ref document number: 2017773129 Country of ref document: EP |