WO2017118080A1 - 一种中央处理器cpu热移除、热添加方法及装置 - Google Patents

一种中央处理器cpu热移除、热添加方法及装置 Download PDF

Info

Publication number
WO2017118080A1
WO2017118080A1 PCT/CN2016/098741 CN2016098741W WO2017118080A1 WO 2017118080 A1 WO2017118080 A1 WO 2017118080A1 CN 2016098741 W CN2016098741 W CN 2016098741W WO 2017118080 A1 WO2017118080 A1 WO 2017118080A1
Authority
WO
WIPO (PCT)
Prior art keywords
cpu
topology
indication information
cpus
controller
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2016/098741
Other languages
English (en)
French (fr)
Inventor
张飞
廖德甫
马樟平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to ES16883210T priority Critical patent/ES2793006T3/es
Priority to EP20159979.2A priority patent/EP3767470B1/en
Priority to EP16883210.3A priority patent/EP3306476B1/en
Publication of WO2017118080A1 publication Critical patent/WO2017118080A1/zh
Priority to US15/863,350 priority patent/US10846186B2/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operations
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/142Reconfiguring to eliminate the error
    • G06F11/1425Reconfiguring to eliminate the error by reconfiguration of node membership
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/2028Failover techniques eliminating a faulty processor or activating a spare
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2097Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements maintaining the standby controller/processing unit updated
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operations
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/142Reconfiguring to eliminate the error
    • G06F11/1428Reconfiguring to eliminate the error with loss of hardware functionality
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2002Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where interconnections or communication control functionality are redundant
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2041Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant with more than one idle spare processing component
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/40Bus structure
    • G06F13/4063Device-to-bus coupling
    • G06F13/4068Electrical coupling
    • G06F13/4081Live connection to bus, e.g. hot-plugging
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/42Bus transfer protocol, e.g. handshake; Synchronisation
    • G06F13/4282Bus transfer protocol, e.g. handshake; Synchronisation on a serial bus, e.g. I2C bus, SPI bus
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/805Real-time

Definitions

  • the present invention relates to a multi-CPU interconnection technology, and in particular, to a central processor CPU thermal removal, hot addition method and apparatus.
  • multiple CPU interconnect technologies have been derived, that is, multiple high-speed interconnect channels between CPUs (such as QPI (Quick Path Interconnect))
  • the CPUs are connected to each other, so that multiple physical CPUs can be connected to each other through these high-speed interconnected channels to form a server system for resource sharing, but the interconnection of multiple CPUs brings some additional risks while enhancing the processing performance of a single server, because In any multi-CPU interconnected system, if any CPU fails, the entire system may hang. If the CPU is faulty, the entire server system must be powered off, and then the CPU is replaced. Operation will inevitably cause system service interruption, which will seriously affect the continuous service time of the system.
  • the embodiment of the invention provides a method for removing and adding heat to the CPU of the CPU, which can realize the replacement of the CPU without powering off, and the system can work normally and improve the user experience.
  • an embodiment of the present application provides a central processor CPU thermal removal method.
  • the method is applicable to a server having a first CPU topology that is not fully interconnected, the server including a controller, wherein the currently running first CPU topology includes a plurality of CPUs, the method may include: the controller determining a number of the plurality of CPUs a CPU, wherein the first CPU is a CPU that is faulty or needs to be removed according to the first indication information, and the first indication information is from the first CPU topology or a user interface.
  • the controller determines at least one second CPU of the plurality of CPUs that meets the preset condition with the first CPU.
  • the controller sends the second indication information to the first CPU topology.
  • the first CPU topology After receiving the second indication information, the first CPU topology removes the first CPU and the at least one second CPU, obtains the second CPU topology, and runs the second CPU. Topology.
  • the online removal of the CPU can be implemented by the embodiment of the present invention, and the system can work normally during the CPU removal process and after the removal, thereby improving the user experience.
  • a plurality of CPUs of the first CPU topology may be connected by an intermediate node, wherein the intermediate node comprises a CPU and/or an external node controller XNC.
  • the CPU in the CPU topology connected through the intermediate node can be removed online, and the system can work normally, which improves the user experience.
  • the controller determines that the plurality of CPUs are consistent with the first CPU.
  • the at least one second CPU of the condition may include: each CPU in the server may have at least one backup CPU, and the controller may determine at least one backup second CPU of the first CPU.
  • the at least one backup second CPU is in the first CPU topology, and can be implemented by the present invention. When the CPU is removed, the CPU that needs to be removed and the backed up CPU can be removed together, so that the removed CPU remains A stable topology ensures that the CPU can be removed while the system is operating normally, improving the user experience.
  • the controller determines the at least one second CPU of the plurality of CPUs that meets the preset condition of the first CPU, and may include: determining, by the controller, the location of the first CPU on the first CPU topology, and In the first CPU topology, the second CPU on the at least one symmetrical position (for example, may be centrally symmetric or axisymmetric, etc.) with the first CPU, or at least one symmetrical position with the first CPU, and directly connected Any one of the second CPUs.
  • the embodiment of the present invention can be implemented to remove a CPU and all or any of the CPUs in a symmetric position, thereby obtaining a stable topology structure, ensuring that the system can work normally and improving the user experience.
  • each CPU may have multiple ports, and the plurality of CPUs are connected by a port, wherein the controller determines at least one second of the plurality of CPUs that meets a preset condition with the first CPU.
  • the CPU may specifically include: the controller determines at least one second CPU that is interconnected with the port of the first CPU through the same port number (for example, one CPU has three ports, and the port numbers are respectively, 0, 1, and 2. If two The CPUs are all connected through port 2, so when one of the CPUs needs to be removed, the other CPU needs to be removed.
  • the embodiment of the present invention can be implemented to determine a CPU that needs to be removed at the same time by using a port, and obtain a stable CPU topology to ensure that the system can work normally and improve the user experience.
  • the first CPU topology includes multiple CPU groups, wherein information of the multiple CPU groups may be pre-stored in the server, and the controller determines that the plurality of CPUs meet the preset conditions of the first CPU.
  • the at least one second CPU may include: the controller determining the at least one second CPU that belongs to the same CPU group as the first CPU.
  • the first CPU topology recovers resources in the first CPU and the at least one second CPU, and disconnects the first CPU and the at least one second CPU from the first CPU.
  • the connection of the CPU in the second CPU topology can also adjust the settings of the CPU in the second CPU topology, so that after the first CPU and the at least one second CPU are removed, the operation can be performed in the form of a stable fourth CPU topology.
  • the embodiment of the present invention can realize that the CPU topology after removing the CPU can work normally, and the user experience is improved.
  • an embodiment of the present application provides a central processor CPU hot add method.
  • the method is applicable to a server having a third CPU topology that is not fully interconnected, the server includes a controller, the method may include the controller determining first indication information, wherein the first indication information is used to indicate that a third CPU is added, and The third CPU is not in the third CPU topology currently running.
  • the controller determines whether at least one fourth CPU that is in compliance with the preset condition with the third CPU has been installed, and if so, the controller transmits the second indication information to the third CPU topology.
  • the third CPU topology adds the third CPU and the fourth CPU to obtain a fourth CPU topology and runs the fourth CPU topology.
  • the online addition of the CPU can be implemented by using the embodiment of the present invention, and the system can work normally during the adding process, thereby improving the user experience.
  • the first indication information may be received through the user interface, where the indication information may carry the identifier of the CPU to be added; or, after the third CPU is installed, the sensor triggers a specific instruction, and the controller according to the The instruction acquires the identity of the third CPU.
  • the embodiment of the present invention can implement the triggering of the CPU by using a specific instruction or a user interface, and the system can work normally, thereby improving the user experience.
  • the controller determines whether the at least one fourth CPU that is in compliance with the preset condition with the third CPU is installed, the controller includes: determining, in the fourth CPU topology, the third CPU is in at least one symmetric position (center Whether the second CPU on the symmetrical or axisymmetric) is already installed.
  • the embodiment of the present invention can be implemented. When the CPU is added, the CPU that is in a symmetric position with the CPU is also added, so that a stable topology is obtained after the CPU is added, and the system is added. The normal operation during the addition process improves the user experience.
  • the controller determines whether the at least one fourth CPU that meets the preset condition with the third CPU is already installed, and the method includes: the processor determining whether the at least one backup CPU of the first CPU is installed.
  • the embodiment of the present invention can be implemented to simultaneously install the CPU and the backup of the CPU, so that the CPU topology can be expanded and the user experience is improved when the operating system can work normally.
  • the fourth CPU topology includes multiple CPU groups, wherein information of the multiple CPU groups may be pre-stored in the server, and the controller determines at least one that meets a preset condition with the third CPU. Whether the fourth CPU has been installed may include that the controller determines whether at least one fourth CPU belonging to the same CPU group as the third CPU has been installed.
  • the embodiment of the present invention can be implemented. When the CPU is added, the addition is performed in groups. This ensures that the topology after adding the CPU is still a stable topology, ensuring that the system can operate normally and improve the user experience.
  • the third CPU topology allocates resources for the third CPU and the at least one fourth CPU, and establishes the third CPU and the at least one fourth CPU and the third CPU topology.
  • the connection of the CPU can also adjust the settings of the CPU in the third CPU topology, obtain the fourth CPU topology, and run the fourth CPU topology.
  • the embodiment of the present invention can be implemented, and the added CPU topology is a stable topology, which ensures the normal operation of the system and improves the user experience.
  • an embodiment of the present application provides a central processor CPU hot removal apparatus, wherein the apparatus is applicable to a server having a first CPU topology that is not fully interconnected, and is currently running first.
  • the CPU topology includes a plurality of CPUs
  • the device includes: a processing unit, configured to determine a first CPU of the plurality of CPUs, where the first CPU is a CPU that is faulty or needs to be removed according to the first indication information, The first indication information is from the first CPU topology or the user interface; the processing unit is further configured to: determine at least one second CPU of the plurality of CPUs that meets a preset condition with the first CPU; The second indication information is sent to the first CPU topology, The second indication information is used to indicate that the first CPU and the at least one second CPU are removed, obtain a second CPU topology, and run the second CPU topology.
  • the processing unit is further configured to determine a location of the first CPU in the first CPU topology, and in the first CPU topology, at least one with the first CPU a second CPU in a symmetrical position, or a CPU in the CPU at least one symmetrical position with the first CPU, and any one of the second CPUs directly connected.
  • each CPU has a plurality of ports, and the plurality of CPUs are connected by a port, and the processing unit is further configured to determine that the ports that are the same port number as the first CPU are connected to each other. At least one second CPU.
  • the first CPU topology includes a plurality of CPU groups, information of a plurality of CPU groups is pre-stored in the server, and the processing unit is further configured to determine that the first CPU belongs to the same CPU. At least one second CPU of the group.
  • the second indication information is used to indicate that the removing the first CPU and the at least one second CPU includes: the second indication information is used to indicate that the first CPU is The resource recovery in the at least one second CPU disconnects the first CPU and the at least one second CPU from the CPU in the second CPU topology.
  • an embodiment of the present application provides a central processor CPU hot add device, wherein the device is applicable to a server having a third CPU topology that is not fully interconnected, and the device includes: a processing unit, For determining the first indication information, the first indication information is used to indicate adding a third CPU, wherein the third CPU is not in the currently running third CPU topology; the processing unit is further configured to determine Whether the at least one fourth CPU that the three CPUs meet the preset condition has been installed; the sending unit, configured to send the second CPU topology to the third CPU topology when the at least one fourth CPU that meets the preset condition with the third CPU has been installed Instructing information, the second indication information is used to indicate adding the third CPU and the fourth CPU, obtaining a fourth CPU topology, and running the fourth CPU topology.
  • the method further includes: a first receiving unit, configured to be connected through a user interface Receiving the third indication information, the third indication information includes an identifier of the third CPU; or, the second receiving unit is configured to receive, by using a sensor, fourth indication information that is triggered by the third CPU; The unit is further configured to determine, according to the fourth indication information, the installed third CPU.
  • the processing unit is further configured to determine whether a second CPU in the fourth CPU topology that is in at least one symmetric position with the third CPU is installed.
  • the fourth CPU topology includes a plurality of CPU groups, information of a plurality of CPU groups is pre-stored in the server, and the processing unit is further configured to determine that the third CPU belongs to Whether at least one fourth CPU of the same CPU group has been installed.
  • the second indication information is used to indicate that adding the third CPU and the fourth CPU includes: the second indication information is used to indicate that the third CPU and the at least A fourth CPU allocates resources, establishes a connection between the third CPU and the fourth CPU and a CPU in the third CPU topology, obtains a fourth CPU topology, and runs the fourth CPU topology.
  • an embodiment of the present application provides a server having a CPU topology, the server including a first CPU topology that is not fully interconnected, a controller, and a memory, wherein the memory is configured to store the instructions of the first aspect above.
  • the controller and the first CPU topology are used to execute the instruction.
  • an embodiment of the present application provides a server having a CPU topology, the server including a third CPU topology that is not fully interconnected, a controller, and a memory, wherein the memory is configured to store the instructions of the second aspect above.
  • the controller and the third CPU topology are used to execute the instruction.
  • an embodiment of the present application provides a server having a CPU topology, including: a plurality of slots, wherein the slots are installed with independently pluggable CPUs, and the slots are connected by interconnecting channels.
  • the plurality of CPUs installed in the slot operate in a first CPU topology
  • the server further includes a controller for performing the steps of the foregoing first aspect.
  • an embodiment of the present application provides a multi-path service with a CPU topology.
  • the structure includes: a plurality of slots, the slots are mounted with independently pluggable CPUs, and the slots are connected by interconnecting channels, wherein a plurality of CPUs installed in the slots are
  • the third CPU topology operates, the server further comprising a controller for performing the steps of the second aspect above.
  • an embodiment of the present invention provides a computer storage medium for storing computer software instructions for use in the first aspect described above, including a program designed to perform the above aspects.
  • an embodiment of the present invention provides a computer storage medium for storing computer software instructions for use in the second aspect described above, including a program designed to perform the above aspects.
  • the CPU hot removal and hot addition method and device provided by the embodiments of the present invention can be implemented to add or remove the CPU on the line, and the topology after the removal or addition is still a stable topology, and does not affect the normal operation of the system. Improve the user experience.
  • FIG. 1 is a schematic diagram of a CPU topology structure
  • FIG. 2 is a schematic diagram of another CPU topology structure
  • FIG. 3 is a schematic diagram of a CPU removal process according to an embodiment of the present invention.
  • FIG. 5 is a schematic diagram of another CPU topology structure
  • FIG. 6 is a schematic diagram of another CPU topology structure
  • FIG. 7 is a schematic diagram of a method for hot removal of a CPU of a central processing unit according to an embodiment of the present invention.
  • FIG. 8 is a schematic diagram of another CPU topology structure
  • FIG. 9 is a schematic diagram of a method for hot adding a CPU of a central processing unit according to an embodiment of the present invention.
  • FIG. 10 is a schematic structural diagram of a central processor CPU hot removal apparatus according to an embodiment of the present invention.
  • FIG. 11 is a schematic structural diagram of a central processor CPU hot add device according to an embodiment of the present invention.
  • FIG. 12 is a schematic structural diagram of a server with a CPU topology according to an embodiment of the present disclosure.
  • FIG. 13 is a schematic structural diagram of another server with a CPU topology according to an embodiment of the present invention.
  • FIG. 1 is a schematic diagram of a CPU topology structure.
  • the CPU topology can be an Intel Xeon Processor, including 8 CPUs, each connected by a high-speed interconnect channel, and Figure 1 shows a stable topology. structure.
  • the CPU topology when one of the CPUs fails, it is generally not only that the CPU cannot perform data processing, but also the channel connected to the CPU may be faulty, such as the CPU shown in Figure 1.
  • the connection between the CPU 101 and the CPU 102, the CPU 101 and the CPU 103, the CPU 101, and the CPU 104 all fail, as shown in FIG. 2, and FIG. 2 shows that when the CPU 101 fails, the remaining Connection diagram.
  • the connection of the seven CPUs shown in Figure 2 is an unstable topology that can cause system failure or hang during operation.
  • the inventor of the present application is aware of this problem. It is found through analysis that, as shown in FIG. 3, when the CPU 101 needs to be removed, the CPU 103 corresponding to the CPU 101 can be removed together, so that a stable 6 can be obtained. The topology of the CPU.
  • a stable topology of less than 8 CPUs by removing a group of CPUs where the CPU is located.
  • Topology and more wherein, the structure in FIG. 4 can be removed by the structure in FIG. 2 The CPU is obtained, and the structure in Fig. 5 can be obtained by removing 4 CPUs by the structure in Fig. 1. That is to say, a CPU topology can remove a set of CPUs to obtain a stable topology. Accordingly, a CPU topology can also obtain a stable topology by adding a group of CPUs.
  • FIG. 6 is a schematic diagram of a CPU topology structure.
  • the CPU topology consists of 8 CPUs. Each CPU is connected by a high-speed interconnect channel or an XNC (External Node Controller).
  • Figure 6 shows the two types of XNC-enabled devices.
  • the connection method regardless of the connection method, has the aforementioned problem, that is, when a CPU fails, the remaining 7 CPU connections are an unstable topology, but no matter which CPU fails, it can be found.
  • a CPU corresponding to him removes the two CPUs and obtains a stable 6-CPU topology.
  • FIG. 7 is a schematic diagram of a method for hot removal of a CPU of a central processing unit according to an embodiment of the present invention.
  • the method may be run on a server having a first CPU topology that is not fully interconnected, and the following steps are performed.
  • the instructions may be executed by a particular CPU of the first CPU topology, or other CPUs or controllers different from the first CPU topology, and the instructions required to perform the steps described below may be stored in memory.
  • the CPU topology of the server includes a plurality of CPUs, and the method may include the following steps:
  • the first CPU is a CPU that is faulty or needs to be removed according to the first indication information, and the first indication information is from a first CPU topology or a user interface.
  • the server can run the business system and the control system, and the service system can detect and determine the CPU that is at risk or has failed.
  • the service system refers to a system that runs on the first CPU topology and mainly processes business tasks, and the control system may be run on a specific CPU or controller of the CPU topology, and is mainly used to control the CPU topology. system.
  • the first CPU topology determines the CPU that needs to stop working.
  • the first CPU topology sends a first indication message to the controller to notify the controller of the identity of the CPU that needs to be removed.
  • CPUs with poor durability or other poor performance can be removed according to the performance of the CPU.
  • the controller receives the first indication information through the user interface. For example, when a CPU needs to be replaced, the user can input the identity of the CPU that needs to be replaced through the user interface.
  • the controller can also detect the CPU that has failed by detecting the CPU in the first topology. For example, it can detect whether the CPU can be powered normally.
  • different CPUs can be distinguished by the identifier of the CPU, wherein the identifier of the CPU can be a Socket ID (ID of the socket), etc., and can be used to identify the CPU.
  • ID of the socket Socket ID
  • controllers for simplicity of description.
  • the CPU of the same topology can use the same type of CPU.
  • a general CPU module has multiple ports. Each port on a CPU can have a different port number, but the same type of CPU, the port numbers between different CPUs are the same, and it can be determined that the same port number is interconnected.
  • the CPU is a CPU group, and when determining at least one second CPU of the plurality of CPUs that meets the preset condition with the first CPU, the controller may determine at least one second CPU that is connected to each other by the same port number as the first CPU.
  • the 0, 1, and 2 terminals actually represent the QPI port number.
  • the CPU groups connected by the same port number are respectively S0 and S2 are connected through port 2; S1 and S3, S4 and S6, S5 and S7 are also connected through port 2. They form a CPU group two and two.
  • S5 fails find the CPU connected to port 2, that is, S7, both S5 and S7. With the removal, the remaining CPUs can become a stable topology.
  • the grouping of CPUs is grouped according to stable topology rules.
  • the controller determines the location of the first CPU in the first CPU topology, and in the first CPU topology, the second CPU in the at least one symmetric position with the first CPU, or at least one symmetric position with the first CPU Any one of the second CPUs connected directly and directly.
  • the symmetry may be central symmetry or axis symmetry.
  • the topology in FIG. 3 has three symmetrical positions with the CPU 101, two axes are symmetrical, one is center symmetrical, and all three can be removed, or only one directly connected thereto can be removed.
  • each CPU in the server may have at least one backup CPU, and the controller may determine at least one backup second CPU of the first CPU.
  • the CPU in the first CPU topology may be grouped, and the information of the CPU group is obtained.
  • the controller may determine that the controller determines at least one second CPU that belongs to the same CPU group as the first CPU.
  • the CPU in the topology shown in FIG. 6 can be divided into two groups to form four CPU groups, and the identifiers of the CPUs in the CPU groups can be correspondingly stored, when it is necessary to determine a CPU that needs to be removed. , find another CPU that corresponds to the storage and remove it.
  • the determination may be made by the service system of the server.
  • the service system on the server transmits the identifier of the CPU that needs to be removed to the control system (for example, an OS (Operating System), a BIOS (Basic Input Output System), and a BMC (Baseboard Management). Controller, management controller) or other software), the control system determines the second CPU topology that does not include the first CPU, and transmits the identifier of the CPU that needs to be removed to the service system, and the corresponding CPU is removed by the service system. Get the second CPU topology.
  • S730 Send second indication information to the first CPU topology.
  • the second indication information is used to indicate that the first CPU and the at least one second CPU are removed, obtain a second CPU topology, and run the second CPU topology.
  • the server After removing the CPU, the server needs to work in the second CPU topology, for example, The business system runs on the second CPU.
  • removing the CPU includes: the system reclaims resources allocated to the CPU, for example, releasing resources allocated to the CPU, or moving resources allocated to the CPU to other CPU or CPU topologies, for example, Moving to the second CPU topology, the CPU in the second CPU topology, that is, the CPU remaining after the CPU that needs to be removed, can also be deleted from the CPU that needs to be removed, and can be reset.
  • the CPU in the second CPU topology enables it to operate in a second CPU topology. Further, it is also possible to power down the CPU that needs to be removed.
  • the CPUs of the CPU topology may be connected by intermediate nodes, where the intermediate nodes may be CPUs and/or external node controllers XNC, such as the topology shown in FIG. 1 or FIG. .
  • the CPU topology in the embodiment of the present invention may include an even number of CPUs (for example, 8 and 6), and correspondingly, the CPU topology after the removal is still an even number.
  • first CPU topology and the second CPU topology are all stable topologies.
  • the CPU when the CPU is faulty or needs to be, the CPU can be removed without affecting the normal system operation, and the CPU topology after the removal is still stable, thereby improving the user experience.
  • FIG. 9 is a schematic diagram of a method for hot adding a CPU of a central processing unit according to an embodiment of the present invention, As shown in FIG. 9, the method can be run on a multi-path server having a non-fully interconnected CPU topology, and the instructions of the following steps may be different from the specific CPU of the non-fully interconnected CPU topology, or different from the Executions are performed on other CPUs or controllers that are not fully interconnected CPU topologies, and the instructions required to perform the following steps may be stored in memory.
  • the method can include the following steps:
  • S910 Determine first indication information.
  • the first indication information is used to indicate that a third CPU is added, and the third CPU is not in the third CPU topology that is currently running.
  • the user can input an instruction through the user interface, and the controller can receive the instruction, wherein the instruction can carry the identifier of the third CPU.
  • a specific electrical signal is triggered by the sensor, and the controller can receive the signal, and then obtain the identifier of the third CPU according to the indication of the electrical signal.
  • the identifier of the CPU may be a Socket ID (ID of the socket), etc., and can be used to identify the CPU.
  • the electrical signals triggered by different slots can be different, and an electrical signal can be used to determine which slot is installed with the CPU.
  • the electrical signals triggered on different slots may be the same.
  • the server may know that a new CPU is installed, and the service system or the control system may determine the new installation. The ID of the CPU.
  • the principle of the method is the same as the principle of the second method in step S720 shown in FIG.
  • the specific step is that the controller determines whether the second CPU in the fourth CPU topology and the third CPU is in at least one symmetric position has been installed.
  • the processor may determine whether at least one backup CPU of the first CPU is installed, for example, the fourth CPU topology includes multiple CPU groups, and the multiple CPU groups
  • the information may be pre-stored in the server, and the controller determines whether at least one fourth CPU belonging to the same CPU group as the third CPU has been installed.
  • the third CPU topology may need to add a set of CPUs to obtain a stable topology.
  • the service system needs to determine, and the other CPU corresponding to the identifier of the CPU. In the case of the position, when the CPU and its corresponding CPU are all installed, perform the following steps.
  • the second indication information is used to indicate adding a third CPU and at least one fourth CPU, obtaining a fourth CPU topology, and running the fourth CPU topology.
  • the third CPU topology may allocate resources for the third CPU and the at least one fourth CPU, and establish a connection between the third CPU and the CPU of the at least one fourth CPU and the third CPU topology. It is also possible to adjust the settings of the CPU in the third CPU topology such that the CPU in the third CPU topology and the third CPU and the at least one fourth CPU can operate in the fourth CPU topology.
  • the CPU topology can be expanded without affecting the normal operation of the system, or the combination of the embodiment shown in FIG. 7 and FIG. 9 can realize the replacement of the CPU.
  • the system runs more stably and the user experience is higher.
  • the server includes corresponding hardware structures and/or software modules for performing various functions.
  • the present invention can be implemented in a combination of hardware or hardware and computer software in combination with the elements and algorithm steps of the various examples described in the embodiments disclosed herein. Whether a function is implemented in hardware or computer software to drive hardware depends on the specific application and design constraints of the solution. Professionals can implement different methods for each specific application. The described functionality, but such implementation should not be considered to be outside the scope of the present invention.
  • FIG. 10 is a schematic structural diagram of a central processor CPU hot removal apparatus according to an embodiment of the present invention.
  • the device is applicable to a server having a first CPU topology that is not fully interconnected.
  • the currently running first CPU topology includes a plurality of CPUs, and the device includes:
  • the processing unit 1001 is configured to determine a first CPU of the plurality of CPUs, where the first CPU is a CPU that is faulty or needs to be removed according to the first indication information, and the first indication information is from the first CPU. Topology or user interface;
  • the processing unit 1001 is further configured to: determine at least one second CPU of the plurality of CPUs that meets a preset condition with the first CPU;
  • the sending unit 1002 is configured to send the second indication information to the first CPU topology, where the second indication information is used to indicate that the first CPU and the at least one second CPU are removed, to obtain a second CPU topology. And running the second CPU topology.
  • processing unit 101 is further configured to:
  • each CPU has multiple ports, and the multiple CPUs are connected by a port, and the processing unit 1001 is further configured to:
  • processing unit 1001 is further configured to determine at least one backup second CPU of the first CPU.
  • the first CPU topology includes a plurality of CPU groups, and information of a plurality of CPU groups is pre-stored in the server, and the processing unit 1001 is further configured to:
  • the second indication information is used to indicate that removing the first CPU and the at least one second CPU includes:
  • the second indication information is used to indicate that resources in the first CPU and the at least one second CPU are recovered, and the first CPU and the at least one second CPU and the second CPU are disconnected The connection of the CPU in the topology.
  • FIG. 11 is a schematic structural diagram of a central processor CPU hot add device according to an embodiment of the present invention.
  • the apparatus is suitable for servers having a third CPU topology that is not fully interconnected, the apparatus comprising:
  • the processing unit 1101 is configured to determine first indication information, where the first indication information is used to indicate adding a third CPU, where the third CPU is not in the currently running third CPU topology;
  • the processing unit 1101 is further configured to: determine whether at least one fourth CPU that meets a preset condition with the third CPU is installed;
  • the sending unit 1102 is configured to send, to the third CPU topology, second indication information, when the at least one fourth CPU that is in compliance with the preset condition of the third CPU has been installed, where the second indication information is used to indicate adding the The third CPU and the at least one fourth CPU obtain a fourth CPU topology and run the fourth CPU topology.
  • it also includes:
  • a first receiving unit configured to receive the third indication information by using a user interface, where the third indication information includes an identifier of the third CPU;
  • a second receiving unit configured to receive, by using a sensor, fourth indication information that is triggered by the third CPU, where the processing unit 1101 is further configured to determine, according to the fourth indication information, the installed third CPU .
  • processing unit 1101 is further configured to:
  • processing unit 1101 is further configured to:
  • the fourth CPU topology includes a plurality of CPU groups, and information of a plurality of CPU groups is pre-stored in the server, and the processing unit 1101 is further configured to:
  • the second indication information is used to indicate that adding the third CPU and the fourth CPU includes: the second indication information is used to indicate that the third CPU and the at least one fourth The CPU allocates resources, establishes a connection between the third CPU and the fourth CPU and the CPU in the third CPU topology, obtains a fourth CPU topology, and runs the fourth CPU topology.
  • FIG. 12 is a schematic structural diagram of a server with a CPU topology structure according to an embodiment of the present invention.
  • the server may include a CPU topology 1201 and an input and output interface 1202, which also shows a memory 1203 and a bus 1204, and may further include a controller 1205 that passes the CPU topology 1201, the input and output interface 1202, the memory 1203, and the controller 1205.
  • the bus 1204 connects and completes communication with each other.
  • the memory 1203 is used to store programs, and the CPU topology 1201 and the controller 1205 execute the program by reading a program stored in the memory, and transmit and receive data and instructions for the external device through the input/output interface 1202.
  • the CPU topology 1201 has a CPU topology including a plurality of slots, and the slot is provided with a CPU that can be independently plugged and unplugged, and slots and slots are connected through interconnecting channels to form a stable topology.
  • the plurality of CPUs installed in the slot operate in the first CPU topology.
  • a CPU corresponding to the CPU to be removed generally exists in the first CPU topology, and the difference between the CPU to be removed and the CPU and other CPUs corresponding thereto may be distinguished by a slot, for example, the CPU to be removed. If the corresponding CPU is treated as a CPU group, it can belong to the same slot group.
  • the slots are identified by the same or similar identifiers, and the same set of slots can be circled in the same box on the motherboard, and the same set of slots can be marked with the same color.
  • the memory 1203 may be a storage device or a collective name of a plurality of storage elements, and is used to store executable program codes in the above steps or parameters, data, and the like required for the operation of the access network management device. And the memory 1203 may include random access memory (RAM), and may also include non-volatile memory such as a magnetic disk memory, a flash memory, or the like.
  • RAM random access memory
  • non-volatile memory such as a magnetic disk memory, a flash memory, or the like.
  • the bus 1204 may be an Industry Standard Architecture (ISA) bus, a Peripheral Component (PCI) bus, or an Extended Industry Standard Architecture (EISA) bus.
  • ISA Industry Standard Architecture
  • PCI Peripheral Component
  • EISA Extended Industry Standard Architecture
  • the bus 1204 can be divided into an address bus, a data bus, a control bus, and the like. For ease of representation, only one thick line is shown in Figure 12, but it does not mean that there is only one bus or one type of bus.
  • FIG. 13 is a schematic structural diagram of another server with a CPU topology according to an embodiment of the present invention.
  • the multiplex server may include a CPU topology 1301 and an input and output interface 1302, and also shows a memory 1303 and a bus 1304, and may further include a controller 1305, the CPU topology 1301, an input/output interface 1302, a memory 1303, and a controller. 1305 connects and completes communication with each other via bus 1304.
  • the CPU topology 1301 has a CPU topology including a plurality of slots, and the slot is provided with a CPU that can be independently plugged and unplugged, and the slot and the slot are connected by an interconnecting channel to form a steady state third.
  • CPU topology has a CPU topology including a plurality of slots, and the slot is provided with a CPU that can be independently plugged and unplugged, and the slot and the slot are connected by an interconnecting channel to form a steady state third.
  • the third CPU topology may reserve a plurality of slots.
  • the CPU to be added and the corresponding CPU can be installed through the reserved slots.
  • the devices belonging to the same slot group can be The slots are identified by the same or similar identifiers, and the same set of slots can be circled in the same box on the main board, and the same set of slots can be marked with the same color.
  • the embodiment of the present invention can implement hot plugging of the CPU without affecting the stability of the CPU topology, so that the system can operate normally and improve the user experience.
  • the steps of a method or algorithm described in connection with the embodiments disclosed herein can be implemented in hardware, a software module executed by a processor, or a combination of both.
  • the software module can be placed in random access memory (RAM), memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or technical field. Any other form of storage medium known.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Hardware Redundancy (AREA)

Abstract

一种中央处理器CPU热移除、热添加方法及装置。该方法适用于具有非全互联的第一CPU拓扑的服务器,包括:控制器确定多个CPU中的第一CPU(S710),其中,该第一CPU为有故障或根据第一指示信息需要移除的CPU,该第一指示信息来自所述第一CPU拓扑或用户接口。控制器确定多个CPU中与第一CPU符合预设条件的至少一个第二CPU(S720)。控制器向第一CPU拓扑发送第二指示信息(S730),第一CPU拓扑接收到第二指示信息后,移除第一CPU以及至少一个第二CPU,得到第二CPU拓扑,并运行所述第二CPU拓扑。通过上述方法可以实现CPU的在线移除,且在CPU移除过程中以及移除后,系统能够正常的工作,提升了用户体验。

Description

一种中央处理器CPU热移除、热添加方法及装置
本申请要求于2016年01月08日提交中国专利局、申请号为201610016926.9、发明名称为“一种中央处理器CPU热移除、热添加方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及多CPU互联技术,尤其涉及一种中央处理器CPU热移除、热添加方法及装置。
背景技术
随着IT(Internet Technology,互联网技术)技术的迅猛发展,各类IT系统中的数据量越来越大,例如现在一些应用于企业关键业务的服务器,由于这些业务处于企业应用中的核心地位,这就决定了它处理的数据和信息都是用户核心的商业数据和信息,而且通常都是海量的。从目前一些关键业务领域最常见的三大类应用:在线交易、商业分析和数据库来看,即使应用于一家普通企业,其所处理的数据量可能也十分惊人,就更不用说它们在银行、电信、证券等行业运行时,动辄就要面对TB或PB级的数据量了。如此规模的数据量,又关系到商业用户的生产、运营和决策效率,势必要求其承载平台要有非常出色的高性能处理能力,而且随着HANA(High-Performance Analytic Appliance,分析软件)等大规模内存数据库应用的兴起,对单台服务器系统的内存容量也提出了很高的要求,因此需要在单台服务器中集成更多的CPU(Central Processing Unit,中央处理器),更多的内存以达到业务运行所需的高性能,大容量的要求。
由此,衍生出了多CPU互联技术,即通过CPU之间高速互联通道(如QPI(QuickPath Interconnect,快速通道互联)快速互联通道等)将多个 CPU相互连接,使得多颗物理CPU可以通过这些高速互联通道相互连接形成一个资源共享的服务器系统,但多CPU的互联在增强单台服务器处理性能的同时,也带来了一些额外的风险,因为这种多CPU互联的系统中只要有任意一个CPU发生故障,都可能会导致整个系统挂死,如需修复CPU故障则必须对整个服务器系统下电,然后更换CPU,而这种下电更换的操作必然会造成系统业务中断,严重影响系统的连续服务时间。
发明内容
本发明实施例提供了一种中央处理器CPU热移除、热添加方法及装置,可以实现在不下电的情况下实现对CPU的更换,且系统能够正常的工作,提升了用户体验。
一方面,本申请的实施例提供了一种中央处理器CPU热移除方法。该方法适用于具有非全互联的第一CPU拓扑的服务器,该服务器包括控制器,其中,当前运行的第一CPU拓扑包括多个CPU,该方法可以包括:控制器确定多个CPU中的第一CPU,其中,该第一CPU为有故障或根据第一指示信息需要移除的CPU,该第一指示信息来自所述第一CPU拓扑或用户接口。控制器确定多个CPU中与第一CPU符合预设条件的至少一个第二CPU。控制器向第一CPU拓扑发送第二指示信息,第一CPU拓扑接收到第二指示信息后,移除第一CPU以及至少一个第二CPU,得到第二CPU拓扑,并运行所述第二CPU拓扑。通过本发明实施例可以实现CPU的在线移除,且在CPU移除过程中以及移除后,系统能够正常的工作,提升了用户体验。
在一个可能的设计中,上述第一CPU拓扑的多个CPU之间可以通过中间节点的连接,其中,该中间节点包括CPU和/或外部节点控制器XNC。通过本发明实施例可以实现对通过中间节点连接的CPU拓扑中的CPU进行在线移除,且系统能够正常的工作,提升了用户体验。
在一个可能的设计中,上述控制器确定多个CPU中与第一CPU符合预 设条件的至少一个第二CPU,可以包括:服务器中每个CPU可以有至少一个备份CPU,控制器可以确定第一CPU的至少一个备份第二CPU。其中,至少一个备份第二CPU在第一CPU拓扑中,通过本发明可以实现,在CPU移除时,可以将需要移除的CPU以及备份的CPU一起移除,使得移除后的CPU依然是一个稳定的拓扑,保证了在系统能够正常的运行的情况下,对CPU进行移除操作,提高了用户体验。
在一个可能的设计中,上述控制器确定多个CPU中与第一CPU符合预设条件的至少一个第二CPU,可以包括:控制器确定第一CPU在第一CPU拓扑上的位置,以及在第一CPU拓扑中,与第一CPU处于至少一个对称位置(例如,可以是中心对称或者轴对称等等)上的第二CPU,或者与第一CPU处于至少一个对称位置上,且直接连接的任意一个第二CPU。通过本发明实施例可以实现,移除CPU和处于对称位置的全部或者任意一个CPU后,能够得到一个稳定的拓扑结构,保证系统能够正常工作,提升了用户体验。
在一个可能的设计中,每个CPU都可以具有多个端口,上述多个CPU之间通过端口连接,其中,上述控制器确定多个CPU中与第一CPU符合预设条件的至少一个第二CPU,具体可以包括,控制器确定与第一CPU通过相同的端口号的端口相互连接的至少一个第二CPU(例如,一个CPU有三个端口,端口号分别为,0,1,2。如果两个CPU都通过端口2相连,那么在其中一个CPU需要移除时,另一个CPU也需要一并移除)。通过本发明实施例可以实现,通过端口的方式确定需要同时移除的CPU,得到一个稳定的CPU拓扑,保证系统能够正常工作,提升了用户体验。
在一个可能的设计中,上述第一CPU拓扑包括多个CPU组,其中,该多个CPU组的信息可以预存在服务器中,上述控制器确定多个CPU中与第一CPU符合预设条件的至少一个第二CPU,可以包括:控制器确定与第一CPU属于同一CPU组的至少一个第二CPU。通过本发明实施例可以实现,通过以组的形式,对CPU进行移除,可以得到一个稳定的CPU拓扑,保证系 统能够正常工作,提升了用户体验。
在一个可能的设计中,上述第一CPU拓扑在接收到上述第二指示信息后,将第一CPU以及至少一个第二CPU中的资源回收,断开第一CPU和至少一个第二CPU与第二CPU拓扑中的CPU的连接,还可以调整第二CPU拓扑中的CPU的设置,使得移除第一CPU以及至少一个第二CPU后,能够以稳定的第四CPU拓扑的形式进行工作。通过本发明实施例可以实现,移除CPU后的CPU拓扑能够正常的工作,提高了用户体验。
另一方面,本申请的实施例提供了一种中央处理器CPU热添加方法。该方法适用于具有非全互联的第三CPU拓扑的服务器,该服务器包括控制器,该方法可以包括,控制器确定第一指示信息,其中,第一指示信息用于指示添加第三CPU,另外,第三CPU不在当前运行的第三CPU拓扑中。控制器确定与第三CPU符合预设条件的至少一个第四CPU是否已经安装,若是,控制器向第三CPU拓扑发送第二指示信息。第三CPU拓扑在接收到第二指示信息后,添加第三CPU以及所述第四CPU,得到第四CPU拓扑,并运行第四CPU拓扑。通过本发明实施例可以实现CPU的在线添加,且在添加过程中,系统能够正常的工作,提升了用户体验。
在一个可能的设计中,可以通过用户接口接收第一指示信息,该指示信息中可以携带需要添加的CPU的标识;或者,在第三CPU安装后,感应器触发特定的指令,控制器根据该指令,获取第三CPU的标识。通过本发明实施例可以实现对通过特定的指令或者用户接口触发CPU的添加,且系统能够正常的工作,提升了用户体验。
在一个可能的设计中,上述控制器确定与第三CPU符合预设条件的至少一个第四CPU是否已经安装,包括:控制器确定第四CPU拓扑中与第三CPU处于至少一个对称位置(中心对称或者轴对称)上的第二CPU是否已经安装。通过本发明实施例可以实现,再添加CPU时,保证与该CPU处于对称位置的CPU也添加,这样在CPU添加后得到一个稳定的拓扑,系统在添 加过程中能够正常工作,提升了用户体验。
在一个可能的设计中,上述控制器确定与第三CPU符合预设条件的至少一个第四CPU是否已经安装,包括:处理器确定第一CPU的至少一个备份CPU是否安装。通过本发明实施例可以实现,对CPU以及该CPU的备份同时安装,使得在操作系统能够正常工作的情况下,对CPU拓扑进行扩容,提高了用户体验。
在一个可能的设计中,上述第四CPU拓扑包括多个CPU组,其中,该多个CPU组的信息可以预存在所述服务器中,上述控制器确定与第三CPU符合预设条件的至少一个第四CPU是否已经安装,可以包括,控制器确定与第三CPU属于同一CPU组的至少一个第四CPU是否已经安装。通过本发明实施例可以实现,再添加CPU时,以组为单位进行添加,这样保证添加CPU后的拓扑依然为稳定的拓扑,保证系统能够正常的运行,提高了用户体验。
在一个可能的设计中,第三CPU拓扑在接收到第二指示信息后,为第三CPU以及至少一个第四CPU分配资源,建立第三CPU和至少一个第四CPU与第三CPU拓扑中的CPU的连接,还可以调整第三CPU拓扑中的CPU的设置,得到第四CPU拓扑,并运行所述第四CPU拓扑。通过本发明实施例可以实现,添加后的CPU拓扑为稳定的拓扑,保证了系统的正常运行,提高了用户体验。
又一方面,本申请的实施例提供了11、一种中央处理器CPU热移除装置,其特征在于,所述装置适用于具有非全互联的第一CPU拓扑的服务器,当前运行的第一CPU拓扑包括多个CPU,所述装置包括:处理单元,用于确定所述多个CPU中的第一CPU,所述第一CPU为有故障或根据第一指示信息需要移除的CPU,所述第一指示信息来自所述第一CPU拓扑或用户接口;所述处理单元还用于,确定所述多个CPU中与所述第一CPU符合预设条件的至少一个第二CPU;发送单元,用于向所述第一CPU拓扑发送第二指示信息, 所述第二指示信息用于指示移除所述第一CPU以及所述至少一个第二CPU,得到第二CPU拓扑,并运行所述第二CPU拓扑。
在一个可能的设计中,所述处理单元还用于,确定所述第一CPU在所述第一CPU拓扑的位置,以及在所述第一CPU拓扑中,与所述第一CPU处于至少一个对称位置上的第二CPU,或者与所述第一CPU处于至少一个对称位置上的CPU中,且直接连接的任意一个第二CPU。
在一个可能的设计中,每个CPU具有多个端口,所述多个CPU之间通过端口连接,所述处理单元还用于,确定与所述第一CPU通过相同的端口号的端口相互连接的至少一个第二CPU。
在一个可能的设计中,所述第一CPU拓扑包括多个CPU组,多个CPU组的信息预存在所述服务器中,所述处理单元还用于,确定与所述第一CPU属于同一CPU组的至少一个第二CPU。
在一个可能的设计中,所述第二指示信息用于指示移除所述第一CPU以及所述至少一个第二CPU包括:所述第二指示信息用于指示,将所述第一CPU以及所述至少一个第二CPU中的资源回收,断开所述第一CPU以及所述至少一个第二CPU与所述第二CPU拓扑中的CPU的连接。
再一方面,本申请的实施例提供了一种中央处理器CPU热添加装置,其特征在于,所述装置适用于具有非全互联的第三CPU拓扑的服务器,所述装置包括:处理单元,用于确定第一指示信息,所述第一指示信息用于指示添加第三CPU,其中,所述第三CPU不在当前运行的第三CPU拓扑中;所述处理单元还用于,确定与第三CPU符合预设条件的至少一个第四CPU是否已经安装;发送单元,用于当与第三CPU符合预设条件的至少一个第四CPU已经安装时,向所述第三CPU拓扑发送第二指示信息,所述第二指示信息用于指示添加所述第三CPU以及所述第四CPU,得到第四CPU拓扑,并运行所述第四CPU拓扑。
在一个可能的设计中,还包括:第一接收单元,用于通过用户接口接 收所述第三指示信息,所述第三指示信息包括第三CPU的标识;或者,第二接收单元,用于通过感应器接收安装所述第三CPU触发的第四指示信息;所述处理单元还用于,根据所述第四指示信息,确定已安装的所述第三CPU。
在一个可能的设计中,所述处理单元还用于,确定所述第四CPU拓扑中与所述第三CPU处于至少一个对称位置上的第二CPU是否已经安装;
在一个可能的设计中,所述第四CPU拓扑包括多个CPU组,多个CPU组的信息预存在所述服务器中,所述处理单元还用于,确定与所述所述第三CPU属于同一CPU组的至少一个第四CPU是否已经安装。
在一个可能的设计中,所述第二指示信息用于指示添加所述第三CPU以及所述第四CPU包括:所述第二指示信息用于指示,为所述第三CPU以及所述至少一个第四CPU分配资源,建立所述第三CPU以及所述第四CPU与所述第三CPU拓扑中的CPU的连接,得到第四CPU拓扑,并运行所述第四CPU拓扑。
再一方面,本申请的实施例提供了一种具有CPU拓扑结构的服务器,该服务器包括非全互联的第一CPU拓扑,控制器,以及存储器,其中,存储器用于存储上述第一方面的指令,控制器和第一CPU拓扑用于执行该指令。
再一方面,本申请的实施例提供了一种具有CPU拓扑结构的服务器,该服务器包括非全互联的第三CPU拓扑,控制器,以及存储器,其中,存储器用于存储上述第二方面的指令,控制器和第三CPU拓扑用于执行该指令。
再一方面,本申请的实施例提供了一种具有CPU拓扑结构的服务器,包括:若干插槽,所述插槽安装有可独立插拔的CPU,所述插槽之间通过互联通道连接,其中,插槽中安装的多个CPU以第一CPU拓扑结构进行工作,所述服务器还包括控制器,所述控制器用于执行前述第一方面的步骤。
再一方面,本申请的实施例提供了一种具有CPU拓扑结构的多路服务 器,其特征在于,所述结构包括:若干插槽,所述插槽安装有可独立插拔的CPU,所述插槽之间通过互联通道连接,其中,插槽中安装的多个CPU以第三CPU拓扑结构进行工作,所述服务器还包括控制器,所述控制器用于执行上述第二方面的步骤。
再一方面,本发明实施例提供了一种计算机存储介质,用于储存为上述第一方面所用的计算机软件指令,其包含用于执行上述方面所设计的程序。
再一方面,本发明实施例提供了一种计算机存储介质,用于储存为上述第二方面所用的计算机软件指令,其包含用于执行上述方面所设计的程序。
本发明实施例提供的CPU热移除、热添加方法及装置,可以实现,在线对CPU进行添加或移除,且移除或添加后的拓扑依然为稳定的拓扑,不影响系统的正常运行,提高了用户体验。
附图说明
图1为一种CPU拓扑结构示意图;
图2为另一种CPU拓扑结构示意图;
图3为本发明实施例提供的一种CPU移除过程示意图;
图4为再一种CPU拓扑结构示意图;
图5为再一种CPU拓扑结构示意图;
图6为再一种CPU拓扑结构示意图;
图7为本发明实施例提供的一种中央处理器CPU热移除方法示意图;
图8为再一种CPU拓扑结构示意图;
图9为本发明实施例提供的一种中央处理器CPU热添加方法示意图;
图10为本发明实施例提供的一种中央处理器CPU热移除装置结构示意图;
图11为本发明实施例提供的一种中央处理器CPU热添加装置结构示意图;
图12为本发明实施例提供的一种具有CPU拓扑的服务器的结构示意图;
图13为本发明实施例提供的另一种具有CPU拓扑的服务器的结构示意图。
具体实施方式
为便于对本发明实施例的理解,下面将结合附图以具体实施例做进一步的解释说明,实施例并不构成对本发明实施例的限定。
图1为一种CPU拓扑结构示意图。如图1所示,该CPU拓扑结构可以采用英特尔处理器(Intel Xeon Processor),包括8颗CPU,每颗CPU之间通过高速互联通道连接,且图1中示出的为一种稳定的拓扑结构。
在CPU拓扑运行过程中,当其中的一颗CPU出现故障时,一般不仅仅是这颗CPU不能进行数据处理,而且还可能与该颗CPU连接的通道都出现故障,例如图1所示的CPU 101出现故障时,CPU 101与CPU 102、CPU 101与CPU 103、CPU 101与CPU 104之间的的连接都出现故障,如图2所示,图2示出了当CPU 101出现故障时,剩余的连接示意图。但是,图2中示出的7颗CPU的连接方式是一种不稳定的拓扑结构,在运行时,也可能导致系统故障或挂死。
本申请的发明人意识到这个问题,通过分析发现,如图3所示,当CPU 101需要移除时,可以将与CPU 101对应的CPU 103一并移除,这样便可以得到一个稳定的6颗CPU的拓扑结构。
所以,可以通过移除CPU所在的一组CPU,来得到少于8颗CPU的稳定拓扑结构,可以有如图4所示的6颗CPU的拓扑结构,或如图5所示的4颗CPU的拓扑结构等等。其中,图4中的结构可以通过图1中的结构移除2 个CPU获得,图5中的结构可以通过图1中的结构移除4个CPU获得。也就是说,一个CPU拓扑结构移除一组CPU便可以得到一个稳定的拓扑结构,相应地,一个CPU拓扑结构添加一组CPU也能够得到一个稳定的拓扑结构。
图6为一种CPU拓扑结构示意图。如图6所示,该CPU拓扑结构包括8颗CPU,每颗CPU之间通过高速互联通道或者XNC(External Node Control ler,外部节点控制器)连接,图6中给出了两种通过XNC的连接方式,无论哪种连接方式,都存在着前述的问题,也就是一个CPU出现故障时,剩余的7颗CPU的连接为一种不稳定的拓扑结构,但是,无论哪个CPU出现故障都能够找到一个与他对应的CPU,把这两个CPU移除后,得到一个稳定的6颗CPU的拓扑结构。
应该知道的是,上述8颗CPU的稳定拓扑结构仅为举例,其他数量的CPU的稳定拓扑结构也具有此特点,为了表述更清楚所以以较为常见的8颗CPU的稳定拓扑结构来说明。
图7为本发明实施例提供的一种中央处理器CPU热移除方法示意图,如图3所示,该方法可以运行在一个具有非全互联的第一CPU拓扑的服务器上,具体下述步骤的指令可以由该第一CPU拓扑某个特定的CPU,或者区别于该第一CPU拓扑的其他CPU或控制器上执行,执行下述步骤所需要的指令可以存储在存储器中。该服务器的CPU拓扑包括多个CPU,该方法可以包括如下步骤:
S710,确定多个CPU中的第一CPU。其中,第一CPU为有故障或根据第一指示信息需要移除的CPU,该第一指示信息来自第一CPU拓扑或用户接口。
其中,服务器上可以运行业务系统以及控制系统,可以由业务系统进行检测,并判断出存在风险或者已经出现故障的CPU。其中,业务系统是指运行在该第一CPU拓扑上,主要处理业务任务的系统,而控制系统可以是运行在该CPU拓扑某个特定的CPU或者控制器上,主要用于控制该CPU拓扑的系统。
还可以是在第一CPU拓扑运行过程中,任务量较小,需要停止部分CPU的工作来节省资源时,第一CPU拓扑确定需要停止工作的CPU。第一CPU拓扑向控制器发送第一指示信息,通知控制器需要移除的CPU的标识。其中,可以根据CPU的性能,移除耐久性差或者其他性能较差的CPU。
还可以是控制器通过用户接口接收第一指示信息。例如,当需要对CPU进行更换时,用户可以通过用户接口输入需要更换的CPU的标识。
控制器还可以通过对第一拓扑中的CPU进行检测,判断出已经故障的CPU,例如,可以检测CPU是否能够正常通电。
另外,可以用CPU的标识来区分不同的CPU,其中,CPU的标识可以是Socket ID(插座的ID)等,能够对CPU起到标识作用的信息。
需要说明的是,区别于前述第一CPU拓扑的其他CPU或控制器,为了描述简明,统称为控制器。
S720,确定多个CPU中与第一CPU符合预设条件的至少一个第二CPU。
在确定多个CPU中与所述第一CPU符合预设条件的至少一个第二CPU时可以有如下实现方式:
方式一,同一个拓扑结构的CPU,可以采用相同类型的CPU。一般CPU模块具有多个端口,一个CPU上的每个端口可以有不同的端口号,但相同类型的CPU,不同的CPU之间的端口号是相同的,可以确定通过相同的端口号互连的CPU为一个CPU组,在确定多个CPU中与第一CPU符合预设条件的至少一个第二CPU时,控制器可以确定与第一CPU通过相同的端口号相互连接的至少一个第二CPU。例如,如图8所示,这个拓扑结构就是一个具有8颗CPU的拓扑,其中的SX(X=0,1···7)中的X指的就是Socket ID,其中另外每条连线两端的0,1,2实际代表的是QPI端口号,从图8中可以看出,由相同的端口号相连的CPU组分别为,S0与S2都是通过端口2连接;S1与S3,S4与S6,S5与S7,也都是通过端口2连接,他们两两构成CPU组,当S5出现故障时,找到端口2连接的CPU,也就是S7,将S5与S7都 移除,剩下的CPU便可以成为一个稳定的拓扑结构。其中,需要说明的是,CPU的分组是依据稳定的拓扑结构规则进行分组的。
方式二,控制器确定第一CPU在第一CPU拓扑的位置,以及在第一CPU拓扑中,与第一CPU处于至少一个对称位置上的第二CPU,或者与第一CPU处于至少一个对称位置上,且直接连接的任意一个第二CPU。其中,对称可以是中心对称或者是轴对称。例如,图3中的拓扑,与CPU 101处于对称位置的有三个,两个轴对称,一个中心对称,可以将者三个全部移除,也可以只移除与其直接连接的任意一个。
方式三,服务器中每个CPU可以有至少一个备份CPU,控制器可以确定第一CPU的至少一个备份第二CPU,例如,可以对第一CPU拓扑中的CPU进行分组,并将CPU组的信息预存在服务器中,所述控制器可以确定控制器确定与第一CPU属于同一CPU组的至少一个第二CPU。再例如,如图6所示的拓扑中的CPU,可以两两分成一组,构成四个CPU组,可以将这些CPU组中的CPU的标识对应存储,当需要确定一个需要移除的CPU时,找到对应存储的另一个CPU一并移除即可。
在确定与第一CPU符合预设条件的至少一个第二CPU时,可以由服务器的业务系统进行确定。也可以是,服务器上的业务系统将需要移除的CPU的标识传输给控制系统(例如,OS(Operating System,操作系统)、BIOS(Basic Input Output System,基本输入输出系统)、BMC(Baseboard Management Controller,管理控制器)或其它软件),由控制系统来判断出不包含第一CPU的第二CPU拓扑,并将需要移除的CPU的标识传输给业务系统,由业务系统移除相应的CPU得到第二CPU拓扑。
S730,向第一CPU拓扑发送第二指示信息。其中,第二指示信息用于指示移除第一CPU以及至少一个第二CPU,得到第二CPU拓扑,并运行所述第二CPU拓扑。
在移除CPU后,服务器需要以第二CPU拓扑进行工作,例如,可以在 第二CPU上运行业务系统。
需要说明的是,移除CPU包括,系统回收分配给该CPU的资源,例如,释放分配给该CPU的资源,或者,将分配给该CPU上的资源移到其他的CPU或CPU拓扑上,例如移到第二CPU拓扑上,还可以将第二CPU拓扑中的CPU中,也就是除了需要移除的CPU后剩下的CPU中,删除与需要移除的CPU的逻辑连接,还可以重新设置第二CPU拓扑中的CPU,使得其能够以第二CPU拓扑进行工作。进一步地,还可以对需要移除的CPU断电。通过上述方式,在第二CPU拓扑的CPU中,不存在指向第一CPU和至少一个第二CPU的信息,也就是说,在系统运行时,不会出现需要第一CPU和至少一个第二CPU执行的任务,到第一CPU和至少一个第二CPU的通道也已经断开,所以第二CPU拓扑能够稳定的运行。
本发明实施例在具体实施过程中,CPU拓扑的CPU之间可以通过中间节点的连接,其中,中间节点可以是CPU和/或外部节点控制器XNC,例如图1或图6所示的拓扑结构。
另外,本发明实施例中的CPU拓扑可以包括偶数个CPU(例如,8个、6个),相应的,移除后的CPU拓扑依然为偶数个。
应该知道的是,第一CPU拓扑,第二CPU拓扑皆为稳定的拓扑结构。
通过本发明实施例,当CPU出现故障或者有需要时,可以在不影响正常的系统运行的情况下,移除该CPU,且保证移除后的CPU拓扑依然稳定,提高了用户体验。
对于具有非全互联的CPU拓扑的服务器,不仅需要服务器具备连续服务的能力,同时也需要服务器具备灵活扩展的能力,这种扩展一方面可以是在硬件资源或者性能不足的情况通过增加服务器的硬件资源以扩展系统资源,增强服务器的性能,此能力称之为扩容,下面提供了一种CPU拓扑扩容的方法。
图9为本发明实施例提供的一种中央处理器CPU热添加方法示意图, 如图9所示,该方法可以运行在一个具有非全互联的CPU拓扑的多路服务器上,具体下述步骤的指令可以由该非全互联的CPU拓扑某个特定的CPU,或者区别于该非全互联的CPU拓扑的其他CPU或控制器上执行,执行下述步骤所需要的指令可以存储在存储器中。该方法可以包括如下步骤:
S910,确定第一指示信息。其中,第一指示信息用于指示添加第三CPU,该第三CPU不在当前运行的第三CPU拓扑中。
用户在对第三CPU安装完成后,可以通过用户接口输入指令,控制器可以接收该指令,其中,该指令可以携带有第三CPU的标识。
或者,在对需要添加的CPU安装完成后,通过感应器触发特定的电信号,控制器可以接收该信号,然后根据该电信号的指示,获取第三CPU的标识。其中,CPU的标识可以是Socket ID(插座的ID)等,能够对CPU起到标识作用的信息。例如,通过不同的插槽触发的电信号可以是不同的,可以通过电信号判断是哪个插槽安装上了CPU。或者,不同的插槽上触发的电信号可以是相同的,可以在接收到这个电信号后,服务器便得知有新的CPU安装,可以通过业务系统或者控制系统进行判断,确定出新安装的CPU的标识。
S920,确定与第三CPU符合预设条件的至少一个第四CPU是否已经安装。
确定与第三CPU符合预设条件的至少一个第四CPU是否已经安装。可以有如下的具体实现方式。
方式一,该方式的原理与图7所示的步骤S720中方式二的原理相同可以参照理解。其中,具体步骤为,控制器确定第四CPU拓扑中与第三CPU处于至少一个对称位置上的第二CPU是否已经安装。
方式二,该方式的原理与图7所示的步骤S720中方式三的原理相同可以参照理解。其中,具体步骤为,处理器可以确定第一CPU的至少一个备份CPU是否安装,例如,第四CPU拓扑包括多个CPU组,该多个CPU组的 信息可以预存在服务器中,控制器确定与第三CPU属于同一CPU组的至少一个第四CPU是否已经安装。
其中,第三CPU拓扑可能需要添加一组CPU才能够得到一个稳定的拓扑,在确定热添加指示信息时,可能仅有一个CPU的标识,那么业务系统需要确定,该CPU的标识对应的其他CPU的在位情况,当该CPU以及与其对应的CPU全部都安装时,再执行下述步骤。
S930,若是,向第三CPU拓扑发送第二指示信息。其中,该第二指示信息用于指示添加第三CPU以及至少一个第四CPU,得到第四CPU拓扑,并运行所述第四CPU拓扑。
其中,第三CPU拓扑在接收到第二指示信息后,可以为第三CPU以及至少一个第四CPU分配资源,建立第三CPU和至少一个第四CPU与第三CPU拓扑中的CPU的连接,还可以调整第三CPU拓扑中的CPU的设置,使得第三CPU拓扑中的CPU以及第三CPU和至少一个第四CPU,能够以第四CPU拓扑运行。
应该知道的是,第三CPU拓扑,以及第四CPU拓扑皆为稳定的拓扑结构。
通过本发明实施例,可以实现在不影响系统正常运行的情况下对CPU拓扑的扩容,或者,在图7所示和图9所示的实施例的结合下,可以实现对CPU的更换,使得系统的运行更稳定,用户体验更高。
上述主要从多路服务器数据的处理流程的角度对本发明实施例提供的方案进行了介绍。可以理解的是服务器为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,本发明能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现 所描述的功能,但是这种实现不应认为超出本发明的范围。
图10,为本发明实施例提供的一种中央处理器CPU热移除装置结构示意图。该装置适用于具有非全互联的第一CPU拓扑的服务器,当前运行的第一CPU拓扑包括多个CPU,所述装置包括:
处理单元1001,用于确定所述多个CPU中的第一CPU,所述第一CPU为有故障或根据第一指示信息需要移除的CPU,所述第一指示信息来自所述第一CPU拓扑或用户接口;
处理单元1001还用于,确定所述多个CPU中与所述第一CPU符合预设条件的至少一个第二CPU;
发送单元1002,用于向所述第一CPU拓扑发送第二指示信息,所述第二指示信息用于指示移除所述第一CPU以及所述至少一个第二CPU,得到第二CPU拓扑,并运行所述第二CPU拓扑。
可选地,处理单元101还用于,
确定所述第一CPU在所述第一CPU拓扑的位置,以及在所述第一CPU拓扑中,与所述第一CPU处于至少一个对称位置上的第二CPU,或者与所述第一CPU处于至少一个对称位置上的CPU中,且直接连接的任意一个第二CPU。
可选地,每个CPU具有多个端口,所述多个CPU之间通过端口连接,处理单元1001还用于,
确定与所述第一CPU通过相同的端口号的端口相互连接的至少一个第二CPU。
可选地,处理单元1001还用于,确定所述第一CPU的至少一个备份第二CPU。
进一步地,所述第一CPU拓扑包括多个CPU组,多个CPU组的信息预存在所述服务器中,所述处理单元1001还用于,
确定与所述第一CPU属于同一CPU组的至少一个第二CPU。
可选地,所述第二指示信息用于指示移除所述第一CPU以及所述至少一个第二CPU包括:
所述第二指示信息用于指示,将所述第一CPU以及所述至少一个第二CPU中的资源回收,断开所述第一CPU以及所述至少一个第二CPU与所述第二CPU拓扑中的CPU的连接。
需要说明的是,本发明实施例与前述图7的方法实施例对应,可相互参照理解,不再赘述。
图11,为本发明实施例提供的一种中央处理器CPU热添加装置结构示意图。该装置适用于具有非全互联的第三CPU拓扑的服务器,该装置包括:
处理单元1101,用于确定第一指示信息,所述第一指示信息用于指示添加第三CPU,其中,所述第三CPU不在当前运行的第三CPU拓扑中;
所述处理单元1101还用于,确定与第三CPU符合预设条件的至少一个第四CPU是否已经安装;
发送单元1102,用于当与第三CPU符合预设条件的至少一个第四CPU已经安装时,向所述第三CPU拓扑发送第二指示信息,所述第二指示信息用于指示添加所述第三CPU以及所述至少一个第四CPU,得到第四CPU拓扑,并运行所述第四CPU拓扑。
可选地,还包括:
第一接收单元,用于通过用户接口接收所述第三指示信息,所述第三指示信息包括第三CPU的标识;
或者,
第二接收单元,用于通过感应器接收安装所述第三CPU触发的第四指示信息;所述处理单元1101还用于,根据所述第四指示信息,确定已安装的所述第三CPU。
可选地,所述处理单元1101还用于,
确定所述第四CPU拓扑中与所述第三CPU处于至少一个对称位置上的 第二CPU是否已经安装;
可选地,所述处理单元1101还用于,
确定所述第一CPU的至少一个备份第二CPU。
进一步的地,所述第四CPU拓扑包括多个CPU组,多个CPU组的信息预存在所述服务器中,所述处理单元1101还用于,
确定与所述第三CPU属于同一CPU组的至少一个第四CPU是否已经安装。
可选地,所述第二指示信息用于指示添加所述第三CPU以及所述第四CPU包括:所述第二指示信息用于指示,为所述第三CPU以及所述至少一个第四CPU分配资源,建立所述第三CPU以及所述第四CPU与所述第三CPU拓扑中的CPU的连接,得到第四CPU拓扑,并运行所述第四CPU拓扑。
需要说明的是,本发明实施例与前述图9的方法实施例对应,可相互参照理解,不再赘述。
图12,为本发明实施例提供的一种具有CPU拓扑结构的服务器的结构示意图。该服务器可以包括,CPU拓扑1201和输入输出接口1202,图中还示出了存储器1203和总线1204,还可以包括控制器1205,该CPU拓扑1201、输入输出接口1202、存储器1203和控制器1205通过总线1204连接并完成相互间的通信。存储器1203用来存储程序,CPU拓扑1201和控制器1205通过读取存储器中存放的程序,执行该程序,通过输入输出接口1202进行发送和接收针对外部设备的数据以及指令。
需要说明的是,这里的CPU拓扑1201其CPU拓扑结构包括若干插槽,该插槽上安装有可独立插拔的CPU,插槽与插槽之间通过互联通道连接,形成稳态的拓扑结构,插槽中安装的多个CPU以第一CPU拓扑结构进行工作。
其中,在第一CPU拓扑中一般存在与待移除的CPU对应的CPU,可以通过插槽来区分待移除的CPU以及与其对应的CPU和其他CPU的差别,例如,将待移除的CPU对应的CPU看作一个CPU组的话,可以将属于同一插槽组 的插槽用相同或同类的标识进行标识,还可以将同一组的插槽在主板上圈在同一个框内,还可以将同一组的插槽用相同的颜色标记。
存储器1203可以是一个存储装置,也可以是多个存储元件的统称,且用于存储上述步骤中的可执行程序代码或接入网管理设备运行所需要参数、数据等。且存储器1203可以包括随机存储器(RAM),也可以包括非易失性存储器(non-volatile memory),例如磁盘存储器,闪存(Flash)等。
总线1204可以是工业标准体系结构(Industry Standard Architecture,ISA)总线、外部设备互连(Peripheral Component,PCI)总线或扩展工业标准体系结构(Extended Industry Standard Architecture,EISA)总线等。该总线1204可以分为地址总线、数据总线、控制总线等。为便于表示,图12中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
图13,为本发明实施例提供的另一种具有CPU拓扑结构的服务器的结构示意图。该多路服务器可以包括,CPU拓扑1301和输入输出接口1302,图中还示出了存储器1303和总线1304,还可以包括控制器1305,该CPU拓扑1301、输入输出接口1302、存储器1303和控制器1305通过总线1304连接并完成相互间的通信。需要说明的是,这里的CPU拓扑1301其CPU拓扑结构包括若干插槽,该插槽上安装有可独立插拔的CPU,插槽与插槽之间通过互联通道连接,形成稳态的第三CPU拓扑。
其中,在第四CPU拓扑中一般存在与待移除的CPU对应的CPU,且第三CPU拓扑可以预留若干插槽。可以通过预留的插槽上安装待添加的CPU以及与其对应的CPU。其中,为了区分预留若干插槽不属于第四CPU拓扑的插槽,可以进行区分,例如,将待添加的CPU以及与其对应的CPU看作一个CPU组的话,可以将属于同一插槽组的插槽用相同或同类的标识进行标识,还可以将同一组的插槽在主板上圈在同一个框内,还可以将同一组的插槽用相同的颜色标记。
其中,上述模块与图12中的模块类似可相互参照理解不再赘述。
通过本发明实施例,可以实现对CPU的热插拔,同时不影响CPU拓扑的稳定性,使得系统能够正常的运行,提高用户体验。
专业人员应该还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。
结合本文中所公开的实施例描述的方法或算法的步骤可以用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。
以上所述的具体实施方式,对本发明的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本发明的具体实施方式而已,并不用于限定本发明的保护范围,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (20)

  1. 一种中央处理器CPU热移除方法,其特征在于,所述方法适用于具有非全互联的第一CPU拓扑的服务器,所述服务器包括控制器,当前运行的第一CPU拓扑包括多个CPU,所述方法包括:
    所述控制器确定所述多个CPU中的第一CPU,所述第一CPU为有故障或根据第一指示信息需要移除的CPU,所述第一指示信息来自所述第一CPU拓扑或用户接口;
    所述控制器确定所述多个CPU中与所述第一CPU符合预设条件的至少一个第二CPU;
    所述控制器向所述第一CPU拓扑发送第二指示信息,所述第二指示信息用于指示移除所述第一CPU以及所述至少一个第二CPU,得到第二CPU拓扑,并运行所述第二CPU拓扑。
  2. 根据权利要求1所述的方法,其特征在于,所述控制器确定所述多个CPU中与所述第一CPU符合预设条件的至少一个第二CPU,包括:
    所述控制器确定所述第一CPU的至少一个备份第二CPU。
  3. 根据权利要求2所述的方法,其特征在于,所述第一CPU拓扑包括多个CPU组,多个CPU组的信息预存在所述服务器中,所述控制器确定所述第一CPU的至少一个备份第二CPU包括:
    所述控制器确定与所述第一CPU属于同一CPU组的至少一个第二CPU。
  4. 根据权利要求1所述的方法,其特征在于,每个CPU具有多个端口,所述多个CPU之间通过端口连接,所述控制器确定所述多个CPU中与所述第一CPU符合预设条件的至少一个第二CPU,包括:
    所述控制器确定与所述第一CPU通过相同的端口号的端口相互连接的至少一个第二CPU。
  5. 根据权利要求1所述的方法,其特征在于,所述第二指示信息用于指示移除所述第一CPU以及所述至少一个第二CPU包括:
    所述第二指示信息用于指示所述第一CPU拓扑,将所述第一CPU以及所述至少一个第二CPU中的资源回收,断开所述第一CPU以及所述至少一个第二CPU与所述第二CPU拓扑中的CPU的连接。
  6. 一种中央处理器CPU热添加方法,其特征在于,所述方法适用于具有非全互联的第三CPU拓扑的服务器,所述服务器包括控制器,所述方法包括:
    所述控制器确定第一指示信息,所述第一指示信息用于指示添加第三CPU,其中,所述第三CPU不在当前运行的第三CPU拓扑中;
    所述控制器确定与第三CPU符合预设条件的至少一个第四CPU是否已经安装;
    若是,所述控制器向所述第三CPU拓扑发送第二指示信息,所述第二指示信息用于指示添加所述第三CPU以及所述至少一个第四CPU,得到第四CPU拓扑,并运行所述第四CPU拓扑。
  7. 根据权利要求6所述的方法,其特征在于,所述控制器确定第一指示信息包括:
    所述控制器通过用户接口接收所述第三指示信息,所述第三指示信息包括第三CPU的标识;
    或者,
    所述控制器通过感应器接收安装所述第三CPU触发的第四指示信息;根据所述第四指示信息,确定已安装的所述第三CPU。
  8. 根据权利要求6或7所述的方法,其特征在于,所述控制器确定与第三CPU符合预设条件的至少一个第四CPU是否已经安装,包括:
    所述控制器确定所述第三CPU的至少一个备份第二CPU是否已经安装。
  9. 根据权利要求8所述的方法,其特征在于,所述第四CPU拓扑包括多个CPU组,多个CPU组的信息预存在所述服务器中,所述控制器确定与 第三CPU符合预设条件的至少一个第四CPU是否已经安装,包括:
    所述控制器确定与所述第三CPU属于同一CPU组的至少一个第四CPU是否已经安装。
  10. 根据权利要求6或7所述的方法,其特征在于,所述第二指示信息用于指示添加所述第三CPU以及所述第四CPU包括:
    所述第二指示信息用于指示,为所述第三CPU以及所述至少一个第四CPU分配资源,建立所述第三CPU以及所述至少一个第四CPU与所述第三CPU拓扑中的CPU的连接,得到第四CPU拓扑,并运行所述第四CPU拓扑。
  11. 一种中央处理器CPU热移除装置,其特征在于,所述装置适用于具有非全互联的第一CPU拓扑的服务器,当前运行的第一CPU拓扑包括多个CPU,所述装置包括:
    处理单元,用于确定所述多个CPU中的第一CPU,所述第一CPU为有故障或根据第一指示信息需要移除的CPU,所述第一指示信息来自所述第一CPU拓扑或用户接口;
    所述处理单元还用于,确定所述多个CPU中与所述第一CPU符合预设条件的至少一个第二CPU;
    发送单元,用于向所述第一CPU拓扑发送第二指示信息,所述第二指示信息用于指示移除所述第一CPU以及所述至少一个第二CPU,得到第二CPU拓扑,并运行所述第二CPU拓扑。
  12. 根据权利要求11所述的装置,其特征在于,所述处理单元还用于,
    确定所述第一CPU的至少一个备份第二CPU。
  13. 根据权利要求12所述的装置,其特征在于,所述第一CPU拓扑包括多个CPU组,多个CPU组的信息预存在所述服务器中,所述处理单元还用于,
    确定与所述第一CPU属于同一CPU组的至少一个第二CPU。
  14. 根据权利要求11所述的装置,其特征在于,每个CPU具有多个端口,所述多个CPU之间通过端口连接,所述处理单元还用于,
    确定与所述第一CPU通过相同的端口号的端口相互连接的至少一个第二CPU。
  15. 根据权利要求11所述的装置,其特征在于,所述第二指示信息用于指示移除所述第一CPU以及所述至少一个第二CPU包括:
    所述第二指示信息用于指示所述第一CPU拓扑,将所述第一CPU以及所述至少一个第二CPU中的资源回收,断开所述第一CPU以及所述至少一个第二CPU与所述第二CPU拓扑中的CPU的连接。
  16. 一种中央处理器CPU热添加装置,其特征在于,所述装置适用于具有非全互联的第三CPU拓扑的服务器,所述装置包括:
    处理单元,用于确定第一指示信息,所述第一指示信息用于指示添加第三CPU,其中,所述第三CPU不在当前运行的第三CPU拓扑中;
    所述处理单元还用于,确定与第三CPU符合预设条件的至少一个第四CPU是否已经安装;
    发送单元,用于当与第三CPU符合预设条件的至少一个第四CPU已经安装时,向所述第三CPU拓扑发送第二指示信息,所述第二指示信息用于指示添加所述第三CPU以及所述至少一个第四CPU,得到第四CPU拓扑,并运行所述第四CPU拓扑。
  17. 根据权利要求16所述的装置,其特征在于,还包括:
    第一接收单元,用于通过用户接口接收所述第三指示信息,所述第三指示信息包括第三CPU的标识;
    或者,
    第二接收单元,用于通过感应器接收安装所述第三CPU触发的第四指示信息;所述处理单元还用于,根据所述第四指示信息,确定已安装的所述第三CPU。
  18. 根据权利要求16或17所述的装置,其特征在于,所述处理单元还用于,
    确定所述第三CPU的至少一个备份第二CPU是否已经安装;
  19. 根据权利要求18所述的装置,其特征在于,所述第四CPU拓扑包括多个CPU组,多个CPU组的信息预存在所述服务器中,所述处理单元还用于,
    确定与所述第三CPU属于同一CPU组的至少一个第四CPU是否已经安装。
  20. 根据权利要求16或17所述的装置,其特征在于,所述第二指示信息用于指示添加所述第三CPU以及所述第四CPU包括:
    所述第二指示信息用于指示,为所述第三CPU以及所述至少一个第四CPU分配资源,建立所述第三CPU以及所述第四CPU与所述第三CPU拓扑中的CPU的连接,得到第四CPU拓扑,并运行所述第四CPU拓扑。
PCT/CN2016/098741 2016-01-08 2016-09-12 一种中央处理器cpu热移除、热添加方法及装置 Ceased WO2017118080A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
ES16883210T ES2793006T3 (es) 2016-01-08 2016-09-12 Procedimiento y aparato para la eliminación y la adición de CPU en caliente durante el funcionamiento
EP20159979.2A EP3767470B1 (en) 2016-01-08 2016-09-12 Central processing unit cpu hot-remove method and apparatus, and central processing unit cpu hot-add method and apparatus
EP16883210.3A EP3306476B1 (en) 2016-01-08 2016-09-12 Method and apparatus for hot cpu removal and hot cpu adding during operation
US15/863,350 US10846186B2 (en) 2016-01-08 2018-01-05 Central processing unit CPU hot-remove method and apparatus, and central processing unit CPU hot-add method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610016926.9 2016-01-08
CN201610016926.9A CN105700975B (zh) 2016-01-08 2016-01-08 一种中央处理器cpu热移除、热添加方法及装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/863,350 Continuation US10846186B2 (en) 2016-01-08 2018-01-05 Central processing unit CPU hot-remove method and apparatus, and central processing unit CPU hot-add method and apparatus

Publications (1)

Publication Number Publication Date
WO2017118080A1 true WO2017118080A1 (zh) 2017-07-13

Family

ID=56226220

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/098741 Ceased WO2017118080A1 (zh) 2016-01-08 2016-09-12 一种中央处理器cpu热移除、热添加方法及装置

Country Status (5)

Country Link
US (1) US10846186B2 (zh)
EP (2) EP3306476B1 (zh)
CN (1) CN105700975B (zh)
ES (1) ES2793006T3 (zh)
WO (1) WO2017118080A1 (zh)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105700975B (zh) 2016-01-08 2019-05-24 华为技术有限公司 一种中央处理器cpu热移除、热添加方法及装置
CN108616366A (zh) * 2016-12-09 2018-10-02 华为技术有限公司 业务处理单元管理方法及装置
CN106933575B (zh) * 2017-02-27 2020-08-14 苏州浪潮智能科技有限公司 一种带外识别服务器资产信息的系统及方法
CN107547451B (zh) * 2017-05-31 2020-04-03 新华三信息技术有限公司 一种多路服务器、cpu连接方法及装置
US10628338B2 (en) * 2018-03-21 2020-04-21 Lenovo Enterprise Solutions (Singapore) Pte. Ltd. Selection of a location for installation of a CPU in a compute node using predicted performance scores
WO2020000354A1 (en) * 2018-06-29 2020-01-02 Intel Corporation Cpu hot-swapping
CN109189699B (zh) * 2018-09-21 2022-03-22 郑州云海信息技术有限公司 多路服务器通信方法、系统、中间控制器及可读存储介质
CN109491947B (zh) * 2018-11-14 2021-12-03 郑州云海信息技术有限公司 一种pcie外接卡热移除信息的发送方法及相关装置
CN110764829B (zh) * 2019-09-21 2022-07-08 苏州浪潮智能科技有限公司 一种多路服务器cpu隔离方法及系统

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1491386A (zh) * 2001-02-09 2004-04-21 在可修复的故障后使群集器系统自动投入运行
US20040215865A1 (en) * 2003-04-28 2004-10-28 International Business Machines Corporation Non-disruptive, dynamic hot-plug and hot-remove of server nodes in an SMP
US20060274372A1 (en) * 2005-06-02 2006-12-07 Avaya Technology Corp. Fault recovery in concurrent queue management systems
CN101216793A (zh) * 2008-01-18 2008-07-09 华为技术有限公司 一种多处理器系统故障恢复的方法及装置
CN103425545A (zh) * 2013-08-20 2013-12-04 浪潮电子信息产业股份有限公司 一种多处理器服务器的系统容错方法
CN105700975A (zh) * 2016-01-08 2016-06-22 华为技术有限公司 一种中央处理器cpu热移除、热添加方法及装置

Family Cites Families (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5367636A (en) * 1990-09-24 1994-11-22 Ncube Corporation Hypercube processor network in which the processor indentification numbers of two processors connected to each other through port number n, vary only in the nth bit
US5909558A (en) * 1997-07-31 1999-06-01 Linzmeier; Daniel Low power serial arbitration system
US6282596B1 (en) * 1999-03-25 2001-08-28 International Business Machines Corporation Method and system for hot-plugging a processor into a data processing system
US6948021B2 (en) * 2000-11-16 2005-09-20 Racemi Systems Cluster component network appliance system and method for enhancing fault tolerance and hot-swapping
US7739685B2 (en) * 2005-01-06 2010-06-15 International Business Machines Corporation Decoupling a central processing unit from its tasks
JP2007172334A (ja) * 2005-12-22 2007-07-05 Internatl Business Mach Corp <Ibm> 並列型演算システムの冗長性を確保するための方法、システム、およびプログラム
CN101878620A (zh) * 2007-11-29 2010-11-03 英特尔公司 在基于链路的系统中修改系统路由信息
US20090144476A1 (en) * 2007-12-04 2009-06-04 Xiaohua Cai Hot plug in a link based system
US20110179311A1 (en) * 2009-12-31 2011-07-21 Nachimuthu Murugasamy K Injecting error and/or migrating memory in a computing system
CN102232218B (zh) * 2011-06-24 2013-04-24 华为技术有限公司 计算机子系统和计算机系统
WO2012149714A1 (zh) * 2011-08-25 2012-11-08 华为技术有限公司 一种节点控制器链路的切换方法、处理器系统和节点
CN103188059A (zh) * 2011-12-28 2013-07-03 华为技术有限公司 快速通道互联系统中数据包重传方法、装置和系统
WO2013145255A1 (ja) * 2012-03-30 2013-10-03 富士通株式会社 電力供給制御装置、中継ノード装置、有線アドホックネットワークシステム、および電力供給制御方法
US9164809B2 (en) * 2012-09-04 2015-10-20 Red Hat Israel, Ltd. Virtual processor provisioning in virtualized computer systems
CN103412836B (zh) * 2013-06-26 2016-08-10 华为技术有限公司 热插拔处理方法、装置以及系统
JP6103060B2 (ja) * 2013-07-11 2017-03-29 富士通株式会社 管理装置、管理方法及びプログラム
CN103699444B (zh) * 2013-12-17 2017-03-15 华为技术有限公司 中央处理器热插拔的实现方法及装置
JP6337606B2 (ja) * 2014-05-15 2018-06-06 富士通株式会社 情報処理装置、経路決定方法及びプログラム
CN104375881B (zh) * 2014-10-28 2017-11-14 江苏中科梦兰电子科技有限公司 龙芯处理器的主核热插拔方法
CN106104505B (zh) * 2015-12-29 2020-02-21 华为技术有限公司 一种cpu及多cpu系统管理方法
US10503684B2 (en) * 2016-07-01 2019-12-10 Intel Corporation Multiple uplink port devices

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1491386A (zh) * 2001-02-09 2004-04-21 在可修复的故障后使群集器系统自动投入运行
US20040215865A1 (en) * 2003-04-28 2004-10-28 International Business Machines Corporation Non-disruptive, dynamic hot-plug and hot-remove of server nodes in an SMP
US20060274372A1 (en) * 2005-06-02 2006-12-07 Avaya Technology Corp. Fault recovery in concurrent queue management systems
CN101216793A (zh) * 2008-01-18 2008-07-09 华为技术有限公司 一种多处理器系统故障恢复的方法及装置
CN103425545A (zh) * 2013-08-20 2013-12-04 浪潮电子信息产业股份有限公司 一种多处理器服务器的系统容错方法
CN105700975A (zh) * 2016-01-08 2016-06-22 华为技术有限公司 一种中央处理器cpu热移除、热添加方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3306476A4 *

Also Published As

Publication number Publication date
US10846186B2 (en) 2020-11-24
EP3306476B1 (en) 2020-03-25
US20180129574A1 (en) 2018-05-10
EP3767470A1 (en) 2021-01-20
EP3306476A1 (en) 2018-04-11
EP3306476A4 (en) 2018-11-07
CN105700975A (zh) 2016-06-22
CN105700975B (zh) 2019-05-24
ES2793006T3 (es) 2020-11-12
EP3767470B1 (en) 2022-02-23

Similar Documents

Publication Publication Date Title
WO2017118080A1 (zh) 一种中央处理器cpu热移除、热添加方法及装置
JP5285690B2 (ja) 並列コンピュータ・システム、並列コンピュータ・システム上のノード・トラフィックを動的に再経路指定するためのコンピュータ実装方法、コンピュータ可読記録媒体及びコンピュータ・プログラム
CN110928679B (zh) 一种资源分配方法及装置
JP2014522052A (ja) ハードウェア故障の軽減
CN105095001A (zh) 分布式环境下虚拟机异常恢复方法
CN102983989B (zh) 一种服务器虚拟地址的迁移方法、装置和设备
CN108924008A (zh) 一种双控制器数据通信方法、装置、设备及可读存储介质
US20170269959A1 (en) Method, apparatus and system to send transactions without tracking
CN108769199A (zh) 一种分布式文件存储系统主节点管理方法及装置
CN109245926B (zh) 智能网卡、智能网卡系统及控制方法
US20200252273A1 (en) Remote network interface card management
WO2023185802A1 (zh) 数据处理方法及装置
US8489721B1 (en) Method and apparatus for providing high availabilty to service groups within a datacenter
CN107491270A (zh) 一种多控存储系统的资源访问方法及装置
CN102546652B (zh) 一种服务器负载平衡系统及方法
CN110413686A (zh) 一种数据写入方法、装置、设备及存储介质
CN113900791B (zh) 一种作业处理方法以及相关设备
CN110837451B (zh) 虚拟机高可用的处理方法、装置、设备和介质
CN111404725A (zh) 一种隔离故障pcie设备的方法及系统
CN107291580A (zh) 计算机容错系统及方法
WO2025035500A1 (zh) 基于复制组的数据处理、查询方法、装置、设备及介质
US11188393B1 (en) Systems and methods for performing load balancing and distributed high-availability
CN104753702B (zh) 一种集群系统中的集群处理方法、装置及系统
CN105515667A (zh) 一种高可用性计算机系统
WO2025232366A1 (zh) 一种资源配置方法、装置及设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16883210

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2016883210

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE