TW591376B - System and method for detecting server failure and the restoring of the same - Google Patents
System and method for detecting server failure and the restoring of the same Download PDFInfo
- Publication number
- TW591376B TW591376B TW90133427A TW90133427A TW591376B TW 591376 B TW591376 B TW 591376B TW 90133427 A TW90133427 A TW 90133427A TW 90133427 A TW90133427 A TW 90133427A TW 591376 B TW591376 B TW 591376B
- Authority
- TW
- Taiwan
- Prior art keywords
- patent application
- server
- scope
- item
- signal
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims description 21
- 238000001514 detection method Methods 0.000 description 5
- 238000007726 management method Methods 0.000 description 5
- 238000011084 recovery Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 2
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000013024 troubleshooting Methods 0.000 description 1
Landscapes
- Debugging And Monitoring (AREA)
- Hardware Redundancy (AREA)
Abstract
Description
591376 五、發明說明(1) 本發明係有關於一種伺服器故障偵測系統及方法,且 特別有關於一種可以自動將故障之伺服器重新回復之偵測 與回復伺服器故障之系統及方法。 近年來,由於網路的蓬勃發展,對於整體網路系統的 系統管理工作(Μ I S )也愈顯重要,同時地,如何隨時隨地 維持與提供一個穩定且正常的伺服器(S e r ν e r )工作狀況也 成為一種重要課題。 此外,由於地球村觀念的成型,遠距離、跨國服務、 以及隨時隨地之資訊服務也成為常見的工作型態。而當在 半夜或是系統管理人員不在場的情況下,如果伺服器臨時 故障,如使用者操作不當、或系統軟體等原因所造成之伺 ® 月艮器當機的時候,則必須等待系統之網路管理人員來進行 故障排解等處理,進而使得伺服器的服務中斷、引響系統 的整體服務品質。 然而,在許多的情況下,伺服器的當機是由於系統軟 體本身的細部設計問題,或是使用者的不當存取伺服器所 造成的,而僅需將伺服器重新關機/開機即可恢復正常服 務。因此,在這種只需要將伺服器重新開機即可解決的情 況下,如何縮短伺服器當機到重開機的時間則成為另一個 重要課題。 有鑑於此,本發明主要目的為提供一種當伺服器當機_ 時,可以在自動通知管理人員的同時來自動將伺服器重新_ 開機,以便利故障之伺服器重新啟動服務之偵測與回復伺 月艮器故障之系統及方法,進而減少因為伺服器停止服務後591376 V. Description of the invention (1) The present invention relates to a server fault detection system and method, and particularly to a system and method for detecting and recovering a server fault by automatically recovering the faulty server. In recent years, due to the vigorous development of the network, the system management work (M IS) of the overall network system has become increasingly important. At the same time, how to maintain and provide a stable and normal server (Ser v er) at any time and place Working conditions have also become an important issue. In addition, due to the formation of the concept of a global village, long-distance, multinational services, and information services anytime and anywhere have become common work patterns. And in the middle of the night or in the absence of system administrators, if the server temporarily fails, such as improper user operation or system software, etc., the server must be waited for when the server fails. Network management personnel perform troubleshooting and other processing, which in turn causes server service interruption and affects the overall service quality of the system. However, in many cases, the server crash is caused by the detailed design problems of the system software itself, or the user's improper access to the server, and the server can be recovered only by shutting down / booting the server again. Normal service. Therefore, in such a situation that the server can be resolved by simply restarting the server, how to shorten the time from the server crashing to restarting has become another important issue. In view of this, the main object of the present invention is to provide a server that can automatically restart the server when it is down, so as to facilitate the detection and recovery of the restarted service of the failed server. System and method for waiting for server failure, thereby reducing
0506-6768TWf;MRS90-003;y i anhou.ptd 第4頁 591376 五、發明說明(2) 所產生的損失。 為了達成上述目的,可藉由本發明所提供之一種偵測 與回復伺服器故障之系統及方法來達成。 依據本發明實施例之偵測與回復伺服器故障之系統, 適用於一伺服器之中,本偵測與回復伺服器故障系統包括 一事件訊號攔截模組與一電源控制電路。事件訊號攔截模 組接收一系統信號,並依據系統信號將一計數器重設信號 送出至電源控制電路。 電源控制電路可以包括隨一單位時間向下遞減之一計 數器,當電源控制電路接收到計數器重設信號時,則將計 數器之計數值重設為第一既定數值,而當計數值等於第三 既定氣袭時,則會關閉伺服器之電源,並將伺服器重新開 機。 此外,當計數值等於第三既定數值時,也同時輸出一 故障發生信號,用以通知系統管理人員。其中,系統信號 可以是一作業系統模組發出之閒置事件(Idle Event)信 號,且電源控制電路是一獨立之電源控制線路且可以配置 於伺服器之主機板或是介面卡之上。 圖式簡單說明 為使本發明之上述目的、特徵和優點能更明顯易懂, 下文特舉一具體實施例,並配合所附圖示,進行詳細說明 如下: 第1圖顯示依據本發明實施例之偵測與回復伺服器故 障系統之系統架構示意圖。0506-6768TWf; MRS90-003; y i anhou.ptd page 4 591376 V. Description of the invention (2) Loss incurred. In order to achieve the above object, the system and method for detecting and recovering a server failure provided by the present invention can be achieved. The system for detecting and recovering a server failure according to the embodiment of the present invention is applicable to a server. The system for detecting and recovering a server failure includes an event signal interception module and a power control circuit. The event signal interception module receives a system signal and sends a counter reset signal to the power control circuit according to the system signal. The power control circuit may include a counter that is decremented with a unit time. When the power control circuit receives the counter reset signal, it resets the count value of the counter to the first predetermined value, and when the count value is equal to the third predetermined value, During the air strike, the server power will be turned off and the server will be restarted. In addition, when the count value is equal to the third predetermined value, a fault occurrence signal is also output at the same time to notify the system management personnel. Among them, the system signal can be an idle event signal from an operating system module, and the power control circuit is an independent power control circuit and can be arranged on the motherboard or interface card of the server. In order to make the above-mentioned objects, features, and advantages of the present invention clearer and easier to understand, a specific embodiment is described below in detail with the accompanying drawings as follows: Figure 1 shows an embodiment according to the present invention Schematic diagram of the system architecture of the detection and recovery server failure system.
0506-6768TWf;MRS90-003;y i anhou.ptd 第5頁 591376 五、發明說明(3) 第2圖顯不依據本發明實施例之偵測與回復伺服器故 障方法之流程圖。 符號說明 1 0〜作業系統模組; 2 0〜事件訊號攔截模組; 3 0〜電源控制電路; 3卜計數器; S 1 0 0、…、S 1 0 8〜操作步驟。 實施例 接下來’本發明實施例將參考伴隨圖示進行詳細說明 於下。 第1圖顯示依據本發明實施例之偵測與回復伺服器故 障系統之系統架構示意圖。參考第1圖,依據本發明實施 例之偵測與回復伺服器故障系統包括一作業系統模組1 〇、 事件訊號攔截模組2 0、以及電源控制電路3 〇。 電源控制電路3 0係具有控制伺服器開機/關機、以及 可以控制電源開/關能力之獨立的電源控制線路且電源控 制電路30可以配置於伺服器(未顯示)之主機板或是介面卡 之上。 電源控制電路3 0中包括一計數器3 1 ,計數器3 1具有一 第一既定數值之一計數值,且在每一既定單位時間之後, 將該計數值減去一第二既定數值。其中,第一既定數值大 於第二既定數值。舉例來說,計數器3 1的初始值(第一既 定數值)為1 8 0,而每1秒(既定單位時間)減去數值丨(第二0506-6768TWf; MRS90-003; y i anhou.ptd Page 5 591376 V. Description of the invention (3) Figure 2 shows a flowchart of a method for detecting and recovering a server failure according to an embodiment of the present invention. Explanation of symbols 1 0 ~ operating system module; 2 0 ~ event signal interception module; 3 0 ~ power control circuit; 3 counter; S 1 0 0, ..., S 1 0 8 ~ operation steps. EXAMPLES Next, examples of the present invention will be described in detail with reference to accompanying drawings. FIG. 1 is a schematic diagram of a system architecture of a fault detection and recovery server system according to an embodiment of the present invention. Referring to FIG. 1, a fault detection and recovery server system according to an embodiment of the present invention includes an operating system module 10, an event signal interception module 20, and a power control circuit 30. The power control circuit 30 is an independent power control circuit that controls the server on / off and can control the power on / off ability. The power control circuit 30 can be configured on the motherboard or interface card of the server (not shown). on. The power supply control circuit 30 includes a counter 31, and the counter 31 has a count value of a first predetermined value, and after each predetermined unit time, the count value is subtracted from a second predetermined value. Among them, the first predetermined value is larger than the second predetermined value. For example, the initial value (the first predetermined value) of the counter 31 is 1 8 0, and the value is subtracted every 1 second (the predetermined unit time) 丨 (the second
0506-6768TWf;MRS90-003;yianhou.ptd 第6頁 591376 五、發明說明(4) 既定數值)。 作業系統模組1 0可以是伺服器中所安裝之作業系統 (Operating System),如Windows 或是Li nu"x 等等^作業系 統。當伺服器之中央處理單元(CPU)沒有事情可以進行>處 理時’則作業系統模組1 〇會發出一系統信號,如間置事件 (Idle Event)信號來給整體伺服器系統以進行電源管理 (Power Management)程序 。 事件訊號攔截模組20可以是一種電路或是驅動程式 ⑼river),用來攔截接收上述由作業系統模組1〇所發出之 系統信號,並依據接收到之系統信號輸出一 重設信 號送出至電源控制電路30。 ° ° 而當電源控制電路3 0接收到計數器重設信號時,則會 將計數器3 1之計數值重新設回初始值(第一既定數值),另 一方面,當電源控制電路30之計數器31之計數值隨著時間 遞減至一第三既定數值時,舉例來說,第 〇(代表饲服器已有-段時間沒有發出系統信號疋;^信 號為閒置事件訊號的話,則代表伺服器之中央處理器為當 機或是一直處於忙碌狀態(此情況極少可能)),因而°,電胃 源控制電路3 0則會關閉伺服器之電源,並將伺服器重新開 機,並會輸出一故障發生信號,如傳訊(Pager),用以通 知系統管理人員伺服器發生故障。 一接下來’第2圖顯示依據本發明實施例之偵測與回復 伺服器故障方法之流程圖。同時參考第丨圖與第2圖,本發 明實施例之操作流程將說明如下。0506-6768TWf; MRS90-003; yianhou.ptd page 6 591376 V. Description of the invention (4) Prescribed value). The operating system module 10 can be an operating system (Operating System) installed in the server, such as Windows or Linux® or other operating systems. When the central processing unit (CPU) of the server has nothing to process > processing, then the operating system module 10 will issue a system signal, such as an Idle Event signal, to the overall server system for power Management (Power Management) program. The event signal interception module 20 may be a circuit or a driver (river), which is used to intercept and receive the system signal sent by the operating system module 10, and output a reset signal to the power control according to the received system signal. Circuit 30. ° ° When the power control circuit 30 receives the counter reset signal, it resets the count value of the counter 31 to the initial value (the first predetermined value). On the other hand, when the counter 31 of the power control circuit 30 When the count value decreases to a third predetermined value with time, for example, the 0 (represents that the feeder has not sent a system signal for a period of time 疋; if the signal is an idle event signal, it represents the server's The CPU is down or has been busy (this situation is extremely unlikely)), so °, the electrical source control circuit 30 will turn off the server's power and restart the server, and will output a fault An occurrence signal, such as Pager, is used to notify system administrators that the server has failed. A next 'FIG. 2 shows a flowchart of a method for detecting and recovering a server fault according to an embodiment of the present invention. Referring to FIG. 丨 and FIG. 2 at the same time, the operation flow of the embodiment of the present invention will be described as follows.
0506-6768TWf;MRS90-003;yi anhou.ptd0506-6768TWf; MRS90-003; yi anhou.ptd
591376591376
首先二如步驟SI 〇〇,電源控制電路30中之計數器31之 計數值在^每一秒(既定單位時間)之後,將計數值減去一 (第二既定^值)。之後,如步驟31〇2,判斷此計數值是否 等於零(第三,既定數值),如果計數值不等於零的話,則如 步驟S 1 〇 4,,斷是否有接收到由事件訊號攔截模組2 〇所 送之一計數器重設信號。 若電源控制電路3 0沒有接收到計數器重設信號,則直 接回到步=S/ 0 〇的程序;而若電源控制電路3 〇有接收到 數器重設信號的話,則如步驟s丨0 6,將計數器3丨之計數^ 重新設回180(第一既定數值),再繼續回到步驟§1〇〇的程 序。 另一方面,在步驟51〇2的判斷中,如果計數值等於零 的話,則如步驟S 1 0 8,電源控制電路3 0關閉伺服器之電 源,並將伺服器重新開機,並輸出一故障發生信號(第2圖 中未顯示)’如傳訊(Pager )來通知系統管理人員。 因此’藉由本發明所提供之一種偵測與回復伺服器故 障之系統及方法,可以當伺服器當機時,同時自動通二管 理人員與自動將伺服器重新開機,以便利故障之伺服器^ 最短的時間内重新啟動以恢復服務,進而減少因為伺服器 停止服務後所產生的損失。 雖然本發明已以較佳實施例揭露如上,然其並非用以 限定本發明,任何熟悉此項技藝者,在不脫離本發明之精 神和範圍内’當可做些許更動與潤飾’因此本發明之保護 範圍當視後附之申請專利範圍所界定者為準。 °First, as in step SI 00, the count value of the counter 31 in the power control circuit 30 is reduced by one every second (predetermined unit time) (a second predetermined value). Then, as in step 31〇2, it is determined whether the count value is equal to zero (third, a predetermined value). If the count value is not equal to zero, then in step S1〇4, it is determined whether an event signal interception module 2 has been received. 〇 One of the counter reset signals sent. If the power control circuit 30 does not receive the counter reset signal, it directly returns to the procedure of step = S / 0 〇; and if the power control circuit 30 receives the counter reset signal, it proceeds to step s0 0 , Reset the count ^ of the counter 3 丨 to 180 (the first predetermined value), and then continue to the procedure of step §100. On the other hand, if the count value is equal to zero in step 5102, then in step S108, the power control circuit 30 turns off the power of the server, restarts the server, and outputs a fault. Signals (not shown in Figure 2) ', such as Pager, to notify system administrators. Therefore, with the system and method for detecting and recovering a server failure provided by the present invention, when the server is down, the administrator and the server can be automatically restarted at the same time to facilitate the failed server ^ Restart in the shortest time to restore service, thereby reducing losses due to server outages. Although the present invention has been disclosed in the preferred embodiment as above, it is not intended to limit the present invention. Anyone skilled in the art can 'do some changes and retouching' without departing from the spirit and scope of the present invention. The scope of protection shall be determined by the scope of the attached patent application. °
Claims (1)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| TW90133427A TW591376B (en) | 2001-12-31 | 2001-12-31 | System and method for detecting server failure and the restoring of the same |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| TW90133427A TW591376B (en) | 2001-12-31 | 2001-12-31 | System and method for detecting server failure and the restoring of the same |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| TW591376B true TW591376B (en) | 2004-06-11 |
Family
ID=34057361
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| TW90133427A TW591376B (en) | 2001-12-31 | 2001-12-31 | System and method for detecting server failure and the restoring of the same |
Country Status (1)
| Country | Link |
|---|---|
| TW (1) | TW591376B (en) |
-
2001
- 2001-12-31 TW TW90133427A patent/TW591376B/en not_active IP Right Cessation
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US7756048B2 (en) | Method and apparatus for customizable surveillance of network interfaces | |
| CN101271413A (en) | Computer operating state detection and processing method and system | |
| WO2018095107A1 (en) | Bios program abnormal processing method and apparatus | |
| WO2017215441A1 (en) | Self-recovery method and apparatus for board configuration in distributed system | |
| US7434085B2 (en) | Architecture for high availability using system management mode driven monitoring and communications | |
| TW591376B (en) | System and method for detecting server failure and the restoring of the same | |
| JP2735514B2 (en) | Process status management method | |
| CN106133699A (en) | Malfunction informing device, failure notification method and program | |
| US6622257B1 (en) | Computer network with swappable components | |
| CN101047564A (en) | Network communication equipment platform and method for implementing high reliability on it | |
| CN104394003B (en) | Power supply trouble processing method, device and power supply unit | |
| JP3325785B2 (en) | Computer failure detection and recovery method | |
| JPH0553846A (en) | Fault detecting system | |
| JP6654662B2 (en) | Server device and server system | |
| CN100413261C (en) | Method and system for data recovery | |
| JP2004013723A (en) | Device and method for fault recovery of information processing system adopted cluster configuration using shared memory | |
| JP2008152552A (en) | Computer system and failure information management method | |
| CN111654434B (en) | Flow switching method and device and forwarding equipment | |
| CN112084049B (en) | Method for monitoring resident program of baseboard management controller | |
| JP2001175545A (en) | Server system, fault diagnosing method, and recording medium | |
| CN1322705C (en) | A method of datum plane reset for forwarding equipment | |
| KR20020065188A (en) | Method for managing fault in computer system | |
| JPH0271336A (en) | Monitor system for fault state of processor | |
| JPH09212388A (en) | CPU operation monitoring method | |
| JP3107104B2 (en) | Standby redundancy method |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| MM4A | Annulment or lapse of patent due to non-payment of fees |