JPS63291132A

JPS63291132A - Fault deciding device for composite computer system

Info

Publication number: JPS63291132A
Application number: JP62125308A
Authority: JP
Inventors: Masashi Kudo; 工藤　雅司
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1987-05-22
Filing date: 1987-05-22
Publication date: 1988-11-29

Abstract

PURPOSE:To correctly judge a fault location, to attain a suitable fault processing and to improve the reliability of a composite computer system by supervising mutually other computer with plural computers and exchanging the supervisory information. CONSTITUTION:A computer having a transmission right of a health signal transmits successively the health signal to other computer and stores the fault supervising information collected already by an opponent computer. When the information collection is completed, the fault of the computer is decided, and the stored fault information is informed to other computer. A fault deciding processing is executed on the computer having the transmission right of the health signal at a constant time internal and a health signal transmission right is cyclicly transferred between computers. Thus, plural computers mutually supervise other computer and the supervisory information is exchanged. Thus, the fault location is correctly judged, a suitable fault processing is executed and the reliability of the composite computer system can be improved.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は情報処理システムに関する。特に、複合計算機
システムの障害判定装置に関する。DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to an information processing system. In particular, the present invention relates to a failure determination device for a compound computer system.

〔overview〕

複合計算機システムの障害判定装置において、複数の計
算機が互いに他の計算機を監視しあいその監視情報を交
換しあうことにより、障害個所を正確に判断し、適切な
障害処理を実施し、複合計算機システムの信頼性を向上
させるようにしたものである。In a multicomputer system fault determination device, multiple computers mutually monitor other computers and exchange monitoring information to accurately determine the location of the fault, perform appropriate fault handling, and improve the performance of the multicomputer system. This is designed to improve reliability.

[Conventional technology]

従来、複合計算機システムを構成する計算機あるいはそ
れらを結ぶ通信路の障害判定は一台の計算機の単独判断
により行っていた。Conventionally, failure determination of computers constituting a compound computer system or communication paths connecting them has been made by a single computer.

[Problem that the invention seeks to solve]

上述した従来の方式では、障害判定を行う一つの計算機
の誤判断で正常である計算機を障害扱いにしたり、ある
いは通信路障害であるにもかかわらず計算機の障害と判
定しシステムの障害処理のだめの、障害計算機の切り離
し、ジョブの継続運転あるいは通信路の切り換えなどを
行うことになり、そのために計算機間での混乱が発生し
、システム全体として不安定な状態に陥り、正常なシス
テム運用を不可能にしてしまう問題を有している。In the conventional method described above, a normal computer is treated as a failure due to a mistake made by one of the computers that performs failure determination, or a failure is determined to be a computer failure even though there is a communication path failure, resulting in failure of the system to handle the failure. , it is necessary to disconnect the faulty computer, continue running the job, or switch the communication path, which causes confusion between the computers, and the entire system falls into an unstable state, making normal system operation impossible. It has the problem of causing

本発明は、このような従来の問題点を解決するもので正
確な障害判定を行う装置を提供することを目的とするも
のである。SUMMARY OF THE INVENTION The present invention solves these conventional problems and aims to provide a device that accurately determines faults.

[Means for solving problems]

本発明は、複数の計算機が通信路により相互に結合され
た複合計算機システムにおいて、各計算機に他の計算機
の障害を監視する障害監視手段を備えたことを特徴とす
る。The present invention is characterized in that in a composite computer system in which a plurality of computers are interconnected by communication paths, each computer is provided with a fault monitoring means for monitoring faults in other computers.

障害監視手段は、他計算機に対して障害監視信号を送出
する手段と、この信号に対して他計算機から応答された
障害監視情報を記憶する手段と、この手段により記憶さ
れた情報から障害計算機を”判定する手段とを含むこと
ができ、複数は３以上である。The fault monitoring means includes means for sending a fault monitoring signal to other computers, means for storing fault monitoring information responded from other computers in response to this signal, and detecting the faulty computer from the information stored by the means. ``a means for determining, and the plurality is three or more.

[Effect]

ヘルス信号の送信権をもつ計算機が他の計算機にヘルス
信号を順次送信し、相手計算機がすでに収集しである障
害監視情報を記憶する。A computer that has the right to transmit health signals sequentially transmits health signals to other computers, and stores fault monitoring information that the other computers have already collected.

この情報収集が終了すると、計算機の障害判定を行い、
記憶した障害情報を他の計算機にｉｌＮ報する。Once this information collection is complete, we will determine the computer's failure.
The stored fault information is reported to other computers.

障害判定処理は、一定時間間隔でヘルス信号の送信権を
もつ計算機上で行われ、ヘルス信号送信権は計算機間で
サイクリックに委譲する。Failure determination processing is performed on a computer that has the right to transmit health signals at regular time intervals, and the right to transmit health signals is cyclically delegated between computers.

このように複数の計算機が互いに他の計算機を監視し合
いその監視情報を交換しあうことにより、障害個所を正
確に判断し、適切な障害処理を行い、複合計算機システ
ムの信転性を向上させることができる。In this way, multiple computers mutually monitor other computers and exchange monitoring information, thereby accurately determining the location of a failure, performing appropriate failure handling, and improving the reliability of the complex computer system. be able to.

〔Example〕

次に、本発明実施例について図面を参照して説明する。 Next, embodiments of the present invention will be described with reference to the drawings.

第１図は計算機を３台使用した本発明実施例の構成を示
すブロック図、第２図は本発明実施例障害監視機能およ
び全体の構成を示すブロック図である。本発明実施例は
、計算機１．２．３と、主通信路であるバス４および補
助通信路であるバス５とから構成されている。FIG. 1 is a block diagram showing the configuration of an embodiment of the present invention using three computers, and FIG. 2 is a block diagram showing the fault monitoring function and overall configuration of the embodiment of the present invention. The embodiment of the present invention is composed of a computer 1.2.3, a bus 4 as a main communication path, and a bus 5 as an auxiliary communication path.

計算機１には、プロセッサ６、主記憶装置７、バス制御
部８および９が備えられており、主記憶装置７には、他
計算機および通信路を監視する障害監視信号送出手段３
１と、この障害監視信号送出手段３１によって得られた
情報を障害監視情報テーブル１０に記憶する障害監視情
報記憶手段３２と、障害監視情報記憶手段３２によって
記憶した障害情報を他の計算機に通知する障害監視情報
通知手段３３と、障害監視情報通知手段３３により通知
された障害監視情報をもとに障害計算機を判定する障害
計算機判定手段３４とを含む障害監視機能１１および障
害監視情報テーブル１０とが備えられている。The computer 1 is equipped with a processor 6, a main memory 7, and bus control units 8 and 9, and the main memory 7 includes a fault monitoring signal sending means 3 for monitoring other computers and communication paths.
1, a fault monitoring information storage means 32 that stores the information obtained by the fault monitoring signal sending means 31 in the fault monitoring information table 10, and a fault monitoring information storage means 32 that notifies other computers of the fault information stored by the fault monitoring information storage means 32. A fault monitoring function 11 and a fault monitoring information table 10 include a fault monitoring information notifying means 33 and a faulty computer determining means 34 for determining a faulty computer based on the fault monitoring information notified by the fault monitoring information notifying means 33. It is equipped.

計算機２および３にもそれぞれプロセッサ１２および１
８と、主記憶装置１３および１９と、バス制御部１４．
１５．２０および２１とが含まれ、計算機１と同様に主
記憶装置１３および１９には障害監視機能１７および２
３と、障害監視情報テーブル１６および２２がそれぞれ
含まれている。Computers 2 and 3 also have processors 12 and 1, respectively.
8, main storage devices 13 and 19, and bus control unit 14.
15.20 and 21 are included, and like the computer 1, the main storage devices 13 and 19 have fault monitoring functions 17 and 2.
3 and fault monitoring information tables 16 and 22, respectively.

このように構成された本発明実施例は、ヘルス信号の送
信権を持つ計算機が障害監視信号送出手段３１により他
の計算機にヘルス信号を順次送信し、相手計算機がすで
に収集しである障害監視情報を障害監視情報記憶手段３
２によって集める。この情報収集が終了したところで障
害計算機判定手段３４が計算機の障害判定を行う。In the embodiment of the present invention configured as described above, a computer having the right to transmit a health signal sequentially transmits a health signal to other computers using the fault monitoring signal sending means 31, and the fault monitoring information that the other computer has already collected is transmitted. Fault monitoring information storage means 3
Collect by 2. When this information collection is completed, the failed computer determining means 34 determines the failure of the computer.

この障害判定処理は一定時間間隔でヘルス信号の送信権
をもつ計算機上で行われ、ヘルス信号送信権も計算機間
でサイクリックに委譲していく。This fault determination processing is performed at regular time intervals on a computer that has the right to transmit health signals, and the right to transmit health signals is also cyclically delegated between computers.

第３図は計算機１の障害監視情報テーブル１０の内容の
一例を示したものである。ここでＮ８４（ｉ≠ｊ１１≦
ｉ、ｊ≦３、Ｎ、、＝Ｏ１１）は計算機ｉが検出した計
算機ｊの状態を示し、Ｎｉ、＝０ならば正常、Ｎｉ、＝
　１ならば異常を示す。計算機ｉはこれらＮ１ｊ（１≦
ｉ、ｊ≦３）を他の計算機からのヘルス信号を受信した
際にその計算段へ応答信号とともに障害監視情報通知手
段３３によって返送する。FIG. 3 shows an example of the contents of the failure monitoring information table 10 of the computer 1. Here, N84 (i≠j11≦
i, j≦3, N, , = O11) indicates the state of computer j detected by computer i; if Ni, = 0, it is normal; Ni, =
If it is 1, it indicates an abnormality. Computer i calculates these N1j (1≦
i, j≦3) is sent back to the calculation stage by the failure monitoring information notification means 33 along with a response signal when receiving a health signal from another computer.

そして、全計算機からの本情報（Ｎ（１、ｔ＃ｊ、ｌ≦
ｉ、ｊ≦３）の収集がすべて完了したところで条件 Σ　Ｎ、Ｊ≧２　　　（Ａ）がチェックされ、これを満たす計算ａｊを障害計算機判
定手段３４が障害と判定する。Then, this information from all computers (N(1, t#j, l≦
When all the calculations (i, j≦3) have been completed, the condition Σ N, J≧2 (A) is checked, and the faulty computer determining means 34 determines that the calculation aj that satisfies this is faulty.

ここで１つの例をあげて説明する。第４図に示すように
、計算機３に時刻Ｔ。で障害が発生し、他計算機のヘル
ス信号に対して応答できなくなったとする。時刻Ｔ、に
計算機１がヘルス信号送信権を得て、他の計算機２およ
び３に順次にヘルス信号を送出する。計算機２は応答信
号と一緒に計算機２内の障害監視情報テーブル１６にあ
るＮ２いＮ２３もあわせて送出する。One example will be explained here. As shown in FIG. 4, the computer 3 receives a time T. Suppose that a failure occurs and it is no longer able to respond to health signals from other computers. At time T, computer 1 obtains the right to transmit a health signal and sequentially sends health signals to other computers 2 and 3. The computer 2 also sends N2 and N23 in the fault monitoring information table 16 in the computer 2 together with the response signal.

この情報を受けとった計算機ｌは、計算機１内の障害監
視情報テーブル１０のＮｔいＮｔ３をその情報で更新し
た後、計算機２を正常と判断しＮｌｔを０とする。After receiving this information, the computer 1 updates Nt3 in the failure monitoring information table 10 in the computer 1 with the information, and then determines that the computer 2 is normal and sets Nlt to 0.

一方計算機３においては障害が発生しているため計算機
１からヘルス信号に対し応答信号を返送できない。この
ため計算機１は障害監視情報テーブル１０で計算機３の
障害監視情報Ｎ３１％　Ｎ、２を更新できずそのままの
状態となっているが最終的に計算機１は計算機３の異常
を検出することになりＮ１３を１とする。この結果障害
監視情報テーブル１０の内容は第５図に示すようになる
。On the other hand, since a failure has occurred in computer 3, a response signal cannot be sent back to the health signal from computer 1. For this reason, computer 1 cannot update the failure monitoring information N31%N,2 of computer 3 in the failure monitoring information table 10 and remains in the same state, but eventually computer 1 will detect an abnormality in computer 3. Let N13 be 1. As a result, the contents of the failure monitoring information table 10 become as shown in FIG.

この障害監視情報テーブル１０ではＮ１．が“ｌ”とな
るのでこれは計算機１が計算機３の状態を確認できてい
ないことを示している。この原因として通信路障害も考
えられるので計算機１は一時的に補助バス５を使って計
算機３にヘルス信号を再度送出し同様な障害判定を行う
。In this failure monitoring information table 10, N1. is "l", which indicates that computer 1 cannot confirm the status of computer 3. Since communication path failure may be the cause of this, the computer 1 temporarily sends the health signal to the computer 3 again using the auxiliary bus 5 and performs the same failure determination.

この結果相変らず計算機３からの応答がなければ計算機
３を異常と判定し、計算機１上の障害監視情報テーブル
ｌＯ上でＮ１３として保持され、もし計算機３からの応
答があれば経路障害と判定し、以後のすべての計算機と
の通信はバス５を使うことを各計算機へ通知するととも
に、計算機ｌの障害監視情報テーブル１０上でＮ＋３＝
Ｏと変更される。As a result, if there is no response from computer 3 as usual, computer 3 is determined to be abnormal and stored as N13 on the failure monitoring information table IO on computer 1, and if there is a response from computer 3, it is determined to be a route failure. Then, each computer is notified that the bus 5 will be used for communication with all computers thereafter, and N+3= on the failure monitoring information table 10 of computer l.
It is changed to O.

さらに、時刻Ｔｚ　　（−Ｔ１　＋Ｔ、Ｔは障害処理の
計算機間委譲間隔）に計算機２がヘルス信号送信権を得
て各計算機からの障害監視情報の収集を行い、計算機１
が行ったのと同様な方法で計算機２上に障害監視情報テ
ーブル１６を作り上げる。もしこのとき計算機３におい
てまだ障害が回復されていなければ障害監視情報テーブ
ル１６は第６図のようになり計算機３に対し条件（Ａ）
を満たすこととなり、計算機１．２の両針算機で計算機
３を障害と判定することになる。Furthermore, at time Tz (-T1 +T, T is the inter-computer delegation interval for failure processing), computer 2 obtains the right to transmit a health signal and collects failure monitoring information from each computer, and computer 1
The fault monitoring information table 16 is created on the computer 2 in the same manner as the previous example. If the fault has not been recovered in computer 3 at this time, the fault monitoring information table 16 will be as shown in Figure 6, and condition (A) will be applied to computer 3.
is satisfied, and the two-hand calculator of calculators 1 and 2 determines that computer 3 is at fault.

〔Effect of the invention〕

以上詳細に説明したように本発明によれば、複数の計算
機が互いに他の計算機を監視し合いその監視情報を交換
しあうことにより、障害箇所を正確に判断することがで
き、そのために適切な障害処理を行うことができ、複合
計算機システムの信頼性を向上させることができる効果
がある。As described in detail above, according to the present invention, a plurality of computers mutually monitor other computers and exchange monitoring information, thereby making it possible to accurately determine the location of a failure. This has the effect of being able to handle failures and improving the reliability of the compound computer system.

[Brief explanation of drawings]

第１図は本発明実施例の構成を示すブロック図。第２図は本発明実施例障害監視機能および全体の構成を
示すブロック図。第３図は本発明実施例障害監視情報テーブルの内容の一
例を示す図。第４図は本発明実施例において計算機３に障害が発生し
たことを想定したタイムチャート。第５図は本発明実施例において時刻Ｔ、に開始する計算
機１の障害処理の結果から得られる計算機１の障害監視
情報テーブルの内容を示す図。第６図は本発明実施例において時刻Ｔ２に開始する計算
機２の障害処理の結果得られる計算機２の障害監視情報
テーブルの内容を示す図。１．２．３・・・計算機、４．５・・・バス、６．１２
．１８・・・プロセッサ、７．１３．１９・・・主記憶
装置、８．９．１４．１５．２０．２１・・・バス制御
部、１０．１６．２２・・・障害監視情報テーブル、１
１．１７．２３・・・障害監視機能、３１・・・障害監
視信号送出手段、３２・・・障害監視情報記憶手段、３
３・・・障害監視情報通知手段、３４・・・障置針算機
判定手段。FIG. 1 is a block diagram showing the configuration of an embodiment of the present invention. FIG. 2 is a block diagram showing the fault monitoring function and overall configuration of the embodiment of the present invention. FIG. 3 is a diagram showing an example of the contents of a failure monitoring information table according to an embodiment of the present invention. FIG. 4 is a time chart assuming that a failure occurs in the computer 3 in the embodiment of the present invention. FIG. 5 is a diagram showing the contents of a failure monitoring information table for computer 1 obtained from the results of failure processing for computer 1 starting at time T in the embodiment of the present invention. FIG. 6 is a diagram showing the contents of a failure monitoring information table for computer 2 obtained as a result of failure processing for computer 2 starting at time T2 in the embodiment of the present invention. 1.2.3...Calculator, 4.5...Bus, 6.12
．． 18...Processor, 7.13.19...Main storage device, 8.9.14.15.20.21...Bus control unit, 10.16.22...Fault monitoring information table, 1
1.17.23... Fault monitoring function, 31... Fault monitoring signal sending means, 32... Fault monitoring information storage means, 3
3... Fault monitoring information notification means; 34... Fault pointer calculator determination means.

Claims

[Claims]

(1) A failure determination device for a compound computer system in which a plurality of computers are interconnected by communication paths, wherein each computer is provided with a failure monitoring means for monitoring failures of other computers.

(2) The fault monitoring means includes a means for sending a fault monitoring signal to other computers, a means for storing fault monitoring information responded from other computers in response to this signal, and a means for storing fault monitoring information from the information stored by this means. The apparatus according to claim 1, further comprising means for determining a faulty computer.

(3) The device according to claim (1), wherein the plurality is three or more.