JPS63291132A - Fault deciding device for composite computer system - Google Patents

Fault deciding device for composite computer system

Info

Publication number
JPS63291132A
JPS63291132A JP62125308A JP12530887A JPS63291132A JP S63291132 A JPS63291132 A JP S63291132A JP 62125308 A JP62125308 A JP 62125308A JP 12530887 A JP12530887 A JP 12530887A JP S63291132 A JPS63291132 A JP S63291132A
Authority
JP
Japan
Prior art keywords
computer
fault
computers
failure
monitoring information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP62125308A
Other languages
Japanese (ja)
Inventor
Masashi Kudo
工藤 雅司
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NEC Corp
Original Assignee
NEC Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NEC Corp filed Critical NEC Corp
Priority to JP62125308A priority Critical patent/JPS63291132A/en
Publication of JPS63291132A publication Critical patent/JPS63291132A/en
Pending legal-status Critical Current

Links

Landscapes

  • Hardware Redundancy (AREA)
  • Multi Processors (AREA)

Abstract

PURPOSE:To correctly judge a fault location, to attain a suitable fault processing and to improve the reliability of a composite computer system by supervising mutually other computer with plural computers and exchanging the supervisory information. CONSTITUTION:A computer having a transmission right of a health signal transmits successively the health signal to other computer and stores the fault supervising information collected already by an opponent computer. When the information collection is completed, the fault of the computer is decided, and the stored fault information is informed to other computer. A fault deciding processing is executed on the computer having the transmission right of the health signal at a constant time internal and a health signal transmission right is cyclicly transferred between computers. Thus, plural computers mutually supervise other computer and the supervisory information is exchanged. Thus, the fault location is correctly judged, a suitable fault processing is executed and the reliability of the composite computer system can be improved.

Description

【発明の詳細な説明】 〔産業上の利用分野〕 本発明は情報処理システムに関する。特に、複合計算機
システムの障害判定装置に関する。
DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to an information processing system. In particular, the present invention relates to a failure determination device for a compound computer system.

〔概要〕〔overview〕

複合計算機システムの障害判定装置において、複数の計
算機が互いに他の計算機を監視しあいその監視情報を交
換しあうことにより、障害個所を正確に判断し、適切な
障害処理を実施し、複合計算機システムの信頼性を向上
させるようにしたものである。
In a multicomputer system fault determination device, multiple computers mutually monitor other computers and exchange monitoring information to accurately determine the location of the fault, perform appropriate fault handling, and improve the performance of the multicomputer system. This is designed to improve reliability.

〔従来の技術〕[Conventional technology]

従来、複合計算機システムを構成する計算機あるいはそ
れらを結ぶ通信路の障害判定は一台の計算機の単独判断
により行っていた。
Conventionally, failure determination of computers constituting a compound computer system or communication paths connecting them has been made by a single computer.

〔発明が解決しようとする問題点〕[Problem that the invention seeks to solve]

上述した従来の方式では、障害判定を行う一つの計算機
の誤判断で正常である計算機を障害扱いにしたり、ある
いは通信路障害であるにもかかわらず計算機の障害と判
定しシステムの障害処理のだめの、障害計算機の切り離
し、ジョブの継続運転あるいは通信路の切り換えなどを
行うことになり、そのために計算機間での混乱が発生し
、システム全体として不安定な状態に陥り、正常なシス
テム運用を不可能にしてしまう問題を有している。
In the conventional method described above, a normal computer is treated as a failure due to a mistake made by one of the computers that performs failure determination, or a failure is determined to be a computer failure even though there is a communication path failure, resulting in failure of the system to handle the failure. , it is necessary to disconnect the faulty computer, continue running the job, or switch the communication path, which causes confusion between the computers, and the entire system falls into an unstable state, making normal system operation impossible. It has the problem of causing

本発明は、このような従来の問題点を解決するもので正
確な障害判定を行う装置を提供することを目的とするも
のである。
SUMMARY OF THE INVENTION The present invention solves these conventional problems and aims to provide a device that accurately determines faults.

〔問題点を解決するための手段〕[Means for solving problems]

本発明は、複数の計算機が通信路により相互に結合され
た複合計算機システムにおいて、各計算機に他の計算機
の障害を監視する障害監視手段を備えたことを特徴とす
る。
The present invention is characterized in that in a composite computer system in which a plurality of computers are interconnected by communication paths, each computer is provided with a fault monitoring means for monitoring faults in other computers.

障害監視手段は、他計算機に対して障害監視信号を送出
する手段と、この信号に対して他計算機から応答された
障害監視情報を記憶する手段と、この手段により記憶さ
れた情報から障害計算機を”判定する手段とを含むこと
ができ、複数は3以上である。
The fault monitoring means includes means for sending a fault monitoring signal to other computers, means for storing fault monitoring information responded from other computers in response to this signal, and detecting the faulty computer from the information stored by the means. ``a means for determining, and the plurality is three or more.

〔作用〕[Effect]

ヘルス信号の送信権をもつ計算機が他の計算機にヘルス
信号を順次送信し、相手計算機がすでに収集しである障
害監視情報を記憶する。
A computer that has the right to transmit health signals sequentially transmits health signals to other computers, and stores fault monitoring information that the other computers have already collected.

この情報収集が終了すると、計算機の障害判定を行い、
記憶した障害情報を他の計算機にilN報する。
Once this information collection is complete, we will determine the computer's failure.
The stored fault information is reported to other computers.

障害判定処理は、一定時間間隔でヘルス信号の送信権を
もつ計算機上で行われ、ヘルス信号送信権は計算機間で
サイクリックに委譲する。
Failure determination processing is performed on a computer that has the right to transmit health signals at regular time intervals, and the right to transmit health signals is cyclically delegated between computers.

このように複数の計算機が互いに他の計算機を監視し合
いその監視情報を交換しあうことにより、障害個所を正
確に判断し、適切な障害処理を行い、複合計算機システ
ムの信転性を向上させることができる。
In this way, multiple computers mutually monitor other computers and exchange monitoring information, thereby accurately determining the location of a failure, performing appropriate failure handling, and improving the reliability of the complex computer system. be able to.

〔実施例〕〔Example〕

次に、本発明実施例について図面を参照して説明する。 Next, embodiments of the present invention will be described with reference to the drawings.

第1図は計算機を3台使用した本発明実施例の構成を示
すブロック図、第2図は本発明実施例障害監視機能およ
び全体の構成を示すブロック図である。本発明実施例は
、計算機1.2.3と、主通信路であるバス4および補
助通信路であるバス5とから構成されている。
FIG. 1 is a block diagram showing the configuration of an embodiment of the present invention using three computers, and FIG. 2 is a block diagram showing the fault monitoring function and overall configuration of the embodiment of the present invention. The embodiment of the present invention is composed of a computer 1.2.3, a bus 4 as a main communication path, and a bus 5 as an auxiliary communication path.

計算機1には、プロセッサ6、主記憶装置7、バス制御
部8および9が備えられており、主記憶装置7には、他
計算機および通信路を監視する障害監視信号送出手段3
1と、この障害監視信号送出手段31によって得られた
情報を障害監視情報テーブル10に記憶する障害監視情
報記憶手段32と、障害監視情報記憶手段32によって
記憶した障害情報を他の計算機に通知する障害監視情報
通知手段33と、障害監視情報通知手段33により通知
された障害監視情報をもとに障害計算機を判定する障害
計算機判定手段34とを含む障害監視機能11および障
害監視情報テーブル10とが備えられている。
The computer 1 is equipped with a processor 6, a main memory 7, and bus control units 8 and 9, and the main memory 7 includes a fault monitoring signal sending means 3 for monitoring other computers and communication paths.
1, a fault monitoring information storage means 32 that stores the information obtained by the fault monitoring signal sending means 31 in the fault monitoring information table 10, and a fault monitoring information storage means 32 that notifies other computers of the fault information stored by the fault monitoring information storage means 32. A fault monitoring function 11 and a fault monitoring information table 10 include a fault monitoring information notifying means 33 and a faulty computer determining means 34 for determining a faulty computer based on the fault monitoring information notified by the fault monitoring information notifying means 33. It is equipped.

計算機2および3にもそれぞれプロセッサ12および1
8と、主記憶装置13および19と、バス制御部14.
15.20および21とが含まれ、計算機1と同様に主
記憶装置13および19には障害監視機能17および2
3と、障害監視情報テーブル16および22がそれぞれ
含まれている。
Computers 2 and 3 also have processors 12 and 1, respectively.
8, main storage devices 13 and 19, and bus control unit 14.
15.20 and 21 are included, and like the computer 1, the main storage devices 13 and 19 have fault monitoring functions 17 and 2.
3 and fault monitoring information tables 16 and 22, respectively.

このように構成された本発明実施例は、ヘルス信号の送
信権を持つ計算機が障害監視信号送出手段31により他
の計算機にヘルス信号を順次送信し、相手計算機がすで
に収集しである障害監視情報を障害監視情報記憶手段3
2によって集める。この情報収集が終了したところで障
害計算機判定手段34が計算機の障害判定を行う。
In the embodiment of the present invention configured as described above, a computer having the right to transmit a health signal sequentially transmits a health signal to other computers using the fault monitoring signal sending means 31, and the fault monitoring information that the other computer has already collected is transmitted. Fault monitoring information storage means 3
Collect by 2. When this information collection is completed, the failed computer determining means 34 determines the failure of the computer.

この障害判定処理は一定時間間隔でヘルス信号の送信権
をもつ計算機上で行われ、ヘルス信号送信権も計算機間
でサイクリックに委譲していく。
This fault determination processing is performed at regular time intervals on a computer that has the right to transmit health signals, and the right to transmit health signals is also cyclically delegated between computers.

第3図は計算機1の障害監視情報テーブル10の内容の
一例を示したものである。ここでN84(i≠j11≦
i、j≦3、N、、=O11)は計算機iが検出した計
算機jの状態を示し、Ni、=0ならば正常、Ni、=
 1ならば異常を示す。計算機iはこれらN1j(1≦
i、j≦3)を他の計算機からのヘルス信号を受信した
際にその計算段へ応答信号とともに障害監視情報通知手
段33によって返送する。
FIG. 3 shows an example of the contents of the failure monitoring information table 10 of the computer 1. Here, N84 (i≠j11≦
i, j≦3, N, , = O11) indicates the state of computer j detected by computer i; if Ni, = 0, it is normal; Ni, =
If it is 1, it indicates an abnormality. Computer i calculates these N1j (1≦
i, j≦3) is sent back to the calculation stage by the failure monitoring information notification means 33 along with a response signal when receiving a health signal from another computer.

そして、全計算機からの本情報(N(1、t#j、l≦
i、j≦3)の収集がすべて完了したところで条件 Σ N、J≧2   (A) がチェックされ、これを満たす計算ajを障害計算機判
定手段34が障害と判定する。
Then, this information from all computers (N(1, t#j, l≦
When all the calculations (i, j≦3) have been completed, the condition Σ N, J≧2 (A) is checked, and the faulty computer determining means 34 determines that the calculation aj that satisfies this is faulty.

ここで1つの例をあげて説明する。第4図に示すように
、計算機3に時刻T。で障害が発生し、他計算機のヘル
ス信号に対して応答できなくなったとする。時刻T、に
計算機1がヘルス信号送信権を得て、他の計算機2およ
び3に順次にヘルス信号を送出する。計算機2は応答信
号と一緒に計算機2内の障害監視情報テーブル16にあ
るN2いN23もあわせて送出する。
One example will be explained here. As shown in FIG. 4, the computer 3 receives a time T. Suppose that a failure occurs and it is no longer able to respond to health signals from other computers. At time T, computer 1 obtains the right to transmit a health signal and sequentially sends health signals to other computers 2 and 3. The computer 2 also sends N2 and N23 in the fault monitoring information table 16 in the computer 2 together with the response signal.

この情報を受けとった計算機lは、計算機1内の障害監
視情報テーブル10のNtいNt3をその情報で更新し
た後、計算機2を正常と判断しNltを0とする。
After receiving this information, the computer 1 updates Nt3 in the failure monitoring information table 10 in the computer 1 with the information, and then determines that the computer 2 is normal and sets Nlt to 0.

一方計算機3においては障害が発生しているため計算機
1からヘルス信号に対し応答信号を返送できない。この
ため計算機1は障害監視情報テーブル10で計算機3の
障害監視情報N31% N、2を更新できずそのままの
状態となっているが最終的に計算機1は計算機3の異常
を検出することになりN13を1とする。この結果障害
監視情報テーブル10の内容は第5図に示すようになる
On the other hand, since a failure has occurred in computer 3, a response signal cannot be sent back to the health signal from computer 1. For this reason, computer 1 cannot update the failure monitoring information N31%N,2 of computer 3 in the failure monitoring information table 10 and remains in the same state, but eventually computer 1 will detect an abnormality in computer 3. Let N13 be 1. As a result, the contents of the failure monitoring information table 10 become as shown in FIG.

この障害監視情報テーブル10ではN1.が“l”とな
るのでこれは計算機1が計算機3の状態を確認できてい
ないことを示している。この原因として通信路障害も考
えられるので計算機1は一時的に補助バス5を使って計
算機3にヘルス信号を再度送出し同様な障害判定を行う
In this failure monitoring information table 10, N1. is "l", which indicates that computer 1 cannot confirm the status of computer 3. Since communication path failure may be the cause of this, the computer 1 temporarily sends the health signal to the computer 3 again using the auxiliary bus 5 and performs the same failure determination.

この結果相変らず計算機3からの応答がなければ計算機
3を異常と判定し、計算機1上の障害監視情報テーブル
lO上でN13として保持され、もし計算機3からの応
答があれば経路障害と判定し、以後のすべての計算機と
の通信はバス5を使うことを各計算機へ通知するととも
に、計算機lの障害監視情報テーブル10上でN+3=
Oと変更される。
As a result, if there is no response from computer 3 as usual, computer 3 is determined to be abnormal and stored as N13 on the failure monitoring information table IO on computer 1, and if there is a response from computer 3, it is determined to be a route failure. Then, each computer is notified that the bus 5 will be used for communication with all computers thereafter, and N+3= on the failure monitoring information table 10 of computer l.
It is changed to O.

さらに、時刻Tz  (−T1 +T、Tは障害処理の
計算機間委譲間隔)に計算機2がヘルス信号送信権を得
て各計算機からの障害監視情報の収集を行い、計算機1
が行ったのと同様な方法で計算機2上に障害監視情報テ
ーブル16を作り上げる。もしこのとき計算機3におい
てまだ障害が回復されていなければ障害監視情報テーブ
ル16は第6図のようになり計算機3に対し条件(A)
を満たすこととなり、計算機1.2の両針算機で計算機
3を障害と判定することになる。
Furthermore, at time Tz (-T1 +T, T is the inter-computer delegation interval for failure processing), computer 2 obtains the right to transmit a health signal and collects failure monitoring information from each computer, and computer 1
The fault monitoring information table 16 is created on the computer 2 in the same manner as the previous example. If the fault has not been recovered in computer 3 at this time, the fault monitoring information table 16 will be as shown in Figure 6, and condition (A) will be applied to computer 3.
is satisfied, and the two-hand calculator of calculators 1 and 2 determines that computer 3 is at fault.

〔発明の効果〕〔Effect of the invention〕

以上詳細に説明したように本発明によれば、複数の計算
機が互いに他の計算機を監視し合いその監視情報を交換
しあうことにより、障害箇所を正確に判断することがで
き、そのために適切な障害処理を行うことができ、複合
計算機システムの信頼性を向上させることができる効果
がある。
As described in detail above, according to the present invention, a plurality of computers mutually monitor other computers and exchange monitoring information, thereby making it possible to accurately determine the location of a failure. This has the effect of being able to handle failures and improving the reliability of the compound computer system.

【図面の簡単な説明】[Brief explanation of drawings]

第1図は本発明実施例の構成を示すブロック図。 第2図は本発明実施例障害監視機能および全体の構成を
示すブロック図。 第3図は本発明実施例障害監視情報テーブルの内容の一
例を示す図。 第4図は本発明実施例において計算機3に障害が発生し
たことを想定したタイムチャート。 第5図は本発明実施例において時刻T、に開始する計算
機1の障害処理の結果から得られる計算機1の障害監視
情報テーブルの内容を示す図。 第6図は本発明実施例において時刻T2に開始する計算
機2の障害処理の結果得られる計算機2の障害監視情報
テーブルの内容を示す図。 1.2.3・・・計算機、4.5・・・バス、6.12
.18・・・プロセッサ、7.13.19・・・主記憶
装置、8.9.14.15.20.21・・・バス制御
部、10.16.22・・・障害監視情報テーブル、1
1.17.23・・・障害監視機能、31・・・障害監
視信号送出手段、32・・・障害監視情報記憶手段、3
3・・・障害監視情報通知手段、34・・・障置針算機
判定手段。
FIG. 1 is a block diagram showing the configuration of an embodiment of the present invention. FIG. 2 is a block diagram showing the fault monitoring function and overall configuration of the embodiment of the present invention. FIG. 3 is a diagram showing an example of the contents of a failure monitoring information table according to an embodiment of the present invention. FIG. 4 is a time chart assuming that a failure occurs in the computer 3 in the embodiment of the present invention. FIG. 5 is a diagram showing the contents of a failure monitoring information table for computer 1 obtained from the results of failure processing for computer 1 starting at time T in the embodiment of the present invention. FIG. 6 is a diagram showing the contents of a failure monitoring information table for computer 2 obtained as a result of failure processing for computer 2 starting at time T2 in the embodiment of the present invention. 1.2.3...Calculator, 4.5...Bus, 6.12
.. 18...Processor, 7.13.19...Main storage device, 8.9.14.15.20.21...Bus control unit, 10.16.22...Fault monitoring information table, 1
1.17.23... Fault monitoring function, 31... Fault monitoring signal sending means, 32... Fault monitoring information storage means, 3
3... Fault monitoring information notification means; 34... Fault pointer calculator determination means.

Claims (3)

【特許請求の範囲】[Claims] (1)複数の計算機が通信路により相互に結合された複
合計算機システムにおいて、 各計算機に他の計算機の障害を監視する障害監視手段を
備えた ことを特徴とする複合計算機システムの障害判定装置。
(1) A failure determination device for a compound computer system in which a plurality of computers are interconnected by communication paths, wherein each computer is provided with a failure monitoring means for monitoring failures of other computers.
(2)障害監視手段は、 他計算機に対して障害監視信号を送出する手段と、 この信号に対して他計算機から応答された障害監視情報
を記憶する手段と、 この手段により記憶された情報から障害計算機を判定す
る手段と を含む特許請求の範囲第(1)項に記載の装置。
(2) The fault monitoring means includes a means for sending a fault monitoring signal to other computers, a means for storing fault monitoring information responded from other computers in response to this signal, and a means for storing fault monitoring information from the information stored by this means. The apparatus according to claim 1, further comprising means for determining a faulty computer.
(3)複数は3以上である特許請求の範囲第(1)項に
記載の装置。
(3) The device according to claim (1), wherein the plurality is three or more.
JP62125308A 1987-05-22 1987-05-22 Fault deciding device for composite computer system Pending JPS63291132A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP62125308A JPS63291132A (en) 1987-05-22 1987-05-22 Fault deciding device for composite computer system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP62125308A JPS63291132A (en) 1987-05-22 1987-05-22 Fault deciding device for composite computer system

Publications (1)

Publication Number Publication Date
JPS63291132A true JPS63291132A (en) 1988-11-29

Family

ID=14906886

Family Applications (1)

Application Number Title Priority Date Filing Date
JP62125308A Pending JPS63291132A (en) 1987-05-22 1987-05-22 Fault deciding device for composite computer system

Country Status (1)

Country Link
JP (1) JPS63291132A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008234117A (en) * 2007-03-19 2008-10-02 Fujitsu Ltd Multiprocessor system and recovery method in multiprocessor system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008234117A (en) * 2007-03-19 2008-10-02 Fujitsu Ltd Multiprocessor system and recovery method in multiprocessor system

Similar Documents

Publication Publication Date Title
US7941810B2 (en) Extensible and flexible firmware architecture for reliability, availability, serviceability features
JPS63291132A (en) Fault deciding device for composite computer system
JPH02132529A (en) Automatic monitoring switching control device
JP4679334B2 (en) Wide area alarm monitoring system
JPS6136663B2 (en)
JPH0934852A (en) Cluster system
JPH11338724A (en) Standby system, standby method and recording medium
JPS634366A (en) Mutual monitor system for multicomputer
JPH03123230A (en) Early alarm detector relating to network monitor system
JP2575943B2 (en) Data transmission equipment
JPS58201155A (en) Dual system monitoring system
JPH04158449A (en) Multicomputer system
JPS6314542B2 (en)
JPH11250026A (en) Fault recovery method and its system for parallel multiprocessor system
JPH02281368A (en) Trouble detecting mechanism for controller
JPS5850372B2 (en) Data collection and distribution processing system
CN117743008A (en) Multi-core processor fault diagnosis and exception handling method, main control board card and equipment
JPH03184154A (en) Network control system
JPS5870670A (en) Failure information transfer system for exchange of duplex system
JPS6224354A (en) Duplex computer system
JPS58211268A (en) Multi-processor system
JPH02165357A (en) Information transfer device
JPS62264796A (en) Information supervising system
JPS62190536A (en) Redundant constitution control system
JPH08181738A (en) Fault location method and fault location device