JPS598064A

JPS598064A - Fault diagnosing system for multiplex computer system

Info

Publication number: JPS598064A
Application number: JP57115477A
Authority: JP
Inventors: Sei Ogiwara; 荻原　聖; Eiji Hasegawa; 栄司長谷川
Original assignee: Toshiba Corp; Tokyo Shibaura Electric Co Ltd
Current assignee: Toshiba Corp
Priority date: 1982-07-05
Filing date: 1982-07-05
Publication date: 1984-01-17

Abstract

PURPOSE:To prevent the loss of an important key for fault diagnosis of a multiplex computer, by collecting the information existing on a main memory of the series having a fault by means of remaining nondefective series. CONSTITUTION:If a fault arises to a computer of a series (a), the generation of the fault is informed to a CPU1b of a series (b). Receiving the information on the fault discontinuation of the series (a), a main memory information shunting program 2b-2 of the series (b) keeps the information existing on a main memory 2a of the series (a) in a main memory information preserving region 3b-2 of an external storage device (bulk memory) 3b of the series (b) via a signal system C and devices 5a and 5b. A CPU1a can give an access a main memory 2b of the remote system via the devices 5a and 5b. While the CPU1b can give an access to the memory 2a of the remote system via the devices 5b and 5a respectively.

Description

【発明の詳細な説明】〔発明の技術分野〕本発明は、多重系計算機システムの障害診断方式、特に
障害を発生した計算機の主メモリ上にある障害発生直前
までの情報を保存し得る多重系計算機システムの障害診
断方式に関するものである。DETAILED DESCRIPTION OF THE INVENTION [Technical Field of the Invention] The present invention relates to a fault diagnosis method for a multi-system computer system, and particularly to a fault diagnosis method for a multi-system computer system that can save information up to just before the fault occurs in the main memory of a computer in which a fault has occurred. This paper relates to a method for diagnosing problems in computer systems.

[Technical evidence of the invention]

一般に計算機システムを停止に至らしめるような重大な
障害発生直前としては、それを構成するハードウェアの
重要部分の故障及びバグによるプログラムのＭ走等が考
えられる。In general, immediately before the occurrence of a serious failure that causes a computer system to stop, failure of an important part of the hardware constituting the computer system or M-running of a program due to a bug can be considered.

これらの障害の診断をし障害原因を判明させるために最
も有効な手掛りとなる情報は、障害のため停止に至った
時の引算機システムの主メモリ上に存在する。こ第１は
停止した際に主メモリ上に保存されている情報には停止
に至る迄のプログラムの走行壮態や外部記憶装置や周辺
機器との入出力状態などがあるためである。このため、
従来、停止に至る際主メモリ上の情報を、一旦外部記憶
装随に退避させておき、訓η機システムを再起動させた
後、その情報をラインプリンタ等に出力し、障害診断を
行なっている。The most effective clue information for diagnosing these failures and determining the cause of the failure exists in the main memory of the subtraction machine system at the time of the failure. The first reason is that the information stored in the main memory when the program stops includes information such as how the program was running until the program stopped and the input/output status with external storage devices and peripheral devices. For this reason,
Conventionally, when a machine comes to a stop, the information in the main memory is temporarily evacuated to an external storage device, and after the training machine system is restarted, that information is output to a line printer, etc., and the fault is diagnosed. There is.

第１図によって、従来の障害診断方式を説明する。第１
図に示す計算機システムは中央演算処理装置（以下ＣＰ
Ｕと云う）１、主メモリ２、外部記憶装置（以下バルク
メモリと云う）３、ラインプリンタ（以下ＬＰと云う）
４を設けている。６けパスでを）る。A conventional fault diagnosis method will be explained with reference to FIG. 1st
The computer system shown in the figure is a central processing unit (hereinafter referred to as CP).
(hereinafter referred to as U) 1, main memory 2, external storage device (hereinafter referred to as bulk memory) 3, line printer (hereinafter referred to as LP)
There are 4. (with 6 passes).

今、この削舞機システムに前記したようなノ・−ドウエ
ア、又はソフトウェアに起因する障害が発生すると、通
常割込みと云う形で（以下障害割込みと太う）　ＣＰＵ
　１に通知される。ここで障害発生割込みを受信したＣ
ＰＵ　１は、それ迄実行していたグログラムを中断し２
、直ちに主メモリ情報退避グログラム２−１に側脚を移
す。主メモリ情報退避プログラム２−１はこのような状
況下、即ち、今まさに引算機システムが停止せんとする
直前に動作する必要のあるプログラムであるため、通常
のプログラムのように常時はバルクメモリ３上にあシ、
実行時のみ主メモリ２上にロードさせる形態はとれず、
主メモリに常駐する形態のプログラム（主常駐プログラ
ムと云う）である。Now, when a failure occurs in this machine system due to the above-mentioned hardware or software, it is sent to the CPU in the form of a normal interrupt (hereinafter referred to as failure interrupt).
1 will be notified. C that received the failure interrupt here
PU 1 interrupts the program that was running until then, and
, immediately move the side leg to the main memory information saving program 2-1. The main memory information saving program 2-1 is a program that needs to run under these circumstances, that is, just before the subtraction machine system is about to stop, so it is always saved in bulk memory like a normal program. 3. Reeds on top,
It is not possible to load it into main memory 2 only during execution,
This is a program that resides in main memory (referred to as a main resident program).

主メモリ情報退避プログラム２−１の動作はよく知られ
ているため、詳細な説明は省くが、次のような機能を有
している。Since the operation of the main memory information saving program 2-1 is well known, a detailed explanation will be omitted, but it has the following functions.

即ち、主メモリ２上にある情報を全量又は選択的に信号
系Ａを通してバルクメモリ３の主メモリ情報保存領域３
−１に転送保存した後、６１算機システムを停止させる
。そし７でバルクメモリ３の主メモリ情報保存領域３−
１に保存さノ１．でいる情報は、引算機システムを再度
起動した後、図示しないプログラムによって、信号系Ｂ
を通してＬＰ４宿に出力し、障害診断に供している。That is, the information stored in the main memory 2 is transferred either completely or selectively to the main memory information storage area 3 of the bulk memory 3 through the signal system A.
-1 and then stop the 61 computer system. At 7, main memory information storage area 3- of bulk memory 3
Saved in 1. After restarting the subtractor system, the information shown in
It is output to the LP4 hostel for troubleshooting.

[Problems with countermeasure technology]

以上が訓算機システムにおける障害診断方式の代表［Ｆ
ＩＪであるが、これには次のような欠点を有し７ている
。即ち、障害原因がハードウェアにあって前記障害発生
割込みを発生しえなくなったり、信号系ＡＶｒＣよるバ
ルクメモリ３への転送が不可能になった場合には、この
方式は全く機能し々くなると云うことである。The above is a typical fault diagnosis method in a computer system [F
However, it has the following drawbacks7. In other words, if the cause of the failure is in the hardware and it becomes impossible to generate the failure interrupt, or if the signal system AVrC becomes impossible to transfer to the bulk memory 3, this method will no longer function at all. That's what I'm saying.

更に又、障害片囚がソフトウェアにあってプログラムの
藁走により、主メモリ情報退避プログラム２−１が破壊
４　ｈ、たよりな場合も同様である。Furthermore, the same is true when the fault lies in software and the main memory information saving program 2-1 is destroyed due to program failure.

多重系システムも上記同様の方法で障害診断のための情
報を得ている。A multisystem system also obtains information for fault diagnosis using the same method as above.

[Purpose of the invention]

本発明は上記欠点を解決することを目的としてなされた
ものであり、ハードウェア及びソフトウェアのいずれの
障害発生に際しても障害診断のための重要な手掛かりの
喪失を防ぎ得る多重余計ａ機システムの障害診断方式を
提供することを目的としている。The present invention has been made for the purpose of solving the above-mentioned drawbacks, and provides a fault diagnosis for a multi-redundant machine system that can prevent the loss of important clues for fault diagnosis even when a fault occurs in either hardware or software. The purpose is to provide a method.

[Summary of the invention]

そして本発明では多重系を構成する計算機のいずれかの
系列において障害が発生した場合、障害を発生した系列
の主メモリ上にある情報を残りの正常動作している系列
で採集することにより、障害診断のための重要々手掛り
の喪失を防ごうとするものである。In the present invention, when a failure occurs in any of the computer systems that make up a multi-system, information stored in the main memory of the computer system in which the failure has occurred is collected from the remaining normally operating systems. This is intended to prevent the loss of important clues for diagnosis.

実施例以下図面を参照しつつ実施例を説明する。第２図は本発
明による多重系計算機システムの障害診断方式の一実施
列構成図である。Embodiments Hereinafter, embodiments will be described with reference to the drawings. FIG. 2 is a block diagram of one implementation of the fault diagnosis method for a multi-system computer system according to the present invention.

第２図は２重系の計算機システムであって、これら各計
算機はＣＰＵ　１　ａ　、　１　ｂ　、主メモリ２ａ。FIG. 2 shows a dual computer system, each of which has CPUs 1a and 1b and a main memory 2a.

２ｂ、バルクメモリ３ａ＋３ｂｘ　ＬＰ４ａ、４ｂをそ
なえていることは第１１￥１と同様である。なお、サフ
ィックスａを伺１．た削讃機を第１系列、ｂを付した側
ａ機を第２系列と称することにする。2b, bulk memory 3a+3bx LP4a, 4b are provided as in the 11th ¥1. Please note that the suffix a is 1. We will call the machines that have been reduced to 1st series, and the 2nd series that has b attached to them.

５ａ、５ｂは互に相手系の主メモリをアクセス可能にす
るための装ＦＩ　ｓ即ち、ＣＰＵ　１　ａ　ｉ７ｔｇＷ
　５　Ｂ＋５ｂを介して相手系にある主メモＩＪ　２　
ｂをアクセスすることができ、又、ＣＰＵ　１　ｂは装
置５ｂ。5a and 5b are FIs for allowing each other to access the main memory of the other system, that is, the CPU 1a i7tgW
5 Main memo IJ in the other party's system via B+5b 2
The CPU 1b can access the device 5b.

５ａを介（７て相手系にある主メモ’Ｊ　２　ａをアク
セスすることができるもσ）で、コンピュータシステム
リンケーノ装置（以下Ｃ８Ｌと云う）と称することにす
る。The main memo 'J2a in the other system can be accessed via the computer system 5a (also σ), which will be referred to as a computer system linkage device (hereinafter referred to as C8L).

次に第３図のフローグーヤードｖＣよって上記第２図々
示実舵例の動作を説明する。Next, the operation of the example of the rudder shown in FIG. 2 will be explained using the flow goo yard vC of FIG.

今、第１系列の旧算機に障害が発生した場合を説明する
と、ステップＡ、Ｂのオア条件により、ステッｆＣＶ？
Ｘ示さノする第２系列のＣＰＵ　２　ｂに障害発生が通
知きれる。即ち、図示しない第１系列の停止検出装Ｗ１
．出力を第２系列の割込み横用装置に入力するなどのノ
・−ドウエアによる手段（ステップＡ）、又は第２系列
にある他系状態監視グログラムによる検出などのソフト
ウェアによる手段（ステップＢ）のいずれかによって、
正常な第２系列が第１系列の障害発生を知、２−（ステ
ップＣ）。Now, to explain the case where a failure occurs in the old computer of the first series, due to the OR condition of steps A and B, step fCV?
The CPU 2 b of the second series indicated by X is notified of the occurrence of the failure. That is, the stop detection device W1 of the first series (not shown)
．． Either by software means such as inputting the output to the interrupt handling device of the second series (step A), or by software means such as detection by the other system status monitoring program in the second series (step B). Depending on the
The normal second system learns of the occurrence of a failure in the first system, 2- (Step C).

第１系列の障害停止の通知を受けた第２系列の主メモリ
情報退避プログラム２ｂ−２は、信号系Ｃを介して障害
停止した第１系列の主メモリ２１上にある情報をＣ８Ｌ
　５　ａ　、　５　ｂを経由して第２系列のバルクメモ
Ｉ７３　ｂの主メモリ情報保存領域３ｂ−２に保存する
（ステップＤ）。The main memory information saving program 2b-2 of the second series, which has been notified of the failure stop of the first series, saves the information in the main memory 21 of the first series that has stopped due to the failure via the signal system C to C8L.
5a and 5b, and is stored in the main memory information storage area 3b-2 of the second series bulk memo I73b (step D).

なお、第２系列の計瀞機システムにおける主メモリ情報
退避プログラム２ｂ−２の前記動作は他の業務プログラ
ムの実行と並行して行なうことが可能である。Note that the above-mentioned operation of the main memory information saving program 2b-2 in the second line of management system can be performed in parallel with the execution of other business programs.

なお、多重系計算機システムを構成する計算機け、各々
独立して動作するものでけ々く、各耐η−機は有機的に
結合して動作している。したがって上記実施例で説明し
た第２系列の主メモリ退避プログラム２ｂ−２によって
、第１系列の主メモリ上［Ｓる情報の採集と共に、第２
系列自身の主メモリ２ｂ上にある情報も併せて、バルク
メモリ３ｂｖＣある主メモリ情報保存領３ｂ〜２に保存
するようにさぜれは、より広範な障害診断のための情報
を供することができる。Note that the computers constituting the multi-system computer system do not operate independently, but the η-resistant machines operate in an organically coupled manner. Therefore, the main memory save program 2b-2 of the second series explained in the above embodiment collects the information stored in the main memory of the first series, and
If the information on the main memory 2b of the series itself is also stored in the main memory information storage areas 3b-2 in the bulk memory 3bvC, it is possible to provide information for a wider range of fault diagnosis. .

〔Effect of the invention〕

以上Ｈ’ｌｌ’明し戸如く、本発明によれば多重系言１
算機システムにおいて、多１１系を構成する計算機のい
ずれかの系列で障害が発生した場合、残りの正常動作し
７ている系列により障害を発生した系列の主２ノモリ上
の情報を採集すると共に、更に必要に応じて正′帛な系
ダ１の主メモリ上にある障害を発生した系列の状態に関
係する同時点の＋９を報をも採集することができるσ）
で、より正確でかつ広範々障害診断のための情報を喪失
することのない多重系計ｎ機ンステムのし一害診断方式
を提供できる。As described above, according to the present invention, multiple series words 1
In a computer system, when a failure occurs in any of the computer systems that make up the 11 systems, information on the main 2 memory of the system in which the failure occurred is collected from the remaining normally operating systems. , if necessary, it is also possible to collect +9 information at the same time related to the state of the faulty series in the main memory of the normal system 1 (σ)
Therefore, it is possible to provide a fault diagnosis method for a multi-system system that is more accurate and does not cause loss of information for extensive fault diagnosis.

[Brief explanation of the drawing]

第１図は従来の障害診断方式を説明するための構成図、
第２図ｄ本発明による多重系計算機システムの障害診断
方式を説明するための構成図、第３図は動作説明のため
のフローチャートである。 ■・・・中央演η処理装置Ｍ　２・・・主メモリ２−１
・・・主メモリ情報退避プログラム３・・・外部記憶装
置３−１・・・主メモリ情報保存領域４・・ラインプリンタ５ａ　、５ｂ・・・他系の主メモリをアクセスする装置慣°許出願人東京芝浦電気株式会社FIG. 1 is a configuration diagram for explaining a conventional fault diagnosis method.
FIG. 2d is a block diagram for explaining a fault diagnosis method for a multi-system computer system according to the present invention, and FIG. 3 is a flowchart for explaining the operation. ■...Central processing unit M2...Main memory 2-1
. . . Main memory information saving program 3 . . . External storage device 3-1 . . . Main memory information storage area 4 . . . Line printers 5a, 5b . Person Tokyo Shibaura Electric Co., Ltd.

Claims

[Claims]

In a fault diagnosis method for a multi-computer system that can save information in the main memory of the faulty computer without losing it when a fault occurs in a multi-computer system consisting of multiple computers, a main system that operates when a fault occurs is used. A fault diagnosis method for a multi-system subtractor system characterized by collecting information in the main memory of a faulty computer into the main memory information storage area of a normally operating computer using a memory saving program.