JPS598064A - Fault diagnosing system for multiplex computer system - Google Patents

Fault diagnosing system for multiplex computer system

Info

Publication number
JPS598064A
JPS598064A JP57115477A JP11547782A JPS598064A JP S598064 A JPS598064 A JP S598064A JP 57115477 A JP57115477 A JP 57115477A JP 11547782 A JP11547782 A JP 11547782A JP S598064 A JPS598064 A JP S598064A
Authority
JP
Japan
Prior art keywords
main memory
series
information
fault
computer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP57115477A
Other languages
Japanese (ja)
Inventor
Sei Ogiwara
荻原 聖
Eiji Hasegawa
栄司 長谷川
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Toshiba Corp
Original Assignee
Toshiba Corp
Tokyo Shibaura Electric Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Toshiba Corp, Tokyo Shibaura Electric Co Ltd filed Critical Toshiba Corp
Priority to JP57115477A priority Critical patent/JPS598064A/en
Publication of JPS598064A publication Critical patent/JPS598064A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/073Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a memory management context, e.g. virtual memory or cache management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Hardware Redundancy (AREA)
  • Multi Processors (AREA)

Abstract

PURPOSE:To prevent the loss of an important key for fault diagnosis of a multiplex computer, by collecting the information existing on a main memory of the series having a fault by means of remaining nondefective series. CONSTITUTION:If a fault arises to a computer of a series (a), the generation of the fault is informed to a CPU1b of a series (b). Receiving the information on the fault discontinuation of the series (a), a main memory information shunting program 2b-2 of the series (b) keeps the information existing on a main memory 2a of the series (a) in a main memory information preserving region 3b-2 of an external storage device (bulk memory) 3b of the series (b) via a signal system C and devices 5a and 5b. A CPU1a can give an access a main memory 2b of the remote system via the devices 5a and 5b. While the CPU1b can give an access to the memory 2a of the remote system via the devices 5b and 5a respectively.

Description

【発明の詳細な説明】 〔発明の技術分野〕 本発明は、多重系計算機システムの障害診断方式、特に
障害を発生した計算機の主メモリ上にある障害発生直前
までの情報を保存し得る多重系計算機システムの障害診
断方式に関するものである。
DETAILED DESCRIPTION OF THE INVENTION [Technical Field of the Invention] The present invention relates to a fault diagnosis method for a multi-system computer system, and particularly to a fault diagnosis method for a multi-system computer system that can save information up to just before the fault occurs in the main memory of a computer in which a fault has occurred. This paper relates to a method for diagnosing problems in computer systems.

〔発明の技術的背票〕[Technical evidence of the invention]

一般に計算機システムを停止に至らしめるような重大な
障害発生直前としては、それを構成するハードウェアの
重要部分の故障及びバグによるプログラムのM走等が考
えられる。
In general, immediately before the occurrence of a serious failure that causes a computer system to stop, failure of an important part of the hardware constituting the computer system or M-running of a program due to a bug can be considered.

これらの障害の診断をし障害原因を判明させるために最
も有効な手掛りとなる情報は、障害のため停止に至った
時の引算機システムの主メモリ上に存在する。こ第1は
停止した際に主メモリ上に保存されている情報には停止
に至る迄のプログラムの走行壮態や外部記憶装置や周辺
機器との入出力状態などがあるためである。このため、
従来、停止に至る際主メモリ上の情報を、一旦外部記憶
装随に退避させておき、訓η機システムを再起動させた
後、その情報をラインプリンタ等に出力し、障害診断を
行なっている。
The most effective clue information for diagnosing these failures and determining the cause of the failure exists in the main memory of the subtraction machine system at the time of the failure. The first reason is that the information stored in the main memory when the program stops includes information such as how the program was running until the program stopped and the input/output status with external storage devices and peripheral devices. For this reason,
Conventionally, when a machine comes to a stop, the information in the main memory is temporarily evacuated to an external storage device, and after the training machine system is restarted, that information is output to a line printer, etc., and the fault is diagnosed. There is.

第1図によって、従来の障害診断方式を説明する。第1
図に示す計算機システムは中央演算処理装置(以下CP
Uと云う)1、主メモリ2、外部記憶装置(以下バルク
メモリと云う)3、ラインプリンタ(以下LPと云う)
4を設けている。6けパスでを)る。
A conventional fault diagnosis method will be explained with reference to FIG. 1st
The computer system shown in the figure is a central processing unit (hereinafter referred to as CP).
(hereinafter referred to as U) 1, main memory 2, external storage device (hereinafter referred to as bulk memory) 3, line printer (hereinafter referred to as LP)
There are 4. (with 6 passes).

今、この削舞機システムに前記したようなノ・−ドウエ
ア、又はソフトウェアに起因する障害が発生すると、通
常割込みと云う形で(以下障害割込みと太う) CPU
 1に通知される。ここで障害発生割込みを受信したC
PU 1は、それ迄実行していたグログラムを中断し2
、直ちに主メモリ情報退避グログラム2−1に側脚を移
す。主メモリ情報退避プログラム2−1はこのような状
況下、即ち、今まさに引算機システムが停止せんとする
直前に動作する必要のあるプログラムであるため、通常
のプログラムのように常時はバルクメモリ3上にあシ、
実行時のみ主メモリ2上にロードさせる形態はとれず、
主メモリに常駐する形態のプログラム(主常駐プログラ
ムと云う)である。
Now, when a failure occurs in this machine system due to the above-mentioned hardware or software, it is sent to the CPU in the form of a normal interrupt (hereinafter referred to as failure interrupt).
1 will be notified. C that received the failure interrupt here
PU 1 interrupts the program that was running until then, and
, immediately move the side leg to the main memory information saving program 2-1. The main memory information saving program 2-1 is a program that needs to run under these circumstances, that is, just before the subtraction machine system is about to stop, so it is always saved in bulk memory like a normal program. 3. Reeds on top,
It is not possible to load it into main memory 2 only during execution,
This is a program that resides in main memory (referred to as a main resident program).

主メモリ情報退避プログラム2−1の動作はよく知られ
ているため、詳細な説明は省くが、次のような機能を有
している。
Since the operation of the main memory information saving program 2-1 is well known, a detailed explanation will be omitted, but it has the following functions.

即ち、主メモリ2上にある情報を全量又は選択的に信号
系Aを通してバルクメモリ3の主メモリ情報保存領域3
−1に転送保存した後、61算機システムを停止させる
。そし7でバルクメモリ3の主メモリ情報保存領域3−
1に保存さノ1.でいる情報は、引算機システムを再度
起動した後、図示しないプログラムによって、信号系B
を通してLP4宿に出力し、障害診断に供している。
That is, the information stored in the main memory 2 is transferred either completely or selectively to the main memory information storage area 3 of the bulk memory 3 through the signal system A.
-1 and then stop the 61 computer system. At 7, main memory information storage area 3- of bulk memory 3
Saved in 1. After restarting the subtractor system, the information shown in
It is output to the LP4 hostel for troubleshooting.

〔背策技術の問題点〕[Problems with countermeasure technology]

以上が訓算機システムにおける障害診断方式の代表[F
IJであるが、これには次のような欠点を有し7ている
。即ち、障害原因がハードウェアにあって前記障害発生
割込みを発生しえなくなったり、信号系AVrCよるバ
ルクメモリ3への転送が不可能になった場合には、この
方式は全く機能し々くなると云うことである。
The above is a typical fault diagnosis method in a computer system [F
However, it has the following drawbacks7. In other words, if the cause of the failure is in the hardware and it becomes impossible to generate the failure interrupt, or if the signal system AVrC becomes impossible to transfer to the bulk memory 3, this method will no longer function at all. That's what I'm saying.

更に又、障害片囚がソフトウェアにあってプログラムの
藁走により、主メモリ情報退避プログラム2−1が破壊
4 h、たよりな場合も同様である。
Furthermore, the same is true when the fault lies in software and the main memory information saving program 2-1 is destroyed due to program failure.

多重系システムも上記同様の方法で障害診断のための情
報を得ている。
A multisystem system also obtains information for fault diagnosis using the same method as above.

〔発明の目的〕[Purpose of the invention]

本発明は上記欠点を解決することを目的としてなされた
ものであり、ハードウェア及びソフトウェアのいずれの
障害発生に際しても障害診断のための重要な手掛かりの
喪失を防ぎ得る多重余計a機システムの障害診断方式を
提供することを目的としている。
The present invention has been made for the purpose of solving the above-mentioned drawbacks, and provides a fault diagnosis for a multi-redundant machine system that can prevent the loss of important clues for fault diagnosis even when a fault occurs in either hardware or software. The purpose is to provide a method.

〔発明の概要〕[Summary of the invention]

そして本発明では多重系を構成する計算機のいずれかの
系列において障害が発生した場合、障害を発生した系列
の主メモリ上にある情報を残りの正常動作している系列
で採集することにより、障害診断のための重要々手掛り
の喪失を防ごうとするものである。
In the present invention, when a failure occurs in any of the computer systems that make up a multi-system, information stored in the main memory of the computer system in which the failure has occurred is collected from the remaining normally operating systems. This is intended to prevent the loss of important clues for diagnosis.

実施例 以下図面を参照しつつ実施例を説明する。第2図は本発
明による多重系計算機システムの障害診断方式の一実施
列構成図である。
Embodiments Hereinafter, embodiments will be described with reference to the drawings. FIG. 2 is a block diagram of one implementation of the fault diagnosis method for a multi-system computer system according to the present invention.

第2図は2重系の計算機システムであって、これら各計
算機はCPU 1 a 、 1 b 、主メモリ2a。
FIG. 2 shows a dual computer system, each of which has CPUs 1a and 1b and a main memory 2a.

2b、バルクメモリ3a+3bx LP4a、4bをそ
なえていることは第11¥1と同様である。なお、サフ
ィックスaを伺1.た削讃機を第1系列、bを付した側
a機を第2系列と称することにする。
2b, bulk memory 3a+3bx LP4a, 4b are provided as in the 11th ¥1. Please note that the suffix a is 1. We will call the machines that have been reduced to 1st series, and the 2nd series that has b attached to them.

5a、5bは互に相手系の主メモリをアクセス可能にす
るための装FI s即ち、CPU 1 a i7tgW
 5 B+5bを介して相手系にある主メモIJ 2 
bをアクセスすることができ、又、CPU 1 bは装
置5b。
5a and 5b are FIs for allowing each other to access the main memory of the other system, that is, the CPU 1a i7tgW
5 Main memo IJ in the other party's system via B+5b 2
The CPU 1b can access the device 5b.

5aを介(7て相手系にある主メモ’J 2 aをアク
セスすることができるもσ)で、コンピュータシステム
リンケーノ装置(以下C8Lと云う)と称することにす
る。
The main memo 'J2a in the other system can be accessed via the computer system 5a (also σ), which will be referred to as a computer system linkage device (hereinafter referred to as C8L).

次に第3図のフローグーヤードvCよって上記第2図々
示実舵例の動作を説明する。
Next, the operation of the example of the rudder shown in FIG. 2 will be explained using the flow goo yard vC of FIG.

今、第1系列の旧算機に障害が発生した場合を説明する
と、ステップA、Bのオア条件により、ステッfCV?
X示さノする第2系列のCPU 2 bに障害発生が通
知きれる。即ち、図示しない第1系列の停止検出装W1
.出力を第2系列の割込み横用装置に入力するなどのノ
・−ドウエアによる手段(ステップA)、又は第2系列
にある他系状態監視グログラムによる検出などのソフト
ウェアによる手段(ステップB)のいずれかによって、
正常な第2系列が第1系列の障害発生を知、2−(ステ
ップC)。
Now, to explain the case where a failure occurs in the old computer of the first series, due to the OR condition of steps A and B, step fCV?
The CPU 2 b of the second series indicated by X is notified of the occurrence of the failure. That is, the stop detection device W1 of the first series (not shown)
.. Either by software means such as inputting the output to the interrupt handling device of the second series (step A), or by software means such as detection by the other system status monitoring program in the second series (step B). Depending on the
The normal second system learns of the occurrence of a failure in the first system, 2- (Step C).

第1系列の障害停止の通知を受けた第2系列の主メモリ
情報退避プログラム2b−2は、信号系Cを介して障害
停止した第1系列の主メモリ21上にある情報をC8L
 5 a 、 5 bを経由して第2系列のバルクメモ
I73 bの主メモリ情報保存領域3b−2に保存する
(ステップD)。
The main memory information saving program 2b-2 of the second series, which has been notified of the failure stop of the first series, saves the information in the main memory 21 of the first series that has stopped due to the failure via the signal system C to C8L.
5a and 5b, and is stored in the main memory information storage area 3b-2 of the second series bulk memo I73b (step D).

なお、第2系列の計瀞機システムにおける主メモリ情報
退避プログラム2b−2の前記動作は他の業務プログラ
ムの実行と並行して行なうことが可能である。
Note that the above-mentioned operation of the main memory information saving program 2b-2 in the second line of management system can be performed in parallel with the execution of other business programs.

なお、多重系計算機システムを構成する計算機け、各々
独立して動作するものでけ々く、各耐η−機は有機的に
結合して動作している。したがって上記実施例で説明し
た第2系列の主メモリ退避プログラム2b−2によって
、第1系列の主メモリ上[Sる情報の採集と共に、第2
系列自身の主メモリ2b上にある情報も併せて、バルク
メモリ3bvCある主メモリ情報保存領3b〜2に保存
するようにさぜれは、より広範な障害診断のための情報
を供することができる。
Note that the computers constituting the multi-system computer system do not operate independently, but the η-resistant machines operate in an organically coupled manner. Therefore, the main memory save program 2b-2 of the second series explained in the above embodiment collects the information stored in the main memory of the first series, and
If the information on the main memory 2b of the series itself is also stored in the main memory information storage areas 3b-2 in the bulk memory 3bvC, it is possible to provide information for a wider range of fault diagnosis. .

〔発明の効果〕〔Effect of the invention〕

以上H’ll’明し戸如く、本発明によれば多重系言1
算機システムにおいて、多11系を構成する計算機のい
ずれかの系列で障害が発生した場合、残りの正常動作し
7ている系列により障害を発生した系列の主2ノモリ上
の情報を採集すると共に、更に必要に応じて正′帛な系
ダ1の主メモリ上にある障害を発生した系列の状態に関
係する同時点の+9を報をも採集することができるσ)
で、より正確でかつ広範々障害診断のための情報を喪失
することのない多重系計n機ンステムのし一害診断方式
を提供できる。
As described above, according to the present invention, multiple series words 1
In a computer system, when a failure occurs in any of the computer systems that make up the 11 systems, information on the main 2 memory of the system in which the failure occurred is collected from the remaining normally operating systems. , if necessary, it is also possible to collect +9 information at the same time related to the state of the faulty series in the main memory of the normal system 1 (σ)
Therefore, it is possible to provide a fault diagnosis method for a multi-system system that is more accurate and does not cause loss of information for extensive fault diagnosis.

【図面の簡単な説明】[Brief explanation of the drawing]

第1図は従来の障害診断方式を説明するための構成図、
第2図d本発明による多重系計算機システムの障害診断
方式を説明するための構成図、第3図は動作説明のため
のフローチャートである。 ■・・・中央演η処理装置M 2・・・主メモリ2−1
・・・主メモリ情報退避プログラム3・・・外部記憶装
置 3−1・・・主メモリ情報保存領域 4・・ラインプリンタ 5a 、5b・・・他系の主メモリをアクセスする装置 慣°許出願人東京芝浦電気株式会社
FIG. 1 is a configuration diagram for explaining a conventional fault diagnosis method.
FIG. 2d is a block diagram for explaining a fault diagnosis method for a multi-system computer system according to the present invention, and FIG. 3 is a flowchart for explaining the operation. ■...Central processing unit M2...Main memory 2-1
. . . Main memory information saving program 3 . . . External storage device 3-1 . . . Main memory information storage area 4 . . . Line printers 5a, 5b . Person Tokyo Shibaura Electric Co., Ltd.

Claims (1)

【特許請求の範囲】[Claims] 複数の計算機から構成される多重計算機システム内の障
害発生に際し、障害発生計算機の主メモリ上にある情報
を喪失することなく保存し得る多重系計算機システムの
障害診断方式において、障害発生時に作動する主メモリ
退避プログラムにより障害発生計算機の主メモリ上にあ
る情報を、正常動作計算機の主メモリ情報保存領域に採
集することを特徴とする多重系引算機システムの障害診
断方式。
In a fault diagnosis method for a multi-computer system that can save information in the main memory of the faulty computer without losing it when a fault occurs in a multi-computer system consisting of multiple computers, a main system that operates when a fault occurs is used. A fault diagnosis method for a multi-system subtractor system characterized by collecting information in the main memory of a faulty computer into the main memory information storage area of a normally operating computer using a memory saving program.
JP57115477A 1982-07-05 1982-07-05 Fault diagnosing system for multiplex computer system Pending JPS598064A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP57115477A JPS598064A (en) 1982-07-05 1982-07-05 Fault diagnosing system for multiplex computer system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP57115477A JPS598064A (en) 1982-07-05 1982-07-05 Fault diagnosing system for multiplex computer system

Publications (1)

Publication Number Publication Date
JPS598064A true JPS598064A (en) 1984-01-17

Family

ID=14663490

Family Applications (1)

Application Number Title Priority Date Filing Date
JP57115477A Pending JPS598064A (en) 1982-07-05 1982-07-05 Fault diagnosing system for multiplex computer system

Country Status (1)

Country Link
JP (1) JPS598064A (en)

Similar Documents

Publication Publication Date Title
JP7351933B2 (en) Error recovery method and device
JPS6375963A (en) System recovery method
JPH0375834A (en) Apparatus and method of sequentially correcting parity
JP3211878B2 (en) Communication processing control means and information processing apparatus having the same
JP2956849B2 (en) Data processing system
JPS598064A (en) Fault diagnosing system for multiplex computer system
JPH07183891A (en) Computer system
CN114416436A (en) Reliability method for single event upset effect based on SoC chip
JP2937857B2 (en) Lock flag release method and method for common storage
JP2002229811A (en) Control method of logical partition system
KR20020065188A (en) Method for managing fault in computer system
JPS6112580B2 (en)
Comfort A fault-tolerant system architecture for navy applications
JP3311704B2 (en) Failure processing method of multiprocessor communication mechanism
JP3019409B2 (en) Machine check test method for multiprocessor system
CN115080211A (en) A task scheduling method, system and related components of a virtualized platform system
JP3340284B2 (en) Redundant system
JPH0224731A (en) Error processing method
JPH03111962A (en) Multiprocessor system
JPH0916425A (en) Information processing system
JPH0268634A (en) Spare system for electronic computer
JPH0227449A (en) Information collecting system at time of software fault
JPH0527994A (en) Preventing erroneous output of digital equipment
JPH1020968A (en) Selective hardware reset circuit
JPH0713792A (en) Error control system in hot standby system