JPS598064A - Fault diagnosing system for multiplex computer system - Google Patents
Fault diagnosing system for multiplex computer systemInfo
- Publication number
- JPS598064A JPS598064A JP57115477A JP11547782A JPS598064A JP S598064 A JPS598064 A JP S598064A JP 57115477 A JP57115477 A JP 57115477A JP 11547782 A JP11547782 A JP 11547782A JP S598064 A JPS598064 A JP S598064A
- Authority
- JP
- Japan
- Prior art keywords
- main memory
- series
- information
- fault
- computer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/0709—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/073—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a memory management context, e.g. virtual memory or cache management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- Hardware Redundancy (AREA)
- Multi Processors (AREA)
Abstract
Description
【発明の詳細な説明】
〔発明の技術分野〕
本発明は、多重系計算機システムの障害診断方式、特に
障害を発生した計算機の主メモリ上にある障害発生直前
までの情報を保存し得る多重系計算機システムの障害診
断方式に関するものである。DETAILED DESCRIPTION OF THE INVENTION [Technical Field of the Invention] The present invention relates to a fault diagnosis method for a multi-system computer system, and particularly to a fault diagnosis method for a multi-system computer system that can save information up to just before the fault occurs in the main memory of a computer in which a fault has occurred. This paper relates to a method for diagnosing problems in computer systems.
一般に計算機システムを停止に至らしめるような重大な
障害発生直前としては、それを構成するハードウェアの
重要部分の故障及びバグによるプログラムのM走等が考
えられる。In general, immediately before the occurrence of a serious failure that causes a computer system to stop, failure of an important part of the hardware constituting the computer system or M-running of a program due to a bug can be considered.
これらの障害の診断をし障害原因を判明させるために最
も有効な手掛りとなる情報は、障害のため停止に至った
時の引算機システムの主メモリ上に存在する。こ第1は
停止した際に主メモリ上に保存されている情報には停止
に至る迄のプログラムの走行壮態や外部記憶装置や周辺
機器との入出力状態などがあるためである。このため、
従来、停止に至る際主メモリ上の情報を、一旦外部記憶
装随に退避させておき、訓η機システムを再起動させた
後、その情報をラインプリンタ等に出力し、障害診断を
行なっている。The most effective clue information for diagnosing these failures and determining the cause of the failure exists in the main memory of the subtraction machine system at the time of the failure. The first reason is that the information stored in the main memory when the program stops includes information such as how the program was running until the program stopped and the input/output status with external storage devices and peripheral devices. For this reason,
Conventionally, when a machine comes to a stop, the information in the main memory is temporarily evacuated to an external storage device, and after the training machine system is restarted, that information is output to a line printer, etc., and the fault is diagnosed. There is.
第1図によって、従来の障害診断方式を説明する。第1
図に示す計算機システムは中央演算処理装置(以下CP
Uと云う)1、主メモリ2、外部記憶装置(以下バルク
メモリと云う)3、ラインプリンタ(以下LPと云う)
4を設けている。6けパスでを)る。A conventional fault diagnosis method will be explained with reference to FIG. 1st
The computer system shown in the figure is a central processing unit (hereinafter referred to as CP).
(hereinafter referred to as U) 1, main memory 2, external storage device (hereinafter referred to as bulk memory) 3, line printer (hereinafter referred to as LP)
There are 4. (with 6 passes).
今、この削舞機システムに前記したようなノ・−ドウエ
ア、又はソフトウェアに起因する障害が発生すると、通
常割込みと云う形で(以下障害割込みと太う) CPU
1に通知される。ここで障害発生割込みを受信したC
PU 1は、それ迄実行していたグログラムを中断し2
、直ちに主メモリ情報退避グログラム2−1に側脚を移
す。主メモリ情報退避プログラム2−1はこのような状
況下、即ち、今まさに引算機システムが停止せんとする
直前に動作する必要のあるプログラムであるため、通常
のプログラムのように常時はバルクメモリ3上にあシ、
実行時のみ主メモリ2上にロードさせる形態はとれず、
主メモリに常駐する形態のプログラム(主常駐プログラ
ムと云う)である。Now, when a failure occurs in this machine system due to the above-mentioned hardware or software, it is sent to the CPU in the form of a normal interrupt (hereinafter referred to as failure interrupt).
1 will be notified. C that received the failure interrupt here
PU 1 interrupts the program that was running until then, and
, immediately move the side leg to the main memory information saving program 2-1. The main memory information saving program 2-1 is a program that needs to run under these circumstances, that is, just before the subtraction machine system is about to stop, so it is always saved in bulk memory like a normal program. 3. Reeds on top,
It is not possible to load it into main memory 2 only during execution,
This is a program that resides in main memory (referred to as a main resident program).
主メモリ情報退避プログラム2−1の動作はよく知られ
ているため、詳細な説明は省くが、次のような機能を有
している。Since the operation of the main memory information saving program 2-1 is well known, a detailed explanation will be omitted, but it has the following functions.
即ち、主メモリ2上にある情報を全量又は選択的に信号
系Aを通してバルクメモリ3の主メモリ情報保存領域3
−1に転送保存した後、61算機システムを停止させる
。そし7でバルクメモリ3の主メモリ情報保存領域3−
1に保存さノ1.でいる情報は、引算機システムを再度
起動した後、図示しないプログラムによって、信号系B
を通してLP4宿に出力し、障害診断に供している。That is, the information stored in the main memory 2 is transferred either completely or selectively to the main memory information storage area 3 of the bulk memory 3 through the signal system A.
-1 and then stop the 61 computer system. At 7, main memory information storage area 3- of bulk memory 3
Saved in 1. After restarting the subtractor system, the information shown in
It is output to the LP4 hostel for troubleshooting.
以上が訓算機システムにおける障害診断方式の代表[F
IJであるが、これには次のような欠点を有し7ている
。即ち、障害原因がハードウェアにあって前記障害発生
割込みを発生しえなくなったり、信号系AVrCよるバ
ルクメモリ3への転送が不可能になった場合には、この
方式は全く機能し々くなると云うことである。The above is a typical fault diagnosis method in a computer system [F
However, it has the following drawbacks7. In other words, if the cause of the failure is in the hardware and it becomes impossible to generate the failure interrupt, or if the signal system AVrC becomes impossible to transfer to the bulk memory 3, this method will no longer function at all. That's what I'm saying.
更に又、障害片囚がソフトウェアにあってプログラムの
藁走により、主メモリ情報退避プログラム2−1が破壊
4 h、たよりな場合も同様である。Furthermore, the same is true when the fault lies in software and the main memory information saving program 2-1 is destroyed due to program failure.
多重系システムも上記同様の方法で障害診断のための情
報を得ている。A multisystem system also obtains information for fault diagnosis using the same method as above.
本発明は上記欠点を解決することを目的としてなされた
ものであり、ハードウェア及びソフトウェアのいずれの
障害発生に際しても障害診断のための重要な手掛かりの
喪失を防ぎ得る多重余計a機システムの障害診断方式を
提供することを目的としている。The present invention has been made for the purpose of solving the above-mentioned drawbacks, and provides a fault diagnosis for a multi-redundant machine system that can prevent the loss of important clues for fault diagnosis even when a fault occurs in either hardware or software. The purpose is to provide a method.
そして本発明では多重系を構成する計算機のいずれかの
系列において障害が発生した場合、障害を発生した系列
の主メモリ上にある情報を残りの正常動作している系列
で採集することにより、障害診断のための重要々手掛り
の喪失を防ごうとするものである。In the present invention, when a failure occurs in any of the computer systems that make up a multi-system, information stored in the main memory of the computer system in which the failure has occurred is collected from the remaining normally operating systems. This is intended to prevent the loss of important clues for diagnosis.
実施例
以下図面を参照しつつ実施例を説明する。第2図は本発
明による多重系計算機システムの障害診断方式の一実施
列構成図である。Embodiments Hereinafter, embodiments will be described with reference to the drawings. FIG. 2 is a block diagram of one implementation of the fault diagnosis method for a multi-system computer system according to the present invention.
第2図は2重系の計算機システムであって、これら各計
算機はCPU 1 a 、 1 b 、主メモリ2a。FIG. 2 shows a dual computer system, each of which has CPUs 1a and 1b and a main memory 2a.
2b、バルクメモリ3a+3bx LP4a、4bをそ
なえていることは第11¥1と同様である。なお、サフ
ィックスaを伺1.た削讃機を第1系列、bを付した側
a機を第2系列と称することにする。2b, bulk memory 3a+3bx LP4a, 4b are provided as in the 11th ¥1. Please note that the suffix a is 1. We will call the machines that have been reduced to 1st series, and the 2nd series that has b attached to them.
5a、5bは互に相手系の主メモリをアクセス可能にす
るための装FI s即ち、CPU 1 a i7tgW
5 B+5bを介して相手系にある主メモIJ 2
bをアクセスすることができ、又、CPU 1 bは装
置5b。5a and 5b are FIs for allowing each other to access the main memory of the other system, that is, the CPU 1a i7tgW
5 Main memo IJ in the other party's system via B+5b 2
The CPU 1b can access the device 5b.
5aを介(7て相手系にある主メモ’J 2 aをアク
セスすることができるもσ)で、コンピュータシステム
リンケーノ装置(以下C8Lと云う)と称することにす
る。The main memo 'J2a in the other system can be accessed via the computer system 5a (also σ), which will be referred to as a computer system linkage device (hereinafter referred to as C8L).
次に第3図のフローグーヤードvCよって上記第2図々
示実舵例の動作を説明する。Next, the operation of the example of the rudder shown in FIG. 2 will be explained using the flow goo yard vC of FIG.
今、第1系列の旧算機に障害が発生した場合を説明する
と、ステップA、Bのオア条件により、ステッfCV?
X示さノする第2系列のCPU 2 bに障害発生が通
知きれる。即ち、図示しない第1系列の停止検出装W1
.出力を第2系列の割込み横用装置に入力するなどのノ
・−ドウエアによる手段(ステップA)、又は第2系列
にある他系状態監視グログラムによる検出などのソフト
ウェアによる手段(ステップB)のいずれかによって、
正常な第2系列が第1系列の障害発生を知、2−(ステ
ップC)。Now, to explain the case where a failure occurs in the old computer of the first series, due to the OR condition of steps A and B, step fCV?
The CPU 2 b of the second series indicated by X is notified of the occurrence of the failure. That is, the stop detection device W1 of the first series (not shown)
.. Either by software means such as inputting the output to the interrupt handling device of the second series (step A), or by software means such as detection by the other system status monitoring program in the second series (step B). Depending on the
The normal second system learns of the occurrence of a failure in the first system, 2- (Step C).
第1系列の障害停止の通知を受けた第2系列の主メモリ
情報退避プログラム2b−2は、信号系Cを介して障害
停止した第1系列の主メモリ21上にある情報をC8L
5 a 、 5 bを経由して第2系列のバルクメモ
I73 bの主メモリ情報保存領域3b−2に保存する
(ステップD)。The main memory information saving program 2b-2 of the second series, which has been notified of the failure stop of the first series, saves the information in the main memory 21 of the first series that has stopped due to the failure via the signal system C to C8L.
5a and 5b, and is stored in the main memory information storage area 3b-2 of the second series bulk memo I73b (step D).
なお、第2系列の計瀞機システムにおける主メモリ情報
退避プログラム2b−2の前記動作は他の業務プログラ
ムの実行と並行して行なうことが可能である。Note that the above-mentioned operation of the main memory information saving program 2b-2 in the second line of management system can be performed in parallel with the execution of other business programs.
なお、多重系計算機システムを構成する計算機け、各々
独立して動作するものでけ々く、各耐η−機は有機的に
結合して動作している。したがって上記実施例で説明し
た第2系列の主メモリ退避プログラム2b−2によって
、第1系列の主メモリ上[Sる情報の採集と共に、第2
系列自身の主メモリ2b上にある情報も併せて、バルク
メモリ3bvCある主メモリ情報保存領3b〜2に保存
するようにさぜれは、より広範な障害診断のための情報
を供することができる。Note that the computers constituting the multi-system computer system do not operate independently, but the η-resistant machines operate in an organically coupled manner. Therefore, the main memory save program 2b-2 of the second series explained in the above embodiment collects the information stored in the main memory of the first series, and
If the information on the main memory 2b of the series itself is also stored in the main memory information storage areas 3b-2 in the bulk memory 3bvC, it is possible to provide information for a wider range of fault diagnosis. .
以上H’ll’明し戸如く、本発明によれば多重系言1
算機システムにおいて、多11系を構成する計算機のい
ずれかの系列で障害が発生した場合、残りの正常動作し
7ている系列により障害を発生した系列の主2ノモリ上
の情報を採集すると共に、更に必要に応じて正′帛な系
ダ1の主メモリ上にある障害を発生した系列の状態に関
係する同時点の+9を報をも採集することができるσ)
で、より正確でかつ広範々障害診断のための情報を喪失
することのない多重系計n機ンステムのし一害診断方式
を提供できる。As described above, according to the present invention, multiple series words 1
In a computer system, when a failure occurs in any of the computer systems that make up the 11 systems, information on the main 2 memory of the system in which the failure occurred is collected from the remaining normally operating systems. , if necessary, it is also possible to collect +9 information at the same time related to the state of the faulty series in the main memory of the normal system 1 (σ)
Therefore, it is possible to provide a fault diagnosis method for a multi-system system that is more accurate and does not cause loss of information for extensive fault diagnosis.
第1図は従来の障害診断方式を説明するための構成図、
第2図d本発明による多重系計算機システムの障害診断
方式を説明するための構成図、第3図は動作説明のため
のフローチャートである。
■・・・中央演η処理装置M 2・・・主メモリ2−1
・・・主メモリ情報退避プログラム3・・・外部記憶装
置
3−1・・・主メモリ情報保存領域
4・・ラインプリンタ
5a 、5b・・・他系の主メモリをアクセスする装置
慣°許出願人東京芝浦電気株式会社FIG. 1 is a configuration diagram for explaining a conventional fault diagnosis method.
FIG. 2d is a block diagram for explaining a fault diagnosis method for a multi-system computer system according to the present invention, and FIG. 3 is a flowchart for explaining the operation. ■...Central processing unit M2...Main memory 2-1
. . . Main memory information saving program 3 . . . External storage device 3-1 . . . Main memory information storage area 4 . . . Line printers 5a, 5b . Person Tokyo Shibaura Electric Co., Ltd.
Claims (1)
害発生に際し、障害発生計算機の主メモリ上にある情報
を喪失することなく保存し得る多重系計算機システムの
障害診断方式において、障害発生時に作動する主メモリ
退避プログラムにより障害発生計算機の主メモリ上にあ
る情報を、正常動作計算機の主メモリ情報保存領域に採
集することを特徴とする多重系引算機システムの障害診
断方式。In a fault diagnosis method for a multi-computer system that can save information in the main memory of the faulty computer without losing it when a fault occurs in a multi-computer system consisting of multiple computers, a main system that operates when a fault occurs is used. A fault diagnosis method for a multi-system subtractor system characterized by collecting information in the main memory of a faulty computer into the main memory information storage area of a normally operating computer using a memory saving program.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP57115477A JPS598064A (en) | 1982-07-05 | 1982-07-05 | Fault diagnosing system for multiplex computer system |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP57115477A JPS598064A (en) | 1982-07-05 | 1982-07-05 | Fault diagnosing system for multiplex computer system |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| JPS598064A true JPS598064A (en) | 1984-01-17 |
Family
ID=14663490
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| JP57115477A Pending JPS598064A (en) | 1982-07-05 | 1982-07-05 | Fault diagnosing system for multiplex computer system |
Country Status (1)
| Country | Link |
|---|---|
| JP (1) | JPS598064A (en) |
-
1982
- 1982-07-05 JP JP57115477A patent/JPS598064A/en active Pending
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| JP7351933B2 (en) | Error recovery method and device | |
| JPS6375963A (en) | System recovery method | |
| JPH0375834A (en) | Apparatus and method of sequentially correcting parity | |
| JP3211878B2 (en) | Communication processing control means and information processing apparatus having the same | |
| JP2956849B2 (en) | Data processing system | |
| JPS598064A (en) | Fault diagnosing system for multiplex computer system | |
| JPH07183891A (en) | Computer system | |
| CN114416436A (en) | Reliability method for single event upset effect based on SoC chip | |
| JP2937857B2 (en) | Lock flag release method and method for common storage | |
| JP2002229811A (en) | Control method of logical partition system | |
| KR20020065188A (en) | Method for managing fault in computer system | |
| JPS6112580B2 (en) | ||
| Comfort | A fault-tolerant system architecture for navy applications | |
| JP3311704B2 (en) | Failure processing method of multiprocessor communication mechanism | |
| JP3019409B2 (en) | Machine check test method for multiprocessor system | |
| CN115080211A (en) | A task scheduling method, system and related components of a virtualized platform system | |
| JP3340284B2 (en) | Redundant system | |
| JPH0224731A (en) | Error processing method | |
| JPH03111962A (en) | Multiprocessor system | |
| JPH0916425A (en) | Information processing system | |
| JPH0268634A (en) | Spare system for electronic computer | |
| JPH0227449A (en) | Information collecting system at time of software fault | |
| JPH0527994A (en) | Preventing erroneous output of digital equipment | |
| JPH1020968A (en) | Selective hardware reset circuit | |
| JPH0713792A (en) | Error control system in hot standby system |