JPH08249196A

JPH08249196A - Redundant execution method of tasks

Info

Publication number: JPH08249196A
Application number: JP7052422A
Authority: JP
Inventors: Yoshiyuki Baba; 儀之馬場; Atsushi Ishiguro; 淳石黒; Tatsuji Munaka; 達司撫中; Kaoru Abe; 薫阿部
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 1995-03-13
Filing date: 1995-03-13
Publication date: 1996-09-27

Abstract

(57)【要約】【目的】マルチコンピュータシステムにおいてタスク
の信頼度を確保する。【構成】複数ノードで形成するマルチコンピュータシ
ステムにおいて、各ノード上のタスク制御手段６は、シ
ステム上にタスクを割り当て、更に、自ノード上でタス
クを起動して管理する。事象通知部７、実行誤り検出手
段１０は、システムで生じた事象、タスク実行結果の比
較結果をタスク制御手段６に通知する。ユーザタスクと
のＩ／Ｆ部８は、ユーザタスクと自ノード上のタスク制
御手段６とのＩ／Ｆである。このような構成により、冗
長化実行方式を実現する。【効果】タスクに冗長度を持たせ続けて実行すること
が出来るので、信頼性の高いシステムを構築できる。ま
た、ノード上で実行中のタスクを安全にノード外に追い
出すことが可能なので、ノードの障害や、Ｈ／Ｗ或はＳ
／Ｗのメンテナンス作業において、システムの停止が不
要になる。 (57) [Summary] [Purpose] To secure the reliability of tasks in a multi-computer system. [Constitution] In a multi-computer system formed of a plurality of nodes, a task control means 6 on each node allocates a task on the system and further activates and manages the task on its own node. The event notification unit 7 and the execution error detection unit 10 notify the task control unit 6 of the event that has occurred in the system and the comparison result of the task execution results. The user task I / F unit 8 is an I / F for the user task and the task control means 6 on the own node. With such a configuration, the redundant execution method is realized. [Effect] Since a task can be executed continuously with redundancy, a highly reliable system can be constructed. In addition, tasks running on a node can be safely ejected outside the node, so that a node failure, H / W or S
In the maintenance work of / W, it is not necessary to stop the system.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】この発明は、複数の計算機とそれ
らの間で共有される外部記憶装置で構成されたマルチコ
ンピュータシステムにおいて、ノードの消滅及びタスク
の誤りの障害をマスクし、更にタスク間相互に同期のと
れたチェックポイント生成手段を提供することで、フォ
ールトトレランス機能を有したシステムを構築する技術
に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention masks node disappearance and task error failures in a multi-computer system consisting of a plurality of computers and an external storage device shared between them, and further The present invention relates to a technique for constructing a system having a fault tolerance function by providing checkpoint generating means which are synchronized with each other.

【０００２】[0002]

【従来の技術】図１８は、Daniel P. Siewiorek著“Fau
lt Tolerance in Commercial Computers”COMPUTER，Ju
ly 1990.pp.26-37に紹介されているフォールトトレラン
ト計算機（以下、ＦＴ計算機と記す）の構成である。図
１８に示したＦＴ計算機では、主要なＨ／Ｗモジュール
が冗長化されており、Ｈ／Ｗの故障に起因するシステム
の誤動作を防ぐために効果がある。2. Description of the Related Art FIG. 18 shows "Fau" by Daniel P. Siewiorek.
lt Tolerance in Commercial Computers ”COMPUTER, Ju
This is the configuration of the fault-tolerant computer (hereinafter referred to as FT computer) introduced in ly 1990.pp.26-37. In the FT computer shown in FIG. 18, main H / W modules are made redundant, which is effective in preventing malfunction of the system due to H / W failure.

【０００３】図１８に挙げた従来のＦＴ計算機では、例
えば、冗長化されたプロセッサエレメントにおいて実行
結果の比較を行なっており、実行結果が一致したプロセ
ッサエレメント対のデータを正常値として採用してい
る。もし、実行結果の比較誤り発生後に固定故障が検知
された障害Ｈ／Ｗモジュールは、システムから切り離さ
れる。そして、障害Ｈ／Ｗモジュールは、オンライン状
態でモジュール交換された後、再度システムに組み込ま
れる。一方、計算機自体にフォールトトレラント機能が
ない計算機においては、一般に２重系を組んだシステム
を構築して、システムの信頼度を上げている。そして、
このような２重系システムでは、不慮の主系停止を想定
して、主系での処理の実行履歴を保存する処理や待機系
に伝える処理が行なわれている。また、Ｓ／Ｗインプリ
メンテッドなフォールトトレラント機能であるＳＩＦＴ
（ＳｏｆｔｗａｒｅＩｍｐｌｅｍｅｎｔｅｄＦａｕ
ｌｔＴｏｌｅｒａｎｃｅ）やＭＡＦＴ（Ｍｕｌｔｉｃ
ｏｍｐｕｔｅｒＡｒｃｈｉｔｅｃｔｕｒｅｆｏｒ
ＦａｕｌｔＴｏｌｅｒａｎｃｅ）と呼ばれるＦＴ計算
機の概念が導入され、これらの流れを汲む装置のインプ
リメントが、例えば、特表平２−５０３１２２に開示さ
れている。図１９は、特表平２−５０３１２２にある多
重計算機アーキテクチャのブロック図である。同アーキ
テクチャでは、複数ノードの障害がどの動作タスクの実
行も阻害しないようにしている。In the conventional FT computer shown in FIG. 18, for example, execution results are compared in redundant processor elements, and the data of the processor element pair in which the execution results match are adopted as the normal value. . If a fixed failure is detected after the occurrence of a comparison error in the execution results, the failed H / W module is disconnected from the system. Then, the failed H / W module is re-installed in the system after the module is replaced in the online state. On the other hand, in a computer which does not have a fault tolerant function, the computer itself is generally constructed with a double system to increase the reliability of the system. And
In such a dual system, a process of saving the execution history of the process in the main system and a process of transmitting it to the standby system are performed in case of an unexpected stop of the main system. In addition, SIFT which is an S / W implemented fault tolerant function.
(Software Implemented Fau
lt Tolerance) and MAFT (Multitic)
computer Architecture for
The concept of FT computer called “Fault Tolerance” is introduced, and an implementation of a device for drawing these flows is disclosed in, for example, Japanese Patent Publication No. 2-503122. FIG. 19 is a block diagram of the multi-computer architecture in Tokuhyo 2-503122. In the same architecture, failure of multiple nodes does not prevent execution of any operation task.

【０００４】図２０は、特開平２−２８７８５８による
従来のチェックポイント機構の構成図である。１２，１
７は、それぞれ、処理装置Ａ，処理装置Ｂであり、分散
システムを構成するノードであって、互いに通信回線２
２によって結合されている。処理装置Ａ１２は、通信制
御部Ａ１３，チェックポイントタスク実行部Ａ１４，リ
スタートタスク実行部Ａ１５により構成されている。ま
た、処理装置Ａ１２には、チェックポイントファイルＡ
１６が外部記憶装置として接続されている。同様に、処
理装置Ｂ１７は、通信制御部Ｂ１８，チェックポイント
タスク実行部Ｂ１９，リスタートタスク実行部Ｂ２０に
より構成されている。また、処理装置Ｂ１７には、チェ
ックポイントファイルＢ２１が外部記憶装置として接続
されている。FIG. 20 is a block diagram of a conventional checkpoint mechanism disclosed in Japanese Patent Laid-Open No. 2-287858. 12, 1
Reference numerals 7 are a processing device A and a processing device B, respectively, which are nodes constituting a distributed system, and which are connected to each other via a communication line 2
Bound by two. The processing device A12 includes a communication control unit A13, a checkpoint task execution unit A14, and a restart task execution unit A15. Also, the checkpoint file A is stored in the processing device A12.
16 is connected as an external storage device. Similarly, the processing device B17 includes a communication control unit B18, a checkpoint task execution unit B19, and a restart task execution unit B20. A checkpoint file B21 is connected to the processing device B17 as an external storage device.

【０００５】図２０において、処理装置Ａ１２の通信制
御部Ａ１３が通信回線２２を介して、処理装置Ｂ１７の
通信制御部Ｂ１８にデータの送信要求を出した場合、チ
ェックポイントタスク実行部Ａ１４は、その実行プログ
ラム情報を新チェックポイントデータとして、チェック
ポイントファイルＡ１６内に格納する。この時、その新
チェックポイントデータの格納領域は、チェックポイン
トファイルＡ１６内にすでに格納されている旧チェック
ポイントとは、別の領域となる。次に、チェックポイン
トタスク実行部Ａ１４は、処理装置Ｂ１７内のチェック
ポイントタスク実行部Ｂ１９に対し、通信制御部Ｂ１８
で実行（受信）されているプログラムの実行プログラム
情報を採取するように要求する。この要求に対して、チ
ェックポイントタスク実行部Ｂ１９は該当するプログラ
ム情報を新チェックポイントデータとして、チェックポ
イントファイルＢ２１内に格納する。この時の新チェッ
クポイントデータの格納領域は、チェックポイントファ
イルＢ２１内にすでに格納されている旧チェックポイン
トデータとは、別の領域となる。In FIG. 20, when the communication control unit A13 of the processing device A12 issues a data transmission request to the communication control unit B18 of the processing device B17 via the communication line 22, the checkpoint task execution unit A14 outputs the request. The execution program information is stored in the checkpoint file A16 as new checkpoint data. At this time, the storage area for the new checkpoint data is different from the old checkpoint already stored in the checkpoint file A16. Next, the checkpoint task execution unit A14 instructs the checkpoint task execution unit B19 in the processing device B17 to communicate with the communication control unit B18.
Request to collect the execution program information of the program being executed (received). In response to this request, the checkpoint task execution unit B19 stores the relevant program information in the checkpoint file B21 as new checkpoint data. The storage area for the new checkpoint data at this time is a different area from the old checkpoint data already stored in the checkpoint file B21.

【０００６】チェックポイントの採取が成功した時、チ
ェックポイントタスク実行部Ｂ１９は、処理装置Ａ１２
のチェックポイントタスク実行部Ａ１４に対してアクノ
リッジ（ＡＮＫ）を返送する。採取に失敗した時は、ア
クノリッジを返送しない。アクノリッジが返送された
時、チェックポイントタスク実行部Ａ１４は、チェック
ポイントファイルＡ１６の旧チェックポイントデータを
捨て、新チェックポイントデータを、旧チェックポイン
トデータとする。同様に、チェックポイントタスク実行
部Ｂ１９は、チェックポイントファイルＢ２１の旧チェ
ックポイントデータを捨て、新チェックポイントデータ
を、旧チェックポイントデータとする。アクノリッジが
返送されなかった時は、旧チェックポイントデータがそ
のまま格納された状態となる。When the checkpoint is successfully collected, the checkpoint task execution unit B19 determines that the processing unit A12
The acknowledge (ANK) is returned to the checkpoint task execution unit A14. If the collection fails, the acknowledgment will not be returned. When the acknowledge is returned, the checkpoint task execution unit A14 discards the old checkpoint data in the checkpoint file A16 and sets the new checkpoint data as the old checkpoint data. Similarly, the checkpoint task execution unit B19 discards the old checkpoint data in the checkpoint file B21 and sets the new checkpoint data as the old checkpoint data. If the acknowledge is not returned, the old checkpoint data remains stored.

【０００７】リスタート処理では、リスタートタスク実
行部Ａ１５が、チェックポイントファイルＡ１６に格納
されている新チェックポイントデータを取り出し、それ
に対応するチェックポイントデータを取り出すように、
処理装置Ｂ１７のリスタートタスク実行部Ｂ２０にリス
タート要求を出す。リスタート要求を受けとったリスタ
ートタスク実行部Ｂ２０は、チェックポイントファイル
Ｂ２１から該当する新チェックポイントデータを取り出
す。この取り出しが成功した時、リスタートタスク実行
部Ｂ２０は、リスタートタスク実行部Ａ１５に対してア
クノリッジを返送する。取り出しに失敗した時は返送し
ない。アクノリッジが返送された時、リスタートタスク
実行部Ａ１５は、新チェックポイントデータによりタス
クの実行を再開する。アクノリッジが返送されない時、
リスタートタスク実行部Ａ１５は、旧チェックポイント
データによりタスクの実行を再開する。In the restart processing, the restart task execution section A15 takes out the new checkpoint data stored in the checkpoint file A16 and takes out the corresponding checkpoint data.
A restart request is issued to the restart task execution unit B20 of the processing device B17. Upon receiving the restart request, the restart task execution unit B20 extracts the corresponding new checkpoint data from the checkpoint file B21. When this extraction is successful, the restart task execution unit B20 returns an acknowledge to the restart task execution unit A15. If you fail to retrieve it, do not return it. When the acknowledge is returned, the restart task execution unit A15 restarts the task execution with the new checkpoint data. When the acknowledge is not returned,
The restart task execution unit A15 restarts the task execution based on the old checkpoint data.

【０００８】このようにチェックポイント採取時、及び
リスタート時にお互いの同期を確認することにより、同
期の取れたチェックポイントリスタート処理を実現す
る。In this way, by confirming the synchronization with each other at the time of collecting checkpoints and at the time of restarting, a synchronized checkpoint restart process is realized.

【０００９】[0009]

【発明が解決しようとする課題】従来のＦＴ計算機に
は、以下の問題点がある。（１）Ｈ／Ｗの故障に対して耐性を持つ一方、Ｉ／Ｏ機
器の動作タイミングやＳ／Ｗの誤りが引き起こす障害に
は無防備である。更に、冗長度を失ったＨ／Ｗモジュー
ルにおける２次故障発生時には、システムの停止に至
る。（２）実行誤りを起こしたタスクの検知機構がない。タ
スクの処理において、何処までが正しくて何処からが誤
りなのかを知ることが困難であり、障害発生後の回復処
理が難しい。（３）Ｈ／Ｗモジュールの拡張や交換等のメンテナンス
作業においてもシステムを停止させないためには、Ｈ／
Ｗモジュールのオンライン拡張や活線挿抜技術、更には
スーパーバイザのＩ／Ｏ再構成能力が必要となる。（４）究極的には、全てのＨ／Ｗモジュールを多重化す
る必要がある。（５）スーパーバイザやタスクモジュールの入れ換え時
には、システムを停止せざるを得ない。The conventional FT computer has the following problems. (1) It is resistant to H / W failures, but is vulnerable to failures caused by I / O device operation timing and S / W errors. Furthermore, when a secondary failure occurs in the H / W module that has lost the redundancy, the system is stopped. (2) There is no detection mechanism for the task that caused the execution error. In task processing, it is difficult to know what is correct and what is incorrect, and recovery processing after a failure occurs is difficult. (3) In order not to stop the system during maintenance work such as expansion and replacement of H / W module, H / W
Online expansion of W module, hot-swap technology, and I / O reconfiguration capability of supervisor are required. (4) Ultimately, it is necessary to multiplex all H / W modules. (5) The system must be stopped when replacing supervisors and task modules.

【００１０】システムが２重系を構成した場合、上記の
問題に対しては片系で連続運転を続けることで、システ
ムの停止を免れることは可能である。しかし、このよう
な２重系にみられるノード自体の冗長化方式では、（６）性能が等しい計算機を複数用いて対象に構成する
必要があり、コスト高となる。勿論、異機種で対称にシ
ステムを構成するのには困難がある。（７）２重系システムにおいて主従の関係を持つシステ
ムでは、常に不慮の事故に備えて実行履歴の保存やチェ
ックポイントセーブ処理が必要となっている。このこと
は、回復データの量に応じた復旧時間と複雑な回復処理
を必要としている。（８）更に、２重系では空間的な多数決論理が成り立た
ず、タスクの実行に空間的冗長度を持たせるといった高
い信頼性を確保するシステムの構築には不向きである。（９）また、一般的にＨ／Ｗの診断処理は、オンライン
業務を停止して行うため、システムの処理において冗長
度が減少した状態となっている。When the system has a double system, it is possible to avoid the system from being stopped by continuing the continuous operation in one system to solve the above problem. However, in the redundant system of the node itself which is seen in such a dual system, (6) it is necessary to configure a plurality of computers having the same performance for the target, which results in high cost. Of course, it is difficult to configure the system symmetrically with different models. (7) In a dual system, which has a master-slave relationship, it is always necessary to save execution history and checkpoint save processing in preparation for an unexpected accident. This requires a recovery time and a complicated recovery process depending on the amount of recovery data. (8) Furthermore, in the dual system, the spatial majority logic does not hold, and it is not suitable for constructing a system that secures high reliability such as giving spatial redundancy to the execution of tasks. (9) Generally, the H / W diagnostic process is performed after the online work is stopped, and thus the redundancy is reduced in the system process.

【００１１】また、特表平２−５０３１２２に開示され
ている計算機アーキテクチャでは、障害の検知とそれに
続く障害部位の切り離しは行われるが、（１０）障害発生後の回復処理には言及していない。従
って、この方式を適用したシステムは、永久的な連続動
作を保証するものではない。更に、ノード上の１デバイ
スの故障が１ノード全体の故障となる場合がある。従っ
て、（１１）デバイスを複雑に組み合わせたシステムに対応
したものとはなっていない。Further, in the computer architecture disclosed in Tokuhyo 2-503122, although failure detection and subsequent failure part isolation are performed, (10) recovery processing after occurrence of failure is not mentioned. . Therefore, the system to which this method is applied does not guarantee a permanent continuous operation. Furthermore, a failure of one device on a node may result in a failure of the entire one node. Therefore, (11) it is not compatible with a system in which devices are complicatedly combined.

【００１２】また、従来のチェックポイント機構は、先
に説明したように構成されているので、（１２）同じ処理装置内のタスク間通信に対しては同期
を取ってチェックポイントを取る方法がなかった。（１３）通信のたびにチェックポイントを採取するの
で、オーバーヘッドが大きいという問題があった。（１４）チェックポイントファイルが、個々の処理装置
に対して１つずつ接続されているので、チェックポイン
トからのリスタートを、他の処理装置で行なうことはで
きないなどの問題があった。Further, since the conventional checkpoint mechanism is configured as described above, (12) there is no method for synchronizing checkpoints for communication between tasks in the same processing unit. It was (13) Since a checkpoint is taken every communication, there is a problem that the overhead is large. (14) Since one checkpoint file is connected to each processing device, there is a problem that restart from a checkpoint cannot be performed by another processing device.

【００１３】本発明は、上記のような問題点を解消する
ためになされたもので、マルチコンピュータシステムに
おいて、以下の機能を得ることを目的とする。（１）Ｈ／Ｗ又はＳ／Ｗの誤りによってシステムが停止
することを防ぐために、タスクの実行を空間的／時間的
に冗長化する機能を得ることを目的とする。（２）ノード上には、タスクの実行誤りを監視する機能
を有した実行誤り検出手段を配置し、実行誤りの判定基
準はタスク毎に定義できるようにする。（３）冗長化実行されているタスクでの実行誤り発生
後、正常実行したタスクでチェックポイントを作成し、
実行誤りタスクは停止させた後に実行可能な他ノード上
に移送する。この後のタスク処理は、チェックポイント
データを引き継いで実行することにより、停止時間の少
ないリカバリ処理を可能にする。この方法では、突然の
システム停止に備えて取り続けた実行履歴を回復処理に
用いる従来の方式に比べ、冗長実行されているタスクに
実行誤りが生じてタスク実行の冗長度を失った場合にの
みチェックポイントを取ることで、チェックポイントセ
ーブ／リカバリ処理を簡易化する。（４）ノードにおけるメンテナンス作業は、ノード上で
実行中のタスクの終了を待つか終了を導いた後、ノード
のオンライン縮退／拡張機能によって、ノードのデタッ
チ／アタッチをして行う。このため、本発明を適用した
システムにおけるノードでは、Ｈ／Ｗモジュールの活線
挿抜機構、Ｈ／Ｗモジュールのオンライン拡張機能、ス
ーパーバイザのＩ／Ｏ再構成能力を必要としない。（５）タスクを実行するノードの指定を可能とし、ノー
ドにおける負荷分散や役割分担を実現する。これによ
り、同一機能を有するノードでなくても、同一システム
に組み込むことを可能とすることを目的とする。（６）本システムへのタスクの投入方法では、間接呼び
出し方式によるプログラミングインタフェースを与え
る。これは、タスクモジュールのプログラム記述とは分
離したタスク定義を変更することによって、タスクの実
行先ノードをプログラムの再コンパイルを必要とせずに
変更可能にする。更に、ノードでは、各タスクを割り付
けるための状態を管理する。これらの機能により、ノー
ドに特定タスクを一時的に割り当てなくすることを保証
できるので、Ｓ／Ｗメンテナンス作業に対応できる。（７）突然のノードの消滅時には、冗長化実行していな
いタスクの復旧処理を容易にするために、ノード上で実
行していた空間的冗長度のないタスクを他ノード上で再
起動することを可能にする。（８）タスクが実行誤りを起こした後に、特定の診断プ
ロシジャを障害ノード上で実行させることが可能であ
る。また、実行誤りを起こした冗長化タスクは、タスク
実行結果のボーティングに参加せずに、オンライン業務
を裏実行することで、不具合を生じたノードをオンライ
ン診断することも可能である。更に、（９）チェックポイントを取る際に、通信の相手が同じ
ノードにあっても、同期を取ってチェックポイントを取
ることを可能にする。（１０）通常の処理において、チェックポイント採取の
オーバーヘッドを無くすことを可能にする。（１１）チェックポイントを取った後、それまでと異な
るノード上でリスタートすることを目的とする。The present invention has been made to solve the above problems, and an object thereof is to obtain the following functions in a multi-computer system. (1) To prevent the system from being stopped due to an error in H / W or S / W, an object thereof is to obtain a function of making task execution spatially / temporally redundant. (2) An execution error detecting means having a function of monitoring the execution error of the task is arranged on the node so that the judgment standard of the execution error can be defined for each task. (3) After the execution error occurs in the redundantly executed task, create a checkpoint with the normally executed task,
The execution error task is transferred to another executable node after being stopped. Subsequent task processing enables the recovery processing with a short stop time by taking over and executing the checkpoint data. In this method, compared to the conventional method that uses the execution history that was kept in preparation for a sudden system stop for recovery processing, the check is performed only when a redundantly executed task has an execution error and loses the redundancy of task execution. By taking points, the checkpoint save / recovery process is simplified. (4) The maintenance work in the node is performed by waiting for the completion of the task being executed on the node or guiding the completion thereof, and then detaching / attaching the node by the online degeneracy / expansion function of the node. Therefore, the node in the system to which the present invention is applied does not require the hot-swap mechanism of the H / W module, the online expansion function of the H / W module, and the I / O reconfiguration capability of the supervisor. (5) It is possible to specify a node that executes a task, and realize load distribution and role sharing in the node. Thus, it is an object of the present invention to enable incorporation in the same system even if the nodes do not have the same function. (6) In the method of submitting a task to this system, a programming interface by an indirect call method is given. This allows the task execution destination node to be changed without recompiling the program by changing the task definition that is separate from the program description of the task module. Furthermore, the node manages the state for assigning each task. With these functions, it is possible to guarantee that a specific task is not temporarily assigned to a node, so that it is possible to deal with S / W maintenance work. (7) When a node suddenly disappears, in order to facilitate the recovery processing of tasks that have not been redundantly executed, restart the tasks that have no spatial redundancy and that have been executed on other nodes. To enable. (8) It is possible to execute a specific diagnostic procedure on the faulty node after the task has made an execution error. In addition, the redundant task that has caused an execution error can also perform online diagnosis of the faulty node by executing the online work behind the scenes without participating in the task execution voting. Further, (9) when taking a checkpoint, it is possible to take a checkpoint in synchronization even if the communication partner is in the same node. (10) It is possible to eliminate the overhead of checkpoint collection in normal processing. (11) After taking a checkpoint, the purpose is to restart on a different node.

【００１４】本発明では、以上の機能を実現することよ
って、マルチコンピュータシステムにおいてフォールト
トレラント機能を実現させたシステムの構築を可能とす
る。In the present invention, by realizing the above functions, it is possible to construct a system in which a fault tolerant function is realized in a multi-computer system.

【００１５】[0015]

【課題を解決するための手段】この発明におけるタスク
の冗長化実行方式は、複数の計算機により構成されたマ
ルチコンピュータシステムのタスクの冗長化実行方式に
おいて、各計算機に以下の要素を有することを特徴とし
ている。（ａ）上記計算機で実行されるタスクを管理するための
情報をあらかじめ記憶するタスク定義部、（ｂ）上記タ
スク定義部に記憶した情報に基づいてタスクの実行を制
御するタスク制御手段。A task redundancy execution system according to the present invention is a task redundancy execution system of a multi-computer system composed of a plurality of computers, wherein each computer has the following elements. I am trying. (A) A task definition unit that stores in advance information for managing tasks executed by the computer, and (b) task control means that controls the execution of the task based on the information stored in the task definition unit.

【００１６】また、この発明におけるタスクの冗長化実
行方式は、上記タスク定義部が、タスクを時間的に冗長
化し実行する情報を記憶することを特徴とする。Further, the task redundancy execution system of the present invention is characterized in that the task definition section stores information for making the task redundant in time and executing the task.

【００１７】また、この発明におけるタスクの冗長化実
行方式は、上記タスク定義部が、タスクを空間的に冗長
化し実行する情報を記憶することを特徴とする。Further, the task redundancy execution system of the present invention is characterized in that the task definition section stores information for spatially redundantly executing a task.

【００１８】また、この発明におけるタスクの冗長化実
行方式は、上記計算機が、更にタスク実行中に発生した
エラーを検知し、検知したエラーを上記タスク制御手段
に通知する実行誤り検出手段を備え、上記タスク定義部
は、エラー検知時の制御をあらかじめ定義するととも
に、上記タスク制御手段は通知された内容に基づいて、
上記タスク定義部の定義内容に従い、タスクの実行を制
御することを特徴とする。Further, in the task redundancy execution system according to the present invention, the computer further comprises an execution error detecting means for detecting an error occurring during task execution and notifying the detected error to the task control means, The task definition unit predefines control at the time of error detection, and the task control unit, based on the notified content,
It is characterized in that the execution of the task is controlled according to the definition contents of the task definition section.

【００１９】また、この発明におけるタスクの冗長化実
行方式は、上記タスク制御手段が、上記実行誤り検出手
段によって特定の計算機上で実行しているタスクのエラ
ーが通知された場合、タスクの冗長度を確保するように
タスクを再起動するタスクの冗長化実行手段を備えたこ
とを特徴とする。In the task redundancy execution method according to the present invention, when the task control unit notifies the execution error detection unit of an error of a task being executed on a specific computer, the redundancy of the task is determined. It is characterized by comprising a task redundancy execution means for restarting the task so as to secure the task.

【００２０】また、この発明におけるタスクの冗長化実
行方式は、上記冗長化実行手段が、実行しているタスク
のエラーが発生した計算機でのタスクの実行を停止し、
他の計算機に上記タスクを再起動することを特徴とす
る。Further, in the task redundancy execution system according to the present invention, the redundancy execution means stops the execution of the task in the computer in which the error of the task being executed occurs.
The above-mentioned task is restarted to another computer.

【００２１】また、この発明におけるタスクの冗長化実
行方式は、上記計算機は冗長化されたハードウェアによ
り構成され、上記ハードウェアに障害が発生し、上記計
算機が空間的冗長度のないタスクを実行している場合、
上記タスク制御手段が、上記計算機上で正常処理中のタ
スクを一度停止し、他の計算機に上記タスクを再起動す
るタスク移送手段を備えることを特徴とする。According to the task redundancy execution method of the present invention, the computer is constituted by redundant hardware, and a failure occurs in the hardware so that the computer executes a task having no spatial redundancy. If you are
It is characterized in that the task control means is provided with a task transfer means for once stopping a task which is being normally processed on the computer and restarting the task on another computer.

【００２２】また、この発明におけるタスクの冗長化実
行方式は、上記タスク冗長化実行方式が、更に、タスク
の途中実行に必要な情報を記憶する外部記憶部を備え、
上記タスク制御手段が、正常処理中のタスクからタスク
の途中実行に必要な情報を採取し、上記外部記憶部に採
取した情報を格納する情報採取手段を備えるとともに、
上記外部記憶部よりタスクの途中実行に必要な情報を取
り出し、取り出した情報に基づいてタスクを途中から再
実行する再実行手段を備えたことを特徴とする。The task redundancy execution method according to the present invention is the task redundancy execution method, further comprising an external storage unit for storing information necessary for mid-execution of the task.
The task control means includes information collecting means for collecting the information necessary for mid-execution of the task from the task being normally processed, and storing the collected information in the external storage unit,
It is characterized by comprising re-execution means for retrieving information necessary for mid-execution of the task from the external storage unit and re-executing the task midway based on the retrieved information.

【００２３】また、この発明におけるタスクの冗長化実
行方式は、上記タスク冗長化実行方式が、更に、複数の
入出力装置を備え、上記タスク制御手段は、上記実行誤
り検出手段により実行中のタスクにエラーが検知された
場合、上記入出力装置の内、上記タスクが使用を予定し
ていた入出力装置を使用する他のタスクの上記計算機へ
の割り当てを禁止する割り当て禁止手段を備えたことを
特徴とする。In the task redundancy execution system according to the present invention, the task redundancy execution system further comprises a plurality of input / output devices, and the task control means is the task being executed by the execution error detection means. When an error is detected in the above, it is provided with an allocation prohibition means for prohibiting allocation of other tasks using the I / O device scheduled to be used by the above task to the computer among the above I / O devices. Characterize.

【００２４】また、この発明におけるタスクの冗長化実
行方式は、上記タスク定義部が、上記入出力装置の内、
どの入出力装置にエラーが発生しているのか診断を行う
診断プロシジャを備え、上記タスク制御部は、上記診断
プロシジャを実行するオンライン診断手段を備えたこと
を特徴とする。Further, in the task redundancy execution system according to the present invention, the task definition unit includes:
A diagnostic procedure is provided for diagnosing which input / output device has an error, and the task control section is provided with online diagnostic means for executing the diagnostic procedure.

【００２５】また、この発明におけるタスクの冗長化実
行方式は、上記外部記憶部が、更に、モジュールを記憶
し、上記タスク制御手段が、特定の計算機に特定のタス
クを特定の期間割り当てないことを保証し、上記外部記
憶部に記憶されたモジュールを取り出し、取り出したモ
ジュールと上記特定のタスクを入れ換えるタスクのメン
テナンス手段を備えたことを特徴とする。Further, in the task redundancy execution system according to the present invention, the external storage section further stores modules, and the task control means does not allocate a specific task to a specific computer for a specific period. It is characterized in that it is provided with a task maintenance unit that guarantees and takes out the module stored in the external storage unit and replaces the taken out module with the specific task.

【００２６】また、この発明におけるタスクの冗長化実
行方式は、上記タスク制御手段が、上記計算機上で正常
に処理しているタスクを停止し、上記計算機をシステム
から切り離し、別の計算機上で上記タスクを再機動する
オンライン縮退手段を備えたことを特徴とする。Further, in the task redundancy execution system according to the present invention, the task control means stops the task which is normally processed on the computer, disconnects the computer from the system, and executes it on another computer. It is characterized by having an online degeneracy means for re-mobilizing tasks.

【００２７】また、この発明におけるタスクの冗長化実
行方式は、上記タスク制御手段が、更に切り離した計算
機をシステムに再投入するオンライン拡張手段を備えた
ことを特徴とする。The task redundancy execution system according to the present invention is characterized in that the task control means further comprises an online expansion means for re-inputting the separated computer into the system.

【００２８】また、この発明におけるタスクの冗長化実
行方式は、上記複数の計算機が、更に計算機間でのメッ
セージの交換を行う通信手段を備えるとともに、通信を
行っている計算機が消滅したことを検知し、消滅した旨
を上記タスク制御手段に通知する事象通知部を備え、上
記タスク制御手段が、上記事象通知部からの通知内容に
基づいて、計算機の消滅に伴いメッセージが消滅するこ
とを防ぐメッセージ管理手段を備えたことを特徴とす
る。In the task redundancy execution system according to the present invention, the plurality of computers further include communication means for exchanging messages between the computers, and it is detected that the computers performing the communication disappear. A message for preventing the message from disappearing due to the disappearance of the computer, based on the notification content from the event notifying unit It is characterized by having a management means.

【００２９】また、この発明におけるタスクの冗長化実
行方式は、上記タスク制御手段が、更に、計算機の消滅
に伴いタスクが消滅することを防ぐタスクの再投入手段
を備えたことを特徴とする。The task redundancy execution system according to the present invention is characterized in that the task control means further comprises task re-input means for preventing the task from disappearing as the computer disappears.

【００３０】また、この発明におけるタスクの冗長化実
行方式は、上記情報採取手段が、タスク間通信の通信処
理中、及び、タスク間通信の受信処理、及び、タスク間
通信における応答待ち処理、及び、タスクの入出力処理
のいずれかの処理を行う場合、タスクの途中実行に必要
な情報を採取することを特徴とする。In the task redundancy execution method according to the present invention, the information collecting means is in the process of communication between the tasks, the process of receiving the communication between the tasks, the process of waiting for a response in the communication between the tasks, and When performing any of the input / output processing of the task, it is characterized in that information necessary for mid-execution of the task is collected.

【００３１】また、この発明におけるタスクの冗長化実
行方式は、上記情報採取手段が、タスク間通信の通信相
手が他の計算機上にある場合、他の計算機が備えている
情報採取手段に信号を送り、タスクの途中実行に必要な
情報を採取させることを特徴とする。Further, in the task redundancy execution system according to the present invention, when the information collecting means has a communication partner of inter-task communication on another computer, a signal is sent to the information collecting means provided in the other computer. It is characterized by sending and collecting the information necessary for mid-execution of the task.

【００３２】[0032]

【作用】第１の発明におけるタスクの冗長化実行方式で
は、タスク制御手段は、タスク定義部に記憶されている
情報に基づいて、タスクの実行を制御する。このため、
タスク定義部にタスクの信頼性を確保するための情報が
定義されていれば、タスク制御手段がこの情報を参照
し、タスクの実行を冗長的に行うことができる。In the task redundancy execution system according to the first aspect of the invention, the task control means controls the execution of the task based on the information stored in the task definition section. For this reason,
If information for ensuring the reliability of the task is defined in the task definition section, the task control means can refer to this information and execute the task redundantly.

【００３３】また、第２の発明におけるタスクの冗長化
実行方式では、タスク定義部は、タスクを時間的に冗長
化し、実行する情報を記憶している。このため、タスク
制御手段が上記タスク定義部を参照することによって、
実行しているタスクに対して、時間的冗長度を持たせる
ことができる。In the task redundancy execution method according to the second aspect of the present invention, the task definition section stores the information to be executed by making the task redundant in time. Therefore, by referring to the task definition section by the task control means,
You can add temporal redundancy to the tasks you are performing.

【００３４】この発明におけるタスクの冗長化実行方式
では、タスク定義部は、タスクを空間的に冗長化し、実
行する情報を記憶している。このため、タスク制御手段
が上記タスク定義部を参照することによって、実行して
いるタスクに対して、空間的冗長度を持たせることがで
きる。In the task redundancy execution system according to the present invention, the task definition unit stores the information to be executed by spatially making the task redundant. Therefore, the task control unit can give the task being executed a spatial redundancy by referring to the task definition unit.

【００３５】この発明におけるタスクの冗長化実行方式
では、実行誤り検出手段が、冗長化実行したタスクの実
行結果を元に、エラーであるかエラーでないかを判断
し、エラーであれば、タスク制御手段にエラーの旨を通
知する。そして、エラー通知を受けたタスク制御手段
は、タスク定義部の定義内容に従い、タスクの実行を制
御する。このため、冗長化実行されているタスクの実行
時に、エラーが発生しても、実行されているタスクはタ
スク制御手段によってエラー検知時の対応を容易に行う
ことができる。In the task redundancy execution method according to the present invention, the execution error detection means determines whether there is an error or not based on the execution result of the task that has been redundantly executed, and if there is an error, the task control is executed. Notify the means of the error. Then, the task control unit that has received the error notification controls the execution of the task according to the definition content of the task definition unit. Therefore, even if an error occurs during execution of the redundantly executed task, the executed task can be easily handled by the task control means when an error is detected.

【００３６】この発明におけるタスクの冗長化実行方式
では、タスク冗長化実行手段が、タスクの冗長度を確保
するように、タスクを再起動する。このため、上記実行
誤り検出手段によって、タスク制御手段にタスクの実行
エラー通知がなされても、タスクの冗長度を損なうこと
なく、タスクの実行を継続して行うことができる。In the task redundancy execution method according to the present invention, the task redundancy execution means restarts the task so as to secure the redundancy of the task. Therefore, even if the execution error detection unit notifies the task control unit of the execution error of the task, the task can be continuously executed without impairing the redundancy of the task.

【００３７】この発明におけるタスクの冗長化実行方式
では、冗長化実行手段が、エラーが発生した計算機での
タスクの実行を停止し、他の計算機に上記タスクを再起
動する。このため、例えば、タスクのエラーが計算機に
接続されている入出力装置に関するエラーであった場
合、上記タスクは他の計算機に再起動されるため、同じ
入出力装置に関するエラーが発生する確率が低くなる。In the task redundancy execution system according to the present invention, the redundancy execution means stops the execution of the task on the computer in which the error has occurred, and restarts the task on another computer. Therefore, for example, if the error of a task is an error related to an I / O device connected to a computer, the task is restarted by another computer, and the probability of an error related to the same I / O device is low. Become.

【００３８】この発明におけるタスクの冗長化実行方式
は、再実行手段が、外部記憶部に記憶されているタスク
の途中実行に必要な情報を取り出し、取り出した情報に
基づいてタスクを途中から再実行する。このため、他の
計算機にタスクが再起動されても、エラーが発生するま
での実行結果が確実に引き継がれることになる。In the task redundancy execution system according to the present invention, the re-execution means retrieves information necessary for mid-execution of the task stored in the external storage unit, and re-executes the task midway based on the retrieved information. To do. Therefore, even if the task is restarted in another computer, the execution result until the error occurs can be surely taken over.

【００３９】この発明におけるタスクの冗長化実行方式
では、割り当て禁止手段が、タスクの実行中にエラーが
発生した場合、上記タスクが使用していた又は使用予定
であった入出力装置を使用する他のタスクを、上記タス
クが実行されていた計算機への割り当てを禁止する。こ
のため、タスクのエラーが入出力装置に関するものであ
った場合、同じ入出力装置を使用するタスクが上記計算
機上に割り当てられないので、システムの稼働率が向上
する。In the task redundancy execution method according to the present invention, the allocation prohibiting means uses the input / output device which was or is planned to be used by the task when an error occurs during execution of the task. It is prohibited to assign the task of (1) to the computer on which the above task was executed. Therefore, when the task error is related to the input / output device, the task that uses the same input / output device cannot be assigned to the computer, thus improving the operating rate of the system.

【００４０】この発明におけるタスクの冗長化実行方式
は、オンライン診断手段が、計算機を構成している入出
力装置の内、どの入出力装置において、エラーが発生し
ているのか診断を行う。このため、エラーが発生してい
る入出力装置を容易に特定できる。In the task redundancy execution system according to the present invention, the online diagnostic means diagnoses which of the input / output devices making up the computer is in error. Therefore, the input / output device in which the error has occurred can be easily specified.

【００４１】この発明におけるタスクの冗長化実行方式
では、タスクのメンテナンス手段が、特定の計算機に特
定のタスクを特定の期間割り当てないことを保証し、外
部記憶部に記憶されているモジュールを取り出し、取り
出したモジュールと上記特定のタスクを入れ換える。こ
のため、例えば、モジュールのバージョンアップ等を行
う場合に、システム全体を停止することなく、特定のタ
スクのみを新しいモジュールと入れ換えることが容易に
できる。In the task redundancy execution system according to the present invention, the task maintenance means guarantees that the specific task is not allocated to the specific computer for the specific period, and the module stored in the external storage unit is taken out. Swap the specified task with the removed module. For this reason, for example, when upgrading the version of a module, it is possible to easily replace only a specific task with a new module without stopping the entire system.

【００４２】この発明におけるタスクの冗長化実行方式
では、オンライン縮退手段が、計算機上で正常実行中に
タスクを停止し、上記計算機をシステムから切り離し、
上記停止したタスクを別の計算機上で再起動する。この
ため、例えば、計算機のメンテナンスを行う場合、実行
しているタスクを停止することなく、他の計算機に処理
を引き継がせ、メンテナンスを行う計算機をシステムか
ら切り離すことができる。In the task redundancy execution system according to the present invention, the online degeneracy means stops the task during normal execution on the computer, disconnects the computer from the system,
Restart the stopped task on another computer. Therefore, for example, when performing maintenance on a computer, it is possible to allow another computer to take over the processing without stopping the task being executed, and to disconnect the computer for maintenance from the system.

【００４３】この発明におけるタスクの冗長化実行方式
では、オンライン拡張手段が、切り離した計算機をシス
テムに再投入する。このため、上記オンライン縮退手段
によって、システムから切り離した計算機を容易にシス
テムに再接続することができる。In the task redundancy execution system according to the present invention, the online expansion means re-inputs the separated computer into the system. Therefore, the computer decoupled from the system can be easily reconnected to the system by the online degeneracy means.

【００４４】この発明におけるタスクの冗長化実行方式
では、事象通知部が、計算機が消滅したことを検知し、
タスク制御手段に消滅した旨を通知する。そして、メッ
セージ管理手段が上記計算機の消滅に伴い、メッセージ
交換中のメッセージが消滅することを防ぐ。このため、
この発明におけるタスクの冗長化実行方式を用いたシス
テムでは、システムの運用を安全に行うことを保証する
ことができる。In the task redundancy execution system according to the present invention, the event notification unit detects that the computer has disappeared,
The task control means is notified of the disappearance. Then, the message management means prevents the message being exchanged from disappearing as the computer disappears. For this reason,
In the system using the task redundancy execution system according to the present invention, it is possible to guarantee that the system is operated safely.

【００４５】この発明におけるタスクの冗長化実行方式
は、タスクの再投入手段が、計算機の消滅に伴い、タス
クが消滅することを防ぐ。このため、この発明における
タスクの冗長化実行方式を用いたシステムの運用を安全
に行うことができるように保証することができる。In the task redundancy execution method according to the present invention, the task re-injection means prevents the task from disappearing as the computer disappears. Therefore, it is possible to guarantee that the system using the task redundancy execution system according to the present invention can be safely operated.

【００４６】この発明におけるタスクの冗長化実行方式
では、情報採取手段が、タスク間通信の通信処理中、及
び、タスク間通信の送信処理、及び、タスク間通信の受
信処理、及び、タスク間通信における応答待ち処理、及
び、タスク間の入出力処理のいずれかの処理を行う場
合、タスクの途中実行に必要な情報を採取する。このた
め、タスクの実行が途中で中断されるようなことがあっ
ても、上記情報採取手段によって採取した情報を元に、
容易にタスクを途中から再実行することができる。In the task redundancy execution system according to the present invention, the information collecting means is performing communication processing of inter-task communication, transmitting processing of inter-task communication, receiving processing of inter-task communication, and inter-task communication. When performing any of the response waiting process and the input / output process between tasks in step 1, the information necessary for mid-execution of the task is collected. Therefore, even if the execution of the task is interrupted in the middle, based on the information collected by the information collecting means,
You can easily re-execute the task from the middle.

【００４７】この発明におけるタスクの冗長化実行方式
では、タスク間通信の通信相手が、他の計算機上にある
場合でも、情報採取手段が他の計算機が備えている情報
採取手段に対して信号を送り、タスクの途中実行に必要
な情報を採取する。このため、例えば、空間的に冗長化
実行されているタスクの１つにエラーが発生し、タスク
の実行を中断する場合、冗長化実行されている他の計算
機で実行しているタスクから、途中実行に必要な情報を
容易に採取することができ、安定したタスクの冗長化実
行方式を実現することができる。In the task redundancy execution system according to the present invention, even when the communication partner of the inter-task communication is on another computer, the information collecting means sends a signal to the information collecting means provided in the other computer. Send and collect the information required for mid-execution of the task. Therefore, for example, when an error occurs in one of the tasks that are being spatially redundantly executed and the execution of the task is interrupted, the task that is being executed by another computer that is being redundantly executed may be interrupted. Information necessary for execution can be easily collected, and a stable task redundancy execution method can be realized.

【００４８】[0048]

【実施例】本発明によるマルチコンピュータシステムを
構成する各ノードには、ノード上にタスクの実行制御を
するタスク制御手段と、タスクの実行結果を比較してタ
スク制御手段に通知する実行誤り検出手段と、システム
で発生する事象をタスク制御手段に通知する事象通知部
と、ユーザタスクとタスク制御手段とのコミュニケーシ
ョン手段であるユーザタスクとのＩ／Ｆ部が存在する。DESCRIPTION OF THE PREFERRED EMBODIMENTS In each node constituting a multi-computer system according to the present invention, a task control means for controlling execution of a task on the node and an execution error detecting means for comparing the execution result of the task and notifying the task control means And an event notification unit for notifying the task control unit of an event that occurs in the system, and an I / F unit for the user task which is a communication unit between the user task and the task control unit.

【００４９】タスク制御手段は、自ノード内にてタスク
を起動する機能を有するが、自らが起動したタスクにつ
いては、その存在を把握することが可能であるように維
持して管理する。つまり、タスク制御手段は、タスクの
終了検知機構を備えていて、実行中のタスクを把握する
ことが可能となっている。自ノード上のタスクの終了
は、タスク制御手段自らが検知しても、タスク実行誤り
検出手段や事象通知部からの通知によっても構わない
が、自ノード上でのタスクの生成時と終了時には、その
情報を他ノード上のタスク制御手段に通知して、ノード
全体で実行中タスクの状態合わせを行う。従って、１ノ
ード上のタスク制御手段は、全ノードで実行中のタスク
を把握している。タスク制御手段間のメッセージ交換
は、各ノード上に実装されるスーパーバイザによって高
いレイヤで提供される確実な通信機能を用いて、他ノー
ド上のタスク制御手段と１対１で行う。The task control means has a function of activating a task within its own node, but maintains and manages the task activated by itself so that the existence thereof can be grasped. In other words, the task control means has a task end detection mechanism and can grasp the task being executed. The end of the task on the own node may be detected by the task control means itself or may be notified by the task execution error detection means or the event notification unit, but at the time of task creation and end on the own node, The information is notified to the task control means on another node, and the status of the task being executed is adjusted in the entire node. Therefore, the task control means on one node grasps the task being executed on all the nodes. The message exchange between the task control means is carried out on a one-to-one basis with the task control means on another node by using a reliable communication function provided at a higher layer by the supervisor mounted on each node.

【００５０】この他、タスク制御手段の機能には、投入
されたタスクをシステムのノード上に割り当てる処理が
ある。ノードには、自ノードでの不具合の発生状況等に
関連したタスクを割り当てるための状態があり、タスク
制御手段が管理している。この、ノードの状態は、自ノ
ードの状態遷移の度に、他ノード上のタスク制御手段に
通知して状態合わせを行う。タスク制御手段の処理にお
けるタスクの割り当て先ノードの決定は、動的に変化す
るノードの状態とタスクの属性を記したタスク定義を元
に行う。タスク定義は、システム運用中にも変更可能な
ものであり、全ノードで同じ内容が参照出来るように管
理する。In addition to the above, the function of the task control means includes a process of assigning the input task to a node of the system. Each node has a state for allocating a task related to the occurrence status of a failure in the node, and is managed by the task control means. The state of this node is notified to the task control means on another node every time the state transition of its own node occurs, and the state is adjusted. In the processing of the task control means, the node to which the task is assigned is determined based on the task definition that describes the dynamically changing node state and task attribute. The task definition can be changed even during system operation, and it is managed so that the same contents can be referenced by all nodes.

【００５１】上記のように、この発明における各ノード
上のタスク制御手段は、タスクの信頼性を確保するため
にタスクの実行に空間的／時間的冗長度を提供する冗長
度管理機能を持ち、全ノード上で実行中のタスクと全ノ
ードにおけるタスク割り付けのための状態とシステムに
投入されるタスクの属性を把握していることで、自ノー
ド上のユーザタスクからタスクの投入依頼を受けたタス
ク制御手段自らが、タスク割り付けに最適なノードを決
定できる。As described above, the task control means on each node in the present invention has a redundancy management function for providing a spatial / temporal redundancy for the execution of the task in order to secure the reliability of the task, By knowing the tasks being executed on all nodes, the status for task allocation in all nodes, and the attributes of the tasks that are submitted to the system, the task that was submitted by the user task on the local node The control means itself can determine the optimum node for task allocation.

【００５２】実行誤り検出手段は、タスクの実行結果の
Ｓ／Ｗ手段によるボーティングを基本として実行誤りタ
スクを検出するものであり、例えば、予めエラー基準を
登録した票決アルゴリズムに基づいて動作する。実行誤
り検出手段は、タスクの実行結果の比較結果を、タスク
制御手段に通知する機能を有している。The execution error detection means detects an execution error task based on the voting of the execution result of the task by the S / W means, and operates based on, for example, a voting algorithm in which an error criterion is registered in advance. The execution error detection means has a function of notifying the task control means of the comparison result of the task execution results.

【００５３】事象通知部は、他ノードの消滅や自ノード
上で発生するＨ／Ｗ異常等の事象を検知して、タスク制
御手段に知らせる機能を有している。自ノードで発生す
る異常の種類及び異常の検知手段はノードの実現依存で
あり、本発明において、実施例での詳細な説明は行わな
い。The event notification unit has a function of detecting an event such as the disappearance of another node or an H / W abnormality occurring on its own node, and notifying the task control means. The type of abnormality that occurs in the own node and the means for detecting the abnormality depend on the realization of the node, and the detailed description in the embodiments will not be given in the present invention.

【００５４】ユーザタスクとのＩ／Ｆ部は、タスク制御
手段に対してタスクの投入指示、タスク定義の変更、ノ
ードの状態遷移を指示する機能を有するものであり、ユ
ーザタスクへの取り込みが可能な形式となっている。The I / F part with the user task has a function of instructing the task control means to input a task, change the task definition, and change the state of the node, and can be incorporated into the user task. The format is

【００５５】本発明では、チェックポイントの取得処理
において、以下の処理を行う。タスク制御手段がチェッ
クポイント採取要求を発行し、チェックポイント採取要
求を受けたタスクは、タスク間通信、または、入出力命
令が現れるまで、処理を継続する。上記の命令が現れた
時には、以下のいずれかの処理を行なう。In the present invention, the following processing is performed in the checkpoint acquisition processing. The task control unit issues a checkpoint collection request, and the task receiving the checkpoint collection request continues processing until intertask communication or an input / output instruction appears. When the above command appears, one of the following processes is performed.

【００５６】前記タスク間通信（送信又は受信）処理に
達した時は、チェックポイントデータを採取し、チェッ
クポイントデータを共有外部記憶装置に格納するととも
に、通信処理の前に当該タスクを停止する。When the inter-task communication (transmission or reception) process is reached, checkpoint data is collected, the checkpoint data is stored in the shared external storage device, and the task is stopped before the communication process.

【００５７】前記タスク間通信（ランデブー）処理に達
した時は、チェックポイントデータを採取し、チェック
ポイントデータを共有外部記憶装置に格納するととも
に、ランデブー処理の前に当該タスクを停止する。When the inter-task communication (rendezvous) process is reached, checkpoint data is collected, the checkpoint data is stored in the shared external storage device, and the task is stopped before the rendezvous process.

【００５８】前記入出力処理に達した時は、チェックポ
イントを採取し、チェックポイントデータを共有外部記
憶装置に格納するとともに、入出力処理の前に当該タス
クを停止する。また、通信処理中にチェックポイント採
取要求を受けた時は、その時点のチェックポイントデー
タを採取し、当該タスクを停止する。更に、タスク間通
信の相手が、他のノード上にある時は、相手タスクにチ
ェックポイント採取の信号を送る。上記に説明を行った
マルチコンピュータを構成する各ノードの構成要件は、
全ての実施例について共通であるものとする。When the input / output processing is reached, a checkpoint is taken, the checkpoint data is stored in the shared external storage device, and the task is stopped before the input / output processing. When a checkpoint collection request is received during communication processing, the checkpoint data at that time is collected and the task is stopped. Further, when the partner of the inter-task communication is on another node, a checkpoint collection signal is sent to the partner task. The configuration requirements for each node that makes up the multicomputer described above are:
It shall be common to all the examples.

【００５９】実施例１．実施例１では、本発明のタスク
制御手段について説明する。図１は、本発明の実施例を
適用したマルチコンピュータシステムを表している。１
はマルチコンピュータシステムの構成要素であるノード
（計算機、以下「ノード」と称する）において、本発明
のタスク冗長化実行方式を実現する部分のブロック図で
ある。各ノードは、２のＩ／Ｏ制御を行うＦｒｏｎｔ
ＥｎｄＰｒｏｃｅｓｓｏｒを持っていて、３のＩ／Ｏ
ネットワークに結合された４のＩ／Ｏ機器等を制御す
る。また、ノードは、５の外部記憶装置である２次記憶
装置を共有しており、ノード上で実行されるタスクにお
いて利用可能になっている。ノード上の６のタスク制御
手段は、スーパーバイザ上で実行される１タスクとして
実現し、他ノード上のタスク制御手段と７のノード間通
信機能を用いてメッセージ交換をする。８のユーザタス
クとのＩ／Ｆ部は、ユーザタスク中へ取り込み可能なも
のであり、自ノード上のタスク制御手段に対して、ノー
ド内コミュニケーション手段を用いて、タスク起動や１
１のタスク定義部に定義されている定義内容の変更やノ
ードの状態遷移のメッセージを送信する機能を有してい
る。ノード上の９の事象通知部と１０の実行誤り検出手
段も、タスク制御手段６と同様に、スーパーバイザ上で
実行される１タスクとして実現する。これらはタスク制
御手段６へ事象通知を行う機能を有するが、通知情報の
違いに応じて複数のタスクに分割することが可能であ
る。Example 1. In the first embodiment, the task control means of the present invention will be described. FIG. 1 shows a multi-computer system to which an embodiment of the present invention is applied. 1
FIG. 3 is a block diagram of a part that realizes the task redundancy execution system of the present invention in a node (computer, hereinafter referred to as “node”) that is a component of a multi-computer system. Each node is a Front that performs 2 I / O control
I have an End Processor and I / O of 3
Controls four I / O devices and the like connected to the network. In addition, the nodes share the secondary storage device, which is the external storage device of 5, and can be used for tasks executed on the nodes. The task control means 6 on the node is realized as one task executed on the supervisor, and message exchange is performed using the inter-node communication function 7 with the task control means on another node. The I / F part with the user task of 8 can be incorporated into the user task, and the task control means on the own node can use the intra-node communication means to activate the task or
It has a function of changing the definition contents defined in the first task definition section and transmitting a message of node state transition. Similarly to the task control unit 6, the event notification unit 9 and the execution error detection unit 10 on the node are also realized as one task executed on the supervisor. These have a function of notifying the task control means 6 of an event, but can be divided into a plurality of tasks according to the difference in notification information.

【００６０】本システムにおいて、実行／管理されるタ
スクには、タスク毎に図２に示すタスク定義を行う。こ
のタスク定義は、タスク定義部１１に記憶し、ノード全
体で唯一無二になるようにタスク制御手段が管理する。
尚、タスク定義中の記述メンバは、以下に示す意味を持
つ。タスクのＩＤ１０１は、システムの運用中にタスク
を区別する識別子である。タスクの処理形態１０２は、
タスク実行ノードに障害等が生じて、ノード上でのタス
クの実行を取り辞める際に、タスク制御手段６によって
受ける扱いを以下に説明するように、指定するものであ
る。本システムでは、タスクを処理形態に応じて、自然
終了型、強制終了型、強制終了・再起動型に分類し、タ
スクの処理形態１０２部にて定義する。これらの分類に
よってタスクは、システムから実行ノードが切り離され
る際にそれぞれ順に、何の関与も受けない、強制終了さ
せられる、強制終了後に他ノードに再起動される、とい
った扱いを受ける。タスクの空間的冗長度１０３は、タ
スクの起動要求時にシステム上に投入するタスクの数を
指定するものである。上記の空間的冗長度とは、ある１
つのタスクを複数のコンピュータを用いて実行すること
である。縮退運転時のタスクの空間的冗長度１０４は、
ノード減少時のタスクの空間的冗長度を指定するもので
ある。ここでは、ノードの数がボーダーラインを下回っ
た時に必要とする空間的冗長度を指定する。優先実行先
ノード１０５は、タスクを割り当てるノードを優先度付
けして明示的に指定するものである。タスクの割り当て
ノードを明示的に指定しない場合は、タスク制御手段６
がそのことを知るノード名（例えば“ＤＯＮＴＭＩＮ
Ｄ”）を記すことで、負荷分散に応じたタスク割り付け
が行なわれる。ノードでの繰り返し実行回数１０６は、
タスク実行において時間的冗長度を指定するものであ
る。上記の時間的冗長度とは、１つのタスクを１つのコ
ンピュータで複数回処理を行うことである。時間的冗長
化実行では、タスクの１度の処理の終了後に、繰り返し
実行回数１０６を満たすまで、同ノード上に同タスクを
投入する。自ノードの障害時の通知に用いる信号番号１
０７は、自ノード上で障害が起きたためにタスクを終了
へ導く場合に、タスク制御手段６からタスクに送られる
信号の番号を指定するものである。チェックポイントセ
ーブの通知に用いる信号番号１０８は、タスクの冗長度
が失われるタイミングで、タスク制御手段６によって正
常処理をしているタスクに、チェックポイントセーブを
設定する通知のために送られる信号の番号を指定するも
のである。タスクは必要に応じて、検知可能な１０７と
１０８の信号を受けた後、実行すべき処理を登録してお
く。具体的な例として、ＯＳ（スーパーバイザ）にＵＮ
ＩＸを用いて本システムを実現する場合、タスクプログ
ラムには信号（シグナル）の検知メカニズムが提供され
ており、その信号を検知した時に実行する処理をタスク
プログラム中の関数として、定義することが可能であ
る。つまり、信号を検知した後に、都合の良い時点で検
知した信号に対して行う処理を、プログラムの一部に記
述しておくことが可能になる。どの種類の信号を受けた
ら何をする（どの関数を実行する）かの設定は、プログ
ラムがスーパーバイザコールを用いて登録する。これ
ら、信号を受けた時に実行するプログラム部分は、ＵＮ
ＩＸの世界ではシグナルハンドラ、或は、シグナルキャ
ッチングファンクションと呼ばれている。デバイスビッ
トマップ１０９は、該当タスクが用いるデバイスを登録
するビットマップデータである。図３にデバイスビット
マップの一例を示す。図３によれば、デバイスビットマ
ップは３２ビットのデバイス接続情報を記憶できる。０
ビット目は固定ディスク、１ビット目は拡張ディスク１
であり、ビット「１」が接続している状態を示し、ビッ
ト「０」が接続していない状態を示す。このデバイスビ
ットマップにおいて、システム上でのデバイスが一意的
に定まるように対応付けて運用することで、１つのデバ
イスの故障がノード全体の故障となることを防げる。こ
れは、タスク制御手段がシステム全体に渡って、故障し
たデバイスを用いるタスクの実行経路のみを無効化する
ことによる。タスク起動時に用いるコマンド名１１０
は、タスク制御手段がスーパーバイザにタスク起動を要
求する際の手続きを示すものである。タスクの実行誤り
時のプロシジャ１１１は、該当タスクで実行誤りが生じ
た際に、タスク制御手段がスーパーバイザに実行要求す
る手続きを示すものである。タスクの実行誤り検出に用
いるプロシジャ１１２は、本タスクの実行誤りを検出す
るタスクの起動手続きであり、本タスク起動時に同じノ
ード上に起動される。また、タスクのグループ化表現１
１３は、同じノード上で実行されるべきタスクのグルー
プを示すものである。また、グループ化表現１１３は、
グループ化毎に複数存在し、上記で説明したタスクのＩ
Ｄ１０１からタスクの実行誤り検出に用いるプロシジャ
１１２とは、別の定義情報である。In the system, the task definition shown in FIG. 2 is performed for each task to be executed / managed. This task definition is stored in the task definition unit 11 and managed by the task control means so that it is unique in all the nodes.
The description members in the task definition have the following meanings. The task ID 101 is an identifier that distinguishes tasks during system operation. The task processing form 102 is
When a task execution node has a failure or the like and the task execution on the node is canceled, the task control means 6 handles the task as described below. In this system, tasks are classified into natural termination type, forced termination type, forced termination / restart type according to the processing type, and defined in the task processing type 102. According to these classifications, the tasks are treated in such a manner that when the execution node is detached from the system, the task is not involved in the task, is killed, and is restarted by another node after the kill. The task spatial redundancy 103 is for designating the number of tasks to be input to the system when a task activation request is issued. The above spatial redundancy is 1
Performing a task using multiple computers. The spatial redundancy 104 of the task during the degenerate operation is
It specifies the spatial redundancy of the task when the number of nodes decreases. Here, the spatial redundancy required when the number of nodes falls below the border line is specified. The priority execution destination node 105 assigns a priority to a node to which a task is assigned and explicitly specifies the node. If the task allocation node is not explicitly specified, the task control means 6
Node name to know that (for example, "DONTMIN
By assigning D ″), task allocation is performed according to the load distribution.
This is to specify the temporal redundancy in task execution. The above-mentioned temporal redundancy means that one computer processes one task a plurality of times. In the time-redundant execution, the same task is input on the same node until the number of repeated executions 106 is satisfied after the processing of one task is completed. Signal number 1 used for notification when own node fails
07 designates the number of the signal sent from the task control means 6 to the task when the task is brought to the end due to the occurrence of a failure on its own node. The signal number 108 used for the notification of the checkpoint save is a signal sent for the notification of setting the checkpoint save to the task that is normally processed by the task control means 6 at the timing when the redundancy of the task is lost. The number is specified. If necessary, the task receives the signals 107 and 108 that can be detected, and then registers the process to be executed. As a specific example, the OS (supervisor) is UN
When this system is implemented using IX, the task program is provided with a signal detection mechanism, and the process executed when the signal is detected can be defined as a function in the task program. Is. That is, it becomes possible to describe in a part of the program the processing to be performed on the detected signal at a convenient time after detecting the signal. The setting of what to do (which function to execute) when receiving what kind of signal is registered by the program using a supervisor call. The program part to be executed when these signals are received is UN
In the world of IX, it is called a signal handler or a signal catching function. The device bitmap 109 is bitmap data for registering a device used by the task. FIG. 3 shows an example of the device bitmap. According to FIG. 3, the device bitmap can store 32-bit device connection information. 0
Bit 1 is fixed disk, Bit 1 is expansion disk 1
, The bit “1” indicates the connected state, and the bit “0” indicates the non-connected state. In this device bitmap, the devices in the system are associated and operated so as to be uniquely determined, and thus it is possible to prevent the failure of one device from the failure of the entire node. This is because the task control means invalidates only the execution path of the task using the failed device throughout the system. Command name 110 used at task startup
Shows the procedure when the task control means requests the supervisor to start the task. The task execution error procedure 111 indicates a procedure that the task control unit requests the supervisor to execute when an execution error occurs in the task. The procedure 112 used for detecting the execution error of the task is a task activation procedure for detecting the execution error of the task, and is activated on the same node when the task is activated. Also, the task grouping expression 1
13 shows a group of tasks to be executed on the same node. Also, the grouping expression 113 is
There is a plurality for each grouping, and the task I described above
The procedure 112 used for detecting a task execution error from D101 is different definition information.

【００６１】以上のタスク定義は、１タスク毎または１
グループ化表現毎の登録単位で、追加又は消去が可能で
ある。これは、ユーザタスクとのＩ／Ｆ部８を用いて、
文字列データであるタスク定義を、タスク制御手段６に
メッセージ送信することで実現する。自ノードでタスク
定義の変更メッセージを受けたタスク制御手段６は、タ
スク定義の変更処理後、他ノード上のタスク制御手段６
に対してメッセージによって変更を通知する。The above task definition is for each task or
It is possible to add or delete in units of registration for each grouped expression. This uses the I / F unit 8 with the user task,
The task definition, which is character string data, is realized by sending a message to the task control means 6. The task control means 6 that has received the task definition change message at its own node performs the task definition change processing, and then the task control means 6 on the other node.
Notify the change by message.

【００６２】システムに対して、ユーザタスクとのＩ／
Ｆ部８を介してタスクの投入が行なわれた場合、投入要
求を受けたローカルなタスク制御手段６は、タスク定義
部１１の内容と、タスクを割り当て可能なノード資源の
状態と、全ノードで実行中のタスクに応じて、タスクを
割り当てるノードを決定する。タスク割り当てノードが
他ノードとなった場合、タスク制御手段６は、割り当て
ノード上のタスク制御手段６に対して、ノード間コミュ
ニケーション機能を用いて、タスクの起動依頼メッセー
ジを送信する。I / O with the user task to the system
When a task is submitted via the F unit 8, the local task control means 6 that has received the submission request receives the contents of the task definition unit 11, the state of the node resources to which the task can be assigned, and all the nodes. Determines the node to which the task is assigned, depending on the task being performed. When the task allocation node becomes another node, the task control unit 6 transmits a task activation request message to the task control unit 6 on the allocation node by using the inter-node communication function.

【００６３】自ノード上でタスクを起動したタスク制御
手段６は、他ノード上のタスク制御手段６に対して、タ
スク起動の事象をメッセージ送信によって通知する。ま
た、タスクの終了時には、タスク終了の事象を他ノード
上のタスク制御手段６にメッセージ送信によって通知す
る。この処理は、各ノード上のタスク制御手段６が、全
ノードで実行中のタスクを把握することを可能にするた
めである。尚、本実施例では、自ノード上のタスクの終
了を、スーパーバイザにより送りつけられる信号によっ
て検知している。The task control means 6 which has activated the task on its own node notifies the task control means 6 on the other node of the event of task activation by sending a message. When the task ends, the task end event is notified to the task control means 6 on the other node by sending a message. This processing is to enable the task control means 6 on each node to grasp the tasks being executed on all the nodes. In this embodiment, the end of the task on the own node is detected by the signal sent by the supervisor.

【００６４】以下に、実施例１におけるタスク制御手段
６の作用、動作の詳細を図４と図５に示した処理の流れ
図を用いて説明する。まず、図４に示したノード起動時
の初期化処理について説明する。ステップ１２１では、
自ノードが起動したことを他ノードに対して通知する処
理を示している。この時、新たに起動したノードは、自
ノードの起動時間を付加情報として伝える。本システム
では、最も古い時刻に起動したノードが、他ノードの起
動時と消滅時にそれぞれ、ステップ１２２のタスク定義
の配信処理と、後の実施例で説明を行うステップ８１８
のタスクの再構成処理を担当する。ステップ１２２で
は、本システムで唯一無二となるように管理しているタ
スク定義を、他ノードから得る手続きを示している。本
実施例では、ステップ１２１で起動メッセージを受けた
ノードのうち、最も古い時刻に起動しているノード上の
タスク制御手段６が、新ノードに対してタスク定義を送
信する。ステップ１２３では、新たに起動したノードの
タスク制御手段６が、他ノードに対してタスク割り当て
のための状態を問う処理を示している。ステップ１２４
では、ステップ１２３の返事として、他ノードからタス
ク割り当てのための状態を得る処理を示している。ステ
ップ１２５では、他ノードに対して、実行中のタスクの
情報を問う処理を示している。ステップ１２６では、他
ノードで実行中であるタスクの情報を、自ノードにおけ
るタスク管理機構に登録する処理を示している。ステッ
プ１２７では、自ノードにおけるタスク割り当てのため
の状態を、他ノード上のタスク制御手段６に伝える処理
を示している。The operation and details of the task control means 6 in the first embodiment will be described below with reference to the flow charts of the processing shown in FIGS. 4 and 5. First, the initialization process at the time of starting the node shown in FIG. 4 will be described. In step 121,
It shows a process of notifying other nodes that the own node has started. At this time, the newly activated node transmits the activation time of its own node as additional information. In this system, the node activated at the oldest time is the task definition distribution process of step 122 and the step 818 described in the later example when the other node is activated and when it disappears, respectively.
Responsible for the task reconfiguration processing. In step 122, a procedure for obtaining the task definition managed so as to be unique in this system from another node is shown. In the present embodiment, among the nodes that received the activation message in step 121, the task control unit 6 on the node activated at the oldest time transmits the task definition to the new node. In step 123, the task control means 6 of the newly started node inquires of another node about the state for task allocation. Step 124
Then, as a reply to step 123, processing for obtaining a state for task assignment from another node is shown. In step 125, a process of inquiring of other nodes about the information of the task being executed is shown. In step 126, a process of registering the information of the task being executed in another node in the task management mechanism in the own node is shown. Step 127 shows a process of transmitting the status for task allocation in the own node to the task control means 6 on another node.

【００６５】上記の初期化処理が終了時点で、新たに起
動したノードは、タスク定義と、他ノード上で実行中の
タスクと、他ノードのタスク割り当てのための状態を得
た。また、既に起動されていたノードは、新たに起動し
たノードのタスク割り当てのための状態を得た。これに
より、新たに起動したノードは、マルチコンピュータシ
ステムにおける１ノードとしての機能を開始する。At the end of the above initialization process, the newly started node obtains the task definition, the task being executed on the other node, and the state for allocating the task of the other node. In addition, the node that has been activated has acquired the state for task assignment of the newly activated node. As a result, the newly started node starts to function as one node in the multi-computer system.

【００６６】次に、本システムにおけるタスクの投入処
理を、図５に示した処理の流れ図を用いて説明する。ス
テップ１３１では、システム上に新たにタスクを投入す
る処理を示している。システムへのタスクの投入は、ユ
ーザタスクとのＩ／Ｆ部８を用いて行う。ユーザタスク
とのＩ／Ｆ部８は、ユーザタスクから呼び出されると、
ローカルノード上のタスク制御手段６に対して、タスク
起動メッセージを発信する。ステップ１３２では、タス
ク制御手段６における、タスク割り当て処理を示してい
る。タスク割り当て処理では、タスク定義と、タスクを
割り当てるための各ノードの状態と、全ノードで実行中
のタスクの情報に応じて、タスクを割り当てるノードを
決定する。ステップ１３３では、タスクを割り当てるノ
ードが、自ノードであるか他ノードであるかで処理を分
岐している。ステップ１３４では、自ノード上のスーパ
ーバイザに、タスク起動を要請する処理を示している。
ステップ１３５では、自ノード上で起動したタスクの情
報を、タスク制御手段６におけるタスク管理機構に登録
する処理を示している。ステップ１３６では、自ノード
上で起動したタスクの情報を、他ノードに伝える処理を
示している。ステップ１３７は、ステップ１３３から分
岐した処理であり、他ノード上にタスクを割り当てる処
理を示している。ここでは、他ノード上のタスク制御手
段６に対して、タスクの起動依頼を発行している。本シ
ステムにおけるタスク制御手段６のタスク割り当て処理
において、ローカルノード上でユーザタスクとのＩ／Ｆ
部８を通して要求されたタスク起動では、割り当てノー
ドの選定処理が行なわれる。これに対して、他ノード上
のタスク制御手段６から受けた起動要求は、即座に実行
する。これは、各ノードにおいて、タスク割り当てに必
要な情報を持って自立したタスク割り当て処理をしてい
ることによる。ステップ１３８では、システムに投入し
たタスクの数が、タスク定義に記述された数を満たすか
どうかで処理を分岐している。Next, the task input processing in this system will be described with reference to the flow chart of the processing shown in FIG. In step 131, processing for newly inputting a task on the system is shown. The task is input to the system by using the I / F unit 8 for the user task. When the user task calls the I / F unit 8 with the user task,
A task activation message is sent to the task control means 6 on the local node. In step 132, task assignment processing in the task control means 6 is shown. In the task assignment process, the node to which the task is assigned is determined according to the task definition, the state of each node for assigning the task, and the information of the task being executed by all the nodes. In step 133, the process branches depending on whether the node to which the task is assigned is the own node or another node. Step 134 shows the process of requesting the task activation from the supervisor on the own node.
In step 135, the process of registering the information of the task activated on the self node in the task management mechanism of the task control means 6 is shown. In step 136, the process of transmitting the information of the task started on the own node to another node is shown. Step 137 is a process branched from step 133 and shows a process of assigning a task to another node. Here, a task activation request is issued to the task control means 6 on another node. In the task allocation process of the task control means 6 in this system, the I / F with the user task is executed on the local node.
In the task activation requested through the unit 8, an allocation node selection process is performed. On the other hand, the activation request received from the task control means 6 on the other node is immediately executed. This is because each node performs independent task allocation processing with the information necessary for task allocation. In step 138, the process branches depending on whether the number of tasks input to the system satisfies the number described in the task definition.

【００６７】ステップ１３９では、他ノード上で起動し
たタスクの情報を得る処理を示している。この処理は、
他ノード上のタスク制御手段６から得たメッセージの解
析後に呼び出されるため、一連のタスクの投入処理とは
非同期な処理である。ステップ１４０では、他ノード上
で起動したタスクの情報を、タスク制御手段６のタスク
管理機構に取り込む処理を示している。In step 139, a process for obtaining information on the task activated on another node is shown. This process
Since it is called after the message obtained from the task control means 6 on another node is called, it is an asynchronous process with the series of task input processes. In step 140, the process of loading the information of the task started on another node into the task management mechanism of the task control means 6 is shown.

【００６８】以上のようにタスク制御手段６は、タスク
定義と、全ノードのタスク割り当てのための状態と、全
ノードで実行中のタスクの情報を用いて、ユーザタスク
とのＩ／Ｆ部８を用いて行われたローカルノードからの
タスク起動要求に対して、全ノードを対象にしたタスク
の冗長化割り当て処理が可能となっている。As described above, the task control means 6 uses the task definition, the state for task assignment of all nodes, and the information of the task being executed in all nodes, and the I / F unit 8 for the user task. In response to a task activation request from a local node made using, it is possible to perform a redundant task assignment process for all nodes.

【００６９】実施例２．実施例２では、本発明のタスク
の実行誤りの検出機構について説明をする。本システム
における実行誤り検出手段１０は、タスクプログラムを
用いて実現し、各ノード上で実行されたタスクの実行結
果を票決する機能によって、どのノード上で実行された
どのタスクがエラーとなったかを判断することが可能で
ある。本システムでの実行誤り検出手段１０は、各タス
クによってエラーに対する考え方が異なることと、アー
キテクチャが異なるノードを共にシステムに組み込むこ
とを許すことから、実行タスクに応じて作成するものと
する。更に、実行誤り検出手段１０は、票決の実行結果
を自ノードのタスク制御手段６に、ユーザタスクとのＩ
／Ｆ部８を用いて通知する機能を有する。Example 2. In the second embodiment, a task execution error detection mechanism of the present invention will be described. The execution error detection means 10 in the present system is realized by using a task program, and has a function of voting the execution result of a task executed on each node to determine which task executed on which node caused an error. It is possible to judge. The execution error detecting means 10 in the present system is created according to the execution task, since it is possible to incorporate nodes having different architectures together in the system because the idea of the error differs depending on each task. Further, the execution error detection means 10 sends the result of the vote execution to the task control means 6 of its own node and the I
It has a function of notifying using the / F unit 8.

【００７０】以下に、図６に示す処理の流れ図を用い
て、実行誤り検出手段の作用と動作の詳細の説明をす
る。ステップ１３５は、タスク制御手段６によって実行
タスクがノード上に起動される処理を示しており、上記
実施例１において説明した図５のステップ１３５の処理
と同じである。ステップ２０１は、タスク制御手段６が
実行誤り検出手段１０を、上記ステップ１３５で投入し
た実行タスクと同じノード上で起動する処理を示してい
る。この実行誤り検出手段１０の起動処理は、図２のタ
スク定義中の１１２に示したプロシジャの実行を意味す
る。ステップ２０２は、実行誤り検出手段１０が、シス
テム上で実行されているタスクの実行結果を収集し、比
較し、比較した結果をタスク制御手段６に報告する処理
を示している。実行結果の収集方法は、タスクによる実
行結果の提供方法に依存しており、本発明の範囲ではな
い。実行結果の比較結果は、ユーザタスクとのＩ／Ｆ部
８を用いて、比較結果をタスク制御手段６にメッセージ
送信する。ステップ２０３は、自ノード上の実行タスク
が終了したかどうかで処理を分岐している。タスクの処
理の途中で、実行結果の正当性チェックを必要とする種
のものは、タスクの実行中にステップ２０２の処理を繰
り返して行う。The operation and operation of the execution error detecting means will be described in detail below with reference to the flow chart of the processing shown in FIG. Step 135 shows the process in which the execution task is activated on the node by the task control means 6 and is the same as the process of step 135 of FIG. 5 described in the first embodiment. Step 201 shows a process in which the task control means 6 activates the execution error detection means 10 on the same node as the execution task input in step 135. The activation process of the execution error detection means 10 means the execution of the procedure indicated by 112 in the task definition of FIG. Step 202 shows a process in which the execution error detecting means 10 collects and compares the execution results of the tasks being executed on the system, and reports the comparison results to the task control means 6. The method of collecting the execution results depends on the method of providing the execution results by the task and is not within the scope of the present invention. The comparison result of the execution results is sent to the task control means 6 as a message by using the I / F unit 8 with the user task. Step 203 branches the processing depending on whether or not the execution task on the own node is completed. In the middle of the processing of the task, for the kind that requires the validity check of the execution result, the processing of step 202 is repeated during the execution of the task.

【００７１】以上の様に、タスクの起動と共に、実行し
たタスク専用の実行誤り検出手段を起動して動作さるこ
とで、タスク特有の動作状態を監視することを容易にし
ている。As described above, by activating the task and activating and executing the execution error detecting means dedicated to the executed task, it becomes easy to monitor the operation state peculiar to the task.

【００７２】実施例３．実施例３では、実行誤りを起こ
したタスクの扱いを説明する。上記実施例２に示したよ
うに、冗長化実行されているタスクの実行結果を比較し
た結果は、各ノード上の実行誤り検出手段１０によっ
て、タスク制御手段６に通知される。比較結果を受けた
タスク制御手段６では、正常処理をしたタスク（インス
タンス）と実行誤りを起こしたタスク（インスタンス）
とで扱いが異なる。タスク制御手段６は、正常処理をし
たタスク（インスタンス）に対して、チェックポイント
セーブの通知を行う。チェックポイントセーブとは、タ
スクの再実行に必要な情報を確保することである。一
方、実行誤りをしたタスク（インスタンス）に対して
は、タスク制御手段６の冗長化実行手段６ａにより、終
了処理をうながす通知を行う。本実施例では、これらの
事象の通知手段として、タスク定義に予め登録された信
号を用いる。即ち、正常処理をしたタスクには、図４の
タスク定義中チェックポイントセーブの通知に用いる信
号番号１０８を送信し、実行誤りを起こしたタスクに
は、自ノードの障害時の通知に用いる信号番号１０７を
送信する。尚、タスクにおけるチェックポイントデータ
は、必要に応じてノード間で共有する２次記憶装置（図
１の共有２次記憶装置５）等に設定し、後に再起動した
タスクが吸い上げてタスクの継続処理ができるものとし
ておく。また、異常処理をしたタスク（インスタンス）
を実行していた障害ノード上のタスク制御手段６は、こ
の種のタスクが新たに割り当てられることを禁止する状
態に自ノードを遷移させ、他ノードにこの状態遷移を通
知する。この状態遷移では、障害タスクが用いるデバイ
スを使用する他タスクの実行においても、本ノード上で
は障害を起こす可能性があるため、それらのタスクの割
り当てを禁じる。更に、障害ノード上のタスク制御手段
６は、タスクの冗長度が減ったので、タスクの冗長化実
行手段６ａにより、システムにこのタスク（インスタン
ス）を追加投入する。この後、正常処理を行なっていた
タスク（インスタンス）と新たに投入されたタスク（イ
ンスタンス）には、先に正常処理を行なっていたタスク
（インスタンス）が設定したチェックポイントから、実
行を再開させる。Example 3. In the third embodiment, handling of a task that causes an execution error will be described. As shown in the second embodiment, the result of comparing the execution results of the redundantly executed tasks is notified to the task control means 6 by the execution error detection means 10 on each node. In the task control means 6 that has received the comparison result, the task (instance) that has performed normal processing and the task (instance) that has caused an execution error
And are treated differently. The task control means 6 notifies the task (instance) that has been normally processed that the checkpoint will be saved. Checkpoint save is to secure information necessary for re-execution of a task. On the other hand, for a task (instance) that has an execution error, the redundancy execution means 6a of the task control means 6 notifies the end processing. In this embodiment, a signal registered in advance in the task definition is used as the notification means for these events. That is, the signal number 108 used for notification of checkpoint save during task definition in FIG. 4 is transmitted to the task that has performed normal processing, and the signal number used for notification when the own node fails for the task that caused the execution error. 107 is transmitted. The checkpoint data in the task is set in a secondary storage device (shared secondary storage device 5 in FIG. 1) or the like shared between the nodes as needed, and the task restarted later siphons it and continues processing of the task. I will make it possible. In addition, the task (instance) that processed the error
The task control means 6 on the faulty node that has executed the step transitions its own node to a state in which new allocation of this type of task is prohibited, and notifies other nodes of this state transition. In this state transition, even when other tasks using the device used by the faulty task are executed, there is a possibility of causing a fault on this node, and therefore assignment of those tasks is prohibited. Furthermore, the task control means 6 on the failed node has reduced the redundancy of the task, and therefore the task redundancy execution means 6a additionally inputs this task (instance) into the system. After that, for the task (instance) that was performing normal processing and the task (instance) that was newly submitted, execution is resumed from the checkpoint set by the task (instance) that was performing normal processing earlier.

【００７３】以下に、図７と図８に示す処理の流れ図を
用いて、実行誤り検出手段１０とタスク制御手段６のタ
スクの冗長化実行手段６ａの作用と、動作の詳細な説明
をする。まず、図７の処理の流れ図を用いて、タスク制
御手段６の冗長化実行手段６ａと情報採取手段６ｊにお
ける、実行誤りを起こしたタスクの停止と、タスクのチ
ェックポイント取得処理について説明する。ステップ３
０１は、実行誤り検出手段１０から、タスクの実行結果
の比較結果を受信する処理を示している。ステップ３０
２では、実行誤り検出手段１０から得たメッセージを解
析した後、タスクの実行結果誤りが生じたかどうかで処
理を分岐している。ステップ３０３では、誤りが発生し
たタスクが、自ノードで実行されていたかどうかで処理
を分岐している。ステップ３０４は、実行誤りを起こし
たタスクを停止させるために、実行誤りタスクに対し
て、タスク定義に登録された自ノードの障害を通知する
信号番号１０７を送信する処理を示している。ステップ
３０５は、実行誤りを起こしたタスクと同様の処理を行
うタスクが新たに割り当てられないように、タスク制御
手段６の割り当て禁止手段６ｃにより自ノードの状態を
遷移させ、この状態遷移を他ノード上のタスク制御手段
６に通知する処理を示している。あるタスクが実行誤り
を起こしたタスクと同様の処理を行うかどうかは、タス
ク定義中のデバイスビットマップ１０９によって判別す
る。これは、実行誤りを起こしたタスクが用いるデバイ
スを使用する他タスクにおいても、本ノード上では、実
行誤りを起こす可能性があるためである。この後の処理
は、実行結果の比較を通知されてから行なってきた一連
の処理とは、非同期になる。この後の非同期処理は、図
８中の流れ図の手続き２の開始に続く。The operation and operation of the execution error detection means 10 and the task redundancy execution means 6a of the task control means 6 will be described in detail below with reference to the flow charts of the processing shown in FIGS. First, with reference to the flow chart of the processing of FIG. 7, a description will be given of the stopping of a task that has caused an execution error and the task checkpoint acquisition processing in the redundancy executing means 6a and the information collecting means 6j of the task control means 6. Step 3
Reference numeral 01 denotes a process of receiving the comparison result of the task execution results from the execution error detection means 10. Step 30
In No. 2, after analyzing the message obtained from the execution error detection means 10, the process branches depending on whether or not an error has occurred in the execution result of the task. In step 303, the process branches depending on whether or not the task in which the error has occurred is being executed by the own node. Step 304 shows a process of transmitting the signal number 107 for notifying the fault of the own node registered in the task definition to the execution error task in order to stop the task that caused the execution error. In step 305, the state of the own node is changed by the assignment prohibiting means 6c of the task control means 6 so that a task performing the same process as the task in which the execution error has occurred is not newly assigned, and this state transition is performed by another node. The process of notifying the above task control means 6 is shown. Whether or not a certain task performs the same processing as the task in which the execution error has occurred is determined by the device bitmap 109 in the task definition. This is because there is a possibility that an execution error will occur on this node even for other tasks that use the device used by the task that caused the execution error. The subsequent processing is asynchronous with the series of processing performed after the notification of the comparison of the execution results. Subsequent asynchronous processing continues with the start of procedure 2 of the flow chart in FIG.

【００７４】ステップ３０６は、他ノード上で実行誤り
を起こしたタスクを自ノード上で空間的に冗長化実行し
ていて、且つ、そのタスクが正常処理をしていた場合の
扱いを示している。タスク制御手段６の情報採取手段６
ｊは、冗長度を失った自ノード上で実行中のタスクに対
して、タスク定義に記述されたチェックポイントセーブ
の通知に用いる信号番号１０８を送信して、チェックポ
イントセーブの指示を行う。この後の処理は、実行誤り
を通知されてから行なってきた一連の処理とは、非同期
になる。この後の非同期処理は、図８の流れ図の手続き
３の開始に続く。Step 306 shows the handling in the case where the task in which an execution error has occurred on another node is spatially redundantly executed on its own node and the task is normally processed. . Information collecting means 6 of task control means 6
j sends a signal number 108 used for notification of the checkpoint save described in the task definition to the task being executed on its own node that has lost the redundancy, and instructs the checkpoint save. Subsequent processing is asynchronous with the series of processing performed after the execution error is notified. Subsequent asynchronous processing continues with the start of procedure 3 in the flow chart of FIG.

【００７５】以下に、図８に示す処理の流れ図を用い
て、冗長度を失ったタスクの冗長度回復処理を説明す
る。始めに、図８における手続き２の処理について説明
する。ステップ３１０は、他ノード上のタスクの終了が
判明した後に呼び出される処理であり、ここでは、他ノ
ードで実行していたタスクがチェックポイントデータの
設定を終了したかどうかで処理を分岐している。本実施
例では、チェックポイントセーブを行なったタスクは終
了し、その事象がタスク制御手段６によって検知され
る。ステップ３１１は、ステップ３０４の処理において
強制終了させたタスクの終了通知を、他ノードに対して
行う処理を示している。ステップ３１２は、実行誤りタ
スクが生じたために失ったタスクの冗長度を、タスクの
冗長化実行手段６ａにより回復させる処理を示してい
る。本実施例では、実行誤りを起こしたタスクは既に終
了させているため、タスクの冗長化実行手段６ａは、タ
スクを再実行するノードのタスク制御手段６にタスクの
再実行を通知する。通知を受けたタスク制御手段６は、
再実行手段６ｋによりシステム上にタスクを追加投入し
ている。The redundancy recovery processing of the task which has lost the redundancy will be described below with reference to the flow chart of the processing shown in FIG. First, the processing of procedure 2 in FIG. 8 will be described. Step 310 is a process that is called after the end of the task on the other node is known. Here, the process branches depending on whether the task that was executing on the other node has finished setting the checkpoint data. . In the present embodiment, the task that has performed the checkpoint save ends, and the event is detected by the task control means 6. Step 311 shows a process of notifying the other node of the end of the task forcibly ended in the process of step 304. Step 312 shows a process of recovering the redundancy of the task lost due to the execution error task by the task redundancy executing means 6a. In the present embodiment, since the task in which the execution error has occurred has already been terminated, the task redundancy executing means 6a notifies the task control means 6 of the node that re-executes the task of re-execution of the task. The task control means 6 that received the notification
A task is additionally input to the system by the re-execution means 6k.

【００７６】次に、図８における手続き３の処理につい
て説明する。ステップ３２０は、正常処理をしたタスク
を実行していたノード上のタスク制御手段６の処理にお
いて、実行誤りを起こしたタスクを実行していたノード
でタスクの終了が判明した後に呼び出される処理であ
る。ここでは、実行誤りを起こしたタスクが終了したか
どうかで処理を分岐している。ステップ３２１は、正常
処理をしていたタスクに、継続処理を行なわせる処理を
示している。本実施例では、先に自ノードで正常処理を
していたタスクに対して、チェックポイントを設定させ
た後に終了させているので、再実行手段６ｋによりシス
テムにタスクを再投入している。Next, the processing of procedure 3 in FIG. 8 will be described. Step 320 is a process which is called in the process of the task control means 6 on the node which was executing the task which has been normally processed, after the end of the task is found to be on the node which is executing the task which caused the execution error. . Here, the process branches depending on whether the task that caused the execution error has ended. Step 321 shows processing for causing a task that has been normally processed to continue processing. In this embodiment, since the checkpoint is set for the task that has been normally processed by the own node before the task is terminated, the task is reintroduced into the system by the re-execution means 6k.

【００７７】以上の処理により、冗長化実行したタスク
で誤りを起こした際に、正常処理したタスクは、チェッ
クポイントセーブを行わせて停止させ、誤りを起こした
タスクは停止させ、システムはその後にタスクの冗長度
を確保した状態で、正常処理したタスクで設定したチェ
ックポイント地点からタスクを再実行させるので、実行
誤りを起こして冗長度を失ったタスクは、元の冗長度を
回復して継続処理が可能となる。By the above processing, when an error occurs in the redundantly executed task, the normally processed task is caused to checkpoint save and is stopped, the task in which the error is caused is stopped, and the system Since the task is re-executed from the checkpoint set in the normally processed task while the task redundancy is secured, the task that caused the execution error and lost the redundancy recovers the original redundancy and continues. Processing becomes possible.

【００７８】尚、本実施例では、障害を起こしたタスク
を一旦終了させているが、正常処理をしたタスクにおけ
るチェックポイントデータを実行誤りタスクに確実に渡
せるならば、一時停止／再実行の処理で済ませても良
い。また、正常処理を行っていたタスクと新たに投入さ
れたタスクは、先に正常処理を行っていたタスクが、設
定したチェックポイントから実行を再開していたが、新
たに投入されたタスクは、始めから実行を再度行っても
構わない。しかし、この場合、正常処理を行っていたタ
スクと新たに投入されたタスクの処理は、非同期にな
る。更に、正常処理を行っていたタスクと新たに投入さ
れたタスク、どちらも始めから実行を再度行っても構わ
ない。この場合、チェックポイントセーブは不要にな
り、実行途中の結果も廃棄する必要がある。In this embodiment, the faulty task is temporarily terminated. However, if the checkpoint data of the task that has performed normal processing can be reliably passed to the execution error task, the temporary stop / reexecution processing is performed. You may finish with. Also, for the task that was performing normal processing and the newly submitted task, the task that was performing normal processing previously resumed execution from the set checkpoint, but the newly submitted task is You can restart the process from the beginning. However, in this case, the processing of the task that was performing the normal processing and the processing of the newly input task are asynchronous. Furthermore, both the task that was performing normal processing and the newly submitted task may be re-executed from the beginning. In this case, the checkpoint save is unnecessary and the result in the middle of execution must be discarded.

【００７９】実施例４．実施例４は、空間的に冗長化さ
れていない実行中タスクを、ノードを変えて継続実行さ
せる際の処理を説明する。例えば、ノード上のＨ／Ｗに
冗長化が行なわれており、この冗長化されたＨ／Ｗの構
成要素に故障が生じた場合を考える。このようなノード
では、Ｈ／Ｗの故障がマスクされるため、故障による影
響はタスクの実行に到達していない。ただし、Ｈ／Ｗの
オンライン修復機能がないノードでは、Ｈ／Ｗは冗長度
を失ったままとなり、ノード上で実行中のタスクに影響
を与える障害がいずれ起こることが予想できる。この状
況で、タスクの実行に空間的冗長度がない場合は、処理
の引き継ぎデータを上記実施例３に挙げた方法で、他の
ノードから得られないので問題となる。そこで、自ノー
ドで実行中のタスクにチェックポイントを設定させて停
止させ、このタスクを他ノード上に再起動して、チェッ
クポイントデータを引き継がせるといったタスクの移送
手段６ｂによる処理が必要となる。Example 4. The fourth embodiment will explain the processing when a running task that is not spatially redundant is continuously executed by changing the node. For example, consider a case where H / W on a node is made redundant and a failure occurs in a component of this redundant H / W. In such a node, the failure of H / W is masked, so the effect of the failure does not reach the execution of the task. However, in a node that does not have the online repair function of H / W, the H / W remains without redundancy, and it can be expected that a failure will eventually occur that affects the task being executed on the node. In this situation, if there is no spatial redundancy in the execution of the task, the process takeover data cannot be obtained from other nodes by the method described in the third embodiment, which is a problem. Therefore, it is necessary to perform processing by the task transfer means 6b in which a checkpoint is set for a task that is being executed in its own node, the task is stopped, the task is restarted on another node, and the checkpoint data is taken over.

【００８０】タスクの実行に空間的冗長度を持たせない
のは、システム設計者の裁量に寄るところもあるが、本
システムにおいてノードの縮退が続いた結果として、タ
スク制御手段６が空間的冗長度を持たせないでタスクを
実行している場合には、このようなタスク移送処理が必
須となる。本実施例では、タスクに対するチェックポイ
ントセーブの通知は、上記実施例１に挙げたタスク定義
に登録したタスクが検出可能な信号を用いる。本実施例
では、移送するタスクに対して、図２に示すタスク定義
中のチェックポイントセーブの通知に用いる信号番号１
０８を送信して、チェックポイントを設定させた後に停
止させる。その後、同タスクをシステムに再投入する。Although it is at the discretion of the system designer that the task execution does not have the spatial redundancy, as a result of the continued degeneracy of the nodes in this system, the task control means 6 becomes spatially redundant. Such task transfer processing is essential when a task is executed without a certain degree. In this embodiment, a checkpoint save notification for a task uses a signal that can be detected by the task registered in the task definition described in the first embodiment. In this embodiment, the signal number 1 used for notification of checkpoint save in the task definition shown in FIG.
Send 08 to set a checkpoint and then stop. After that, the task is re-introduced into the system.

【００８１】以下に、タスク制御手段６のタスクの移送
手段６ｂについて、作用、動作の詳細な説明を図９に示
す処理の流れ図を用いて説明する。ステップ４０１で
は、自ノードにおいて空間的冗長度のないタスクの継続
実行が危ぶまれる事象が発生したことを示している。例
えば、これは、ノード内で冗長化したプロセッサエレメ
ントにおける固定故障である。事象通知部９が検知する
冗長化Ｈ／Ｗにおける固定故障等の発生状況は、ユーザ
タスクとのＩ／Ｆ部８を用いてタスク制御手段６に伝え
られる。ステップ４０２では、自ノードの状態を新たに
タスクを割り当てられないように遷移させ、この状態遷
移を他ノード上のタスク制御手段６に通知する処理を示
している。ステップ４０３では、着目しているタスクの
処理形態が強制終了・再起動型かどうかによって、処理
を分岐している。ステップ４０４では、タスクの移送手
段６ｂが情報採取手段６ｊに対して、タスク定義中に記
述されたチェックポイントセーブの通知に用いる信号番
号１０８を、タスクに送信するように指示を行う処理を
示している。この信号を受けたタスクは、チェックポイ
ントの設定をノード間で共有する２次記憶装置５に設定
して停止する。本実施例では、チェックポイントを設定
したタスクは、終了する。ステップ４０５の処理は、ス
テップ４０４の処理の後に、チェックポイントを設定し
たタスクの終了を検出したタイミングで実行されるた
め、一連の処理とは不連続である。本実施例におけるス
テップ４０５の処理は、タスク制御手段６がチェックポ
イントの設定終了をスーパーバイザからのタスクの終了
通知によって認知した後、タスク移送手段６ｂが再投入
手段６ｉに対し、システムに同じタスクを再投入するよ
う指示する。ここでのタスクの再投入処理では、ステッ
プ４０２の処理によって、自ノードにタスクが割り当て
られることはない。このようにして、タスクの移送は行
なわれる。Below, a detailed explanation of the operation and operation of the task transfer means 6b of the task control means 6 will be given using the flow chart of the processing shown in FIG. In step 401, it is indicated that an event has occurred in which the continuous execution of a task having no spatial redundancy is at risk in the own node. For example, this is a fixed failure in a redundant processor element within a node. The occurrence status of a fixed failure or the like in the redundant H / W detected by the event notification unit 9 is transmitted to the task control means 6 by using the I / F unit 8 with the user task. In step 402, a process of transitioning the state of the own node so that a new task cannot be allocated and notifying the task control means 6 on another node of this state transition is shown. In step 403, the processing is branched depending on whether the processing form of the focused task is the forced termination / restart type. In step 404, a process in which the task transfer means 6b instructs the information collection means 6j to transmit the signal number 108 used for notification of checkpoint save described in the task definition to the task is shown. There is. The task receiving this signal sets the checkpoint in the secondary storage device 5 shared by the nodes and stops. In this embodiment, the checkpointed task ends. Since the process of step 405 is executed after the process of step 404 at the timing at which the end of the task to which the checkpoint is set is detected, the process is discontinuous. In the processing of step 405 in the present embodiment, after the task control means 6 recognizes the completion of the setting of the checkpoint by the notification of the completion of the task from the supervisor, the task transfer means 6b informs the re-input means 6i of the same task in the system. Instruct to recycle. In the task re-input process here, the task is not assigned to the own node by the process of step 402. In this way, the task transfer is performed.

【００８２】ステップ４０６の処理は、タスクの処理形
態が強制終了かどうかで処理を分岐している。強制終了
型でないタスクは、自然終了型タスクであるので、放っ
ておく。尚、自然終了型のタスクは、２次障害が起こる
前に自然終了することを想定されたタスクである。ステ
ップ４０７の処理では、タスクを強制終了させるため
に、タスク定義中に記述した自ノードの障害時の通知に
用いる信号番号１０７を送信する処理を示している。
尚、本実施例では、タスクの再起動処理をタスク制御手
段６にまかせているが、タスク自らが自ノードの障害時
の通知に用いる信号番号１０７の信号を受けて後処理を
済ませた後に、ユーザタスクとのＩ／Ｆ部８を用いて自
タスクをシステムに再投入しても良い。また、上記説明
では、他ノード上にタスクを再投入し、再投入したタス
クはチェックポイントから処理を再開していたが、始め
から処理を再実行しても構わない。この場合、チェック
ポイントセーブは不要となり、途中までの実行結果も廃
棄する。The processing of step 406 is branched depending on whether the processing form of the task is forced termination. A task that is not a forced termination type is a natural termination type task, so leave it alone. The natural end type task is a task that is supposed to end naturally before a secondary failure occurs. In the processing of step 407, in order to forcibly terminate the task, the signal number 107 used for notification at the time of failure of the own node described in the task definition is transmitted.
In this embodiment, the task control unit 6 is allowed to perform the task restart processing. The own task may be re-introduced into the system by using the I / F unit 8 for the user task. Further, in the above description, the task is re-injected on another node and the re-injected task restarts the process from the checkpoint, but the process may be re-executed from the beginning. In this case, checkpoint saving is unnecessary and the execution results up to the middle are also discarded.

【００８３】以上のように、空間的冗長度のないタスク
を実行するノードで、冗長化されているＨ／Ｗの障害発
生後に、実行中タスクへチェックポイントセーブの通知
を行って停止させ、他ノード上に同じタスクを再起動し
てチェックポイントからの実行を可能にするので、タス
クの実行をＨ／Ｗの障害等により中断せずに継続して行
うことができる。As described above, at a node that executes a task having no spatial redundancy, after the occurrence of a redundant H / W failure, a checkpoint save is notified to the executing task and the task is stopped. Since the same task is restarted on the node to enable execution from the checkpoint, the task can be continuously executed without interruption due to an H / W failure or the like.

【００８４】実施例５．実施例５では、本システムでの
オンライン診断の実施例を説明する。尚、本システムで
は、各タスクの空間的冗長度を確保するための資源とし
て、ノードを使用するが、システム上の全ノードが同一
のデバイスを有しているわけではない。また、システム
の可用性を高めるため、１ノードに付加したデバイスの
故障がノード全体の故障となることを防ぐ。このため
に、ノードに含まれるデバイスが使用可能であるかどう
かを診断する機能が必要である。Example 5. In the fifth embodiment, an example of online diagnosis in this system will be described. In this system, nodes are used as resources for securing the spatial redundancy of each task, but not all nodes in the system have the same device. Further, in order to improve the system availability, the failure of the device added to one node is prevented from the failure of the entire node. For this purpose, the function of diagnosing whether or not the devices included in the node are usable is necessary.

【００８５】以下に、図１０に示す処理の流れ図を用い
て、オンライン診断手段の作用、動作の詳細な説明をす
る。図１０の処理において、自ノード上で実行していた
タスクが実行誤りを起こした後に、タスクの冗長度を確
保するために行うステップ３０１〜ステップ３０５の回
復処理は、上記実施例３に挙げた処理と同じであるの
で、ここでは説明を省略する。本実施例では、冗長度回
復処理の後に障害発生ノードで実施する、診断処理を説
明する。ステップ５０１の処理は、タスクの実行誤り発
生後にタスク制御手段６によって、診断プロシジャが実
行される処理を示している。この診断プロシジャの実施
は、タスク定義中に記述されたタスクの実行誤り検出に
用いるプロシジャ１１１の実施を意味している。例え
ば、ステップ５０１の処理では実行誤りを起こしたタス
ク（インスタンス）を他ノード上に再投入することもで
きる。その後、実行誤り検出手段からの報告によって正
常実行が続けられていることが確認できる場合、タスク
（インスタンス）で生じた先の実行誤りは瞬時的なもの
として、再度自ノードにこの種のタスクを割り付けられ
る状態に遷移させ、その状態遷移を他ノードに通知す
る。The operation and operation of the online diagnostic means will be described in detail below with reference to the flow chart of the processing shown in FIG. In the processing of FIG. 10, the recovery processing of steps 301 to 305 performed to secure the redundancy of the task after the execution of the task executed on the own node causes an execution error, is described in the third embodiment. Since the processing is the same as the processing, the description is omitted here. In the present embodiment, a diagnostic process executed in the faulty node after the redundancy recovery process will be described. The process of step 501 shows the process of executing the diagnostic procedure by the task control unit 6 after the occurrence of the task execution error. The execution of this diagnostic procedure means the execution of the procedure 111 used for detecting the execution error of the task described in the task definition. For example, in the process of step 501, the task (instance) in which the execution error has occurred can be re-injected into another node. After that, if it can be confirmed by the report from the execution error detection means that the normal execution is continued, the previous execution error that occurred in the task (instance) is instantaneous and the task of this kind is sent to the own node again. It makes a transition to the assigned state and notifies the other nodes of the state transition.

【００８６】本システムでは、タスクが実行誤りを起こ
した後、タスクが用いたデバイスに故障の疑いがあると
して、タスク制御手段６の割り当て禁止手段６ｃによ
り、これらのデバイスを用いるタスクが新たにノード上
に割り当てられないようにすることが可能である。ま
た、故障部位の特定は、タスクがエラーを生じた後に実
行するタスク制御手段６のオンライン診断手段６ｄによ
る診断プロシジャの実行によって、行うことが可能であ
る。診断プロシジャの実行によって、タスクに実行誤り
が起きた原因が、デバイスの瞬時故障か固定故障かを区
別をすることが可能である。In the present system, after the task causes an execution error, it is considered that the device used by the task is in failure, and the assignment prohibiting means 6c of the task control means 6 causes the task using these devices to newly add a node. It is possible not to be assigned above. Further, the failure part can be specified by executing a diagnostic procedure by the online diagnostic means 6d of the task control means 6 which is executed after the task has an error. By executing the diagnostic procedure, it is possible to distinguish whether the cause of the execution error in the task is the instantaneous failure or the fixed failure of the device.

【００８７】タスクが用いるデバイスの登録は、システ
ム設計者がシステム上のデバイスが一意になるよう採番
した値を、タスク定義中のデバイスビットマップ１０９
に図３のように、ビット‘０’，ビット‘１’を用いて
登録する。図３では、ビット０を固定ディスク、ビット
１を拡張ディスク１、ビット２を拡張ディスク２、ビッ
ト３を拡張ディスク３、ビット４をプリンタとシステム
であらかじめ定義しておき、ビット‘０’とビット
‘１’を用いて接続、非接続を表している。タスク制御
手段６は、ノード毎に故障を起こしているデバイスを示
すデバイスビットマップを保持しており、タスク割り当
て処理時に参照する。The device used by the task is registered by the system designer by assigning a value assigned by the system designer so that the device on the system is unique.
As shown in FIG. 3, it is registered by using the bit “0” and the bit “1”. In FIG. 3, bit 0 is a fixed disk, bit 1 is an expansion disk 1, bit 2 is an expansion disk 2, bit 3 is an expansion disk 3, and bit 4 is defined in advance by a printer and a system. "1" is used to represent connection or non-connection. The task control unit 6 holds a device bit map showing a device having a failure for each node, and refers to the device bit map at the time of task allocation processing.

【００８８】ステップ５０２は、ステップ５０１で実行
された診断プロシジャによる診断結果が、ユーザタスク
とのＩ／Ｆ部８を用いてタスク制御手段６に通知される
処理を示している。ステップ５０３は、診断結果が瞬時
故障を示すものかどうかで処理を分岐している。ステッ
プ５０４は、先の故障が瞬時故障であったと判断した場
合の処理であり、ステップ３０５において、遷移させた
自ノードの状態（故障デバイスのビットマップデータ）
を修正して、これを他ノードに伝える。Step 502 shows a process of notifying the task control means 6 of the diagnosis result by the diagnosis procedure executed in step 501 by using the I / F unit 8 with the user task. Step 503 branches the process depending on whether the diagnosis result indicates an instantaneous failure. Step 504 is a process when it is determined that the previous failure is an instantaneous failure, and the status of the own node that has made a transition in step 305 (bitmap data of the failed device).
Is corrected and transmitted to other nodes.

【００８９】以上のように、タスクが実行誤りを起こし
た後に、実行誤りを生じたノード上において、誤りを起
こしたタスクと同様の入出力装置を使用する予定のタス
クの割り当てを禁止し、更に、診断プロシジャを実行す
ることにより、システムに付加したデバイスにおいて、
オンライン診断を自動化させることが可能となってい
る。更に、この診断結果は、タスクの割り当て処理に反
映することが可能となっている。As described above, after a task has made an execution error, on the node that has made an execution error, the assignment of a task that is to use the same input / output device as the task that made the error is prohibited, and , By executing the diagnostic procedure, in the device added to the system,
It is possible to automate online diagnosis. Further, this diagnosis result can be reflected in the task allocation process.

【００９０】実施例６．実施例６では、タスクのメンテ
ナンス手段による、タスクの実行先指定の変更機能を用
いたタスクモジュールの入れ換え方法を説明する。ノー
ド上のタスク制御手段６は、各ノードに新たにタスクを
割り付けることが可能かどうかを示す状態と、個々のタ
スク毎にタスクを割り付けることが可能かどうかを判断
する情報（上記実施例１，５に挙げたデバイスビットマ
ップ１０９）を保持している。本システムにおいて、ノ
ード上にインストールされているタスクモジュールの入
れ換え作業は、作業ノード上において入れ換えるタスク
が実行されないようにした状態で行うものとする。Example 6. In a sixth embodiment, a method of replacing a task module using a task execution destination designation changing function by a task maintenance means will be described. The task control means 6 on the node has a state indicating whether or not a new task can be assigned to each node, and information for determining whether or not a task can be assigned for each individual task (the above-described first and first embodiments). The device bitmap 109) listed in No. 5 is held. In this system, the replacement work of the task modules installed on the node is performed in a state where the task to be replaced is not executed on the work node.

【００９１】以下に、タスクのメンテナンス手段の作
用、動作の詳細な説明を図１１に示す処理の流れ図を用
いて説明する。ステップ６０１では、入れ換え作業を行
うタスクの属性が、実行先ノードの明示指定型かどうか
で処理を変えている。実行先ノードが明示指定型かどう
かは、タスク定義中の優先実行先ノードの指定があるか
どうかで判断する。指定がある場合を明示指定型とす
る。ステップ６０２では、ユーザタスクとのＩ／Ｆ部８
を用いて、タスク定義を変更する処理を示している。１
ノード上で実行したタスク定義の変更は、タスク制御手
段６によって全ノードに通知される。ステップ６０２で
の処理により、タスクモジュールの入れ換え作業を行う
ノードにおいて、入れ換えを行うタスクが新たに割り当
てられなくなる。これは、タスクが実行されるノードを
明示的に指定するタイプのタスクに有効である。A detailed description of the operation and operation of the task maintenance means will be given below with reference to the flow chart of the processing shown in FIG. In step 601, the process is changed depending on whether the attribute of the task performing the replacement work is the explicit designation type of the execution destination node. Whether or not the execution destination node is an explicit specification type is determined by whether or not the priority execution destination node is specified in the task definition. If there is a specification, it is an explicit specification type. In step 602, the I / F unit 8 for the user task
Is used to indicate the process of changing the task definition. 1
The task control means 6 notifies all nodes of the change in the task definition executed on the node. Due to the processing in step 602, a new task to be replaced is not assigned to the node that performs the task module replacement work. This is useful for tasks of the type that explicitly specify the node on which the task will be executed.

【００９２】ステップ６０３では、タスクモジュールの
入れ換えを行う作業ノードにおいて、入れ換えの対称と
なるタスクが実行中かどうかで処理を分岐している。ス
テップ６０４では、入れ換えの対象となるタスクが実行
先ノードの明示指定型かどうかで処理を変えている。ス
テップ６０５では、ユーザタスクとのＩ／Ｆ部８の機能
を用いてノードの状態を遷移させ、自ノード上で実行し
ているタスクを終了に導く処理を示している。遷移させ
たノードの状態は、タスク制御手段６によって全ノード
に通知される。ステップ６０５での処理によって、実行
先を明示的に指定しないタイプのタスクも、タスクモジ
ュールの入れ換え作業を行うノードに新たに割り当てら
れなくなる。この処理では、更に、ノード上でタスクの
実行を継続するのが危ぶまれる障害が起きた場合と同様
の状態を作り、既に実行中の冗長化されていないタスク
を他ノードに移送する。尚、タスク移送処理のメカニズ
ムは、上記実施例４に記述済である。In step 603, the process is branched depending on whether or not the task to be replaced has a symmetrical task in execution at the work node which replaces the task module. In step 604, the process is changed depending on whether the task to be replaced is the explicit designation type of the execution destination node. In step 605, the process of transitioning the state of the node using the function of the I / F unit 8 with the user task and leading the task executed on the own node to the end is shown. The task control means 6 notifies all nodes of the transitioned node states. By the processing in step 605, even a task of a type whose execution destination is not explicitly specified is not newly assigned to the node performing the task module replacement work. In this process, a state similar to that in the case where a failure occurs that makes it dangerous to continue executing the task on the node, and the already executed non-redundant task is transferred to another node. Note that the task transfer processing mechanism has been described in the fourth embodiment.

【００９３】ステップ６０６では、タスクモジュールの
入れ換え作業を行うノード上で入れ換えの対象となるタ
スクが動作していないことを確認して、タスクのメンテ
ナンス手段６ｅによりタスクモジュールの入れ換え作業
を行う。ステップ６０７では、タスクを割り当てるため
のノードの状態を、必要に応じて元に戻している。これ
は、ステップ６０５において、状態を変化させている場
合に必要な操作である。このノードの状態遷移は、全ノ
ードに対して通知される。ステップ６０８では、タスク
定義を必要に応じて元に戻している。これは、ステップ
６０２において、タスク定義を変化させている場合に必
要な操作である。このタスク定義の修復は、全ノードに
通知される。In step 606, it is confirmed that the task to be replaced is not operating on the node which carries out the task module replacement work, and the task maintenance means 6e carries out the task module replacement work. In step 607, the state of the node for assigning the task is returned to the original state if necessary. This is an operation required when the state is being changed in step 605. The state transition of this node is notified to all nodes. In step 608, the task definition is restored as needed. This is an operation required when the task definition is being changed in step 602. Repair of this task definition is notified to all nodes.

【００９４】以上の方法のように、タスクの実行先指定
変更によって、特定ノードに特定タスクを特定の期間割
り当てないことを保証し、ノード上にインストール済み
のタスクモジュールを、安全に、かつ、容易に入れ換え
ることが可能である。As in the above method, by changing the execution destination designation of a task, it is guaranteed that a specific task is not assigned to a specific node for a specific period, and a task module installed on a node can be safely and easily installed. Can be replaced with.

【００９５】実施例７．実施例７では、オンラインの縮
退手段及びオンラインの拡張手段による、活線挿抜機能
のないＨ／Ｗモジュールの交換やスーパーバイザの入れ
換え作業等、ノードを停止せざるを得ないメンテナンス
作業の手順を説明する。本発明を適用したシステムで
は、ノードを停止する際に、ノード上で実行中の処理を
他ノードに引き継がせるため、システムとしては連続処
理が可能である。Example 7. The seventh embodiment will explain a procedure of maintenance work that must stop the node, such as replacement of an H / W module having no hot-swap function and replacement of a supervisor, which is performed by the online degeneration unit and the online expansion unit. . In the system to which the present invention is applied, when the node is stopped, the process being executed on the node is handed over to another node, so that the system is capable of continuous processing.

【００９６】以下に、オンラインの縮退手段及びオンラ
インの拡張手段について、作用、動作の詳細の説明を図
１２に示す処理の流れ図を用いて説明する。ステップ７
０１は、自ノードに新たにタスクを割り当てられないよ
うにするための処理と、既に割り当てられているタスク
を終了に導く処理を示している。ここでの状態遷移は、
タスク制御手段６により全ノードに通知され、メンテナ
ンス作業を行うノードに対して、新たにタスクが割り当
てられなくなる。また、ステップ７０１の処理では同時
に、自ノード上でタスクの継続実行が危ぶまれる障害が
起きたのと同様の状態を作り、既に実行中の冗長度のな
いタスクを他ノードに移送する。タスク移送のメカニズ
ムは、上記実施例４に記述済である。ステップ７０２で
は、メンテナンス作業を行うノードをオンライン縮退手
段６ｆにより、マルチコンピュータシステムから取り除
く処理を示している。ここでの処理は、自ノード上で実
行中のタスクが全て終了した時点で、自ノードのタスク
制御手段６が自ノードが正常にシャットダウンすること
を、他ノード上のタスク制御手段６に伝える。ステップ
７０３で、メンテナンス作業を実施する。ステップ７０
４では、メンテナンス作業が終了したノードをオンライ
ン拡張手段６ｇにより、元のマルチコンピュータシステ
ムに組み込む処理を示している。ステップ７０４の処理
は、例えば、ノードのリセット処理における一連処理と
なるように起動スクリプトを作成して登録しておくこと
で簡易化できる。The details of the operation and operation of the online degeneracy means and online expansion means will be described below with reference to the flow chart of the processing shown in FIG. Step 7
Reference numeral 01 denotes a process for preventing a new task from being assigned to the own node and a process for leading an already assigned task to the end. The state transition here is
All the nodes are notified by the task control means 6, and a new task cannot be assigned to the node performing the maintenance work. At the same time, in the processing of step 701, a state similar to that in which a failure in which continuous execution of a task is threatened occurs on its own node, and a task having no redundancy already being executed is transferred to another node. The task transfer mechanism has been described in the fourth embodiment. In step 702, a process of removing the node performing the maintenance work from the multi-computer system by the online degeneracy means 6f is shown. In the processing here, the task control unit 6 of the own node informs the task control unit 6 of the other node that the own node is normally shut down when all the tasks being executed on the own node are completed. In step 703, maintenance work is performed. Step 70
4 shows a process of incorporating a node for which maintenance work has been completed into the original multi-computer system by the online expansion means 6g. The process of step 704 can be simplified by, for example, creating and registering an activation script so as to be a series of processes in the node reset process.

【００９７】以上のように、ノードをシステム上から切
り離した後に、ノードにおけるＨ／Ｗモジュールの保守
作業及びスーパーバイザの改版等の作業を行い、作業終
了後にノードをシステムに再加入することを実現する、
ノードのオンライン縮退／拡張機能を用いることによ
り、ノードの処理をシステム上の他ノードに引き継がせ
ることが可能になり、システムを動作させた状態で、シ
ステムの構成要素であるノードを停止させてメンテナン
ス作業を実施できる。As described above, after the node is disconnected from the system, the maintenance work of the H / W module and the revision of the supervisor in the node are performed, and the node is rejoined to the system after the work is completed. ,
By using the online degeneration / expansion function of the node, it is possible to take over the processing of the node to other nodes in the system, and stop the node that is a component of the system for maintenance while the system is operating. Work can be carried out.

【００９８】実施例８．実施例８では、タスク制御手段
の処理における、ノードの異常消滅時にノード間メッセ
ージの紛失を防止するためのメッセージ管理手段と、異
常消滅ノードで実行予定／実行中であったタスクの再投
入手段について説明する。本発明におけるタスク制御手
段６のメッセージ管理手段６ｈでは、通信手段７による
メッセージの送信時と受信時の処理において特徴が有
る。ただし、タスク制御手段６間のメッセージは、相手
先指定で送信するものである。本システムでのメッセー
ジ交換は、スーパーバイザが提供する高信頼なノード間
コミュニケーション手段を用いるので、スーパーバイザ
が伝送中にデータを紛失させることないとしている。つ
まり、障害は、スーパーバイザの関与できない境遇に起
こることを想定している。本システムでノード間メッセ
ージの伝送時に起こり得る障害は、１．メッセージ送信ノードが同一メッセージ（状態合わ
せのために全てのノードに送信する、内容の同じメッセ
ージ）を、他ノードに配布している最中に送信ノードが
異常消滅する。２．メッセージ送信ノードが特定メッセージ（１つのノ
ードに特定の処理を行わすためのメッセージ）を送信し
た後、受信ノードの消滅によってメッセージに期待され
た処理を失う。ことである。Example 8. In the eighth embodiment, in the processing of the task control means, a message management means for preventing loss of an inter-node message at the time of abnormal disappearance of a node, and a re-input means of a task scheduled / executed in the abnormal disappearance node explain. The message management means 6h of the task control means 6 in the present invention is characterized in the processing at the time of transmitting and receiving the message by the communication means 7. However, the message between the task control means 6 is transmitted by designating the other party. Since message exchange in this system uses highly reliable inter-node communication means provided by supervisors, it is said that supervisors will not lose data during transmission. In other words, disability is assumed to occur in circumstances where the supervisor cannot participate. The obstacles that can occur when transmitting messages between nodes in this system are: The sending node abnormally disappears while the same message is being sent to another node by the message sending node (same message sent to all nodes for status matching). 2. After the message sending node sends a specific message (a message for performing a specific process to one node), the process expected for the message is lost due to the disappearance of the receiving node. That is.

【００９９】まず、ノード間で交換するメッセージの形
式を図１３に示す。図１３に示したノード間メッセージ
の形式において、各フィールドの情報は、以下の通りで
ある。オペレーションコード８０１は、メッセージの命
令種別を表す情報を格納する。メッセージを受信したタ
スク制御手段６は、このフィールドのデータ値によって
メッセージの解析処理を変える。メッセージの連番８０
２は、メッセージ発行ノードのタスク制御手段６が、メ
ッセージ毎に採番した連番を格納する。本フィールド
は、メッセージの新旧を表すものである。ここで採番さ
れた値は、メッセージ発行元ノードが生成したメッセー
ジ毎にカウントアップした値を適用するが、発信するメ
ッセージが全ノードを対象にする場合、これらのメッセ
ージでは同番にする。メッセージの長さ８０３は、メッ
セージの総バイト数を格納する。転送メッセージの情報
８０４には、メッセージ発信元ノードが異常消滅後、他
ノードによって消滅ノードの最終メッセージが再送（代
理転送）される場合に、本来のメッセージ発信ノードを
示すための情報を格納する。ここでの情報は、本来のメ
ッセージ発信ノードを示す情報、本来のメッセージにつ
けられた連番、本来のオペレーションコードである。本
フィールドは、代理転送ノードの消滅時の代理転送デー
タの更なる代理転送に備えて、複数ノードによる転送履
歴を格納できる分確保する。オペレーションコード所望
のデータ８０５は、オペレーションコード８０１が要求
するデータであり、例えば、オペレーションコードに基
づいたメッセージ解析時に必要なデータである。メッセ
ージを生んだユーザの識別子８０６は、システムがマル
チユーザシステムであった場合に、メッセージ発行者を
特定するための情報を格納する。メッセージのボディ８
０７は、他ノード上のタスク制御手段６に伝えるデータ
であり、オペレーションコード８０１によって内容は異
なる。タスク制御手段６における、メッセージ送受信時
の処理を以下に説明する。First, the format of messages exchanged between nodes is shown in FIG. In the format of the inter-node message shown in FIG. 13, the information of each field is as follows. The operation code 801 stores information indicating the command type of the message. Upon receiving the message, the task control means 6 changes the message parsing process according to the data value of this field. Message serial number 80
2 stores the serial number assigned by the task control means 6 of the message issuing node for each message. This field indicates the old and new of the message. The value assigned here is the value counted up for each message generated by the message issuing node, but when the outgoing message targets all nodes, the same number is applied to these messages. The message length 803 stores the total number of bytes of the message. The forwarding message information 804 stores information for indicating the original message sending node when the final message of the disappearing node is retransmitted (proxy transfer) by another node after the message sending node abnormally disappears. The information here is information indicating the original message sending node, the serial number attached to the original message, and the original operation code. This field is secured as much as the transfer history of multiple nodes can be stored in preparation for further proxy transfer of proxy transfer data when the proxy transfer node disappears. The operation code desired data 805 is data required by the operation code 801, and is, for example, data required when a message is analyzed based on the operation code. The identifier of the user who generated the message 806 stores information for identifying the message issuer when the system is a multi-user system. Message body 8
Reference numeral 07 is data transmitted to the task control means 6 on the other node, and the content differs depending on the operation code 801. The process of transmitting / receiving a message in the task control means 6 will be described below.

【０１００】（ａ）メッセージ送信時新たなタスクの起動要求等のシステム状態を変化させる
メッセージは、送信手続き後に、送信先ノード毎に区別
して保存する。これらの保存メッセージは、送信先ノー
ドがメッセージを受理したことを示すＡＣＫを返してき
た後に処分する。他ノードへのタスク起動要求に対する
他ノードからのＡＣＫは、例えば、タスク起動処理の結
果で代用することが可能であり、そのメッセージの内容
は、起動したタスクの情報かタスク起動に失敗した理由
を伝えるものである。このように、システムの状態を変
化させるノード間のメッセージでは、受信ノードでの処
理の終了後に、依頼ノードにＡＣＫを返す。(A) At the time of message transmission A message that changes the system state, such as a new task activation request, is saved for each destination node after the transmission procedure. These stored messages are discarded after the destination node returns an ACK indicating that the message has been accepted. The ACK from the other node in response to the task activation request to the other node can be substituted by, for example, the result of the task activation process, and the content of the message indicates the information of the activated task or the reason why the task activation failed. It is something to convey. Thus, in the message between the nodes that changes the system state, ACK is returned to the requesting node after the processing at the receiving node is completed.

【０１０１】（ｂ）メッセージの受信時送信元ノードを区別して、最後に受信したメッセージを
保存しておく。ここで保存するメッセージは、ノードが
異常消滅した場合に、消滅ノードが最後に送信してきた
メッセージを他ノードに配送する状態合わせ処理に用い
る。(B) When receiving a message The source node is distinguished and the last received message is saved. The message stored here is used for the state matching process of delivering the message transmitted last by the disappearing node to another node when the node abnormally disappears.

【０１０２】本実施例での他ノードの消滅検知は、事象
通知部が検知して、タスク制御手段６に通知する。In the detection of the disappearance of another node in this embodiment, the event notification unit detects and notifies the task control means 6.

【０１０３】以下に、メッセージ管理手段とタスクの再
投入手段の作用、動作の詳細な説明を図１４に示す処理
の流れ図を用いて説明する。ステップ８１１では、シス
テム上のノードの消滅を検知する処理を示している。各
ノード上に配置された事象通知部９は、ノード間コミュ
ニケーション部での生存チェックによって他ノードの消
滅を検知した後、ノード消滅の事象をユーザタスクとの
Ｉ／Ｆ部８を用いてタスク制御手段６に通知する。ステ
ップ８１１ａでは、タスク制御手段６が異常消滅したノ
ードに対して、新たにタスクを割り当てないように制御
する。ステップ８１２では、消滅したノードが正常終了
したかどうかで、タスク制御手段６が処理を分岐してい
る。ノードが正常に切り離されてシャットダウンした場
合、タスク制御手段間で状態合わせをしているので、正
常終了の判別が可能である。ステップ８１３では、メッ
セージ管理手段６ｈは、異常消滅したノードが最後に送
信してきたメッセージを、異常消滅したノード（と自ノ
ード）を除いた全ノードに対して行う。メッセージの受
け手となった場合の処理では、メッセージのヘッダ部分
中の代理送信を告げる識別子とメッセージ発行ノードで
のメッセージ通番から、異常消滅ノードが自ノードにも
送り付けてくる予定であった未処理メッセージのみを、
選択的に取り出すことが可能になっている。タスク制御
手段６におけるメッセージ送信が、スーパーバイザによ
って提供される高信頼な相手先指定方法で行なわれる一
方で、全ノードに対するメッセージの送信処理途中にメ
ッセージ送信ノードが消滅した場合、システム上のノー
ド間で状態の不一致が生じる。本処理は、この不一致を
補正するために行う。この補正処理は、タスク制御手段
６の処理が受信済みメッセージによる処理を可逆的に無
効化するのではなく、メッセージ未受信ノードの状態を
受信済みノードと同じ状態に合わせるといったポリシィ
によるものである。The detailed operation and operation of the message management means and task re-input means will be described below with reference to the flow chart of processing shown in FIG. In step 811, a process of detecting the disappearance of a node on the system is shown. The event notification unit 9 arranged on each node detects the disappearance of another node by the existence check in the inter-node communication unit, and then controls the event of the node disappearance using the I / F unit 8 with the user task. Notify the means 6. In step 811a, the task control unit 6 controls so that a new task is not assigned to the node that has abnormally disappeared. In step 812, the task control means 6 branches the processing depending on whether or not the disappeared node has ended normally. When the node is normally disconnected and shut down, the task control units are in a state-matching state, so that it is possible to determine the normal end. In step 813, the message management unit 6h performs the message transmitted last by the abnormally disappeared node to all nodes except the abnormally disappeared node (and its own node). In the process when it becomes the recipient of the message, the unprocessed message that the abnormally disappearing node was supposed to send to its own node based on the message notification number at the message issuing node and the identifier that announces the proxy transmission in the message header Only
It is possible to take out selectively. While the message transmission in the task control means 6 is performed by the highly reliable destination designation method provided by the supervisor, when the message transmission node disappears in the middle of the process of transmitting the message to all the nodes, the nodes in the system are connected. A state mismatch occurs. This process is performed to correct this mismatch. This correction process is not based on the policy of the process of the task control means 6 reversibly invalidating the process by the received message, but by adjusting the state of the message unreceived node to the same state as the received node.

【０１０４】ステップ８１４では、メッセージ交換手段
６ｈが、他ノードが代送してきた消滅ノードの最終メッ
セージを、消滅ノードが送信してきたものとして処理し
ている。メッセージの種類とメッセージ中の８０４の転
送メッセージの情報内に埋め込まれたメッセージの連番
により、代送されたメッセージを処理するかどうかの判
断を行う。ここでは、明らかに自分向けでないメッセー
ジと、自分が既に処理したメッセージと同じか古いメッ
セージは、無視する。In step 814, the message exchanging means 6h processes the final message of the extinct node sent by another node as if it had been sent by the extinct node. Based on the message type and the serial number of the message embedded in the information of the transfer message 804 in the message, it is determined whether or not to process the delegated message. Ignore messages that are obviously not for me and messages that are the same or older than the messages I have already processed.

【０１０５】ステップ８１５では、メッセージ交換手段
６ｈが、他ノードによるメッセージの代送フェーズ終了
後、ＡＣＫのないメッセージを再処理して、必要ならば
他ノードに配信する処理を表している。ここでの処理
は、消滅ノードに対してタスクの実行依頼を出していた
が、上記までの処理結果、消滅ノードでタスクを起動し
た形跡がないというものを、代わりに他ノードに割り当
てる。In step 815, the message exchanging means 6h represents a process of reprocessing a message without an ACK after the completion of the message transfer phase by another node and delivering it to another node if necessary. In the processing here, the execution request of the task is issued to the extinction node, but the result of the above processing that there is no evidence that the task is activated in the extinction node is assigned to another node instead.

【０１０６】ステップ８１６では、タスクの再投入手段
６ｉが、消滅したノードで実行中であった空間的冗長度
のないタスクを再構成するノードを決めている。本シス
テムでは、各ノード上のタスク制御手段６がシステム上
で実行中のタスクを把握していることから、どのノード
においてもこの再構成処理は可能である。むしろ、再構
成処理を行う１ノードを決定する調停作業が必要とな
る。本実施例では、上記実施例１のステップ１２１の処
理に示している、ノードが起動時に発行したメッセージ
中の、ノードの起動開始時刻を元に選定する。この起動
開始時刻は、ノードにローカルな時計から得た時刻を用
いて良く、再構成処理担当ノードは、生存ノードの中で
一番古い起動時刻を示すノードが選ばれる。In step 816, the task re-injection means 6i determines a node for reconfiguring a task having no spatial redundancy which was being executed by the disappeared node. In the present system, the task control means 6 on each node grasps the task being executed on the system, and therefore this reconfiguration processing is possible on any node. Rather, arbitration work for deciding one node to perform the reconstruction process is required. In the present embodiment, selection is made based on the node start start time in the message issued by the node at the time of start, which is shown in the processing of step 121 of the first embodiment. For this activation start time, the time obtained from the clock local to the node may be used, and the node showing the oldest activation time among the surviving nodes is selected as the reconfiguration processing node.

【０１０７】ステップ８１７では、消滅ノードが実行中
であった空間的冗長度のないタスクの再構成処理を、自
ノードが行うかどうかで処理を変えている。In step 817, the processing is changed depending on whether or not the own node performs the reconfiguration processing of the task having no spatial redundancy which was being executed by the disappearing node.

【０１０８】ステップ８１８では、タスクの再投入手段
６ｉが、消滅したノードで実行中の空間的冗長度のない
タスクを、必要に応じて他ノード上で再起動するための
処理を行なっている。尚、上記で説明したステップ８１
１〜ステップ８１５がメッセージ管理手段６ｈによる処
理であり、ステップ８１６〜ステップ８１８がタスクの
再投入手段６ｉによる処理である。In step 818, the task re-injection means 6i performs a process for restarting a task having no spatial redundancy executed in the disappeared node on another node, if necessary. Note that step 81 described above
1 to step 815 are processes by the message management means 6h, and steps 816 to 818 are processes by the task re-input means 6i.

【０１０９】以上のように、ノードの異常消滅時にノー
ド間メッセージの紛失を防止するメッセージ管理手段
と、消滅ノードで実行を予定していたタスク、もしく
は、実行中であった空間的に冗長化されていないタスク
を、再投入する再投入手段により、ノードの異常消滅時
にもノード間で状態合わせを保証し、更に、システム上
に投入済みの空間的冗長度のないタスクを失うことを防
げる。As described above, the message management means for preventing the loss of inter-node messages when a node abnormally disappears, the task scheduled to be executed in the disappearing node, or the spatial redundancy that was being executed. By the re-injection means for re-injecting a task that has not been performed, it is possible to guarantee the state matching between the nodes even when the node abnormally disappears, and further it is possible to prevent the loss of the task that has already been submitted and has no spatial redundancy.

【０１１０】従来より、フォールトトレラントシステム
では、障害時にデータの紛失を防ぐことが大切である。
一方、本システムのように、複数ノードで構成するフォ
ールトトレラントシステムで交換するメッセージは、内
容によって送信先に伝わらなかった場合に、捨ててしま
って良いものと、代理ノードが処理する必要が生じるも
のに分けられる。本発明では、上記１の障害に対して
は、他ノードが行う代理送信で対処し（ステップ８１
３，８１４）、２の障害に対しては送信済みメッセージ
を保持しておいて、送信先ノードが異常消滅した後にメ
ッセージが処理されていない場合に、他ノードに処理を
再度割り当て直して、システム全体では発行したメッセ
ージを失うことを防ぐ（ステップ８１５）。ここで、
“消滅ノードで実行を予定していたタスク”が正常ノー
ドに再割当される訳である。上記処理によって、障害を
起こしたノードが伝えようとしていた全メッセージと、
障害を起こしたノードに伝えようとした全メッセージが
システム全体に反映された。残りの作業は、システムの
再構成（障害ノードで実行中であったタスクの再割り当
て処理）になる。障害を起こしたノードで実行中であっ
た処理の引き継ぎとは、障害を起こしたノードで実行中
であったタスクを他ノードで実行することを意味する。
この後の回復処理は、タスク固有のものとなり、重複し
た命令を実行することを避ける必要があるならば、タス
クが正常実行中にチェックポイントを設定しておく等の
引き継ぎ処理のための工夫が必要になる。Conventionally, in a fault tolerant system, it is important to prevent data loss at the time of failure.
On the other hand, like this system, messages exchanged in a fault-tolerant system consisting of multiple nodes can be discarded if they are not delivered to the destination depending on the content, and those that need to be processed by a proxy node. It is divided into In the present invention, the above failure 1 is dealt with by proxy transmission performed by another node (step 81).
(3, 814) and 2) the transmitted message is retained, and when the message is not processed after the destination node abnormally disappears, the process is reassigned to another node, and the system is reassigned. The loss of the issued message as a whole is prevented (step 815). here,
That is, the "task that was scheduled to be executed on the disappearing node" is reassigned to the normal node. By the above process, all messages that the faulty node was trying to convey,
All messages attempting to reach the failed node are reflected throughout the system. The remaining work is system reconfiguration (task reassignment processing that was being executed in the failed node). Inheriting the process being executed by the faulty node means executing the task being executed by the faulty node by another node.
The recovery process after this will be task-specific, and if it is necessary to avoid executing duplicate instructions, it is necessary to devise a method for taking over, such as setting a checkpoint during normal execution of the task. You will need it.

【０１１１】実施例９．実施例９では、本発明の情報採
取手段のチェックポイント機構の動作を図を用いて説明
する。本実施例では、上記実施例３に述べたチェックポ
イントセーブの処理を示すものである。図１５は、タス
クがチェックポイント採取要求を受けた時の処理を示す
流れ図である。タスクがタスク制御手段６の情報採取手
段６ｊから、チェックポイント採取要求を受けると、タ
スクのチェックポイント処理（１０００）を開始する。
最初に、当該タスクがチェックポイント採取要求を受け
た時に、タスク間通信を行なっていたかを調べる（１０
０１）。当該タスクがタスク間通信を行っていない時に
は、そのままタスク間通信又は入出力処理までタスクの
処理を継続する（１００２）。タスク間通信又は入出力
処理にたどりついたら、当該処理がタスク間通信（送受
信）かを調べる（１００３）。当該処理がタスク間通信
（送受信）でなければ、タスク間通信（ランデブー）か
どうかを調べる（１００４）。上記ランデブーとは、メ
ッセージを送信したタスクが送信先のタスクからの応答
待ちの状態を示している。当該処理がタスク間通信（ラ
ンデブー）でなければ、当該処理は入出力処理であるの
で、入出力でのチェックポイント処理（１０４０）を行
なう。上記１００１の処理において、タスクがタスク間
通信を行っていた時には、タスク間通信中でのチェック
ポイント処理（１０１０）を行なう。上記１００３の処
理において、到達した処理がタスク間通信（送受信）で
あった時は、タスク間通信（送受信）でのチェックポイ
ント処理（１０２０）を行なう。上記１００４の処理に
おいて、到達した処理がタスク間通信（ランデブー）の
時は、タスク間通信（ランデブー）でのチェックポイン
ト処理（１０３０）を行なう。Example 9. In the ninth embodiment, the operation of the checkpoint mechanism of the information collecting means of the present invention will be described with reference to the drawings. In this embodiment, the checkpoint save processing described in the third embodiment is shown. FIG. 15 is a flow chart showing processing when a task receives a checkpoint collection request. When the task receives a checkpoint collection request from the information collecting means 6j of the task control means 6, the task checkpoint processing (1000) is started.
First, when the task receives a checkpoint collection request, it is checked whether communication between the tasks was performed (10
01). When the task is not performing inter-task communication, the task processing is continued as it is until inter-task communication or input / output processing (1002). When the inter-task communication or the input / output process is reached, it is checked whether the process is inter-task communication (transmission / reception) (1003). If the process is not inter-task communication (transmission / reception), it is checked whether it is inter-task communication (rendezvous) (1004). The rendezvous indicates a state in which the task that sent the message waits for a response from the destination task. If the process is not task-to-task communication (rendezvous), the process is an input / output process, and therefore a checkpoint process (1040) for input / output is performed. In the processing of 1001, when a task is performing inter-task communication, a checkpoint process (1010) during inter-task communication is performed. In the processing of 1003, when the arrived processing is inter-task communication (transmission / reception), checkpoint processing (1020) in inter-task communication (transmission / reception) is performed. In the processing of 1004, when the arrived processing is inter-task communication (rendezvous), checkpoint processing (1030) in inter-task communication (rendezvous) is performed.

【０１１２】図１６に、タスク間通信中でのチェックポ
イント処理（１０１０）の流れ図を示す。タスク間通信
中でのチェックポイント処理では、まず、タスク間通信
の相手が同じノード上にあるかをチェックする（１０１
１）。この時、通信の相手が異なるノード上にある時
は、相手が動作しているノードのタスク管理手段６の情
報採取手段６ｊに、相手タスクのチェックポイントを採
取するようにリクエストを出す（１０１２）。次に、チ
ェックポイントの採取をリクエストされたノードの情報
採取手段６ｊは、そこでのチェックポイントを採取し
（１０１３）、採取したチェックポイントデータを共有
２次記憶装置５に保存する（１０１４）。タスク制御手
段６は、チェックポイントデータを共有２次記憶装置５
に保存した後、当該タスクの実行を停止する（１０１
５）。FIG. 16 shows a flow chart of the checkpoint processing (1010) during communication between tasks. In the checkpoint processing during inter-task communication, first, it is checked whether the other party of inter-task communication is on the same node (101
1). At this time, when the other party of communication is on a different node, a request is issued to the information collecting means 6j of the task managing means 6 of the node on which the other party is operating to collect the checkpoint of the other task (1012). . Next, the information collecting unit 6j of the node requested to collect the checkpoint collects the checkpoint there (1013) and stores the collected checkpoint data in the shared secondary storage device 5 (1014). The task control means 6 shares the checkpoint data with the secondary storage device 5.
Then, execution of the task is stopped after saving the
5).

【０１１３】次に、タスク間通信（送受信）に到達した
時のチェックポイント処理（１０２０）の手順について
説明を行うが、処理の手順は、図１６のタスク間通信中
でのチェックポイント処理と同じであるので、図１６の
流れ図に従い説明する。タスク間通信（送受信）に到達
した時のチェックポイント処理では、まず、タスク間通
信の相手が同じノード上にあるかをチェックする（１０
１１）。この時、通信の相手が異なるノード上にある時
は、相手が動作しているノードのタスク管理手段６の情
報採取手段６ｊに、相手タスクのチェックポイントを採
取するようにリクエストを出す（１０１２）。次に、チ
ェックポイントの採取をリクエストされたノードの情報
採取手段６ｊは、そこでのチェックポイントを採取し
（１０１３）、採取したチェックポイントデータを共有
２次記憶装置５に保存する（１０１４）。タスク制御手
段６は、チェックポイントデータを共有２次記憶装置５
に保存した後、当該タスクの実行を停止する（１０１
５）。Next, the procedure of the checkpoint process (1020) when reaching the intertask communication (transmission / reception) will be described. The procedure of the process is the same as the checkpoint process during the intertask communication of FIG. Therefore, a description will be given according to the flowchart of FIG. In the checkpoint processing when reaching the inter-task communication (transmission / reception), first, it is checked whether the other party of the inter-task communication is on the same node (10
11). At this time, when the other party of communication is on a different node, a request is issued to the information collecting means 6j of the task managing means 6 of the node on which the other party is operating to collect the checkpoint of the other task (1012). . Next, the information collecting unit 6j of the node requested to collect the checkpoint collects the checkpoint there (1013) and stores the collected checkpoint data in the shared secondary storage device 5 (1014). The task control means 6 shares the checkpoint data with the secondary storage device 5.
Then, execution of the task is stopped after saving the
5).

【０１１４】次に、タスク間通信（ランデブー）に到達
した時のチェックポイント処理（１０３０）の手順につ
いて説明を行うが、処理の手順は、図１６のタスク間通
信中でのチェックポイント処理と同じであるので、図１
６の流れ図に従い説明する。タスク間通信（ランデブ
ー）に到達した時のチェックポイント処理では、まず、
タスク間通信の相手が同じノード上にあるかをチェック
する（１０１１）。この時、通信の相手が異なるノード
上にある時は、相手が動作しているノードのタスク管理
手段６の情報採取手段６ｊに、相手タスクのチェックポ
イントを採取するようにリクエストを出す（１０１
２）。次に、チェックポイントの採取をリクエストされ
たノードの情報採取情報６ｊは、そこでのチェックポイ
ントを採取し（１０１３）、採取したチェックポイント
データを共有２次記憶装置５に保存する（１０１４）。
タスク制御手段６は、チェックポイントデータを共有２
次記憶装置５に保存した後、当該タスクの実行を停止す
る（１０１５）。Next, the procedure of the checkpoint process (1030) when reaching the intertask communication (rendezvous) will be described. The procedure of the process is the same as the checkpoint process during the intertask communication of FIG. Therefore, FIG.
A description will be given according to the flowchart of FIG. In the checkpoint process when reaching inter-task communication (rendezvous), first,
It is checked whether the other party of the inter-task communication is on the same node (1011). At this time, when the other party of communication is on a different node, a request is issued to the information collecting means 6j of the task managing means 6 of the node on which the other party is operating to collect checkpoints of the other task (101).
2). Next, the information collection information 6j of the node requested to collect the checkpoint collects the checkpoint there (1013) and stores the collected checkpoint data in the shared secondary storage device 5 (1014).
The task control means 6 shares the checkpoint data 2
After saving in the next storage device 5, the execution of the task is stopped (1015).

【０１１５】図１７に、入出力処理に到達した時のチェ
ックポイント処理（１０４０）の流れ図を示す。入出力
処理に到達した時のチェックポイント処理では、まず、
情報採取手段６ｊにより、そこでのチェックポイントを
採取する（１０４１）とともに、採取したチェックポイ
ントデータを共有２次記憶装置５に保存する(１０４
２)。そして、タスク制御手段６は、チェックポイント
データを共有２次記憶装置５に保存した後、当該タスク
の実行を停止する（１０４３）。FIG. 17 shows a flow chart of the checkpoint processing (1040) when the input / output processing is reached. In the checkpoint process when I / O process is reached, first,
The information collecting means 6j collects checkpoints there (1041) and stores the collected checkpoint data in the shared secondary storage device 5 (104).
2). Then, the task control means 6 saves the checkpoint data in the shared secondary storage device 5, and then stops the execution of the task (1043).

【０１１６】また、上記図１６の流れ図では、チェック
ポイントデータを共有２次記憶装置５に保存した後、当
該タスクの実行を停止していたが、停止せずにそのまま
実行しても構わない。この場合、場外タスクは、チェッ
クポイントセーブより途中実行させたタスクとは、非同
期になる。In the flow chart of FIG. 16, the checkpoint data is stored in the shared secondary storage device 5 and then the execution of the task is stopped. However, the task may be executed without being stopped. In this case, the out-of-place task becomes asynchronous with the task that was executed midway after the checkpoint save.

【０１１７】以上のように、この実施例では、各タスク
は、システムから、或いは、他のタスクからのチェック
ポイント採取要求を受けてから、タスク間通信処理があ
るまで処理を継続する。タスク間通信処理に到達する
と、それが送信処理である時は、チェックポイントデー
タを共有２次記憶装置５に格納し、タスクの実行を停止
する。従って、チェックポイント採取要求を受けてか
ら、新たなタスク間通信（送信）は発生しない。タスク
間通信の受信処理に到達した時は、チェックポイントデ
ータを共有２次記憶装置５に格納し、受信処理を行う。
受信処理に成功した時は、更に処理を継続する。受信処
理に失敗した時は、タスクの処理を停止する。これによ
り当該タスクは、チェックポイント採取要求を受け取っ
た時に、既に送られていたタスク間通信のデータを、漏
らさず受け取り、未解決のタスク間通信がないように
し、まだ行われていない送信処理に対する受信処理で停
止する。タスク間通信のランデブー処理に到達したとき
は、チェックポイントデータを共有２次記憶装置５に格
納し、ランデブー処理を行う。ランデブー処理に成功し
たときは、再度チェックポイントデータを共有２次記憶
装置５に格納する。そして、タスクの実行を停止する。As described above, in this embodiment, each task continues the process until it receives the inter-task communication process after receiving the checkpoint collection request from the system or from another task. When the inter-task communication process is reached, if it is a transmission process, the checkpoint data is stored in the shared secondary storage device 5, and the task execution is stopped. Therefore, no new inter-task communication (transmission) occurs after receiving the checkpoint collection request. When the reception process of the inter-task communication is reached, the checkpoint data is stored in the shared secondary storage device 5 and the reception process is performed.
When the reception process is successful, the process is further continued. When the reception process fails, the task process is stopped. As a result, when the task receives the checkpoint collection request, it will receive the data of the inter-task communication that has already been sent without leaking, and there will be no unresolved inter-task communication. Stop at the receiving process. When the rendezvous process of inter-task communication is reached, the checkpoint data is stored in the shared secondary storage device 5 and the rendezvous process is performed. When the rendezvous process is successful, the checkpoint data is stored again in the shared secondary storage device 5. Then, the execution of the task is stopped.

【０１１８】また、以上のように、上記実施例１〜実施
例９によれば、この発明におけるタスクの冗長化実行方
式では、冗長化実行しているタスクに障害が起きた時点
で、正常処理をしているタスクからチェックポイントセ
ーブを得て、失った冗長度を回復させた後、チェックポ
イントからの処理を継続するので、障害による異常の切
り離しと回復の処理において優れている。本発明を適用
したシステムは、ユーザタスクとのＩ／Ｆ部８を用い
て、タスクの属性を記憶したタスク定義部１１より、制
御情報をタスク制御手段６に与えることが可能である。
更に、タスクを実行するノードの状態遷移を制御でき
る。このため、ノード上から特定タスクを取り除く処理
と、ノードに特定タスクを割り当てる処理を制御でき、
ノードのオンライン縮退／拡張が可能となる。本システ
ムでは、ノード上の事象通知部９を各ノードのＨ／Ｗ実
現方式に応じたカスタマイズを可能にしてタスク制御手
段６から分離したことで、異機種ノードの接続を容易に
している。更に、タスクの実行誤り検出手段１０をタス
ク毎に定義させたことで、システムに投入するタスクに
応じた適用が行える。As described above, according to the above-described first to ninth embodiments, in the task redundancy execution method according to the present invention, the normal processing is performed when a failure occurs in the task which is being redundantly executed. Since the checkpoint save is obtained from the task that is doing the work and the lost redundancy is restored, the process from the checkpoint is continued, which is excellent in the process of separating and recovering from the abnormality due to the failure. In the system to which the present invention is applied, control information can be given to the task control means 6 from the task definition unit 11 that stores the task attributes by using the user task I / F unit 8.
Furthermore, it is possible to control the state transition of the node that executes the task. Therefore, it is possible to control the process of removing the specific task from the node and the process of assigning the specific task to the node.
Online degeneration / expansion of nodes is possible. In the present system, the event notification unit 9 on the node can be customized according to the H / W implementation method of each node and separated from the task control means 6, thereby facilitating connection of heterogeneous nodes. Furthermore, by defining the task execution error detection means 10 for each task, it is possible to apply according to the task to be input to the system.

【０１１９】[0119]

【発明の効果】以上のように、この発明によれば、タス
ク制御手段がタスク定義部に記憶した情報に基づいて、
タスクの実行を制御する。このため、タスク定義部にタ
スクの冗長化実行を実現するような情報を記憶しておけ
ば、容易にタスク制御手段によってタスクを冗長化実行
することができる効果がある。As described above, according to the present invention, based on the information stored in the task definition section by the task control means,
Control the execution of tasks. For this reason, if the task definition unit stores information for realizing the redundant execution of the task, the task control means can easily perform the redundant execution of the task.

【０１２０】また、第２の発明によれば、タスク定義部
は、タスクを時間的に冗長化し、実行する情報を記憶し
ている。このため、タスク定義部を参照し、タスクの実
行を制御するタスク制御手段によって、時間的に冗長化
したタスクの実行を実現することができる効果がある。Further, according to the second invention, the task definition section stores the information to be executed by making the task redundant in time. Therefore, there is an effect that the task control unit that refers to the task definition unit and controls the execution of the task can realize the execution of the task that is temporally redundant.

【０１２１】また、第３の発明によれば、タスク定義部
は、タスクを空間的に冗長化し、実行する情報を記憶し
ている。このため、タスク定義部を参照し、タスクの実
行を制御するタスク制御手段によって、空間的に冗長化
したタスクの実行を実現することができる効果がある。Further, according to the third invention, the task definition section stores the information to be executed by making the task spatially redundant. Therefore, there is an effect that the task control means for controlling the task execution with reference to the task definition section can realize the spatially redundant task execution.

【０１２２】また、第４の発明では、タスク制御手段
は、実行誤り検出手段によってタスクのエラー通知を受
信し、受信した内容とタスク定義部の定義内容に従い、
タスクの実行を制御する。また、上記タスク定義部に
は、エラー検知時の制御をあらかじめ定義する。このた
め、タスク実行中にエラーが発生しても、タスク制御手
段が適切な対応を行い、タスクの実行を制御するのでシ
ステムの多様性を高める効果がある。Further, in the fourth invention, the task control means receives the error notification of the task by the execution error detection means, and according to the received content and the definition content of the task definition section,
Control the execution of tasks. In addition, control at the time of error detection is defined in advance in the task definition section. Therefore, even if an error occurs during task execution, the task control means takes appropriate action and controls the execution of the task, which has the effect of increasing the diversity of the system.

【０１２３】また、第５の発明によれば、タスクの冗長
化実行手段が、タスクの冗長度を確保するようにタスク
を再起動する。このため、タスクの実行中にエラーが発
生しても、タスクの冗長度は確保されるので、システム
の多様性を高める効果がある。According to the fifth invention, the task redundancy executing means restarts the task so as to secure the redundancy of the task. Therefore, even if an error occurs during execution of the task, the redundancy of the task is ensured, which is effective in increasing the diversity of the system.

【０１２４】また、第６の発明によれば、冗長化実行手
段は、他の計算機にエラーが発生したタスクを再起動す
る。このため、タスクの冗長度は確保されるので、シス
テムの多様性を高める効果がある。According to the sixth invention, the redundancy executing means restarts the task in which the error has occurred in another computer. Therefore, the redundancy of the task is secured, which has the effect of increasing the diversity of the system.

【０１２５】また、第７の発明によれば、タスク移送手
段が、計算機を構成している冗長化されたハードウェア
に障害が発生した場合、上記計算機で正常処理中のタス
クを一度停止し、他の計算機に上記タスクを再起動す
る。このため、既に計算機上で実行中のタスクを計算機
外に安全に追い出すことが可能となり、システムの多様
性を高める効果がある。According to the seventh invention, when the task transfer means fails in the redundant hardware constituting the computer, the task being normally processed by the computer is once stopped, Restart the above task on another computer. For this reason, it is possible to safely eject a task that is already being executed on the computer to the outside of the computer, which has the effect of increasing the diversity of the system.

【０１２６】また、第８の発明では、情報採取手段が、
タスクの途中実行に必要な情報を採取し、外部記憶部に
採取した情報を格納する。そして、再実行手段が上記外
部記憶部より情報を取り出し、取り出した情報に基づい
て、タスクを途中から再実行する。このため、タスクの
エラーによって、タスクの実行の中断を余儀なくされて
も再実行を容易に行うことができ、タスクの冗長化を確
実に実現することができる効果がある。Further, in the eighth invention, the information collecting means is
The information necessary for mid-execution of the task is collected, and the collected information is stored in the external storage unit. Then, the re-execution unit retrieves the information from the external storage unit and re-executes the task from the middle based on the retrieved information. For this reason, even if the execution of the task is forced to be interrupted due to the task error, the re-execution can be easily performed, and the redundancy of the task can be reliably realized.

【０１２７】また、第９の発明では、割り当て禁止手段
が、計算機を構成している入出力装置にエラーが検知さ
れた場合、上記入出力装置を使用する他のタスクを上記
計算機へ割り当てることを禁止する。このため、再度入
出力装置によるタスクのエラーが発生することを防ぐこ
とができ、システムの多様性を高める効果がある。Further, in the ninth invention, the allocation prohibiting means allocates another task using the input / output device to the computer when an error is detected in the input / output device constituting the computer. Ban. Therefore, it is possible to prevent the task error from occurring again due to the input / output device, which has the effect of increasing the diversity of the system.

【０１２８】また、第１０の発明によれば、オンライン
診断手段が、どの入出力装置でエラーが発生しているの
か診断を行う。このため、エラーが発生している入出力
装置を容易に特定できるので、上記割り当て禁止手段に
よって診断結果を反映したタスク割り当て処理が行うこ
とができ、システムの多様性が高めることができる効果
がある。According to the tenth aspect of the invention, the online diagnostic means diagnoses in which input / output device the error has occurred. Therefore, since the input / output device in which the error has occurred can be easily identified, the task allocation process that reflects the diagnosis result can be performed by the allocation prohibition means, and the diversity of the system can be enhanced. .

【０１２９】また、第１１の発明では、タスクのメンテ
ナンス手段が特定の計算機に特定のタスクを特定の期間
割り当てないことを保証し、外部記憶部に記憶されたモ
ジュールを取り出して、上記特定のタスクと入れ換え
る。このため、システムを停止することなく、ソフトウ
ェアのメンテナンス作業が行え、システムの多様性を高
める効果がある。Further, in the eleventh invention, it is guaranteed that the task maintenance means does not allocate a specific task to a specific computer for a specific period of time, and the module stored in the external storage unit is taken out to execute the specific task. Replace with. For this reason, software maintenance work can be performed without stopping the system, which has the effect of increasing the diversity of the system.

【０１３０】また、第１２の発明では、オンライン縮退
手段は、計算機上で正常に処理しているタスクを停止
し、上記計算機をシステムから切り離し、別の計算機上
で上記タスクを再起動する。このため、ハードウェアの
メンテナンス作業において、システムを停止することな
く、メンテナンスを行うことができ、システムの多様性
を高める効果がある。In the twelfth aspect of the invention, the online degeneracy means stops the task normally processed on the computer, disconnects the computer from the system, and restarts the task on another computer. Therefore, in hardware maintenance work, maintenance can be performed without stopping the system, which has the effect of increasing the diversity of the system.

【０１３１】また、第１３の発明では、オンライン拡張
手段が、切り離した計算機をシステムに再投入する。こ
のため、上記第１２の発明において、メンテナンス作業
を終えたハードウェアを容易にシステムを停止すること
なく、再投入することができ、システムの多様性を高め
る効果がある。Further, in the thirteenth invention, the online expansion means re-inputs the separated computer into the system. Therefore, in the twelfth aspect of the present invention, it is possible to easily reload the hardware for which maintenance work has been completed without stopping the system, which has the effect of increasing the diversity of the system.

【０１３２】また、第１４の発明では、メッセージ管理
手段が、事象通知部から計算機の消滅を受信し、上記計
算機の消滅に伴い、メッセージが消滅することを防ぐ。
このため、タスク間通信を安全に保障する効果がある。Further, in the fourteenth invention, the message management means receives the disappearance of the computer from the event notifying unit and prevents the message from disappearing due to the disappearance of the computer.
Therefore, the task-to-task communication is effectively secured.

【０１３３】また、第１５の発明では、タスクの再投入
手段が、計算機の消滅に伴い、タスクが消滅することを
防ぐ。このため、上記第１４の発明と同様に、タスク間
通信を安全に行うことを保証する効果がある。In the fifteenth invention, the task re-input means prevents the task from disappearing as the computer disappears. Therefore, similar to the fourteenth invention, there is an effect of guaranteeing that the inter-task communication is safely performed.

【０１３４】また、第１６の発明では、情報採取手段
が、タスクの途中実行に必要な情報を採取する。そし
て、採取するタイミングは、タスク間通信の通信処理
中、及び、タスク間通信の送信処理、及び、タスク間通
信の受信処理、及び、タスク間通信処理における応答待
ち処理、及び、タスクの入出力処理のいずれかの処理を
行う場合である。このため、タスク間で同期したタスク
の途中実行に必要な情報を採取することができるので、
効率的、且つ、安全に情報採取を行うことができる効果
がある。Further, in the sixteenth invention, the information collecting means collects information necessary for mid-execution of the task. Then, the sampling timing is during communication processing of intertask communication, transmission processing of intertask communication, reception processing of intertask communication, response waiting processing in intertask communication processing, and task input / output. This is the case where any one of the processes is performed. For this reason, it is possible to collect the information necessary for the mid-execution of tasks synchronized between tasks.
There is an effect that information can be collected efficiently and safely.

【０１３５】更に、第１７の発明では、情報採取手段
が、タスク間通信の通信相手が他の計算機上であって
も、他の計算機に対しタスクの途中実行に必要な情報を
採取するよう信号を送る。このため、空間的に冗長化実
行されているタスクである場合、容易に途中実行に必要
な情報を採取できる効果がある。Furthermore, in the seventeenth invention, even if the communication partner of the inter-task communication is on another computer, the information collecting means sends a signal to the other computer to collect information necessary for mid-execution of the task. To send. Therefore, in the case of a task that is spatially redundantly executed, there is an effect that the information necessary for intermediate execution can be easily collected.

[Brief description of drawings]

【図１】この発明の実施例１を表す図である。FIG. 1 is a diagram showing a first embodiment of the present invention.

【図２】本システムにおいて実行／管理するタスク属
性の定義データを表す図である。FIG. 2 is a diagram showing definition data of task attributes executed / managed in this system.

【図３】デバイスビットマップの一例を示す図であ
る。FIG. 3 is a diagram showing an example of a device bitmap.

【図４】ノード起動時の初期化処理の処理手順を示す
流れ図である。FIG. 4 is a flowchart showing a processing procedure of initialization processing at node startup.

【図５】本システムにタスクが投入された時の処理手
順を示す流れ図である。FIG. 5 is a flowchart showing a processing procedure when a task is input to the present system.

【図６】本システムにおけるタスクの実行誤り検出方
法の処理手順を示す流れ図である。FIG. 6 is a flowchart showing a processing procedure of a task execution error detection method in the present system.

【図７】冗長化実行されているタスクが実行誤りを起
こした際の処理手順の前半を示す流れ図である。FIG. 7 is a flowchart showing the first half of the processing procedure when a task that is being redundantly executed causes an execution error.

【図８】冗長化実行されているタスクが実行誤りを起
こした際の処理手順の前半を示す流れ図である。FIG. 8 is a flowchart showing the first half of the processing procedure when a task that is being redundantly executed causes an execution error.

【図９】空間的に冗長化実行されていないタスクを他
ノードに移送する処理手順を示す流れ図である。FIG. 9 is a flowchart showing a processing procedure of transferring a task that is not spatially redundantly executed to another node.

【図１０】ノード診断の処理手順を示す流れ図であ
る。FIG. 10 is a flowchart showing a processing procedure of node diagnosis.

【図１１】本システムにおいてノード上のタスクモジ
ュールを入れ換える際の処理手順を示す流れ図である。FIG. 11 is a flowchart showing a processing procedure when replacing task modules on a node in the present system.

【図１２】本システムにおいてノードのメンテナンス
作業を行う際の処理手順を示す流れ図である。FIG. 12 is a flowchart showing a processing procedure when performing maintenance work on a node in the present system.

【図１３】本システムにおいて交換されるメッセージ
の形式を表す図である。FIG. 13 is a diagram showing a format of a message exchanged in the present system.

【図１４】異常消滅したノードに関する処理のリカバ
リ方式を示す流れ図である。FIG. 14 is a flowchart showing a recovery method of processing relating to an abnormally disappeared node.

【図１５】この発明において、タスクがチェックポイ
ント採取要求を受けた時処理の手順を示す流れ図であ
る。FIG. 15 is a flowchart showing a procedure of processing when a task receives a checkpoint collection request in the present invention.

【図１６】この発明において、タスクがタスク間通信
処理(送受信)に到達した時の処理手順を示す流れ図であ
る。FIG. 16 is a flowchart showing a processing procedure when a task reaches inter-task communication processing (transmission / reception) in the present invention.

【図１７】この発明において、タスクが入出力処理に
到達した時の処理を示す流れ図である。FIG. 17 is a flowchart showing a process when a task reaches an input / output process in the present invention.

【図１８】従来のＦＴ計算機の構成を表す図である。FIG. 18 is a diagram showing a configuration of a conventional FT computer.

【図１９】従来の多重計算機アーキテクチャのブロッ
ク図である。FIG. 19 is a block diagram of a conventional multiple computer architecture.

【図２０】従来のチェックポイント機構を示した構成
図である。FIG. 20 is a configuration diagram showing a conventional checkpoint mechanism.

[Explanation of symbols]

１ノード、２Ｉ／Ｏ制御を行うＦｒｏｎｔＥｎｄ
Ｐｒｏｃｅｓｓｏｒ、３Ｉ／Ｏネットワーク、４
Ｉ／Ｏ機器、５共有２次記憶装置、６タスク制御手
段、７ノード間通信手段、８ユーザタスクとのＩ／
Ｆ部、９事象通知部、１０実行誤り検出手段、１１
タスク定義部。1 node, 2 front end that controls I / O
Processor, 3 I / O network, 4
I / O devices, 5 shared secondary storage devices, 6 task control means, 7 internode communication means, 8 I / O with user tasks
F part, 9 event notification part, 10 execution error detecting means, 11
Task definition section.

フロントページの続き (72)発明者阿部薫鎌倉市大船五丁目１番１号三菱電機株式会社情報システム研究所内Front Page Continuation (72) Inventor Kaoru Abe 5-1-1 Ofuna, Kamakura City Mitsubishi Electric Corporation

Claims

[Claims]

1. A task redundancy execution system for a multi-computer system configured by a plurality of computers, wherein each computer has the following elements: (a) a task redundancy execution system; A task definition unit that stores in advance information for managing the task, and (b) a task control unit that controls the execution of the task based on the information stored in the task definition unit.

2. The task redundancy execution method according to claim 1, wherein the task definition unit stores information that makes a task redundant in time and executes the task.

3. The task redundancy execution method according to claim 1, wherein the task definition unit stores information that makes the task spatially redundant and executes the task.

4. The computer further comprises execution error detection means for detecting an error occurring during task execution, and notifying the task control means of the detected error, and the task definition part is provided for detecting an error. 4. The task according to claim 1, wherein control is defined in advance, and the task control means controls the execution of the task based on the notified content and according to the definition content of the task definition unit. Redundant execution method.

5. The task control means restarts the task so as to secure the redundancy of the task, when the execution error detection means notifies the error of the task being executed on a specific computer. 5. The task redundancy execution method according to claim 4, further comprising:

6. The redundancy execution means stops the execution of a task on a computer in which an error of the task being executed has occurred, and restarts the task on another computer. Redundant execution method for the listed tasks.

7. If the computer is configured by redundant hardware, and the hardware fails, and the computer is executing a task without spatial redundancy, the task control means The task redundancy means according to claim 1, 2, 3, 4 or 5, further comprising: task transfer means for once stopping a task which is being normally processed on the computer and restarting the task on another computer. Execution method.

8. The task redundancy execution method further comprises an external storage unit for storing information necessary for mid-execution of a task, and the task control means is required for execution of a task from a normal processing to a mid-execution of the task. Information collecting means for collecting various kinds of information and storing the collected information in the external storage unit, and extracting information necessary for mid-execution of the task from the external storage unit, and executing the task from the middle based on the extracted information. 8. The task redundancy execution system according to claim 1, further comprising a re-execution means for re-execution.

9. The task redundancy execution system further comprises a plurality of input / output devices, and the task control means performs input / output operation when an error is detected in a task being executed by the execution error detection means. 5. The task according to claim 1, further comprising an allocation prohibition means for prohibiting allocation of another task using an input / output device, which is scheduled to be used by the task, to the computer. Redundant execution method.

10. The task definition unit includes a diagnostic procedure for diagnosing which of the input / output devices has an error, and the task control unit executes the diagnostic procedure. 10. The task redundancy execution system according to claim 9, further comprising an online diagnostic means.

11. The external storage unit further stores a module, and the task control unit ensures that a specific task is not allocated to a specific computer for a specific period of time, and is stored in the external storage unit. 2. A task maintenance means for taking out a module and replacing the taken-out module with the specific task is provided.
~ Redundant execution method of the tasks described in 4 above.

12. The task control means comprises an online degeneration means for stopping a task which is normally processed on the computer, disconnecting the computer from the system, and re-mobilizing the task on another computer. The redundant task execution method according to claim 1, 2, 3, 4, or 5.

13. The task redundancy execution system according to claim 12, wherein said task control means further comprises an online expansion means for re-inputting the separated computer into the system.

14. The plurality of computers further comprises communication means for exchanging messages between the computers, and
An event notification unit that detects the disappearance of a computer that is communicating and notifies the task control unit of the disappearance is provided, and the task control unit is a computer based on the notification content from the event notification unit. 4. The task redundancy execution method according to claim 1, further comprising a message management unit for preventing the message from disappearing when the message disappears.

15. The task redundancy execution method according to claim 14, wherein the task control means further comprises task re-input means for preventing the task from disappearing as the computer disappears.

16. The information collecting means performs any one of processing during communication processing of inter-task communication, reception processing of inter-task communication, response waiting processing in inter-task communication, and task input / output processing. 9. When performing the above, the task redundancy execution method according to claim 8, wherein information necessary for mid-execution of the task is collected.

17. The information collecting means sends a signal to the information collecting means provided in another computer when the communication partner of the inter-task communication is on another computer, so that the information necessary for mid-execution of the task is transmitted. 17. The task redundancy execution method according to claim 16, wherein the task redundancy execution method is performed.