JPH0635874A

JPH0635874A - Parallel processor

Info

Publication number: JPH0635874A
Application number: JP3352657A
Authority: JP
Inventors: Thomas Kelly; ケリー・トーマス; Maclean Mackenzie Louis; ルイス・マクリーン・マッケンジー; John Sutherland Robert; ロバート・ジョン・サザーランド
Original assignee: Motorola Ltd
Current assignee: Motorola Solutions UK Ltd
Priority date: 1990-12-20
Filing date: 1991-12-16
Publication date: 1994-02-10
Also published as: EP0492174B1; EP0492174A3; DE69130857D1; EP0492174A2; GB9027633D0; GB2251320A

Abstract

PURPOSE: To provide a generalized and hypercube topology having superior connectivity, a high band performance and a low waiting time on a processor having parallel architectures realizing a high output processing through the use of multiple central processing units CPU. CONSTITUTION: The processor is provided with plural processing elements arranged D-dimensionally and divided into sub-sets 11. Each processing element in the sub-set has a bus 13 and it can communicate with one another. Each processing element is the member of one sub-set 11 in the respective dimensions. The respective processing elements in one sub-set 11 are connected in the sub-set by an output means. They transmit messages to the other processing elements in the pertinent sub-set. They have individual input means for the respective processing elements in the sub-set and they receive the messages from the other processing elements on the input stages.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は多数の中央処理装置（Ｃ
ＰＵ）を用いて高出力処理を達成する並列アーキテクチ
ャを有するプロセッサに関する。BACKGROUND OF THE INVENTION The present invention relates to a number of central processing units (C
PU) to achieve high power processing with a parallel architecture.

【０００２】[0002]

【従来の技術】可能な限り数段にした多段ＣＰＵの使用
により処理性能を増大させることは近年広く議論されて
いる。この要求は、現在、学問的分野及び商業的分野に
も拡大し、小規模でシステムレベルでＶＬＳＩマイクロ
プロセッサに適用されつつある。しかしながら、広く受
け入れる点及び大規模にするという点では少なくとも２
つの重大な障害がある。2. Description of the Related Art Increasing the processing performance by using a multi-stage CPU in which the number of stages is as many as possible has been widely discussed in recent years. This requirement is now expanding into the academic and commercial fields and is being applied to VLSI microprocessors at the system level on a small scale. However, at least 2 in terms of wide acceptance and large scale
There are two major obstacles.

【０００３】第１に高並列汎用コンピュータによって要
求される多機能かつ強力な通信を設計して構築すること
は非常に難かしい。第２に、一度構築されたそのような
コンピュータをいかにプログラムすべきが全く明白でな
い。非常に重要な研究が既に機能的オブジェクト指向モ
デル及びデータフローモデルを基礎とするような新しい
プログラミングパラダイムにおいて実行されている（参
照：ＢｒｏｎｎｅｎｂｅｒｇＷＪＨＪ，Ｎｉｊｍａｎ
Ｌ，Ｏｄｊｉｋ，ＥＡＭ，ｖａｎＴｗｉｓｔＲ
ＡＨ：“Ｄｏｏｍ；ａｄｅｃｅｎｔｒａｌｉｓｅｄ
ｏｂｊｅｃｔ−ｏｒｉｅｎｔｅｄｍａｃｈｉｎｅ”
ＩＥＥＥＭｉｃｒｏＶｏｌ７Ｎｏ５（Ｏｃｔ
１９８７）_ｐｐ５４７−５５３、Ｗａｔｓｏｎ，
Ｉ，ｅｔａｌ：“Ｆｌａｇｓｈｉｐ：ａｐａｒａｌｌｅ
ｌａｒｃｈｉｔｅｃｔｕｒｅｆｏｒｄｅｃｌａｒａ
ｔｉｖｅｐｒｏｇｒａｍｍｉｎｇ”ｉｎＰｒｏｃｅ
ｓｓｉｎｇｓｏｆ１５ｔｈＡｎｎｕａｌＳｙｍ
ｐｏｓｉｕｍｏｎＣｏｍｐｕｔｅｒＡｒｃｈｉｔ
ｅｃｔｕｒｅＩＥＥＥＣｏｍｐＳｏｃＰｒｅｓ
ｓ（１９８８）_ｐｐ１２４−１３０、ＶｅｅｎＡ
Ｈ、“Ｄａｔａｆｌｏｗｍａｃｈｉｎａｒｃｈｉｔ
ｅｃｔｕｒｅ”ＡＣＭＣｏｍｐｕｔｉｎｇＳｕｒｖｅ
ｙｓ，Ｖｏｌ１８Ｎｏ４（Ｄｅｃ１９８６）
_ｐｐ３６５−３９６）。First, it is very difficult to design and build the multifunctional and powerful communication required by a highly parallel general-purpose computer. Second, it is not entirely clear how to program such a computer once built. Very important work has already been carried out in a new programming paradigm based on functional object-oriented models and dataflow models (see: Bronnenberg WJHJ, Nijman).
L, Odjik, EAM, van Twist R
AH: “Doom; a decentralized”
object-oriented machine "
IEEE Micro Vol 7 No 5 (Oct
1987) _pp 547-553, Watson,
I, et al: “Flagship: aparale”
l architecturefor declara
"Tive programming" in Proce
sings of 15th Annual Sym
Posium on Computer Archit
ecture IEEE Comp Soc Pres
s (1988) _pp 124-130, Veen A
H, "Dataflow machine archit
electure "ACM ComputingSurve
ys, Vol 18 No 4 (Dec 1986)
_pp 365-396).

【０００４】サブネットの目的は、ホストエレメント
（プロセッシングノード及び適当なメモリ）に通信接続
を形成できることである。理想的には、これらの通信接
続は次の性質を有すべきである。ａ）高帯域性。これは要求されるときにはいつでも大量
のデータをホストエレメント間で転送できるようにす
る。ｂ）低い待ち時間。これはメッセージを送出し応答を要
求するいかなる処理も過剰な期間待たなくてもよいこと
を保証する。The purpose of a subnet is to be able to make communication connections to host elements (processing nodes and suitable memory). Ideally, these communication connections should have the following properties: a) High bandwidth. This allows large amounts of data to be transferred between host elements whenever required. b) Low latency. This ensures that any process sending a message and requesting a reply does not have to wait an excessive period of time.

【０００５】特に交渉すべきスイッチングレベルが多く
ある場合には、サブネットは低い待ち時間に最も寄与す
ることができる。Subnets can contribute most to low latency, especially when there are many switching levels to negotiate.

【０００６】さらに、通信者の相対的物理的位置（メト
リック対称）及びネットワークの他の場所での活動に関
係なく、サブネットは接続を帯域性及び回転待ち時間を
均一な許容値に保持できなければならない。最後に、内
部接続トポロジが中型及び大型マイチコンピュータにお
いて機能しなければ、可能なネットワークサイズの広い
範囲に亘って好ましいアーキテクチャ性質（待ち時間、
帯域性、対称性、独立性）を保持することが要求され
る。Moreover, regardless of the relative physical location of the correspondent (metric symmetry) and activity elsewhere in the network, the subnet must maintain the connection to a uniform tolerance of bandwidth and rotational latency. I won't. Finally, if the interconnect topology does not work on medium and large Mighty computers, it has favorable architectural properties (latency, latency, etc.) over a wide range of possible network sizes.
Bandwidth, symmetry, independence) is required.

【０００７】並列プロセッサの従来のアーキテクチャ
は、ＬａｒｒｙＤＷｉｔｔｉｅ：“Ｃｏｍｍｕｎｉ
ｃａｔｉｏｎＳｔｒｕｃｔｕｒｅｓｆｏｒＬａｒ
ｇｅＮｅｔｗｏｒｋｓｏｆＭｉｃｒｏｃｏｍｐｕｔ
ｅｒｓ，ＩＥＥＥ，１９８１に記載されている。A conventional architecture for parallel processors is Larry D Wittie: "Communi".
Cation Structures for Lar
geNetworks of Microcomput
ers, IEEE, 1981.

【０００８】[0008]

【発明が解決しようとする課題】２進超立方体（ｈｙｐ
ｅｒｃｕｂｅ）は、あるプロセッサがある点から他の点
へ到達するのに多大な時間を要するために、低いメトリ
ック対称性を有する。また、待ち時間は固有的には大き
く変化する。さらに。超立方体は帯域性としては良い
が、プロセッサの数を倍にすると、直径が１だけ増大
し、より大きなアセンブリにおいて最悪の遅延を生じせ
しめることになる。Problem to be Solved by the Invention Binary hypercube (hyp
ercube) has low metric symmetry because one processor takes a significant amount of time to reach from one point to another. Also, the waiting time inherently varies greatly. further. Although the hypercube is good in bandwidth, doubling the number of processors increases the diameter by one, causing the worst delay in larger assemblies.

【０００９】この分野での一般的な目的は、ノード間の
高い内部接続性を有するアーキテクチャを達成し、メッ
セージが最小の数のノードを介して目的場所まで到達さ
れるようにすることである。内部接続の究極の制限は物
理的に支持できる配線密度もしくはノード間での他の通
信手段（たとえば光学的バス、自由耐久光注入器（ｆｒ
ｅｅ−ｓｔａｎｄｉｎｇｏｐｔｉｃａｌｔｒａｎｓ
ｆｕｓｅｒ）もしくは他の手段）の制限である。A general purpose in this area is to achieve an architecture with high inter-node connectivity, so that messages can reach their destination via a minimum number of nodes. The ultimate limitation of interconnects is the physically supportable wiring density or other means of communication between nodes (eg, optical bus, free endurance injector (fr).
ee-standing optical trans
fuser) or other means).

【００１０】[0010]

【課題を解決するための手段及び作用】本発明によれ
ば、次元Ｄで配置され複数のサブセットに分割された複
数のプロセッシングエレメントを具備し、１つのサブセ
ットにおけるすべてのプロセッシングエレメントはこれ
らの間の通信のための１つのバスを有し、前記各プロセ
ッシングエレメントは各次元における１つのサブセット
のメンバであるプロセッサにおいて、１つのサブセット
の各プロセッシングエレメントは出力手段によって当該
サブセットのバスに接続され、当該サブセットの他の複
数のプロセッシングエレメントにメッセージを送信し、
別個の入力手段は、当該サブセットの各他のプロセッシ
ングエレメントに対応し、各対応の入力手段上の前記他
のプロセッシングエレメントからのメッセージを受信す
ることを特徴とする並列プロセッサが提供される。According to the present invention, there are provided a plurality of processing elements arranged in dimension D and divided into a plurality of subsets, all processing elements in a subset being in between. In a processor having one bus for communication, each said processing element being a member of one subset in each dimension, each processing element of one subset is connected by output means to the bus of that subset, Send messages to multiple other processing elements in
A parallel processor is provided, characterized in that a separate input means corresponds to each other processing element of the subset and receives messages from said other processing element on each corresponding input means.

【００１１】プロセッシングエレメントはメッセージを
他の多数のエレメントに同時に送出することができない
が、これがすべてのエレメントと他のすべてのエレメン
トとの間における入力ライン／出力ライン上のトータル
の内部接続の理論的最適化構成から性能を低下させない
ことが分った。従って、性能は理論的最適性能とほぼ同
一であるが、内部配線密度は実質的に低下する。A processing element cannot send a message to a number of other elements at the same time, but this is the theoretical total interconnect on the input / output lines between every element and every other element. It has been found that the optimized configuration does not degrade performance. Therefore, the performance is almost the same as the theoretical optimum performance, but the internal wiring density is substantially reduced.

【００１２】好ましい実施例においては、１つのサブセ
ットにおけるプロセッシングエレメントは１ラインに配
列され、該ラインの終端間に位置するプロセッシングエ
レメントは、該ラインに沿って一方側の他のプロセッシ
ングエレメントにメッセージを送出する１つの出力手段
と、該ラインに沿って他方側の他のプロセッシングエレ
メントにメッセージを送出する別個の出力手段とを有す
る。これはラインの終端間に位置するプロセッシングエ
レメントはラインに沿って左右両方向に同時にメッセー
ジを送出できることを意味するが、臨界配線密度は増加
しない。つまり、配線密度はあるラインのバスと直交す
るラインのバスとの間のクロスオーバ点で最も大きくな
るからである（いかなる場合も、１つのプロセッシング
エレメントはそのサブセット内の他のエレメント及び他
のサブセット（上記プロセッシングエレメントはそのメ
ンバである）の他のエレメントに同時にメッセージを送
出できる）。In the preferred embodiment, the processing elements in a subset are arranged in a line, the processing elements located between the ends of the line sending messages along the line to other processing elements on one side. And one separate output means for sending a message along the line to the other processing element on the other side. This means that the processing elements located between the ends of the line can simultaneously send messages to the left and right along the line, but the critical wire density does not increase. That is, the wiring density is greatest at the crossover point between the bus of one line and the bus of an orthogonal line (in any case, one processing element may be the other element within that subset and another subset). Messages can be sent simultaneously to other elements (the processing elements being members of which).

【００１３】最も簡単な構成においては、プロセッサは
プロセッシングエレメントの２次元アレイを具備し、そ
の各行がサブセットを形成し、また、各列がサブセット
を形成する。行と列との交差する点におけるプロセッシ
ングエレメントは２つのサブセット間での通信のタスク
を実行する。以後、サブセットに対しては表現“クラス
タ”を用いる。In the simplest configuration, the processor comprises a two-dimensional array of processing elements, each row of which forms a subset and each column of which forms a subset. The processing elements at the intersections of rows and columns perform the task of communicating between the two subsets. Hereinafter, the expression "cluster" will be used for subsets.

【００１４】本発明による利点は、基本的に測定可能か
つモジュール化され、高度に接続され、対称的な低い待
ち時間のネットワークによってリンクされたプロセッシ
ングエレメント（ＰＥ）をその数に制限なくサポートで
き、２進超立方体を同等のコストパフォーマンスを有す
るアーキテクチャである。The advantages according to the invention are essentially measurable, modular, highly connected, capable of supporting an unlimited number of processing elements (PEs) linked by a symmetrical low latency network, It is an architecture that has the same cost performance as a binary hypercube.

【００１５】メトリック非対称性の度合いはプロセッシ
ングエレメント（ＰＥ）を強固に接続されたグルーダ内
でクラスタ化し、これらのグループを高帯域性リンク及
び処理を繰返すという選択をも用いて連結することによ
って受容される。The degree of metric asymmetry is accepted by clustering the processing elements (PEs) in tightly connected gluers and concatenating these groups also using high bandwidth links and the option of repeating the process. It

【００１６】[0016]

【実施例】図１を参照すると、各クラスタは最大ｗの複
数のノードを含む。この値ｗつまりネット幅は装置の固
定特性である（但し、アーキテクチャ内では、可変であ
る）。各ノードは唯一の自分自身の非方向性バスを有
し、このバスは各ノードを同一のクラスタ内の選択され
た他のエレメントに接続する。この他のエレメントはｗ
−１個の同一の入力するリンクから選択できることを意
味できる。各ラインは唯一つの出力装置によって電気的
に駆動され、これにより共有に伴なう制限現象、速度を
制限するいわゆるワイヤードオアの突然の故障（ｇｌｉ
ｔｃｈ）（ＧｕｓｔａｖａｓｏｎＤＢ，Ｔｈｅｕｓ
Ｊ：“Ｗｉｒｅ−Ｏｒｌｏｇｉｃｏｎｔｒａｎｓ
ｍｉｓｓｉｏｎｌｉｎｅｓ”ＩＥＥＥＭｉｃｒｏ
Ｖｏｌ３ＮＯ．３（Ｊｕｎｅ１９８３），_ｐｐ５
１−５５）を防止することができる。ここで、この突然
の故障によりバスの所有権もしくは信号の方向性さえ変
更されてしまう。データ転送は個々の受信部で行われ、
または全体通信もしくはクラスト毎の通信を介して行わ
れる。アーキテクチャは厳密にはクラスタ内部接続方法
に依存する。全体システムはクラスタグラフトポロジの
Ｄ番目の直積をとることによって形成されるＤ次元の格
子である。これは各次元においてクラスタ構成つまり一
般化された超立方体として知られる反復的に形成された
構成を課す効果を有する。図１は２Ｄ超平面を形成する
２次元の例を示し、各ノードは２つの独立の直交するク
ラスタに属する。このアプローチはより高い次元に拡張
できる。つまり、各ノードが等しくＮ個のクラスタより
なるＮ次元超立方体はｗ（Ｎ−１）個のクラスタリンク
によって接続されたｗ個のＮ−１次元の超平面により構
成される。この直交するクラスタを重ね合わす構成は全
体のメッセージ通過のための必須の高い帯域内部接続を
提供する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT Referring to FIG. 1, each cluster contains a maximum of w multiple nodes. This value w, that is, the net width, is a fixed characteristic of the device (however, it is variable within the architecture). Each node has its own non-directional bus, which connects each node to other selected elements in the same cluster. Other elements are w
It can mean that one can select from one and the same input link. Each line is electrically driven by only one output device, which causes a limiting phenomenon associated with sharing, a so-called wired-OR sudden failure that limits speed.
tch) (Gustavason D B, Theus
J: "Wire-Or logic trans
"Mission lines" IEEE Micro
Vol3 NO. 3 (June 1983), _pp 5
1-55) can be prevented. Here, this sudden breakdown changes the ownership of the bus or even the directionality of the signals. Data transfer is done by individual receivers,
Alternatively, the communication is carried out through whole communication or communication for each crust. The architecture strictly depends on the cluster interconnection method. The entire system is a D-dimensional lattice formed by taking the Dth Cartesian product of the cluster graph topology. This has the effect of imposing a cluster configuration in each dimension, an iteratively formed configuration known as a generalized hypercube. FIG. 1 shows a two-dimensional example forming a 2D hyperplane, where each node belongs to two independent orthogonal clusters. This approach can be extended to higher dimensions. That is, an N-dimensional hypercube in which each node is equally composed of N clusters is composed of w N-1 dimensional hyperplanes connected by w (N-1) cluster links. This configuration of overlapping orthogonal clusters provides the requisite high-band interconnect for the entire message passage.

【００１７】単純な２進超立方体システムと異なり、Ｄ
は大きな装置であっても低い値のみ（たとえば２もしく
は３）を採用することになる。たとえば、ｗ＝３２につ
いては、達成される数字である３次元構造は３２Ｋのプ
ロセッシングエレメント（ＰＥ）を含むことになる。Unlike a simple binary hypercube system, D
Will only use low values (eg 2 or 3) even for large devices. For example, for w = 32, the number achieved, the three-dimensional structure, would include 32K processing elements (PEs).

【００１８】接続の低い待ち時間のために、ハードウエ
アはメッセージの通過と共に共有メモリを備えることが
できる。共有位置をアクセスしようとするプロセッサは
短かい要求メッセージをその位置を有するノードに送出
し、この要求が処理されると応答（要求が読出であれば
データを含む）を受信する。待ち時間を最小にするため
に、遠隔メモリ管理ハードウエアはプロセッサに関係な
く全体のメモリに読出し及び書込みを行うことができ
る。多くの場合、このようなメモリへの競合しないアク
セスに対する制裁はローカルＲＡＭの場合の２倍よりも
っと大きくなる。Due to the low latency of the connection, the hardware can provide shared memory with the passage of messages. A processor attempting to access a shared location sends a short request message to the node having that location and receives a response (containing data if the request was a read) when the request was processed. To minimize latency, remote memory management hardware can read and write the entire memory independent of the processor. In many cases, the sanctions for such contention-free access to memory will be more than double that for local RAM.

【００１９】図１には、６×６個のプロセッシングエレ
メント１０のアレイを備えたプロセッサが示されてお
り、各エレメントはアレイのノードを構成している。エ
レメントの各行はクラスタ１１を構成し、エレメントの
各列はクラスタ１２を構成している。アレイは３次元以
上の次元に拡張できる。３次元の場合、アレイは６層に
構築され、この各層は図１に示されると同一であり、各
列がクラスタを構成する。対称性の同一原理はシステム
の拡張次元に適用される。たとえば、第４次元は各エレ
メントを６個のエレメントよりなるクラスタで置換する
ことにより発生することができる。FIG. 1 shows a processor with an array of 6 × 6 processing elements 10, each element forming a node of the array. Each row of elements constitutes a cluster 11, and each column of elements constitutes a cluster 12. Arrays can be extended to more than two dimensions. In the three-dimensional case, the array is built in 6 layers, each layer being identical to that shown in FIG. 1, with each column forming a cluster. The same principle of symmetry applies to the extended dimension of the system. For example, the fourth dimension can be generated by replacing each element with a cluster of 6 elements.

【００２０】クラスタの６個のエレメントはバス１３に
よって接続されているが、これについては、図３を参照
して後述する。The six elements of the cluster are connected by a bus 13, which will be described later with reference to FIG.

【００２１】図２にはエレメント１０の構成が示されて
いる。このエレメントは、１つ以上のマイクロプロセッ
サ２１たとえばモトローラＭ８８００マイクロプロセッ
サよりなるホストエレメント２０を備えている。また、
ホストエレメントはメモリ２２及び通信エレメント２３
を含んでいる。ホストエレメント２０には、ネットワー
クエレメント２４が付随しており、このネットワークエ
レメント２４はノードのインターフェイスをとるための
次元毎のクラスタインターフェイスユニット（ＣＩＵ）
２５を備えている。ただし、図２においては、２６，２
７，２８として３つのＣＩＵが図示されている。唯一の
ホストインターフェイスユニット（ＨＩＵ）２９はホス
トエレメント２０との情報交換のために設けられてい
る。ネットワークエレメント２４はネットワークエレメ
ント管理ユニット３０を含んでいる。FIG. 2 shows the structure of the element 10. This element comprises a host element 20 which comprises one or more microprocessors 21, for example Motorola M8800 microprocessors. Also,
The host element is the memory 22 and the communication element 23.
Is included. A network element 24 is attached to the host element 20, and the network element 24 is a cluster interface unit (CIU) for each dimension for interfacing a node.
25 are provided. However, in FIG.
Three CIUs are shown as 7,28. A unique host interface unit (HIU) 29 is provided for exchanging information with the host element 20. The network element 24 includes a network element management unit 30.

【００２２】図３を参照すると、４つのエレメント１０
よりなるクラスタが図示されており、各エレメントには
１０ａ，１０ｂ，１０ｃ，１０ｄの参照番号が与えられ
ている。これらのエレメントはバス１３に接続されてい
るが、このバス１３はさらに各々が１６個のラインより
なる４つのバスよりなっている。１つのバスは出力のた
めに各ネットワークエレメントから接続されている。１
つのネットワークからの出力はクラスタ内の他の各ネッ
トワークの入力に接続されている。従って、各ネットワ
ークは１つの出力及び３つの入力を有することになる。
各ホストエレメントは、アレイの他の次元に対応して、
さらに１つの出力及び該出力に接続された各他のバスの
ための入力を有することになる。このように、幅ｗ＝４
の３次元アレイは各ネットワークエレメント毎に３つの
出力及び１２個の入力を有することになる。バス１３の
各ラインは唯一の出力装置によって電気的に駆動され、
ワイヤードオアによる突然の故障を防止する。このよう
な構成は、ホストエレメントが１つのメッセージのみを
そのクラスタ内での他のエレメントに一時に送出するこ
とができるという欠点を有する。しかしながら、これは
大きな欠点ではない。なぜなら、ホストエレメントはい
かなる場合でもシリアル装置（もしくはある限られた数
のシリアル装置）であり、また、いかなる場合には直交
するクラスタ内での他のエレメントに同時にメッセージ
を送出できるからである。Referring to FIG. 3, four elements 10
Clusters of elements are shown, each element being given the reference numeral 10a, 10b, 10c, 10d. These elements are connected to a bus 13, which is in turn made up of four buses, each consisting of 16 lines. One bus is connected from each network element for output. 1
The output from one network is connected to the input of each other network in the cluster. Therefore, each network will have one output and three inputs.
Each host element corresponds to another dimension of the array,
It will also have an output and an input for each other bus connected to it. Thus, width w = 4
3D array will have 3 outputs and 12 inputs for each network element. Each line of the bus 13 is electrically driven by a unique output device,
Prevents sudden failure due to wired OR. Such an arrangement has the disadvantage that the host element can send only one message at a time to other elements in its cluster. However, this is not a major drawback. This is because the host element is in any case a serial device (or some limited number of serial devices), and in any case it can simultaneously send messages to other elements in orthogonal clusters.

【００２３】上述の１つのアレイの密度及び内部接続に
おける制限要因の１つは配線密度である。配線密度制限
には、バス１３がハードワイヤ、マイクロ波リンク、光
リンク、無線リンク等か否かに依存して、複数の形式を
とる。配線最大密度の領域は直交バス間のクロスオーバ
点である。図４にはクロスオーバ点における配線密度を
上げることなくクラスタ内のエレメントの内部接続性を
増大する構成が示されている。この構成においては、バ
スの両端間の各エレメント１０ｂ，１０ｃは左側に延び
る出力及び右側へ延びる出力の２つの出力を有する。各
エレメントはバスに沿って左側及び右側へメッセージを
同時に送出することができる。配線密度の唯一の増加
は、ネットワークエレメントの出力部に見られる。これ
は重要な領域でない。One of the limiting factors in the density and interconnection of one array described above is the wiring density. The wiring density limitation takes a plurality of forms depending on whether the bus 13 is a hard wire, a microwave link, an optical link, a wireless link, or the like. The area of maximum wiring density is the crossover point between orthogonal buses. FIG. 4 shows a configuration in which the internal connectivity of the elements in the cluster is increased without increasing the wiring density at the crossover point. In this configuration, each element 10b, 10c across the bus has two outputs, an output extending to the left and an output extending to the right. Each element can simultaneously send messages to the left and right sides along the bus. The only increase in wiring density is found at the output of the network element. This is not an important area.

【００２４】装置の動作は次のごとくである。１つのネ
ットワークエレメントが他のネットワークエレメントに
データもしくはコマンドを送出するとき、データもしく
はコマンドは受信元エレメントをアドレスする通信エレ
メント２３においてパケットを構成する。このパケット
はホストインターフェイスユニット２９を介し、さらに
受信元エレメントに対応する適当なクラスタインターフ
ェイスユニット２６，２７もしくは２８を介して送出さ
れる。受信元エレメントもちろん発信元エレメントと同
一のクラスタ内でないこともあり、発信元クラスタと受
信元クラスタとの交差に位置するノードに送出する必要
がある。また、さらに、中間的なステップもあることが
ある。Ｄ次元のアレイについては、ステップの最大数は
Ｄとなる。ネットワークエレメント管理ユニット３０は
パケットが受信先に送出されるルートを決定する。たと
えば、制限的なルーティングにおいては、南西方向に送
る必要があるメッセージはまず西へ送られ、次に南に送
られ、これにより、南西方向から到達するメッセージと
の衝突を避ける。なお、後者のメッセージはまず東へ送
られ、次に北へ送られる。他のルーティングのプロトコ
ルも考えることができる。メッセージが受信元クラスタ
のバス１３に到達すると、メッセージは受信元エレメン
トのアドレスによって認識され、その受信元エレメント
のクラスタインターフェイスユニットに受信される。The operation of the device is as follows. When one network element sends data or a command to another network element, the data or command constitutes a packet in the communication element 23 which addresses the receiving element. This packet is sent out via the host interface unit 29 and further through the appropriate cluster interface unit 26, 27 or 28 corresponding to the receiving element. Receiving element Of course, it may not be in the same cluster as the originating element, so it must be sent to the node located at the intersection of the originating cluster and the receiving cluster. There may also be intermediate steps. For a D-dimensional array, the maximum number of steps will be D. The network element management unit 30 determines the route by which the packet is sent to the destination. For example, in restrictive routing, messages that need to be sent in the southwest direction are sent west and then south, thereby avoiding collisions with messages arriving from the southwest direction. The latter message is sent east and then north. Other routing protocols can be considered. When the message reaches the bus 13 of the receiving cluster, the message is recognized by the address of the receiving element and is received by the cluster interface unit of the receiving element.

【００２５】パケットが受信されると、パケットはネッ
トワークエレメント２４のバッファメモリにバッファリ
ングされる。ネットワークエレメント２４は同時に複数
のパケットを受信でき、これをそのホストエレメントで
処理しもしくは直交クラスタに送出する。クラスタイン
ターフェイスユニット２６，２７，２８には調停回路が
設けられ、同時に到着するパケットをバッファリングし
て時間的に効率よく処理する。パケットがエレメント自
身を受信元としている場合には、このパケットはＨＩＵ
２９を介して通信エレメント２３に送られ、プロセッシ
ングエレメント２１によって処理されもしくはメモリエ
レメント２２に格納される。When the packet is received, the packet is buffered in the buffer memory of the network element 24. The network element 24 can receive multiple packets at the same time and either process it on its host element or send it to the orthogonal cluster. An arbitration circuit is provided in each of the cluster interface units 26, 27, 28 to buffer packets that arrive at the same time and process them efficiently in time. If the packet is sourced by the element itself, this packet is a HIU
It is sent to the communication element 23 via 29 and processed by the processing element 21 or stored in the memory element 22.

【００２６】例を上げると、動作（ａ＋ｂ）×（ｃ＋
ｄ）は以下のごとく並列処理によって実行することがで
きる。すべてのパラメータが可変とすれば、動作ａ＋ｂ
がアレイの第１のエレメントにおいて実行され、動作ｃ
＋ｄが第２のエレメントにおいて実行される。第１、第
２のエレメントはこれらの演算の結果を含むパケットを
第３のエレメントに送出し、第３のエレメントはこれら
の演算の結果に乗算演算を実行する。As an example, the operation (a + b) × (c +
d) can be executed by parallel processing as follows. If all parameters are variable, operation a + b
Is performed on the first element of the array, and operation c
+ D is executed in the second element. The first and second elements send a packet containing the results of these operations to the third element, and the third element performs the multiplication operation on the results of these operations.

【００２７】ネットワークについてメッセージの好ましいルーティング方法はウォームホー
ル（ｗｏｒｍｈｏｌｅ）ルーティングの変形である
（ＤａｌｌｙＷＪ，ＳｅｉｔｚＣＬ：Ｍｕｌｔ
ｉｃｏｍｐｕｔｅｒｓ：ｍｅｓｓａｇｅ−ｐａｓｓｉｎ
ｇｃｏｎｃｕｒｒｅｎｔｃｏｍｐｕｔｅｒｓ”，Ｉ
ＥＥＥＣｏｍｐｕｔｅｒ，Ｖｏｌ２１，Ｎｏ．８
（Ａｕｇ１９８８），_ｐｐ９−２３）。各ウォーム
は１つのヘッドのみよりなり、これを以後パケットとす
る。The preferred method of routing messages for a network is a variation of warm hole routing (Daily W J, Seitz CL: Multi).
icons: message-passin
g current computers ", I
EEE Computer, Vol 21, No. 8
(Aug 1988), _pp 9-23). Each worm consists of only one head, which will be referred to as a packet hereinafter.

【００２８】上述のプロセッシングエレメントは各可変
次元における幅ｗのクラスタの中に接続されている。プ
ロセッシングエレメントが属するクラスタにおいては、
ノードは非方向性ラインを有することによりパケットを
当該クラスタを含む他の（ｗ−１）個のノードの１つに
送出でき、あるいは同時にこのパケットをこれらのサブ
グループに通信できる。ノードに到達するパケットは次
の２つの基本形がある。ａ）クラスタ内：これらの行程の最後のリンク（できれ
ば１つのリンク）上の巡回は局所的に伝達される。ｂ）クラスタ間：現在のノードによって受信された後に
直交するクラスタに進む。The processing elements described above are connected in clusters of width w in each variable dimension. In the cluster to which the processing element belongs,
A node can send a packet to one of the other (w-1) nodes containing the cluster by having a non-directional line, or at the same time communicate the packet to these subgroups. A packet that reaches a node has two basic forms. a) Intra-cluster: The tours on the last link (preferably one link) of these journeys are propagated locally. b) Inter-cluster: Go to the orthogonal cluster after being received by the current node.

【００２９】到着するパケットはバッファリングされ、
そして、ＣＩＵによる選択を待つ。ＣＩＵは高速ラウン
ドロビンアルゴリズムを用いてその（ｗ−１）個の発信
元から選択する。選択されたパケットはこれらの形式に
依存して異なって扱われる。クラスタ間パケットはＨＩ
Ｕを介してホストエレメントに進み、最終的にはローカ
ルバッファメモリの予め定められた領域に直接書込まれ
る。クラスタ内パケットはこれらのルート上の次の直交
するクラスタのＣＩＵに直接進む。パケットは欠陥があ
る場合を除き最小距離パスに従って受信元へ進む。大き
なシステム（＞１０^４個ノード）でさえせいぜい直径３
を有するので、ルーティングは実現可能サイズの全スペ
クトルに亘って基本的に無視できる。デッドロックはル
ーティングを制限することによりもしくは構築されたバ
ッファ管理を導入することにより容易に避けることがで
きる。直径を小さくすることによりこれらの戦略の性能
の欠点を最小にできる。Packets arriving are buffered,
Then, it waits for selection by the CIU. The CIU selects from its (w-1) sources using a fast round robin algorithm. The selected packet is treated differently depending on these formats. Intercluster packets are HI
Proceed to the host element via U and finally written directly to a predetermined area of the local buffer memory. Intra-cluster packets go directly to the CIU of the next orthogonal cluster on these routes. The packet follows the minimum distance path to the receiver unless it is defective. Even large systems (> 10 ⁴ nodes) have a diameter of at most 3
, The routing is essentially negligible over the entire feasible size spectrum. Deadlocks can be easily avoided by limiting routing or by introducing built-in buffer management. By reducing the diameter, the performance drawbacks of these strategies can be minimized.

【００３０】データを有するパケットに加えて、ネット
ワーク層はネットワーク制御パケット（ＮＣパケット）
と呼ばれる特殊なパケットを認識してネットワークエレ
メント間制御情報を通過させる。この制御情報はバッフ
ァ管理としてのタスクのためのハウスキーピング（ｈｏ
ｕｓｅｋｅｅｐｉｎｇ）情報、ＣＩＵによって実行され
る自動機能を含むことができる。しかしながら、制御パ
ケットはネットワークエレメント管理ユニットによって
用いられ、このネットワーク管理ユニットは、ネットワ
ーク活動、負荷分散（適当であれば）及び戦略的幅輳制
御を管理する。In addition to packets with data, the network layer is network control packets (NC packets).
It recognizes a special packet called "." And passes the control information between network elements. This control information is used for housekeeping (ho
information, automatic functions performed by the CIU. However, control packets are used by the network element management unit, which manages network activity, load balancing (if applicable) and strategic congestion control.

【００３１】リンクについてクラスタ内のネットワークエレメント間通信は高帯域リ
ンクによる。リンクの分離及びネットワーク機能に大き
な利点がある。特に、リンク装置技術に大きく依存する
特徴を分離することすができる。実際に、クラスタリン
クは幾つか方法で認識できる。たとえば、インターフェ
イス面を装着した短かいブロードアクティブ背面（ｓｈ
ｏｒｔｂｒｏａｄａｃｔｉｖｅｂａｃｋｐｌａｎ
ｅ）、デマルチプレクスドポイント−ポイントリンク、
多段光スター構成、あるいはＵＬＳＩ装置のセットとし
て認識できる。１ＧＢｙｔｅ／ｓまでの転送速度はア
クティブ背面もしくはデマルチプレクスドスター分配器
を用いて達成できる。異なるリンクプロトコルはこれら
の異なる技法に対して適切である。たとえば、並列バス
装置上に用いられるプロトコルはシリアル光ケーブルに
よって要求されるものと異なる。 About Links Communication between network elements within a cluster is by high bandwidth links. There are great advantages in link separation and network functionality. In particular, features that are highly dependent on link device technology can be isolated. In fact, cluster links can be recognized in several ways. For example, a short broad active back (sh
ort broad active backplan
e), demultiplexed points-point links,
It can be recognized as a multi-stage optical star configuration or a set of ULSI devices. Transfer rates up to 1 G Byte / s can be achieved with active backplanes or demultiplexed star distributors. Different link protocols are suitable for these different techniques. For example, the protocol used on parallel bus devices differs from that required by serial optical cables.

【００３２】１６ビットの非方向性バス装置は付随する
並列バスリンクプロトコルと共に採用されている。代表
的なＬＡＮもしくはＷＡＮプロトコルと異なり、リンク
層はエラー制御もしくはフロー制御（共にネットワーク
層によって実行される）の提供を意図しておらず、物理
的なリンク上のパケットの境界、透明性の保存、及びマ
ルチドロップアドレス指定のためのものである。A 16-bit non-directional bus device has been adopted with an accompanying parallel bus link protocol. Unlike typical LAN or WAN protocols, the link layer is not intended to provide error control or flow control (both performed by the network layer), preserving packet boundaries, transparency on the physical link. , And for multi-drop addressing.

【００３３】グローバルメモリパケット構造は共有アクセスグローバルメモリをサポー
トするように設計されている。システムの可能なサイズ
のために、３２ビットアドレス指定はグローバルアドレ
ス空間全体の均一にアクセスすることができないことが
明らかである。基本的な処理は３２ビット仮想アドレス
を用い、これをローカルＭＭＵによって３２ビット物理
アドレスに変換する。グローバルメモリは論理アドレス
指定ユニットつまりスーパページにおいて構築されてい
る。各スーパページは次元Ｈ（≦Ｄ）の超平面に亘って
インターリーブされている。このようにして、２次元シ
ステムにおいては、グローバルメモリ全体はすべてのホ
ストエレメント（Ｈ＝２）に亘り完全にインターリーブ
でき、あるいは局所的クラスタ（Ｈ＝１）にのみインタ
ーリーブでき、あるいは異なる共同で作動する処理によ
って用いられた両者の共通領域に分けることができる。
これにより、広範囲の共有メモリ応用に亘り多くのフレ
キシビリティが生じる。The global memory packet structure is designed to support shared access global memory. Obviously, due to the possible size of the system, 32-bit addressing does not allow uniform access across the global address space. The basic process uses a 32-bit virtual address, which is translated into a 32-bit physical address by the local MMU. Global memory is organized in logical addressing units or superpages. Each superpage is interleaved across hyperplanes of dimension H (≤D). Thus, in a two-dimensional system, the entire global memory can be fully interleaved across all host elements (H = 2), or only local clusters (H = 1), or work together in different ways. Can be divided into common areas used by the processing.
This provides a great deal of flexibility over a wide range of shared memory applications.

【００３４】グローバルメモリへの短かいアクセスは唯
一のＣＬＰ優先でもって転送され、支障のないパケット
のごとく取扱われて、待ち時間を最小にする。Short accesses to global memory are forwarded with unique CLP priority and are treated like unhindered packets to minimize latency.

【００３５】ホストエレメントホストエレメント構成は図５に示されている。このアー
キテクチャはデータフローのようなパラダイムのために
設計された特殊のノードの使用を排除するものでない。
グローバルメモリへの要求及びグローバルメモリからの
応答はメモリエレメント（ＭＥ）２２が直接行う。ま
た、通信エレメント（ＣＥ）２３はプロセッシングエレ
メント（ＰＥ）２１のローカルメモリ５１に対して直接
データの転送を行うダイレクトメモリアクセス制御回路
（ＤＭＡＣ）５０をも内蔵している。このＤＭＡＣ５０
は使用中のソフトウエアパラダイムに対するメッセージ
転送を最適化でき、また、転送すべきメッセージのメモ
リ上の長さ及び位置を示すメッセージ記述子を維持す
る。 Host Element The host element configuration is shown in FIG. This architecture does not preclude the use of specialized nodes designed for paradigms such as data flow.
The memory element (ME) 22 directly makes a request to and a response from the global memory. The communication element (CE) 23 also includes a direct memory access control circuit (DMAC) 50 that directly transfers data to the local memory 51 of the processing element (PE) 21. This DMAC50
Can optimize message transfer for the software paradigm in use, and also maintains a message descriptor that indicates the memory length and location of the message to be transferred.

【００３６】共有メモリエレメント（ＭＥ）２２の機能
はグローバルメモリ制御回路（ＧＭＣ）と呼ばれる内蔵
ハードウエアモジュールによって調整される。ＧＭＣは
次の２つのサブモジュールよりなる。ａ）グローバルメモリ管理ユニット（ＧＭＭＵ）５４。
これは、カーナル制御のもとで、ある範囲の物理アドレ
スをトラップし、特殊グローバルメモリ保証器（ｑｕａ
ｌｉｆｉｅｒ）でパケットを発生する。ＧＭＭＵはアク
ティブ処理に発生したアドレス上で現に使用中のインタ
ーリーブ形式を解釈できる。ｂ）グローバルメモリアクセスユニット（ＧＭＡＵ）５
５。これは通信エレメント（ＣＥ）によって通過してき
たメモリアクセスパケットを受ける。またＧＭＡＵ５５
はシングルもしくはブロックＤＭＡアクセスをグローバ
ルメモリセグメントに発生することができる。さらに、
ＧＭＡＵ５５は、同期、リストの取扱、ゴミ集めを含む
より高いレべルの動作を実行するのに用いることができ
る。The function of the shared memory element (ME) 22 is coordinated by a built-in hardware module called a global memory control circuit (GMC). GMC consists of the following two sub-modules. a) Global memory management unit (GMMU) 54.
It traps a range of physical addresses under the control of karnal and uses a special global memory guarantor (qua).
A packet is generated in the (lifer). The GMMU can interpret the interleaved format currently in use on the address generated for active processing. b) Global memory access unit (GMAU) 5
5. It receives the memory access packets passed by the communication element (CE). See also GMAU55
Can generate single or block DMA access to the global memory segment. further,
The GMAU 55 can be used to perform higher level operations including synchronization, list handling, garbage collection.

【００３７】共有メモリアクセスはシステムレベルもし
くはユーザレベルで発生できる。前者の場合、ユーザ処
理は、必ずしも直接アドレス指定（ＲＡＭディスクの）
ではなく、共有構造としてのグローバルメモリを観察す
る。アクセスが必要とされるとき、ブロッキングもしく
はノンブロッキングシステム呼を発生し、これを受けて
ローカルカーナルはＧＭＭＵに要求を通知する。この手
法は実際に必要とされる前にデータをプリフェッチする
処理によって用いることができる。真のユーザレベルア
クセスはユーザ処理によって直接発生され、ローカルメ
モリ管理ユニット（ＭＭＵ）によって翻訳され、ＧＭＭ
Ｕによってトラップされる。アクセス処理は待ち時間が
十分短かければ遅延でき、もしくは必要なら一時停止で
きる。Shared memory access can occur at the system level or the user level. In the former case, the user process is not always directly addressed (on the RAM disk)
Instead, look at global memory as a shared structure. When access is required, a blocking or non-blocking system call is initiated, in response to which the local kernel notifies the GMMU of the request. This approach can be used by the process of prefetching data before it is actually needed. True user level access is generated directly by the user process, translated by the local memory management unit (MMU), GMM
Trapped by U. The access process can be delayed if the waiting time is short enough, or can be suspended if necessary.

【００３８】ネットワークエレメント図６に示すごとく、各ＣＩＵは３つの明瞭な機能ユニッ
トよりなる（なお、これらの機能分割はネットワークエ
レメントのＶＬＳＩにおけるパッケージ位置上で制限さ
れるものと解釈すべきでない）。ａ）送信ユニット６０。これは共通クラスタ内でメッセ
ージを他のノードへ送信するものである。ｂ）受信ユニット６１。これは共通クラスタ内での他の
ノードから到着するメッセージを適当な調停機構を用い
て選択し、局所的に転送もしくは他のＣＩＵへルーティ
ングするものである。ｃ）ＣＩＵ制御ユニット６０。これはＮＣパケットを必
要に応じて対応するＣＩＵ制御ユニットと交換する有力
な各受信元のバッファの使用状態を監視するものであ
る。 Network Element As shown in FIG. 6, each CIU consists of three distinct functional units (note that these functional divisions should not be construed as limited by the package location in the VLSI of the network element). a) Transmission unit 60. It sends messages to other nodes within a common cluster. b) Receiving unit 61. This is to select a message arriving from another node in the common cluster by using an appropriate arbitration mechanism and locally transfer or route it to another CIU. c) CIU control unit 60. This is to monitor the use status of the buffer of each of the possible receiving sources that exchanges NC packets with the corresponding CIU control units as needed.

【００３９】送信ユニット６０はリンク制御ユニットと
呼ばれる積分オートマトン（つまり、有限状態マシン）
を用いてクラスタリンク上にパケットを転送するもので
ある。このようなパケットは、局所的に発生するものも
しくはＮＥ受信ユニットのいずれかから直接発生するも
の（クラスタ内メッセージ）であれば、ＨＩＵから発生
することができる。The transmitting unit 60 is an integral automaton (that is, a finite state machine) called a link control unit.
Is used to transfer packets on the cluster link. Such packets can originate from the HIU if they originate locally or if they originate directly from any of the NE receiving units (intra-cluster messages).

【００４０】受信ユニット６１は幾つかのバッファ付入
力マルチプレクサよりなり、各マルチプレクサは発信元
選択のためのラウンドロビン調停回路を有する。異なる
形式のパケットがローカルもしくはノンローカル転送の
ためであれば、パケットは別個にバッファリングされ
る。しかしながら、いかなるカテゴリが与えられても、
各マルチプレクサ入力においては１つのパケットに対し
てのみ空間であり、従って、トータルのバッファ要求は
適切となる。受信バッファの空間はＣＩＵ制御ユニット
により監視され、つまり、現に転送中のデータに関係な
く、ＣＩＵ制御ユニットはＮＣパケットを発生し、この
パケットはローカル送信ユニットに直接転送されて制御
フレームとしてただちに供される。パケットが１つの入
力に到着すると、この入力に伴う調停回路は次のノード
への転送が可能である限り送信要求を受信する。調停回
路によって選択されたパケットはただちにこれらの受信
元へ送られる。主として次の４つの場合がある。ａ）パケットはＨＩＵ、ＣＥを介してＰＥに送られ、受
信元の処理に伴うバッファメモリに直接書込まれる。ｂ）パケットは他のＣＩＵにおける送信ユニットに送ら
れる（クラスタ内トラフィック）。ｃ）パケットはＨＩＵを介してグローバルメモリアクセ
スユニットに送られる。グローバルメモリ要求パケット
はＧＭＡＵに送られ、ＣＭＡＵはこれに付随するグロー
バルメモリモジュールに対してＤＭＡを発生する。ｄ）パケットは再びＨＩＵを介してグローバルメモリ管
理ユニットに送られる。たとえば読出しデータを有する
グローバルメモリ要求パケットはＧＭＭＵに送られ、そ
こで未決定の要求とマッチングされる。The receiving unit 61 comprises several buffered input multiplexers, each multiplexer having a round robin arbitration circuit for source selection. If the different types of packets are for local or non-local transfer, the packets are buffered separately. However, given any category,
There is space for only one packet at each multiplexer input, so the total buffer requirement is adequate. The receive buffer space is monitored by the CIU control unit, that is, regardless of the data currently being transferred, the CIU control unit generates an NC packet, which is directly transferred to the local transmission unit to serve as a control frame immediately. It When a packet arrives at one input, the arbitration circuit associated with this input will receive the transmission request as long as it can be transferred to the next node. The packet selected by the arbitration circuit is immediately sent to these receivers. There are four main cases: a) The packet is sent to the PE via the HIU and CE, and is directly written in the buffer memory associated with the processing of the receiving side. b) The packet is sent to the sending unit in another CIU (intra-cluster traffic). c) The packet is sent to the global memory access unit via the HIU. The global memory request packet is sent to the GMAU and the CMAU issues a DMA to its associated global memory module. d) The packet is sent again to the global memory management unit via the HIU. For example, a global memory request packet with read data is sent to the GMMU where it is matched with pending requests.

【００４１】信号定義は、手法によって適当に変更され
るので、アーキテクチャの標準化さされたアクロス処理
系（ａｃｒｏｓｓｉｍｐｌｅｎｔａｔｉｏｎｓ）でな
い。これらの物理的な定義はアーキテクチャのより高い
層から切り離され、送信ユニット及び受信ユニット以外
のモジュールの設計に影響を与えることなく変更でき
る。Signal definitions are not standardized across the implementations of the architecture, as they are modified appropriately by the technique. These physical definitions are decoupled from the higher layers of the architecture and can be changed without affecting the design of modules other than transmitter and receiver units.

【００４２】要求バスラインは次のものである。ａ）Ｄ１５−Ｄ０：データ用ｂ）ＤＥＬＩＭ（ウォームデリミッタ）：ウォームが開
始もしくは終了していることを示す（フレームフラグは
透明性の複雑化を避けるために用いない）。ｃ）ＣＳ（制御ストローブ）：制御フレームを示す。ｄ）ＤＳ（データストローブ）：リンク上の有効データ
ワードの存在を示す。The required bus lines are: a) D15-D0: For data b) DELIM (Warm delimiter): Indicates that the worm has started or ended (the frame flag is not used to avoid complication of transparency). c) CS (control strobe): indicates a control frame. d) DS (Data Strobe): indicates the presence of valid data words on the link.

【００４３】制御フームは送信ユニットと受信ユニット
との間の通信レベルを制御するのに用いる。制御フレー
ムはたとえば１ワード長もしくは２ワード長であり、受
信バッファ状態としての情報を有する。パケットの第１
のワードは受信ＮＣによりルーティング情報を解釈でき
情報と共にクラスタアドレス指定情報を含む。受信ユニ
ットにおいては、このワードは除去されて無視され、つ
まり、新しいルーティング情報がパケットの前に存在す
ることになる。The control frame is used to control the communication level between the transmitting unit and the receiving unit. The control frame has, for example, a one-word length or a two-word length, and has information as a reception buffer state. Packet first
Word can interpret the routing information by the receiving NC and includes the cluster addressing information as well as the information. At the receiving unit, this word is removed and ignored, i.e. new routing information will be present before the packet.

【００４４】ＨＩＵはＣＩＵから直接ローカル転送のた
めのパケットを受信し、これをＣＩＵから直接送り、機
能を反転し、ＨＥからのパケットを適当なＣＩＵにデマ
ルチプレクスする。管理ユニット／ＨＩＵアセンブリ内
にはシステムタイムを維持するシステムクロックレジス
タ（図示せず）がある。スキューを慎重に最小化するこ
とによってシステム全体に唯一のグローバルロック信号
を分配することができる。グローバル時間は、適切であ
れば、ＨＥ動作を同期化させるためと共にパケットをス
タンプするのに用いることができる。The HIU receives the packet for local transfer directly from the CIU, sends it directly from the CIU, reverses the function, and demultiplexes the packet from the HE to the appropriate CIU. Within the management unit / HIU assembly is a system clock register (not shown) that maintains system time. By carefully minimizing the skew, only one global lock signal can be distributed throughout the system. Global time, if appropriate, can be used to synchronize HE operations and also to stamp packets.

【００４５】マルチコンピュータ設計において最も難し
い問題の１つは内部接続戦略の選択の問題であり、この
内部接続戦略は、多数の強力な処理エレメントに基づく
マシンにおける並列応用に対する真に一般的なサポート
を提供できるように十分融通制を有さねばならない。One of the most difficult problems in multi-computer design is the choice of interconnect strategy, which provides truly general support for parallel applications in machines based on many powerful processing elements. You must have enough flexibility to provide.

【００４６】[0046]

【発明の効果】本発明によれば、優れた接続性、高い帯
域性、低い待ち時間を有する一般化された超立方体トポ
ロジを提供できる。重要なことは、このアーキテクチャ
は非常に多くのプロセッサに対して可能であり、微細及
び粗野なプログラミング手法の両方に適する１０^３〜１
０^５のプロセッシングエレメントを有するＭＩＭＤマシ
ンの構成を可能にする。According to the present invention, it is possible to provide a generalized hypercube topology having excellent connectivity, high bandwidth, and low latency. Importantly, this architecture is possible for a very large number of processors and is suitable for both fine and crude programming techniques 10 ³ -1.
Allows the construction of MIMD machines with 0. ⁵ processing elements.

[Brief description of drawings]

【図１】本発明に係わる２次元並列プロセッサを示すブ
ロック回路図である。FIG. 1 is a block circuit diagram showing a two-dimensional parallel processor according to the present invention.

【図２】図１のノードの構成を示すブロック回路図であ
る。FIG. 2 is a block circuit diagram showing a configuration of a node in FIG.

【図３】図１のクラスタの内部構成を示すブロック回路
図である。FIG. 3 is a block circuit diagram showing an internal configuration of the cluster shown in FIG.

【図４】図３の変更例を示すブロック回路図である。FIG. 4 is a block circuit diagram showing a modified example of FIG.

【図５】図１のホストエレメントの詳細を示すブロック
回路図である。5 is a block circuit diagram showing details of the host element of FIG. 1. FIG.

【図６】図１のＣＩＵの詳細を示すブロック回路図であ
る。FIG. 6 is a block circuit diagram showing the details of the CIU of FIG.

[Explanation of symbols]

１０プロキッシングエレメント（ＰＥ）１１，１２クラスタ（サブセット）１３バス２０ホストエレメント（ＨＥ）２１プロキッシングエレメント２２メモリエレメント２３通信エレメント２４ネットワークエレメント（ＮＥ）２５，２６，２７，２８クラスタインターフェイスエ
レメント（ＣＩＵ）２９ホストインターフェイスユニット（ＨＩＵ）３０ネットワークエレメント管理ユニット５４グローバルメモリ管理ユニット（ＧＭＭＵ）５５グローバルメモリアクセスユニット（ＧＭＡＵ）６１受信ユニット６２ＣＩＵ制御ユニット６３送信ユニット10 Proxing Element (PE) 11, 12 Cluster (Subset) 13 Bus 20 Host Element (HE) 21 Proxing Element 22 Memory Element 23 Communication Element 24 Network Element (NE) 25, 26, 27, 28 Cluster Interface Element (CIU) ) 29 host interface unit (HIU) 30 network element management unit 54 global memory management unit (GMMU) 55 global memory access unit (GMAU) 61 reception unit 62 CIU control unit 63 transmission unit

───────────────────────────────────────────────────── フロントページの続き (72)発明者ルイス・マクリーン・マッケンジーイギリス国、グラスゴウ、モンテン・ストリート 18 ジー４９エイチエックス (72)発明者ロバート・ジョン・サザーランドイギリス国、グラスゴウ、ランキャスター・テラス２ジー12 ０ユーティ ─────────────────────────────────────────────────── ───Continued from the front page (72) Inventor Louis McLean McKenzie, Glasgow, Munten Street, England 18 G49 AX (72) Inventor Robert John Sutherland England, Glasgow, Lancaster・ Terrace 2 G 120 U

Claims

[Claims]

1. Comprising a plurality of processing elements (10) arranged in a dimension D and divided into a plurality of subsets, all processing elements in a subset being one bus (1) for communication between them.
3), wherein each said processing element is a member of one subset in each dimension, each processing element of one subset is connected to the bus of that subset by output means and A parallel arrangement, characterized in that it sends a message to a processing element, a separate input means corresponding to each other processing element of the subset and receiving a message from said other processing element on each corresponding input means. Processor.

2. The processing elements in one subset are arranged in one line, the processing elements located between the ends of the line sending one message along the line to the other processing element on one side. A processor according to claim 1, comprising means and separate output means for sending a message along the line to other processing elements on the other side.

3. Each processing element comprises a communication means for generating a message packet and sending it to another element, said communication means generating an address indicating whether the source of the packet is the same subset or another subset. The processor according to claim 1, further comprising:

4. Each processing element comprises one
Means for receiving a message packet from another element in a cluster on one bus, means for recognizing whether a packet source is the same cluster or different orthogonal clusters, and a packet source when the packet source is the same cluster Means for writing in memory within the element and retransmitting the packet to a different orthogonal bus when the sources of the packet are in different clusters.
Processor described in.

5. The processing element further comprises means for receiving a message packet, means for changing an address in the message packet, and means for retransmitting the message packet. The processor according to any one of 1.

6. The processing element further comprises means for generating a message packet and sending it to another element, each message packet comprising a unique address part and at least a unique command part and a unique data part. 6. A processor according to any one of claims 1 to 5, including one.

7. A processor as claimed in any one of the preceding claims, in which a bus of a subset is electrically driven by an output device of only one element connected to the subset.

8. A processor as claimed in any one of the preceding claims, in between two and three dimensions.