JPH11250094A

JPH11250094A - Two-phase data cluster method and apparatus, and recording medium recording two-phase data cluster program

Info

Publication number: JPH11250094A
Application number: JP10052340A
Authority: JP
Inventors: Takeshi Maruyama; 猛丸山; Seiji Isobe; 成二磯部; Toshiko Shiobara; 寿子塩原; Tetsuya Iizuka; 哲也飯塚
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: NTT Inc
Priority date: 1998-03-04
Filing date: 1998-03-04
Publication date: 1999-09-17
Anticipated expiration: 2018-03-04
Also published as: JP3478967B2

Abstract

(57)【要約】【課題】データ空間内のデータの固まりを指定数分に
集約するための初期クラスタ核の位置を決定し、データ
空間内のデータを最適な集約し、大量のデータの分布傾
向を容易に把握し得る２相データクラスタ方法および装
置と２相データクラスタプログラムを記録した記録媒体
を提供する。【解決手段】初期クラスタ核をランダムにまたは統計
的推測によりデータ空間内にユーザ指定数のＫ倍だけ設
定し、データと初期クラスタ核をもって集約処理を行
い、処理結果のクラスタ核の位置を出力し、評価関数を
もって処理結果のクラスタ核から最適な集約処理をでき
るクラスタ核を抽出し、該クラスタ核を初期クラスタ核
として設定して再度集約処理を行い、１回目の集約処理
においてデータの固まりに近いクラスタ核が生成され、
その中からクラスタ核を抽出し、集約処理を行う。 (57) [Summary] [Problem] To determine the position of an initial cluster nucleus for aggregating a cluster of data in a data space into a specified number of minutes, optimally aggregate the data in the data space, and distribute a large amount of data. Provided are a two-phase data cluster method and apparatus capable of easily grasping a tendency, and a recording medium on which a two-phase data cluster program is recorded. SOLUTION: An initial cluster nucleus is set randomly or statistically by the number of K times in a data space by a user-specified number, an aggregation process is performed using the data and the initial cluster nucleus, and the position of the cluster nucleus as a processing result is output. Then, a cluster nucleus capable of performing the optimal aggregation processing is extracted from the cluster nucleus of the processing result using the evaluation function, the cluster nucleus is set as the initial cluster nucleus, and the aggregation processing is performed again. Cluster nuclei are generated,
Cluster nuclei are extracted from among them, and aggregation processing is performed.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、大量の件数のデー
タから多次元のデータ空間内の関係が近いデータを所望
の集約クラスタ数に集約したクラスタを生成する２相デ
ータクラスタ方法および装置に関し、更に詳しくは、デ
ータ空間内に存在するデータの分布状況をすべてのデー
タからそのデータの件数より少ない指定された個数のデ
ータに集約し判別する統計的データ集約の実現方法に有
効であるとともに、記憶装置内の情報を図形の集まりと
して表示装置に２次元表示する際に表示装置に表示され
る図形同士の重なりの排除を実現し得る２相データクラ
スタ方法および装置と２相データクラスタプログラムを
記録した記録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a two-phase data cluster method and apparatus for generating a cluster in which data having a close relationship in a multidimensional data space is aggregated into a desired number of aggregated clusters from a large number of data. More specifically, the present invention is effective for a method of implementing statistical data aggregation in which the distribution status of data existing in a data space is aggregated from all data into a specified number of data smaller than the number of data and determined, and stored. A two-phase data cluster method and apparatus and a two-phase data cluster program capable of realizing the elimination of overlapping of figures displayed on the display device when the information in the device is displayed two-dimensionally on the display device as a group of figures are recorded. It relates to a recording medium.

【０００２】[0002]

【従来の技術】従来のデータ集約手法として、初期クラ
スタ核と呼ばれるものをデータ空間内にランダムに指定
数分だけ設定し、各クラスタ核とデータとの距離が最小
になるようにクラスタ核を移動して、データを集約する
方法がある。ランダムに初期クラスタ核を設定した場合
では、初期クラスタ核がデータ空間内に任意に設定され
るので、データ空間内にあるデータの固まり１つ１つを
集約することができない場合がある。2. Description of the Related Art As a conventional data aggregation method, what is called an initial cluster nucleus is set at random in a data space for a specified number, and the cluster nucleus is moved so that the distance between each cluster nucleus and data is minimized. Then, there is a method to aggregate the data. When the initial cluster nuclei are set at random, the initial cluster nuclei are arbitrarily set in the data space, so that it may not be possible to aggregate each data chunk in the data space.

【０００３】これを解消するために、最近重心ソート法
などの既存の統計的推測により初期クラスタ核を指定数
分だけ設定する方法がある。この手法では、前者のラン
ダムな設定と比べ、データの固まり１つ１つに初期クラ
スタ核が対応する処理が行われる。以降、データ空間に
あるデータの距離関係によって生成されるデータの固ま
りを１つ１つ分割し、それらを各々に集約できることを
最適とする。In order to solve this problem, there is a method of setting a specified number of initial cluster nuclei by existing statistical inference such as a recent centroid sorting method. In this method, as compared with the former random setting, processing in which the initial cluster nucleus corresponds to each data chunk is performed. Hereinafter, it is optimal to divide data blocks generated by the distance relationship of data in the data space one by one and to be able to aggregate them.

【０００４】簡易な場合で従来技術の適用結果を示す。
図９および図１０はデータ空間が２次元の場合を示す。
図９に示すようなデータ空間内に黒丸「●」で示すよう
に存在するデータを指定数４として集約する場合につい
て以下に説明する。[0004] The result of applying the prior art is shown in a simple case.
9 and 10 show a case where the data space is two-dimensional.
The case where the data existing in the data space as shown in FIG. 9 as indicated by black circles “●” are aggregated as the designated number 4 will be described below.

【０００５】図１０は、ランダムに初期クラスタ核を設
定し、既存の集約手法であるＫ−平均法を適用した結果
を示している。なお、図１０において、黒丸「●」がデ
ータであり、白三角「△」がランダムに設定された初期
クラスタ核であり、黒三角「▲」が集約処理後のクラス
タ核であり、白枠で囲った円形エリアが核クラスタの集
約したデータの範囲を示し、矢印は集約処理後のクラス
タ核の移動を示している。上述したように、データの固
まり１つ１つに初期クラスタ核が設定されないので、最
適にデータを集約できない。FIG. 10 shows the result of randomly setting an initial cluster kernel and applying the K-means method which is an existing aggregation method. In FIG. 10, a black circle “●” is data, a white triangle “△” is a randomly set initial cluster nucleus, a black triangle “▲” is a cluster nucleus after the aggregation processing, and is a white frame. The enclosed circular area indicates the range of the aggregated data of the nuclear clusters, and the arrow indicates the movement of the cluster nucleus after the aggregation processing. As described above, since the initial cluster nucleus is not set for each data chunk, data cannot be optimally aggregated.

【０００６】図１１は、統計的推定を利用した初期クラ
スタ核を設定し、既存の集約手法であるＫ−平均法を適
用した結果を示している。なお、図１１において、黒丸
「●」がデータであり、白三角「△」が統計的推測によ
り設定された初期クラスタ核であり、黒三角「▲」が集
約処理後のクラスタ核であり、白枠で囲った円形エリア
が核クラスタの集約したデータの範囲を示し、矢印は集
約処理後のクラスタ核の移動を示している。FIG. 11 shows the result of setting an initial cluster kernel using statistical estimation and applying the K-means method which is an existing aggregation method. In FIG. 11, a black circle “●” is data, a white triangle “△” is an initial cluster nucleus set by statistical estimation, a black triangle “▲” is a cluster nucleus after the aggregation processing, The circular area surrounded by the frame indicates the range of the aggregated data of the nuclear clusters, and the arrows indicate the movement of the cluster nuclei after the aggregation processing.

【０００７】図１１に示す手法では、図１０に示したラ
ンダム設定の場合に比較して、データの固まり１つ１つ
に対し、初期クラスタ核が設定され、最適に近い集約結
果が出力されている。しかしながら、集約結果のＣ２お
よびＣ４の結果を観察すると、データ３，１９，２はＣ
４に集約されたほうが最適であるが、結果はＣ２に集約
されているため、この部分においては、この手法が最適
ではない。In the method shown in FIG. 11, an initial cluster nucleus is set for each data chunk, and an aggregation result close to the optimum is output, as compared with the case of the random setting shown in FIG. I have. However, when observing the results of the aggregation results C2 and C4, data 3, 19, and 2
4 is more optimal, but since the results are aggregated in C2, this method is not optimal in this part.

【０００８】上記の２つの問題は、初期クラスタ点の配
置位置に大きく依存している。統計的推測により初期ク
ラスタ核を設定し、集約処理を行った結果の図１１に示
すＣ２の部分は、データ３，１９，１２，２を１つの固
まりと推測したために、そのデータに距離が近い点に初
期クラスタ核が設定されたことが原因である。これによ
り、この初期クラスタ核Ｃ２は、データ８，１０をデー
タ３，１９，２とともに集約するように移動し、図１０
に示す結果が生じている。The above two problems largely depend on the positions of the initial cluster points. The initial cluster nucleus is set by statistical estimation and the result of the aggregation processing is shown in FIG. 11 at C2, where the data 3, 19, 12, and 2 are assumed to be one lump, and the distance is close to the data. This is because the initial cluster nucleus was set at the point. As a result, the initial cluster nucleus C2 moves so as to aggregate the data 8, 10 together with the data 3, 19, 2 as shown in FIG.
The result shown in FIG.

【０００９】[0009]

【発明が解決しようとする課題】上述したように、初期
クラスタ核をデータ空間内にランダムにまたは統計的推
測だけにより設定すると、データの固まりが領域内にあ
る本来の固まりとは異なって計算され、データ空間に存
在するデータの固まりとは異なるデータの固まりとし
て、集約結果が出力されることがあるという問題があ
る。As described above, if the initial cluster nucleus is set in the data space at random or only by statistical inference, the cluster of data is calculated differently from the original cluster in the area. However, there is a problem in that the aggregated result may be output as a set of data different from the set of data existing in the data space.

【００１０】また、上述した従来の集約方式を実現した
装置が記憶装置内の情報を図形の集まりとして表示する
装置の前処理装置として適用された場合、上述した従来
の問題により、表示装置に表示された図形によって生じ
る重なりを排除し、元の図形の配置位置の傾向を残し
て、表示結果が出力されないという問題がある。[0010] Further, when a device that realizes the above-described conventional aggregation method is applied as a pre-processing device of a device that displays information in a storage device as a group of graphics, the display is not displayed on the display device due to the conventional problem described above. There is a problem that a display result is not output while eliminating the overlap caused by the displayed figure and leaving the tendency of the arrangement position of the original figure.

【００１１】本発明は、上記に鑑みてなされたもので、
その目的とするところは、データ空間内のデータの固ま
りを指定数分に集約するために初期クラスタ核の位置を
決定し、データ空間内のデータを最適に集約し、大量の
データの分布傾向を容易に把握し得る２相データクラス
タ方法および装置と２相データクラスタプログラムを記
録した記録媒体を提供することにある。[0011] The present invention has been made in view of the above,
The purpose is to determine the position of the initial cluster nucleus in order to aggregate a cluster of data in the data space into a specified number of minutes, optimally aggregate the data in the data space, and reduce the distribution tendency of a large amount of data. An object of the present invention is to provide a two-phase data cluster method and apparatus which can be easily grasped, and a recording medium on which a two-phase data cluster program is recorded.

【００１２】[0012]

【課題を解決するための手段】上記目的を達成するた
め、請求項１記載の本発明は、大量の件数のデータから
多次元のデータ空間内の関係が近いデータを所望の集約
クラスタ数に集約したクラスタを生成する２相データク
ラスタ方法であって、集約するクラスタ数、集約処理の
重み付けをクラスタ核間距離かクラスタ内のデータの広
がりの程度のどちらで行うかを０から１の範囲で指定す
る重み付けパラメータα、初期核クラスタ数を集約クラ
スタ数から決定するための初期核クラスタ数決定パラメ
ータＫを指定し、この指定された初期核クラスタ数を条
件として集約処理の初期値となる初期クラスタ核を各次
元の最大および最小の範囲内でランダムにまたは統計的
推定により設定し、Ｋ−平均法を含むクラスタ手法を集
約処理に適用し、クラスタ核と各データ間の距離を各次
元について計算し、総合的に距離の近いデータを同一ク
ラスタに集約し、クラスタ核を抽出するための指標とし
て重み付けパラメータαを１に近付けるとクラスタ核間
距離に関するウェイトが高く設定され、重み付けパラメ
ータαを０に近付けると各クラスタ内のデータの分散割
合に関する量のウェイトが高く設定されるための次に示
す評価関数Ｃi ：In order to achieve the above object, according to the present invention, data having a close relationship in a multidimensional data space is aggregated into a desired number of clusters from a large number of data. A two-phase data cluster method for generating clusters in which the number of clusters to be aggregated and the weight of the aggregation process are specified by the distance between cluster nuclei or the extent of data spread within a cluster, in the range of 0 to 1. Weighting parameter α to be determined, and an initial kernel cluster number determination parameter K for determining the initial kernel cluster number from the aggregate cluster number, and an initial cluster kernel which is an initial value of the aggregation process on condition of the designated initial kernel cluster number. Are set randomly or by statistical estimation within the maximum and minimum ranges of each dimension, and the cluster method including the K-means method is applied to the aggregation process, The distance between the star nucleus and each data is calculated for each dimension, the data having a short distance is aggregated into the same cluster, and the weighting parameter α is approached to 1 as an index for extracting the cluster nucleus. Is set to be high, and when the weighting parameter α approaches 0, the weight of the amount related to the distribution ratio of data in each cluster is set to be high, so that the following evaluation function Ci:

【数４】Ｃi ＝（重み付けパラメータα）＊（クラスタ
核間の距離行列に関する量）＋（１−重み付けパラメー
タα）＊（クラスタ核ｉに集約された各データとクラス
タ核との距離に関する量）の値が低い順に初期クラスタ核を並べ、上位から指定し
た集約クラスタ核の数だけを再集約処理のための初期ク
ラスタ核として抽出し、この抽出されたクラスタ核を再
集約処理の初期クラスタ核として使用して集約処理を行
い、ユーザの指定した数のクラスタを生成することを要
旨とする。## EQU4 ## Ci = (weighting parameter α) * (amount related to distance matrix between cluster nuclei) + (1−weighting parameter α) * (amount related to distance between each data collected in cluster nucleus i and cluster nucleus) The initial cluster nuclei are arranged in descending order of the value of, and only the number of aggregate cluster nuclei specified from the top is extracted as the initial cluster nuclei for the re-aggregation processing, and this extracted cluster nucleus is used as the initial cluster nucleus of the re-aggregation processing The gist is to perform the aggregation process using the data to generate the number of clusters specified by the user.

【００１３】また、請求項２記載の本発明は、大量の件
数のデータから多次元のデータ空間内の関係が近いデー
タを所望の集約クラスタ数に集約したクラスタを生成す
る２相データクラスタ方法であって、集約するクラスタ
数、初期クラスタ核数決定パラメータおよび集約処理対
象データを指定し、設定した集約数、初期クラスタ核数
決定パラメータと集約処理を行うデータの次元数および
各次元の最大および最小値を検出し、その範囲で初期ク
ラスタ核をランダムに設定するかまたは集約処理対象デ
ータから統計的手法を用いて初期クラスタ核を設定し、
設定された初期クラスタ核と指定したデータで集約処理
を行い、処理結果後のクラスタ核の多次元データでの位
置を生成し、生成された集約処理結果のクラスタ核か
ら、評価関数の重み付けパラメータαをもって、ユーザ
が希望する集約方法でクラスタ核を指定数分抽出し、優
先的に抽出されたクラスタ核と指定した元データをもと
に集約処理を行い、最終的なクラスタ核のデータ空間で
の位置を出力し、生成された最終的なクラスタ核のデー
タを図形の集まりとして表示するために図形の情報にデ
ータを変換生成し、生成された図形の表示情報を表示装
置に出力することを要旨とする。According to a second aspect of the present invention, there is provided a two-phase data cluster method for generating a cluster in which data having a close relationship in a multidimensional data space is aggregated into a desired number of aggregated clusters from a large number of data. The number of clusters to be aggregated, the initial cluster nucleus number determination parameter and the data to be aggregated are specified, and the set aggregation number, initial cluster nucleus number determination parameter, the number of dimensions of the data to be aggregated, and the maximum and minimum of each dimension Detect the value and randomly set the initial cluster nucleus in that range or set the initial cluster nucleus using statistical methods from the aggregation target data,
Aggregation processing is performed using the set initial cluster kernel and specified data, the position of the cluster kernel after the processing result in the multidimensional data is generated, and the weighted parameter α of the evaluation function is calculated from the cluster kernel of the generated aggregation processing result. With the aggregation method desired by the user, a specified number of cluster nuclei are extracted, and aggregation processing is performed based on the preferentially extracted cluster nuclei and the specified original data. To output the position, convert the data into graphic information in order to display the generated final cluster nucleus data as a group of figures, and output the generated graphic display information to a display device. And

【００１４】更に、請求項３記載の本発明は、大量の件
数のデータから多次元のデータ空間内の関係が近いデー
タを所望の集約クラスタ数に集約したクラスタを生成す
る２相データクラスタ装置であって、集約するクラスタ
数、集約処理の重み付けをクラスタ核間距離かクラスタ
内のデータの広がりの程度のどちらで行うかを０から１
の範囲で指定する重み付けパラメータα、初期核クラス
タ数を集約クラスタ数から決定するための初期核クラス
タ数決定パラメータＫを指定する集約パラメータ指定手
段と、該集約パラメータ指定手段より指定された初期核
クラスタ数を条件として集約処理の初期値となる初期ク
ラスタ核を各次元の最大および最小の範囲内でランダム
にまたは統計的推定により設定する初期クラスタ核設定
手段と、Ｋ−平均法を含むクラスタ手法を集約処理に適
用し、クラスタ核と各データ間の距離を各次元について
計算し、総合的に距離の近いデータを同一クラスタに集
約する集約処理手段と、クラスタ核を抽出するための指
標として重み付けパラメータαを１に近付けるとクラス
タ核間距離に関するウェイトが高く設定され、重み付け
パラメータαを０に近付けると各クラスタ内のデータの
分散割合に関する量のウェイトが高く設定されるための
次に示す評価関数Ｃi ：Furthermore, the present invention according to claim 3 is a two-phase data cluster apparatus for generating a cluster in which data having a close relationship in a multidimensional data space is aggregated into a desired number of aggregated clusters from a large number of data. The number of clusters to be aggregated and the weighting of the aggregation process are determined from 0 to 1 as to whether to perform the cluster internuclear distance or the extent of data spread in the cluster.
A weighting parameter α specified in the range, an aggregation parameter specifying means for specifying an initial number of nuclear clusters determination parameter K for determining the number of initial nuclear clusters from the number of aggregation clusters, and an initial nuclear cluster specified by the aggregation parameter specifying means. Initial cluster kernel setting means for randomly or statistically estimating initial cluster kernels which are initial values of the aggregation processing on the condition of the number within the maximum and minimum ranges of each dimension, and a cluster method including a K-means method. Aggregation processing means that applies to aggregation processing, calculates the distance between the cluster nucleus and each data for each dimension, and aggregates data with a short distance comprehensively into the same cluster, and weighting parameters as indices for extracting cluster nuclei When α approaches 1, the weight related to the distance between cluster nuclei is set high, and the weighting parameter α is set to 0. Give the evaluation function Ci in the following for the amount of weight is set high on Distributed percentage of data in each cluster:

【数５】Ｃi ＝（重み付けパラメータα）＊（クラスタ
核間の距離行列に関する量）＋（１−重み付けパラメー
タα）＊（クラスタ核ｉに集約された各データとクラス
タ核との距離に関する量）の値が低い順に初期クラスタ核を並べ、上位から指定し
た集約クラスタ核の数だけを再集約処理のための初期ク
ラスタ核として抽出する優先クラスタ核抽出手段と、該
優先クラスタ核抽出手段より抽出されたクラスタ核を再
集約処理の初期クラスタ核として使用して集約処理を行
い、ユーザの指定した数のクラスタを生成する集約処理
再実行手段とを有することを要旨とする。## EQU5 ## Ci = (weighting parameter α) * (amount related to distance matrix between cluster nuclei) + (1−weighting parameter α) * (amount related to distance between each data collected in cluster nucleus i and cluster nucleus) Priority cluster nucleus extracting means for arranging the initial cluster nuclei in ascending order and extracting only the number of aggregate cluster nuclei specified from the top as the initial cluster nucleus for the re-aggregation processing; The present invention has an aggregation processing re-executing unit that performs aggregation processing by using the cluster nucleus obtained as an initial cluster nucleus of the re-aggregation processing and generates a number of clusters designated by the user.

【００１５】請求項４記載の本発明は、大量の件数のデ
ータから多次元のデータ空間内の関係が近いデータを所
望の集約クラスタ数に集約したクラスタを生成する２相
データクラスタ装置であって、集約するクラスタ数、初
期クラスタ核数決定パラメータおよび集約処理対象デー
タを指定する集約パラメータ指定手段と、設定した集約
数、初期クラスタ核数決定パラメータと集約処理を行う
データの次元数および各次元の最大および最小値を検出
し、その範囲で初期クラスタ核をランダムに設定するか
または集約処理対象データから統計的手法を用いて初期
クラスタ核を設定する初期クラスタ核設定手段と、設定
された初期クラスタ核と指定したデータで集約処理を行
い、処理結果後のクラスタ核の多次元データでの位置を
生成する集約処理手段と、生成された集約処理結果のク
ラスタ核から、評価関数の重み付けパラメータαをもっ
て、ユーザが希望する集約方法でクラスタ核を指定数分
抽出する優先クラスタ核抽出手段と、優先的に抽出され
たクラスタ核と指定した元データをもとに集約処理を行
い、最終的なクラスタ核のデータ空間での位置を出力す
る集約処理再実行手段と、生成された最終的なクラスタ
核のデータを図形の集まりとして表示するために図形の
情報にデータを変換生成する表示情報生成手段と、生成
された図形の表示情報を表示装置に出力する図形情報表
示装置とを有することを要旨とする。According to a fourth aspect of the present invention, there is provided a two-phase data cluster apparatus for generating a cluster in which data having a close relationship in a multidimensional data space is aggregated into a desired number of aggregated clusters from a large number of data. Aggregation parameter specifying means for specifying the number of clusters to be aggregated, the initial cluster nucleus number determination parameter and the data to be aggregated, the set aggregation number, the initial cluster nucleus number determination parameter and the number of dimensions of the data to be aggregated and the dimensions Initial cluster nucleus setting means for detecting the maximum and minimum values and randomly setting the initial cluster nucleus within the range or setting the initial cluster nucleus from the data to be aggregated using a statistical method, and the set initial cluster nucleus Aggregation processing that performs aggregation processing with the data specified as the nucleus and generates the position in the multidimensional data of the cluster nucleus after the processing result And priority cluster nucleus extraction means for extracting a specified number of cluster nuclei by a user's desired aggregation method with the weighting parameter α of the evaluation function from the cluster nuclei of the generated aggregation processing result; Aggregation processing re-execution means for performing aggregation processing based on the original data specified as the cluster kernel and outputting the final cluster kernel data position in the data space. The gist of the present invention is to have a display information generating means for converting and generating data into graphic information for displaying as a group, and a graphic information display device for outputting the generated display information of the graphic to a display device.

【００１６】請求項１乃至４記載の本発明にあっては、
大量のデータが存在するデータ空間においてデータ間の
距離が近いものによって表現されるデータの固まりをデ
ータの分布の傾向を損なわないように指定された数に集
約する際、初期クラスタ核をランダムにまたは統計定推
測によりデータ空間内にユーザ指定数のＫ倍だけ設定
し、データと初期クラスタ核をもってＫ−平均法を含む
集約処理を行い、処理結果のクラスタ核の位置を出力
し、評価関数Ｃi をもって処理結果のクラスタ核から最
適な集約処理を行うことができるクラスタ核を抽出し、
このクラスタ核を初期クラスタ核として設定し、再度集
約処理を行うことにより、１回目の集約処理においてデ
ータの固まりに近いクラスタ核が生成され、その中から
クラスタ核を抽出し、集約処理を行うため、データ空間
内のデータの固まりを１つ１つに分けて集約することが
できる。In the present invention described in claims 1 to 4,
When aggregating chunks of data represented by those with a close distance between data in a data space with a large amount of data into a specified number so as not to impair the tendency of data distribution, the initial cluster kernel is randomly or randomly By statistical estimation, the data space is set to K times the number specified by the user, aggregation processing including the K-means method is performed using the data and the initial cluster nucleus, the position of the cluster nucleus of the processing result is output, and the evaluation function Ci is used. Extract cluster nuclei that can perform optimal aggregation processing from the cluster nuclei of the processing results,
By setting this cluster nucleus as an initial cluster nucleus and performing aggregation processing again, a cluster nucleus close to a data chunk is generated in the first aggregation processing, and a cluster nucleus is extracted therefrom to perform aggregation processing. , The data blocks in the data space can be aggregated separately.

【００１７】前記評価関数の作用について説明する。評
価関数の第１項はクラスタ核間の距離に関係する量を示
している。最大クラスタ核間距離から各クラスタ核間の
距離の差分をとっているので、この値が小さいほどその
クラスタ核は他のクラスタ核と離れていることがわか
る。第２項はクラスタ核とそれに集約されるデータとの
距離に関係する量を示している。この量は集約された各
データとクラスタ核との距離をとっているので、この値
が小さいクラスタ核はクラスタ自体が密集しているデー
タで構成されていると見なすことができる。以上のこと
から、評価関数の値が小さいクラスタ核は、データが密
集していて、他のクラスタ核と離れているという最適な
条件を満たすことになる。The operation of the evaluation function will be described. The first term of the evaluation function indicates a quantity related to the distance between cluster nuclei. Since the difference between the distances between cluster nuclei is calculated from the maximum inter-cluster nucleus distance, the smaller this value is, the farther the cluster nucleus is from other cluster nuclei. The second term indicates a quantity related to the distance between the cluster nucleus and the data aggregated therein. Since this amount is the distance between the aggregated data and the cluster nucleus, a cluster nucleus with a small value can be regarded as being composed of data in which clusters themselves are dense. From the above, the cluster nucleus having a small value of the evaluation function satisfies the optimum condition that the data is dense and distant from other cluster nuclei.

【００１８】また、重み付けパラメータαの値を０から
１の間の数で任意に指定することで、重み付けパラメー
タαが０に近いときには第１項の値が評価関数に大きく
影響し、重み付けパラメータαが１に近いときには第２
項の値が評価関数に大きく影響する。つまり、重み付け
パラメータαを０に近い値として設定すれば、優先クラ
スタ核の候補の中でデータが密集しているクラスタ核が
抽出される。同様に、重み付けパラメータαを１に近い
値として設定すれば、優先クラスタ核の中で他のクラス
タ核と離れているクラスタ核が抽出される。この重み付
けパラメータαを任意に指定することにより、様々な分
布のデータに対応した集約処理が行える。従って、従来
検出することが不可能だったよく密集しているデータの
固まりを１つ１つ分割して集約するためには重み付けパ
ラメータαを適当に設定し、この第１項と第２項の総和
が小さいクラスタ核を候補として集約処理を行うことに
より、評価関数が小さいクラスタ核を優先的に指定され
た数だけ抽出するため、データ空間内のデータの固まり
を１つ１つに分けて集約することができる。Also, by arbitrarily specifying the value of the weighting parameter α by a number between 0 and 1, when the weighting parameter α is close to 0, the value of the first term greatly affects the evaluation function, and the weighting parameter α Is close to 1
The value of the term greatly affects the evaluation function. In other words, if the weighting parameter α is set to a value close to 0, cluster nuclei in which data is dense among candidate cluster nuclei are extracted. Similarly, if the weighting parameter α is set to a value close to 1, cluster nuclei that are distant from other cluster nuclei among the priority cluster nuclei are extracted. By arbitrarily specifying the weighting parameter α, aggregation processing corresponding to data of various distributions can be performed. Therefore, in order to divide and aggregate data clusters that are often dense and could not be detected conventionally, the weighting parameter α is appropriately set, and the first and second terms are used. By performing aggregation processing with cluster nuclei having a small sum as a candidate, cluster nuclei having a small evaluation function are extracted by a designated number with priority, so that a lump of data in the data space is divided and aggregated one by one. can do.

【００１９】また、請求項５記載の本発明は、大量の件
数のデータから多次元のデータ空間内の関係が近いデー
タを所望の集約クラスタ数に集約したクラスタを生成す
る２相データクラスタプログラムを記録した記録媒体で
あって、集約するクラスタ数、集約処理の重み付けをク
ラスタ核間距離かクラスタ内のデータの広がりの程度の
どちらで行うかを０から１の範囲で指定する重み付けパ
ラメータα、初期核クラスタ数を集約クラスタ数から決
定するための初期核クラスタ数決定パラメータＫを指定
し、この指定された初期核クラスタ数を条件として集約
処理の初期値となる初期クラスタ核を各次元の最大およ
び最小の範囲内でランダムにまたは統計的推定により設
定し、Ｋ−平均法を含むクラスタ手法を集約処理に適用
し、クラスタ核と各データ間の距離を各次元について計
算し、総合的に距離の近いデータを同一クラスタに集約
し、クラスタ核を抽出するための指標として重み付けパ
ラメータαを１に近付けるとクラスタ核間距離に関する
ウェイトが高く設定され、重み付けパラメータαを０に
近付けると各クラスタ内のデータの分散割合に関する量
のウェイトが高く設定されるための次に示す評価関数Ｃ
i ：According to a fifth aspect of the present invention, there is provided a two-phase data cluster program for generating a cluster in which data having a close relationship in a multidimensional data space is aggregated into a desired number of aggregated clusters from a large number of data. A recording medium in which the number of clusters to be aggregated, a weighting parameter α for designating in a range of 0 to 1 whether the weighting of the aggregation processing is to be performed based on the distance between cluster nuclei or the degree of spread of data in the cluster, An initial nuclear cluster number determination parameter K for determining the number of nuclear clusters from the number of aggregated clusters is specified, and the initial cluster nucleus which is an initial value of the aggregation processing is set to the maximum value of each dimension and the initial value of the aggregation process on the condition of the specified number of initial nuclear clusters. It is set randomly or by statistical estimation within the minimum range, and the cluster method including the K-means method is applied to the aggregation processing, and the cluster kernel and The distance between the data is calculated for each dimension, and the data having a short distance is aggregated into the same cluster. When the weighting parameter α is approached to 1 as an index for extracting the cluster nucleus, the weight relating to the distance between the cluster nuclei becomes high. When the weighting parameter α approaches 0, the weight of the amount related to the distribution ratio of the data in each cluster is set to be high.
i:

【数６】Ｃi ＝（重み付けパラメータα）＊（クラスタ
核間の距離行列に関する量）＋（１−重み付けパラメー
タα）＊（クラスタ核ｉに集約された各データとクラス
タ核との距離に関する量）の値が低い順に初期クラスタ核を並べ、上位から指定し
た集約クラスタ核の数だけを再集約処理のための初期ク
ラスタ核として抽出し、この抽出されたクラスタ核を再
集約処理の初期クラスタ核として使用して集約処理を行
い、ユーザの指定した数のクラスタを生成する２相デー
タクラスタプログラムを記録媒体に記録することを要旨
とする。## EQU6 ## Ci = (weighting parameter α) * (amount related to distance matrix between cluster nuclei) + (1−weighting parameter α) * (amount related to distance between each data collected in cluster nucleus i and cluster nucleus) The initial cluster nuclei are arranged in descending order of the value of, and only the number of aggregate cluster nuclei specified from the top is extracted as the initial cluster nuclei for the re-aggregation processing, and this extracted cluster nucleus is used as the initial cluster nucleus of the re-aggregation processing The gist of the present invention is to record a two-phase data cluster program for generating a cluster designated by the user by performing aggregation processing using the program.

【００２０】請求項５記載の本発明にあっては、大量の
データが存在するデータ空間においてデータ間の距離が
近いものによって表現されるデータの固まりをデータの
分布の傾向を損なわないように指定された数に集約する
際、初期クラスタ核をランダムにまたは統計的推測によ
りデータ空間内にユーザ指定数のＫ倍だけ設定し、デー
タと初期クラスタ核をもって集約処理を行い、処理結果
のクラスタ核の位置を出力し、評価関数Ｃi をもって処
理結果のクラスタ核から最適な集約処理を行うことがで
きるクラスタ核を抽出し、このクラスタ核を初期クラス
タ核として設定し、再度集約処理を行うことにより、１
回目の集約処理においてデータの固まりに近いクラスタ
核が生成され、その中からクラスタ核を抽出して集約処
理を行う２相データクラスタプログラムを記録媒体に記
録しているため、該記録媒体を用いて、その流通性を高
めることができる。According to the fifth aspect of the present invention, in a data space where a large amount of data exists, a data chunk represented by an object having a short distance between the data is designated so as not to impair the tendency of the data distribution. When aggregating the number of clusters, the initial cluster nucleus is set randomly or statistically by a factor of K times in the data space by the number specified by the user, and the aggregation process is performed with the data and the initial cluster nucleus. By outputting the position, extracting a cluster nucleus capable of performing the optimal aggregation processing from the cluster nucleus of the processing result using the evaluation function Ci, setting this cluster nucleus as an initial cluster nucleus, and performing the aggregation processing again, 1
In the second aggregation processing, a cluster nucleus close to a cluster of data is generated, and a two-phase data cluster program for extracting a cluster nucleus from the cluster nucleus and performing the aggregation processing is recorded on a recording medium. , And its distribution can be improved.

【００２１】[0021]

【発明の実施の形態】以下、図面を用いて本発明の実施
の形態について説明する。図１は、本発明の一実施形態
に係る２相データクラスタ装置の構成を示すブロック図
である。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing a configuration of a two-phase data cluster device according to one embodiment of the present invention.

【００２２】図１に示す本実施形態の２相データクラス
タ装置は、集約対象データを格納する情報格納装置４０
１と、ユーザが集約条件（集約対象データ、集約数、重
み付けパラメータα）を設定条件設定処理表示装置４０
２−１にあるＧＵＩ（Graphical User Interface）部品
４０２−２を利用して、外部入力装置４０６から入力す
る集約条件入力装置４０２と、該集約条件入力装置４０
２から入力された集約条件をもとに集約処理を行う集約
データ生成装置４０３と、前記集約条件のもとにユーザ
が意図した配置方法で集約データを図形の集まりとして
表示させるための表示情報の生成、変換、合成を行う表
示情報生成装置４０４と、表示情報生成装置４０４で生
成、変換、合成された表示情報を表示装置４０５−１に
表示する図形情報変換表示装置４０５とを有する。The two-phase data cluster device of the present embodiment shown in FIG. 1 has an information storage device 40 for storing data to be consolidated.
1 and the user sets the aggregation condition (data to be aggregated, the number of aggregations, weighting parameter α) by setting condition setting processing display device 40
2-1 using a GUI (Graphical User Interface) component 402-2 and an aggregation condition input device 402 input from an external input device 406;
2. An aggregated data generation device 403 that performs an aggregation process based on the aggregation condition input from Step 2 and display information for displaying the aggregated data as a group of figures in an arrangement method intended by the user under the aggregation condition. The display information generation device 404 includes a display information generation device 404 that performs generation, conversion, and synthesis, and a graphic information conversion display device 405 that displays display information generated, converted, and synthesized by the display information generation device 404 on the display device 405-1.

【００２３】また、集約データ生成装置４０３は、集約
条件入力装置４０２で指定された集約数より多くの初期
クラスタ核を集約対象データが存在するデータ空間にラ
ンダムにまたは統計的推定によって設定する１相目初期
点設定装置４０３−１と、該１相目初期点設定装置４０
３−１で設定された初期クラスタ核と集約対象となるデ
ータに関して集約処理を行い、処理後のクラスタ核を生
成する１相目集約処理装置４０３−２と、この１相目の
集約処理結果から出力された２相目初期クラスタ核候補
から前記評価関数Ｃi をもとに指定した集約数の初期ク
ラスタ核を抽出する２相目初期点抽出装置４０３−３
と、この２相目初期点抽出装置４０３−３で抽出された
初期クラスタ核と集約条件装置で指定した情報格納装置
４０１内のデータに関して集約処理を行う２相目集約処
理装置４０３−４とから構成されている。更に、集約条
件入力装置４０２は、ＧＵＩ部品４０２−２、設定条件
設定処理表示装置４０２−１、ＧＵＩ部４０２−３、設
定画面制御装置４０２−４から構成されている。Further, the aggregated data generation device 403 randomly or statistically estimates more initial cluster nuclei than the number of aggregations specified by the aggregation condition input device 402 in the data space where the data to be aggregated exists. Eye initial point setting device 403-1 and first phase initial point setting device 40
The first-phase aggregation processing device 403-2 that performs aggregation processing on the initial cluster nucleus set in 3-1 and the data to be aggregated, generates a cluster nucleus after the processing, and the result of the aggregation processing of the first phase A second-phase initial point extracting device 403-3 for extracting the specified number of initial cluster kernels based on the evaluation function Ci from the output second-phase initial cluster kernel candidates.
And the second phase aggregation processing device 403-4 which performs aggregation processing on the initial cluster kernel extracted by the second phase initial point extraction device 403-3 and the data in the information storage device 401 designated by the aggregation condition device. It is configured. Further, the aggregation condition input device 402 includes a GUI component 402-2, a setting condition setting processing display device 402-1, a GUI unit 402-3, and a setting screen control device 402-4.

【００２４】集約処理に必要な情報を設定する集約条件
入力装置４０２は集約パラメータ指定手段を構成し、集
約データ生成装置４０３の１相目初期点設定装置４０３
−１は初期クラスタ核設定手段を構成し、１相目集約処
理装置４０３−２は集約処理手段を構成し、２相目初期
点抽出装置４０３−３は優先クラスタ核抽出手段を構成
し、２相目集約処理装置４０３−４は集約処理再度実行
手段を構成し、表示情報生成装置４０４は表示情報生成
手段を構成し、図形情報表示装置４０５は図形情報表示
手段を構成している。The aggregation condition input device 402 for setting information required for the aggregation processing constitutes an aggregation parameter designating means, and the first phase initial point setting device 403 of the aggregate data generation device 403
-1 constitutes an initial cluster nucleus setting unit, the first phase aggregation processing unit 403-2 constitutes an aggregation processing unit, and the second phase initial point extraction unit 403-3 constitutes a priority cluster nucleus extraction unit. The phase aggregation processing device 403-4 constitutes an aggregation processing execution unit again, the display information generation device 404 constitutes a display information generation unit, and the graphic information display device 405 constitutes a graphic information display unit.

【００２５】集約条件入力装置４０２を構成する集約パ
ラメータ指定手段は、集約するクラスタ数、集約処理の
重み付けをクラスタ核間距離かクラスタ内のデータの広
がりの程度のどちらで行うかを０から１の範囲で指定す
る重み付けパラメータα、初期核クラスタ数を集約クラ
スタ数から決定するための初期核クラスタ数決定パラメ
ータＫを指定する。また、１相目初期点設定装置４０３
−１を構成する初期クラスタ核設定手段は、集約パラメ
ータ指定手段で指定された集約数および初期核クラスタ
数決定パラメータＫの積の数だけ、多次元のデータ空間
に１回目の集約処理に必要な初期クラスタ核をデータ空
間内に各次元の最大および最小の範囲内でランダムにま
たは統計的推測により設定する。The aggregation parameter designating means constituting the aggregation condition input device 402 determines from 0 to 1 whether the number of clusters to be aggregated and the weighting of the aggregation processing are to be performed based on the distance between cluster nuclei or the degree of spread of data in the cluster. The weighting parameter α specified by the range and the initial nuclear cluster number determination parameter K for determining the initial nuclear cluster number from the aggregated cluster number are specified. Also, the first phase initial point setting device 403
The initial cluster nucleus setting means constituting −1 is required for the first aggregation process in the multidimensional data space by the number of products of the aggregation number specified by the aggregation parameter specifying means and the initial kernel cluster number determination parameter K. An initial cluster kernel is set in data space, randomly or by statistical inference, within the maximum and minimum limits of each dimension.

【００２６】１相目集約処理装置４０３−２を構成する
集約処理手段は、初期クラスタ核設定手段で設定された
クラスタ核と各データとの距離をもって、１回目集約処
理を行う。２相目初期点抽出装置４０３−３を構成する
優先クラスタ核抽出手段は、集約処理によって初期時点
から移動したクラスタ核から、The aggregation processing means constituting the first-phase aggregation processing device 403-2 performs the first aggregation processing with the distance between the cluster kernel and each data set by the initial cluster kernel setting means. The priority cluster nucleus extraction means that constitutes the second phase initial point extraction device 403-3 extracts the cluster nucleus moved from the initial time by the aggregation process,

【数７】｛ここにおいて、Ｎ’＝Ｋ×（指定集約数）、ｒ_ikは集
約処理によって移動したクラスタ核ｉ（ｉ＝１，…，
Ｎ’）とその他のクラスタ核ｊ（ｊ＝１，…，Ｎ’）と
の距離を示し、ｒ_Maxは集約処理によって移動したクラ
スタ核とその他のクラスタ核との距離の最大値を示し、
ここで距離とはユークリッド距離を表し、Ｍ_iはクラス
タ核ｉ（ｉ＝１，…，Ｎ’）に集約されたデータの数を
示し、ｒ’_ikはクラスタ核ｉ（ｉ＝１，…，Ｎ’）とそ
れに集約されたＭ_i個のデータとの距離を示し、αは重
み付けパラメータであって、０から１の間の実数であ
る。｝なる評価関数Ｃi の値の低い順から指定数だけク
ラスタ核を抽出するものであって、前記評価関数Ｃi に
ある重み付けパラメータαを０から１の間で任意に決定
することにより、評価関数Ｃi の値を決定するのが、第
１項から第２項の値で自由に決められ、重み付けパラメ
ータαが０に近いと第２項の重みが大きくなり、各クラ
スタ内に集約されたデータの密集度合いの高いクラスタ
核が抽出され、重み付けパラメータαが１に近いと第１
項の重みが大きくなり、その他のクラスタとの距離が離
れているクラスタ核が抽出されるというようにクラスタ
核の抽出方法を変更することができる。(Equation 7) ｛Where N ′ = K × (designated number of aggregations), r _ik is the cluster kernel i (i = 1,.
N ′) indicates the distance between the other cluster nuclei j (j = 1,..., N ′), and r _Max indicates the maximum value of the distance between the cluster nucleus moved by the aggregation processing and the other cluster nucleus.
Here represents the Euclidean distance is the distance, M _i is the cluster nuclei i (i = 1, ..., N ') indicates the number of aggregated data in, r' _ik cluster nuclei i (i = 1, ..., N ') and indicates the distance between M _i pieces of aggregated data thereto, alpha is a weighting parameter, which is a real number between 0 and 1. A cluster kernel is extracted by a designated number from the lowest value of the evaluation function Ci, and the weighting parameter α in the evaluation function Ci is arbitrarily determined between 0 and 1 to obtain the evaluation function Ci. Is freely determined by the values of the first and second terms. When the weighting parameter α is close to 0, the weight of the second term increases, and the density of the data aggregated in each cluster increases. A cluster kernel with a high degree is extracted, and if the weighting parameter α is close to 1, the first
The method of extracting cluster nuclei can be changed such that the weight of the term increases and cluster nuclei that are far from other clusters are extracted.

【００２７】２相目集約処理装置４０３−４を構成する
集約処理再度実行手段は、優先クラスタ核抽出手段で抽
出されたクラスタ核を初期クラスタ核として再度各デー
タとの距離をもって集約処理を行う。The consolidation processing re-executing means constituting the second phase consolidation processing device 403-4 performs the consolidation processing again using the cluster nuclei extracted by the priority cluster nucleus extraction means as the initial cluster nuclei with the distance from each data.

【００２８】このように構成される２相データクラスタ
装置では、ユーザは情報格納装置４０１内のデータおよ
び集約数、重み付けパラメータαをＧＵＩ部品４０２−
２から外部入力装置４０６を利用して入力する。これら
の指定された条件はＧＵＩ部４０２−３を経由して適当
な数値の媒介変数として設定画面制御装置４０２−４に
受け渡される。受け渡された数値情報で集約数より多い
初期クラスタ核を１相目初期点設定装置４０３−１によ
って設定し、情報格納装置４０１内のデータと初期クラ
スタ核をもって、１相目集約処理装置４０３−２で集約
処理計算を行う。In the two-phase data cluster device configured as described above, the user sets the data in the information storage device 401, the number of aggregations, and the weighting parameter α using the GUI component 402-.
2 using the external input device 406. These specified conditions are passed to the setting screen control device 402-4 as appropriate numerical parameters via the GUI unit 402-3. Initial phase nuclei that are larger than the number of aggregations in the passed numerical information are set by the first-phase initial point setting device 403-1, and the first-phase aggregation processing device 403- The aggregation processing is calculated in step 2.

【００２９】１相目集約処理装置４０３−２から計算さ
れたクラスタ核を２相目初期点抽出装置４０３−３に投
入し、評価関数Ｃi に基づいて、抽出する初期点を決定
する。この決定された２相目初期クラスタ核と集約条件
入力装置４０２から入力された集約対象データをもって
２相目集約処理装置４０３−４内で集約処理を行い、集
約処理されたクラスタ核を表示情報生成装置４０４に受
け渡す。The cluster kernel calculated from the first-phase aggregation processing unit 403-2 is input to the second-phase initial point extraction unit 403-3, and the initial point to be extracted is determined based on the evaluation function Ci. The second phase initial cluster nucleus and the aggregation target data input from the aggregation condition input device 402 are subjected to aggregation processing in the second phase aggregation processor 403-4, and the aggregated cluster nucleus is generated as display information. Transfer to the device 404.

【００３０】この受け渡された結果をもとにユーザが意
図した配置方法で情報を図形の集まりとして表示させる
ための表示情報の生成、変換、合成を表示情報生成装置
４０４が行う。この生成、変換、合成された表示情報
は、図形情報表示装置４０５に送り出され、表示装置４
０５−１上に表示される。The display information generating device 404 generates, converts, and combines display information for displaying information as a group of figures based on the received result in an arrangement method intended by the user. The generated, converted, and synthesized display information is sent to the graphic information display device 405 and is displayed on the display device 4.
05-1.

【００３１】上述したように、本実施形態では、集約処
理に伴う処理方式の２相化および１相目集約処理によっ
て計算されたクラスタ核から２相目集約処理の初期クラ
スタ核を抽出する集約処理を実現している。As described above, in the present embodiment, the aggregation process for extracting the initial cluster nucleus of the second-phase aggregation process from the cluster nucleus calculated by the two-phase processing method associated with the aggregation process and the first-phase aggregation process. Has been realized.

【００３２】次に、図６〜図８に示すフローチャートを
参照して、上述したように構成される本実施形態の２相
データクラスタ装置の作用について説明する。Next, the operation of the two-phase data cluster device of the present embodiment configured as described above will be described with reference to the flowcharts shown in FIGS.

【００３３】図６においては、まず集約条件入力装置４
０２からなる集約パラメータ指定手段により集約処理対
象データ、集約数Ｎ、抽出パラメータである重み付けパ
ラメータα、初期核クラスタ数決定パラメータＫ等を指
定し（ステップＳ１１，Ｓ１２，Ｓ１３）、それから１
相目初期点設定装置４０３−１からなる初期クラスタ核
指定手段により集約数より多い初期クラスタ核である１
相目初期点を生成し（ステップＳ１５，Ｓ１６）、この
初期クラスタ核をもって１相目集約処理装置４０３−２
からなる集約処理手段により１相目集約処理を行って、
１相目クラスタ核を生成する（ステップＳ１９，Ｓ２
１）。In FIG. 6, first, the aggregation condition input device 4
The data to be aggregated, the number of aggregations N, the weighting parameter α as an extraction parameter, the parameter K for determining the number of initial nuclear clusters, and the like are designated by the aggregation parameter designation means consisting of 02 (steps S11, S12, S13), and then 1
The initial cluster nucleus specifying unit 403-1 of the phase initial point setting device 403-1 is the initial cluster nucleus 1 that is larger than the number of aggregations.
A phase initial point is generated (steps S15 and S16), and the first phase aggregation processing device 403-2 is provided with the initial cluster kernel.
The first-phase aggregation processing is performed by the aggregation processing means consisting of
Generate a first-phase cluster nucleus (steps S19 and S2)
1).

【００３４】次に、２相目初期点抽出装置４０３−３か
らなる優先クラスタ核抽出手段により１相目集約処理の
結果として出力された２相目処理クラスタ核候補から前
記評価関数に基づいて指定集約数の初期クラスタ核を抽
出し（ステップＳ２３，Ｓ２５）、それから２相目集約
処理装置４０３−４からなる集約処理再実行手段により
２相目集約処理を行って、集約されたクラスタ核である
２相目クラスタ核を生成する（ステップＳ２７，Ｓ２
９）。Next, the priority cluster nucleus extraction means including the second phase initial point extraction device 403-3 designates the second phase processing cluster kernel candidates output as a result of the first phase aggregation processing based on the evaluation function. The initial cluster kernels of the number of aggregations are extracted (steps S23 and S25), and then the second-phase aggregation processing is performed by the aggregation processing re-executing means including the second-phase aggregation processing device 403-4, and the cluster kernels are aggregated. Generate a second-phase cluster nucleus (steps S27 and S2)
9).

【００３５】次に、図７に示すフローチャートを参照し
て、１相目初期点設定装置４０３−１による１相目初期
点生成処理について説明する。図７においては、まず
Ｎ’＝（初期核クラスタ数決定パラメータＫ）＊（指定
集約数Ｎ）を計算し、各次元の最大値と最小値を獲得
し、パラメータＩを１に設定する（ステップＳ４１，Ｓ
４２，Ｓ４３）。そして、処理点を設定し（ステップＳ
４５）、各次元毎に最小値と最大値との間の値である乱
数を発生し（ステップＳ４７）、この発生させた乱数を
結合する（ステップＳ５１）。それから、パラメータＩ
を＋１インクリメントし、該パラメータＩがＮ’になる
までのすべてについて初期点設定処理を繰り返し行う
（ステップＳ５３）。Next, the first-phase initial point generation processing by the first-phase initial point setting device 403-1 will be described with reference to the flowchart shown in FIG. In FIG. 7, first, N ′ = (initial nuclear cluster number determination parameter K) * (specified aggregation number N) is calculated, the maximum value and the minimum value of each dimension are obtained, and the parameter I is set to 1 (step). S41, S
42, S43). Then, a processing point is set (step S
45) A random number that is a value between the minimum value and the maximum value is generated for each dimension (step S47), and the generated random numbers are combined (step S51). Then the parameter I
Is incremented by +1 and the initial point setting process is repeatedly performed for all the parameters until the parameter I becomes N ′ (step S53).

【００３６】次に、図８に示すフローチャートを参照し
て、２相目初期点抽出装置４０３−３による２相目初期
点抽出処理について説明する。図８においては、まず各
クラスタ核に番号を付け（ステップＳ６１）、それから
一方において各クラスタ核間の距離ｒ_ijを計算し（ステ
ップＳ６３）、他方においてはクラスタ内の距離を計算
し（ステップＳ６５）、それからクラスタ核ｉに集約さ
れた元データとクラスタ核との距離ｒ’_ijを計算し（ス
テップＳ６７）、更にパラメータｉがＮ’になるまでク
ラスタ内距離を計算する（ステップＳ６９）。Next, the second phase initial point extraction processing by the second phase initial point extraction device 403-3 will be described with reference to the flowchart shown in FIG. In FIG. 8, first, each cluster nucleus is numbered (step S61), and then the distance r _ij between each cluster nucleus is calculated on one side (step S63), and the distance within the cluster is calculated on the other side (step S65). ) Then, the distance r ′ _ij between the original data collected in the cluster kernel i and the cluster kernel is calculated (step S67), and the intra-cluster distance is further calculated until the parameter i becomes N ′ (step S69).

【００３７】それから、評価関数Ｃi を計算し（ステッ
プＳ７１）、この評価関数Ｃi の値が高い順から２相目
初期点として設定する（ステップＳ７３）。Then, the evaluation function Ci is calculated (step S71), and is set as a second-phase initial point in ascending order of the value of the evaluation function Ci (step S73).

【００３８】次に、図２〜図４を参照して、本実施形態
の効果について説明する。ここでは簡単のため、従来の
技術で説明した２次元データ空間のデータに適用させた
例について説明する。まず、図２に示す対象となるデー
タ空間のデータを上述した場合と同様に４つのクラスタ
に集約する。この例で利用するパラメータＫは２、すな
わち１回目の集約処理時の初期クラスタ核の個数は２倍
の８個とする。Next, the effects of the present embodiment will be described with reference to FIGS. Here, for simplicity, an example in which the present invention is applied to data in the two-dimensional data space described in the related art will be described. First, the data in the target data space shown in FIG. 2 is aggregated into four clusters as in the case described above. The parameter K used in this example is 2, that is, the number of initial cluster nuclei at the time of the first aggregation processing is doubled to eight.

【００３９】図３は１回目の集約処理後の状態を示して
いる。１回目の集約処理で利用する初期クラスタ核の設
定に関しては既存の最近重心ソート法を利用している。
図３には、データ空間内に８（＝４×２）つのクラスタ
核が存在する。このクラスタ核およびデータとの関係を
上述した評価関数を用いて計算し、再度集約処理を行う
ときの初期クラスタ核を抽出する。この計算処理過程を
次に示す表１，２，３，４に示す。FIG. 3 shows a state after the first aggregation processing. For the setting of the initial cluster nucleus used in the first aggregation processing, the existing recent centroid sorting method is used.
In FIG. 3, there are eight (= 4 × 2) cluster kernels in the data space. The relationship between the cluster nucleus and the data is calculated using the above-described evaluation function, and an initial cluster nucleus for performing the aggregation process again is extracted. This calculation process is shown in Tables 1, 2, 3, and 4 below.

【００４０】[0040]

【表１】表１は、評価関数Ｃi の第１項の計算に必要なクラスタ
核間の距離を示した行列である。例えば、クラスタ核１
とクラスタ核２との距離は、２．１３２と計算されてい
る。[Table 1] Table 1 is a matrix showing the distance between cluster nuclei necessary for calculating the first term of the evaluation function Ci. For example, cluster kernel 1
Is calculated as 2.132.

【００４１】[0041]

【表２】表２は、評価関数Ｃi の第２項で必要となるデータとク
ラスタ核間の関係および距離を示している。例えば、デ
ータ１はクラスタ核７に集約され、そのクラスタ核との
距離は０．２４９と計算されている（その他も同様）。
この２式から評価関数Ｃi の各項を計算し、重み付けパ
ラメータαを０から１に０．１ずつ増加し、評価値を出
力した結果を表３に示す。[Table 2] Table 2 shows the relationship and distance between the data and the cluster nuclei required in the second term of the evaluation function Ci. For example, the data 1 is aggregated in the cluster nucleus 7 and the distance to the cluster nucleus is calculated to be 0.249 (the same applies to other cases).
Table 3 shows the result of calculating each term of the evaluation function Ci from these two equations, increasing the weighting parameter α from 0 to 1 in increments of 0.1, and outputting the evaluation value.

【００４２】[0042]

【表３】各重み付けパラメータで評価値が小さいクラスタ核が異
なっている。そのため、表２に抽出される優先クラスタ
核、つまり評価値が小さいクラスタ核を示す。[Table 3] The cluster nucleus having a small evaluation value differs for each weighting parameter. Therefore, the priority cluster kernel extracted in Table 2, that is, the cluster kernel with a small evaluation value is shown.

【表４】表４の結果に従って、重み付けパラメータαが０．５の
ときを例としてあげ、クラスタ核を抽出し、初期クラス
タ核として設定し、再度集約処理を行った結果を図４に
示す。この結果からわかるように、既存の技術では不可
能だったＣ４およびＣ２の分割および集約が最適に行わ
れている。表５で具体的な数値で最適性を示す。[Table 4] FIG. 4 shows the result of extracting a cluster nucleus, setting the cluster nucleus as an initial cluster nucleus, and performing the aggregation process again according to the result of Table 4 when the weighting parameter α is 0.5. As can be seen from this result, the division and aggregation of C4 and C2, which were impossible with the existing technology, are optimally performed. Table 5 shows the optimality by specific numerical values.

【００４３】[0043]

【表５】表５は、既存の手法としてＫ−平均法を上述のデータに
適用させた場合と本方式を適用させた場合のクラスタ核
間の距離の平均および各クラスタに集約されたデータと
クラスタ核との距離の平均を示している。一例としてあ
げた式に対する前者の数値は、クラスタが１つ１つに分
割されているかということを示す指標であり、小さいほ
どよい。後者の数値は各クラスタに集約されているデー
タが密集しているかどうかを示す指標であり、小さいほ
どよい。数値に見られるように本方式の方が既存手法よ
り適した結果が出力されている。[Table 5] Table 5 shows the average of the distance between cluster nuclei when the K-means method is applied to the above data as an existing method and when this method is applied, and the average of the distance between the clustered data and the cluster nuclei in each cluster. The average distance is shown. The former numerical value for the expression given as an example is an index indicating whether the cluster is divided into individual ones, and the smaller the better, the better. The latter numerical value is an index indicating whether or not the data aggregated in each cluster is dense, and the smaller the better, the better. As can be seen from the numerical values, the output of this method is more suitable than the existing method.

【００４４】なお、上記はあるデータに限定した集約処
理結果であるが、本方式は、重み付けパラメータαの値
を変化させることで、表４に見られるような抽出される
クラスタ核を変化させることができる。上述した例で説
明すると、重み付けパラメータαが０．１のときと０．
９のときでは図３で出力されているクラスタ核Ｃ２，Ｃ
４，Ｃ５，Ｃ６のどれを抽出するかが異なっている。こ
のようなクラスタに集約されるデータは既存手法にも見
られるようにクラスタの作り方の方針によって集約され
るクラスタが変わっていく。一方、重み付けパラメータ
αが０．１でも０．９でも抽出されるクラスタ核Ｃ１，
Ｃ８はデータ間の距離からでもいつでも独立して抽出さ
れるべきであり、本方式ではそれを実現している。それ
以外の、集約処理において、どのクラスタに入るかどう
かわからない、いわば不安定と考えられるデータを集約
処理でどのクラスタに入れるかということを重み付けパ
ラメータαを０から１の間の値で任意に変化させること
で実現している。この手法は、従来の１回の集約処理を
行う方式では実現できない。Although the above is the result of the aggregation processing limited to certain data, this method changes the extracted cluster nuclei as shown in Table 4 by changing the value of the weighting parameter α. Can be. In the above-described example, when the weighting parameter α is 0.1 and when the weighting parameter α is 0.1.
In the case of 9, the cluster kernels C2 and C output in FIG.
4, C5 and C6 are extracted. As can be seen from the existing methods, the data to be aggregated into such a cluster changes the cluster to be aggregated according to the policy of how to create the cluster. On the other hand, the cluster kernel C1, which is extracted regardless of whether the weighting parameter α is 0.1 or 0.9,
C8 should always be independently extracted even from the distance between data, and this method realizes it. Other than that, in the aggregation processing, it is not known which cluster to enter, in other words, which cluster is considered to be unstable data to be included in the aggregation processing. The weighting parameter α is arbitrarily changed by a value between 0 and 1. It is realized by doing. This method cannot be realized by the conventional method of performing one aggregation process.

【００４５】図５は、上述した本実施形態の２相データ
クラスタ装置の結果例を示している。図５（ａ）は本実
施形態の２相データクラスタ装置を通さなかった結果の
イメージ図であり、図５（ｂ）は図５（ａ）の結果が出
力されるデータに対し、本実施形態の２相データクラス
タ装置を適用した場合の結果のイメージ図である。本実
施形態の２相データクラスタ装置を適用させると、各図
形に対応しているデータ空間内のデータを集約し、各集
約結果で出力されたクラスタ核の位置を図形に変換し、
出力する。例えば、図５（ａ）の中央付近にある小型の
三角形の集合とそれより大きな三角形の集合が集約され
る。この部分だけに限定すると、三角形の大きさに対応
しているデータの大小を検証して、大きい図形に対応し
ているデータおよび小さい図形に対応しているデータで
各々集約が行われている。これは、既存の方式では実現
することができないデータの傾向を保持したままの集約
処理が行えることを示している。FIG. 5 shows an example of the result of the two-phase data cluster device of the present embodiment described above. FIG. 5A is an image diagram of the result of not passing through the two-phase data cluster device of the present embodiment, and FIG. 5B is a diagram illustrating the output of the result of FIG. It is an image figure of a result at the time of applying a two-phase data cluster device. When the two-phase data cluster device of the present embodiment is applied, data in the data space corresponding to each figure is aggregated, and the position of the cluster nucleus output in each aggregation result is converted into a figure,
Output. For example, a set of small triangles near the center of FIG. 5A and a set of larger triangles are aggregated. If only this part is limited, the size of the data corresponding to the size of the triangle is verified, and the data corresponding to the large figure and the data corresponding to the small figure are respectively aggregated. This indicates that the aggregation process can be performed while maintaining the tendency of data that cannot be realized by the existing method.

【００４６】[0046]

【発明の効果】以上説明したように、本発明によれば、
大量のデータが存在するデータ空間においてデータ間の
距離が近いものによって表現されるデータの固まりをデ
ータの分布の傾向を損なわないように指定された数に集
約する際、初期クラスタ核をランダムにまたは統計的推
測によりデータ空間内にユーザ指定数のＫ倍だけ設定
し、データと初期クラスタ核をもって集約処理を行い、
処理結果のクラスタ核の位置を出力し、評価関数をもっ
て処理結果のクラスタ核から最適な集約処理を行うこと
ができるクラスタ核を抽出し、このクラスタ核を初期ク
ラスタ核として設定し、再度集約処理を行うことによ
り、１回目の集約処理においてデータの固まりに近いク
ラスタ核が生成され、その中からクラスタ核を抽出し、
集約処理を行うので、データ空間内のデータの固まりを
１つ１つに分けて集約することができる。As described above, according to the present invention,
When aggregating chunks of data represented by those with a close distance between data in a data space with a large amount of data into a specified number so as not to impair the tendency of data distribution, the initial cluster kernel is randomly or randomly By statistical estimation, set K times the number of users specified in the data space, perform aggregation processing with the data and the initial cluster kernel,
Output the position of the cluster nucleus of the processing result, extract the cluster nucleus that can perform the optimal aggregation processing from the cluster nucleus of the processing result using the evaluation function, set this cluster nucleus as the initial cluster nucleus, and perform the aggregation processing again. By doing so, a cluster nucleus close to a cluster of data is generated in the first aggregation processing, and a cluster nucleus is extracted from the cluster nucleus.
Since the aggregation process is performed, the block of data in the data space can be divided and aggregated one by one.

【００４７】また、本発明によれば、データ空間内に存
在するデータ間の距離によって形成されるデータの固ま
りの集約を最適に行うことが可能となり、大量のデータ
の分布傾向を容易に把握することができる。Further, according to the present invention, it is possible to optimize the aggregation of the data formed by the distance between the data existing in the data space, and to easily grasp the distribution tendency of a large amount of data. be able to.

【００４８】更に、本発明によれば、記憶装置に格納さ
れた大量の件数のデータを集約し、そのクラスタ結果を
図形の集まりとして表示画面に表示することで、図形の
重なりのない結果を出力することが可能である。Furthermore, according to the present invention, a large amount of data stored in the storage device is aggregated, and the cluster result is displayed as a group of figures on the display screen, thereby outputting a result without overlapping figures. It is possible to

【００４９】また、本発明によれば、集約処理が最適に
行われるように初期クラスタ核を設定しており、従来の
ようにランダムにまたは統計的推測により初期クラスタ
核を設定して集約処理を行わずに、指定数より多くのク
ラスタ核をまずランダムにまたは統計的推測により設定
し、１回集約処理を行い、この処理されたクラスタ核か
ら２回目の集約処理で必要な初期クラスタ核を抽出し
て、集約処理を行っている。Further, according to the present invention, the initial cluster nucleus is set so that the aggregation processing is optimally performed, and the aggregation processing is performed by setting the initial cluster nucleus randomly or by statistical estimation as in the related art. Rather, a larger number of cluster nuclei than the specified number are first set at random or by statistical guess, and the aggregation processing is performed once, and the initial cluster nuclei necessary for the second aggregation processing are extracted from the processed cluster nuclei. Then, aggregation processing is performed.

[Brief description of the drawings]

【図１】本発明の一実施形態に係る２相データクラスタ
装置の構成を示すブロック図である。FIG. 1 is a block diagram showing a configuration of a two-phase data cluster device according to an embodiment of the present invention.

【図２】本実施形態が適用される前のデータ空間を示す
図である。FIG. 2 is a diagram showing a data space before the embodiment is applied.

【図３】図２に示すデータ空間に対して本実施形態の１
回目の集約処理を適用した結果を示す図である。FIG. 3 shows a first embodiment of the present invention for the data space shown in FIG. 2;
It is a figure showing the result of having applied the aggregation processing of the time.

【図４】指定集約数より多いクラスタ核から再度集約処
理を実行するために評価関数を用いて優先クラスタ核を
抽出した後の集約処理結果であって、図３に示す１回目
の集約処理結果に対して再度集約処理を実行した結果を
示す図である。FIG. 4 shows a result of the aggregation processing after extracting a priority cluster nucleus by using an evaluation function in order to execute the aggregation processing again from the cluster nuclei having more than the designated aggregation number, and is a result of the first aggregation processing shown in FIG. FIG. 14 is a diagram illustrating a result of executing the aggregation process again on the.

【図５】本実施形態の２相データクラスタ装置の結果例
であって、同図（ａ）は本実施形態の２相データクラス
タ装置を通さなかった結果のイメージ図であり、同図
（ｂ）は同図（ａ）の結果が出力されるデータに対し、
本実施形態の２相データクラスタ装置を適用した場合の
結果のイメージ図である。FIG. 5 is an example of a result of the two-phase data cluster device of the present embodiment, and FIG. 5A is an image diagram showing a result of not passing through the two-phase data cluster device of the present embodiment, and FIG. Is for the data for which the result of FIG.
It is an image figure of a result at the time of applying the two-phase data cluster device of this embodiment.

【図６】図１に示す実施形態の２相データクラスタ装置
の作用を示すフローチャートである。FIG. 6 is a flowchart showing the operation of the two-phase data cluster device of the embodiment shown in FIG.

【図７】図１に示す実施形態に使用されている１相目初
期点設定装置による１相目初期点生成処理を示すフロー
チャートである。FIG. 7 is a flowchart showing a first phase initial point generation process by the first phase initial point setting device used in the embodiment shown in FIG. 1;

【図８】図１に示す実施形態に使用されている２相目初
期点抽出装置による２相目初期点抽出処理を示すフロー
チャートである。FIG. 8 is a flowchart showing a second-phase initial point extraction process by the second-phase initial point extraction device used in the embodiment shown in FIG. 1;

【図９】従来の処理を説明するための適用前のデータ空
間を示す図である。FIG. 9 is a diagram showing a data space before application for explaining a conventional process.

【図１０】図９のデータ空間に対してランダムに初期ク
ラスタ核を設定し、既存の集約手法であるＫ−平均法を
適用した結果を示す図である。10 is a diagram showing a result of randomly setting initial cluster kernels in the data space of FIG. 9 and applying a K-means method which is an existing aggregation method.

【図１１】図９のデータ空間に対して統計的推定を利用
した初期クラスタ核を設定し、既存の集約手法であるＫ
−平均法を適用した結果を示す図である。11 sets an initial cluster kernel using statistical estimation for the data space of FIG.
FIG. 14 is a diagram showing the result of applying the averaging method.

[Explanation of symbols]

４０１情報格納装置４０２集約条件入力装置４０３集約データ生成装置４０３−１１相目初期点設定装置４０３−２１相目集約処理装置４０３−３２相目初期点抽出装置４０３−４２相目集約処理装置４０４表示情報生成装置４０５図形情報変換表示装置４０５−１表示装置４０６外部入力装置 401 Information storage device 402 Aggregation condition input device 403 Aggregated data generation device 403-1 First phase initial point setting device 403-2 First phase aggregation processing device 403-3 Second phase initial point extraction device 403-4 Second phase aggregation Processing device 404 Display information generation device 405 Graphic information conversion display device 405-1 Display device 406 External input device

───────────────────────────────────────────────────── フロントページの続き (72)発明者飯塚哲也東京都新宿区西新宿三丁目19番２号日本電信電話株式会社内 ────────────────────────────────────────────────── ─── Continued on the front page (72) Inventor Tetsuya Iizuka 3-19-2 Nishishinjuku, Shinjuku-ku, Tokyo Nippon Telegraph and Telephone Corporation

Claims

[Claims]

1. A two-phase data cluster method for generating a cluster in which data having a close relationship in a multidimensional data space is aggregated into a desired number of aggregated clusters from a large number of data, comprising: A weighting parameter α that specifies in a range from 0 to 1 whether the weighting of the processing is performed based on the distance between cluster nuclei or the extent of the data in the cluster, and an initial kernel for determining the number of initial kernel clusters from the number of aggregated clusters. A parameter K for determining the number of clusters is designated, and an initial cluster nucleus, which is an initial value of the aggregation processing, is set randomly or by statistical estimation within the maximum and minimum ranges of each dimension on the condition of the designated number of initial nuclear clusters. , Applying the clustering method including the K-means method to the aggregation processing, calculating the distance between the cluster kernel and each data for each dimension,
When data with a short distance is aggregated into the same cluster and the weighting parameter α is approached to 1 as an index for extracting cluster nuclei, the weight related to the distance between cluster nuclei is set high. When the weighting parameter α is approached to 0, The following evaluation function Ci for setting the weight of the quantity relating to the distribution ratio of the data in each cluster to be high: ## EQU1 ## Ci = (weighting parameter α) * (quantity related to distance matrix between cluster nuclei) + ( 1-weighting parameter α) * (initial cluster nuclei are arranged in ascending order of the value of the distance between each data aggregated in cluster nuclei i and the cluster nuclei), and only the number of aggregate cluster nuclei specified from the top is re-aggregated Extraction as initial cluster nucleus for processing, and using this extracted cluster nucleus as the initial cluster nucleus for reaggregation processing, aggregation processing And generating a number of clusters specified by the user.

2. A two-phase data cluster method for generating a cluster in which data having a close relationship in a multidimensional data space is aggregated into a desired number of aggregated clusters from a large number of data, comprising: By specifying the parameter for determining the number of cluster nuclei and the data to be aggregated, the set number of aggregations, the initial parameter for determining the number of cluster nuclei, the number of dimensions of the data to be aggregated, and the maximum and minimum values of each dimension are detected. The cluster nucleus is set randomly or an initial cluster nucleus is set from the data to be aggregated by using a statistical method, and the aggregation process is performed using the set initial cluster nucleus and the specified data. A position in the multidimensional data is generated, and from the cluster nucleus of the generated aggregation processing result, the The user extracts the specified number of cluster nuclei using the aggregation method desired by the user, performs aggregation processing based on the extracted cluster nuclei and the specified original data, and positions the final cluster nuclei in the data space. And converting the data into graphic information in order to display the generated final cluster nucleus data as a set of graphics, and outputting the generated graphics display information to a display device. Two-phase data cluster method.

3. A two-phase data cluster apparatus for generating a cluster in which data having a close relationship in a multidimensional data space is aggregated into a desired number of aggregated clusters from a large number of data, the number of clusters to be aggregated, A weighting parameter α that specifies in a range from 0 to 1 whether the weighting of the processing is performed based on the distance between cluster nuclei or the extent of the data in the cluster, and an initial kernel for determining the number of initial kernel clusters from the number of aggregated clusters. Aggregation parameter designating means for designating the number-of-clusters determination parameter K, and an initial cluster kernel which is an initial value of the aggregation processing on the condition of the number of initial kernel clusters designated by the aggregation parameter designating means within a maximum and minimum range of each dimension. Initial cluster nucleus setting means, which is set randomly or by statistical estimation, and cluster methods including the K-means method are aggregated. It was applied to the distance between the clusters nucleus and each data is calculated for each dimension,
Aggregation processing means that aggregates data having a short distance comprehensively into the same cluster, and when the weighting parameter α is brought close to 1 as an index for extracting cluster nuclei, the weight related to the distance between cluster nuclei is set high. When the value approaches 0, the weight of the amount relating to the variance of the data in each cluster is set high. The following evaluation function Ci: ## EQU2 ## Ci = (weighting parameter α) * (distance matrix between cluster nuclei) Amount) + (1−weighting parameter α) * (amount related to the distance between each data aggregated in cluster kernel i and cluster kernel) The initial cluster kernels are arranged in ascending order, and the number of aggregate cluster kernels specified from the top Cluster nucleus extraction means for extracting only the initial cluster nucleus for the re-aggregation processing, and extraction from the priority cluster nucleus extraction means A two-phase data cluster apparatus comprising: an aggregation processing re-executing unit that performs aggregation processing by using the cluster nucleus obtained as an initial cluster nucleus of the re-aggregation processing and generates a number of clusters designated by a user.

4. A two-phase data cluster apparatus for generating a cluster in which data having a close relationship in a multidimensional data space is aggregated from a large number of data into a desired number of aggregated clusters, the number of clusters to be aggregated, An aggregation parameter specifying means for specifying the cluster nucleus number determination parameter and the data to be aggregated, and detects the set aggregation number, the initial cluster nucleus number determination parameter, the number of dimensions of the data to be aggregated, and the maximum and minimum values of each dimension. Initial cluster nucleus setting means that randomly sets the initial cluster nuclei within the range or sets the initial cluster nuclei from the data to be aggregated using a statistical method, and aggregates with the set initial cluster nuclei and specified data An aggregation processing means for performing processing and generating a position in the multidimensional data of the cluster kernel after the processing result; Priority cluster nucleus extraction means for extracting a specified number of cluster nuclei by the user's desired aggregation method using the weighting parameter α of the evaluation function from the cluster nuclei of the processing result, and original data designated as the cluster nucleus extracted preferentially Aggregation processing re-executing means that performs aggregation processing based on the data and outputs the final cluster nucleus position in the data space, and figures for displaying the generated final cluster nucleus data as a set of figures A two-phase data cluster device, comprising: display information generating means for converting and generating data into the information of the above; and a graphic information display device for outputting display information of the generated graphic to a display device.

5. A recording medium recording a two-phase data cluster program for generating a cluster in which data having a close relationship in a multidimensional data space is aggregated into a desired number of aggregated clusters from a large number of data. The number of clusters to perform, the weighting of the aggregation process is determined by the distance between cluster nuclei or the extent of data spread within the cluster, a weighting parameter α specifying in the range of 0 to 1, the initial number of nuclear clusters is determined from the number of aggregated clusters The parameter K for determining the number of initial nuclear clusters for performing the calculation is specified, and the initial cluster nuclei, which are the initial values of the aggregation processing, are randomly or statistically determined within the maximum and minimum ranges of each dimension, based on the specified number of initial nuclear clusters. The cluster method including the K-means method is applied to the aggregation process, and the distance between the cluster kernel and each data is calculated for each dimension. Calculation Te and,
When data with a short distance is aggregated into the same cluster and the weighting parameter α is approached to 1 as an index for extracting cluster nuclei, the weight related to the distance between cluster nuclei is set high. When the weighting parameter α is approached to 0, The following evaluation function Ci for setting the weight of the amount related to the distribution ratio of data in each cluster to be high: Ci = (weighting parameter α) * (amount related to distance matrix between cluster nuclei) + ( 1-weighting parameter α) * (initial cluster nuclei are arranged in ascending order of the value of the distance between each data aggregated in cluster nuclei i and the cluster nuclei), and only the number of aggregate cluster nuclei specified from the top is re-aggregated Extraction as initial cluster nucleus for processing, and using this extracted cluster nucleus as the initial cluster nucleus for reaggregation processing, aggregation processing And generating a user-specified number of clusters.