JPH1027125A

JPH1027125A - Document classifying device

Info

Publication number: JPH1027125A
Application number: JP8199543A
Authority: JP
Inventors: Hiroshi Masuichi; 博増市
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 1996-07-11
Filing date: 1996-07-11
Publication date: 1998-01-27
Anticipated expiration: 2016-07-11
Also published as: JP3772401B2

Abstract

PROBLEM TO BE SOLVED: To adequately classify many documents which are linked complicatedly like hypertexts by generating an initial document cluster on the basis of link relation and document distances, taking a cluster analysis based upon the document distances, and classifying the documents. SOLUTION: A document storage part 11 stores electronized documents and a link relation storage part 12 stores the link relation among the documents stored in the document storage part 11. A distance calculating processing part 13 calculates the document distances from the appearance frequencies of words included in the respective documents stored in the document storage part 11 and then a document classifying processing part 14 generates the initial document cluster on the basis of the stored link relation and the obtained document distances and takes the cluster analysis based upon the document distances to classify the documents stored in the document storage part 11. Then an output processing part 15 outputs the classification result of the document classifying processing part 14.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、ネットワークシス
テム上に存在する電子化された多数の文書を分類する文
書分類装置に関し、特に、ハイパーテキストのような複
雑にリンク付けされた多数の文書を分類する文書分類装
置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document classifying apparatus for classifying a large number of electronic documents existing on a network system, and more particularly to a classifying apparatus for classifying a large number of complicatedly linked documents such as hypertexts. And a document classifying device.

【０００２】[0002]

【従来の技術】今日、インターネットの普及に伴い、物
理的に離れた位置に存在するコンピュータシステム上の
電子文書にネットワークを介して容易にアクセスするこ
とができるようになっている。このような電子文書は、
文書の中に他の電子文書を参照するためのリンク情報を
埋め込むことが可能であり、リンク情報が埋め込まれた
電子文書は、そのリンク情報を辿ることによって当該電
子文書に関連する他の電子文書に容易に到達することが
できる。このようなリンク情報が埋め込まれた電子文書
の形態を、一般にハイパーテキストと呼んでいる。2. Description of the Related Art With the spread of the Internet today, electronic documents on computer systems located at physically distant locations can be easily accessed via a network. Such electronic documents are
It is possible to embed link information for referring to another electronic document in a document, and the electronic document in which the link information is embedded can be linked to another electronic document by following the link information. Can be easily reached. The form of an electronic document in which such link information is embedded is generally called hypertext.

【０００３】インターネットのようなネットワークシス
テムにおいて、アクセス可能な電子文書の数が大量に増
加すると、この大量の電子文書からリンク情報のみにし
たがって所望の文書を探し出すことが困難になりつつあ
る。[0003] In a network system such as the Internet, as the number of accessible electronic documents increases in a large amount, it is becoming difficult to search for a desired document from the large amount of electronic documents based only on link information.

【０００４】このような問題を解決するための１つの方
法として、インターネット上で公開されている電子文書
を対象とした検索サービスを提供するシステムが増えつ
つある。これらの検索システムでは、大量の文書に対し
て一括したキーワード検索を行うことができる。すなわ
ち、インターネット上で公開されている電子文書を予め
可能な限り漏れなく探索しておき、各文書の内容を取得
しておくことにより、このような一括のキーワード検索
を行うことができるようにしている。As one method for solving such a problem, systems for providing a search service for electronic documents published on the Internet are increasing. In these search systems, a collective keyword search can be performed on a large number of documents. That is, by searching for electronic documents published on the Internet as much as possible in advance and obtaining the contents of each document, it is possible to perform such a collective keyword search. I have.

【０００５】また、更に、このような検索システムにお
いては、各文書をその内容にしたがっていくつかのカテ
ゴリーへと分類しておくことによって、より検索効率の
向上を図るものがある。この場合のシステムの利用者
は、所望の文書が含まれていると思われるカテゴリーを
中心にキーワード検索を行うことが可能となり、検索効
率の向上が期待できる。[0005] Further, in such a search system, there is a system in which each document is classified into several categories according to its contents, thereby further improving search efficiency. In this case, the user of the system can perform a keyword search centering on a category considered to include a desired document, and improvement in search efficiency can be expected.

【０００６】ところで、文書を分類する方法には、人手
によって行う方法と、文書間の距離に基づいた計算によ
って自動的に行う方法とがある。大量の文書を分類する
場合には、効率の点から、後者の方法が有利である。There are two methods for classifying documents: a manual method and a method for automatically performing calculations based on the distance between documents. When classifying a large number of documents, the latter method is advantageous in terms of efficiency.

【０００７】（従来技術１）このような文書を分類する
手法として、例えば、文献「Luhn, H. P., 'A statisti
cal approach to mechanised encoding and searching
of library information', IBM journal of research a
nd development, 1, 309-17 (1957)」において論じられ
ているように、文書中に含まれる各単語の出現頻度を基
に単語の重み付けを行なう方法がある。この場合、特
に、重みの高い単語は、その文書を代表するキーワード
とみなすことができる。(Prior Art 1) As a method for classifying such documents, for example, a document "Luhn, HP, 'A statisti
cal approach to mechanised encoding and searching
of library information ', IBM journal of research a
As discussed in “nd development, 1, 309-17 (1957)”, there is a method of weighting words based on the frequency of occurrence of each word included in a document. In this case, particularly, a word having a high weight can be regarded as a keyword representing the document.

【０００８】（従来技術２）また、単語の重みから文書
間距離を求める手法が、例えば、文献「Salton, G.and
McGill, N. J., ' Introduction to modern informatio
n retrieval', NewYork, McGraw-Hill (1983)」で提案
されており、いくつかの文書分類システムにおいて採用
されている。(Prior Art 2) A method of obtaining the inter-document distance from the weight of words is described in, for example, the document "Salton, G. and
McGill, NJ, 'Introduction to modern informatio
n retrieval ', NewYork, McGraw-Hill (1983) ", and is used in some document classification systems.

【０００９】このような文書分類システムにおいては、
文書Ｐｉに対して各単語Ｒｕの重みＷiuが設定されてい
るものとすると、文書Ｐｉの文書ベクトルＶpiを以下の
ように定義する。ただし、文書Ｐｉ中に単語Ｒｕが存在
しない場合には、重みＷiuには“０”を設定する。ま
た、単語Ｒｕが存在する場合には、重みＷiuは“０”以
上の実数値とする。 Ωiu*＝Ｗiu／（ｍax Ｗiu） ……（１−１）Ｖpi*＝（Ωi1*，Ωi2*，…，Ωim*） ……（１−２）Ｖpi＝Ｖpi*／｜Ｖpi*｜＝（Ωi1，Ωi2，…，Ωim） ……（１−３）ただし、ここでは、単語の異なり総数をｍとしており、
また、Ωiu（０≦Ωiu≦１）を文書Ｐｉに対する各単語
Ｒｕの重みＷiuとして再定義する。そして、この場合に
おける文書Ｐｉと文書Ｐｊの間の距離ｄ（Ｐｉ，Ｐｊ）
｛（０≦ｄ（Ｐｉ，Ｐｊ）≦１）は、ｄ（Ｐi，Ｐj）＝２（arccos（Ｖpi・Ｖpj））／π ……（１−４）として、両者の文書ベクトルの角度として定義する。In such a document classification system,
Assuming that the weight Ru of each word Ru is set for the document Pi, the document vector Vpi of the document Pi is defined as follows. However, when the word Ru does not exist in the document Pi, “0” is set to the weight Wiu. If the word Ru exists, the weight Wiu is a real value equal to or greater than “0”. Ωiu * = Wiu / (max Wiu) (1-1) Vpi * = (Ωi1 *, Ωi2 *,..., Ωim *) (1-2) Vpi = Vpi * / | Vpi * | = (Ωi1 , Ωi2,..., Ωim) (1-3) Here, the total number of different words is m,
Also, Ωiu (0 ≦ Ωiu ≦ 1) is redefined as the weight Wiu of each word Ru for the document Pi. Then, the distance d (Pi, Pj) between the document Pi and the document Pj in this case
｛(0 ≦ d (Pi, Pj) ≦ 1) is defined as d (Pi, Pj) = 2 (arccos (Vpi · Vpj)) / π (1-4) as an angle between both document vectors I do.

【００１０】（従来技術３）上記のようにして求められ
た文書間距離に基づき、クラスター分析の手法を用いる
と、文書の分類が可能となる。クラスター分析の手法に
ついては、例えば、文献「田中，垂水，脇本，“統計解
析ハンドブックＩＩ多変量解析編”，第２２６頁〜第
２５７頁，共立出版（１９８４）」が参照できる。クラ
スター分析の手法は、よく知られた技術であるのでここ
での説明は省略する。(Prior Art 3) A document can be classified by using a cluster analysis method based on the inter-document distance obtained as described above. For the cluster analysis technique, for example, reference can be made to the document “Tanaka, Tarumi, Wakimoto,“ Statistical Analysis Handbook II, Multivariate Analysis ”, pp. 226-257, Kyoritsu Shuppan (1984)”. Since the technique of cluster analysis is a well-known technique, description thereof is omitted here.

【００１１】[0011]

【発明が解決しようとする課題】ところで、上述した従
来の技術による文書分類システムにおいては、更に、解
決すべき課題として、次のような問題がある。すなわ
ち、（従来技術１）や（従来技術２）による文書分類シ
ステムにおいて、機械的に得られる文書間距離は、文書
の意味内容を深く勘案した上で設定されるものではな
い。したがって、このような文書間距離に基づいた文書
分類は、文書の意味内容が充分に反映されたものである
とは言い難い。このため、ユーザにとって、大量の電子
文書からは所望の文書を探し出すことが困難な状況にあ
ることにかわりはない。In the document classification system according to the above-mentioned conventional technology, there is a further problem to be solved as follows. In other words, in the document classification system according to (Prior Art 1) or (Prior Art 2), the inter-document distance obtained mechanically is not set in consideration of the meaning and content of the document. Therefore, it is difficult to say that such a document classification based on the inter-document distance sufficiently reflects the semantic content of the document. For this reason, it is still difficult for a user to find a desired document from a large amount of electronic documents.

【００１２】本発明は、このような問題点を解決するた
めになされたものであり、本発明の目的は、ハイパーテ
キストのような複雑にリンク付けされた多数の文書を適
切に分類することができる文書分類装置を提供すること
にある。SUMMARY OF THE INVENTION The present invention has been made to solve such a problem, and an object of the present invention is to appropriately classify a large number of complicatedly linked documents such as hypertext. It is an object of the present invention to provide a document classification device capable of performing the above.

【００１３】[0013]

【課題を解決するための手段】上記のような目的を達成
するため、本発明による文書分類装置は、電子化された
複数の文書を格納する文書格納手段（１１）と、前記文
書格納手段に格納された複数の文書の間のリンク関係を
格納するリンク関係格納手段（１２）と、前記文書格納
手段に格納された各文書に含まれる単語の出現頻度から
文書間距離を計算する距離計算手段（１３）と、前記リ
ンク関係格納手段に格納されたリンク関係と前記距離計
算手段から得られる文書間距離を基にして、初期文書ク
ラスターを生成し、文書間距離に基づいたクラスター分
析を行い、前記文書格納手段に格納された複数の文書を
分類する文書分類手段（１４）と、文書分類手段による
分類された結果を出力する出力手段（１５）とを有する
ことを特徴とする。In order to achieve the above object, a document classifying apparatus according to the present invention comprises: a document storage means (11) for storing a plurality of digitized documents; Link relation storing means (12) for storing a link relation between a plurality of stored documents, and distance calculating means for calculating an inter-document distance from the appearance frequency of a word included in each document stored in the document storing means (13) generating an initial document cluster based on the link relation stored in the link relation storage means and the inter-document distance obtained from the distance calculation means, performing a cluster analysis based on the inter-document distance, Document classification means (14) for classifying a plurality of documents stored in the document storage means, and output means (15) for outputting a result classified by the document classification means.

【００１４】このような特徴を有する文書分類装置にお
いては、文書格納手段（１１）が、電子化された複数の
文書を格納しており、リンク関係格納手段（１２）が、
文書格納手段に格納された複数の文書の間のリンク関係
を格納している。距離計算手段（１３）が、文書格納手
段に格納された各文書に含まれる単語の出現頻度から文
書間距離を計算すると、文書分類手段（１４）が、リン
ク関係格納手段に格納されたリンク関係と前記距離計算
手段から得られる文書間距離を基にして、初期文書クラ
スターを生成し、文書間距離に基づいたクラスター分析
を行い、前記文書格納手段に格納された複数の文書を分
類する。そして、出力手段（１５）により、文書分類手
段による分類された結果を出力する。In the document classifying apparatus having such features, the document storage means (11) stores a plurality of digitized documents, and the link relation storage means (12)
The link relation between a plurality of documents stored in the document storage means is stored. When the distance calculating means (13) calculates the inter-document distance from the frequency of appearance of words contained in each document stored in the document storing means, the document classifying means (14) sets the link relation stored in the link relation storing means. And generating an initial document cluster based on the inter-document distance obtained from the distance calculation means, performing a cluster analysis based on the inter-document distance, and classifying the plurality of documents stored in the document storage means. Then, the output unit (15) outputs the result classified by the document classification unit.

【００１５】このようにして、本発明の文書分類装置で
は、ハイパーテキストの形態をとる文書をクラスター分
析の手法を用いて分類する際に、文書に記述されたリン
ク情報を利用する。文書間のリンク関係は、基本的に文
書の作成者が自分の作成した文書と意味的に近い（距離
が小さい）文書に対して設定されているので、リンク関
係情報と、文書間距離の双方を用いてクラスター分析を
行う。これにより、文書の作成者の意志を反映した文書
分類、つまりは、文書の意味内容に沿った文書分類が実
現できる。As described above, in the document classifying apparatus of the present invention, when classifying a document in the form of a hypertext using the cluster analysis technique, the link information described in the document is used. Since the link relationship between documents is basically set for a document that is semantically close (small distance) to the document created by the document creator, both the link relationship information and the inter-document distance are set. Perform a cluster analysis using. As a result, a document classification reflecting the intention of the creator of the document, that is, a document classification according to the semantic content of the document can be realized.

【００１６】[0016]

【発明の実施の形態】以下、本発明を実施する場合の一
形態について図面を参照して具体的に説明する。図１
は、本発明の一実施例の文書分類装置の要部の構成を示
すブロック図である。図１において、１１は文書格納
部、１２はリンク関係格納部、１３は距離計算処理部、
１４は文書分類処理部、１５は出力処理部である。DESCRIPTION OF THE PREFERRED EMBODIMENTS One embodiment of the present invention will be specifically described below with reference to the drawings. FIG.
FIG. 1 is a block diagram illustrating a configuration of a main part of a document classification device according to an embodiment of the present invention. In FIG. 1, 11 is a document storage unit, 12 is a link relation storage unit, 13 is a distance calculation processing unit,
Reference numeral 14 denotes a document classification processing unit, and reference numeral 15 denotes an output processing unit.

【００１７】本実施例の文書分類装置においては、文書
格納部１１が、電子化された大量の文書を格納してお
り、ここに格納された各々の文書に対応して、リンク関
係格納部１２が、各々の文書の間のリンク関係情報（参
照する文書の存在位置とその文書識別子）を格納してい
る。距離計算処理部１３は、文書格納部１１に格納され
た各文書を解析し、その文書に含まれる単語の出現頻度
から文書間距離を計算する。この文書間距離の計算は、
例えば、前述した（従来技術２）の文書分類システムと
同様な手法（アルゴリズム）により計算する。In the document classifying apparatus of this embodiment, the document storage unit 11 stores a large number of electronic documents, and a link relation storage unit 12 corresponds to each of the stored documents. Stores the link relation information between the documents (the location of the referenced document and its document identifier). The distance calculation processing unit 13 analyzes each document stored in the document storage unit 11 and calculates the inter-document distance from the appearance frequency of words included in the document. The calculation of the distance between documents is
For example, the calculation is performed by the same method (algorithm) as the above-described document classification system (prior art 2).

【００１８】このようにして文書間距離が計算される
と、文書分類処理部１４では、リンク関係格納部１２に
格納されたリンク関係情報と距離計算処理部１３から得
られた文書間距離を基にして、初期文書クラスターを生
成し、文書間距離に基づいたクラスター分析を行う。そ
して、文書格納部１１に格納された複数の文書を分類す
る。分類された結果は、出力処理部１５によるグラフィ
カルユーザインタフェースを介して、見やすい表示形態
でユーザに対して表示出力される。これにより、例え
ば、クラスター分析の結果に応じて、大量の文書の中か
ら同じグループに属する文書のみが表示されるので、ユ
ーザは所望する文書を探しやすくなる。When the inter-document distance is calculated in this manner, the document classification processing unit 14 uses the link relation information stored in the link relation storage unit 12 and the inter-document distance obtained from the distance calculation processing unit 13. Then, an initial document cluster is generated, and a cluster analysis based on the inter-document distance is performed. Then, the plurality of documents stored in the document storage unit 11 are classified. The classified results are displayed and output to the user in an easily viewable display form via a graphical user interface by the output processing unit 15. Thus, for example, only documents belonging to the same group are displayed from a large number of documents according to the result of the cluster analysis, so that the user can easily find a desired document.

【００１９】図２は、本発明の別の実施例である広域ネ
ットワークに結合された文書分類システムの要部の構成
を示すブロック図である。図２において、２０は広域ネ
ットワーク、２１は文書取得処理部、２２は文書格納
部、２３はリンク関係格納部、２４は自立語抽出処理
部、２５は単語重み設定処理部、２６は文書間距離計算
処理部、２７は文書分類処理部、２８は出力処理部であ
る。図２に示す文書分類システムでは、広域ネットワー
ク２０上に分散して存在するハイパーテキストの形態の
文書に対して、これらの文書を取得し、その文書中に埋
め込まれたリンク情報から、これらの電子文書の内容を
対象として文書分類を行う。FIG. 2 is a block diagram showing a configuration of a main part of a document classification system connected to a wide area network according to another embodiment of the present invention. 2, reference numeral 20 denotes a wide area network, 21 denotes a document acquisition processing unit, 22 denotes a document storage unit, 23 denotes a link relation storage unit, 24 denotes an independent word extraction processing unit, 25 denotes a word weight setting processing unit, and 26 denotes a distance between documents. A calculation processing unit, 27 is a document classification processing unit, and 28 is an output processing unit. In the document classification system shown in FIG. 2, with respect to documents in the form of hypertext distributed and existing on the wide area network 20, these documents are acquired, and these electronic information are obtained from link information embedded in the documents. Document classification is performed on the contents of the document.

【００２０】広域ネットワーク２０は、例えば、複数の
ネットワークシステムが互いに結合されたインターネッ
トであり、文書取得処理部２１は、広域ネットワーク２
０にアクセス可能に存在する大量の文書を取得するプロ
グラムモジュールにより構成される。このプログラムモ
ジュールは、広域ネットワーク２０に接続されているコ
ンピュータシステム上に格納されている電子文書の１つ
を指定すると、「指定された電子文書の内容を取得し、
この電子文書中に埋め込まれた他の文書を指示するリン
ク情報を同定し、リンク情報が指示する他の文書を取得
する操作」を再帰的に繰り返す処理を実行し、広域ネッ
トワーク２０に接続された複数のコンピュータシステム
上に分散して存在する電子文書を取得する。The wide area network 20 is, for example, the Internet in which a plurality of network systems are connected to each other.
It is configured by a program module that obtains a large amount of documents that can be accessed at 0. When one of the electronic documents stored on the computer system connected to the wide area network 20 is specified, the program module “acquires the contents of the specified electronic document,
An operation of identifying link information indicating another document embedded in the electronic document and retrieving another document indicated by the link information ”is performed recursively, and connected to the wide area network 20. Acquire electronic documents distributed on a plurality of computer systems.

【００２１】文書取得処理部２１により取得された大量
の文書は、文書格納部２２に格納される。この場合、文
書格納部２２では、文書取得処理部２１が取得した文書
をその文書を特定するリンク情報と対にして格納する。
また、リンク関係格納部２３において、文書格納部２２
に格納されている各々の文書間のリンク関係の有無を格
納する。A large number of documents acquired by the document acquisition processing section 21 are stored in the document storage section 22. In this case, the document storage unit 22 stores the document acquired by the document acquisition processing unit 21 as a pair with link information that specifies the document.
Also, in the link relation storage unit 23, the document storage unit 22
The presence / absence of a link relationship between the documents stored in the document is stored.

【００２２】自立語抽出処理部２４は、文書格納部２２
に格納されている文書から形態素解析アルゴリズムを用
いて自立語（単語）を抽出する。これにより、文書から
単語が切り出される。単語重み設定処理部２５は、自立
語抽出処理部２４による抽出結果を基にして、各文書毎
に全ての自立語に対して重み（重要度）を設定する。そ
して、文書間距離計算処理部２６において、単語重み設
定処理部２５によって設定された重みを基にして、文書
格納部２２に格納されている文書の全ての２つの項目の
間の距離を計算する。The independent word extraction processing unit 24 includes a document storage unit 22
The self-sustained words (words) are extracted from the document stored in the document by using a morphological analysis algorithm. Thereby, words are cut out from the document. The word weight setting processing unit 25 sets weights (importance) for all the independent words for each document based on the extraction result by the independent word extraction processing unit 24. Then, the inter-document distance calculation processing unit 26 calculates the distance between all two items of the document stored in the document storage unit 22 based on the weight set by the word weight setting processing unit 25. .

【００２３】このようにして、文書間の距離が計算され
ると、文書分類処理部２７では、リンク関係格納部２３
に格納されているリンク関係の有無と、文書間距離計算
処理部２６によって計算された文書間距離に基づいて、
文書をクラスター分析により分類する。分類された結果
は、出力処理部２８により、その文書分類処理部２７の
分類結果が表示される。出力処理部２８は、ユーザに対
して、グラフィカルユーザインターフェイスを利用して
見やすい表示形態により、例えば、同じグループに属す
る文書がまとめられて、その文書分類結果として出力表
示される。When the distance between the documents is calculated in this manner, the document classification processing unit 27 links the link relation storage unit 23
Based on the presence or absence of a link relationship stored in the document and the inter-document distance calculated by the inter-document distance calculation processing unit 26.
Classify documents by cluster analysis. As for the classified result, the classification result of the document classification processing unit 27 is displayed by the output processing unit 28. The output processing unit 28 collects, for example, documents belonging to the same group in a display format that is easy for the user to view using a graphical user interface, and outputs and displays the documents as a document classification result.

【００２４】一般的にハイパーテキストの形態をとる電
子文書では、文書の内容部分とリンク情報（他の文書の
ネットワーク上の存在位置および文書識別子）とを区別
するため、リンク情報には、リンク情報であることを示
すタグ付けがなされている。このため、文書中からタグ
と一致する文字列を検出することにより、文書取得処理
部２１では、文書中からリンク情報を同定する。Generally, in an electronic document in the form of a hypertext, in order to distinguish a content portion of a document from link information (location of another document on a network and a document identifier), the link information includes link information. Tag indicating that the Therefore, by detecting a character string that matches the tag in the document, the document acquisition processing unit 21 identifies the link information from the document.

【００２５】図３は、文書取得処理部２１の文書取得処
理のアルゴリズムを示すフローチャートである。図３に
示すフローチャートを参照して、文書取得処理部の動作
を説明する。広域ネットワーク上の１つの文書のリンク
情報を初期条件として指定して、文書取得処理を起動す
ると、ここでの処理が開始され、まず、ステップ３１に
おいて、初期条件としてリンク情報（ネットワーク上の
存在位置および文書識別子）が指定された文書を文書Ｄ
とし、次のステップ３２において、リストＳの先頭に文
書Ｄのリンク情報を加え、リストＳの先頭をカレントの
リスト位置Ｐとする。次に、次のステップ３３におい
て、リストＳのリスト位置Ｐに対応するリンク情報が存
在するか否かを判定する。この判定で、リンク情報が存
在しない場合は、ここでのリスト操作による文書取得処
理が終了したことなので、処理を終了する。FIG. 3 is a flowchart showing the algorithm of the document acquisition processing of the document acquisition processing section 21. The operation of the document acquisition processing unit will be described with reference to the flowchart shown in FIG. When the link information of one document on the wide area network is designated as the initial condition and the document acquisition process is started, the process starts here. First, in step 31, as the initial condition, the link information (existing position on the network) And the document with the specified document identifier) as the document D
In the next step 32, the link information of the document D is added to the head of the list S, and the head of the list S is set as the current list position P. Next, in the next step 33, it is determined whether or not there is link information corresponding to the list position P of the list S. If the link information does not exist in this determination, it means that the document acquisition processing by the list operation here has ended, and the processing ends.

【００２６】また、ステップ３３の判定処理で、リンク
情報が存在する場合は、次のステップ３４に進み、リン
ク情報を基にして、各リンク情報に対応する文書Ｄの文
書内容を取得する。次に、ステップ３５において、文書
Ｄのリンク情報とその文書内容とを対にして、文書格納
部２２に格納する（図４）。そして、次のステップ３６
において、文書Ｄの文書中に記述されているリンク情報
（Ｄ１，Ｄ２，…，Ｄｎ）を全て同定する。If it is determined in step S33 that the link information exists, the process proceeds to step S34, where the document content of the document D corresponding to each link information is acquired based on the link information. Next, in step 35, the link information of the document D and the content of the document are paired and stored in the document storage unit 22 (FIG. 4). Then, the next step 36
, All the link information (D1, D2,..., Dn) described in the document D is identified.

【００２７】次に、ステップ３７において、リンク情報
（Ｄ１，Ｄ２，…，Ｄｎ）のうち、リストＳ中に存在し
ないリンク情報があれば、リストＳに連接する。次にス
テップ３８において、文書Ｄと各リンク情報（Ｄ１，Ｄ
２，…，Ｄｎ）との間の２項間にリンク関係が存在する
ことをリンク情報格納部２３に格納する。そして、次の
文書に対する処理のため、ステップ３９において、カレ
ントのリスト位置ＰをリストＳ中のリスト位置Ｐの次の
位置とし、ステップ３３に戻る。ステップ３３において
は、前述のように、リストＳのリスト位置Ｐに対応する
リンク情報が存在するか否かを判定し、この判定処理
で、リンク情報が存在する場合には、ステップ３４から
の処理を繰り返し、また、リンク情報が存在しない場合
は、ここでのリスト操作による文書取得処理が終了した
ことなので、処理を終了する。Next, at step 37, if there is link information that does not exist in the list S among the link information (D1, D2,..., Dn), the link information is linked to the list S. Next, in step 38, the document D and each link information (D1, D
2,..., Dn) are stored in the link information storage unit 23. Then, in step 39, the current list position P is set as the position next to the list position P in the list S, and the process returns to step 33 for processing on the next document. In step 33, as described above, it is determined whether or not there is link information corresponding to the list position P of the list S. In this determination processing, if link information exists, the processing from step 34 is performed. Is repeated, and if there is no link information, it means that the document acquisition processing by the list operation has been completed here, and the processing is terminated.

【００２８】このようにして、文書取得処理部２１の処
理によって、文書中でリンク付けされている他の文書が
再帰的に取得される。この結果、得られた各文書の内容
はその文書のリンク情報と共に文書格納部２２に格納さ
れる。また、各文書間のリンク関係の情報は、リンク関
係格納部２３に格納される。As described above, the other documents linked in the document are recursively acquired by the processing of the document acquisition processing unit 21. As a result, the obtained contents of each document are stored in the document storage unit 22 together with the link information of the document. Further, information on the link relation between the documents is stored in the link relation storage unit 23.

【００２９】図４は、文書格納部２２に格納される文書
内容とリンク情報の関係を説明する図である。図４に示
すように、文書格納部には、取得された文書の文書内容
４２とリンク情報（Ｄ１，Ｄ２，…，Ｄｎ）４１とが対
応づけて格納される。FIG. 4 is a diagram for explaining the relationship between the document contents stored in the document storage unit 22 and the link information. As shown in FIG. 4, the document storage unit stores the document content 42 of the acquired document and the link information (D1, D2,..., Dn) 41 in association with each other.

【００３０】図５は、リンク関係格納部２３に格納され
るリンク関係の情報を説明する図である。図５に示すよ
うに、リンク関係格納処理部２３には、リンク関係が２
次元マトリックスの表の形式で格納される。表中の行見
出しおよび列見出しは、文書格納部２２に格納されたリ
ンク情報（Ｄ１，Ｄ２，…，Ｄｎ）に対応し、リンク情
報によって特定される文書間にリンク関係がある場合を
○印で表記し、リンク関係がない場合を×印で表記して
いる。FIG. 5 is a diagram for explaining link relation information stored in the link relation storage unit 23. As shown in FIG. 5, the link relation storage processing unit 23 stores
Stored in the form of a dimensional matrix table. The row headings and column headings in the table correspond to the link information (D1, D2,..., Dn) stored in the document storage unit 22, and a circle indicates that there is a link relationship between the documents specified by the link information. , And the case where there is no link relationship is indicated by an x mark.

【００３１】前述したように、自立語抽出処理部２４
は、文書格納部２２に格納された各文書内容から公知の
形態素解析アルゴリズムを用いて単語を切り出し、各文
書内容の中の自立語を抽出する。ここで抽出した自立語
に対して、単語重み設定処理部２５が、各文書の文書内
容の中に含まれる自立語に対して“１”を設定し、文書
内容の中に含まれない自立語に対して“０”を設定す
る。As described above, the independent word extraction processing unit 24
Extracts a word from each document content stored in the document storage unit 22 using a known morphological analysis algorithm, and extracts an independent word from each document content. For the extracted independent words, the word weight setting processing unit 25 sets “1” to the independent words included in the document contents of each document, and sets the independent words not included in the document contents. Is set to “0”.

【００３２】図６は、単語重み設定処理部２５による重
み付け結果の一例を示す図である。前述したように、こ
こでの文書の各文書内容は、リンク情報（Ｄ１，Ｄ２，
…，Ｄｎ）により対応づけられているので、図６に示す
ように、各文書内容に含まれている自立語（ＷＯＲＤ
１，ＷＯＲＤ２，ＷＯＲＤ３，…，ＷＯＲＤｎ）に対し
て、当該各文書の文書内容の中に含まれる自立語には
“１”を設定し、文書内容の中に含まれない自立語は
“０”を設定するが、これらは、リンク情報（Ｄ１，Ｄ
２，…，Ｄｎ）により各文書内容と対応付けられる。FIG. 6 is a diagram showing an example of the weighting result by the word weight setting processing unit 25. As described above, each document content of the document here is linked information (D1, D2,
, Dn), as shown in FIG. 6, independent word (WORD) included in each document content
1, WORD2, WORD3,..., WORDn), “1” is set to the independent word included in the document content of each document, and “0” is set to the independent word not included in the document content. Are set as link information (D1, D1
2,..., Dn) are associated with each document content.

【００３３】文書間距離計算処理部２６は、前述した式
（１−１）〜式（１−４）に基づいて、文書格納処理部
２２に格納された文書の全ての２項間について、その間
の距離を計算する。計算された各文書の文書間距離は、
各文書内容と対応づけられているリンク情報（Ｄ１，Ｄ
２，…，Ｄｎ）の間の距離として格納される。図７は、
文書間距離計算処理部２６による文書間距離の計算結果
の一例を示している。The inter-document distance calculation processing unit 26 calculates the inter-document distance between all two items of the document stored in the document storage processing unit 22 based on the above-described equations (1-1) to (1-4). Calculate the distance of The calculated inter-document distance for each document is
Link information (D1, D1) associated with each document content
2,..., Dn). FIG.
9 shows an example of the calculation result of the inter-document distance by the inter-document distance calculation processing unit 26.

【００３４】このようにして、リンク情報により取得さ
れた各文書の文書間距離が算出されると、文書分類処理
部２７において、リンク関係の情報と、算出した文書間
距離に基づいて、文書分類処理部２７は、初期文書クラ
スターを生成し、文書間距離に基づいたクラスター分析
を行い、文書格納部２２に格納された各文書を分類す
る。When the inter-document distance of each document obtained from the link information is calculated in this way, the document classification processing unit 27 classifies the document based on the link-related information and the calculated inter-document distance. The processing unit 27 generates an initial document cluster, performs cluster analysis based on the inter-document distance, and classifies each document stored in the document storage unit 22.

【００３５】図８は、文書分類処理部２７による文書分
類処理のアルゴリズムを示すフローチャートである。図
８を参照して、ここで文書分類処理を説明する。文書分
類処理においては、処理を開始すると、ステップ８１に
おいて、初期文書クラスターの作成処理を行う。すなわ
ち、リンク関係格納部２３のリンク関係の有無と、文書
間距離計算部２６の計算結果を参照し、リンク関係があ
り、かつ、文書間距離が所定の定数Ｋ（０≦Ｋ≦１）以
下である文書の対を１つのクラスターとする。この場
合、３つ以上の文書が、この条件を満たして連なる場合
には、それらをまとめて１つのクラスターとする。FIG. 8 is a flowchart showing an algorithm of the document classification processing by the document classification processing section 27. The document classification processing will now be described with reference to FIG. In the document classification process, when the process is started, in step 81, a process of creating an initial document cluster is performed. That is, referring to the presence or absence of the link relationship in the link relationship storage unit 23 and the calculation result of the inter-document distance calculation unit 26, there is a link relationship and the inter-document distance is equal to or less than a predetermined constant K (0 ≦ K ≦ 1). Let a pair of documents be a cluster. In this case, if three or more documents are connected in a row satisfying this condition, they are combined into one cluster.

【００３６】次に、ステップ８２に進み、得られた前ク
ラスターと、クラスターに属さない全文書の２項間距離
を再計算する。次に、ステップ８３において、得られた
２項間距離のうち最も小さい値となる２つのクラスター
あるいは文書を１つのクラスターとする。そして、次の
ステップ８４において、クラスター数および文書数の合
計値が、所定数Ｎ（１≦Ｎ≦ｎ：文書総数ｎ）以下であ
るか否かを判定し、合計値が所定数Ｎ以下でない場合、
未だ分類されていない文書が存在するので、この場合に
は、ステップ８２に戻り、ステップ８２およびステップ
８３のクラスター分析よる分類処理を繰り返し行う。こ
の結果、ステップ８４の判定処理で、クラスター数およ
び文書数の合計値が所定数Ｎ以下であることが確認でき
ると、ここで文書の分類が終了したので、一連の処理を
終了する。そして、次に説明するように、分類した結果
を出力処理部２８により表示する。Then, the process proceeds to a step 82, wherein the distance between the two terms of the obtained previous cluster and all the documents not belonging to the cluster is recalculated. Next, in step 83, two clusters or documents having the smallest value among the obtained inter-term distances are defined as one cluster. Then, in the next step 84, it is determined whether or not the total value of the number of clusters and the number of documents is equal to or smaller than a predetermined number N (1 ≦ N ≦ n: the total number n of documents). If
Since there is a document that has not been classified yet, in this case, the process returns to step 82, and the classification process based on the cluster analysis in steps 82 and 83 is repeatedly performed. As a result, in the determination processing of step 84, when it is confirmed that the total value of the number of clusters and the number of documents is equal to or smaller than the predetermined number N, the classification of the documents is completed here, and a series of processing ends. Then, as described below, the classified result is displayed by the output processing unit 28.

【００３７】なお、このステップ８２の処理において、
クラスターとクラスターに属さない文書の間の文書間距
離の再計算を行うが、この場合の文書と文書との間の文
書間距離計算は、前述したように、式（１−１）〜式
（１−４）により行う。また、クラスターＣと文書Ｄの
間の距離計算は、クラスターＣに属する全ての文書と文
書Ｄの距離計算を式（１−１）〜式（１−４）によって
行い、その平均値を距離とする。クラスターＣ１とクラ
スターＣ２の間では、クラスターＣ１とクラスターＣ２
に属する各文書の距離計算を行い、その平均値を距離と
する。In the process of step 82,
The recalculation of the inter-document distance between the cluster and the documents that do not belong to the cluster is performed. In this case, the inter-document distance calculation between the documents is performed by using the equations (1-1) to ( This is performed according to 1-4). The distance between the cluster C and the document D is calculated by calculating the distance between all the documents belonging to the cluster C and the document D according to Expressions (1-1) to (1-4). I do. Between cluster C1 and cluster C2, cluster C1 and cluster C2
The distance of each document belonging to is calculated, and the average value is set as the distance.

【００３８】文書分類処理部２７による文書分類アルゴ
リズムは、一般のクラスター分析の初期クラスターの設
定に文書間距離とリンク関係を併用するものである。す
なわち、リンク関係があり、かつ、文書間距離が近い文
書をまとめて、初期クラスターとし、更に、文書間距離
とリンク関係を併用することにより、意味的関係の深い
リンク関係を選択的に利用することが可能となる。ま
た、リンク関係を用いることにより、従来の文書間距離
情報のみに基づくクラスター分析と比較して、より信頼
性の高い分類が可能となる。これにより、文書の意味内
容をより反映したクラスター解析（分類）が可能とな
る。The document classification algorithm by the document classification processing unit 27 uses the inter-document distance and the link relation together for setting an initial cluster for general cluster analysis. That is, documents having a link relationship and having a short inter-document distance are put together to form an initial cluster, and furthermore, by using the inter-document distance and the link relationship together, a link relationship having a deep semantic relationship is selectively used. It becomes possible. Further, by using the link relationship, more reliable classification can be performed as compared with the conventional cluster analysis based only on the inter-document distance information. As a result, cluster analysis (classification) that further reflects the semantic content of the document becomes possible.

【００３９】具体例で説明すると、前述した図４，図
５，図６，および図７の数値例の場合には、Ｋ＝０．６
とした場合、文書間距離が最も近いものは、文書Ｄ１と
文書Ｄ４との距離“０．０９”であり、次に近い文書間
距離は文書Ｄ４と文書Ｄ５との距離“０．１２”であ
り、その次に近い文書間距離は文書Ｄ２と文書Ｄ３との
距離“０．２７”であることから、初期クラスターは
（Ｄ１，Ｄ４，Ｄ５）および（Ｄ２，Ｄ３）となる。Explaining in a concrete example, in the case of the numerical examples shown in FIGS. 4, 5, 6, and 7, K = 0.6
In this case, the closest inter-document distance is the distance “0.09” between the document D1 and the document D4, and the next closest inter-document distance is the distance “0.12” between the document D4 and the document D5. Since the next closest inter-document distance is the distance “0.27” between the document D2 and the document D3, the initial clusters are (D1, D4, D5) and (D2, D3).

【００４０】次に、出力処理部２８の処理について説明
する。前述したように、出力処理部２８は、ユーザに対
して、グラフィカルユーザインターフェイスを利用して
見やすい表示形態により、例えば、同じグループに属す
る文書がまとめられて、その文書分類結果として出力表
示する。このような出力処理部による表示形態を、具体
的な操作例を例示して説明する。図９〜図１３は、ユー
ザが、ここでの文書分類装置に組み込まれている文書検
索装置を起動して、論文検索を行い、更に文書分類を行
う場合の操作画面の一連の状態の変化を示している。こ
こでの文書検索装置を起動すると、図９に示すように、
文献検索ウィンドウ画面９０が表示される。この文献検
索ウィンドウ画面９０には、検索操作ガイド共に、検索
キーワード入力ためのキーワード入力フィールド９１が
設けられている。Next, the processing of the output processing unit 28 will be described. As described above, the output processing unit 28 collects documents belonging to the same group, for example, in a display format that is easy for the user to view using the graphical user interface, and outputs and displays the documents as a document classification result. A display form by such an output processing unit will be described by exemplifying a specific operation example. FIG. 9 to FIG. 13 show a change in a series of states of an operation screen when a user activates a document search device incorporated in the document classification device, searches for articles, and further classifies documents. Is shown. When the document search device is activated here, as shown in FIG.
The document search window screen 90 is displayed. The document search window screen 90 is provided with a keyword input field 91 for inputting a search keyword together with the search operation guide.

【００４１】この文献検索ウィンドウ画面９０におい
て、例えば、ユーザが論文検索のためのキーワードとし
て、図１０に示すように、「人工頭脳」，「定性推
論」，および「免疫ネットワーク」のキーワードを入力
する操作を行うと、文献検索ウィンドウ画面９０は、キ
ーワード入力フィールド９１に検索キーワードが入力さ
れた状態となり、この状態において、検索ボタン９２を
ポインタカーソル９３によりクリックすると、検索処理
が開始されて、その検索結果が、検索結果表示フィール
ド９４に表示される。その結果、図１１に示すように、
検索結果表示フィールド９４には、例えば、ヒットした
文献の３件の文書のタイトルが表示される。On the document search window screen 90, for example, as shown in FIG. 10, the user inputs keywords of "artificial brain", "qualitative inference", and "immune network" as keywords for searching articles. When an operation is performed, a search keyword is entered in the keyword input field 91 on the document search window screen 90. In this state, when the search button 92 is clicked with the pointer cursor 93, search processing is started, and the search processing is started. The result is displayed in the search result display field 94. As a result, as shown in FIG.
In the search result display field 94, for example, the titles of three documents of the hit document are displayed.

【００４２】次に、ユーザが、検索された文書と関連の
深い文書を更に表示させるため、本実施例にかかる文書
分類装置を起動する。このため、図１２に示すように、
検索結果表示フィールド９４に表示された文書の内の１
つの文書９５をポインタカーソル９３の操作により指定
して（反転表示させて）、図１３に示すように、関連文
献表示ボタン９６を操作すると、つまり、マウス操作で
ポインタカーソル９３によりクリックすると、本実施例
にかかる文書分類装置が起動される。そして、指定され
た文書から、その中に埋め込まれたリンク情報により関
連のある文書を取得し、その文書間距離に基づくクラス
ター分析による文書分類処理を実行し、同じグループに
属する文書を関連文書表示フィールド９７に表示する。
このようして、ユーザは、文献検索を行う場合に、関連
のある文書まで含めて効率よく検索することとができ
る。Next, the user activates the document classification device according to the present embodiment to further display a document closely related to the searched document. Therefore, as shown in FIG.
One of the documents displayed in the search result display field 94
When one of the documents 95 is designated (highlighted) by operating the pointer cursor 93 and the related document display button 96 is operated as shown in FIG. The example document classification device is activated. Then, a related document is obtained from the specified document by link information embedded therein, and a document classification process is performed by cluster analysis based on the inter-document distance, and documents belonging to the same group are displayed as related documents. It is displayed in the field 97.
In this way, when performing a document search, the user can efficiently search for documents including related documents.

【００４３】[0043]

【発明の効果】以上、説明したように、本発明の文書分
類装置によれば、ハイパーテキストの形態をとる文書を
クラスター分析する際に、文書に記述されたリンク情報
を利用することにより、文書の作成者の意志を反映した
文書分類を行うことができる。つまり、文書の意味内容
に沿った文書分類ができるようになる。As described above, according to the document classification apparatus of the present invention, when performing a cluster analysis of a document in the form of a hypertext, the document is utilized by utilizing the link information described in the document. Document classification that reflects the will of the creator of the document. That is, it becomes possible to classify documents according to the semantic contents of the documents.

[Brief description of the drawings]

【図１】図１は本発明の一実施例の文書分類装置の要
部の構成を示すブロック図、FIG. 1 is a block diagram showing a configuration of a main part of a document classification device according to an embodiment of the present invention;

【図２】図２は本発明の別の実施例である広域ネット
ワークに結合された文書分類システムの要部の構成を示
すブロック図、FIG. 2 is a block diagram showing a configuration of a main part of a document classification system coupled to a wide area network according to another embodiment of the present invention;

【図３】図３は文書取得処理部２１の文書取得処理の
アルゴリズムを示すフローチャート、FIG. 3 is a flowchart showing an algorithm of a document acquisition process of a document acquisition processing unit 21;

【図４】図４は文書格納部２２に格納される文書内容
とリンク情報の関係を説明する図、FIG. 4 is a view for explaining the relationship between document content and link information stored in a document storage unit 22;

【図５】図５はリンク関係格納部２３に格納されるリ
ンク関係の情報を説明する図、FIG. 5 is a view for explaining link relation information stored in a link relation storage unit 23;

【図６】図６は単語重み設定処理部２５による重み付
け結果の一例を示す図、FIG. 6 is a diagram illustrating an example of a weighting result by the word weight setting processing unit 25;

【図７】図７は文書間距離計算処理部２６による文書
間距離の計算結果の一例を示す図、FIG. 7 is a diagram showing an example of a calculation result of an inter-document distance by an inter-document distance calculation processing unit 26;

【図８】図８は文書分類処理部２７による文書分類処
理のアルゴリズムを示すフローチャート、FIG. 8 is a flowchart showing an algorithm of a document classification process by the document classification processing unit 27;

【図９】図９は論文検索を行い更に文書分類を行う場
合の操作画面の一連の状態の変化の第１の状態を示す
図、FIG. 9 is a diagram showing a first state of a series of state changes of an operation screen when a paper search is performed and a document classification is further performed;

【図１０】図１０は論文検索を行い更に文書分類を行
う場合の操作画面の一連の状態の変化の第２の状態を示
す図、FIG. 10 is a diagram showing a second state of a series of state changes of an operation screen when performing a paper search and further performing document classification;

【図１１】図１１は論文検索を行い更に文書分類を行
う場合の操作画面の一連の状態の変化の第３の状態を示
す図、FIG. 11 is a diagram showing a third state of a series of state changes of the operation screen when performing a paper search and further performing document classification;

【図１２】図１２は論文検索を行い更に文書分類を行
う場合の操作画面の一連の状態の変化の第４の状態を示
す図、FIG. 12 is a diagram showing a fourth state of a series of state changes on the operation screen when a paper search is performed and a document is further classified;

【図１３】図１３は論文検索を行い更に文書分類を行
う場合の操作画面の一連の状態の変化の第５の状態を示
す図である。FIG. 13 is a diagram showing a fifth state of a series of state changes on the operation screen when a paper search is performed and a document is further classified.

[Explanation of symbols]

１１…文書格納部、１２…リンク関係格納部、１３…距
離計算処理部、１４…文書分類処理部、１５…出力処理
部、２０…広域ネットワーク、２１…文書取得処理部、
２２…文書格納部、２３…リンク関係格納部、２４…自
立語抽出処理部、２５…単語重み設定処理部、２６…文
書間距離計算処理部、２７…文書分類処理部、２８…出
力処理部。11: Document storage unit, 12: Link relation storage unit, 13: Distance calculation processing unit, 14: Document classification processing unit, 15: Output processing unit, 20: Wide area network, 21: Document acquisition processing unit,
22 Document storage unit 23 Link relation storage unit 24 Independent word extraction processing unit 25 Word weight setting processing unit 26 Document distance calculation processing unit 27 Document classification processing unit 28 Output processing unit .

Claims

[Claims]

1. A document storage means for storing a plurality of digitized documents; a link relation storage means for storing a link relation between the plurality of documents stored in the document storage means; Distance calculating means for calculating an inter-document distance from the frequency of appearance of words contained in each of the stored documents; and a link relation stored in the link relation storing means and an inter-document distance obtained from the distance calculating means. Generating an initial document cluster, performing a cluster analysis based on the inter-document distance, classifying a plurality of documents stored in the document storage means, and outputting a result classified by the document classification means. And a document classification device.