JPH11296552A

JPH11296552A - Document classification device, document classification method, and computer-readable recording medium recording a program for causing a computer to execute the method

Info

Publication number: JPH11296552A
Application number: JP10115907A
Authority: JP
Inventors: Eiji Kenmochi; 栄治剣持; Tatsuo Miyaji; 達生宮地; Atsuo Shimada; 敦夫嶋田; Kazuhisa Takeya; 一寿武谷; Akiko Nakajima; 明子中島; Tetsuo Nagatsuka; 哲郎長束; Makoto Yamazaki; 真湖人山崎; Katsuhiko Fujita; 克彦藤田
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1998-04-13
Filing date: 1998-04-13
Publication date: 1999-10-29

Abstract

(57)【要約】【課題】文書間の類似性に基づいて文書分類をおこな
う際、操作者の意図を反映する文書分類を短時間で効率
良く繰り返しをおこなうことを課題とする。【解決手段】文書データを入力する入力部４０１と、
入力された文書データを解析し解析情報を得る解析部４
０２と、得られた解析情報に基づいて文書データに対す
る文書特徴ベクトルを生成するベクトル生成部４０３
と、生成された文書特徴ベクトルが文書特徴ベクトル間
の類似性を反映する空間に射影されるための表現空間変
換関数を算出する変換関数算出部４０４と、算出された
表現空間変換関数をもちいてベクトル生成部４０３によ
り生成された文書特徴ベクトルを変換するベクトル変換
部４０５と、変換された文書特徴ベクトル間の類似度に
基づいて文書を分類する分類部４０６と、分類された文
書分類の結果を記憶する分類結果記憶部４０７とを備え
る。 (57) [Summary] [Problem] To classify documents based on similarity between documents, it is an object of the present invention to efficiently and repeatedly perform document classification reflecting the intention of an operator in a short time. SOLUTION: An input unit 401 for inputting document data,
Analysis unit 4 that analyzes input document data and obtains analysis information
02 and a vector generation unit 403 that generates a document feature vector for the document data based on the obtained analysis information.
And a conversion function calculation unit 404 that calculates an expression space conversion function for projecting the generated document feature vector into a space reflecting the similarity between the document feature vectors, and a calculated expression space conversion function. A vector conversion unit 405 that converts the document feature vector generated by the vector generation unit 403, a classification unit 406 that classifies the document based on the similarity between the converted document feature vectors, and a result of the classified document classification. And a classification result storage unit 407 for storing.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、文書間の類似性
に基づいて文書を分類する文書分類装置、文書分類方法
およびその方法をコンピュータに実行させるプログラム
を記録したコンピュータ読み取り可能な記録媒体に関す
る。[0001] 1. Field of the Invention [0002] The present invention relates to a document classification device for classifying documents based on similarity between documents, a document classification method, and a computer-readable recording medium recording a program for causing a computer to execute the method.

【０００２】[0002]

【従来の技術】近年インターネット等の普及により大量
の文書情報へのアクセスが可能になり、収集した大量の
文書情報を意味のあるカテゴリーに分類し、文書群の構
造を把握するなどの知的作業がおこなわれるようになっ
ていきている。大量の文書情報を操作者が手動で分類す
る場合、人的／時間的コストが膨大なものになり、また
分類をおこなう際にもちいる知識を分類をする操作者の
みが有することになるため、分類をおこなう担当の操作
者が変わると分類基準も変わってしまうことになる。2. Description of the Related Art In recent years, the spread of the Internet and the like has made it possible to access a large amount of document information, classify a large amount of collected document information into meaningful categories, and grasp the structure of a group of documents. Is being carried out. When the operator manually classifies a large amount of document information, human / time costs become enormous, and only the operator who classifies the knowledge used for performing the classification is required. If the operator in charge of the classification changes, the classification standard will also change.

【０００３】したがって、文書群をいかに人間が分類を
おこなうような分類基準によって自動的に分類すること
ができるかが重要な課題となる。すなわち、意味的に類
似している文書は同一のカテゴリーに分類され、また、
分類をする工程において生成される各分類カテゴリーは
操作者が文類実行前に意図しているような分類カテゴリ
ーとなるように構成された文書分類装置の出現が望まれ
ている。[0003] Therefore, an important issue is how to automatically classify a group of documents according to a classification criterion that a human classifies. That is, documents that are semantically similar are classified into the same category,
It is desired that a document classifying device configured so that each classification category generated in the classifying step is a classification category intended by an operator before executing a class is performed.

【０００４】文書の自動分類装置の従来技術としては、
たとえば特開平７−３６８９７号公報に記載されている
ように、文書を単語を特徴とする文書ベクトルとみな
し、クラスタリング手法をもちいてこれらの文書ベクト
ルを群分けし、群分けした文書ベクトルに基づいて文書
の自動分類をおこなうものがある。[0004] The prior art of an automatic document classification apparatus includes:
For example, as described in Japanese Patent Application Laid-Open No. H7-36897, a document is regarded as a document vector characterized by words, these document vectors are grouped using a clustering method, and based on the grouped document vectors, Some documents perform automatic classification of documents.

【０００５】[0005]

【発明が解決しようとする課題】しかしながら、上記従
来技術の文書分類装置は、分類対象文書に含まれる単語
を特徴量とする文書特徴ベクトルをもちいて、その文書
特徴ベクトルに対しクラスタリング手法を適用して分類
をおこなうため、単語の多義性／同義性により文書の意
味的な関連性を反映した分類結果を得ることが困難とな
るという問題があった。However, the document classification apparatus of the prior art uses a document feature vector having a word included in a document to be classified as a feature amount, and applies a clustering method to the document feature vector. Therefore, there is a problem that it is difficult to obtain a classification result reflecting the semantic relevance of the document due to the polysemy / synonymity of the word.

【０００６】この単語の多義性／同義性の問題を解決す
るものとしては、米国特許第４８３９８５３号公報に記
載されているように、文書間の内積行列に特異値分解を
適用するものがある。すなわち、文書間の単語の共起性
をもとに生成される潜在的意味空間といわれる空間へ、
文書と単語を射影することにより意味的な関連性を反映
した文書検索をおこなうものである。As a solution to the word ambiguity / synonymous problem, as described in US Pat. No. 4,839,853, a singular value decomposition is applied to an inner product matrix between documents. In other words, into a space called a latent semantic space generated based on the co-occurrence of words between documents,
A document search that reflects semantic relevance by projecting a document and a word is performed.

【０００７】また、「Ｐｒｏｊｅｃｔｉｏｎｓｆｏｒ
ＥｆｆｉｃｉｅｎｔＤｏｃｕｍｅｎｔＣｌｕｓｔ
ｅｒｉｎｇ（著者名：ＨｉｎｒｉｃｈＳｃｈｕｔｚｅ
ａｎｄＣｒａｉｎｇＳｉｌｖｅｒｓｔｅｉｎ，
学会名：ＡＣＭ，論文名：Ｐｒｏｃｅｅｄｉｎｇｓ
ｏｆＳＩＧＩＲ，ページ：７４−８１，発行年：
１９９７）」においては、上記潜在的意味空間において
文書分類を実施しているものがある。さらに、「Ｒｅｐ
ｒｅｓｅｎｔａｔｉｎｇＤｏｃｕｍｅｎｔｓＵｓｉｎ
ｇａｎＥｘｐｌｉｃｉｔＭｏｄｅｌｏｆＴｈ
ｅｉｒＳｉｍｉｌａｒｉｔｉｅｓ（著者名：Ｂｒｉａ
ｎＴ．Ｂａｒｔｅｌｌ，ＧａｒｒｉｓｏｎＷ．
Ｃｏｔｔｒｅｌｌ，ａｎｄＲｉｃｈａｒｄＫ．
Ｂｅｌｅｗ，論文名：Ｊｏｕｒｎａｌｏｆｔｈ
ｅＡｍｅｒｉｃａｎＳｏｃｉｅｔｙｆｏｒＩｎ
ｆｏｒｍａｔｉｏｎＳｃｉｅｎｃｅ，学会名：ｔｈ
ｅＡｍｅｒｉｃａｎＳｏｃｉｅｔｙｆｏｒＩｎ
ｆｏｒｍａｔｉｏｎＳｃｉｅｎｃｅ，ページ：２５
４−２７１，Ｖｏｌ．４６Ｎｏ．４，発行年：１９
９５）」においては、上記潜在的意味空間への変換手法
を一般化し、文書間の内積行列に、文書が有する他の文
書への参照情報から生成される共参照情報などを付加し
た行列をもちいて、これらの類似性を反映する空間へ文
書や単語を射影するための表現空間変換関数を導出して
いるものがある。[0007] Also, "Projections for
Efficient Document Cluster
ering (author name: Hinrich Schutze)
and Craving Silverstein,
Conference name: ACM, Paper name: Proceedings
of SIGIR, Page: 74-81, Publication year:
1997)), document classification is performed in the latent semantic space. Furthermore, "Rep
presenting Documents Usin
g an Explicit Model of Th
eir Similiaries (Author: Bria)
nT. Bartell, Garrison W.
Cottrell, and Richard K.
Berew, Article name: Journal of the
e American Society for In
formation Science, Society name: th
e American Society for In
formation Science, page: 25
4-271, Vol. 46 No. 4, Publication year: 19
95)), a generalization of the above-described conversion method to a latent semantic space, and using a matrix obtained by adding co-reference information generated from reference information to another document included in a document to a dot product matrix between documents. In some cases, an expression space conversion function for projecting a document or word into a space reflecting these similarities is derived.

【０００８】これらの従来技術の手法で生成される射影
空間の各次元は複数の単語が意味的に結合した概念的な
ものであるが、どの特徴次元を使って文書分類あるいは
文書検索をおこなうかは、特異値分解を適用する際に算
出される特異値の大きさのみを基準として決定される。
このため、分類実行時にもちいられる特徴次元の選択に
おいては、操作者の意図が反映されることは困難であ
り、このため分類結果が操作者の意図する結果と異なっ
てしまうという問題点があった。Each dimension of the projection space generated by these conventional techniques is a conceptual one in which a plurality of words are semantically combined. Which feature dimension is used to perform document classification or document search? Is determined based only on the magnitude of the singular value calculated when applying the singular value decomposition.
For this reason, it is difficult to reflect the intention of the operator in selecting the feature dimension used at the time of performing the classification, and there is a problem that the classification result is different from the result intended by the operator. .

【０００９】また、従来の他の文書分類方法では、文書
の意味的な関連性を反映した文書分類をおこなうため
に、文書を変換するするための表現空間変換関数を算出
する部分と実際に前記表現空間変換関数をもちいて変換
された文書の文書分類をおこなう部分とを連続的に処理
しているが、表現空間変換関数を算出する部分は非常に
計算時間を費やす処理であるため、結果として一回の文
書分類に要する時間も膨大なものになるという問題点が
あった。In another conventional document classification method, in order to perform document classification reflecting the semantic relevance of a document, a part for calculating a representation space conversion function for converting a document and a part for actually calculating the expression space conversion function are used. Although the part that performs document classification of the document converted using the expression space conversion function is continuously processed, the part that calculates the expression space conversion function is processing that requires a lot of calculation time. There is a problem that the time required for one document classification becomes enormous.

【００１０】この発明は、上述した従来例による問題点
を解消するため、操作者の意図を反映する文書分類を短
時間で効率良く繰り返しをおこなうことができる文書分
類装置、文書分類方法およびその方法をコンピュータに
実行させるプログラムを記録したコンピュータ読み取り
可能な記録媒体を提供することを目的とする。According to the present invention, in order to solve the above-mentioned problems of the prior art, a document classification apparatus, a document classification method, and a document classification method capable of efficiently and efficiently repeating a document classification reflecting the intention of an operator in a short time. It is an object of the present invention to provide a computer-readable recording medium on which a program for causing a computer to execute the above is recorded.

【００１１】[0011]

【課題を解決するための手段】上述した課題を解決し、
目的を達成するため、請求項１の発明に係る文書分類装
置は、文書データを入力する入力手段と、前記入力手段
により入力された文書データを解析し解析情報を得る解
析手段と、前記解析手段により得られた解析情報に基づ
いて前記文書データに対する文書特徴ベクトルを生成す
るベクトル生成手段と、前記ベクトル生成手段により生
成された文書特徴ベクトルが文書特徴ベクトル間の類似
性を反映する空間に射影されるための表現空間変換関数
を算出する変換関数算出手段と、前記変換関数算出手段
により算出された表現空間変換関数をもちいて前記ベク
トル生成手段により生成された文書特徴ベクトルを変換
するベクトル変換手段と、前記ベクトル変換手段により
変換された文書特徴ベクトル間の類似度に基づいて文書
を分類する分類手段と、前記分類手段により分類された
文書分類の結果を記憶する分類結果記憶手段と、を備え
たことを特徴とする。Means for Solving the Problems The above-mentioned problems are solved,
In order to achieve the above object, a document classification device according to the present invention comprises an input unit for inputting document data, an analysis unit for analyzing the document data input by the input unit to obtain analysis information, and the analysis unit. And a vector generation unit that generates a document feature vector for the document data based on the analysis information obtained by the method. The document feature vector generated by the vector generation unit is projected onto a space that reflects the similarity between the document feature vectors. A conversion function calculating means for calculating an expression space conversion function for converting, and a vector conversion means for converting the document feature vector generated by the vector generation means using the expression space conversion function calculated by the conversion function calculation means. A classifier for classifying a document based on the similarity between the document feature vectors converted by the vector conversion means. When, characterized in that and a classification result storing means for storing the result of the classified documents classified by the classifying means.

【００１２】この請求項１の発明によれば、分類対象で
ある文書群での文書間の類似性に基づいて、各文書をそ
れら文書間の意味的な関連性を反映しうる表現空間へ変
換するための表現空間変換関数を算出し、その表現空間
で文書分類をおこなうことにより、操作者の意図を反映
しうる文書分類を実現することが可能である。According to the first aspect of the present invention, each document is converted into an expression space which can reflect the semantic relevance between the documents based on the similarity between the documents in the group of documents to be classified. By calculating an expression space conversion function for performing the operation and classifying the documents in the expression space, it is possible to realize a document classification that can reflect the intention of the operator.

【００１３】また、請求項２に係る文書分類装置は、請
求項１の発明において、前記ベクトル生成手段により生
成された文書特徴ベクトル間の内積を算出する内積算出
手段を備え、前記変換関数算出手段が、前記内積算出手
段により算出された内積をもちいて表現空間変換関数を
算出することを特徴とする。The document classification device according to a second aspect of the present invention is characterized in that, in the first aspect of the present invention, the document classification device further comprises an inner product calculating means for calculating an inner product between the document feature vectors generated by the vector generating means, and Means for calculating a representation space conversion function using the inner product calculated by the inner product calculation means.

【００１４】この請求項２の発明によれば、表現空間変
換関数を導出する際に必要となる文書間の類似性として
文書特徴ベクトル間の内積をもちいることにより、文書
間の意味的な関連性を反映した文書分類をおこなうこと
が可能である。According to the second aspect of the present invention, the inner product between the document feature vectors is used as the similarity between the documents required for deriving the expression space conversion function, so that the semantic relation between the documents is obtained. It is possible to classify documents that reflect the characteristics.

【００１５】また、請求項３に係る文書分類装置は、請
求項２の発明において、前記入力手段により入力された
文書の作成者、作成日等の文書データの文書間類似情報
を設定する文書間類似情報設定手段を備え、前記変換関
数算出手段が、前記内積算出手段により算出された内積
および前記文書間類似情報設定手段により設定された文
書間類似情報をもちいて表現空間変換関数を算出するこ
とを特徴とする。According to a third aspect of the present invention, there is provided the document classification device according to the second aspect of the present invention, wherein an inter-document similarity information of the document data such as a creator and a creation date of the document input by the input means is set. A similarity information setting unit, wherein the conversion function calculation unit calculates an expression space conversion function using the inner product calculated by the inner product calculation unit and the inter-document similarity information set by the inter-document similarity information setting unit. It is characterized by the following.

【００１６】この請求項３の発明によれば、表現空間変
換関数を導出する際に必要となる文書間の類似性として
文書特徴ベクトル間の内積に加え、文書の作成者や作成
日などの文書間類似情報ももちいることにより、文書間
の意味的な関連性を反映した文書分類をおこなうことが
可能である。According to the third aspect of the present invention, in addition to the inner product between the document feature vectors, the similarity between the documents required for deriving the expression space conversion function, the document creator and the document creation date and the like can be obtained. By using the inter-similarity information, it is possible to perform document classification reflecting the semantic relevance between documents.

【００１７】また、請求項４に係る文書分類装置は、請
求項１〜３の発明において、さらに、前記ベクトル生成
手段により生成された文書特徴ベクトルを記憶するベク
トル記憶手段と、前記変換関数算出手段により算出され
た表現空間変換関数を記憶する変換関数記憶手段と、を
備えたことを特徴とする。According to a fourth aspect of the present invention, there is provided the document classification apparatus according to the first to third aspects, further comprising: a vector storage unit for storing the document feature vector generated by the vector generation unit; And a conversion function storage means for storing the expression space conversion function calculated by (1).

【００１８】この請求項４の発明によれば、算出する文
書特徴ベクトルと表現空間変換関数を記憶することによ
り、表現空間変換関数を算出する部分と実際に前記表現
空間変換関数をもちいて変換された文書をもちいて文書
分類をおこなう部分とを分離して処理するので、その都
度、表現空間変換関数を算出することなしに文書分類を
実行でき、さらに、前記文書特徴ベクトル変換部でもち
いる表現空間変換関数として、事前に他の文書特徴ベク
トルに基づいて生成された表現空間変換関数をもちいる
こともできるため、文書分類の繰り返し実行を短時間で
効率良くおこなうことが可能である。According to the fourth aspect of the present invention, by storing the document feature vector to be calculated and the expression space conversion function, the part for calculating the expression space conversion function is actually converted using the expression space conversion function. Since the document classification process is performed separately from the document classification using the document, the document classification can be performed without calculating the expression space conversion function each time, and further, the expression using the document feature vector conversion unit can be used. As a space conversion function, an expression space conversion function generated in advance based on another document feature vector can be used, so that iterative execution of document classification can be performed efficiently in a short time.

【００１９】また、請求項５に係る文書分類装置は、請
求項１〜４のいずれか一つの発明において、さらに、前
記ベクトル変換手段により文書特徴ベクトルを変更する
前に、前記解析手段により抽出される単語が有する特性
により構成される規則をもちいて前記文書特徴ベクトル
および／または文書特徴ベクトルを構成する特徴次元を
操作することにより前記文書特徴ベクトルを修正するベ
クトル修正手段を備えたことを特徴とする。According to a fifth aspect of the present invention, in the document classification apparatus according to any one of the first to fourth aspects, before the document converting unit changes the document feature vector by the vector converting unit, Vector modifying means for modifying the document feature vector by manipulating the document feature vector and / or the feature dimension constituting the document feature vector using a rule constituted by the characteristics of the words. I do.

【００２０】この請求項５の発明によれば、文書分類の
繰り返し実行をおこなう際、個々の分類実行ごとに、文
書特徴ベクトルやそれらを構成する特徴次元を操作する
ことで、各分類ごとに異なる単語を削除して文書分類を
実行する等の分類対象文書の範囲の変更や分類をおこな
う空間の変更をおこなうことが可能である。According to the fifth aspect of the present invention, when the document classification is repeatedly executed, the document feature vectors and the feature dimensions constituting them are operated for each individual classification execution, so that the classification differs for each classification. It is possible to change the range of the document to be classified or change the space in which the classification is performed, such as executing the document classification by deleting words.

【００２１】また、請求項６に係る文書分類装置は、請
求項５の発明において、前記ベクトル修正手段において
文書特徴ベクトルを修正することにより特徴次元が変更
された場合に、前記変更された特徴次元により前記ベク
トル変換手段において前記文書特徴ベクトルが適切に変
換できるように、前記変換関数算出手段により算出され
た表現空間変換関数を修正する変換関数修正手段を備え
たことを特徴とする。According to a sixth aspect of the present invention, in the document classification apparatus according to the fifth aspect, when the feature dimension is changed by correcting the document feature vector in the vector correcting means, the changed feature dimension is changed. And a conversion function correction unit that corrects the expression space conversion function calculated by the conversion function calculation unit so that the vector conversion unit can appropriately convert the document feature vector.

【００２２】この請求項６の発明によれば、表現空間変
換関数が文書特徴ベクトルの内積をに基づいて算出され
る場合、表現空間変換関数をもちいて変換された文書を
もちいて文書分類をおこなう部分において、文書特徴ベ
クトルやその特徴次元が操作された場合に生じる表現空
間変換関数の不整合を簡便に修正することができるの
で、より適正な文書特徴ベクトルの変換をおこなうこと
が可能である。According to the present invention, when the expression space conversion function is calculated based on the inner product of the document feature vector, the document is classified using the document converted using the expression space conversion function. In this part, the inconsistency of the expression space conversion function caused when the document feature vector or its feature dimension is manipulated can be easily corrected, so that more appropriate conversion of the document feature vector can be performed.

【００２３】また、請求項７に係る文書分類装置は、請
求項１〜５のいずれか一つの発明において、さらに、前
記表現空間変換関数の特徴次元の操作に関する指示をす
る変換関数修正指示手段と、前記変換関数修正指示手段
により指示された特徴次元の操作に関する指示内容に基
づいて、前記表現空間変換関数を修正する変換関数修正
手段と、を備えたことを特徴とする。According to a seventh aspect of the present invention, in the document classification apparatus according to any one of the first to fifth aspects, further, there is provided a conversion function correction instructing means for instructing operation of a feature dimension of the expression space conversion function. And a conversion function correction unit that corrects the expression space conversion function based on the instruction content regarding the operation of the feature dimension specified by the conversion function correction instruction unit.

【００２４】この請求項７の発明によれば、前記表現空
間変換関数をもちいて構成される空間の特徴次元につい
て操作者が簡便な操作をすることにより、操作者の意図
を反映しうる文書分類をおこなうことが可能である。According to the seventh aspect of the present invention, a document classification that can reflect the intention of the operator by allowing the operator to perform a simple operation on the feature dimension of the space formed by using the expression space conversion function. It is possible to do.

【００２５】また、請求項８に係る文書分類装置は、請
求項７の発明において、前記変換関数修正指示手段によ
り指示された特徴次元の操作に関する指示内容が、任意
の文書ベクトルデータをもちいて前記表現空間変換関数
の特徴次元を操作するものであることを特徴とする。The document classification device according to claim 8 is the invention according to claim 7, wherein the instruction content regarding the operation of the characteristic dimension instructed by the conversion function modification instructing means uses arbitrary document vector data. It is characterized in that the feature dimension of the expression space conversion function is operated.

【００２６】この請求項８の発明によれば、前記表現空
間変換関数をもちいて構成される空間の特徴次元につい
て、操作者により指示された分類対象以外の任意の文書
ベクトルデータをもちいての簡便な操作をすることによ
り、操作者の意図を反映しうる文書分類をおこなうこと
が可能である。According to the eighth aspect of the present invention, the feature dimension of the space formed by using the expression space conversion function is simplified by using arbitrary document vector data other than the classification target specified by the operator. By performing appropriate operations, it is possible to perform document classification that can reflect the intention of the operator.

【００２７】また、請求項９に係る文書分類装置は、請
求項７の発明において、前記変換関数修正指示手段によ
り指示された特徴次元の操作に関する指示内容が、文書
特徴ベクトルをもちいて前記表現空間変換関数の特徴次
元を操作するものであることを特徴とする。According to a ninth aspect of the present invention, there is provided the document classification apparatus according to the seventh aspect, wherein the instruction content relating to the operation of the feature dimension instructed by the conversion function modification instructing means uses the document feature vector to represent the expression space. It is characterized in that the feature dimension of the conversion function is operated.

【００２８】この請求項９の発明によれば、前記表現空
間変換関数をもちいて構成される空間の特徴次元につい
て、操作者により指示された文書特徴ベクトルをもちい
ての簡便な操作をすることにより、操作者の意図を反映
しうる文書分類をおこなうことが可能である。According to the ninth aspect of the present invention, a simple operation using a document feature vector designated by an operator is performed on a feature dimension of a space formed by using the expression space conversion function. It is possible to classify documents that can reflect the intention of the operator.

【００２９】また、請求項１０に係る文書分類装置は、
請求項７の発明において、前記変換関数修正指示手段に
より指示された特徴次元の操作に関する指示内容が、前
記解析手段により得られた解析情報をもちいて前記表現
空間変換関数の特徴次元を操作するものであることを特
徴とする。The document classification device according to claim 10 is
8. The invention according to claim 7, wherein the instruction content related to the operation of the characteristic dimension specified by the conversion function correction instruction unit operates the characteristic dimension of the expression space conversion function using the analysis information obtained by the analysis unit. It is characterized by being.

【００３０】この請求項１０の発明によれば、前記表現
空間変換関数をもちいて構成される空間の特徴次元を、
操作者により指示された解析情報をもちいての簡便な操
作をすることにより、操作者の意図を反映しうる文書分
類をおこなうことが可能である。According to the tenth aspect of the present invention, the feature dimension of a space formed by using the expression space conversion function is
By performing a simple operation using the analysis information instructed by the operator, it is possible to perform document classification that can reflect the intention of the operator.

【００３１】また、請求項１１に係る文書分類装置は、
請求項７の発明において、前記変換関数修正指示手段に
より指示された特徴次元の操作に関する指示内容が、前
記分類結果記憶手段により記憶された分類結果をもちい
て前記表現空間変換関数の特徴次元を操作するものであ
ることを特徴とする。Further, the document classification device according to claim 11 is:
8. The invention according to claim 7, wherein the instruction content relating to the operation of the characteristic dimension instructed by the transformation function correction instructing means operates the characteristic dimension of the expression space transformation function using the classification result stored by the classification result storage means. It is characterized by that.

【００３２】この請求項１１の発明によれば、前記表現
空間変換関数をもちいて構成される空間の特徴次元を、
操作者により指示された事前に分類された分類結果をも
ちいての簡便な操作をすることにより、操作者の意図を
反映しうる文書分類をおこなうことが可能である。According to the eleventh aspect of the present invention, the feature dimension of the space formed by using the expression space conversion function is
By performing a simple operation using the pre-classified classification result instructed by the operator, it is possible to perform document classification that can reflect the operator's intention.

【００３３】また、請求項１２に係る文書分類装置は、
請求項１〜１１のいずれか一つの発明において、初期ク
ラスタ重心を指定する初期重心指定手段と、前記初期重
心指定手段により指定された初期クラスタ重心を登録す
る初期重心登録手段とを備え、前記分類手段は、前記初
期重心登録手段により登録された初期クラスタ重心にし
たがって文書を分類することを特徴とする。Further, the document classification device according to claim 12 is:
12. The classification according to any one of claims 1 to 11, further comprising: an initial center of gravity specifying unit that specifies an initial cluster center of gravity; and an initial center of gravity registration unit that registers an initial cluster center of gravity specified by the initial center of gravity specifying unit. The means classifies documents according to the initial cluster centroid registered by the initial centroid registration means.

【００３４】この請求項１２の発明によれば、文書分類
手法として、非階層型クラスタリング手法をもちいて、
その際に必要となる初期クラスタ重心を、操作者が任意
に指定することができ、その指定された初期クラスタ重
心にしたがって文書分類をおこなうので、操作者の意図
を反映する文書分類をおこなうことが可能である。According to the twelfth aspect of the present invention, a non-hierarchical clustering method is used as the document classification method.
The operator can arbitrarily specify the initial cluster centroid required at that time, and the document is classified according to the specified initial cluster centroid, so that the document classification reflecting the operator's intention can be performed. It is possible.

【００３５】また、請求項１３に係る文書分類装置は、
請求項１２の発明において、前記初期重心指定手段によ
り指定される初期クラスタ重心として任意の文書ベクト
ルデータを指定することを特徴とする。According to a thirteenth aspect of the present invention, there is provided a document classification device comprising:
In the twelfth aspect, arbitrary document vector data is specified as the initial cluster centroid specified by the initial centroid specifying means.

【００３６】この請求項１３の発明によれば、文書分類
手法として、非階層型クラスタリング手法をもちいて、
その際に必要となる初期クラスタ重心として、分類対象
以外の任意の文書をもちいることができるので、操作者
の意図を反映する文書分類をおこなうことが可能であ
る。According to the thirteenth aspect, a non-hierarchical clustering method is used as a document classification method.
Since any document other than the classification target can be used as the initial cluster centroid required at that time, it is possible to perform document classification reflecting the intention of the operator.

【００３７】また、請求項１４に係る文書分類装置は、
請求項１２の発明において、前記初期重心指定手段によ
り指定される初期クラスタ重心として文書特徴ベクトル
を指定することを特徴とする。The document classification device according to claim 14 is:
A twelfth aspect of the invention is characterized in that a document feature vector is specified as an initial cluster centroid specified by the initial centroid specifying means.

【００３８】この請求項１４の発明によれば、文書分類
手法として、非階層型クラスタリング手法をもちいて、
その際に必要となる初期クラスタ重心として、文書特徴
ベクトルをもちいることができるので、操作者の意図を
反映する文書分類をおこなうことが可能である。According to the fourteenth aspect of the present invention, a non-hierarchical clustering method is used as a document classification method.
Since the document feature vector can be used as the initial cluster gravity center required at that time, it is possible to perform document classification reflecting the intention of the operator.

【００３９】また、請求項１５に係る文書分類装置は、
請求項１２の発明において、前記初期重心指定手段によ
り指定される初期クラスタ重心として前記解析手段によ
り得られた解析情報を指定することを特徴とする。The document classification device according to claim 15 is:
In the twelfth aspect of the invention, the analysis information obtained by the analysis means is designated as the initial cluster gravity center designated by the initial gravity center designation means.

【００４０】この請求項１５の発明によれば、文書分類
手法として、非階層型クラスタリング手法をもちいて、
その際に必要となる初期クラスタ重心として、分類対象
文書を文書解析部に作用させた結果得られる単語等の解
析情報をもちいることができるので、操作者の意図を反
映する文書分類をおこなうことが可能である。According to the fifteenth aspect, a non-hierarchical clustering method is used as a document classification method.
Since the analysis information such as words obtained as a result of applying the document to be classified to the document analysis unit can be used as the initial cluster centroid required at that time, document classification reflecting the intention of the operator should be performed. Is possible.

【００４１】また、請求項１６に係る文書分類装置は、
請求項１２の発明において、前記初期重心指定手段によ
り指定される初期クラスタ重心として前記分類結果記憶
手段により記憶された分類結果を指定することを特徴と
する。The document classification device according to claim 16 is:
The invention according to claim 12, wherein the classification result stored by the classification result storage means is specified as the initial cluster center of gravity specified by the initial center of gravity specification means.

【００４２】この請求項１６の発明によれば、文書分類
手法として、非階層型クラスタリング手法をもちいて、
その際に必要となる初期クラスタ重心として、事前に分
類された分類結果をもちいることができるので、操作者
の意図を反映する文書分類をおこなうことが可能であ
る。According to the sixteenth aspect of the present invention, a non-hierarchical clustering method is used as the document classification method.
Since the classification result preliminarily classified can be used as the initial cluster centroid required at that time, it is possible to perform document classification reflecting the intention of the operator.

【００４３】また、請求項１７に係る文書分類方法は、
文書データを入力する第１工程と、前記第１工程により
入力された文書データを解析し解析情報を得る第２工程
と、前記第２工程により得られた解析情報に基づいて前
記文書データに対する文書特徴ベクトルを生成する第３
工程と、前記第３工程により生成された文書特徴ベクト
ルが文書特徴ベクトル間の類似性を反映する空間に射影
されるための表現空間変換関数を算出する第４工程と、
前記第４工程により算出された表現空間変換関数をもち
いて前記第３工程により生成された文書特徴ベクトルを
変換する第５工程と、前記第５工程により変換された文
書特徴ベクトル間の類似度に基づいて文書を分類する第
６工程と、前記第６工程分類手段により分類された文書
分類の結果を記憶する第７工程と、を含んだことを特徴
とする。Further, according to a seventeenth aspect of the present invention, there is provided a document classification method comprising:
A first step of inputting document data; a second step of analyzing the document data input in the first step to obtain analysis information; and a document for the document data based on the analysis information obtained in the second step. Third to generate feature vector
And a fourth step of calculating an expression space conversion function for projecting the document feature vector generated in the third step onto a space reflecting the similarity between the document feature vectors,
A fifth step of converting the document feature vector generated in the third step using the expression space conversion function calculated in the fourth step, and a similarity between the document feature vectors converted in the fifth step. A sixth step of classifying documents based on the sixth step and a seventh step of storing a result of document classification classified by the sixth step classifying means.

【００４４】この請求項１７の発明によれば、分類対象
である文書群での文書間の類似性に基づいて、各文書を
それら文書間の意味的な関連性を反映しうる表現空間へ
変換するための表現空間変換関数を算出し、その表現空
間で文書分類をおこなうことにより、操作者の意図を反
映しうる文書分類を実現することが可能である。According to the seventeenth aspect of the present invention, each document is converted into an expression space capable of reflecting the semantic relevance between the documents based on the similarity between the documents in the group of documents to be classified. By calculating an expression space conversion function for performing the operation and classifying the documents in the expression space, it is possible to realize a document classification that can reflect the intention of the operator.

【００４５】また、請求項１８に係る文書分類方法は、
請求項１７の発明において、前記第３工程により生成さ
れた文書特徴ベクトル間の内積を算出する第８工程を含
み、前記第４工程は、前記第８工程により算出された内
積をもちいて表現空間変換関数を算出することを特徴と
する。The document classification method according to the eighteenth aspect is characterized in that:
18. The invention according to claim 17, further comprising an eighth step of calculating an inner product between the document feature vectors generated in the third step, wherein the fourth step uses the inner product calculated in the eighth step to represent the expression space. It is characterized in that a conversion function is calculated.

【００４６】この請求項１８の発明によれば、表現空間
変換関数を導出する際に必要となる文書間の類似性とし
て文書特徴ベクトル間の内積をもちいることにより、文
書間の意味的な関連性を反映した文書分類をおこなうこ
とが可能である。According to the eighteenth aspect of the present invention, the inner product between the document feature vectors is used as the similarity between the documents required for deriving the expression space conversion function, so that the semantic relation between the documents is obtained. It is possible to classify documents that reflect the characteristics.

【００４７】また、請求項１９に係る文書分類方法は、
請求項１８の発明において、前記第１工程により入力さ
れた文書の作成者、作成日等の文書データの文書間類似
情報を設定する第９工程を含み、前記第４工程は、前記
第８工程により算出された内積および前記第９工程によ
り設定された文書間類似情報をもちいて表現空間変換関
数を算出することを特徴とする。The document classification method according to claim 19 is
The invention according to claim 18, further comprising a ninth step of setting inter-document similarity information of document data such as a creator and a creation date of the document input in the first step, and wherein the fourth step is the eighth step. The expression space conversion function is calculated using the inner product calculated by the above and the inter-document similarity information set in the ninth step.

【００４８】この請求項１９の発明によれば、表現空間
変換関数を導出する際に必要となる文書間の類似性とし
て文書特徴ベクトル間の内積に加え、文書の作成者や作
成日などの文書間類似情報ももちいることにより、文書
間の意味的な関連性を反映した文書分類をおこなうこと
が可能である。According to the nineteenth aspect of the present invention, in addition to the inner product between the document feature vectors, the similarity between the documents required for deriving the expression space conversion function, the document creator and the document creation date, etc. By using the inter-similarity information, it is possible to perform document classification reflecting the semantic relevance between documents.

【００４９】また、請求項２０に係る文書分類方法は、
請求項１７〜１９のいずれか一つの発明において、さら
に、前記第３工程により生成された文書特徴ベクトルを
記憶する第１０工程と、前記第４工程により算出された
表現空間変換関数を記憶する第１１工程と、を含んだこ
とを特徴とする。Further, according to a twentieth aspect of the invention, there is provided a document classification method comprising:
20. The method according to claim 17, further comprising: storing a document feature vector generated in the third step, and storing an expression space conversion function calculated in the fourth step. And 11 steps.

【００５０】この請求項２０の発明によれば、算出する
文書特徴ベクトルと表現空間変換関数を記憶することに
より、表現空間変換関数を算出する部分と実際に前記表
現空間変換関数をもちいて変換された文書をもちいて文
書分類をおこなう部分とを分離して処理するので、その
都度、表現空間変換関数を算出することなしに文書分類
を実行でき、さらに、前記文書特徴ベクトル変換部でも
ちいる表現空間変換関数として、事前に他の文書特徴ベ
クトルに基づいて生成された表現空間変換関数をもちい
ることもできるため、文書分類の繰り返し実行を短時間
で効率良くおこなうことが可能である。According to the twentieth aspect of the present invention, by storing the document feature vector to be calculated and the expression space conversion function, the part for calculating the expression space conversion function is actually converted using the expression space conversion function. Since the document classification process is performed separately from the document classification using the document, the document classification can be performed without calculating the expression space conversion function each time, and further, the expression using the document feature vector conversion unit can be used. As a space conversion function, an expression space conversion function generated in advance based on another document feature vector can be used, so that iterative execution of document classification can be performed efficiently in a short time.

【００５１】また、請求項２１に係る文書分類方法は、
請求項１７〜２０のいずれか一つの発明において、さら
に、前記第５工程により文書特徴ベクトルを変更する前
に、前記第２工程により抽出される単語が有する特性に
より構成される規則をもちいて前記文書特徴ベクトルお
よび／または文書特徴ベクトルを構成する特徴次元を操
作することにより前記文書特徴ベクトルを修正する第１
２工程を含んだことを特徴とする。Further, according to the document classification method of claim 21,
21. The invention according to any one of claims 17 to 20, further comprising, before changing the document feature vector in the fifth step, using a rule constituted by characteristics of words extracted in the second step. Modifying the document feature vector by manipulating a document feature vector and / or a feature dimension comprising the document feature vector;
It is characterized by including two steps.

【００５２】この請求項２１の発明によれば、文書分類
の繰り返し実行をおこなう際、個々の分類実行ごとに、
文書特徴ベクトルやそれらを構成する特徴次元を操作す
ることで、各分類ごとに異なる単語を削除して文書分類
を実行する等の分類対象文書の範囲の変更や分類をおこ
なう空間の変更をおこなうことが可能である。According to the twenty-first aspect of the present invention, when the document classification is repeatedly executed,
By manipulating the document feature vectors and the feature dimensions that compose them, change the range of the document to be classified or change the space in which the classification is performed, such as deleting the different words for each classification and performing document classification. Is possible.

【００５３】また、請求項２２に係る文書分類方法は、
請求項２１の発明において、前記第１２工程において文
書特徴ベクトルを修正することにより特徴次元が変更さ
れた場合に、前記変更された特徴次元により第５工程に
おいて前記文書特徴ベクトルが適切に変換できるよう
に、前記第４工程により算出された表現空間変換関数を
修正する第１３工程を含んだことを特徴とする。The document classification method according to claim 22 is characterized in that:
22. The invention according to claim 21, wherein when the feature dimension is changed by correcting the document feature vector in the twelfth step, the document feature vector can be appropriately converted in the fifth step by the changed feature dimension. And a thirteenth step of correcting the expression space conversion function calculated in the fourth step.

【００５４】この請求項２２の発明によれば、表現空間
変換関数が文書特徴ベクトルの内積をに基づいて算出さ
れる場合、表現空間変換関数をもちいて変換された文書
をもちいて文書分類をおこなう部分において、文書特徴
ベクトルやその特徴次元が操作された場合に生じる表現
空間変換関数の不整合を簡便に修正することができるの
で、より適正な文書特徴ベクトルの変換をおこなうこと
が可能となる。According to the twenty-second aspect, when the expression space conversion function is calculated based on the inner product of the document feature vectors, the document is classified using the document converted using the expression space conversion function. In this part, the inconsistency of the expression space conversion function that occurs when the document feature vector or its feature dimension is manipulated can be easily corrected, so that more appropriate conversion of the document feature vector can be performed.

【００５５】また、請求項２３に係る文書分類方法は、
請求項１７〜２１のいずれか一つの発明において、さら
に、前記表現空間変換関数の特徴次元の操作に関する指
示をする第１４工程と、前記第１４工程により指示され
た特徴次元の操作に関する指示内容に基づいて、前記表
現空間変換関数を修正する第１５工程と、を含んだこと
を特徴とする。The document classification method according to claim 23 is:
22. The invention according to claim 17, further comprising: a fourteenth step of giving an instruction relating to the operation of the characteristic dimension of the representation space transformation function; and an instruction content relating to the operation of the characteristic dimension designated by the fourteenth step. A fifteenth step of modifying the expression space conversion function based on the expression space conversion function.

【００５６】この請求項２３の発明によれば、前記表現
空間変換関数をもちいて構成される空間の特徴次元につ
いて操作者が簡便な操作をすることにより、操作者の意
図を反映しうる文書分類をおこなうことが可能である。According to the twenty-third aspect of the present invention, a document classification that can reflect the operator's intention by allowing the operator to perform a simple operation on the feature dimension of the space formed by using the expression space conversion function. It is possible to do.

【００５７】また、請求項２４に係る文書分類方法は、
請求項２３の発明において、前記第１５工程により指示
された特徴次元の操作に関する指示内容が、任意の文書
ベクトルデータをもちいて前記表現空間変換関数の特徴
次元を操作するものであることを特徴とする。According to a twenty-fourth aspect of the present invention, there is provided a document classification method comprising:
24. The invention according to claim 23, wherein the instruction content relating to the operation of the characteristic dimension specified in the fifteenth step is to operate the characteristic dimension of the expression space conversion function using arbitrary document vector data. I do.

【００５８】この請求項２４の発明によれば、前記表現
空間変換関数をもちいて構成される空間の特徴次元につ
いて、操作者により指示された分類対象以外の任意の文
書ベクトルデータをもちいての簡便な操作をすることに
より、操作者の意図を反映しうる文書分類をおこなうこ
とが可能である。According to the twenty-fourth aspect of the present invention, the feature dimension of a space formed by using the expression space conversion function is simplified by using arbitrary document vector data other than the classification target specified by the operator. By performing appropriate operations, it is possible to perform document classification that can reflect the intention of the operator.

【００５９】また、請求項２５に係る文書分類方法は、
請求項２３の発明において、前記第１５工程により指示
された特徴次元の操作に関する指示内容が、文書特徴ベ
クトルをもちいて前記表現空間変換関数の特徴次元を操
作するものであることを特徴とする。Further, the document classification method according to claim 25,
23. The invention according to claim 23, wherein the instruction content regarding the operation of the characteristic dimension specified in the fifteenth step is to operate the characteristic dimension of the expression space conversion function using a document characteristic vector.

【００６０】この請求項２５の発明によれば、前記表現
空間変換関数をもちいて構成される空間の特徴次元につ
いて、操作者により指示された文書特徴ベクトルをもち
いての簡便な操作をすることにより、操作者の意図を反
映しうる文書分類をおこなうことが可能である。According to the twenty-fifth aspect of the present invention, a simple operation using a document feature vector specified by an operator is performed on a feature dimension of a space formed by using the expression space conversion function. It is possible to classify documents that can reflect the intention of the operator.

【００６１】また、請求項２６に係る文書分類方法は、
請求項２３の発明において、前記第１５工程により指示
された特徴次元の操作に関する指示内容が、前記第２工
程により得られた解析情報をもちいて前記表現空間変換
関数の特徴次元を操作するものであることを特徴とす
る。The document classification method according to claim 26 is characterized in that:
24. The invention according to claim 23, wherein the instruction content related to the operation of the characteristic dimension specified in the fifteenth step is to operate the characteristic dimension of the expression space conversion function using the analysis information obtained in the second step. There is a feature.

【００６２】この請求項２６の発明によれば、前記表現
空間変換関数をもちいて構成される空間の特徴次元を、
操作者により指示された解析情報をもちいての簡便な操
作をすることにより、操作者の意図を反映しうる文書分
類をおこなうことが可能である。According to the twenty-sixth aspect of the present invention, the feature dimension of a space formed by using the expression space conversion function is
By performing a simple operation using the analysis information instructed by the operator, it is possible to perform document classification that can reflect the intention of the operator.

【００６３】また、請求項２７に係る文書分類方法は、
請求項２３の発明において、前記第１５工程により指示
された特徴次元の操作に関する指示内容が、前記第７工
程により記憶された分類結果をもちいて前記表現空間変
換関数の特徴次元を操作するものであることを特徴とす
る。The document classification method according to claim 27 is characterized in that:
24. The invention according to claim 23, wherein the instruction content related to the operation of the characteristic dimension specified in the fifteenth step is to operate the characteristic dimension of the expression space conversion function using the classification result stored in the seventh step. There is a feature.

【００６４】この請求項２７の発明によれば、前記表現
空間変換関数をもちいて構成される空間の特徴次元を、
操作者により指示された事前に分類された分類結果をも
ちいての簡便な操作をすることにより、操作者の意図を
反映しうる文書分類をおこなうことが可能である。According to the twenty-seventh aspect of the present invention, the feature dimension of a space formed using the expression space conversion function is
By performing a simple operation using the pre-classified classification result instructed by the operator, it is possible to perform document classification that can reflect the operator's intention.

【００６５】また、請求項２８に係る文書分類方法は、
請求項１７〜２７のいずれか一つの発明において、初期
クラスタ重心を指定する第１６工程と、前記第１６工程
により指定された初期クラスタ重心を登録する第１７工
程とを含み、前記第６工程は、前記第１７工程により登
録された初期クラスタ重心にしたがって文書を分類する
ことを特徴とする。A document classification method according to claim 28 is characterized in that:
28. The method according to any one of claims 17 to 27, further comprising: a sixteenth step of specifying an initial cluster centroid; and a seventeenth step of registering the initial cluster centroid specified by the sixteenth step. The document is classified according to the initial cluster centroid registered in the seventeenth step.

【００６６】この請求項２８の発明によれば、文書分類
手法として、非階層型クラスタリング手法をもちいて、
その際に必要となる初期クラスタ重心を、操作者が任意
に指定することができ、その指定された初期クラスタ重
心にしたがって文書分類をおこなうので、操作者の意図
を反映する文書分類をおこなうことが可能である。According to the twenty-eighth aspect of the present invention, a non-hierarchical clustering method is used as a document classification method.
The operator can arbitrarily specify the initial cluster centroid required at that time, and the document is classified according to the specified initial cluster centroid, so that the document classification reflecting the operator's intention can be performed. It is possible.

【００６７】また、請求項２９に係る文書分類方法は、
請求項２８の発明において、前記第１６工程により指定
される初期クラスタ重心として任意の文書ベクトルデー
タを指定することを特徴とする。The document classification method according to claim 29 is characterized in that:
The invention according to claim 28, wherein arbitrary document vector data is specified as the initial cluster barycenter specified in the sixteenth step.

【００６８】この請求項２９の発明によれば、文書分類
手法として、非階層型クラスタリング手法をもちいて、
その際に必要となる初期クラスタ重心として、分類対象
以外の任意の文書をもちいることができるので、操作者
の意図を反映する文書分類をおこなうことが可能であ
る。According to this invention, a non-hierarchical clustering method is used as a document classification method.
Since any document other than the classification target can be used as the initial cluster centroid required at that time, it is possible to perform document classification reflecting the intention of the operator.

【００６９】また、請求項３０に係る文書分類方法は、
請求項２８の発明において、前記第１６工程により指定
される初期クラスタ重心として文書特徴ベクトルを指定
することを特徴とする。A document classification method according to claim 30 is characterized in that:
A twenty-eighth aspect of the present invention is characterized in that a document feature vector is specified as the initial cluster centroid specified in the sixteenth step.

【００７０】この請求項３０の発明によれば、文書分類
手法として、非階層型クラスタリング手法をもちいて、
その際に必要となる初期クラスタ重心として、文書特徴
ベクトルをもちいることができるので、操作者の意図を
反映する文書分類をおこなうことが可能である。According to the thirtieth aspect of the present invention, a non-hierarchical clustering method is used as a document classification method.
Since the document feature vector can be used as the initial cluster gravity center required at that time, it is possible to perform document classification reflecting the intention of the operator.

【００７１】また、請求項３１に係る文書分類方法は、
請求項２８の発明において、前記第１６工程により指定
される初期クラスタ重心として前記第２工程により得ら
れた解析情報を指定することを特徴とする。A document classification method according to claim 31 is characterized in that:
The invention according to claim 28, wherein the analysis information obtained in the second step is specified as the initial cluster centroid specified in the sixteenth step.

【００７２】この請求項３１の発明によれば、文書分類
手法として、非階層型クラスタリング手法をもちいて、
その際に必要となる初期クラスタ重心として、分類対象
文書を文書解析部に作用させた結果得られる単語等の解
析情報をもちいることができるので、操作者の意図を反
映する文書分類をおこなうことが可能である。According to the thirty-first aspect of the present invention, a non-hierarchical clustering method is used as a document classification method.
Since the analysis information such as words obtained as a result of applying the document to be classified to the document analysis unit can be used as the initial cluster centroid required at that time, document classification reflecting the intention of the operator should be performed. Is possible.

【００７３】また、請求項３２に係る文書分類方法は、
請求項２８の発明において、前記第１６工程により指定
される初期クラスタ重心として前記第７工程により記憶
された分類結果を指定することを特徴とする。A document classification method according to claim 32 is characterized in that:
The invention according to claim 28, wherein the classification result stored in the seventh step is specified as the initial cluster barycenter specified in the sixteenth step.

【００７４】この請求項３２の発明によれば、文書分類
手法として、非階層型クラスタリング手法をもちいて、
その際に必要となる初期クラスタ重心として、事前に分
類された分類結果をもちいることができるので、操作者
の意図を反映する文書分類をおこなうことが可能であ
る。According to this invention, a non-hierarchical clustering method is used as a document classification method.
Since the classification result preliminarily classified can be used as the initial cluster centroid required at that time, it is possible to perform document classification reflecting the intention of the operator.

【００７５】また、請求項３３の発明に係る記憶媒体
は、請求項１７〜３２に記載された方法をコンピュータ
に実行させるプログラムを記録したことで、そのプログ
ラムを機械読み取り可能となり、これによって、請求項
１７〜３２の動作をコンピュータによって実現すること
が可能である。The storage medium according to the invention of claim 33 records a program for causing a computer to execute the method according to claims 17 to 32, so that the program becomes machine-readable. The operations of Items 17 to 32 can be realized by a computer.

【００７６】[0076]

【発明の実施の形態】以下に添付図面を参照して、この
発明に係る文書分類装置、文書分類方法およびその方法
をコンピュータに実行させるプログラムを記録したコン
ピュータ読み取り可能な記録媒体の好適な実施の形態を
詳細に説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Referring to the accompanying drawings, preferred embodiments of a document classifying apparatus, a document classifying method, and a computer-readable recording medium storing a program for causing a computer to execute the method are described below. The form will be described in detail.

【００７７】（実施の形態１）まず、この発明の実施の
形態１による文書分類装置を構成する情報処理システム
全体のハードウエア構成を説明する。図１は、実施の形
態１による文書分類装置を構成する情報処理システム全
体のハードウエア構成を示す説明図である。(Embodiment 1) First, the hardware configuration of the entire information processing system constituting the document classification device according to Embodiment 1 of the present invention will be described. FIG. 1 is an explanatory diagram showing the hardware configuration of the entire information processing system that constitutes the document classification device according to the first embodiment.

【００７８】図１において、実施の形態１による文書分
類装置を構成する情報処理システムは、サーバー／クラ
イアント方式で構成されている。すなわち、サーバー１
０１と複数のクライアント１０２がネットワーク１０３
によって接続されている。クライアント１０２は、分類
データの生成、サーバー１０１への指示、分類結果の表
示などをおこなう。一方、クライアント１０２からの指
示にしたがって、サーバー１０１は文書（テキスト）分
類に関する処理を膨大な数値演算によりおこない、その
処理の結果をクライアント１０２へ送る。In FIG. 1, the information processing system constituting the document classification device according to the first embodiment is configured by a server / client system. That is, server 1
01 and a plurality of clients 102
Connected by The client 102 generates classification data, instructs the server 101, displays a classification result, and the like. On the other hand, in accordance with an instruction from the client 102, the server 101 performs processing relating to document (text) classification by enormous numerical operations, and sends the result of the processing to the client 102.

【００７９】より具体的には、サーバー１０１において
は、テキスト分類処理（前処理、クラスタリング処理）
がおこなわれ、クライアント１０２においては、分類デ
ータ生成、処理実行指示、テキスト分類結果表示等がお
こなわれる。サーバー１０１における処理は、上述のよ
うに、「前処理」と「分類処理」の２つに分かれてお
り、その処理はデータによっては非常に負荷が大きくな
る。したがって、サーバー１０１は「前処理」と「分類
処理」がそれぞれ一つずつしか処理をおこなわないよう
にマネージャプロセスが処理受付リストを作成して管理
する。「前処理」および「分類処理」の詳細については
後述する。More specifically, in the server 101, text classification processing (preprocessing, clustering processing)
The client 102 performs classification data generation, processing execution instruction, text classification result display, and the like. As described above, the processing in the server 101 is divided into two, that is, “pre-processing” and “classification processing”, and the processing becomes extremely heavy depending on data. Therefore, in the server 101, the manager process creates and manages the processing reception list so that only one of each of the “preprocessing” and the “classification processing” is performed. Details of the "pre-processing" and the "classification processing" will be described later.

【００８０】また、サーバー１０１とクライアント１０
２との間のデータのやりとりはファイル共有という方法
をもちいる。すなわち、分類処理にもちいるファイルを
サーバー１０１上の共有フォルダに作成することにより
両者はデータのやりとりをおこなう。したがって、クラ
イアント１０２からはサーバー１０１の共有フォルダを
ネットワーク共有して利用することが可能である。The server 101 and the client 10
The exchange of data between the two uses a method called file sharing. That is, by creating a file used for the classification process in the shared folder on the server 101, the two exchange data. Therefore, it is possible for the client 102 to use the shared folder of the server 101 by network sharing.

【００８１】つぎに、サーバー１０１およびクライアン
ト１０２のハードウエア構成について説明する。図２
は、実施の形態１による文書分類装置を構成する情報処
理システムにおけるサーバー１０１をハードウエア的に
示す説明図である。サーバー１０１は、たとえばワーク
ステーション（ＷＳ）等がもちいられる。Next, the hardware configuration of the server 101 and the client 102 will be described. FIG.
FIG. 3 is an explanatory diagram showing the hardware of the server 101 in the information processing system constituting the document classification device according to the first embodiment. The server 101 is, for example, a workstation (WS).

【００８２】図２において、２０１はサーバー１０１全
体を制御するＣＰＵを、２０２はブートプログラム等を
記憶したＲＯＭを、２０３はＣＰＵ２０１のワークエリ
アとして使用されるＲＡＭ２０３を、２０４は通信回線
２０５を介してネットワーク１０３に接続され、そのネ
ットワーク１０３と内部のインターフェイスを司るイン
ターフェイス（Ｉ／Ｆ）を、２０６はデータを記憶する
ディスク装置を示している。２００は上記各部を結合さ
せるためのバスを示している。In FIG. 2, 201 is a CPU for controlling the entire server 101, 202 is a ROM storing a boot program and the like, 203 is a RAM 203 used as a work area of the CPU 201, and 204 is a communication line 205. An interface (I / F) that is connected to the network 103 and controls an internal interface with the network 103, and a disk device 206 stores data. Reference numeral 200 denotes a bus for connecting the above components.

【００８３】そのほか、文書情報、画像情報、機能情報
等を表示するディスプレイ２０８や，データを入力する
ためのキーボード２０９およびマウス２１０等が同様に
接続されていてもよい。さらに、ディスク装置２０６に
は、クライアント１０２との間のデータのやりとりをす
るための共有フォルダ２０７が設けられている。In addition, a display 208 for displaying document information, image information, function information, and the like, a keyboard 209 and a mouse 210 for inputting data, and the like may be similarly connected. Further, the disk device 206 is provided with a shared folder 207 for exchanging data with the client 102.

【００８４】また、図３は、実施の形態１による文書分
類装置を構成する情報処理システムにおけるクライアン
ト１０２をハードウエア的に示す説明図である。クライ
アント１０２は、たとえばパーソナルコンピュータ（Ｐ
Ｃ）等がもちいられる。FIG. 3 is an explanatory diagram showing the hardware of the client 102 in the information processing system constituting the document classification device according to the first embodiment. The client 102 is, for example, a personal computer (P
C) and the like are used.

【００８５】図３において、３０１はシステム全体を制
御するＣＰＵを、３０２はブートプログラム等を記憶し
たＲＯＭを、３０３はＣＰＵ３０１のワークエリアとし
て使用されるＲＡＭを、３０４はＣＰＵ３０１の制御に
したがってＨＤ（ハードディスク）３０５に対するデー
タのリード／ライトを制御するＨＤＤ（ハードディスク
ドライブ）を、３０５はＨＤＤ３０４の制御で書き込ま
れたデータを記憶するＨＤを、３０６はＣＰＵ３０１の
制御にしたがってＦＤ（フロッピーディスク）３０７に
対するデータのリード／ライトを制御するＦＤＤ（フロ
ッピーディスクドライブ）を、３０７はＦＤＤ３０６の
制御で書き込まれたデータを記憶する着脱自在のＦＤ
を、３０８はドキュメント、画像、機能情報等を表示す
るディスプレイをそれぞれ示している。In FIG. 3, reference numeral 301 denotes a CPU for controlling the entire system; 302, a ROM storing a boot program and the like; 303, a RAM used as a work area of the CPU 301; A hard disk drive (HDD) 305 for controlling the reading / writing of data to / from a hard disk 305, an HD 305 for storing data written under the control of the HDD 304, and a data 306 for a floppy disk (FD) 307 under the control of the CPU 301. 307 is a floppy disk drive (FDD) for controlling read / write of a disk, and 307 is a removable FD for storing data written under the control of the FDD 306.
308 denotes a display for displaying documents, images, function information, and the like.

【００８６】また、３０９は通信回線３１０を介してネ
ットワーク１０３に接続され、そのネットワーク１０３
と内部のインターフェイスを司るインターフェイス（Ｉ
／Ｆ）を、３１１は文字、数値、各種指示等の入力のた
めのキーを備えたキーボードを、３１２はカーソルの移
動や範囲選択、あるいは表示画面に表示されたアイコン
やボタンの押下やウインドウの移動やサイズの変更等を
おこなうマウスを、３１３はＯＣＲ（Ｏｐｔｉｃａｌ
ＣｈａｒａｃｔｅｒＲｅａｄｅｒ）機能を備えた画像
を光学的に読み取るスキャナを、３１４は分類結果を含
むデータの内容等を印刷するプリンタを、３１５は上記
各部を結合するためのバスをそれぞれ示している。ま
た、ＨＤ３０５にはワープロソフトや表計算ソフト等の
アプリケーションソフト３１６が記憶されている。Reference numeral 309 is connected to the network 103 via the communication line 310, and the network
And the interface that controls the internal interface (I
/ F), 311 is a keyboard having keys for inputting characters, numerical values, various instructions, etc., 312 is moving a cursor, selecting a range, pressing an icon or button displayed on a display screen, or opening a window. A mouse 313 for moving and changing the size is an OCR (Optical).
A scanner 314 for optically reading an image having a character reader (Character Reader) function, a printer 314 for printing the contents of data including a classification result, and the like, and a bus 315 for connecting the above-described units are shown. The HD 305 stores application software 316 such as word processing software or spreadsheet software.

【００８７】つぎに、実施の形態１による文書分類装置
の機能的構成について説明する。図４〜図６は、実施の
形態１による文書分類装置の構成を機能的に示すブロッ
ク図である。図４において、文書分類装置は、入力部４
０１と、解析部４０２と、ベクトル生成部４０３と、変
換関数算出部４０４と、ベクトル変換部４０５と、分類
部４０６と、分類結果記憶部４０７を含む構成である。Next, the functional configuration of the document classification device according to the first embodiment will be described. 4 to 6 are block diagrams functionally showing the configuration of the document classification device according to the first embodiment. In FIG. 4, the document classification device includes an input unit 4
01, an analysis unit 402, a vector generation unit 403, a conversion function calculation unit 404, a vector conversion unit 405, a classification unit 406, and a classification result storage unit 407.

【００８８】さらに、入力部４０１と解析部４０２との
間には、文書データ中の表記の揺れ等を吸収する図示し
ない第１フィルタ部を含めるようにしてもよい。また、
解析部４０２とベクトル生成部４０３との間には、解析
情報から不要な単語や品詞を除去する図示しない第２フ
ィルタ部を含めるようにしてもよい。さらに、変換関数
算出部４０４とベクトル変換部との間には、文書特徴ベ
クトルから分類時に不要な単語や品詞を除去する図示し
ない第３フィルタ部を含めるようにしてもよい。Further, between the input unit 401 and the analysis unit 402, a first filter unit (not shown) for absorbing fluctuations of the notation in the document data may be included. Also,
A second filter unit (not shown) that removes unnecessary words and parts of speech from the analysis information may be included between the analysis unit 402 and the vector generation unit 403. Further, a third filter unit (not shown) for removing unnecessary words and parts of speech from the document feature vector at the time of classification may be included between the conversion function calculation unit 404 and the vector conversion unit.

【００８９】また、図５においては、さらに内積算出部
４２１を含む構成となっている。また、図６において
は、さらに文書間類似情報設定部４３１を含む構成とな
っている。FIG. 5 further includes an inner product calculation section 421. In FIG. 6, the configuration further includes an inter-document similarity information setting unit 431.

【００９０】入力部４０１は、文書データを入力するも
のであり、たとえば、キーボード２０９または３１１、
スキャナ３１３、ＯＣＲ機能を備えたスキャナ３１３、
またはネットワーク１０３を経由して文書や文書群を得
ることができるＩ／Ｆ２０４または３０９等である。ま
た、入力部４０１は、上記以外に、文書データを取得す
ることができるものであれば、それらのすべてを含む。
たとえば、文書データがデータベース化されている場合
に、そのデータベースが記録された媒体を本実施の形態
の文書分類装置に組み入れた場合も文書データの入力と
する。さらに、入力した文書データを記憶する図示しな
い文書データ記憶部を含んでいてもよい。The input unit 401 is for inputting document data. For example, the keyboard 209 or 311,
A scanner 313, a scanner 313 having an OCR function,
Or an I / F 204 or 309 that can obtain a document or a group of documents via the network 103. In addition to the above, the input unit 401 includes all of them that can acquire document data.
For example, if the document data is stored in a database and the medium on which the database is recorded is incorporated in the document classification device of the present embodiment, the input of the document data is also performed. Further, a document data storage unit (not shown) for storing the input document data may be included.

【００９１】ここで、文書とは、自然言語で記述された
一つ以上の文の集まりであり、それが分類対象となる場
合はこれを文書という。具体的には、公開特許公報や特
定の新聞記事も文書であり、また、請求項や特定の一文
を取り出したものであっても、これを文書とみなすもの
である。Here, a document is a group of one or more sentences described in a natural language, and when it is to be classified, it is called a document. Specifically, a published patent publication and a specific newspaper article are also documents, and even if a claim or a specific sentence is extracted, it is regarded as a document.

【００９２】解析部４０２は、入力部４０１により入力
された文書データの単語を解析し解析情報を得る。具体
的には、入力部４０１により入力された文書データそれ
ぞれに対して、形態素解析等の自然言語解析をおこな
い、単語やその品詞などを抽出する。さらに、文書群で
出現した単語に対し一意な単語ＩＤを付与し、文書内お
よび文書群に対する単語出現回数を計数するものであ
る。The analysis unit 402 analyzes words of the document data input by the input unit 401 to obtain analysis information. Specifically, a natural language analysis such as a morphological analysis is performed on each of the document data input by the input unit 401 to extract a word and its part of speech. Furthermore, a unique word ID is assigned to a word that has appeared in a document group, and the number of word appearances in the document and for the document group is counted.

【００９３】また、ベクトル生成部４０３は、解析部４
０３により得られた解析情報に基づいて文書データに対
する文書特徴ベクトルを生成するものである。変換関数
算出部４０４は、文書特徴ベクトル間の類似性を反映す
る空間に前記ベクトル生成部４０３により生成された文
書特徴ベクトルを射影するための表現空間変換関数を算
出するものである。ベクトル変換部４０５は、変換関数
算出部４０４により算出された表現空間変換関数をもち
いて文書特徴ベクトルを変換するものである。The vector generation unit 403 includes the analysis unit 4
A document feature vector for document data is generated based on the analysis information obtained in step S03. The conversion function calculation unit 404 calculates an expression space conversion function for projecting the document feature vector generated by the vector generation unit 403 into a space reflecting the similarity between the document feature vectors. The vector conversion unit 405 converts the document feature vector using the expression space conversion function calculated by the conversion function calculation unit 404.

【００９４】ベクトル生成部４０３，変換関数算出部４
０４，ベクトル変換部４０５の各処理の詳細は後述す
る。The vector generation unit 403 and the conversion function calculation unit 4
04, details of each processing of the vector conversion unit 405 will be described later.

【００９５】分類部４０６は、ベクトル変換部４０５に
より変換された新たな文書特徴ベクトル間の類似度に基
づいて文書を分類するものである。具体的には、生成さ
れた分類対象データに対して、カイ自乗法の手法、判別
分析の手法、およびクラスタ分析の手法等の分類手法を
適用することで、文書分類をおこなうことができる。分
類部４０６においては、ベクトルデータが適用できる分
類手法であれば、その手法は問わない。The classification unit 406 classifies documents based on the similarity between new document feature vectors converted by the vector conversion unit 405. Specifically, document classification can be performed by applying a classification method such as a chi-square method, a discriminant analysis method, or a cluster analysis method to the generated classification target data. The classification unit 406 may use any classification method as long as the vector data can be applied.

【００９６】さらに、分類結果記憶部４０７は、分類部
４０６により分類された結果を適切な形式で記憶する記
憶部である。たとえば、ディスク装置３０６またはハー
ドディスク３１６の所定の領域のほか、ＲＡＭ２０３ま
たは３０３、その他データを記憶可能なところであれば
いずれでもよい。Further, the classification result storage unit 407 is a storage unit that stores the results classified by the classification unit 406 in an appropriate format. For example, in addition to a predetermined area of the disk device 306 or the hard disk 316, any of the RAM 203 or 303 and any other data storage location may be used.

【００９７】内積算出部５０１は、ベクトル生成部４０
２手段により生成された文書特徴ベクトル間の内積を算
出する算出部である。内積算出部５０１の処理の内容は
後述する。The inner product calculation unit 501 includes a vector generation unit 40
The calculation unit calculates the inner product between the document feature vectors generated by the two means. The contents of the process of the inner product calculation unit 501 will be described later.

【００９８】また、文書間類似情報設定部６０１は、入
力部４０１により入力された文書の作成者、作成日等の
文書データの文書間類似情報を設定する設定部である。
文書間類似情報には、文書内での単語の出現順序や、文
書の作成日、修正日、作成者、修正者、参照文書、引用
文書などの文書間での一致情報を含む。操作者は、こら
らの文書間類似情報の中から所望の情報を指定し任意に
設定することができる。The inter-document similarity information setting unit 601 is a setting unit for setting the inter-document similarity information of the document data such as the creator of the document input by the input unit 401 and the date of creation.
The inter-document similarity information includes the order in which words appear in the document, and the matching information between documents such as the date of creation and modification of the document, the creator, the modifier, the reference document, and the cited document. The operator can specify desired information from these inter-document similarity information and set it arbitrarily.

【００９９】入力部４０１、解析部４０２、ベクトル生
成部４０３、変換関数算出部４０４、ベクトル変換部４
０５、分類部４０６、内積算出部４２０、文書間類似情
報設定部４３０は、ＲＯＭ２０２または３０２、ＲＡＭ
２０３または３０３、あるいはディスク装置３０６また
はハードディスク３１６等の記録媒体に記録されたプロ
グラムに記載された命令にしたがってＣＰＵ２０１また
は３０１等が命令処理を実行することにより、各部の機
能を実現する。Input unit 401, analysis unit 402, vector generation unit 403, conversion function calculation unit 404, vector conversion unit 4
05, the classification unit 406, the inner product calculation unit 420, and the inter-document similarity information setting unit 430 include the ROM 202 or 302, the RAM
The functions of the respective units are realized by the CPU 201 or 301 or the like executing the instruction processing in accordance with the instructions described in a program recorded in a recording medium such as 203 or 303, or the disk device 306 or the hard disk 316.

【０１００】つぎに、ベクトル生成部４０３による文書
特徴ベクトルの生成処理の内容について説明する。ベク
トル生成部４０３は、解析部４０３により得られた解析
情報に基づいて文書データに対する文書特徴ベクトルを
生成するものである。ここで解析情報とは、たとえば、
単語、単語ＩＤ、単語出現回数、品詞情報等の情報であ
る。Next, the contents of the process of generating a document feature vector by the vector generation unit 403 will be described. The vector generation unit 403 generates a document feature vector for the document data based on the analysis information obtained by the analysis unit 403. Here, the analysis information is, for example,
Information such as a word, a word ID, the number of appearances of a word, and part of speech information.

【０１０１】図７は、文書−単語行列データと文書特徴
ベクトルの一例を示す説明図である。図７において、行
成分７０１が単語ＩＤであり、また、列成分７０２が文
書ＩＤである。行列要素として、文書ＩＤが列番号であ
り、文書に含まれる単語ＩＤが行番号である単語の出現
回数となるような文書−単語行列データを上記解析情報
に基づいて生成する。この文書−単語行列の各列ベクト
ルが文書特徴ベクトルである。このようにして文書特徴
ベクトルを生成する。FIG. 7 is an explanatory diagram showing an example of document-word matrix data and a document feature vector. In FIG. 7, a row component 701 is a word ID, and a column component 702 is a document ID. As a matrix element, document-word matrix data is generated based on the analysis information so that the document ID is a column number and the word ID included in the document is the number of appearances of a word having a row number. Each column vector of this document-word matrix is a document feature vector. Thus, a document feature vector is generated.

【０１０２】また、この文書特徴ベクトルに対して、正
規化等の処理を同時におこなうこともできる。この際、
文書−単語行列データに付随する付加的な情報、たとえ
ば、文書−単語行列データの行成分である単語ＩＤとそ
の単語との対応関係を記述した単語−単語ＩＤ対応マッ
プデータや各単語において単語ＩＤとその単語が有する
品詞情報との対応関係を記述した単語ＩＤ−品詞対応マ
ップデータなども同時に生成する。Further, processing such as normalization can be simultaneously performed on the document feature vector. On this occasion,
Additional information attached to the document-word matrix data, for example, word-word ID correspondence map data describing the correspondence between word IDs, which are row components of the document-word matrix data, and the words, and word IDs for each word Word ID-speech correspondence map data describing the correspondence between the word and the part-of-speech information of the word are also generated.

【０１０３】つぎに、変換関数算出部４０４による変換
関数算出処理の内容について説明する。ベクトル生成部
４０３における文書特徴ベクトルの生成は、通常、その
文書内での単語の出現回数に基づいておこなわれる。こ
の際、各単語はそれぞれ意味的に独立なものと仮定し、
各々を直交するものとして扱われる。しかしながら、現
実には単語は多義性や同義性を含むものであるため、上
記のような仮定の妥当性は保証されておらず、各単語が
各々直交するものと扱われることにより、分類の精度・
妥当性にも影響を及ぼすものである。Next, the contents of the conversion function calculation processing by the conversion function calculation unit 404 will be described. The generation of the document feature vector in the vector generation unit 403 is usually performed based on the number of appearances of a word in the document. This assumes that each word is semantically independent,
Each is treated as orthogonal. However, in reality, words contain ambiguity and synonymity, so the validity of the above assumption is not guaranteed, and by treating each word as orthogonal, the accuracy of classification and
It also affects the validity.

【０１０４】この影響を軽減するための手法として、こ
の問題を多次元尺度問題とみなして、統計的手法をもち
いることが考えられる。すなわち、変換関数算出部４０
４において、各文書特徴ベクトルを文書特徴ベクトル間
での特徴次元、すなわち単語の共起性が反映された空間
へ変換するための表現空間変換関数を、ベクトル生成部
４０３により生成された文書特徴ベクトルに基づいて算
出する。なお、単語間の同義性の影響を軽減するための
方法としてシソーラス等をもちいるようにしてもよい。As a technique for reducing this effect, it is conceivable that this problem is regarded as a multidimensional scale problem and a statistical method is used. That is, the conversion function calculator 40
4, the expression feature conversion function for converting each document feature vector into a feature dimension between the document feature vectors, that is, a space in which the co-occurrence of words is reflected, is represented by a document feature vector generated by the vector generation unit 403. Calculated based on Note that a thesaurus or the like may be used as a method for reducing the effect of synonymity between words.

【０１０５】本実施の形態においては表現空間変換関数
の算出手法としては、前出の「Ｒｅｐｒｅｓｅｎｔａｔ
ｉｎｇＤｏｃｕｍｅｎｔｓＵｓｉｎｇａｎＥｘ
ｐｌｉｃｉｔＭｏｄｅｌｏｆＴｈｅｉｒＳｉｍ
ｉｌａｒｉｔｉｅｓ」に述べられている表現空間変換関
数の算出手法をもちいるが、そのほか、因子分析や数量
化などの手法をもちいて算出するようにしてもよい。In the present embodiment, the method of calculating the expression space conversion function is described in the above “Representantat”.
ing Documents Using an Ex
Plicit Model of Their Sim
Although the calculation method of the expression space conversion function described in “ilariities” is used, the calculation may be performed using a method such as factor analysis or quantification.

【０１０６】すなわち、内積算出部５０１により算出さ
れた文書特徴ベクトル間の内積に、文書間類似情報設定
手段により設定された文書間類似情報を付加した文書間
類似行列と文書特徴ベクトルに基づいて表現空間変換関
数を算出する。そして、この表現空間変換関数をもちい
ることにより、文書間の意味的な類似性を強く反映した
表現空間にて文書分類をおこなうことができる。また、
上述のように、操作者が自由に文書間類似情報を付加的
に選択することもできるため、操作者の意図を反映した
文書分類をおこなうことができる。That is, based on the inter-document similarity matrix and the document feature vector obtained by adding the inter-document similarity information set by the inter-document similarity information setting means to the inner product between the document feature vectors calculated by the inner product calculation unit 501. Calculate the expression space conversion function. By using this expression space conversion function, document classification can be performed in an expression space that strongly reflects semantic similarity between documents. Also,
As described above, since the operator can also freely select similar information between documents, document classification reflecting the intention of the operator can be performed.

【０１０７】具体的には、文書数をｄ、単語数をｔと
し、大きさｔ×ｄの文書−単語行列（文書特徴ベクト
ル）をＸ、大きさｄ×ｄの文書間内積行列をＳ、大きさ
ｄ×ｄの付加的文書間類似情報行列をＳ_aとすると、表
現空間変換関数Ｗは式１のようになる。Specifically, the number of documents is d, the number of words is t, the size of a document-word matrix (document feature vector) of size t × d is X, the inner product matrix between documents of size d × d is S, additional documents between similar information matrix of size d × d When S _a, representation space conversion function W is as equation 1.

【０１０８】Ｗ＝Ｍ^TＣＸ⁺ （式１）W = M ^T CX ⁺ (Equation 1)

【０１０９】なお、^Tは行列の転置を示す。Note that ^T indicates transposition of a matrix.

【０１１０】ここで、行列へ特異値分解を適用する演算
子をｓｖｄ（）とすると、式１において、行列Ｃ、
Ｍ、Ｘ⁺はつぎのような行列となる。Here, assuming that the operator for applying the singular value decomposition to the matrix is svd (),
M and X ⁺ are the following matrices.

【０１１１】Ｘ＝ｓｖｄ（Ｘ）＝ＵＬＡ^T （式２）X = svd (X) = ULA ^T (Equation 2)

【０１１２】Ｓ＝Ｘ^TＸ（式３）[0112] S = X ^T X (Equation 3)

【０１１３】Ｓ＋Ｓ_a＝ｓｖｄ（Ｓ＋Ｓ_a）＝Ｃ^TＣ（式４）S + S _a = svd (S + S _a ) = C ^T C (Equation 4)

【０１１４】ＣＡＡ^T＝ｓｖｄ（ＣＡＡ^T）＝ＭＺＮ^T （式５）CAA ^T = svd (CAA ^T ) = MZN ^T (Equation 5)

【０１１５】Ｘ⁺＝ＡＬ^-1Ｕ^T （式６）X ⁺ = AL ^-1 U ^T (Equation 6)

【０１１６】また、ベクトルの内積をもちいて表現空間
変換関数を算出するには、上記付加的文書間類似行列Ｓ
_aを空行列とする。その場合、表現空間変換関数は式７
にようになる。To calculate the expression space conversion function using the inner product of the vectors, the additional inter-document similarity matrix S
_{Let a} be an empty matrix. In that case, the expression space conversion function is expressed by Equation 7
It becomes like.

【０１１７】Ｗ＝Ｕ^T （式７）W = U ^T (Equation 7)

【０１１８】また、設定された文書間類似情報をもちい
て表現空間変換関数を算出するには、Ｓ_aを空行列以外
の対称行列とする。[0118] Further, in order to calculate the representation space conversion function using a document between similar information set is a symmetric matrix of non-empty matrix S _a.

【０１１９】さらに、本文書分類装置では、表現空間変
換関数Ｗを大きさがｔ×ｔの単位行列とすることで変換
関数生成部４０４をバイパスすることも可能である。Further, in the present document classification apparatus, the conversion function generation unit 404 can be bypassed by using the expression space conversion function W as a unit matrix having a size of t × t.

【０１２０】さらにまた、ベクトル生成部４０３で生成
される文書特徴ベクトルは、特徴次元数が文書群で出現
する単語数であるため、通常非常に高次なものとなり、
このまま分類等をおこなうと計算コストや記憶空間が膨
大になる。このため、出現回数の極端に少ない単語や極
端に多い単語を文書特徴ベクトルを構成する次元から除
外することができるが、これにより分類精度や妥当性が
低下する可能性がある。Furthermore, the document feature vector generated by the vector generation unit 403 is usually of a very high order because the number of feature dimensions is the number of words appearing in the document group.
If the classification or the like is performed as it is, the computation cost and storage space become enormous. For this reason, a word with an extremely small number of appearances or a word with an extremely large number of occurrences can be excluded from the dimensions constituting the document feature vector, but this may reduce the classification accuracy and validity.

【０１２１】本発明でもちいる表現空間変換関数は各文
書特徴ベクトル間の単語の共起性が考慮された空間への
変換を実現するため、式１からも明らかなように表現空
間変換関数により生成される表現空間は各特徴次元が複
数の単語の一次結合として表現される。したがって、少
ない特徴次元でも多くの単語の意味を扱うことができ、
これにより分類等をおこなう際の計算コストや記憶空間
を抑制することができる。Since the expression space conversion function used in the present invention realizes conversion to a space in which the co-occurrence of words between document feature vectors is taken into consideration, the expression space conversion function is used as is clear from Equation 1. In the generated expression space, each feature dimension is expressed as a linear combination of a plurality of words. Therefore, the meaning of many words can be handled with a small feature dimension,
As a result, it is possible to suppress the calculation cost and the storage space when performing the classification or the like.

【０１２２】つぎに、ベクトル変換部４０５による文書
特徴ベクトルの変換処理について説明する。ベクトル変
換部４０５では、変換関数生成部４０４で生成される表
現空間変換関数をもちいて、文書特徴ベクトルを変換
し、分類の対象となるデータを導く。加えて、各単語も
前記表現空間変換関数をもちいて変換するが可能であ
る。すなわち、表現空間変換関数として行列Ｗをもちい
ると、変換された文書特徴ベクトルをＤ_pとすると、式
８のようになる。Next, the conversion process of the document feature vector by the vector conversion unit 405 will be described. The vector conversion unit 405 converts the document feature vector using the expression space conversion function generated by the conversion function generation unit 404, and derives data to be classified. In addition, each word can be converted using the expression space conversion function. That is, when the matrix W is used as the expression space conversion function, if the converted document feature vector is D _p , Expression 8 is obtained.

【０１２３】Ｄ_p＝ＷＸ（式８）D _p = WX (Equation 8)

【０１２４】また、変換された単語の行列表現をＴ_pと
すると、式９のようになる。Further, assuming that the converted word is represented by a matrix, T _p is expressed by the following equation (9).

【０１２５】Ｔ_p＝Ｗ^TＩ＝Ｗ（式９）T _p = W ^T I = W (Equation 9)

【０１２６】なお、Ｉは単位行列を示す。Note that I indicates a unit matrix.

【０１２７】つぎに、実施の形態１による文書分類装置
の一連の処理の手順について説明する。図８は実施の形
態１による文書分類装置の一連の処理の手順を示すフロ
ーチャートである。図８のフローチャートにおいて、ま
ず、入力部４０１は文書データを入力する（ステップＳ
８１０）。つぎに、解析部４０２はステップＳ８１０に
おいて入力された文書データを解析し解析情報を得る
（ステップＳ８２０）。Next, a series of processing procedures of the document classification device according to the first embodiment will be described. FIG. 8 is a flowchart showing a procedure of a series of processes of the document classification device according to the first embodiment. In the flowchart of FIG. 8, first, the input unit 401 inputs document data (Step S).
810). Next, the analysis unit 402 analyzes the document data input in step S810 to obtain analysis information (step S820).

【０１２８】つぎに、ベクトル生成部４０３はステップ
Ｓ８２０において得解析情報に基づいて文書特徴ベクト
ルを生成する（ステップＳ８３０）。つぎに、変換関数
算出部４０４はステップ８３０において生成された文書
特徴ベクトルが文書特徴ベクトル間の類似性を反映する
空間に射影されるための表現空間関数を算出する（ステ
ップＳ８４０）。Next, the vector generation unit 403 generates a document feature vector based on the obtained analysis information in step S820 (step S830). Next, the conversion function calculation unit 404 calculates an expression space function for projecting the document feature vector generated in step 830 into a space reflecting the similarity between the document feature vectors (step S840).

【０１２９】つぎに、ベクトル変換部４０５はステップ
Ｓ８４０において算出された表現空間関数をもちいてス
テップＳ８３０において生成された文書特徴ベクトルを
変換する（ステップＳ８５０）。つぎに、分類部４０６
はステップＳ８５０において変換された文書特徴ベクト
ルの間の類似度に基づいて文書を分類する（ステップＳ
８６０）。その後、ステップＳ８６０によって分類され
た分類結果が記憶され（ステップＳ８７０）、すべての
処理を終了する。Next, the vector conversion unit 405 converts the document feature vector generated in step S830 using the expression space function calculated in step S840 (step S850). Next, the classification unit 406
Classifies documents based on the similarity between the document feature vectors converted in step S850 (step S850).
860). After that, the classification result classified in step S860 is stored (step S870), and all the processing ends.

【０１３０】また、図９は実施の形態１による文書分類
装置の一連の処理の別の手順を示すフローチャートであ
る。図９のフローチャートにおいて、図８の各ステップ
と同じ処理をおこなうステップは同じ番号を付して、そ
の説明を省略する。FIG. 9 is a flowchart showing another procedure of a series of processes of the document classification device according to the first embodiment. In the flowchart of FIG. 9, steps that perform the same processing as the steps of FIG. 8 are given the same numbers, and descriptions thereof are omitted.

【０１３１】ステップＳ８３０につづいて、同ステップ
において生成された文書特徴ベクトル間の内積を算出す
る（ステップＳ８３５）。つぎに、文書間類似情報をも
ちいるとの指示があったか否かを判断する（ステップＳ
８３６）。After step S830, the inner product between the document feature vectors generated in the step is calculated (step S835). Next, it is determined whether or not there is an instruction to use the inter-document similarity information (step S).
836).

【０１３２】ステップＳ８３６において、指示がなかっ
た場合（ステップＳ８３６否定）は、ステップＳ８３５
において算出された内積をもちいて表現空間変換関数の
算出をする（ステップＳ８４０）。一方、ステップＳ８
３６において、指示があった場合（ステップＳ８３６肯
定）は、入力部４０１により入力された文書データの文
書間類似情報を設定する（ステップＳ８３７）。その
後、ステップＳ８４０へ移行し、ステップＳ８３５にお
いて算出された内積とステップＳ８３７において設定さ
れた文書類間情報をもちいて表現空間変換関数の算出を
する。以下、図８と同様の処理をおこなう。If there is no instruction in step S836 (No in step S836), step S835 is performed.
The expression space conversion function is calculated using the inner product calculated in (step S840). On the other hand, step S8
If there is an instruction in S36 (Yes at Step S836), inter-document similarity information of the document data input by the input unit 401 is set (Step S837). Thereafter, the process proceeds to step S840, and the expression space conversion function is calculated using the inner product calculated in step S835 and the inter-document information set in step S837. Hereinafter, the same processing as in FIG. 8 is performed.

【０１３３】以上説明したように、実施の形態１によれ
ば、分類対象である文書群での文書間の類似性に基づい
て、各文書をそれら文書間の意味的な関連性を反映しう
る表現空間へ変換するための表現空間変換関数を算出
し、その表現空間で文書分類をおこなうことにより、操
作者の意図を反映しうる文書分類を実現することができ
る。As described above, according to the first embodiment, each document can reflect the semantic relevance between the documents based on the similarity between the documents in the group of documents to be classified. By calculating an expression space conversion function for converting to an expression space and performing document classification in the expression space, it is possible to realize document classification that can reflect the intention of the operator.

【０１３４】（実施の形態２）さて、上述した実施の形
態１では、生成された文書特徴ベクトルと算出された表
現空間変換関数の保存についてはなんら記載していなか
ったが、以下に説明する実施の形態２のように、さらに
ベクトル記憶部と、変換関数記憶部とを含む構成とする
ようにしてもよい。(Embodiment 2) The embodiment 1 described above does not describe the storage of the generated document feature vector and the calculated expression space conversion function. As in Embodiment 2, the configuration may further include a vector storage unit and a conversion function storage unit.

【０１３５】実施の形態２による文書分類装置の機能的
構成について説明する。図１０は、実施の形態２による
文書分類装置の構成を機能的に示すブロック図である。
図１０において、実施の形態１の図４と同一のものに関
しては同じ番号を付して、その説明を省略する。The functional configuration of the document classification device according to the second embodiment will be described. FIG. 10 is a block diagram functionally showing the configuration of the document classification device according to the second embodiment.
10, the same components as those in FIG. 4 of the first embodiment are denoted by the same reference numerals, and description thereof will be omitted.

【０１３６】ベクトル記憶部１００１は、ベクトル生成
部４０３により生成された文書特徴ベクトルを記憶する
記憶部である。この際、ベクトル生成部４０３において
同時に生成される文書−単語行列データに付随する付加
的な情報、たとえば、文書−単語行列データの行成分で
ある単語ＩＤとその単語との対応関係を記述した単語−
単語ＩＤ対応マップデータや各単語において単語ＩＤと
その単語が有する品詞情報との対応関係を記述した単語
ＩＤ−品詞対応マップデータや構文情報データなども記
憶することができる。The vector storage unit 1001 is a storage unit for storing the document feature vector generated by the vector generation unit 403. At this time, additional information accompanying the document-word matrix data simultaneously generated in the vector generation unit 403, for example, a word describing the correspondence between the word ID which is a row component of the document-word matrix data and the word −
Word ID-corresponding map data, word ID-part-of-speech corresponding map data that describes the correspondence between word IDs of each word, and part-of-speech information of the words, syntax information data, and the like can be stored.

【０１３７】また、変換関数記憶部１００２は、変換関
数生成部４０４より生成された表現空間変換関数を記憶
する記憶部である。The conversion function storage unit 1002 is a storage unit for storing the expression space conversion function generated by the conversion function generation unit 404.

【０１３８】ベクトル記憶部１００１、変換関数記憶部
１００２は、ＲＯＭ２０２または３０２、ＲＡＭ２０３
または３０３、あるいはディスク装置３０６またはハー
ドディスク３１６等の記録媒体に記録されたプログラム
に記載された命令にしたがってＣＰＵ２０１または３０
１等が命令処理を実行することにより、各部の機能を実
現する。The vector storage unit 1001 and the conversion function storage unit 1002 include the ROM 202 or 302, the RAM 203
Or 303, or the CPU 201 or 30 in accordance with instructions described in a program recorded on a recording medium such as the disk device 306 or the hard disk 316.
1 and the like execute the command processing, thereby realizing the function of each unit.

【０１３９】文書特徴ベクトルと表現空間変換関数を記
憶することにより記憶された文書表現空間をもちいて記
憶された文書特徴ベクトルを変換することが可能になる
ため、ベクトル記憶部１００１および変換関数記憶部１
００２と、ベクトル変換部４０５を一連の処理としてお
こなう必要がなくなり、機能的に分離することができ
る。By storing the document feature vector and the expression space conversion function, it is possible to convert the stored document feature vector using the stored document expression space. Therefore, the vector storage unit 1001 and the conversion function storage unit 1
002 and the vector conversion unit 405 need not be performed as a series of processes, and can be functionally separated.

【０１４０】つぎに、実施の形態２による文書分類装置
の一連の処理の手順について説明する。図１１は実施の
形態２による文書分類装置の一連の処理の手順を示すフ
ローチャートである。図１１のフローチャートにおい
て、実施の形態１の図８の各ステップと同じ処理をおこ
なうステップは同じ番号を付して、その説明を省略す
る。Next, a series of processing procedures of the document classification device according to the second embodiment will be described. FIG. 11 is a flowchart showing a procedure of a series of processes of the document classification device according to the second embodiment. In the flowchart of FIG. 11, steps that perform the same processing as the steps of FIG. 8 of the first embodiment are given the same numbers, and descriptions thereof are omitted.

【０１４１】ステップＳ８３０の処理につづいて、同ス
テップにおいて生成された文書特徴ベクトルを記憶する
（ステップＳ８３１）。その後、ステップＳ８４０へ移
行し、実施の形態１と同様の処理をおこなう。また、ス
テップＳ８４０の処理につづいて、同ステップにおいて
算出された表現空間変換関数を記憶する（ステップＳ８
４１）。以下、実施の形態１と同様の処理をおこなう。Following the processing in step S830, the document feature vector generated in the step is stored (step S831). After that, the processing shifts to step S840, and the same processing as in the first embodiment is performed. Further, following the processing in step S840, the expression space conversion function calculated in the step is stored (step S8).
41). Hereinafter, the same processing as in the first embodiment is performed.

【０１４２】以上説明したように、実施の形態２による
文書分類装置は、分類数や分類手法を変えて分類をおこ
なう場合に、その都度、表現空間変換関数を算出するこ
となしに文書分類を実行できるため、複数の分類結果を
短時間で得ることができる。As described above, the document classification apparatus according to the second embodiment executes the document classification without calculating the expression space conversion function each time the classification is performed by changing the number of classifications or the classification method. Therefore, a plurality of classification results can be obtained in a short time.

【０１４３】さらに、前記文書特徴ベクトル変換部でも
ちいる表現空間変換関数として、事前に他の文書特徴ベ
クトルに基づいて生成された表現空間変換関数をもちい
ることもできる。Further, as the expression space conversion function used in the document feature vector conversion unit, an expression space conversion function generated in advance based on another document feature vector can be used.

【０１４４】（実施の形態３）さて、実施の形態１，２
に対して、以下に説明する実施の形態３のように、さら
にベクトル修正部１２０１を追加してもよい。(Embodiment 3) Embodiments 1 and 2
However, a vector correction unit 1201 may be further added as in Embodiment 3 described below.

【０１４５】まず、実施の形態３による文書分類装置の
機能的構成について説明する。図１２は、実施の形態３
による文書分類装置の構成を機能的に示すブロック図で
ある。図１２において、実施の形態１の図４と同一のも
のに関しては同じ番号を付して、その説明を省略する。First, the functional configuration of the document classification device according to the third embodiment will be described. FIG. 12 shows Embodiment 3
1 is a block diagram functionally showing the configuration of a document classification device according to the present invention. 12, the same components as those in FIG. 4 of the first embodiment are denoted by the same reference numerals, and description thereof will be omitted.

【０１４６】ベクトル修正部１２０１は、ベクトル変換
部４０５により文書特徴ベクトルを変更する前に、解析
部４０２により抽出される単語が有する特性により構成
される規則をもちいて文書特徴ベクトル、文書特徴ベク
トルを構成する特徴次元のいずれか一つまたはその両方
を操作することにより文書特徴ベクトルを修正するもの
である。Before the vector conversion unit 405 changes the document feature vector, the vector correction unit 1201 converts the document feature vector and the document feature vector using a rule constituted by the characteristics of the words extracted by the analysis unit 402. The document feature vector is modified by operating one or both of the constituent feature dimensions.

【０１４７】図１３はベクトル修正部１２０１の処理内
容の手順を示すフローチャートである。図１３のフロー
チャートにおいて、ベクトル修正部１２０１は、まず、
文書特徴ベクトルの読み込みをおこない（ステップＳ１
３０１）、つぎに、解析部４０２おいて抽出された単語
やその単語の品詞情報などを指定することで（ステップ
Ｓ１３０２）、削除などの操作をおこなう前記文書特徴
ベクトルの特徴次元、すなわち文書群に固有に出現する
単語の単語ＩＤを決定する（ステップＳ１３０３）。FIG. 13 is a flowchart showing the procedure of the processing contents of the vector correction unit 1201. In the flowchart of FIG. 13, the vector correction unit 1201 first
The document feature vector is read (step S1).
301) Next, by specifying the word extracted by the analysis unit 402 and the part of speech information of the word (step S1302), the feature dimension of the document feature vector for performing an operation such as deletion, that is, the document group The word ID of the word that uniquely appears is determined (step S1303).

【０１４８】その後、ベクトル生成部４０３よって生成
された文書特徴ベクトルやベクトル記憶部１００１によ
って記憶された文書特徴ベクトルに対して、操作対象の
特徴次元に対し、削除や合成等の修正の操作をおこない
（ステップＳ１３０４）、文書特徴ベクトルを合成す
る。After that, the document feature vector generated by the vector generation unit 403 and the document feature vector stored by the vector storage unit 1001 are subjected to correction operations such as deletion and synthesis for the feature dimension to be operated. (Step S1304) The document feature vector is synthesized.

【０１４９】ベクトル修正部１２０１は、ＲＯＭ２０２
または３０２、ＲＡＭ２０３または３０３、あるいはデ
ィスク装置３０６またはハードディスク３１６等の記録
媒体に記録されたプログラムに記載された命令にしたが
ってＣＰＵ２０１または３０１等が命令処理を実行する
ことにより、各部の機能を実現する。[0149] The vector correction unit 1201
Alternatively, the function of each unit is realized by the CPU 201 or 301 or the like executing the instruction processing in accordance with the instruction described in the program recorded in the recording medium such as 302, the RAM 203 or 303, or the disk device 306 or the hard disk 316.

【０１５０】文書特徴ベクトルからｔ’個の特徴次元
（すなわち、単語ＩＤ）を削除する手続きの一例を図１
４に示す。文書数ｄ、単語数ｔとし、文書特徴ベクトル
（文書−単語頻度行列）をｔ×ｄの大きさの行列Ｘと
し、各行（列）が単語ＩＤに対応する大きさｔ×ｔの単
位行列に、削除対象となる単語ＩＤに対応する行を削除
したｔ’×ｔの大きさの行列をＰ_tとした場合、修正部
１２０１によって修正される文書特徴ベクトルＸ’は式
１０のようになる。FIG. 1 shows an example of a procedure for deleting t ′ feature dimensions (that is, word IDs) from a document feature vector.
It is shown in FIG. The number of documents is d, the number of words is t, the document feature vector (document-word frequency matrix) is a matrix X of size t × d, and each row (column) is a unit matrix of size t × t corresponding to the word ID. 'If a matrix of size of × t was P _t, the document feature vector X to be modified by the modification unit 1201' deleting rows corresponding to the word ID to be deleted t is as equation 10.

【０１５１】Ｘ’＝Ｐ_tＸ（式１０）X ′ = P _t X (Equation 10)

【０１５２】つぎに、実施の形態３による文書分類装置
の一連の処理の手順について説明する。図１５は実施の
形態１による文書分類装置の一連の処理の手順を示すフ
ローチャートである。図１３のフローチャートにおい
て、実施の形態１の図８の各ステップと同じ処理をおこ
なうステップは同じ番号を付して、その説明を省略す
る。Next, a series of processing procedures of the document classification device according to the third embodiment will be described. FIG. 15 is a flowchart showing the sequence of a series of processes of the document classification device according to the first embodiment. In the flowchart of FIG. 13, steps that perform the same processing as the steps of FIG. 8 of Embodiment 1 are given the same numbers, and descriptions thereof are omitted.

【０１５３】ステップＳ８３０、Ｓ８３１の処理につづ
いて、文書特徴ベクトルの修正をおこなう（ステップＳ
８３２）。その後、ステップＳ８４０へ移行し、実施の
形態１と同様の処理をおこなう。Following the processing in steps S830 and S831, the document feature vector is corrected (step S83).
832). After that, the processing shifts to step S840, and the same processing as in the first embodiment is performed.

【０１５４】以上説明したように、実施の形態３による
文書分類装置は、ベクトル修正部１２０１により、ベク
トル生成部４０３によって文書特徴ベクトルを生成した
後でも、文類時に不要であることが判明した単語などを
削除することができる。さらに、同じ文書特徴ベクトル
に対しての分類を効率的におこなるようになっている
が、前記文書特徴ベクトル修正部１２０１により、各分
類ごとに異なる単語を削除して文書分類を実行すること
ができる。As described above, in the document classification device according to the third embodiment, even after the vector correction unit 1201 generates the document feature vector by the vector generation unit 403, it is determined that the word that is unnecessary in the text Etc. can be deleted. Furthermore, the same document feature vector is efficiently classified, but the document feature vector correction unit 1201 may delete a different word for each classification and execute the document classification. it can.

【０１５５】（実施の形態４）さて、実施の形態３で
は、ベクトル修正部１２０１を追加したが、以下に説明
する実施の形態４のように、さらに、ベクトル修正部と
ともに変換関数修正部１６０１も併せて追加してもよ
い。(Embodiment 4) In Embodiment 3, the vector correction unit 1201 is added. However, as in Embodiment 4 described below, the conversion function correction unit 1601 is provided together with the vector correction unit. It may be added at the same time.

【０１５６】まず、実施の形態４による文書分類装置の
機能的構成について説明する。図１６は、実施の形態４
による文書分類装置の構成を機能的に示すブロック図で
ある。図１６において、実施の形態３の図１２と同一の
ものに関しては同じ番号を付して、その説明を省略す
る。First, the functional configuration of the document classification device according to the fourth embodiment will be described. FIG. 16 shows Embodiment 4
1 is a block diagram functionally showing the configuration of a document classification device according to the present invention. In FIG. 16, the same components as those in FIG. 12 of the third embodiment are denoted by the same reference numerals, and description thereof will be omitted.

【０１５７】実施の形態３において、ベクトル修正部１
２０１によって文書特徴ベクトルの修正がおこなわれた
場合に、表現空間辺関数は修正前の文書特徴ベクトルに
基づいて算出されているため、この表現空間変換関数に
も文書特徴ベクトルが修正された効果を反映させなけれ
ば、文書特徴ベクトルを修正した効果が半減する可能性
がある。したがって、前記表現空間変換関数を修正され
た文書特徴ベクトルをもとに修正する。In the third embodiment, the vector correction unit 1
When the document feature vector is corrected in step 201, the expression space edge function is calculated based on the document feature vector before correction. Otherwise, the effect of modifying the document feature vector may be halved. Therefore, the expression space conversion function is modified based on the modified document feature vector.

【０１５８】すなわち、図１６における変換関数修正部
１６０１は表現空間変換関数をＷ’に修正する。表現空
間変換関数が文書特徴ベクトルの内積に基づいて算出さ
れる場合に、表現空間変換関数は式７で与えられる。こ
のとき、修正された表現空間変換関数をＷ’とすると、
式２、式７、式１０をもちいて式１１のように表現され
る。That is, the conversion function correction unit 1601 in FIG. 16 corrects the expression space conversion function to W ′. When the expression space conversion function is calculated based on the inner product of the document feature vector, the expression space conversion function is given by Expression 7. At this time, if the modified representation space conversion function is W ′,
Expression 11 is expressed using Expressions 2, 7, and 10.

【０１５９】Ｗ’＝Ｌ^-1Ｕ^TＰ_tＸ（Ｐ_tＸ）（式１１）W ′ = L ⁻¹ U ^T P _t X (P _t X) (Equation 11)

【０１６０】変換関数修正部１６０１は、ＲＯＭ２０２
または３０２、ＲＡＭ２０３または３０３、あるいはデ
ィスク装置３０６またはハードディスク３１６等の記録
媒体に記録されたプログラムに記載された命令にしたが
ってＣＰＵ２０１または３０１等が命令処理を実行する
ことにより、各部の機能を実現する。The conversion function correction unit 1601
Alternatively, the function of each unit is realized by the CPU 201 or 301 or the like executing the instruction processing in accordance with the instruction described in the program recorded in the recording medium such as 302, the RAM 203 or 303, or the disk device 306 or the hard disk 316.

【０１６１】図１７に実施の形態４による文書分類装置
の一連の処理の手順を説明するフローチャートを示す。
図１７のフローチャートにおいて、文書特徴ベクトルの
修正があった場合、表現空間変換関数の修正もおこなう
（ステップＳ８４１）。以下は実施の形態３の処理と同
様である。FIG. 17 is a flowchart illustrating the sequence of a series of processes performed by the document classification device according to the fourth embodiment.
In the flowchart of FIG. 17, when the document feature vector is corrected, the expression space conversion function is also corrected (step S841). The following is the same as the processing of the third embodiment.

【０１６２】以上説明したように、実施の形態４による
文書分類装置においては、文書特徴ベクトルの修正にと
もなって表現空間変換関数の修正もおこなうことができ
るので、より適正な文書特徴ベクトルの変換ができる。As described above, in the document classification device according to the fourth embodiment, since the expression space conversion function can be modified along with the modification of the document characteristic vector, more appropriate conversion of the document characteristic vector can be performed. it can.

【０１６３】（実施の形態５）さて、実施の形態４で
は、変換関数修正部１６０１を追加したが、以下に説明
する実施の形態５のように、さらに変換関数修正部１６
０１へ修正指示をおこなう変換関数修正指示部１８０１
を追加してもよい。(Embodiment 5) In Embodiment 4, the conversion function correction unit 1601 is added. However, as in Embodiment 5 described below, the conversion function correction unit 161 is added.
01, a conversion function correction instructing unit 1801 that issues a correction instruction to 01
May be added.

【０１６４】まず、実施の形態５による文書分類装置の
機能的構成について説明する。図１８は、実施の形態５
による文書分類装置の構成を機能的に示すブロック図で
ある。図１８において、実施の形態１の図４と同一のも
のに関しては同じ番号を付して、その説明を省略する。First, the functional configuration of the document classification device according to the fifth embodiment will be described. FIG. 18 shows Embodiment 5
1 is a block diagram functionally showing the configuration of a document classification device according to the present invention. 18, the same components as those in FIG. 4 of the first embodiment are denoted by the same reference numerals, and description thereof will be omitted.

【０１６５】変換関数修正指示部１８０１は、表現空間
変換関数の特徴次元の操作に関する指示するものであ
る。また、変換関数修正部１８０２は、変換関数修正指
示部１８０１からの指示内容に基づいて、前記表現空間
変換関数の特徴次元を操作し、前記表現空間変換関数を
修正する。The conversion function modification instructing unit 1801 is for instructing the operation of the feature dimension of the expression space conversion function. Further, the conversion function correction unit 1802 corrects the expression space conversion function by operating the feature dimension of the expression space conversion function based on the instruction content from the conversion function correction instruction unit 1801.

【０１６６】変換関数修正指示部１８０１、変換関数修
正部１８０２は、ＲＯＭ２０２または３０２、ＲＡＭ２
０３または３０３、あるいはディスク装置３０６または
ハードディスク３１６等の記録媒体に記録されたプログ
ラムに記載された命令にしたがってＣＰＵ２０１または
３０１等が命令処理を実行することにより、各部の機能
を実現する。The conversion function correction instructing unit 1801 and the conversion function correction unit 1802 include the ROM 202 or 302, the RAM 2
The functions of each unit are realized by the CPU 201 or 301 or the like executing the instruction processing according to the instructions described in a program recorded in a recording medium such as the disk drive 03 or 303 or the disk device 306 or the hard disk 316.

【０１６７】変換関数修正指示部１８０１においては、
操作者の意図を反映するような文書分類をおこなうため
の一つの方法として、前記表現空間変換関数により構成
される空間における不必要な特徴次元や、悪影響を及ぼ
すような特徴次元に対し削除や合成をおこなったり、逆
にある特徴次元を強調させるための操作をすることが考
えられる。In conversion function modification instructing section 1801,
One method for performing document classification that reflects the intention of the operator is to delete or combine unnecessary feature dimensions in a space formed by the expression space conversion function or feature dimensions that have an adverse effect. Or performing an operation to emphasize a certain characteristic dimension.

【０１６８】しかしながら、表現空間変換関数により生
成される空間の特徴次元は、解析部４０２によって抽出
された単語のうち意味的に似たものが複数結合したもの
と考えることができるため、各特徴次元の意味的な解釈
は極めて複雑かつ多義的なものである。したがって、操
作者に各特徴次元の意味を提示することは極めて難し
い。However, the feature dimension of the space generated by the expression space conversion function can be considered to be a combination of a plurality of words that are semantically similar among the words extracted by the analysis unit 402. The semantic interpretation of is extremely complex and ambiguous. Therefore, it is extremely difficult to present the meaning of each feature dimension to the operator.

【０１６９】そこで、操作者に分類に反映させたくない
内容や強調したい内容をもつ文書や単語などの情報を指
定させ、それらを前記表現空間辺関数により構成される
空間に適切に射影し、それらと類似度の高い特徴次元や
低い特徴次元を判別することで、操作をおこなう特徴次
元を選択することができる。Therefore, the operator is caused to specify information such as documents or words having contents that he does not want to reflect on the classification or contents that he wants to emphasize, and appropriately projects them into the space constituted by the expression space edge functions. By discriminating a feature dimension having a high similarity or a low feature dimension, a feature dimension on which an operation is performed can be selected.

【０１７０】本実施の形態では、表現空間変換関数の特
徴次元を操作する例として、操作者が指定するある文書
と類似度の高い特徴次元の削除をおこなう例を示す。操
作者により指定された文書を前記文書特徴ベクトルと同
じ次元数をもつベクトルで表現し、その文書ベクトルに
表現空間変換関数を適用し文書ベクトルを表現空間変換
関数により構成される空間へ射影する。そして、この射
影された文書ベクトルと各特徴次元との類似度を算出す
ることで、類似度の高い特徴次元を判別する。In this embodiment, as an example of operating the feature dimension of the expression space conversion function, an example is shown in which a feature dimension having a high similarity to a certain document specified by the operator is deleted. A document specified by the operator is represented by a vector having the same dimension as the document feature vector, and an expression space conversion function is applied to the document vector to project the document vector into a space formed by the expression space conversion function. Then, by calculating the similarity between the projected document vector and each feature dimension, a feature dimension having a high similarity is determined.

【０１７１】このとき、類似度を測るための尺度として
は、余弦尺度、内積尺度、ユークリッド距離尺度などを
もちいることができる。また、判別に関しては、ある類
似度以上を削除対象として採用するような閾値処理によ
る判別や、類似度の高い順にある一定数を削除対象とし
て採用する定数処理もしくは判別分析などももちいるこ
とができる。At this time, as a scale for measuring the similarity, a cosine scale, an inner product scale, a Euclidean distance scale, or the like can be used. As for the discrimination, it is also possible to use a discrimination by a threshold value process in which a certain degree of similarity or more is adopted as a deletion target, or a constant process or discrimination analysis in which a certain number of similarities are adopted as deletion targets in descending order of similarity.

【０１７２】このようにして、採用された特徴次元を表
現空間変換関数から削除することで表現空間変換関数を
修正することができる。この際、操作者が指示（指定）
する情報としては、前記文書特徴ベクトルと同じ次元数
を有するベクトル形式であれば、どのようなものでも適
用可能である。In this way, the expression space conversion function can be corrected by deleting the adopted feature dimension from the expression space conversion function. At this time, the operator instructs (specifies)
Any information can be applied as long as the information has the same number of dimensions as the document feature vector.

【０１７３】操作者が指示する情報としては、そのほか
に、より操作者にとって理解しやすいものとして、分類
対象文書群以外の文書を文書特徴ベクトルと同じ次元を
もつベクトルに表現したものをもちいることができる。
また、操作者が指示する情報としてそのほかには、文書
特徴べクトルをもちいることができる。As information specified by the operator, in addition to the information that the operator can easily understand, information other than the group of documents to be classified is expressed in a vector having the same dimension as the document feature vector. Can be.
In addition, a document feature vector can be used as information specified by the operator.

【０１７４】また、操作者が指示する情報としてそのほ
かには、解析部４０２によって抽出されまたは操作者が
手動で入力した単語や単語品詞情報をもちいることがで
きる。また、操作者が指示する情報としてそのほかに
は、分類結果記憶部４０７によって記憶されている事前
におこなわれた分類結果である分類代表値をもちいるこ
とができる。In addition to the information specified by the operator, words or word parts of speech information extracted by the analysis unit 402 or manually input by the operator can be used. In addition, as the information specified by the operator, a classification representative value that is a classification result performed in advance and stored in the classification result storage unit 407 can be used.

【０１７５】上記の指示情報は、それぞれ単独でもちい
るほか、それらを適切に組み合わせたものをもちいるよ
うにしてもよい。The above-mentioned instruction information may be used alone or may be used by appropriately combining them.

【０１７６】図１９に実施の形態５による文書分類装置
の一連の処理の一部の手順を説明するフローチャートを
示す。図１９のフローチャートにおいて、まず、変換関
数の修正の指示があるのを待って（ステップＳ１９０１
肯定）、つぎに、指示の内容、すなわち、操作者が指示
（指定）した指示情報をインプットする（ステップＳ１
９０２）。複数の指示がある場合は、すべての指示が終
了するまで同様のステップを繰り返し、指示が終了した
場合（ステップＳ１９０３肯定）に、インプットされた
指示情報に基づいて変換関数の修正を実行し（ステップ
Ｓ１９０４）、すべての処理を終了する。FIG. 19 is a flowchart illustrating a part of a series of processes of the document classification device according to the fifth embodiment. In the flow chart of FIG. 19, first, there is an instruction to correct the conversion function (step S1901)
Next, the content of the instruction, that is, the instruction information instructed (specified) by the operator is input (step S1).
902). When there are a plurality of instructions, the same steps are repeated until all the instructions are completed, and when the instructions are completed (Yes at Step S1903), the conversion function is corrected based on the input instruction information (Step S1903). S1904), terminates all the processing.

【０１７７】以上説明したように、この実施の形態５に
よれば、表現空間変換関数をもちいて構成される空間の
特徴次元について操作者が簡便な操作をすることによ
り、操作者の意図を反映しうる文書分類をおこなうこと
ができる。As described above, according to the fifth embodiment, the intention of the operator is reflected by the operator performing a simple operation on the feature dimension of the space formed by using the expression space conversion function. Possible document classification can be performed.

【０１７８】（実施の形態６）さて、実施の形態１〜５
に対して、以下に説明する実施の形態６のように、初期
重心指定部２００１および初期重心登録部２００２をさ
らに追加するようにしてもよい。(Embodiment 6) Embodiments 1 to 5
However, as in the sixth embodiment described below, an initial center of gravity specifying unit 2001 and an initial center of gravity registration unit 2002 may be further added.

【０１７９】まず、実施の形態６による文書分類装置の
機能的構成について説明する。図２０は、実施の形態６
による文書分類装置の構成を機能的に示すブロック図で
ある。図２０において、実施の形態１の図４と同一のも
のに関しては同じ番号を付して、その説明を省略する。First, the functional configuration of the document classification device according to the sixth embodiment will be described. FIG. 20 shows Embodiment 6
1 is a block diagram functionally showing the configuration of a document classification device according to the present invention. 20, the same elements as those in FIG. 4 of the first embodiment are denoted by the same reference numerals, and description thereof will be omitted.

【０１８０】初期重心指定部２００１は、初期クラスタ
重心を指定する指定部である。初期重心登録部２００２
は、初期重心指定部２００１により指定された初期クラ
スタ重心を登録するする登録部である。また、分類部４
０５は、初期重心登録部２００２により登録された初期
クラスタ重心にしたがって文書を分類するものである。The initial center of gravity specifying unit 2001 is a specifying unit for specifying an initial cluster center of gravity. Initial center of gravity registration unit 2002
Is a registration unit for registering the initial cluster centroid specified by the initial centroid specifying unit 2001. Classification unit 4
05 classifies documents according to the initial cluster centroid registered by the initial centroid registration unit 2002.

【０１８１】初期重心指定部２００１、初期重心登録部
２００２は、ＲＯＭ２０２または３０２、ＲＡＭ２０３
または３０３、あるいはディスク装置３０６またはハー
ドディスク３１６等の記録媒体に記録されたプログラム
に記載された命令にしたがってＣＰＵ２０１または３０
１等が命令処理を実行することにより、各部の機能を実
現する。The initial center-of-gravity designation section 2001 and the initial center-of-gravity registration section 2002 include a ROM 202 or 302 and a RAM 203.
Or 303, or the CPU 201 or 30 in accordance with instructions described in a program recorded on a recording medium such as the disk device 306 or the hard disk 316.
1 and the like execute the command processing, thereby realizing the function of each unit.

【０１８２】通常、カイ自乗法の手法、判別分析の手
法、およびクラスタ分析の手法等をもちいて文書分類を
おこなう場合にもちいられる分類基準が統計的な理論を
元にして構成されている。しかしながら、本実施の形態
においては、文書分類をおこなった際の最終的な分類の
質の評価は、統計的な数値評価ではなく、その分類結果
を分析する操作者による主観評価となる。したがって、
前記文書分類をおこなうための諸手法において、操作者
が介入できうる余地を設けることで分類結果に操作者の
意図を反映することができ、結果として分類結果の質的
な向上が見込まれる。Normally, the classification criteria used when classifying documents using the chi-square method, the discriminant analysis method, the cluster analysis method, and the like are configured based on statistical theory. However, in the present embodiment, the final evaluation of the quality of the classification at the time of classifying the document is not a statistical numerical evaluation but a subjective evaluation by an operator who analyzes the classification result. Therefore,
In the various methods for classifying the documents, by providing a room where the operator can intervene, the intention of the operator can be reflected in the classification result, and as a result, the quality of the classification result is expected to be improved.

【０１８３】つぎに、図２１に実施の形態６による文書
分類装置の一連の処理の一部の手順を説明するフローチ
ャートを示す。非階層型のクラスタリング手法は一般的
に図１９のフローチャートのような処理の手順となる。
図２１のフローチャートにおいて、まず、初期クラスタ
重心が指定され（ステップＳ２１０１）、その初期クラ
スタ重心が登録される（ステップＳ２１０２）。つぎ
に、初期クラスタ重心を決定し（ステップＳ２１０
３）、そのクラスタ重心と各分類対象データとの類似度
を計算する（ステップＳ２１０４）。Next, FIG. 21 is a flowchart illustrating a part of a series of processes of the document classification device according to the sixth embodiment. The non-hierarchical clustering method generally has a processing procedure as shown in the flowchart of FIG.
In the flowchart of FIG. 21, first, an initial cluster centroid is specified (step S2101), and the initial cluster centroid is registered (step S2102). Next, an initial cluster barycenter is determined (step S210).
3) The similarity between the cluster centroid and each classification target data is calculated (step S2104).

【０１８４】つぎに、各分類対象データを一番類似度の
高いクラスタに割り当てて（ステップＳ２１０５）、各
クラスタごとに割り当てられた分類対象データを基にそ
のクラスタ重心を計算する（ステップＳ２１０６）。Next, each classification target data is assigned to the cluster having the highest similarity (step S2105), and the cluster centroid is calculated based on the classification target data assigned to each cluster (step S2106).

【０１８５】この時点で、反復停止基準を満たすか否か
を判断し（ステップＳ２１０７）、反復停止基準を満た
さない場合（ステップＳ２１０７否定）は、ステップＳ
２１０４へ移行し、以後、ステップＳ２１０４〜Ｓ２１
０６の各ステップを繰り返し実行する。ステップＳ２１
０７において、反復停止基準を満たす場合（ステップＳ
２１０７肯定）は、すべての処理終了する。At this point, it is determined whether the repetition stop criterion is satisfied (step S2107). If the repetition stop criterion is not satisfied (step S2107: No), step S2107 is executed.
The process proceeds to steps S2104 to S21.
Steps 06 are repeatedly executed. Step S21
07, when the repetition stop criterion is satisfied (step S
2107 affirmative) ends all the processing.

【０１８６】分類結果はどのような初期クラスタ重心を
選択するかに強く依存するといわれている。したがっ
て、分類実行部での分類手法として、ｋ−ｍｅａｎｓ法
などの非階層型クラスタリング手法をもちいて、その初
期クラスタ重心を操作者が指定することで、操作者の分
類手続きへの介入を可能にし、操作者の意図を反映した
文書分類が実現できる。It is said that the classification result strongly depends on what initial cluster centroid is selected. Therefore, a non-hierarchical clustering method such as the k-means method is used as a classification method in the classification execution unit, and the operator specifies the initial cluster centroid, thereby enabling the operator to intervene in the classification procedure. Thus, document classification reflecting the intention of the operator can be realized.

【０１８７】なお、各文書特徴ベクトルとクラスタの重
心ベクトルとの類似度を算出し、各特徴ベクトルで最も
類似度の高い分類代表値にその文書特徴ベクトルを帰属
させる形式の分類手法であれば、非階層型クラスタリン
グ以外の手法でも利用可能である。また、クラスタの重
心ベクトルと文書ベクトルとの類似度を測るための類似
測度としては、余弦測度、内積測度、ユークリッド距離
測度、マハラノビス距離測度などが利用可能である。It should be noted that if the classification method is such that the similarity between each document feature vector and the center of gravity vector of the cluster is calculated and the document feature vector is assigned to the classification representative value having the highest similarity in each feature vector, Techniques other than non-hierarchical clustering can also be used. As a similarity measure for measuring the similarity between the centroid vector of the cluster and the document vector, a cosine measure, an inner product measure, a Euclidean distance measure, a Mahalanobis distance measure, or the like can be used.

【０１８８】初期重心指定部２００１によって、前記分
類対象データと同一の特徴次元数をもつ任意の複数の文
書ベクトルがクラスタリングの初期重心として入力され
る。前記任意の文書ベクトルは操作者により指定するこ
ともできるし、また分類対象の文書特徴ベクトルなどに
基づいて構築した規則を操作者が選択することにより間
接的に文書ベクトルを指定することもできる。An initial center of gravity designating section 2001 inputs a plurality of arbitrary document vectors having the same number of characteristic dimensions as the classification target data as an initial center of gravity for clustering. The arbitrary document vector can be specified by the operator, or the operator can indirectly specify the document vector by selecting a rule constructed based on the document feature vector to be classified.

【０１８９】また、前記任意の文書ベクトルとしては、
前記文書特徴ベクトルと同じ次元数を有するベクトル形
式であれば、どのようなものでも適用可能である。ま
た、任意の文書ベクトルとしては、そのほかに、より操
作者にとって理解しやすいものとして、分類対象文書群
以外の文書を文書特徴ベクトルと同じ次元をもつベクト
ルに表現したものをもちいることができる。The arbitrary document vector includes:
Any vector format having the same number of dimensions as the document feature vector can be applied. In addition, as an arbitrary document vector, a vector expressing a document other than the group of documents to be classified into a vector having the same dimension as the document feature vector can be used as an easier-to-understand one.

【０１９０】また、任意の文書ベクトルとしてそのほか
には、文書特徴べクトルをもちいることができる。任意
の文書ベクトルとしてそのほかには、解析部４０２によ
って抽出される単語や単語品詞情報をもちいることがで
きる。また、任意の文書ベクトルとしてそのほかには、
分類結果記憶部４０７によって記憶されている事前にお
こなわれた分類結果である分類代表値をもちいることが
できる。In addition, a document feature vector can be used as an arbitrary document vector. In addition, a word or word part of speech information extracted by the analysis unit 402 can be used as an arbitrary document vector. In addition, as an arbitrary document vector,
A classification representative value, which is a classification result performed in advance and stored by the classification result storage unit 407, can be used.

【０１９１】上記の指示情報は、それぞれ単独でもちい
るほか、それらを適切に組み合わせたものをもちいるよ
うにしてもよい。The above-mentioned instruction information may be used alone or may be used by appropriately combining them.

【０１９２】２つの任意の文書ベクトル、３つの文書特
徴ベクトル、一つの単語、一つの分類代表値とそれらの
組み合わせ規則を指定することで、５つの初期クラスタ
重心を求める例を図２２に示す。図２２に示すとおり、
本実施の形態では、初期クラスタ重心１として文書１
を、初期クラスタ重心２として文書２と文書３の平均
を、初期クラスタ重心３として文書４と単語１の平均
を、初期クラスタ重心４として文書５を、初期クラスタ
重心５として分類代表値１を各々指定している。FIG. 22 shows an example in which five initial cluster centroids are obtained by designating two arbitrary document vectors, three document feature vectors, one word, one classification representative value, and a combination rule thereof. As shown in FIG.
In the present embodiment, document 1 is used as the initial cluster centroid 1.
, The average of documents 2 and 3 as initial cluster centroid 2, the average of document 4 and word 1 as initial cluster centroid 3, document 5 as initial cluster centroid 4, and classification representative value 1 as initial cluster centroid 5. Is specified.

【０１９３】また、指定された文書ベクトルが、操作者
が指定したクラスタ数に満たない場合には、ｋ−ｍｅａ
ｎｓ法などでもちいれられている一般的な自動初期重心
選出法をもちいて残りのクラスタ重心を求めることがで
きる。このようにして求めた初期重心に基づいてｋ−ｍ
ｅａｎｓ法等をもちいて、クラスタの精練化をおこなう
ことで文書分類を実行する。If the specified document vector is less than the number of clusters specified by the operator, k-mea
The remaining cluster centroids can be obtained by using a general automatic initial centroid selection method used in the ns method or the like. Km-m based on the initial center of gravity obtained in this manner.
The document classification is executed by refining the cluster using the eans method or the like.

【０１９４】以上説明したように、この実施の形態６に
よれば、文書分類手法として、非階層型クラスタリング
手法をもちいて、その際に必要となる初期クラスタ重心
を、操作者が任意に指定することができ、その指定され
た初期クラスタ重心にしたがって文書分類をおこなうの
で、操作者の意図を反映する文書分類をおこなうことが
できる。As described above, according to the sixth embodiment, the non-hierarchical clustering method is used as the document classification method, and the operator can arbitrarily specify the initial cluster center of gravity required at that time. Since the document classification is performed according to the designated initial cluster center of gravity, the document classification reflecting the intention of the operator can be performed.

【０１９５】[0195]

【発明の効果】以上説明したように、請求項１の発明に
よれば、分類対象である文書群での文書間の類似性に基
づいて、各文書をそれら文書間の意味的な関連性を反映
しうる表現空間へ変換するための表現空間変換関数を算
出し、その表現空間で文書分類をおこなうことにより、
操作者の意図を反映しうる文書分類を実現することが可
能な文書分類装置が得られるという効果を奏する。As described above, according to the first aspect of the present invention, based on the similarity between documents in a group of documents to be classified, each document is converted into a semantic relationship between the documents. By calculating an expression space conversion function for converting into an expression space that can be reflected, and performing document classification in that expression space,
There is an effect that a document classification device capable of implementing document classification that can reflect the intention of the operator can be obtained.

【０１９６】また、請求項２の発明によれば、表現空間
変換関数を導出する際に必要となる文書間の類似性とし
て文書特徴ベクトル間の内積をもちいることにより、文
書間の意味的な関連性を反映した文書分類をおこなうこ
とが可能な文書分類装置が得られるという効果を奏す
る。According to the second aspect of the present invention, the inner product between the document feature vectors is used as the similarity between the documents required for deriving the expression space conversion function, so that the semantic between the documents is obtained. This has the effect of providing a document classification device that can perform document classification reflecting the relevance.

【０１９７】また、請求項３の発明によれば、表現空間
変換関数を導出する際に必要となる文書間の類似性とし
て文書特徴ベクトル間の内積に加え、文書の作成者や作
成日などの文書間類似情報ももちいることにより、文書
間の意味的な関連性を反映した文書分類をおこなうこと
が可能な文書分類装置が得られるという効果を奏する。According to the third aspect of the present invention, in addition to the inner product between the document feature vectors, the similarity between the documents required when deriving the expression space conversion function, the creator of the document, the date of creation, etc. The use of the inter-document similarity information has the effect of providing a document classification device capable of performing document classification reflecting the semantic relevance between documents.

【０１９８】また、請求項４の発明によれば、算出する
文書特徴ベクトルと表現空間変換関数を記憶することに
より、表現空間変換関数を算出する部分と実際に前記表
現空間変換関数をもちいて変換された文書をもちいて文
書分類をおこなう部分とを分離して処理するので、その
都度、表現空間変換関数を算出することなしに文書分類
を実行でき、さらに、前記文書特徴ベクトル変換部でも
ちいる表現空間変換関数として、事前に他の文書特徴ベ
クトルに基づいて生成された表現空間変換関数をもちい
ることもできるため、文書分類の繰り返し実行を短時間
で効率良くおこなうことが可能な文書分類装置が得られ
るという効果を奏する。According to the fourth aspect of the present invention, the document feature vector to be calculated and the expression space conversion function are stored, so that the expression space conversion function is calculated using the expression space conversion function. Since the document classification and the part for performing the document classification are performed by using the processed document, the document classification can be performed without calculating the expression space conversion function each time, and the document feature vector conversion unit is used. A document classification device capable of efficiently performing document classification repeatedly in a short time because an expression space conversion function generated in advance based on another document feature vector can be used as the expression space conversion function. Is obtained.

【０１９９】また、請求項５の発明によれば、文書分類
の繰り返し実行をおこなう際、個々の分類実行ごとに、
文書特徴ベクトルやそれらを構成する特徴次元を操作す
ることで、各分類ごとに異なる単語を削除して文書分類
を実行する等の分類対象文書の範囲の変更や分類をおこ
なう空間の変更をおこなうことが可能な文書分類装置が
得られるという効果を奏する。According to the fifth aspect of the present invention, when the document classification is repeatedly executed, each classification execution is
By manipulating the document feature vectors and the feature dimensions that compose them, change the range of the document to be classified or change the space in which the classification is performed, such as deleting the different words for each classification and performing document classification. There is an effect that a document classification device capable of performing the above is obtained.

【０２００】また、請求項６の発明によれば、表現空間
変換関数が文書特徴ベクトルの内積をに基づいて算出さ
れる場合、表現空間変換関数をもちいて変換された文書
をもちいて文書分類をおこなう部分において、文書特徴
ベクトルやその特徴次元が操作された場合に生じる表現
空間変換関数の不整合を簡便に修正することができるの
で、より適正な文書特徴ベクトルの変換をおこなうこと
が可能な文書分類装置が得られるという効果を奏する。Further, according to the invention of claim 6, when the expression space conversion function is calculated based on the inner product of the document feature vector, the document classification is performed using the document converted using the expression space conversion function. In the part to be performed, the inconsistency of the expression space conversion function that occurs when the document feature vector or its feature dimension is manipulated can be easily corrected, so that a document that can perform more appropriate document feature vector conversion There is an effect that a classification device can be obtained.

【０２０１】また、請求項７の発明によれば、前記表現
空間変換関数をもちいて構成される空間の特徴次元につ
いて操作者が簡便な操作をすることにより、操作者の意
図を反映しうる文書分類をおこなうことが可能な文書分
類装置が得られるという効果を奏する。According to the seventh aspect of the present invention, a document in which the operator's intention can be reflected by the operator performing a simple operation on the feature dimension of the space formed by using the expression space conversion function. There is an effect that a document classification device capable of performing classification can be obtained.

【０２０２】また、請求項８の発明によれば、前記表現
空間変換関数をもちいて構成される空間の特徴次元につ
いて、操作者により指示された分類対象以外の任意の文
書ベクトルデータをもちいての簡便な操作をすることに
より、操作者の意図を反映しうる文書分類をおこなうこ
とが可能な文書分類装置が得られるという効果を奏す
る。Further, according to the invention of claim 8, regarding the feature dimension of the space formed by using the expression space conversion function, any document vector data other than the classification target specified by the operator is used. By performing a simple operation, it is possible to obtain a document classification device capable of performing document classification that can reflect the intention of the operator.

【０２０３】また、請求項９の発明によれば、前記表現
空間変換関数をもちいて構成される空間の特徴次元につ
いて、操作者により指示された文書特徴ベクトルをもち
いての簡便な操作をすることにより、操作者の意図を反
映しうる文書分類をおこなうことが可能な文書分類装置
が得られるという効果を奏する。According to the ninth aspect of the present invention, a simple operation can be performed on a feature dimension of a space formed by using the expression space conversion function using a document feature vector specified by an operator. As a result, there is an effect that a document classification device capable of performing document classification that can reflect the intention of the operator can be obtained.

【０２０４】また、請求項１０の発明によれば、前記表
現空間変換関数をもちいて構成される空間の特徴次元
を、操作者により指示された解析情報をもちいての簡便
な操作をすることにより、操作者の意図を反映しうる文
書分類をおこなうことが可能な文書分類装置が得られる
という効果を奏する。According to the tenth aspect of the present invention, the feature dimension of the space formed by using the expression space conversion function can be easily operated by using the analysis information specified by the operator. Thus, there is an effect that a document classification device capable of performing document classification that can reflect the intention of the operator can be obtained.

【０２０５】また、請求項１１の発明によれば、前記表
現空間変換関数をもちいて構成される空間の特徴次元
を、操作者により指示された事前に分類された分類結果
をもちいての簡便な操作をすることにより、操作者の意
図を反映しうる文書分類をおこなうことが可能な文書分
類装置が得られるという効果を奏する。According to the eleventh aspect of the present invention, a feature dimension of a space formed by using the expression space conversion function can be easily determined by using a classification result specified in advance by an operator. By performing the operation, it is possible to obtain a document classification device capable of performing document classification that can reflect the intention of the operator.

【０２０６】また、請求項１２の発明によれば、文書分
類手法として、非階層型クラスタリング手法をもちい
て、その際に必要となる初期クラスタ重心を、操作者が
任意に指定することができ、その指定された初期クラス
タ重心にしたがって文書分類をおこなうので、操作者の
意図を反映する文書分類をおこなうことが可能な文書分
類装置が得られるという効果を奏する。According to the twelfth aspect of the present invention, a non-hierarchical clustering method is used as a document classification method, and an operator can arbitrarily designate an initial cluster centroid required at that time. Since the document classification is performed according to the designated initial cluster centroid, an effect is obtained that a document classification device capable of performing the document classification reflecting the intention of the operator is obtained.

【０２０７】また、請求項１３の発明によれば、文書分
類手法として、非階層型クラスタリング手法をもちい
て、その際に必要となる初期クラスタ重心として、分類
対象以外の任意の文書をもちいることができるので、操
作者の意図を反映する文書分類をおこなうことが可能な
文書分類装置が得られるという効果を奏する。According to the thirteenth aspect of the present invention, a non-hierarchical clustering method is used as a document classification method, and any document other than a classification target is used as an initial cluster centroid required at that time. Therefore, there is an effect that a document classifying device capable of classifying documents reflecting the intention of the operator can be obtained.

【０２０８】また、請求項１４の発明によれば、文書分
類手法として、非階層型クラスタリング手法をもちい
て、その際に必要となる初期クラスタ重心として、文書
特徴ベクトルをもちいることができるので、操作者の意
図を反映する文書分類をおこなうことが可能な文書分類
装置が得られるという効果を奏する。According to the fourteenth aspect of the present invention, a non-hierarchical clustering method can be used as a document classification method, and a document feature vector can be used as an initial cluster centroid required at that time. This has the effect of providing a document classification device that can perform document classification reflecting the intention of the operator.

【０２０９】また、請求項１５の発明によれば、文書分
類手法として、非階層型クラスタリング手法をもちい
て、その際に必要となる初期クラスタ重心として、分類
対象文書を文書解析部に作用させた結果得られる単語等
の解析情報をもちいることができるので、操作者の意図
を反映する文書分類をおこなうことが可能な文書分類装
置が得られるという効果を奏する。According to the fifteenth aspect of the present invention, a non-hierarchical clustering method is used as a document classification method, and a document to be classified is made to act on a document analysis unit as an initial cluster centroid required at that time. Since the analysis information such as the words obtained as a result can be used, there is an effect that a document classification device capable of performing document classification reflecting the intention of the operator can be obtained.

【０２１０】また、請求項１６の発明によれば、文書分
類手法として、非階層型クラスタリング手法をもちい
て、その際に必要となる初期クラスタ重心として、事前
に分類された分類結果をもちいることができるので、操
作者の意図を反映する文書分類をおこなうことが可能な
文書分類方法が得られるという効果を奏する。According to the sixteenth aspect of the present invention, a non-hierarchical clustering method is used as a document classification method, and a classification result preliminarily classified is used as an initial cluster centroid required at that time. Therefore, there is an effect that a document classification method capable of performing document classification reflecting the intention of the operator can be obtained.

【０２１１】また、請求項１７の発明によれば、分類対
象である文書群での文書間の類似性に基づいて、各文書
をそれら文書間の意味的な関連性を反映しうる表現空間
へ変換するための表現空間変換関数を算出し、その表現
空間で文書分類をおこなうことにより、操作者の意図を
反映しうる文書分類を実現することが可能な文書分類方
法が得られるという効果を奏する。According to the seventeenth aspect, based on the similarity between documents in a group of documents to be classified, each document is converted into an expression space that can reflect the semantic relevance between the documents. By calculating an expression space conversion function for conversion and performing document classification in the expression space, it is possible to obtain a document classification method capable of realizing a document classification that can reflect the intention of the operator. .

【０２１２】また、請求項１８の発明によれば、表現空
間変換関数を導出する際に必要となる文書間の類似性と
して文書特徴ベクトル間の内積をもちいることにより、
文書間の意味的な関連性を反映した文書分類をおこなう
ことが可能な文書分類方法が得られるという効果を奏す
る。According to the eighteenth aspect of the present invention, the inner product between the document feature vectors is used as the similarity between the documents required for deriving the expression space conversion function.
There is an effect that a document classification method capable of performing document classification reflecting semantic relevance between documents can be obtained.

【０２１３】また、請求項１９の発明によれば、表現空
間変換関数を導出する際に必要となる文書間の類似性と
して文書特徴ベクトル間の内積に加え、文書の作成者や
作成日などの文書間類似情報ももちいることにより、文
書間の意味的な関連性を反映した文書分類をおこなうこ
とが可能な文書分類方法が得られるという効果を奏す
る。According to the nineteenth aspect of the present invention, in addition to the inner product between the document feature vectors, the similarity between the documents required when deriving the expression space conversion function, the creator of the document, the date of creation, etc. By using the inter-document similarity information, there is an effect that a document classification method capable of performing document classification reflecting the semantic relevance between documents can be obtained.

【０２１４】また、請求項２０の発明によれば、算出す
る文書特徴ベクトルと表現空間変換関数を記憶すること
により、表現空間変換関数を算出する部分と実際に前記
表現空間変換関数をもちいて変換された文書をもちいて
文書分類をおこなう部分とを分離して処理するので、そ
の都度、表現空間変換関数を算出することなしに文書分
類を実行でき、さらに、前記文書特徴ベクトル変換部で
もちいる表現空間変換関数として、事前に他の文書特徴
ベクトルに基づいて生成された表現空間変換関数をもち
いることもできるため、文書分類の繰り返し実行を短時
間で効率良くおこなうことが可能な文書分類方法が得ら
れるという効果を奏する。According to the twentieth aspect of the present invention, the document feature vector to be calculated and the expression space conversion function are stored, so that the expression space conversion function is calculated and the expression space conversion function is actually used. Since the document classification and the part for performing the document classification are performed by using the processed document, the document classification can be performed without calculating the expression space conversion function each time, and the document feature vector conversion unit is used. A document classification method that can perform repetitive execution of document classification in a short time and efficiently because an expression space conversion function generated in advance based on another document feature vector can be used as the expression space conversion function. Is obtained.

【０２１５】また、請求項２１の発明によれば、文書分
類の繰り返し実行をおこなう際、個々の分類実行ごと
に、文書特徴ベクトルやそれらを構成する特徴次元を操
作することで、各分類ごとに異なる単語を削除して文書
分類を実行する等の分類対象文書の範囲の変更や分類を
おこなう空間の変更をおこなうことが可能な文書分類方
法が得られるという効果を奏する。According to the twenty-first aspect of the present invention, when the document classification is repeatedly performed, the document feature vectors and the feature dimensions constituting them are manipulated for each individual classification, so that each classification is performed. This has the effect of providing a document classification method that can change the range of the document to be classified and change the space in which the classification is performed, such as executing the document classification by deleting different words.

【０２１６】また、請求項２２の発明によれば、表現空
間変換関数が文書特徴ベクトルの内積をに基づいて算出
される場合、表現空間変換関数をもちいて変換された文
書をもちいて文書分類をおこなう部分において、文書特
徴ベクトルやその特徴次元が操作された場合に生じる表
現空間変換関数の不整合を簡便に修正することができる
ので、より適正な文書特徴ベクトルの変換をおこなうこ
とが可能な文書分類方法が得られるという効果を奏す
る。According to the twenty-second aspect, when the expression space conversion function is calculated based on the inner product of the document feature vectors, the document classification is performed using the document converted using the expression space conversion function. In the part to be performed, the inconsistency of the expression space conversion function that occurs when the document feature vector or its feature dimension is manipulated can be easily corrected, so that a document that can perform more appropriate document feature vector conversion This produces an effect that a classification method can be obtained.

【０２１７】また、請求項２３の発明によれば、前記表
現空間変換関数をもちいて構成される空間の特徴次元に
ついて操作者が簡便な操作をすることにより、操作者の
意図を反映しうる文書分類をおこなうことが可能な文書
分類方法が得られるという効果を奏する。According to the twenty-third aspect of the present invention, a document which can reflect the intention of the operator by allowing the operator to easily operate the feature dimension of the space formed by using the expression space conversion function. There is an effect that a document classification method capable of performing classification can be obtained.

【０２１８】また、請求項２４の発明によれば、前記表
現空間変換関数をもちいて構成される空間の特徴次元に
ついて、操作者により指示された分類対象以外の任意の
文書ベクトルデータをもちいての簡便な操作をすること
により、操作者の意図を反映しうる文書分類をおこなう
ことが可能な文書分類方法が得られるという効果を奏す
る。According to the twenty-fourth aspect of the present invention, for the feature dimension of a space formed by using the expression space conversion function, any document vector data other than the classification target specified by the operator is used. By performing a simple operation, it is possible to obtain a document classification method capable of performing a document classification that can reflect the intention of the operator.

【０２１９】また、請求項２５の発明によれば、前記表
現空間変換関数をもちいて構成される空間の特徴次元に
ついて、操作者により指示された文書特徴ベクトルをも
ちいての簡便な操作をすることにより、操作者の意図を
反映しうる文書分類をおこなうことが可能な文書分類方
法が得られるという効果を奏する。According to the twenty-fifth aspect of the present invention, a simple operation can be performed on a feature dimension of a space formed by using the expression space conversion function using a document feature vector specified by an operator. Accordingly, there is an effect that a document classification method capable of performing document classification that can reflect the intention of the operator can be obtained.

【０２２０】また、請求項２６の発明によれば、前記表
現空間変換関数をもちいて構成される空間の特徴次元
を、操作者により指示された解析情報をもちいての簡便
な操作をすることにより、操作者の意図を反映しうる文
書分類をおこなうことが可能な文書分類方法が得られる
という効果を奏する。According to the twenty-sixth aspect of the present invention, a feature dimension of a space constituted by using the expression space conversion function can be easily operated by using analysis information designated by an operator. Thus, there is an effect that a document classification method capable of performing document classification that can reflect the intention of the operator can be obtained.

【０２２１】また、請求項２７の発明によれば、前記表
現空間変換関数をもちいて構成される空間の特徴次元
を、操作者により指示された事前に分類された分類結果
をもちいての簡便な操作をすることにより、操作者の意
図を反映しうる文書分類をおこなうことが可能な文書分
類方法が得られるという効果を奏する。According to the twenty-seventh aspect of the present invention, a feature dimension of a space formed by using the expression space conversion function is easily determined by using a classification result pre-classified by an operator. By performing the operation, it is possible to obtain a document classification method capable of performing document classification that can reflect the intention of the operator.

【０２２２】また、請求項２８の発明によれば、文書分
類手法として、非階層型クラスタリング手法をもちい
て、その際に必要となる初期クラスタ重心を、操作者が
任意に指定することができ、その指定された初期クラス
タ重心にしたがって文書分類をおこなうので、操作者の
意図を反映する文書分類をおこなうことが可能な文書分
類方法が得られるという効果を奏する。According to the twenty-eighth aspect of the present invention, a non-hierarchical clustering method is used as a document classification method, and an operator can arbitrarily designate an initial cluster centroid required at that time. Since the document classification is performed according to the designated initial cluster center of gravity, there is an effect that a document classification method capable of performing the document classification reflecting the intention of the operator is obtained.

【０２２３】また、請求項２９の発明によれば、文書分
類手法として、非階層型クラスタリング手法をもちい
て、その際に必要となる初期クラスタ重心として、分類
対象以外の任意の文書をもちいることができるので、操
作者の意図を反映する文書分類をおこなうことが可能な
文書分類方法が得られるという効果を奏する。According to the twenty-ninth aspect of the present invention, a non-hierarchical clustering method is used as a document classification method, and any document other than a classification target is used as an initial cluster centroid required at that time. Therefore, there is an effect that a document classification method capable of performing document classification reflecting the intention of the operator can be obtained.

【０２２４】また、請求項３０の発明によれば、文書分
類手法として、非階層型クラスタリング手法をもちい
て、その際に必要となる初期クラスタ重心として、文書
特徴ベクトルをもちいることができるので、操作者の意
図を反映する文書分類をおこなうことが可能な文書分類
方法が得られるという効果を奏する。According to the thirtieth aspect of the present invention, a non-hierarchical clustering method can be used as a document classification method, and a document feature vector can be used as an initial cluster centroid required at that time. There is an effect that a document classification method capable of performing document classification reflecting the intention of the operator can be obtained.

【０２２５】また、請求項３１の発明によれば、文書分
類手法として、非階層型クラスタリング手法をもちい
て、その際に必要となる初期クラスタ重心として、分類
対象文書を文書解析部に作用させた結果得られる単語等
の解析情報をもちいることができるので、操作者の意図
を反映する文書分類をおこなうことが可能な文書分類方
法が得られるという効果を奏する。According to the thirty-first aspect of the present invention, a non-hierarchical clustering method is used as a document classification method, and a document to be classified is caused to act on a document analysis unit as an initial cluster centroid required at that time. Since analysis information such as words obtained as a result can be used, there is an effect that a document classification method capable of performing document classification reflecting the intention of the operator can be obtained.

【０２２６】また、請求項３２の発明によれば、文書分
類手法として、非階層型クラスタリング手法をもちい
て、その際に必要となる初期クラスタ重心として、事前
に分類された分類結果をもちいることができるので、操
作者の意図を反映する文書分類をおこなうことが可能な
文書分類方法が得られるという効果を奏する。According to the invention of claim 32, a non-hierarchical clustering method is used as a document classification method, and a classification result preliminarily classified is used as an initial cluster centroid required at that time. Therefore, there is an effect that a document classification method capable of performing document classification reflecting the intention of the operator can be obtained.

【０２２７】また、請求項３３の発明によれば、請求項
１７〜３２に記載された方法をコンピュータに実行させ
るプログラムを記録したことで、そのプログラムを機械
読み取り可能となり、これによって、請求項１７〜３２
の動作をコンピュータによって実現することが可能な記
録媒体が得られるという効果を奏する。Further, according to the invention of claim 33, by recording a program for causing a computer to execute the method according to claim 17 to 32, the program becomes machine-readable, whereby the program can be read. ~ 32
There is an effect that a recording medium capable of realizing the above operation by a computer can be obtained.

[Brief description of the drawings]

【図１】この発明の実施の形態１による文書分類装置を
構成する情報処理システム全体のハードウエア構成を示
す説明図である。FIG. 1 is an explanatory diagram showing a hardware configuration of an entire information processing system constituting a document classification device according to a first embodiment of the present invention;

【図２】実施の形態１による文書分類装置を構成する情
報処理システムにおけるサーバーをハードウエア的に示
す説明図である。FIG. 2 is an explanatory diagram showing a hardware of a server in the information processing system constituting the document classification device according to the first embodiment;

【図３】実施の形態１による文書分類装置を構成する情
報処理システムにおけるクライアントをハードウエア的
に示す説明図である。FIG. 3 is an explanatory diagram showing a hardware of a client in the information processing system constituting the document classification device according to the first embodiment;

【図４】実施の形態１による文書分類装置の構成を機能
的に示すブロック図である。FIG. 4 is a block diagram functionally showing the configuration of the document classification device according to the first embodiment.

【図５】実施の形態１による文書分類装置の構成を機能
的に示す別のブロック図である。FIG. 5 is another block diagram functionally showing the configuration of the document classification device according to the first embodiment.

【図６】実施の形態１による文書分類装置の構成を機能
的に示す別のブロック図である。FIG. 6 is another block diagram functionally showing the configuration of the document classification device according to the first embodiment.

【図７】実施の形態１による文書分類装置の文書−単語
行列データと文書特徴ベクトルの一例を示す説明図であ
る。FIG. 7 is an explanatory diagram showing an example of document-word matrix data and a document feature vector of the document classification device according to the first embodiment.

【図８】実施の形態１による文書分類装置の一連の処理
の手順を示すフローチャートである。FIG. 8 is a flowchart showing a procedure of a series of processes of the document classification device according to the first embodiment.

【図９】実施の形態１による文書分類装置の一連の処理
の別の手順を示すフローチャートである。FIG. 9 is a flowchart showing another procedure of a series of processes of the document classification device according to the first embodiment.

【図１０】この発明の実施の形態２による文書分類装置
の構成を機能的に示すブロック図である。FIG. 10 is a block diagram functionally showing a configuration of a document classification device according to a second embodiment of the present invention.

【図１１】実施の形態２による文書分類装置の一連の処
理の手順を示すフローチャートである。FIG. 11 is a flowchart illustrating a procedure of a series of processes of the document classification device according to the second embodiment.

【図１２】この発明の実施の形態３による文書分類装置
の構成を機能的に示すブロック図である。FIG. 12 is a block diagram functionally showing a configuration of a document classification device according to a third embodiment of the present invention.

【図１３】実施の形態３による文書分類装置のベクトル
修正部の処理内容の手順を示すフローチャートである。FIG. 13 is a flowchart showing a procedure of processing contents of a vector correction unit of the document classification device according to the third embodiment.

【図１４】実施の形態３による文書分類装置の文書特徴
ベクトルから特徴次元を削除する手続きの一例を示す説
明図である。FIG. 14 is an explanatory diagram showing an example of a procedure for deleting a feature dimension from a document feature vector of the document classification device according to the third embodiment.

【図１５】実施の形態３による文書分類装置の一連の処
理の手順を示すフローチャートである。FIG. 15 is a flowchart illustrating a procedure of a series of processes of the document classification device according to the third embodiment.

【図１６】この発明の実施の形態４による文書分類装置
の構成を機能的に示すブロック図である。FIG. 16 is a block diagram functionally showing a configuration of a document classification device according to a fourth embodiment of the present invention.

【図１７】実施の形態４による文書分類装置の一連の処
理の手順を示すフローチャートである。FIG. 17 is a flowchart illustrating a procedure of a series of processes of the document classification device according to the fourth embodiment.

【図１８】この発明の実施の形態５による文書分類装置
の構成を機能的に示すブロック図である。FIG. 18 is a block diagram functionally showing a configuration of a document classification device according to a fifth embodiment of the present invention.

【図１９】実施の形態５による文書分類装置の一連の処
理の一部の手順を示すフローチャートである。FIG. 19 is a flowchart showing a partial procedure of a series of processes of the document classification device according to the fifth embodiment.

【図２０】この発明の実施の形態６による文書分類装置
の構成を機能的に示すブロック図である。FIG. 20 is a block diagram functionally showing a configuration of a document classification device according to a sixth embodiment of the present invention.

【図２１】実施の形態６による文書分類装置の一連の処
理の一部の手順を示すフローチャートである。FIG. 21 is a flowchart showing a partial procedure of a series of processes of the document classification device according to the sixth embodiment.

【図２２】実施の形態６による文書分類装置の初期クラ
スタ重心を求める処理の一例についての説明図である。FIG. 22 is an explanatory diagram illustrating an example of a process of obtaining an initial cluster barycenter of the document classification device according to the sixth embodiment.

[Explanation of symbols]

１０１サーバー１０２クライアント１０３ネットワーク２０１ＣＰＵ２０４Ｉ／Ｆ２０６ディスク装置３０１ＣＰＵ３０６ハードディスク３０８ディスプレイ３０９Ｉ／Ｆ３１１キーボード３１２マウス３１３スキャナ４０１入力部４０２解析部４０３ベクトル生成部４０４変換関数算出部４０５ベクトル変換部４０６分類部４０７分類結果記憶部４２１内積算出部４３１文書間類似情報設定部１００１ベクトル記憶部１００２変換関数記憶部１２０１ベクトル修正部１６０１変換関数修正部１８０１変換関数修正指示部２００１初期重心指定部２００２初期重心登録部 101 server 102 client 103 network 201 CPU 204 I / F 206 disk device 301 CPU 306 hard disk 308 display 309 I / F 311 keyboard 312 mouse 313 scanner 401 input unit 402 analysis unit 403 vector generation unit 404 conversion function calculation unit 405 vector conversion unit 406 Classification unit 407 Classification result storage unit 421 Inner product calculation unit 431 Inter-document similarity information setting unit 1001 Vector storage unit 1002 Conversion function storage unit 1201 Vector correction unit 1601 Conversion function correction unit 1801 Conversion function correction instruction unit 2001 Initial center of gravity specification unit 2002 Initial center of gravity registration section

フロントページの続き (72)発明者武谷一寿東京都大田区中馬込１丁目３番６号株式会社リコー内 (72)発明者中島明子東京都大田区中馬込１丁目３番６号株式会社リコー内 (72)発明者長束哲郎東京都大田区中馬込１丁目３番６号株式会社リコー内 (72)発明者山崎真湖人東京都大田区中馬込１丁目３番６号株式会社リコー内 (72)発明者藤田克彦東京都大田区中馬込１丁目３番６号株式会社リコー内Continued on the front page (72) Inventor Kazutoshi Takeya 1-3-6 Nakamagome, Ota-ku, Tokyo Inside Ricoh Company (72) Inventor Akiko Nakajima 1-3-6 Nakamagome, Ota-ku, Tokyo Ricoh Company (72) Inventor Tetsuro Nagatsuka 1-3-6 Nakamagome, Ota-ku, Tokyo Ricoh Co., Ltd. (72) Inventor Makoto Yamazaki 1-3-6 Nakamagome, Ota-ku, Tokyo Ricoh Co., Ltd. (72 ) Inventor Katsuhiko Fujita 1-3-6 Nakamagome, Ota-ku, Tokyo Inside Ricoh Co., Ltd.

Claims

[Claims]

An input unit for inputting document data; an analysis unit for analyzing the document data input by the input unit to obtain analysis information; and an analysis unit for analyzing the document data based on the analysis information obtained by the analysis unit. A vector generation unit for generating a document feature vector; and a conversion function calculation for calculating an expression space conversion function for projecting the document feature vector generated by the vector generation unit onto a space reflecting similarity between the document feature vectors. Means, vector conversion means for converting the document feature vector generated by the vector generation means using the expression space conversion function calculated by the conversion function calculation means, between the document feature vectors converted by the vector conversion means Classifying means for classifying documents based on the similarity of the documents; And a classification result storing means for storing a result of the class.

2. An inner product calculating means for calculating an inner product between the document feature vectors generated by the vector generating means, wherein the conversion function calculating means expresses the inner product calculated by the inner product calculating means. The document classification device according to claim 1, wherein a space conversion function is calculated.

3. An inter-document similarity information setting unit for setting inter-document similarity information of document data such as a creator and a creation date of a document input by the input unit, wherein the conversion function calculating unit includes 3. The document classification apparatus according to claim 2, wherein the expression space conversion function is calculated using the inner product calculated by the output unit and the inter-document similarity information set by the inter-document similarity information setting unit.

4. A vector storage unit for storing a document feature vector generated by the vector generation unit, and a conversion function storage unit for storing an expression space conversion function calculated by the conversion function calculation unit. The document classification device according to any one of claims 1 to 3, wherein:

5. The method according to claim 1, further comprising, before changing the document feature vector by the vector conversion means, using a rule constituted by the characteristics of the words extracted by the analysis means. The document classification apparatus according to any one of claims 1 to 4, further comprising a vector correction unit configured to correct the document feature vector by manipulating a feature dimension of the document.

6. When a feature dimension is changed by correcting a document feature vector in the vector correction means, the vector feature means can appropriately convert the document feature vector by the changed feature dimension. 6. The document classification device according to claim 5, further comprising a conversion function correction unit that corrects the expression space conversion function calculated by the conversion function calculation unit.

7. A conversion function correction instructing unit for instructing an operation on a feature dimension of the expression space conversion function, and the instruction content on the operation of the feature dimension instructed by the transformation function correction instructing unit. The document classification device according to claim 1, further comprising: a conversion function correction unit configured to correct a representation space conversion function.

8. The method according to claim 1, wherein the instruction content related to the operation of the feature dimension specified by the conversion function correction instructing means is to operate the feature dimension of the expression space conversion function using arbitrary document vector data. The document classification apparatus according to claim 7, which performs the following.

9. The method according to claim 1, wherein the instruction content related to the operation of the feature dimension specified by the conversion function correction instructing means is to operate the feature dimension of the expression space conversion function using a document feature vector. Item 8. The document classification device according to Item 7.

10. The instruction content regarding the operation of the characteristic dimension instructed by the transformation function modification instructing means operates the characteristic dimension of the expression space transformation function using the analysis information obtained by the analyzing means. The document classification device according to claim 7, wherein:

11. An instruction content relating to an operation of a feature dimension instructed by said transformation function modification instructing means, wherein said instruction content is operated by using a classification result stored by said classification result storage means. The document classification device according to claim 7, wherein

12. An initial center of gravity specifying means for specifying an initial cluster center of gravity, and an initial center of gravity registering means for registering an initial cluster center of gravity specified by the initial center of gravity specifying means. The document classification apparatus according to any one of claims 1 to 11, wherein the documents are classified according to the initial cluster centroid registered by (1).

13. The document classification apparatus according to claim 12, wherein any document vector data is specified as an initial cluster centroid specified by said initial centroid specifying means.

14. The document classification device according to claim 12, wherein a document feature vector is specified as an initial cluster centroid specified by said initial centroid specifying means.

15. The document classification apparatus according to claim 12, wherein the analysis information obtained by the analysis unit is specified as the initial cluster center of gravity specified by the initial center of gravity specification unit.

16. The document classification apparatus according to claim 12, wherein the classification result stored by said classification result storage means is specified as an initial cluster centroid specified by said initial centroid specification means.

17. A first step of inputting document data; a second step of analyzing the document data input in the first step to obtain analysis information; and A third step of generating a document feature vector for the document data; and calculating a representation space conversion function for projecting the document feature vector generated in the third step into a space reflecting similarity between the document feature vectors. A fourth step of converting the document feature vector generated in the third step using the expression space conversion function calculated in the fourth step; and a document feature converted in the fifth step. A sixth step of classifying the document based on the similarity between the vectors, and a seventh step of storing a result of the document classification classified by the sixth step classification means. Document classification method and butterflies.

18. An eighth step of calculating an inner product between the document feature vectors generated in the third step, and the fourth step uses the inner product calculated in the eighth step to represent a representation space conversion function. The document classification method according to claim 17, wherein is calculated.

19. A ninth step of setting inter-document similarity information of document data such as a creator and a creation date of the document input in the first step, wherein the fourth step is calculated by the eighth step. 19. The document classification method according to claim 18, wherein the expression space conversion function is calculated using the obtained inner product and the similarity information between documents set in the ninth step.

20. The method according to claim 20, further comprising: a tenth step of storing the document feature vector generated in the third step, and an eleventh step of storing the expression space conversion function calculated in the fourth step. The document classification method according to any one of claims 17 to 19, wherein:

21. Further, before changing the document feature vector in the fifth step, the document feature vector and / or the document feature are changed using a rule constituted by the characteristics of the words extracted in the second step. 21. The document classification method according to claim 17, further comprising a twelfth step of modifying the document feature vector by manipulating a feature dimension constituting the vector.

22. When the feature dimension is changed by correcting the document feature vector in the twelfth step, the document feature vector can be appropriately converted in the fifth step by the changed feature dimension. 22. The document classification method according to claim 21, further comprising a thirteenth step of correcting the expression space conversion function calculated in the fourth step.

23. A fourteenth step of giving an instruction on operation of a feature dimension of the expression space conversion function, and the expression space conversion function based on the instruction content on the operation of feature dimension instructed in the fourteenth step. The document classification method according to any one of claims 17 to 21, further comprising: a fifteenth step of correcting.

24. The method according to claim 15, wherein the instruction content relating to the operation of the characteristic dimension specified in the fifteenth step is to operate the characteristic dimension of the expression space conversion function using arbitrary document vector data. Item 23. The document classification method according to Item 23.

25. The method according to claim 23, wherein the instruction content related to the operation of the feature dimension specified in the fifteenth step is to operate the feature dimension of the expression space conversion function using a document feature vector. Document classification method described in.

26. The instruction content relating to the operation of the feature dimension specified in the fifteenth step is to operate the characteristic dimension of the expression space conversion function using the analysis information obtained in the second step. 24. The method of claim 23, wherein
Document classification method described in.

27. The instruction content relating to the operation of the feature dimension specified in the fifteenth step is to operate the characteristic dimension of the expression space conversion function using the classification result stored in the seventh step. 3. The method according to claim 2, wherein
3. The document classification method according to 3.

28. A sixteenth step of specifying an initial cluster centroid, and a seventeenth step of registering the initial cluster centroid specified in the sixteenth step, wherein the sixth step is registered in the seventeenth step. 28. The document classification method according to claim 17, wherein the documents are classified according to the initial cluster centroid.

29. The document classification method according to claim 28, wherein any document vector data is specified as the initial cluster centroid specified in the sixteenth step.

30. The document classification method according to claim 28, wherein a document feature vector is specified as an initial cluster centroid specified in the sixteenth step.

31. The document classification method according to claim 28, wherein the analysis information obtained in the second step is specified as an initial cluster barycenter specified in the sixteenth step.

32. The method according to claim 28, wherein the classification result stored in the seventh step is specified as the initial cluster centroid specified in the sixteenth step.

33. A computer-readable recording medium on which a program for causing a computer to execute the method according to claim 17 is recorded.