JPH10320403A

JPH10320403A - Method and device for generating retrieval expression, and record medium

Info

Publication number: JPH10320403A
Application number: JP9124562A
Authority: JP
Inventors: Hiroyuki Nakajima; 浩之中島; Tsuyoshi Kitani; 強木谷
Original assignee: N T T DATA KK; NTT Data Corp
Current assignee: N T T DATA KK; NTT Data Group Corp
Priority date: 1997-05-14
Filing date: 1997-05-14
Publication date: 1998-12-04

Abstract

(57)【要約】【課題】キーワードの重要性に着目し、文書データ
中に出現頻度が小さいキーワードを対象キーワードとし
て優先して選択することにより、検索精度を一定値以上
に維持することが可能な、検索式作成装置を提供する。【解決手段】キーワード抽出部３１、文書集合分割部１
１、キーワード文書頻度辞書１２、検索式作成部３３の
各機能を備えて構成され、検索キーワード候補となるキ
ーワードが複数あるときに、各キーワードが含まれる文
書の数（文書頻度）をそれぞれキーワード文書頻度辞書
１２から読み出して比較し、文書頻度が小さいキーワー
ドを優先的に検索キーワードとして選択する。検索式作
成部３３は、検索キーワードを論理演算子“ａｎｄ”、
及び“ｏｒ”で結合して検索式を作成する。 (57) [Summary] [Problem] By focusing on the importance of keywords and preferentially selecting keywords having a low appearance frequency in document data as target keywords, it is possible to maintain search accuracy at or above a certain value A search expression creating device. A keyword extracting unit and a document set dividing unit are provided.
1, a keyword document frequency dictionary 12 and a search formula creation unit 33 are provided. When there are a plurality of keywords as search keyword candidates, the number of documents (document frequency) including each keyword is determined by a keyword document. The keyword is read out from the frequency dictionary 12 and compared, and a keyword with a low document frequency is preferentially selected as a search keyword. The search expression creation unit 33 converts the search keyword into a logical operator “and”,
And "or" to create a search expression.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、例えば大量に蓄積
された学会論文や技術文書等の電子文書から特定のもの
を索出する文書データベースや、予め蓄積された電子文
書例等を文書作成や発想展開の支援のために利用する各
種支援システム等に適用される文書検索技術に係り、特
に、電子文書中から抽出したキーワードを用いて、検索
者が関心のある文書の索出を効率的に行うための検索式
を試行錯誤的に作成する技術に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document database for retrieving a specific document from electronic documents such as academic papers and technical documents accumulated in a large amount, and a method for creating a document stored in advance in an electronic document. The present invention relates to a document search technology applied to various support systems used for supporting the development of ideas, and in particular, a searcher can efficiently search for a document of interest using keywords extracted from an electronic document. The present invention relates to a technique for creating a search formula to be performed by trial and error.

【０００２】[0002]

【従来の技術】検索対象となる電子文書を蓄積した文書
データベースからあるキーワードを抽出し、このキーワ
ードの論理積や論理和の組み合わせにより所要の検索式
を検索者と協調して試行錯誤的に作成する検索式作成装
置が知られている。2. Description of the Related Art A keyword is extracted from a document database in which electronic documents to be searched are stored, and a required search formula is created by trial and error in cooperation with a searcher by a combination of a logical product and a logical sum of the keywords. There is known a search formula creation device.

【０００３】図３は、従来のこの種の検索式作成装置の
機能構成図である。この検索式作成装置３０は、コンピ
ュータ装置が所定のプログラムを読み込んで実行するこ
とにより形成される、キーワード抽出部３１、文書集合
分割部３２、及び検索式作成部３３の機能ブロックを備
えている。なお、文書には、それぞれ検索者が関心のあ
る必要文書か、関心のない不要文書かを表す必要・不要
の指定情報が付与されているものとする。FIG. 3 is a functional block diagram of this type of conventional search expression creating apparatus. The search formula creation device 30 includes functional blocks of a keyword extraction unit 31, a document set division unit 32, and a search formula creation unit 33, which are formed by a computer device reading and executing a predetermined program. Note that it is assumed that each document is provided with necessary / unnecessary designation information indicating whether the searcher is a necessary document of interest or an unnecessary document of no interest.

【０００４】キーワード抽出部３１は、複数の文書から
公知の形態素解析処理によって文書毎に複数のキーワー
ドの抽出処理を行う。また、個々の文書におけるキーワ
ードの出現の有無を表す判別情報及び当該文書が必要文
書か不要文書かを表す識別情報を、文書名や文書番号等
の文書識別子と共に文書集合として出力する。符号３１
Ｂは、キーワード抽出部３１から出力される文書集合の
内容を例示したものである。The keyword extracting unit 31 performs a process of extracting a plurality of keywords for each document from the plurality of documents by a known morphological analysis process. Also, discrimination information indicating whether a keyword appears in each document and identification information indicating whether the document is a necessary document or an unnecessary document are output as a document set together with a document identifier such as a document name and a document number. Code 31
B illustrates the contents of the document set output from the keyword extraction unit 31.

【０００５】文書集合分割部３２は、文書集合を上記判
別情報に基づいて段階的に分割し、文書検索に用いる検
索式を作成する場合の基礎となる複数の検索キーワード
を決定する。この場合、出来るだけ一つ（少数）のキー
ワードの判別情報によって文書集合を分割していくこと
で、必要文書と不要文書とを区別した検索者の意図の抽
出が可能となる。文書集合分割部３２で決定した複数の
検索キーワードは、検索式作成部３３において論理演算
子“ａｎｄ”または“ｏｒ”で結合され、検索式として
後続処理に出力される。[0005] The document set dividing section 32 divides the document set in stages based on the discrimination information, and determines a plurality of search keywords that are the basis for creating a search formula used for document search. In this case, by dividing the document set by the discrimination information of one (small) keyword as much as possible, it is possible to extract the intention of the searcher who distinguishes the necessary documents from the unnecessary documents. The plurality of search keywords determined by the document set division unit 32 are combined by the logical operator “and” or “or” in the search expression creation unit 33, and are output to the subsequent processing as a search expression.

【０００６】文書集合分割部３２における文書集合の分
割処理は、例えば公知の決定木（論理式を木構造で表現
したもの）学習アルゴリズムである「ＩＤ３」に基づい
て行われる。以下、この決定木学習アルゴリズム「ＩＤ
３」による文書集合の分割処理の概要を図４を参照して
説明する。まず、キーワード抽出部３１から送られた文
書集合を初期文書集合Ｓｅｔ₀とする（ステップＳ１０
１）。次に、初期文書集合Ｓｅｔ₀の“未分割”のフラ
グをオンにし（ステップＳ１０２）、これをＳｅｔ_iと
する（ステップＳ１０３）。次に、この文書集合Ｓｅｔ
_i中の必要文書、不要文書に含まれる各キーワードｔ
_j(１≦ｊ≦Ｎ）について、文書全体の情報量に対する個
別文書の情報量の相対関係を表す相互情報量Ｉ（ｔ_j)を
算出する（ステップＳ１０４）。相互情報量Ｉ（ｔ_j)
は、具体的には、未分割の文書集合についての情報量Ｈ
からキーワードｔ_jが含まれた文書集合及び含まない文
書集合についての情報量Ｈ（ｔ_j)を差し引いた値で表さ
れる。また、各情報量Ｈ、Ｈ（ｔ_j)は、それぞれ下記
（１）、（２）式で表される。[0006] The document set dividing process in the document set dividing section 32 is performed based on, for example, "ID3" which is a known decision tree (a logical expression represented by a tree structure) learning algorithm. Hereinafter, this decision tree learning algorithm “ID
The outline of the document set division process by “3” will be described with reference to FIG. First, the document set sent from the keyword extraction unit 31 is set as an initial document set Set ₀ (step S10).
1). Next, the "undivided" flag of the initial document set Set ₀ is turned on (step S102), and this is set as Set _i (step S103). Next, this document set Set
Necessary documents in _i , each keyword t included in unnecessary documents
_{For j} (1 ≦ j ≦ N), a mutual information amount I (t _j ) representing a relative relationship between the information amount of the individual document and the information amount of the entire document is calculated (step S104). Mutual information I (t _j )
Is, specifically, the information amount H about the undivided document set.
Is subtracted from the information amount H (t _j ) of the document set that includes the keyword t _j and the document set that does not include the keyword t _j . The information amounts H and H (t _j ) are expressed by the following equations (1) and (2), respectively.

【０００７】[0007]

【数１】 (Equation 1)

【０００８】但し、（１）、（２）式におけるパラメー
タは下記のようになる。ｐ_i：Ｓｅｔ_i中の必要文書数、ｎ_i：Ｓｅｔ_i中の不要文書数、ｓ_i：ｐ_i+ｎ_i、ｐ_i(ｔ_j)：Ｓｅｔ_i中でキーワードｔ_jを含む必要文書
数、ｎ_i(ｔ_j)：Ｓｅｔ_i中でキーワードｔ_jを含む不要文書
数、ｓ_i(ｔ_j)：ｐ_i(ｔ_j)＋ｎ_i(ｔ_j)、ｐ_i not（ｔ_j)：Ｓｅｔ_i中でキーワードｔ_jを含まない
必要文書数、ｎ_i not（ｔ_j)：Ｓｅｔ_i中でキーワードｔ_jを含まない
不要文書数、ｓ_i not（ｔ_j)：ｐ_i not（ｔ_j)＋ｎ_i not（ｔ_j)、ｈ(a,b,c)：-{a/c・log₂(a/c)＋b/c・log₂(b/c)｝The parameters in the equations (1) and (2) are as follows. p _{_i:} Set _i need the number of documents in, n _{_i:} Set _i unnecessary number of documents _{_{in, s i: p i + n}} i, p i (t j): necessary number of documents that contain the keyword t _j in Set _i _{_{, n i (t j):}} Set i unnecessary number of documents that contain the keyword t _j _{_{in, s i (t j):}} p i (t j) + n i (t j), p i not (t j): Set _i need the number of documents that do not contain the keyword t _j _{_{in, n i not (t j)}} : Set i unnecessary number of documents that do not contain the keyword t _j _{_{in, s i not (t j)}} : p i not (t j) + N _i not (t _j ), h (a, b, c):-{a / c · log ₂ (a / c) + b / c · log ₂ (b / c)}

【０００９】次に、複数のキーワードｔ_jから相互情報
量Ｉ（ｔ_k)の値を最大にすることが可能なキーワードｔ
_kを選択し、これを検索キーワードとする（ステップＳ
１０５）。この相互情報量Ｉ（ｔ_k)が正の有限値（＞
０）の場合（ステップＳ１０６）、検索キーワードｔ_k
を含む文書の番号からなる文書集合をＳｅｔ_i′、検索キ
ーワードｔ_kを含まない文書の番号からなる文書集合を
Ｓｅｔ_i″として分割し、分割したそれぞれの文書集合
の“未分割”のフラグをオンにする（ステップＳ１０７
〜Ｓ１１０）。ｉ′，ｉ″は既に文書集合Ｓｅｔ_i′、Ｓ
ｅｔ_i″が存在しなければ任意の値で良い。一方、相互
情報量Ｉ（ｔ_k)がゼロ値（＝０）の場合は文書集合の分
割を行わない（ステップＳ１０６）。Next, a keyword t which can maximize the value of the mutual information I (t _k ) from a plurality of keywords t _j
_k is selected and set as a search keyword (step S
105). This mutual information I (t _k ) has a positive finite value (>
In the case of 0) (step S106), the search keyword t _k
Set _i 'a document set consisting of number of documents containing the flag of the search keyword set of documents consisting of number of documents that do not contain t _k Set _i "divided as, for each document set divided" undivided " Turn on (step S107)
To S110). i ′, i ″ are already document sets Set _i ′, S
If et _i ″ does not exist, any value may be used.On the other hand, if the mutual information amount I (t _k ) is a zero value (= 0), the document set is not divided (step S106).

【００１０】その後、集合Ｓｅｔ_iの“未分割”のフラ
グをオフにする（ステップＳ１１１）。“未分割”のフ
ラグがオンの文書集合がある場合はステップＳ１０３に
戻り（ステップＳ１１２，Ｙｅｓ）、“未分割”のフラ
グがオンの文書集合がなくなるまで処理を繰り返す。そ
して、すべての文書集合についての“未分割”のフラグ
がオフになった時点で処理を終える（ステップＳ１１
２，Ｎｏ）。Thereafter, the flag of “undivided” of the set Set _i is turned off (step S111). If there is a document set with the “undivided” flag on, the process returns to step S103 (step S112, Yes), and the process is repeated until there is no document set with the “undivided” flag on. Then, the process ends when the “undivided” flag is turned off for all the document sets (step S11).
2, No).

【００１１】この決定木学習アルゴリズム「ＩＤ３」に
ついての詳細は、「知識獲得と学習シリーズ１：知識獲
得入門」（Ｍｉｃｈａｌｓｋｉ，Ｒ．Ｓ．他編、共立出
版）を参考にすることができる。また、上記アルゴリズ
ム「ＩＤ３」による処理過程は、例えば、公知のアルゴ
リズムである「Ｃ４．５」や、また、文書中のキーワー
ドの有無のみを用いて文書集合を分割する他のアルゴリ
ズム等による代用も可能である。「Ｃ４．５」の詳細に
ついては、「C4.5 Programs for Machine Learning」
（Quinlan、J.R.著、Morgan Kaufmann Publishers 刊）の
記載を参考にすることができる。The details of the decision tree learning algorithm "ID3" can be referred to "Knowledge Acquisition and Learning Series 1: Introduction to Knowledge Acquisition" (Michalski, RS, et al., Kyoritsu Shuppan). In addition, the process using the algorithm “ID3” may be replaced with, for example, a known algorithm “C4.5” or another algorithm that divides a document set using only the presence or absence of a keyword in a document. It is possible. See “C4.5 Programs for Machine Learning” for details of “C4.5”.
(Quinlan, JR, published by Morgan Kaufmann Publishers).

【００１２】図５は、上記検索式作成装置３０におい
て、一つの文書集合から複数の文書集合に分割され、検
索式が試行錯誤的に作成されていく過程を示す説明図で
ある。以下、図５を参照して、従来の検索式の作成手順
を説明する。まず、キーワード抽出部３１から出力され
た初期文書集合Ｓｅｔ₀から、上述の決定木学習アルゴ
リズム「ＩＤ３」に基づいて相互情報量が最大となるキ
ーワードを決定し、これを検索キーワードとする。ここ
では、検索キーワードｋｗｄ３が決定されたとする。そ
して、この検索キーワードｋｗｄ３によって、初期文書
集合Ｓｅｔ₀を、検索キーワードｋｗｄ３を含む必要文
書の集合Ｓｅｔ₁と検索キーワードｋｗｄ３を含まない
必要文書及び不要文書の集合Ｓｅｔ₂とに分割する。FIG. 5 is an explanatory diagram showing a process in which the above-mentioned search formula creating apparatus 30 divides one document set into a plurality of document sets and creates a search formula by trial and error. Hereinafter, with reference to FIG. 5, a description will be given of a procedure for creating a conventional search expression. First, from the initial document set Set ₀ output from the keyword extraction unit 31, a keyword that maximizes the mutual information is determined based on the above-described decision tree learning algorithm “ID3”, and is set as a search keyword. Here, it is assumed that the search keyword kwd3 has been determined. Then, by this search Kwd3, divides the initial document set Set _0, the search keyword Kwd3 in a set Set ₂ required documents and unnecessary documents that do not contain the set Set ₁ the search keyword Kwd3 necessary documents including.

【００１３】文書集合Ｓｅｔ₁は、これ以上の分割は不
可能であるが、一方、文書集合Ｓｅｔ₂はさらなる分割
が可能である。そこで、この文書集合Ｓｅｔ₂において
相互情報量が最大となる検索キーワードｋｗｄ２を決定
し、この検索キーワードｋｗｄ２によって文書集合Ｓｅ
ｔ₂を、検索キーワードｋｗｄ２を含まない不要文書の
集合Ｓｅｔ₃と検索キーワードｋｗｄ２を含む必要及び
不要文書の集合Ｓｅｔ₄とに分割する。文書集合Ｓｅｔ₄
は、さらなる分割が可能なので、この文書集合Ｓｅｔ₄
において相互情報量が最大となるキーワードｋｗｄ１を
検索キーワードとして決定し、この検索キーワードｋｗ
ｄ１を含む必要文書の集合Ｓｅｔ₅と、検索キーワード
ｋｗｄ１を含まない文書の集合Ｓｅｔ₆とを分割する。
文書集合Ｓｅｔ₅及びＳｅｔ₆は、共にこれ以上の分割が
不可能であるため、分割処理を終える。The document set Set ₁ cannot be further divided, while the document set Set ₂ can be further divided. Therefore, the search keyword kwd2 that maximizes the mutual information in the document set Set ₂ is determined, and the document set Sed is determined by the search keyword kwd2.
The t _2, divides search in Kwd2 a set Set ₃ of unnecessary documents that do not contain a search set of necessary and unnecessary documents containing the keyword kwd2 Set _4. Document Set Set ₄
Can be further divided, so this document set Set ₄
Is determined as the search keyword, and the keyword kwd1 with the maximum mutual information amount is determined.
A set Set _{5 of} necessary documents including d1 and a set Set _{6 of} documents not including the search keyword kwd1 are divided.
Since the document sets Set ₅ and Set ₆ cannot be further divided, the division processing ends.

【００１４】上記分割処理において決定された複数の検
索キーワードｋｗｄ１〜ｋｗｄ３は逐次図示しない記憶
手段に保持され、分割処理が終了した時点で検索式作成
部３３に渡される。検索式作成部３３では、文書集合分
割部３２の結果である各検索キーワードを、論理演算子
“ａｎｄ”、及び“ｏｒ”により結合して検索式ｑｕｅ
ｒｙを作成する。符号３３Ｂは、検索式作成部３３から
出力される検索式を例示したものである。ｔ₆は、共に
これ以上の分割が不可能であるため、分割処理を終え
る。The plurality of search keywords kwd1 to kwd3 determined in the above-mentioned division processing are sequentially stored in storage means (not shown), and are passed to the retrieval formula creation unit 33 when the division processing is completed. The search expression creation unit 33 combines the search keywords, which are the result of the document set division unit 32, with the logical operators "and" and "or" to combine the search keywords que.
Create ry. Reference numeral 33B illustrates a search formula output from the search formula creation unit 33. At t ₆ , the division process ends because no further division is possible.

【００１５】上記分割処理において決定された複数の検
索キーワードｋｗｄ１〜ｋｗｄ３は逐次図示しない記憶
手段に保持され、分割処理が終了した時点で検索式作成
部３３に渡される。検索式作成部３３では、文書集合分
割部３２の結果である各検索キーワードを、論理演算子
“ａｎｄ”、及び“ｏｒ”により結合して検索式ｑｕｅ
ｒｙを作成する。符号３３Ｂは、検索式作成部３３から
出力される検索式を例示したものである。The plurality of search keywords kwd1 to kwd3 determined in the above-mentioned division processing are sequentially stored in storage means (not shown), and are passed to the retrieval formula creation unit 33 when the division processing is completed. The search expression creation unit 33 combines the search keywords, which are the result of the document set division unit 32, with the logical operators "and" and "or" to combine the search keywords que.
Create ry. Reference numeral 33B illustrates a search formula output from the search formula creation unit 33.

【００１６】[0016]

【発明が解決しようとする課題】上記従来の検索式作成
装置３０では、文書集合の分割に用いる検索キーワード
を、相互情報量と必要文書／不要文書の判別情報を基準
として決定しており、検索者にとって真に重要な検索キ
ーワードかどうかを考慮していない。そのため、作成さ
れる検索式には重要ではない検索キーワードを含む可能
性があり、この検索式を実際の文書検索処理に用いた場
合に、十分な検索精度が得られない場合があった。In the above-mentioned conventional retrieval formula creation device 30, the retrieval keyword used for dividing the document set is determined based on the mutual information amount and the necessary / unnecessary document discrimination information. Does not consider whether the keyword is truly important to the searcher. For this reason, the created search formula may include an unimportant search keyword, and when this search formula is used in actual document search processing, sufficient search accuracy may not be obtained.

【００１７】そこで本発明の課題は、キーワードの重要
性を反映して文書検索における検索精度を一定値以上に
維持することができる検索式の作成をコンピュータ装置
を用いて行う改良された方法を提供することにある。本
発明の他の課題は、上記方法の実施に適した検索式作成
装置及び上記検索式作成方法を汎用のコンピュータ装置
で実現するための記録媒体を提供することにある。SUMMARY OF THE INVENTION It is an object of the present invention to provide an improved method of using a computer to create a search formula capable of maintaining a search accuracy in a document search at a certain value or more by reflecting the importance of a keyword. Is to do. Another object of the present invention is to provide a search formula creation device suitable for implementing the above method and a recording medium for implementing the search formula creation method with a general-purpose computer device.

【００１８】[0018]

【課題を解決するための手段】上記課題を解決する本発
明の検索式作成方法は、文書データベースにおける指定
文書群を形態素解析処理して複数のキーワードを抽出す
る過程と、抽出された個々のキーワードが出現する文書
数をキーワード毎に検出する過程と、当該キーワードを
含む文書群及び含まない文書群の情報量を前記指定文書
群の総情報量から差し引いて得られる相互情報量が最大
となり且つ当該キーワードを含む文書の数が最小となる
キーワードを検索キーワードとして決定する過程と、決
定した検索キーワードを論理式で結合して前記文書デー
タベースの検索に用いる検索式を作成する過程とを含む
ことを特徴とする。According to the present invention, there is provided a method for creating a retrieval formula according to the present invention, comprising the steps of: morphologically analyzing a specified document group in a document database to extract a plurality of keywords; And the mutual information amount obtained by subtracting the information amount of the document group including and not including the keyword from the total information amount of the specified document group is maximized, and Determining a keyword that minimizes the number of documents including the keyword as a search keyword; and combining the determined search keyword with a logical expression to create a search expression used for searching the document database. And

【００１９】本発明の他の検索式作成方法は、文書デー
タベースにおける指定文書群を形態素解析処理して複数
のキーワードを抽出する過程と、抽出された個々のキー
ワードが出現する文書数をキーワード毎に検出する過程
と、前記文書数に対する単調減少関数に基づいて算定さ
れた当該キーワードの重要度と当該キーワードを含む文
書群及び含まない文書群の情報量を指定文書群の総情報
量から差し引いて得られる相互情報量とを一定比率で合
算し、合算値が最大となるキーワードを検索キーワード
として決定する過程と、決定した検索キーワードを論理
式で結合して文書検索に用いる検索式を作成する過程と
を含むことを特徴とする。According to another search formula creation method of the present invention, a process of morphologically analyzing a specified document group in a document database to extract a plurality of keywords, and the number of documents in which the extracted individual keywords appear are determined for each keyword. The detection step, the importance of the keyword calculated based on the monotonically decreasing function with respect to the number of documents, and the information amount of the document group including and not including the keyword are subtracted from the total information amount of the designated document group. And a process of determining the keyword having the largest total value as a search keyword, and combining the determined search keywords with a logical formula to create a search formula used for document search. It is characterized by including.

【００２０】上記他の課題を解決する本発明の検索式作
成装置は、文書データベースから特定の文書を索出する
ための検索式を作成する装置であって、前記文書データ
ベースにおける指定文書群から形態素解析によって複数
のキーワードを抽出するとともに抽出した個々のキーワ
ードが文書中に含まれるか否かを表す判別情報、及び当
該文書が必要文書か不要文書かを表す指定情報を各指定
文書の識別情報と共に集合させた文書集合を生成するキ
ーワード抽出部と、個々のキーワードを含む文書群及び
含まない文書群の情報量を指定文書群の総情報量から差
し引いて得られる相互情報量と当該キーワードが出現す
る文書数とに基づいて単一のキーワードを検索キーワー
ドとして決定するとともに、決定した検索キーワードを
用いて一つの文書集合を複数の文書集合に分割する文書
集合分割部と、前記文書集合の分割の際に用いた検索キ
ーワードを論理式で結合して前記検索式を作成する検索
式作成部と、を有することを特徴とする。According to another aspect of the present invention, there is provided a search formula generating apparatus for generating a search formula for searching for a specific document from a document database. A plurality of keywords are extracted by the analysis, and discrimination information indicating whether or not each extracted keyword is included in the document, and designation information indicating whether the document is a necessary document or an unnecessary document, together with identification information of each designated document. A keyword extracting unit that generates a set of documents, a mutual information amount obtained by subtracting the information amount of the document group including and not including the individual keywords from the total information amount of the specified document group, and the keyword appear. A single keyword is determined as a search keyword based on the number of documents, and one document is determined using the determined search keyword. A document set dividing unit that divides the document set into a plurality of document sets, and a search formula creating unit that creates the search formula by combining search keywords used in dividing the document set with a logical expression. Features.

【００２１】前記抽出されたキーワードが出現する文書
数を予め計数してキーワード毎に保持した辞書をさらに
備えるようにしても良い。この場合、前記文書集合分割
部は、文書集合を分割する際に前記辞書に保持された該
当文書数を索出して前記検索キーワードを決定するよう
に構成される。A dictionary may be further provided in which the number of documents in which the extracted keywords appear is counted in advance and held for each keyword. In this case, the document set division unit is configured to search out the number of relevant documents held in the dictionary and determine the search keyword when dividing the document set.

【００２２】なお、前記文書集合分割部は、前記相互情
報量が最大となり、且つ前記文書数が最小となるキーワ
ード、あるいは、前記文書数に対する単調減少関数を用
いて当該キーワードの重要度を算定し、算定された重要
度と前記相互情報量とをそれぞれ一定の比率で合算した
値が最大となるキーワードを前記検索キーワードとして
決定するように構成される。The document set dividing unit calculates the importance of the keyword by using a keyword in which the mutual information amount is maximum and the number of documents is minimum, or a monotone decreasing function with respect to the number of documents. The keyword having the maximum value obtained by adding the calculated importance and the mutual information at a fixed ratio is determined as the search keyword.

【００２３】上記他の課題を解決する本発明の記録媒体
は、文書データベースにおける指定文書群から複数のキ
ーワードを抽出する形態素解析処理と、個々のキーワー
ドが文書中に含まれるか否かを表す判別情報、及び当該
文書が必要文書か不要文書かを表す判別情報を各指定文
書の識別情報と共に集合させた文書集合を生成するキー
ワード抽出処理と、個々のキーワードを含む文書群及び
含まない文書群の情報量を全文書群の総情報量から差し
引いて得られる相互情報量と当該キーワードが出現する
文書数とに基づいて単一のキーワードを検索キーワード
として決定するとともに、決定した検索キーワードを用
いて一つの文書集合を複数の文書集合に分割する文書集
合分割処理と、前記文書集合の分割の際に用いた検索キ
ーワードを論理式で結合して前記検索式を作成する検索
式作成処理と、をコンピュータ装置に実行させるための
プログラムを当該コンピュータ装置が読み取り可能な形
態で記録して成る。According to another aspect of the present invention, there is provided a recording medium for extracting a plurality of keywords from a group of designated documents in a document database, and determining whether each keyword is included in the document. Keyword extraction processing for generating a document set in which information and discrimination information indicating whether the document is a necessary document or an unnecessary document are collected together with identification information of each designated document; A single keyword is determined as a search keyword based on the mutual information amount obtained by subtracting the information amount from the total information amount of all document groups and the number of documents in which the keyword appears, and one keyword is determined using the determined search keyword. A document set division process for dividing one document set into a plurality of document sets, and a search expression used for dividing the document set used in a logical expression. A search expression creation processing for creating the search expression combined with a program for causing a computer to execute the device comprising recorded in a form readable is the computer device.

【００２４】[0024]

【発明の実施の形態】以下、本発明の実施の形態を詳細
に説明する。図１は、本発明を適用した検索式作成装置
の実施の形態を示す機能ブロック図である。なお、図３
で説明した従来の検索式作成装置３０と同一の機能につ
いては、同一符号を付して重複説明を省略する。Embodiments of the present invention will be described below in detail. FIG. 1 is a functional block diagram showing an embodiment of a search formula creation device to which the present invention is applied. Note that FIG.
The same reference numerals are given to the same functions as those of the conventional search-expression creating apparatus 30 described above, and redundant description is omitted.

【００２５】本実施形態の検索式作成装置１０は、コン
ピュータ装置が所定のプログラムを読み込んで実行する
ことにより形成される、キーワード抽出部３１、改良さ
れた文書集合分割部１１、キーワード文書頻度辞書１
２、検索式作成部３３の各機能を備えて構成される。上
記プログラムは、通常、コンピュータ装置に内蔵される
記憶手段に格納され、当該コンピュータ装置の主制御部
（ＣＰＵ）に随時読み出されて使用されるが、コンピュ
ータ装置とは分離した形態で流通する記録媒体、例えば
ＣＤ−ＲＯＭ等に格納され、使用時に上記記憶手段にイ
ンストールされるものであってもよい。The retrieval formula creation device 10 of the present embodiment includes a keyword extraction unit 31, an improved document set division unit 11, and a keyword document frequency dictionary 1 formed by a computer device reading and executing a predetermined program.
2. It is provided with the functions of the search formula creation unit 33. The above-mentioned program is usually stored in a storage means incorporated in the computer device, and is read out and used at any time by a main control unit (CPU) of the computer device. It may be stored in a medium, for example, a CD-ROM, and installed in the storage means at the time of use.

【００２６】文書集合分割部１１は、キーワード抽出部
３１から出力された初期文書集合を前述の決定木学習ア
ルゴリズム「ＩＤ３」を用いて分割する。一般に文書中
に出現する頻度の小さいキーワードは、複数の文書を区
別する際の重要なキーワードとなりうる。そこで、本実
施形態では、キーワードが出現する文書数（文書頻度）
をキーワード文書頻度辞書１２に格納しておき、個々の
キーワードについての文書頻度をキーワード文書頻度辞
書１２から読み出して、その文書頻度に対する単調減少
関数、例えば文書頻度の逆数を用いてキーワードの重要
度を計算する。なお、重要度は、予め、文書頻度と共に
各キーワード毎にキーワード文書頻度辞書１２中に格納
するようにしても良い。The document set division unit 11 divides the initial document set output from the keyword extraction unit 31 using the above-described decision tree learning algorithm “ID3”. In general, a keyword that appears in a document with a low frequency can be an important keyword for distinguishing a plurality of documents. Therefore, in the present embodiment, the number of documents in which the keyword appears (document frequency)
Is stored in the keyword document frequency dictionary 12, the document frequency of each keyword is read out from the keyword document frequency dictionary 12, and the importance of the keyword is determined using a monotonically decreasing function for the document frequency, for example, the reciprocal of the document frequency. calculate. The importance may be stored in advance in the keyword document frequency dictionary 12 for each keyword together with the document frequency.

【００２７】このようにして求められた重要度と相互情
報量とを一定の比率で足し合わせ、その和の値が大きい
キーワードを優先的に検索キーワードとして決定する。
そして、決定した検索キーワードによって文書集合の分
割処理を行うとともに、分割過程で決定された検索キー
ワードを検索作成部３３に出力する。The importance thus obtained and the mutual information amount are added at a fixed ratio, and a keyword having a large sum is preferentially determined as a search keyword.
Then, the document set is divided according to the determined search keywords, and the search keywords determined in the division process are output to the search creating unit 33.

【００２８】図２は、検索式作成装置１０において、一
つの文書集合３１Ａから段階的に複数の文書集合に分割
され、検索式が作成されるまでの過程を示す説明図であ
る。以下、図２を参照して、本実施形態による検索式作
成手順を説明する。FIG. 2 is an explanatory diagram showing a process in which one document set 31A is divided into a plurality of document sets in a stepwise manner and a search formula is created in the search formula creating apparatus 10. Hereinafter, a search formula creation procedure according to the present embodiment will be described with reference to FIG.

【００２９】まず、キーワード抽出部３１から出力され
た初期文書集合Ｓｅｔ₀から相互情報量が最大となるキ
ーワードを検索キーワードとして決定する。ここでは、
検索キーワードｋｗｄ３が決定されたとする。決定され
た検索キーワードｋｗｄ３によって、初期文書集合Ｓｅ
ｔ₀は、検索キーワードｋｗｄ３を含む必要文書の集合
Ｓｅｔ₁と、検索キーワードｋｗｄ３を含まない文書の
集合Ｓｅｔ₂とに分割される。First, a keyword having the maximum mutual information is determined as a search keyword from the initial document set Set ₀ output from the keyword extraction unit 31. here,
It is assumed that the search keyword kwd3 has been determined. The initial document set Se is determined by the determined search keyword kwd3.
t ₀ is divided into a set Set _{1 of} necessary documents including the search keyword kwd3 and a set Set _{2 of} documents not including the search keyword kwd3.

【００３０】文書集合Ｓｅｔ₁は、これ以上の分割は不
可能であるが、一方、文書集合Ｓｅｔ₂はさらなる分割
処理が可能なので、このＳｅｔ₂において、各キーワー
ドによる相互情報量を算出し、相互情報量が最大となる
キーワードを特定する。本例の場合、文書集合Ｓｅｔ₂
から２つのキーワードｋｗｄ２，ｋｗｄ４が特定され
る。このように複数のキーワードが検索キーワードの候
補となる場合、キーワード文書頻度辞書１２中に蓄積さ
れている該当キーワードについての文書頻度を読み出し
て各々の値の比較を行う。本例ではキーワードｋｗｄ４
の方が相対的に文書頻度が小さいとして、これを検索キ
ーワードに決定する。そして、文書集合Ｓｅｔ₂を、検
索キーワードｋｗｄ４を含まない文書の集合Ｓｅｔ
₃と、検索キーワードｋｗｄ４を含む文書の集合Ｓｅｔ₄
とに分割する。The document set Set ₁ cannot be further divided. On the other hand, the document set Set ₂ can be further divided. In this set ₂ , the mutual information amount for each keyword is calculated. Identify the keyword with the largest amount of information. In the case of this example, the document set Set ₂
, Two keywords kwd2 and kwd4 are specified. When a plurality of keywords are candidates for a search keyword as described above, the document frequency of the corresponding keyword stored in the keyword document frequency dictionary 12 is read, and the respective values are compared. In this example, the keyword kwd4
Is determined to be a search keyword because the document frequency is relatively low. Then, the document set Set ₂ is set to a document set Set not including the search keyword kwd4.
₃ and a set Set _{4 of} documents including the search keyword kwd4
And split into

【００３１】文書集合Ｓｅｔ₄は、さらなる分割が可能
なので、この文書集合Ｓｅｔ₄を検索キーワードｋｗｄ
１を含む必要文書の集合Ｓｅｔ₅と、検索キーワードｋ
ｗｄ１を含まない文書の集合Ｓｅｔ₆とに分割する。文
書集合Ｓｅｔ₅及びＳｅｔ₆は、共に、これ以上の分割が
不可能であるため、分割処理を終える。Since the document set Set ₄ can be further divided, this document set Set ₄ is used as a search keyword kwd.
Set _{5 of} required documents including “1” and search keyword k
The divided into a set Set ₆ of the document that does not contain wd1. Since the document sets Set ₅ and Set ₆ cannot be further divided, the division processing ends.

【００３２】上記分割処理において決定された複数の検
索キーワードｋｗｄ１，ｋｗｄ３，ｋｗｄ４は、逐次図
示しない記憶手段に保持しておき、分割処理が終了した
時点で検索式作成部３３に出力する。検索式作成部３３
では、文書集合分割部１１より受領した各検索キーワー
ドを、論理演算子“ａｎｄ”、及び“ｏｒ”により結合
して検索式ｑｕｅｒｙを作成する。符号３３Ａは、検索
式作成部３３から出力される検索式を例示したものであ
る。The plurality of search keywords kwd1, kwd3, and kwd4 determined in the above-described division processing are sequentially stored in storage means (not shown), and are output to the search expression creating unit 33 when the division processing is completed. Search expression creation unit 33
Then, the search keywords received from the document set division unit 11 are combined by the logical operators “and” and “or” to create a search expression query. Reference numeral 33A illustrates a search formula output from the search formula creation unit 33.

【００３３】このように、本実施形態の検索式作成装置
１０によれば、例えば相互情報量Ｉが最大値となる検索
キーワードの候補が複数特定された場合に、キーワード
文書頻度辞書１２から該当するキーワードについての文
書頻度を読み出して比較を行い、文書頻度の小さいキー
ワードを優先して選択して検索キーワードに決定し、当
該重要度と相互情報量とを一定の比率で足し合わせた値
が大きいキーワードを検索キーワードに決定することに
より、迅速な検索式作成処理が可能になるとともに、作
成される検索式に、重要なキーワードが含まれるように
なる。As described above, according to the search expression creation apparatus 10 of the present embodiment, when a plurality of search keyword candidates having the maximum mutual information I are specified, the search result is obtained from the keyword document frequency dictionary 12. A keyword having a large value obtained by reading out the document frequency of the keyword and performing comparison, preferentially selecting a keyword with a low document frequency, determining the keyword as a search keyword, and adding the importance and mutual information at a fixed ratio. Is determined as a search keyword, a quick search expression creation process can be performed, and an important keyword is included in the created search expression.

【００３４】[0034]

【発明の効果】以上の説明から明らかなように、本発明
によれば、個々のキーワードの重要性が考慮された検索
式が作成される効果がある。また、これにより得られる
検索式を用いることにより、文書データの検索精度を一
定値以上に維持することが可能となり、検索処理の効率
が大幅に向上するという効果もある。As is apparent from the above description, according to the present invention, there is an effect that a retrieval formula is created in which the importance of each keyword is considered. Further, by using the retrieval formula obtained thereby, it is possible to maintain the retrieval accuracy of the document data at a certain value or more, and there is also an effect that the efficiency of the retrieval processing is greatly improved.

[Brief description of the drawings]

【図１】本発明の検索式作成装置の実施形態を表す機能
構成図。FIG. 1 is a functional configuration diagram showing an embodiment of a search formula creation device of the present invention.

【図２】本発明の検索式作成装置の処理過程において得
られる情報の模式図。FIG. 2 is a schematic diagram of information obtained in a process of the search formula creation device of the present invention.

【図３】従来の検索式作成装置の機能構成図。FIG. 3 is a functional configuration diagram of a conventional search expression creation device.

【図４】従来の検索式作成装置の文書集合分割処理にお
ける手順図。FIG. 4 is a diagram showing a procedure in a document set division process of the conventional search expression creating apparatus.

【図５】従来の検索式作成装置の処理過程において得ら
れる情報の模式図。FIG. 5 is a schematic diagram of information obtained in a processing process of a conventional search expression creating device.

[Explanation of symbols]

１０，３０検索式作成装置３１キーワード抽出部１１，３２文書集合分割部１２キーワード文書頻度辞書３３検索式作成部 10, 30 retrieval formula creation device 31 keyword extraction unit 11, 32 document set division unit 12 keyword document frequency dictionary 33 search formula creation unit

Claims

[Claims]

1. A process of morphologically analyzing a specified document group in a document database to extract a plurality of keywords, a process of detecting, for each keyword, the number of documents in which the extracted individual keywords appear, and including the keyword. Determining, as a search keyword, a keyword that maximizes the mutual information obtained by subtracting the information amount of the document group and the information amount of the document group not included from the total information amount of the specified document group and minimizes the number of documents including the keyword And a step of combining the determined search keywords with a logical expression to create a search expression used for searching the document database.

2. A step of morphologically analyzing a specified document group in a document database to extract a plurality of keywords, a step of detecting, for each keyword, the number of documents in which the extracted individual keywords appear, At a fixed ratio, the importance of the keyword calculated based on the monotone decreasing function and the mutual information obtained by subtracting the information amount of the document group including and not including the keyword from the total information amount of the specified document group Computer, comprising: a step of determining a keyword having the maximum total value as a search keyword; and a step of combining the determined search keywords with a logical expression to create a search expression used for document search. A method for creating a search formula using a device.

3. An apparatus for creating a retrieval formula for searching for a specific document from a document database, comprising extracting a plurality of keywords from a specified document group in the document database by morphological analysis and extracting the extracted individual keywords. A keyword extracting unit that generates a document set in which discrimination information indicating whether the document is included in the document and designation information indicating whether the document is a required document or an unnecessary document together with identification information of each designated document; A single keyword is used as a search keyword based on the mutual information obtained by subtracting the information amount of the document group including and not including the keyword from the total information amount of the specified document group and the number of documents in which the keyword appears. A document set dividing unit that decides and divides one document set into a plurality of document sets using the determined search keyword; Search expression creating apparatus characterized by having a search expression creation unit for creating the search expression combined with a logical expression a search keyword used when the division of the document set.

4. The apparatus further comprises a dictionary in which the number of documents in which the extracted keywords appear is counted in advance and held for each keyword, and wherein the document set division unit holds the document set when dividing the document set. 4. The search formula creation device according to claim 3, wherein the search keyword is determined by finding out the number of documents.

5. The document set division unit is configured to determine, as the search keyword, a keyword in which the mutual information amount is maximum and the number of documents is minimum. Search formula creation device.

6. The document set division unit calculates the importance of the keyword using a monotonically decreasing function with respect to the number of documents, and sums the calculated importance and the mutual information at a fixed ratio. The search expression creating apparatus according to claim 4, wherein a keyword having a maximum value is determined as the search keyword.

7. A morphological analysis process for extracting a plurality of keywords from a specified document group in a document database, discrimination information indicating whether each keyword is included in the document, and whether the document is a required document or an unnecessary document Keyword extraction processing that generates a document set in which the discrimination information indicating the information is collected together with the identification information of each designated document, and the information amount of the document group including and not including the individual keywords is subtracted from the total information amount of the designated document group A single keyword is determined as a search keyword based on the mutual information amount obtained and the number of documents in which the keyword appears, and a document set is divided into a plurality of document sets using the determined search keyword. A set dividing process and a search formula creating the search formula by combining a search keyword used in dividing the document set with a logical formula And a computer-readable storage medium storing a program for causing a computer device to execute the program.