JP2000207404A

JP2000207404A - Document search method and apparatus, and recording medium

Info

Publication number: JP2000207404A
Application number: JP11004587A
Authority: JP
Inventors: Junichi Aoe; 順一青江; Masakatsu Matsuoka; 正勝松岡
Original assignee: Sumitomo Metal Industries Ltd
Current assignee: Nippon Steel Corp
Priority date: 1999-01-11
Filing date: 1999-01-11
Publication date: 2000-07-28

Abstract

(57)【要約】【課題】任意の文書に類似する類似文書の検索に用い
るＮ−gramの数を有効に削減でき、類似文書の検索処理
を高速に行える文書検索方法及び装置、並びに、その検
索方法を実施するためのプログラムを記録した記録媒体
を提供する。【解決手段】検索対象文書に含まれる単語を抽出し
（Ｓ２〜Ｓ６）、その単語のＮ−gramを生成して（Ｓ７
〜Ｓ９）、検索対象文書と含まれるＮ−gramとの関係を
示す転置ベクトル表を準備しておく（Ｓ10，Ｓ11）。任
意の検索キー文書に含まれる単語を抽出して、その単語
のＮ−gramを生成する。転置ベクトル表と、その検索キ
ー文書におけるＮ−gramとに基づき、検索キー文書に類
似する類似文書を検索する。 (57) [Summary] [PROBLEMS] A document search method and apparatus capable of effectively reducing the number of N-grams used for searching for a similar document similar to an arbitrary document and performing high-speed similar document search processing, and a document search method therefor. Provided is a recording medium on which a program for executing a search method is recorded. SOLUTION: A word included in a search target document is extracted (S2 to S6), and an N-gram of the word is generated (S7).
To S9), a transposed vector table indicating the relationship between the search target document and the included N-gram is prepared (S10, S11). A word included in an arbitrary search key document is extracted, and an N-gram of the word is generated. A similar document similar to the search key document is searched based on the transposed vector table and the N-gram in the search key document.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、任意の文書に類似
する類似文書を複数の検索対象文書から検索する方法及
び装置、並びに、その検索方法を実施するためのプログ
ラムを記録した記録媒体に関する。[0001] 1. Field of the Invention [0002] The present invention relates to a method and an apparatus for retrieving a similar document similar to an arbitrary document from a plurality of retrieval target documents, and a recording medium on which a program for executing the retrieval method is recorded.

【０００２】[0002]

【従来の技術及び発明が解決しようとする課題】ワード
プロセッサなどによって作成される電子文書はその量が
増大しており、ユーザが指定した質問文書に類似する類
似文書を検索する対象である検索対象文書のデータ量も
膨大であり、その類似文書を検索する処理の高速化が困
難な状況である。このような質問文書に対する類似文書
を検索対象文書から検索する方法として、以下のような
３種の方法、特開平２−2458号公報に開示の方法（以下
従来例１という）、特開平６−110948号公報に開示の方
法（以下従来例２という）、特開平９−153051号公報に
開示の方法（以下従来例３という）が公知である。2. Description of the Related Art The amount of electronic documents created by a word processor or the like is increasing, and a search target document for searching for a similar document similar to a question document specified by a user. Is also enormous, and it is difficult to speed up the process of searching for similar documents. As a method of searching for a similar document to such a question document from a search target document, the following three methods, a method disclosed in JP-A-2-2458 (hereinafter referred to as Conventional Example 1), and a method described in A method disclosed in Japanese Patent Application Laid-Open No. 110948 (hereinafter referred to as Conventional Example 2) and a method disclosed in Japanese Patent Application Laid-Open No. Hei 9-153051 (hereinafter referred to as Conventional Example 3) are known.

【０００３】従来例１は、質問文書及び検索対象文書か
らキーワードを抽出し、抽出したキーワードを比較して
類似度を判定し、類似度が高い文書を類似文書として出
力する。この従来例１では、キーワードの抽出の仕方に
よって検索できない単語が発生する。また、検索漏れが
生じる。例えば、検索対象文書１で「開発作業」、検索
対象文書２で「開発」というキーワードが抽出されてい
る場合に、質問文書で「開発」とキーワード抽出がなさ
れると、検索対象文書１が検索されないことになる。[0003] In Conventional Example 1, keywords are extracted from a question document and a search target document, the extracted keywords are compared to determine similarity, and a document having a high similarity is output as a similar document. In the first conventional example, a word that cannot be searched is generated depending on a keyword extraction method. Also, search omission occurs. For example, if the keywords “development work” are extracted in the search target document 1 and “development” is extracted in the search target document 2 and the keyword “development” is extracted in the question document, the search target document 1 is searched. Will not be.

【０００４】従来例２は、複数の検索対象文書に対して
Ｎ−gram参照列を生成し、その各Ｎ−gramに重み付けを
行い、複数の参照列間の共通性を除去し、質問文書をＮ
−gramに分解し、分解した各Ｎ−gramに重み付けを行
い、質問文書のＮ−gramと検索対象文書の参照列のＮ−
gramとを比較し、質問文書と検索対象文書との類似度を
表す点数を求め、その点数に応じて類似文書を検索す
る。この従来例２では、検索対象文書の共通性を求め
て、類似度の判断に利用するので、検索対象文書に追
加，削除などの変化が起こった場合、共通性を再度求め
なくてはならず、検索対象文書の変化に柔軟に対応でき
ない。また、検索対象文書の集合が変化した場合、それ
らの共通性も変化するので、同一の質問文書であっても
類似度を表す点数が変化し、類似度の評価判断が逆転す
る可能性がある。また、検索対象文書に偏りがある場
合、それらの共通性を取り除くと、その分野の特定に有
効な情報が取り除かれることになって、正しい類似度を
判断できない。更に、文書に含まれるすべてのＮ−gram
を対象として類似度を判断するので、文字種が多い日本
語の場合、その類似度計算に膨大な時間がかかる。日本
語の場合、ＪＩＳ第一水準の漢字だけでも1024種存在
し、例えば２−gramの種類数はその２乗の1048576 種に
達する。これにカタカナ，ひらがななども加えると、２
−gramの種類は莫大な数となり、類似度計算に長時間を
要する。In Conventional Example 2, an N-gram reference sequence is generated for a plurality of search target documents, each N-gram is weighted, commonality among the plurality of reference sequences is removed, and a query document is generated. N
-N-gram of the query document and N-gram of the reference sequence of the search target document are weighted to each of the decomposed N-grams.
The gram is compared with the gram, a score representing the similarity between the question document and the search target document is obtained, and a similar document is searched according to the score. In the second conventional example, the commonality of the search target documents is obtained and used for determining the similarity. Therefore, when a change such as addition or deletion occurs in the search target document, the commonality must be obtained again. However, it is not possible to flexibly respond to changes in the search target document. Further, when the set of search target documents changes, their commonality also changes, so that even for the same question document, the score indicating the similarity changes, and the similarity evaluation judgment may be reversed. . In addition, if the search target documents are biased, removing their commonality removes information that is effective in identifying the field, and it is not possible to determine the correct similarity. Further, all N-grams included in the document
Since the similarity is determined for the Japanese language, it takes an enormous amount of time to calculate the similarity for Japanese characters. In the case of Japanese, there are 1024 kinds of JIS first-level kanji alone, and for example, the number of kinds of 2-gram reaches 1048576 kinds of the square. If you add katakana and hiragana to this, you get 2
The number of types of grams is enormous, and it takes a long time to calculate the similarity.

【０００５】従来例３は、検索対象文書から所定の部分
文字列とその出現頻度とを求め、出現頻度に基づいて部
分文字列の重要度を求め、質問文書から所定の部分文字
列とその出現頻度とを求め、その出現頻度及び上記重要
度に基づいて類似度を求め、その類似度に応じて類似文
書を検索する。この従来例３では、一度にすべての部分
文字列を抽出するので、重要度の算出処理に長時間を要
する。出現頻度で重要度を求めても、例えば数字などを
考えると、出現頻度が重要度に直結しているとは考えら
れない場合もあり、正確な検索を行えない。In prior art 3, a predetermined partial character string and its appearance frequency are obtained from a search target document, the importance of the partial character string is obtained based on the appearance frequency, and a predetermined partial character string and its occurrence frequency are obtained from a question document. A similarity is determined based on the frequency of occurrence and the importance, and a similar document is searched according to the similarity. In the third conventional example, since all partial character strings are extracted at once, it takes a long time to calculate the importance. Even if the importance is calculated based on the frequency of appearance, for example, in consideration of a number or the like, the frequency of appearance may not be considered to be directly related to the importance, and an accurate search cannot be performed.

【０００６】本発明は斯かる事情に鑑みてなされたもの
であり、検索に用いるＮ−gramの数を有効に削減でき、
類似文書の検索処理を高速に行える文書検索方法及び装
置、並びに、その検索方法を実施するためのプログラム
を記録した記録媒体を提供することを目的とする。The present invention has been made in view of such circumstances, and can effectively reduce the number of N-grams used for retrieval.
It is an object of the present invention to provide a document search method and apparatus capable of performing similar document search processing at high speed, and a recording medium on which a program for executing the search method is recorded.

【０００７】[0007]

【課題を解決するための手段】請求項１に係る文書検索
方法は、任意の文書に類似する類似文書を、複数の文書
から検索する文書検索方法において、前記複数の文書に
含まれる単語を抽出するステップと、抽出した単語のＮ
−gramを生成するステップと、前記任意の文書に含まれ
る単語を抽出するステップと、抽出した単語のＮ−gram
を生成するステップと、前記任意の文書について生成し
たＮ−gramと前記複数の文書について生成したＮ−gram
とを比較し、その比較結果に基づいて類似文書を検索す
るステップとを有することを特徴とする。According to a first aspect of the present invention, there is provided a document search method for searching for a similar document similar to an arbitrary document from a plurality of documents by extracting words included in the plurality of documents. And N of the extracted word
-Generating a gram, extracting a word included in the arbitrary document, and N-gram of the extracted word.
Generating the N-gram generated for the arbitrary document and the N-gram generated for the plurality of documents.
And searching for a similar document based on the comparison result.

【０００８】請求項２に係る文書検索方法は、請求項１
において、単語を抽出する際に、字面情報を用いて単語
を抽出することとし、抽出した単語の中から不要な単語
を除去することを特徴とする。[0008] The document search method according to claim 2 is based on claim 1.
Is characterized in that when extracting a word, the word is extracted using the face information, and unnecessary words are removed from the extracted words.

【０００９】請求項３に係る文書検索方法は、請求項１
または２において、生成したＮ−gramの中から不要なＮ
−gramを除去することを特徴とする。According to a third aspect of the present invention, there is provided a document search method.
Or in 2, the unnecessary N out of the generated N-grams
-It is characterized in that gram is removed.

【００１０】請求項４に係る文書検索方法は、請求項１
において、前記任意の文書について生成したＮ−gramと
前記複数の文書について生成したＮ−gramとに基づい
て、前記任意の文書と前記複数の文書夫々との類似度を
計算し、その類似度に応じて類似文書を検索することと
し、類似文書の候補を選定し、選定した類似文書の候補
について前記類似度を計算することを特徴とする。According to a fourth aspect of the present invention, there is provided a document search method according to the first aspect.
Calculating a similarity between the arbitrary document and each of the plurality of documents based on the N-gram generated for the arbitrary document and the N-gram generated for the plurality of documents; A similar document is searched in response thereto, a similar document candidate is selected, and the similarity is calculated for the selected similar document candidate.

【００１１】請求項５に係る文書検索方法は、請求項１
において、前記複数の文書について生成したＮ−gramの
存在情報をベクトルの形式で記憶手段に格納することと
し、前記ベクトルの各行を複数の要素を単位としてブロ
ック分けし、不要なブロックは除去して前記ベクトルを
圧縮することを特徴とする。According to a fifth aspect of the present invention, there is provided a document search method according to the first aspect.
In the above, the existence information of the N-gram generated for the plurality of documents is stored in the storage unit in the form of a vector, and each row of the vector is divided into blocks using a plurality of elements as units, and unnecessary blocks are removed. The method is characterized in that the vector is compressed.

【００１２】請求項６に係る文書検索方法は、請求項１
において、前記複数の文書について生成したＮ−gramの
存在情報をベクトルの形式で、読み出し速度が異なる複
数の記憶手段に分けて格納することを特徴とする。According to a sixth aspect of the present invention, there is provided a document search method.
Wherein the existence information of N-grams generated for the plurality of documents is stored in the form of a vector separately in a plurality of storage units having different reading speeds.

【００１３】請求項７に係る文書検索装置は、任意の文
書に類似する類似文書を、複数の文書から検索する文書
検索装置において、前記複数の文書に含まれる単語を抽
出する第１単語抽出手段と、抽出した単語のＮ−gramを
生成する第１Ｎ−gram生成手段と、前記任意の文書に含
まれる単語を抽出する第２単語抽出手段と、抽出した単
語のＮ−gramを生成する第２Ｎ−gram生成手段と、前記
任意の文書について生成したＮ−gramと前記複数の文書
について生成したＮ−gramとを比較し、その比較結果に
基づいて類似文書を検索する類似文書検索手段とを備え
ることを特徴とする。According to a seventh aspect of the present invention, in the document search apparatus for searching a plurality of documents for a similar document similar to an arbitrary document, a first word extracting unit for extracting a word included in the plurality of documents. First N-gram generating means for generating an N-gram of the extracted word, second word extracting means for extracting a word contained in the arbitrary document, and second N-gram generating an N-gram of the extracted word -Gram generation means, and similar document search means for comparing the N-gram generated for the arbitrary document with the N-gram generated for the plurality of documents, and searching for a similar document based on the comparison result. It is characterized by the following.

【００１４】請求項８に係る文書検索装置は、請求項７
において、前記第１単語抽出手段及び第２単語抽出手段
は、文書の字面情報を用いて単語を抽出する手段と、抽
出した単語の中から不要な単語を除去する手段とを有す
ることを特徴とする。[0014] The document retrieval apparatus according to claim 8 is the claim 7.
, Wherein the first word extracting means and the second word extracting means have means for extracting a word using document face information, and means for removing unnecessary words from the extracted words. I do.

【００１５】請求項９に係る文書検索装置は、請求項７
または８において、前記第１Ｎ−gram生成手段及び第２
Ｎ−gram生成手段は、生成したＮ−gramの中から不要な
Ｎ−gramを除去する手段を有することを特徴とする。According to a ninth aspect of the present invention, there is provided a document search apparatus according to the seventh aspect.
Or 8, the first N-gram generating means and the second
The N-gram generation means is characterized by having means for removing unnecessary N-grams from the generated N-grams.

【００１６】請求項10に係る文書検索装置は、請求項７
において、前記類似文書検索手段は、前記任意の文書に
ついて生成したＮ−gramと前記複数の文書について生成
したＮ−gramとに基づいて、前記複数の文書から類似文
書の候補を選定する手段と、前記任意の文書と選定した
類似文書の候補夫々との類似度を計算する手段と、その
類似度に応じて類似文書を検索する手段とを有すること
を特徴とする。According to a tenth aspect of the present invention, there is provided a document search apparatus according to the seventh aspect.
In the similar document search means, based on the N-gram generated for the arbitrary document and the N-gram generated for the plurality of documents, means for selecting a similar document candidate from the plurality of documents, It is characterized by comprising means for calculating the degree of similarity between the arbitrary document and each of the selected similar document candidates, and means for searching for a similar document according to the degree of similarity.

【００１７】請求項11に係る文書検索装置は、請求項７
において、前記複数の文書について生成したＮ−gramの
存在情報をベクトルの形式で格納する格納手段を更に備
え、前記ベクトルの各行を複数の要素を単位としてブロ
ック分けし、不要なブロックは除去して前記ベクトルを
圧縮するようにして前記格納手段に格納するようにした
ことを特徴とする。[0017] The document retrieval apparatus according to claim 11 is the same as claim 7.
, Further comprising storage means for storing the presence information of the N-gram generated for the plurality of documents in the form of a vector, dividing each row of the vector into a plurality of elements as units, and removing unnecessary blocks. The vector is compressed and stored in the storage unit.

【００１８】請求項12に係る文書検索装置は、請求項７
において、前記複数の文書について生成したＮ−gramの
存在情報の一部を格納する第１格納手段と、前記複数の
文書について生成したＮ−gramの存在情報の残りの部分
を格納する、前記第１格納手段とは読み出し速度が異な
る第２格納手段とを更に備えることを特徴とする。According to a twelfth aspect of the present invention, there is provided a document search apparatus according to the seventh aspect.
A first storage unit that stores a part of the N-gram existence information generated for the plurality of documents; and a second storage unit that stores a remaining part of the N-gram existence information generated for the plurality of documents. The storage device further includes a second storage device having a different reading speed from the first storage device.

【００１９】請求項13に係る記録媒体は、任意の文書に
類似する類似文書を、複数の文書から検索するためのプ
ログラムを記録してあるコンピュータでの読み取り可能
な記録媒体において、前記複数の文書に含まれる単語を
抽出することを前記コンピュータにさせるプログラムコ
ード手段と、抽出した単語のＮ−gramを生成することを
前記コンピュータにさせるプログラムコード手段と、前
記任意の文書に含まれる単語を抽出することを前記コン
ピュータにさせるプログラムコード手段と、抽出した単
語のＮ−gramを生成することを前記コンピュータにさせ
るプログラムコード手段と、前記任意の文書について生
成したＮ−gramと前記複数の文書について生成したＮ−
gramとを比較し、その比較結果に基づいて類似文書を検
索することを前記コンピュータにさせるプログラムコー
ド手段とを有することを特徴とする。A recording medium according to claim 13 is a computer-readable recording medium storing a program for retrieving a similar document similar to an arbitrary document from the plurality of documents. Program code means for causing the computer to extract words included in the program, program code means for causing the computer to generate N-grams of the extracted words, and extract words included in the arbitrary document Program code means for causing the computer to do the above, program code means for causing the computer to generate an N-gram of the extracted word, N-gram generated for the arbitrary document and generated for the plurality of documents. N-
and program code means for causing the computer to compare the gram with the gram and search for a similar document based on the comparison result.

【００２０】請求項14に係る記録媒体は、請求項13にお
いて、単語を抽出することを前記コンピュータにさせる
前記プログラムコード手段は、字面情報を用いて単語を
抽出することを前記コンピュータにさせるプログラムコ
ード手段と、抽出した単語の中から不要な単語を除去す
ることを前記コンピュータにさせるプログラムコード手
段とを含むことを特徴とする。A recording medium according to claim 14, wherein the program code means for causing the computer to extract a word is a program code for causing the computer to extract a word using face information. Means and program code means for causing the computer to remove unnecessary words from the extracted words.

【００２１】請求項15に係る記録媒体は、請求項13また
は14において、抽出した単語のＮ−gramを生成すること
を前記コンピュータにさせるプログラムコード手段は、
生成したＮ−gramの中から不要なＮ−gramを除去するこ
とを前記コンピュータにさせるプログラムコード手段を
含むことを特徴とする。According to a fifteenth aspect of the present invention, in the recording medium according to the thirteenth or fourteenth aspect, the program code means for causing the computer to generate an N-gram of the extracted word is:
It is characterized by including program code means for causing the computer to remove unnecessary N-grams from the generated N-grams.

【００２２】請求項16に係る記録媒体は、請求項13にお
いて、類似文書を検索することを前記コンピュータにさ
せる前記プログラムコード手段は、前記任意の文書につ
いて生成したＮ−gramと前記複数の文書について生成し
たＮ−gramとに基づいて、前記複数の文書から類似文書
の候補を選定することを前記コンピュータにさせるプロ
グラムコード手段と、前記任意の文書と選定した類似文
書の候補夫々との類似度を計算することを前記コンピュ
ータにさせるプログラムコード手段と、その類似度に応
じて類似文書を検索することを前記コンピュータにさせ
るプログラムコード手段とを含むことを特徴とする。According to a sixteenth aspect of the present invention, in the storage medium according to the thirteenth aspect, the program code means for causing the computer to search for a similar document includes the N-gram generated for the arbitrary document and the plurality of documents. Program code means for causing the computer to select a similar document candidate from the plurality of documents based on the generated N-gram, and a similarity between the arbitrary document and the selected similar document candidate. It is characterized by including program code means for causing the computer to calculate and program code means for causing the computer to search for a similar document according to the degree of similarity.

【００２３】請求項17に係る記録媒体は、請求項13にお
いて、前記複数の文書について生成したＮ−gramの存在
情報をベクトルの形式で格納手段に格納することを前記
コンピュータにさせるプログラムコード手段と、前記ベ
クトルの各行を複数の要素を単位としてブロック分け
し、不要なブロックは除去して前記ベクトルを圧縮する
ことを前記コンピュータにさせるプログラムコード手段
とを更に有することを特徴とする。The recording medium according to claim 17 is the recording medium according to claim 13, wherein the computer is configured to cause the computer to store the existence information of the N-gram generated for the plurality of documents in the form of a vector in the storage unit. And a program code unit for causing the computer to divide each row of the vector into blocks in units of a plurality of elements, remove unnecessary blocks, and compress the vector.

【００２４】請求項18に係る記録媒体は、請求項13にお
いて、前記複数の文書について生成したＮ−gramの存在
情報をベクトルの形式で、読み出し速度が異なる複数の
記憶手段に分けて格納することを前記コンピュータにさ
せるプログラムコード手段を更に有することを特徴とす
る。In the recording medium according to the present invention, the N-gram existence information generated for the plurality of documents is stored in a form of a vector separately in a plurality of storage units having different reading speeds. The program further comprises program code means for causing the computer to execute the following.

【００２５】請求項１，７，13による第１発明では、検
索対象文書から単語を抽出し、その単語のＮ−gramを生
成すると共に、質問文書から単語を抽出し、その単語の
Ｎ−gramを生成する。そして、検索対象文書におけるＮ
−gramと質問文書におけるＮ−gramとを比較して、検索
対象文書と質問文書との類似度を求め、その類似度に応
じて類似文書を検索する。According to the first aspect of the present invention, a word is extracted from a document to be searched, an N-gram of the word is generated, and a word is extracted from a question document, and the N-gram of the word is extracted. Generate Then, N in the search target document
The gram and the N-gram in the question document are compared to determine the similarity between the search target document and the question document, and a similar document is searched according to the similarity.

【００２６】このように、単語抽出を行った後に、Ｎ−
gram生成を行うので、すべてのＮ−gramを生成する方法
に比べて、Ｎ−gram生成に要する時間を短縮できる。ま
た、検索処理で重要とならない語（助詞など）が単語抽
出によって取り除かれているので、検索時に考慮を必要
とするＮ−gramの数を削減できて、文書検索処理を高速
化できる。As described above, after the word extraction, N-
Since gram generation is performed, the time required for N-gram generation can be reduced as compared with the method of generating all N-grams. Further, since words (particles and the like) that are not important in the search processing are removed by word extraction, the number of N-grams that need to be considered at the time of search can be reduced, and the document search processing can be speeded up.

【００２７】請求項２，８，14による第２発明では、字
面情報により単語を抽出する。よって、極めて短時間に
単語抽出を行える。また、抽出した単語の中の不要な単
語を除去した後、Ｎ−gram生成を行う。よって、類似度
を求めるＮ−gramの数を削減でき、検索処理の高速化を
図れる。According to the second aspect of the present invention, a word is extracted from character information. Therefore, word extraction can be performed in a very short time. After removing unnecessary words from the extracted words, N-gram generation is performed. Therefore, the number of N-grams for obtaining the similarity can be reduced, and the search processing can be speeded up.

【００２８】請求項３，９，15による第３発明では、例
えばＮ−gram辞書を参照して、生成したＮ−gramの中の
不要なＮ−gramは除去する。よって、類似度を求めるＮ
−gramの数を大幅に削減でき、検索処理の高速化を図れ
る。According to the third aspect of the present invention, unnecessary N-grams in the generated N-grams are removed by referring to, for example, an N-gram dictionary. Therefore, the similarity N
-The number of gram can be significantly reduced, and the search process can be speeded up.

【００２９】請求項４，10，16による第４発明では、簡
単な演算処理によって類似文書候補を選定し、その選定
した類似文書候補についてのみ類似度を求める。よっ
て、転置ベクトル表に登録されているすべての検索対象
文書について類似度を求める方法と比べて、少ない回数
の計算処理にて類似度を求めることができ、検索処理の
高速化を図れる。According to the fourth aspect of the present invention, similar document candidates are selected by simple arithmetic processing, and the similarity is obtained only for the selected similar document candidates. Therefore, as compared with the method of calculating the similarity for all the search target documents registered in the transposed vector table, the similarity can be calculated by a smaller number of calculation processes, and the search process can be speeded up.

【００３０】請求項５，11，17による第５発明では、検
索対象文書におけるＮ−gramの情報を示すベクトルの各
行を複数の要素毎にブロック化し、検索処理において意
味がないブロックは除外してベクトルを圧縮する。よっ
て、そのベクトルの読み出しの回数を低減でき、検索時
間の短縮化を図れる。According to the fifth aspect of the present invention, each row of the vector indicating the information of the N-gram in the search target document is divided into a plurality of elements, and blocks having no meaning in the search processing are excluded. Compress the vector. Therefore, the number of times of reading the vector can be reduced, and the search time can be reduced.

【００３１】請求項６，12，18による第６発明では、ベ
クトルの中で頻繁に利用するものは、高速読み出し可能
な主記憶に格納しておく。よって、検索処理の高速化を
図れる。In the sixth invention according to claims 6, 12 and 18, frequently used vectors are stored in a high speed readable main memory. Therefore, the search processing can be speeded up.

【００３２】[0032]

【発明の実施の形態】以下、本発明をその実施の形態を
示す図面を参照して具体的に説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The present invention will be specifically described below with reference to the drawings showing the embodiments.

【００３３】図１は、本発明の文書検索方法を実施する
ためのシステム構成を示す概念図である。類似文書検索
が行えるように、検索対象文書をインデックスに登録す
る文書登録装置１と、ユーザが指定または入力した検索
キー文書（質問文書）に対する類似文書をインデックス
から検索する類似文書検索装置２とを設けている。FIG. 1 is a conceptual diagram showing a system configuration for implementing the document search method of the present invention. A document registration device 1 that registers a search target document in an index and a similar document search device 2 that searches the index for a similar document to a search key document (question document) specified or input by a user so that a similar document search can be performed. Provided.

【００３４】図２は、文書登録装置１の機能ブロック図
である。文書登録装置１は、文書データベースから検索
対象文書を選択する文書選択装置11と、選択された文書
から単語（キーワード）を抽出するキーワード抽出装置
12と、単語のＮ−gram（Ｎ文字連鎖）を生成するＮ−gr
am生成装置13と、インデックス（転置ベクトル表）を生
成するインデックス生成装置14とを有する。FIG. 2 is a functional block diagram of the document registration device 1. The document registration device 1 includes a document selection device 11 for selecting a search target document from a document database and a keyword extraction device for extracting a word (keyword) from the selected document.
12 and N-gr which generates N-gram (N character chain) of words
An am generator 13 and an index generator 14 for generating an index (transposed vector table) are provided.

【００３５】図３は、キーワード抽出装置12の機能ブロ
ック図である。キーワード抽出装置12は、字面解析によ
って単語を抽出する字面解析装置121 と、類似文書検索
においてノイズとなる文字を除去する不要文字除去装置
122 と、不要語辞書を参照して不要な文字列を除去する
不要文字列除去装置123 と、同義語辞書を参照して同義
語を加える同義語展開装置124 と、抽出された単語の出
現頻度を集計する頻度集計装置125 とを有する。FIG. 3 is a functional block diagram of the keyword extracting device 12. The keyword extraction device 12 includes a character analysis device 121 that extracts words by character analysis, and an unnecessary character removal device that removes characters that become noise in similar document search.
122, an unnecessary character string removing device 123 that removes unnecessary character strings by referring to the unnecessary word dictionary, a synonym expanding device 124 that adds synonyms by referring to the synonym dictionary, and an appearance frequency of the extracted words. And a frequency totaling device 125 for totalizing.

【００３６】図４は、Ｎ−gram生成装置13の機能ブロッ
ク図である。Ｎ−gram生成装置13は、文字列（単語）を
Ｎ−gramに分解するＮ−gram分解装置131 と、Ｎ−gram
辞書を参照して登録されているＮ−gramを抽出するＮ−
gram辞書参照装置132 と、不要なＮ−gramを除去する不
要Ｎ−gram除去装置133 とを有する。FIG. 4 is a functional block diagram of the N-gram generation device 13. The N-gram generation device 13 includes an N-gram decomposition device 131 that decomposes a character string (word) into N-grams,
N-gram extracting registered N-gram by referring to dictionary
It has a gram dictionary reference device 132 and an unnecessary N-gram removal device 133 for removing unnecessary N-grams.

【００３７】図５は、インデックス生成装置14の機能ブ
ロック図である。インデックス生成装置14は、生成され
たＮ−gramに応じて文書特徴ベクトルを生成する文書特
徴ベクトル生成装置141 と、文書特徴ベクトルを転置し
た転置ベクトルを生成し、転置ベクトル表に登録する転
置ベクトル表登録装置142 とを有する。FIG. 5 is a functional block diagram of the index generation device 14. The index generation device 14 includes a document feature vector generation device 141 that generates a document feature vector according to the generated N-gram, and a transposed vector table that generates a transposed vector obtained by transposing the document feature vector and registers the transposed vector in a transposed vector table. Registration device 142.

【００３８】次に、文書登録処理について説明する。図
６は、その処理手順を示すフローチャートである。文書
データベースから文書（キーワード，入力文，文章全体
またはその一部分などの何らかの意味がある文字列）を
文書選択装置11にて選択して、キーワード抽出装置12へ
送る（Ｓ１）。Next, the document registration process will be described. FIG. 6 is a flowchart showing the processing procedure. A document (a character string having some meaning such as a keyword, an input sentence, a whole sentence, or a part thereof) is selected from the document database by the document selecting device 11 and sent to the keyword extracting device 12 (S1).

【００３９】入力された文書の中から、字面解析装置12
1 にて、同じ文字種（ひらがな，カタカナ，漢字，英
字，記号など）が２文字以上連続する文字列を単語とし
て抽出する（Ｓ２）。類似文書検索において誤検索の原
因となる文字（漢数字，アラビア数字など）を、不要文
字除去装置122 にて文字列から除去する（Ｓ３）。除去
対象の不要文字の例としては、漢数字（一，二，三）、
単位記号（円）、日付（年，月，日）などがある。不要
文字列除去装置123 にて、文字列の不要／必要を判定し
て不要な文字列を除去する（Ｓ４）。文字列が不要語辞
書に登録されているか否かを判定し、登録されている場
合にはその文字列を不要文字列として除去し、登録され
ていない場合にはその文字列を必要な文字列と判定して
残す。除去対象の不要文字列の例としては、金額（一億
円）、日付（一九九五年十一月）などがある。また、同
義語展開装置124 にて、同義語辞書を検索し、登録され
ている同義語も抽出単語に加える（Ｓ５）。頻度集計装
置125 にて、以上のようにして抽出された文字列（単
語）の出現頻度を集計し、抽出された文字列を出現頻度
順に並べて文字列表を作成する（Ｓ６）。作成した文字
列表をＮ−gram生成装置13へ送る。From the input document, a character surface analysis device 12
In step 1, a character string in which two or more characters having the same character type (hiragana, katakana, kanji, kanji, alphabet, symbol, etc.) are extracted as a word is extracted (S2). Characters (kanji numerals, Arabic numerals, etc.) that cause an erroneous search in similar document search are removed from the character string by the unnecessary character removal device 122 (S3). Examples of unnecessary characters to be removed include Chinese numerals (1, 2, 3),
There are unit symbols (yen) and dates (year, month, day). The unnecessary character string removing device 123 determines whether or not the character string is unnecessary and removes the unnecessary character string (S4). Determines whether a character string is registered in the unnecessary word dictionary. If registered, removes the character string as an unnecessary character string. And leave it. Examples of unnecessary character strings to be removed include an amount (100 million yen) and a date (November 1995). Further, the synonym expansion device 124 searches the synonym dictionary and adds registered synonyms to the extracted words (S5). The frequency counting device 125 counts the appearance frequencies of the character strings (words) extracted as described above, and arranges the extracted character strings in order of appearance frequency to create a character string table (S6). The created character string table is sent to the N-gram generation device 13.

【００４０】Ｎ−gram分解装置131 にて、文字列表に含
まれる文字列をＮ−gramに分解する（Ｓ７）。Ｎ−gram
辞書参照装置132 にて、分解したＮ−gramの中から、Ｎ
−gram辞書に登録されているものを抽出する（Ｓ８）。
また、次のような処理により、不要Ｎ−gram除去装置13
3 にて、不要と判断されるＮ−gramを除去する（Ｓ
９）。Ｓ８で抽出したＮ−gramを出現頻度順に並べて、
Ｎ−gramリストを作成し、そのＮ−gramリストに含まれ
るＮ−gramについて、その出現頻度などに基づいて重要
度を算出する。算出した重要度が所定値を下回るＮ−gr
amは不要なＮ−gramと判定して、Ｎ−gramリストから削
除する。このようにして生成したＮ−gramを、インデッ
クス生成装置14へ送る。このようにＮ−gramの数を減ら
すので、類似文書の検索速度の向上を図れる。The N-gram decomposing device 131 decomposes the character strings included in the character string table into N-grams (S7). N-gram
The dictionary reference unit 132 selects N-grams from the decomposed N-grams.
-Extract what is registered in the gram dictionary (S8).
The unnecessary N-gram removing device 13 is executed by the following processing.
At 3, the N-grams determined to be unnecessary are removed (S
9). N-grams extracted in S8 are arranged in order of appearance frequency,
An N-gram list is created, and the importance of the N-grams included in the N-gram list is calculated based on the appearance frequency and the like. N-gr for which the calculated importance is lower than the predetermined value
am is determined to be unnecessary N-gram and is deleted from the N-gram list. The N-gram generated in this way is sent to the index generation device 14. Since the number of N-grams is reduced in this way, the search speed of similar documents can be improved.

【００４１】文書特徴ベクトル生成装置141 にて、生成
されたＮ−gramの登録番号に対応したビットを“１”に
して、文書特徴ベクトルを生成する（Ｓ10）。図７に、
文書特徴ベクトルの一例を示す。図７に示す例では、４
種の検索対象文書１〜検索対象文書４のそれぞれに４種
のＮ−gram１〜Ｎ−gram４が含まれているか否かを表し
ており、例えば、文書２では、Ｎ−gram２及びＮ−gram
４を含み、Ｎ−gram１及びＮ−gram３を含まない。The document feature vector generation device 141 sets the bit corresponding to the registration number of the generated N-gram to "1" to generate a document feature vector (S10). In FIG.
4 shows an example of a document feature vector. In the example shown in FIG.
It indicates whether or not each of the types of search target documents 1 to 4 includes four types of N-gram 1 to N-gram 4. For example, in document 2, N-gram 2 and N-gram
4 and not including N-gram1 and N-gram3.

【００４２】転置ベクトル表登録装置142 にて、文書特
徴ベクトルを転置して転置ベクトルを生成し、転置ベク
トル表にその転置ベクトルを登録する（Ｓ11）。図８
に、図７の文書特徴ベクトルに対応した転置ベクトル表
を示す。例えば、Ｎ−gram２が文書２及び文書４に含ま
れていることを表している。本発明では、Ｎ−gramが含
まれている場合にはビット“１”、含まれていない場合
にはビット“０”を割り当てるのみであり、きわめて簡
易に転置ベクトル表を作成することができる。The transposed vector table registration unit 142 transposes the document feature vector to generate a transposed vector, and registers the transposed vector in the transposed vector table (S11). FIG.
FIG. 7 shows a transposed vector table corresponding to the document feature vector in FIG. For example, it indicates that N-gram2 is included in document 2 and document 4. In the present invention, a bit “1” is assigned only when an N-gram is included, and a bit “0” is assigned otherwise, so that a transposed vector table can be created very easily.

【００４３】そして、登録終了の指示がなされた否かを
判断し（Ｓ12）、なされた場合には（Ｓ12：YES)登録処
理を終了し、なされていない場合には（Ｓ12：NO）Ｓ１
に戻って別の検索対象文書における登録処理を繰り返
す。Then, it is determined whether or not an instruction to end the registration has been made (S12). If the instruction has been made (S12: YES), the registration processing is ended. If not, (S12: NO) S1.
And the registration process for another search target document is repeated.

【００４４】次に、効率が良い転置ベクトル表作成につ
いて説明する。図９は、転置ベクトル表登録装置142 の
機能ブロック図である。転置ベクトル表登録装置142
は、転置ベクトル表を生成する転置ベクトル表生成装置
1421と、転置ベクトル表の各行の要素をブロック化する
ブロック分解装置1422と、転置ベクトル表を圧縮するベ
クトル圧縮装置1423とを有する。Next, the creation of an efficient transposed vector table will be described. FIG. 9 is a functional block diagram of the transposed vector table registration device 142. Transposed vector table registration device 142
Is a transposed vector table generator that generates a transposed vector table
1421, a block decomposer 1422 that blocks the elements of each row of the transposed vector table, and a vector compressor 1423 that compresses the transposed vector table.

【００４５】転置ベクトル表生成装置1421で生成した転
置ベクトル表に対して、ブロック分解装置1422にて、各
行の要素を長さＬ毎に区切る。ベクトル圧縮装置1423に
て、長さＬに区切られたブロック内に“０”要素のみが
含まれるブロックは保持せず、少なくとも１つの“１”
要素を含むブロックを位置情報と共に保持する。このよ
うに圧縮化しておいても、転置ベクトル表を再構成しよ
うとする場合、位置情報を基に“１”要素を含むブロッ
クを復元して、ブロックが存在しない部分をすべて
“０”要素で埋めることにより再構成が可能である。こ
のように転置ベクトル表を圧縮することにより、計算機
資源の有効利用、転置ベクトル表へのアクセス回数の低
減を実現でき、検索速度の向上を図れる。With respect to the transposed vector table generated by the transposed vector table generating device 1421, the elements of each row are divided for each length L by the block decomposition device 1422. The vector compression device 1423 does not hold a block including only the “0” element in the block divided into the length L, and stores at least one “1”.
Holds the block containing the element with the location information. In order to reconstruct the transposed vector table even with the compression as described above, the block including the “1” element is restored based on the position information, and all the parts where the block does not exist are replaced by the “0” element. Reconstruction is possible by filling. By compressing the transposed vector table in this manner, effective use of computer resources, reduction in the number of accesses to the transposed vector table can be realized, and search speed can be improved.

【００４６】図10は、類似文書検索装置２の機能ブロッ
ク図である。類似文書検索装置２は、検索キー文書（質
問文書）を入力する検索キー文書入力装置21と、入力さ
れた検索キー文書からキーワード（単語）を抽出するキ
ーワード抽出装置22と、単語のＮ−gramを生成するＮ−
gram生成装置23と、生成したＮ−gramに対応する転置ベ
クトルを取り出すインデックス参照装置24と、検索キー
文書に対する類似文書を判定する類似文書判定装置25と
を有する。FIG. 10 is a functional block diagram of the similar document search device 2. The similar document search device 2 includes a search key document input device 21 for inputting a search key document (question document), a keyword extraction device 22 for extracting a keyword (word) from the input search key document, and an N-gram of the word. N-
It includes a gram generation device 23, an index reference device 24 for extracting a transposed vector corresponding to the generated N-gram, and a similar document determination device 25 for determining a similar document to the search key document.

【００４７】なお、キーワード抽出装置22の構成は、前
述したキーワード抽出装置12の構成と同じであり、Ｎ−
gram生成装置23の構成は、前述したＮ−gram生成装置13
の構成と同じである。The structure of the keyword extracting device 22 is the same as the structure of the keyword extracting device 12 described above.
The configuration of the gram generation device 23 is the same as that of the N-gram generation device 13 described above.
The configuration is the same as

【００４８】図11は、インデックス参照装置24の機能ブ
ロック図である。インデックス参照装置24は、使用頻度
が高い転置ベクトル表を主記憶から読み出す転置ベクト
ル表先読み装置241 と、インデックスファイルから転置
ベクトル表を読み出す転置ベクトル表読み出し装置242
とを有する。FIG. 11 is a functional block diagram of the index reference device 24. The index reference device 24 includes a transposed vector table look-ahead device 241 that reads a frequently used transposed vector table from main storage, and a transposed vector table read device 242 that reads a transposed vector table from an index file.
And

【００４９】図12は、類似文書判定装置25の機能ブロッ
ク図である。類似文書判定装置25は、転置ベクトルに基
づいて類似文書候補を選定する類似文書候補選定装置25
1 と、類似文書候補について類似度を計算する類似度計
算装置252 と、類似度順に検索結果をソートする検索結
果ソート装置253 と、類似度順にソートされた検索結果
を出力する検索結果出力装置254 とを有する。FIG. 12 is a functional block diagram of the similar document determination device 25. The similar document determination device 25 selects a similar document candidate based on the transposed vector.
1, a similarity calculation device 252 for calculating similarity for similar document candidates, a search result sorting device 253 for sorting search results in order of similarity, and a search result output device 254 for outputting search results sorted in order of similarity And

【００５０】次に、類似文書検索処理について説明す
る。図13は、その処理手順を示すフローチャートであ
る。検索キー文書入力装置21にて、文書（キーワード，
入力文，文章全体またはその一部分などの何らかの意味
がある文字列）を入力して、キーワード抽出装置22へ送
る（Ｓ21）。Next, the similar document search processing will be described. FIG. 13 is a flowchart showing the processing procedure. In the search key document input device 21, a document (keyword,
An input sentence, a character string having some meaning such as the whole sentence or a part thereof is input and sent to the keyword extracting device 22 (S21).

【００５１】キーワード抽出装置22における動作は、前
述したキーワード抽出装置12における動作と同様であ
る。即ち、入力された文書の中から、同じ文字種が２文
字以上連続する文字列を単語として抽出し（Ｓ22）、不
要な文字を文字列から除去し（Ｓ23）、不要語辞書を参
照して文字列の不要／必要を判定して不要な文字列を除
去し（Ｓ24）、同義語辞書を検索して同義語も抽出単語
に加え（Ｓ25）、抽出された文字列（単語）の出現頻度
を集計して文字列表を作成し（Ｓ26）、作成した文字列
表をＮ−gram生成装置23へ送る。The operation of the keyword extracting device 22 is the same as the operation of the keyword extracting device 12 described above. That is, from the input document, a character string in which two or more characters of the same character type are consecutive is extracted as a word (S22), unnecessary characters are removed from the character string (S23), and a character string is referred to the unnecessary word dictionary. Unnecessary / necessary columns are determined to remove unnecessary character strings (S24), a synonym dictionary is searched, and synonyms are added to the extracted words (S25), and the appearance frequency of the extracted character strings (words) is determined. A character string table is created by totaling (S26), and the created character string table is sent to the N-gram generation device 23.

【００５２】Ｎ−gram生成装置23における動作は、前述
したＮ−gram生成装置13における動作と同様である。即
ち、文字列表に含まれる文字列（単語）をＮ−gramに分
解し（Ｓ27）、分解したＮ−gramの中から、Ｎ−gram辞
書に登録されているものを抽出し（Ｓ28）、不要と判断
されるＮ−gramを除去し（Ｓ29）、残ったＮ−gramをイ
ンデックス参照装置24へ送る。The operation of the N-gram generator 23 is the same as the operation of the N-gram generator 13 described above. That is, the character strings (words) included in the character string table are decomposed into N-grams (S27), and those registered in the N-gram dictionary are extracted from the decomposed N-grams (S28), which is unnecessary. The N-gram determined as is removed (S29), and the remaining N-gram is sent to the index reference device 24.

【００５３】転置ベクトル表を読み出す場合に、まず主
記憶に書き込まれているか否かを判断する（Ｓ30）。主
記憶に書き込まれている場合には（Ｓ30：YES)、そこか
ら転置ベクトル表先読み装置241 にて転置ベクトル表を
読み出す（Ｓ31）。主記憶に書き込まれていない場合に
は（Ｓ30：NO）、転置ベクトル表読み出し装置242 に
て、インデックスファイルから転置ベクトル表を読み出
す（Ｓ32）。読み出した転置ベクトル表を類似文書判定
装置25へ送る。頻繁に必要となる転置ベクトル表を、高
速にアクセスできる主記憶上に書き込むようにしたの
で、検索処理の高速化を図れる。When reading the transposed vector table, it is first determined whether or not the transposed vector table has been written to the main memory (S30). If the data has been written to the main memory (S30: YES), the transposed vector table is read out therefrom by the transposed vector table look-ahead device 241 (S31). If not written in the main memory (S30: NO), the transposed vector table reading device 242 reads the transposed vector table from the index file (S32). The read transposed vector table is sent to the similar document determination device 25. Since the transposed vector table that is frequently required is written in the main storage that can be accessed at high speed, the search processing can be speeded up.

【００５４】類似文書候補選定装置251 にて、次のよう
にして類似文書候補を選定する（Ｓ33）。検索キー文書
から取り出されたＮ−gramに対応する転置ベクトルを、
入力された転置ベクトル表から取り出す。取り出した転
置ベクトルを足し合わせる。そして、この加算結果（重
なり語数）が１以上となる検索対象文書を類似文書候補
とする。類似文書候補となった文書について、類似度計
算装置252 にて、類似度を後述する所定の計算式に従っ
て計算する（Ｓ34）。そして、検索結果ソート装置253
にて、類似度順に検索結果をソートし、類似度順にソー
トされた検索結果を検索結果出力装置254 から出力する
（Ｓ35）。The similar document candidate selecting device 251 selects similar document candidates as follows (S33). The transposed vector corresponding to the N-gram extracted from the search key document is
Extract from the input transposed vector table. Add the transposed vectors taken out. Then, a search target document for which the addition result (the number of overlapping words) is 1 or more is set as a similar document candidate. The similarity calculation device 252 calculates the similarity of the document which is a similar document candidate according to a predetermined calculation formula described later (S34). Then, the search result sorting device 253
The search results are sorted in order of similarity, and the search results sorted in order of similarity are output from the search result output device 254 (S35).

【００５５】ここで、類似度計算の具体例について説明
する。検索対象文書及び転置ベクトル表は、前述した図
８に示すものとする。また、検索キー文書には、Ｎ−gr
am１とＮ−gram３とが含まれているとする。まず、これ
らのＮ−gram１及びＮ−gram３に対応する転置ベクトル
を図８から取り出して、それらを各文書１〜文書４につ
いて足し合わせる。具体的にその加算値は、文書１：
２，文書２：０，文書３：２，文書４：１となる。その
加算値が１以上である文書１，文書３，文書４を類似文
書候補とする。文書２は加算値が０であるので類似文書
候補としない。次に、これらの類似文書候補（文書１，
文書３，文書４）それぞれについて、以下の式（１）に
従って類似度Ｄを計算する。Ｄ＝γ／（α・β）^1/2 但し、 α：検索対象文書（文書１，文書３，文書４）のＮ−gr
amの総数 β：検索キー文書（質問文書）のＮ−gramの総数 γ：対応する転置ベクトルの加算値（上記例では、文書
１：２，文書文書３：２，文書４：１）Here, a specific example of the similarity calculation will be described. The search target document and the transposed vector table are as shown in FIG. Also, the search key document includes N-gr
It is assumed that am1 and N-gram3 are included. First, the transposed vectors corresponding to these N-gram1 and N-gram3 are extracted from FIG. 8, and they are added for each of documents 1 to 4. Specifically, the added value is calculated in Document 1:
2, document 2: 0, document 3: 2, document 4: 1. Document 1, document 3, and document 4 whose addition value is 1 or more are set as similar document candidates. Document 2 is not considered as a similar document candidate because the added value is 0. Next, these similar document candidates (document 1,
The similarity D is calculated for each of the documents 3 and 4) according to the following equation (1). D = γ / (α · β) ^1/2 where α: N-gr of the search target document (document 1, document 3, document 4)
total number of am β: total number of N-grams of the search key document (question document) γ: addition value of the corresponding transposed vectors (in the above example, document 1: 2, document 3: 2, document 4: 1)

【００５６】なお、上述の式（１）に示す類似度Ｄの計
算式は一例であり、他の式に従って類似度Ｄを求めるよ
うにしても良いことは言うまでもない。The equation for calculating the similarity D shown in the above equation (1) is merely an example, and it goes without saying that the similarity D may be obtained according to another equation.

【００５７】本発明では、このように、転置ベクトル表
に登録されているすべての検索対象文書について類似度
を計算するのではなく、最初に類似文書候補を選定し、
それらの類似文書候補についてのみ類似度を計算するよ
うにしているので、少ない回数の計算で類似度を求める
ことができ、検索処理の高速化を図れる。In the present invention, instead of calculating the similarity for all the search target documents registered in the transposed vector table, similar document candidates are selected first,
Since the similarity is calculated only for those similar document candidates, the similarity can be obtained with a small number of calculations, and the search processing can be speeded up.

【００５８】図14は、本発明の記録媒体の実施の形態の
構成を示すブロック図である。ここに例示するプログラ
ムは、図６及び図13に示すＳ１〜Ｓ12及びＳ21〜Ｓ35を
含んでおり、以下に説明する記録媒体に記録されてい
る。FIG. 14 is a block diagram showing the configuration of an embodiment of the recording medium of the present invention. The program exemplified here includes S1 to S12 and S21 to S35 shown in FIGS. 6 and 13 and is recorded on a recording medium described below.

【００５９】図14において、コンピュータ30とオンライ
ン接続する記録媒体31は、コンピュータ30の設置場所か
ら隔たって設置される例えばＷＷＷ(World Wide Web)の
サーバコンピュータを用いてなり、記録媒体31には前述
の如きプログラム31a が記録されている。記録媒体31か
ら読み出されたプログラム31a がコンピュータ30を制御
することにより、コンピュータ30が文書登録処理及び類
似文書検索処理を実行する。In FIG. 14, a recording medium 31 that is connected online to the computer 30 is, for example, a WWW (World Wide Web) server computer that is installed separately from a place where the computer 30 is installed. Is recorded. The computer 30 executes a document registration process and a similar document search process by controlling the computer 30 by the program 31a read from the recording medium 31.

【００６０】コンピュータ30の内部に設けられた記録媒
体32は、内蔵設置される例えばハードディスクドライブ
またはＲＯＭなどを用いてなり、記録媒体32には前述の
如きプログラム32a が記録されている。記録媒体32から
読み出されたプログラム32aがコンピュータ30を制御す
ることにより、コンピュータ30が文書登録処理及び類似
文書検索処理を実行する。The recording medium 32 provided inside the computer 30 uses, for example, a hard disk drive or a ROM installed therein, and the recording medium 32 stores the program 32a as described above. By controlling the computer 30 by the program 32a read from the recording medium 32, the computer 30 executes a document registration process and a similar document search process.

【００６１】コンピュータ30に設けられたディスクドラ
イブ30a に装填して使用される記録媒体33は、運搬可能
な例えば光磁気ディスク，ＣＤ−ＲＯＭまたはフレキシ
ブルディスクなどを用いてなり、記録媒体33には前述の
如きプログラム33a が記録されている。記録媒体33から
読み出されたプログラム33a がコンピュータ30を制御す
ることにより、コンピュータ30が文書登録処理及び類似
文書検索処理を実行する。The recording medium 33 used by being loaded into the disk drive 30a provided in the computer 30 is a transportable medium such as a magneto-optical disk, a CD-ROM or a flexible disk. The program 33a is recorded as follows. The computer 30 executes the document registration process and the similar document search process by controlling the computer 30 by the program 33a read from the recording medium 33.

【００６２】[0062]

【発明の効果】以上のように本発明では、検索対象文書
及び質問文書からその中に含まれる単語を特徴量として
抽出し、その抽出した単語からＮ−gramを生成し、検索
対象文書におけるＮ−gramと質問文書におけるＮ−gram
とを比較して類似度を求めて、類似文書を検索するよう
にしたので、検索処理に必要なＮ−gramを短時間で得る
ことができ、検索処理の高速化を図れる。As described above, according to the present invention, words included in a search target document and a question document are extracted as feature amounts, an N-gram is generated from the extracted words, and the N-gram in the search target document is extracted. -Gram and N-gram in question document
Is compared to obtain a similar document, and a similar document is searched. Therefore, an N-gram required for the search processing can be obtained in a short time, and the search processing can be speeded up.

[Brief description of the drawings]

【図１】本発明の文書検索方法を実施するためのシステ
ム構成を示す概念図である。FIG. 1 is a conceptual diagram showing a system configuration for implementing a document search method of the present invention.

【図２】文書登録装置の機能ブロック図である。FIG. 2 is a functional block diagram of the document registration device.

【図３】キーワード抽出装置の機能ブロック図である。FIG. 3 is a functional block diagram of the keyword extracting device.

【図４】Ｎ−gram生成装置の機能ブロック図である。FIG. 4 is a functional block diagram of the N-gram generation device.

【図５】インデックス生成装置の機能ブロック図であ
る。FIG. 5 is a functional block diagram of the index generation device.

【図６】文書登録処理の手順を示すフローチャートであ
る。FIG. 6 is a flowchart illustrating a procedure of a document registration process.

【図７】文書特徴ベクトルの一例を示す図である。FIG. 7 is a diagram illustrating an example of a document feature vector.

【図８】図７に対応する転置ベクトル表を示す図であ
る。FIG. 8 is a diagram showing a transposed vector table corresponding to FIG. 7;

【図９】転置ベクトル表登録装置の機能ブロック図であ
る。FIG. 9 is a functional block diagram of a transposed vector table registration device.

【図１０】類似文書検索装置の機能ブロック図である。FIG. 10 is a functional block diagram of a similar document search device.

【図１１】インデックス参照装置の機能ブロック図であ
る。FIG. 11 is a functional block diagram of an index reference device.

【図１２】類似文書判定装置の機能ブロック図である。FIG. 12 is a functional block diagram of the similar document determination device.

【図１３】類似文書検索処理の手順を示すフローチャー
トである。FIG. 13 is a flowchart illustrating a procedure of a similar document search process.

【図１４】本発明の記録媒体の実施の形態の構成を示す
ブロック図である。FIG. 14 is a block diagram illustrating a configuration of a recording medium according to an embodiment of the present invention.

[Explanation of symbols]

１文書登録装置２類似文書検索装置 12，22 キーワード抽出装置 13，23 Ｎ−gram生成装置 14 インデックス生成装置 24 インデックス参照装置 25 類似文書判定装置 30 コンピュータ 31，32，33 記録媒体 121 字面解析装置 122 不要文字除去装置 123 不要文字列除去装置 131 Ｎ−gram分解装置 132 Ｎ−gram辞書参照装置 133 不要Ｎ−gram除去装置 141 文書特徴ベクトル生成装置 142 転置ベクトル表登録装置 241 転置ベクトル表先読み装置 242 転置ベクトル表読み出し装置 251 類似文書候補選定装置 252 類似度計算装置 1421 転置ベクトル表生成装置 1422 ブロック分解装置 1423 ベクトル圧縮装置 DESCRIPTION OF SYMBOLS 1 Document registration device 2 Similar document search device 12, 22 Keyword extraction device 13, 23 N-gram generation device 14 Index generation device 24 Index reference device 25 Similar document determination device 30 Computer 31, 32, 33 Recording medium 121 Character analysis device 122 Unnecessary character removing device 123 Unnecessary character string removing device 131 N-gram decomposition device 132 N-gram dictionary reference device 133 Unnecessary N-gram removing device 141 Document feature vector generating device 142 Transposed vector table registration device 241 Transposed vector table look-ahead device 242 Transposed Vector table reading device 251 Similar document candidate selection device 252 Similarity calculation device 1421 Transposed vector table generation device 1422 Block decomposition device 1423 Vector compression device

Claims

[Claims]

1. A document retrieval method for retrieving a similar document similar to an arbitrary document from a plurality of documents, a step of extracting words included in the plurality of documents, and generating an N-gram of the extracted words. Extracting a word included in the arbitrary document;
Generating an N-gram for the arbitrary document and an N-gram generated for the plurality of documents.
Comparing the gram with the gram and searching for a similar document based on the comparison result.

2. The document search method according to claim 1, wherein when extracting the word, the word is extracted using the face information, and unnecessary words are removed from the extracted words.

3. Unnecessary N-gr from the generated N-grams
3. The document search method according to claim 1, wherein am is removed.

4. An N-gr generated for the arbitrary document
The similarity between the arbitrary document and each of the plurality of documents is calculated based on am and the N-gram generated for the plurality of documents, and a similar document is searched for according to the similarity. 2. The document search method according to claim 1, wherein a document candidate is selected, and the similarity is calculated for the selected similar document candidate.

5. The N-gr generated for the plurality of documents
2. The document according to claim 1, wherein the presence information of am is stored in a storage unit in the form of a vector, and each row of the vector is divided into blocks in units of a plurality of elements, and unnecessary blocks are removed to compress the vector. retrieval method.

6. The N-gr generated for the plurality of documents
2. The document search method according to claim 1, wherein the presence information of am is separately stored in a plurality of storage units having different read speeds in a vector format.

7. A document search device for searching a plurality of documents for a similar document similar to an arbitrary document, a first word extracting means for extracting words included in the plurality of documents, and N- first N-gram generating means for generating a gram, second word extracting means for extracting a word included in the arbitrary document, and second N-gram for generating an N-gram of the extracted word
-Gram generating means and N generated for the arbitrary document
And a similar document search means for comparing the N-gram generated for the plurality of documents with the N-gram and searching for a similar document based on the comparison result.

8. The first word extracting means and the second word extracting means have means for extracting a word using document face information, and means for removing an unnecessary word from the extracted words. Item 7. The document search device according to Item 7.

9. The first N-gram generation means and the second N-gram generation means.
The gram generation means selects unnecessary N-grams from the generated N-grams.
9. The document search device according to claim 7, further comprising means for removing a gram.

10. The similar document search means selects a similar document candidate from the plurality of documents based on the N-gram generated for the arbitrary document and the N-gram generated for the plurality of documents. 8. The document search apparatus according to claim 7, further comprising: means for calculating a similarity between the arbitrary document and each of the selected similar document candidates; and means for searching for a similar document according to the similarity.

11. An N-file generated for the plurality of documents.
storage means for storing presence information of the gram in the form of a vector, wherein each row of the vector is divided into blocks in units of a plurality of elements, unnecessary blocks are removed and the vector is compressed, and the storage means is provided. 8. The document search device according to claim 7, wherein the document search device is stored.

12. An N-file generated for the plurality of documents.
a first storage unit that stores a part of the existence information of the gram, and a second storage unit that stores the remaining part of the existence information of the N-gram generated for the plurality of documents, the second storage unit having a different reading speed from the first storage unit. The document search device according to claim 7, further comprising a storage unit.

13. A computer-readable recording medium that stores a program for searching a plurality of documents for a similar document similar to an arbitrary document, and extracts words included in the plurality of documents. Program code means for causing the computer to do the above, program code means causing the computer to generate an N-gram of the extracted word, and a program causing the computer to extract a word contained in the arbitrary document Code means, program code means for causing the computer to generate an N-gram of the extracted word, and comparing the N-gram generated for the arbitrary document with the N-gram generated for the plurality of documents. A program that causes the computer to search for a similar document based on the comparison result. Recording medium and having a code means.

14. The program code means for causing the computer to extract a word includes program code means for causing the computer to extract a word using face information, and an unnecessary word from the extracted words. 14. A recording medium according to claim 13, further comprising program code means for causing the computer to remove the program.

15. A program code means for causing the computer to generate an N-gram of an extracted word, wherein the program code causes the computer to remove an unnecessary N-gram from the generated N-gram. The recording medium according to claim 13 or 14, comprising means.

16. The program code means, which causes the computer to search for a similar document, based on the N-gram generated for the arbitrary document and the N-gram generated for the plurality of documents. Program code means for causing the computer to select a similar document candidate from among the documents, and program code means for causing the computer to calculate the degree of similarity between the arbitrary document and each of the selected similar document candidates. 14. Program code means for causing the computer to search for a similar document according to the degree of similarity.
The recording medium according to the above.

17. An N-file generated for the plurality of documents.
program code means for causing the computer to store the presence information of the gram in the storage means in the form of a vector; and dividing each row of the vector into a plurality of elements as a unit, removing unnecessary blocks to remove the vector. 14. The recording medium according to claim 13, further comprising program code means for causing said computer to perform compression.

18. An N-file generated for the plurality of documents.
14. The recording medium according to claim 13, further comprising program code means for causing the computer to store the gram presence information in a plurality of storage means having different read speeds in a vector format.