JPH11143902A

JPH11143902A - Similar document search method using n-gram

Info

Publication number: JPH11143902A
Application number: JP9309078A
Authority: JP
Inventors: Tadataka Matsubayashi; 忠孝松林; Katsumi Tada; 勝己多田; Takuya Okamoto; 卓哉岡本; Natsuko Sugaya; 菅谷　　奈津子; Yasushi Kawashita; 靖司川下
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1997-11-11
Filing date: 1997-11-11
Publication date: 1999-05-28

Abstract

PROBLEM TO BE SOLVED: To provide a similar document retrieval system which can retrieve even for such languages as the Japanese have many character types at a high speed and with high accuracy. SOLUTION: This system is provided with a step where the appearance frequency of a feature character string existing in a text 103 contained in a text data base is stored as an appearance frequency file 104, a step where the feature character string is extracted from the text that is designated by a user and a step where the appearance frequency of the feature character string is counted in the text designated by the user. Then, a similarity is calculated to the text designated by the user by using the appearance frequency stored in the file 104 and in the text designated by the user. Thus, the similar documents are retrieved by using the calculated similarity.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、ユーザが指定した
文書と類似する文書を、文書データベースの中から検索
する方法に関する。[0001] 1. Field of the Invention [0002] The present invention relates to a method for searching a document database for a document similar to a document specified by a user.

【０００２】[0002]

【従来の技術】近年、パーソナルコンピュータやインタ
ーネット等の普及に伴い、電子化文書が爆発的に増加し
ており、今後も加速度的に増大していくものと予想され
る。このような状況において、ユーザが所望する情報を
含んだ文書を高速かつ効率的に検索したいという要求が
高まってきている。2. Description of the Related Art In recent years, with the spread of personal computers, the Internet, and the like, the number of digitized documents has exploded, and is expected to increase at an accelerating rate in the future. In such a situation, there is an increasing demand for a user to quickly and efficiently search for a document containing desired information.

【０００３】このような要求に応える技術として全文検
索がある。全文検索では、検索対象文書をテキストとし
て計算機システムに登録してデータベース化し、この中
からユーザが指定した検索文字列（以下、検索タームと
呼ぶ）を含む文書を検索する。このように全文検索で
は、文書中の文字列そのものを対象として検索を行なう
ため、予めキーワードを付与し、このキーワードを手掛
りに検索する従来の検索システムとは異なり、検出漏れ
が原理的に生じないという特長がある。There is a full-text search as a technique to meet such a demand. In the full-text search, a document to be searched is registered as a text in a computer system and made into a database, and a document including a search character string (hereinafter, referred to as a search term) specified by a user is searched from the database. As described above, in the full-text search, since a search is performed for the character string itself in a document, a keyword is previously assigned, and unlike a conventional search system in which this keyword is used as a clue, detection omission does not occur in principle. There is a feature.

【０００４】しかし、ユーザが所望する情報を含んだ文
書を的確に検索するためには、ユーザの検索意図を正確
に表す複雑な検索条件式を入力する必要がある。However, in order to accurately search for a document containing information desired by the user, it is necessary to input a complicated search condition expression that accurately represents the user's search intention.

【０００５】この繁雑さを解消するために、ユーザが自
分の所望する内容の文書（以下、種文書と呼ぶ）を指定
し、その文書と類似する文書を検索する類似文書検索技
術が注目されている。[0005] In order to solve this complexity, similar document search technology, in which a user designates a document having a desired content (hereinafter referred to as a seed document) and searches for a document similar to the document, has attracted attention. I have.

【０００６】類似文書検索の方法としては、例えば、
「特開平８−３３５２２２号公報」に、形態素解析等に
より種文書中に含まれる単語を抽出し、これを用いて類
似文書を検索する技術（以下、従来技術１と呼ぶ）が開
示されている。As a similar document search method, for example,
Japanese Patent Application Laid-Open No. 8-335222 discloses a technique for extracting words included in a seed document by morphological analysis or the like and searching for a similar document using the extracted words (hereinafter referred to as Conventional Technique 1). .

【０００７】また、「特開平６−１１０９４８号公報」
には、種文書中から連続するn文字の文字列（以下、n-g
ramと呼ぶ）を抽出し、これを用いて類似文書を検索す
る技術（以下、従来技術２と呼ぶ）が開示されている。[0007] Japanese Patent Application Laid-Open No. Hei 6-110948 discloses
Is a string of n consecutive characters from the seed document (hereinafter ng
A technique for extracting a similar document using the extracted ram (hereinafter referred to as ram) (hereinafter referred to as prior art 2) is disclosed.

【０００８】上記２つの従来技術について、その概要を
説明する。The outline of the above two prior arts will be described.

【０００９】従来技術１では、形態素解析により種分書
中に含まれる単語を抽出し、この単語を含む文書を類似
文書として検索する。例えば、「この装置は地下水脈の
観測にも使える。」という文書を種文書とする場合、形
態素解析により単語辞書を参照して、「装置」「地下」
「水脈」「観測」「使える」という単語が抽出される。
この結果、「地下水脈を観測することによる地震の発生
を予測する。」という文書を類似文書として検索するこ
とができる。しかし従来技術１では、単語の抽出に単
語辞書を用いるため、次のような２つの問題が生じる。In the prior art 1, a word included in a seed document is extracted by morphological analysis, and a document containing this word is searched for as a similar document. For example, when a document “This device can be used for observation of groundwater veins” is used as a seed document, the word dictionary is referred to by morphological analysis, and “device” “underground”
The words "water vein", "observation", and "usable" are extracted.
As a result, a document "predict the occurrence of an earthquake by observing groundwater veins" can be retrieved as a similar document. However, in the prior art 1, since the word dictionary is used for extracting words, the following two problems occur.

【００１０】まず、単語辞書に含まれていない単語は、
種文書から検索用の単語として抽出されないためこの単
語を含む文書を検索することができないという問題があ
る。このため、ユーザが所望する情報が新語で表され、
これが単語辞書に含まれていない場合、目的の情報を含
む文書を検索することができなくなる。First, words not included in the word dictionary are:
There is a problem that a document including this word cannot be searched because it is not extracted as a search word from the seed document. Therefore, the information desired by the user is expressed in a new word,
If this is not included in the word dictionary, it will not be possible to search for documents containing the desired information.

【００１１】次に、ユーザが所望する情報を表わす言葉
が単語辞書に含まれる場合でも、単語の抽出の仕方によ
っては検索漏れが生じてしまう。例えば、上記の「この
装置は地下水脈の観測にも使える。」という種文書から
は、「装置」「地下」「水脈」「観測」「使える」とい
う単語が抽出される。しかし、「地下水」という単語が
抽出されないため「地下水の大量汲み上げで地盤沈下地
域が拡大した。」という文書は、類似文書として検索す
ることができないという問題がある。[0011] Next, even if a word representing information desired by the user is included in the word dictionary, search omission may occur depending on how the word is extracted. For example, the word “device”, “underground”, “water vein”, “observation”, “usable” is extracted from the seed document “this device can also be used for observation of groundwater veins”. However, since the word "groundwater" is not extracted, there is a problem that the document "The ground subsidence area has been expanded by the large pumping of groundwater" cannot be searched as a similar document.

【００１２】以上が従来技術１の問題点である。The above is the problem of the prior art 1.

【００１３】この問題を解決するために、上記従来技術
２が提案されている。これは、n-gramを用いた類似文書
検索方法である。In order to solve this problem, the above-mentioned prior art 2 has been proposed. This is a similar document search method using an n-gram.

【００１４】以下、文書１「新開発の心電計による発作
時の心電図」、文書２「新しいソフトウェアの開発作
業」、および文書３「ソフト開発を支援するソフトウェ
ア」が登録されているデータベースを対象に、n-gramの
nの値を２として、ユーザが文書２を種文書に指定した
場合を例に、従来技術２の具体的な処理方法を説明し、
その問題点を述べる。Hereinafter, a database in which a document 1 “Electrocardiogram at the time of a seizure by a newly developed electrocardiograph”, a document 2 “New software development work”, and a document 3 “Software supporting software development” are registered. And n-gram
A specific processing method of the prior art 2 will be described by taking as an example a case where the value of n is set to 2 and the user designates the document 2 as a seed document.
The problem is described.

【００１５】まず、データベース中の文書１〜文書３か
ら2-gramを抽出する。First, 2-grams are extracted from documents 1 to 3 in the database.

【００１６】[0016]

【表１】 [Table 1]

【００１７】表１に、文書１に出現する2-gramの中から
重複を排除して抽出した2-gram（以下、重複排除2-gram
と呼ぶ）を示す。次に、これらの2-gramの各々に対しウ
ェイトを計算する。このウェイトは各々の2-gramの出現
頻度をその文書中に出現する2-gramの重複を含めた総出
現頻度で割ることによって求める。ウェイトは各々の2-
gramのその文書内における出現割合を表し、この値が大
きい2-gramほどその文書に頻繁に出現することを意味す
る。文書２および文書３に対しても同様の処理を施し、
それぞれウェイトを求める。表２および表３にこの処理
結果を示す。Table 1 shows a 2-gram extracted by removing duplicates from 2-grams appearing in document 1 (hereinafter referred to as a duplicate-eliminated 2-gram).
). Next, a weight is calculated for each of these 2-grams. This weight is obtained by dividing the appearance frequency of each 2-gram by the total appearance frequency including the duplication of the 2-gram appearing in the document. Weight is 2-
The gram represents the appearance rate of the document in the document, and the larger the value of 2-gram, the more frequently the gram appears in the document. The same processing is performed on document 2 and document 3,
Find the weight for each. Tables 2 and 3 show the processing results.

【００１８】[0018]

【表２】 [Table 2]

【００１９】[0019]

【表３】 [Table 3]

【００２０】その後、データベース中の文書間の共通性
を除去する。ここでは、まず、データベース中に存在す
る2-gramの中で重複を排除した2-gramについて、その共
通性ウェイトを算出する。この共通性ウェイトは、デー
タベース中の全文書に関する各2-gramのウェイトの総和
を、全文書数で割ることによって算出する。共通性ウェ
イトは、各2-gramのデータベース全体における出現割合
を表し、この値が大きい2-gramほどデータベース中のど
の文書にも共通して出現することを意味する。Thereafter, the commonality between the documents in the database is removed. Here, first, a commonality weight is calculated for a 2-gram from which duplicates have been eliminated among 2-grams existing in the database. This commonality weight is calculated by dividing the sum of the weights of each 2-gram for all documents in the database by the total number of documents. The commonality weight indicates the appearance ratio of each 2-gram in the entire database, and means that a 2-gram with a larger value appears more commonly in any document in the database.

【００２１】[0021]

【表４】 [Table 4]

【００２２】表４に、文書１、文書２および文書３の間
の共通性ウェイトを示す。Table 4 shows the commonality weights among Document 1, Document 2, and Document 3.

【００２３】例えば、2-gram「新開」の共通性ウェイト
は、(0.063+0.0+0.0)/3=0.021である。ここで2-gram
「新開」は文書２および文書３に出現していないのでウ
ェイトはそれぞれ「0.0」となっている。2-gram「開
発」の共通性ウェイトは、(0.063+0.077+0.067)/3=0.06
9である。For example, the commonality weight of the 2-gram “new opening” is (0.063 + 0.0 + 0.0) /3=0.021. Where 2-gram
Since “Shinkai” does not appear in the document 2 and the document 3, the weight is “0.0”, respectively. The commonality weight of 2-gram `` development '' is (0.063 + 0.077 + 0.067) /3=0.06
9

【００２４】上述したように、共通性ウェイトは各n-gr
amのウェイトの平均値である。As described above, the commonality weight is n-gr
It is the average value of am weight.

【００２５】この共通性ウェイトを各n-gramのウェイト
から差し引くことにより、データベース中の文書間の共
通性を除去する（この値を従来技術２では、正規化ウェ
イトと呼んでいる）。正規化ウェイトは、データベース
における各n-gramの出現偏りを表し、この値が大きいn-
gramほどある特定の文書に偏って出現することを意味す
る。By subtracting this commonality weight from the weight of each n-gram, the commonality between documents in the database is removed (this value is called a normalized weight in the prior art 2). The normalized weight represents the appearance bias of each n-gram in the database.
This means that gram appears more eccentric to a specific document.

【００２６】もし、あるn-gramが全ての文書に同じ割合
で出現していれば、ウェイトと共通性ウェイトは同じ値
となるため、正規化ウェイトは「０」となる。つまり、
どの文書においても同じような割合で出現するn-gramに
関しては、ウェイトが限りなく「０」に近づくことにな
る。If a certain n-gram appears in all documents at the same ratio, the weight and the commonality weight have the same value, so that the normalized weight is "0". That is,
For n-grams that appear at the same rate in any document, the weights will approach “0” without limit.

【００２７】表５、表６および表７に、文書１、文書２
および文書３の正規化ウェイトを示す。Table 5, Table 6 and Table 7 show Document 1, Document 2
And the normalized weight of Document 3.

【００２８】[0028]

【表５】 [Table 5]

【００２９】[0029]

【表６】 [Table 6]

【００３０】[0030]

【表７】 [Table 7]

【００３１】以上のようにして得られた正規化ウェイト
を用いて、ユーザが種文書として指定した文書とデータ
ベース中の全文書との類似性を求め、これを類似度とし
て表わす。文書番号をiとすると、文書iの類似度S(i)
は、以下に示す式（１）によって求められる。Using the normalized weights obtained as described above, the similarity between the document designated by the user as the seed document and all the documents in the database is expressed as similarity. Assuming that the document number is i, the similarity S (i) of document i
Is determined by the following equation (1).

【００３２】[0032]

【数１】 (Equation 1)

【００３３】ここで、U(j)は種文書中のj番目のn-gram
の正規化ウェイトを示し、R(j)はデータベース中文書の
j番目のn-gramの正規化ウェイトを示す。また、nはデー
タベース中の全文書数を表わす。この式を用いてデータ
ベース中の全ての文書の類似度を算出すると以下のよう
になる。Here, U (j) is the j-th n-gram in the seed document.
R (j) is the normalized weight of the document in the database.
The normalized weight of the j-th n-gram is shown. N represents the total number of documents in the database. When the similarity of all documents in the database is calculated using this equation, the following is obtained.

【００３４】 S(1) = 0.018 S(2) = 1.0 S(3) = 0.119 最後に、得られた類似度の降順に文書を出力する。この
例では、文書２、文書３、文書１の順で出力されること
になる。S (1) = 0.018 S (2) = 1.0 S (3) = 0.119 Finally, the documents are output in descending order of the obtained similarity. In this example, document 2, document 3, and document 1 are output in this order.

【００３５】以上が、従来技術２の具体的な処理内容で
ある。このように従来技術２では、単語辞書に基づく形
態素解析を用いることなく種文書に類似する文書を検索
することができるため、従来技術１における２つの問題
点を解決することができる。The above is the specific processing content of the prior art 2. As described above, in the second prior art, since a document similar to a seed document can be searched without using morphological analysis based on a word dictionary, two problems in the first prior art can be solved.

【００３６】しかし、この従来技術２には次のような２
つの問題点がある。However, the prior art 2 has the following 2
There are two problems.

【００３７】まず、第一の問題点は、種文書から抽出さ
れるn-gram数が膨大になるため、検索に長大な時間を要
してしまうという問題である。例えば、1,000文字から
なる種文書から全ての2-gramを抽出した場合、999個の2
-gramが抽出されることになる。そのため、抽出した全
ての2-gramを類似検索に用いる従来技術２の方法では、
1個の2-gramの検索が0.1秒で済んだとしても、999個の2
-gramでは99.9秒、すなわち約1分40秒も検索時間が掛か
ってしまうことになる。First, the first problem is that since the number of n-grams extracted from a seed document becomes enormous, a long time is required for retrieval. For example, if all 2-grams are extracted from a seed document consisting of 1,000 characters, 999 2
-gram will be extracted. Therefore, in the method of prior art 2 in which all extracted 2-grams are used for similarity search,
Even if one 2-gram search takes 0.1 seconds, 999 2
In the case of -gram, it takes 99.9 seconds, that is, about 1 minute and 40 seconds.

【００３８】また、第二の問題点は、全てのn-gramを用
いて類似文書を検索するため、検索結果にノイズが含ま
れるという問題である。The second problem is that a similar document is searched using all n-grams, so that the search result contains noise.

【００３９】以下、この問題点を、文書１〜文書３が登
録されている前記データベースに、文書４「新しいソフ
トクリーム券の配布作業」を追加した場合を例に、具体
的に説明する。Hereinafter, this problem will be specifically described with reference to an example in which a document 4 "distribution work of a new soft serve ticket" is added to the database in which documents 1 to 3 are registered.

【００４０】本例では、文書２が、種文書としてユーザ
に指定されたものとする。In this example, it is assumed that document 2 is designated by the user as a seed document.

【００４１】まず、文書４から2-gramを抽出し、ウェイ
トを求めた結果を表８に示す。First, a result of extracting a 2-gram from the document 4 and obtaining a weight is shown in Table 8.

【００４２】[0042]

【表８】 [Table 8]

【００４３】この文書４のウェイトと表１〜表３に示し
た文書１〜文書３のウェイトを用いて、共通性ウェイト
を算出する。Using the weight of the document 4 and the weights of the documents 1 to 3 shown in Tables 1 to 3, a commonality weight is calculated.

【００４４】[0044]

【表９】 [Table 9]

【００４５】表９に、文書１〜文書４の間の共通性ウェ
イトを示す。例えば、2-gram「開発」の共通性ウェイト
は、(0.063+0.077+0.067+0.000)/4=0.052となる。次
に、この共通性ウェイトを各文書の重複排除2-gramのウ
ェイトから差し引くことにより、データベース中の文書
間の共通性を除去した正規化ウェイトを求める。Table 9 shows the commonality weight between documents 1 to 4. For example, the commonality weight of 2-gram “development” is (0.063 + 0.077 + 0.067 + 0.000) /4=0.052. Next, a normalized weight is obtained by subtracting the commonality between documents in the database by subtracting the commonality weight from the deduplication 2-gram weight of each document.

【００４６】[0046]

【表１０】 [Table 10]

【００４７】[0047]

【表１１】 [Table 11]

【００４８】[0048]

【表１２】 [Table 12]

【００４９】[0049]

【表１３】 [Table 13]

【００５０】表１０〜表１３に文書１〜文書４における
2-gramの正規化ウェイトを示す。これらを用いて、種文
書である文書２に対する各文書の類似度を式（１）を用
いて算出すると、 S(1) = 0.036 S(2) = 1.0 S(3) = 0.179 S(4) = 0.190 となる。Tables 10 to 13 show documents 1 to 4
This shows the normalized weight of 2-gram. Using these, the similarity of each document to document 2 which is a seed document is calculated using equation (1). S (1) = 0.036 S (2) = 1.0 S (3) = 0.179 S (4) = 0.190.

【００５１】ここで、文書３は文書２と同様にソフトウ
ェアに関する文書であるにも関わらず、関係のない文書
４の方が文書２に類似していると判断されてしまってい
る。これは、文書２の「ソフトウェア」から抽出される
「ソフ」「フト」が、文書４の「ソフトクリーム」から
も抽出されることによる。n-gramは単語のように意味的
にまとまった単位の文字列ではないため、同じn-gramで
あっても文書内で同じ意味を表現しているとは限らな
い。そのため、この例のように全く関係のない文書が高
い類似度を持つ文書として探し出されてしまうという問
題がある。Here, the document 3 is determined to be similar to the document 2 irrelevant document 4 though the document 3 is a document relating to software like the document 2. This is because “soft” and “soft” extracted from “software” in document 2 are also extracted from “soft cream” in document 4. Since an n-gram is not a character string in a unit semantically like a word, even the same n-gram does not always represent the same meaning in a document. Therefore, there is a problem that a completely unrelated document like this example is found as a document having a high similarity.

【００５２】[0052]

【発明が解決しようとする課題】こうした従来技術の問
題に対し、本発明では以下の課題を解決することを目的
とする。SUMMARY OF THE INVENTION In order to solve the problems of the prior art, the present invention aims to solve the following problems.

【００５３】（１）検索精度の高い類似文書検索方法を
提供する。(1) To provide a similar document search method with high search accuracy.

【００５４】（２）日本語のように文字種の多い言語に
対しても、高速に類似文書検索が行える方法を提供す
る。(2) A method is provided that enables similar documents to be searched at high speed even for languages with many types of characters such as Japanese.

【００５５】[0055]

【課題を解決するための手段】上記課題を解決するため
に、本発明による文書検索方法では、以下に示すステッ
プで種文書と類似する文書を検索する。In order to solve the above-mentioned problems, a document retrieval method according to the present invention retrieves a document similar to a seed document in the following steps.

【００５６】すなわち、本発明による文書検索方法で
は、文書の登録処理として、（ステップ１）登録対象文
書を読み込む文書読込みステップ、（ステップ２）上記
文書読込みステップで読み込んだ登録対象文書の文字列
を、漢字やカタカナ等の文字種境界で分割し、同一文字
種で構成される文字列（以下、同一文字種文字列と呼
ぶ）として抽出する同一文字種文字列抽出ステップ、
（ステップ３）上記同一文字種文字列抽出ステップで抽
出した同一文字種文字列に対して、その文字種を判定
し、漢字ならば予め定められた長さの文字列を自立語の
可能性があるもの（以下、特徴文字列と呼ぶ）として、
そこから抽出し、カタカナや英字ならば同一文字種文字
列そのものを特徴文字列として抽出し、それ以外の文字
種ならば特徴文字列としては抽出を行わない登録用特徴
文字列抽出ステップ、（ステップ４）上記登録用特徴文
字列抽出ステップで抽出した特徴文字列に関して、登録
対象文書内における出現頻度を計数する出現頻度計数ス
テップ、（ステップ５）上記出現頻度計数ステップで計
数した出現頻度を該当する出現頻度ファイルに格納する
出現頻度ファイル作成登録ステップ、を有し、種文書に
類似する文書の検索処理として、（ステップ６）種文書
を読み込む種文書読込みステップ、（ステップ７）上記
種文書読込みステップにおいて読み込んだ種文書の文字
列を文字種境界で分割し、同一文字種文字列として抽出
する同一文字種文字列抽出ステップ、（ステップ８）上
記同一文字種文字列抽出ステップで抽出した同一文字種
文字列に対して、その文字種を判定し、漢字ならば予め
定められた長さの文字列を特徴文字列としてそこから抽
出し、カタカナや英字ならば同一文字種文字列そのもの
を特徴文字列として抽出し、それ以外の文字種ならば特
徴文字列としては抽出を行わない検索用特徴文字列抽出
ステップ、（ステップ９）上記検索用特徴文字列抽出ス
テップで抽出した特徴文字列に関して、種文書内の出現
頻度を計数する出現頻度計数ステップ、（ステップ１
０）上記出現頻度計数ステップで抽出した全ての特徴文
字列に対して、前記出現頻度ファイルを読み込み、デー
タベース内の各文書における出現頻度を取得する出現頻
度取得ステップ、（ステップ１１）上記出現頻度取得ス
テップで抽出した特徴文字列に関し、上記出現頻度計数
ステップで計数した種文書内の出現頻度と、上記出現頻
度取得ステップで取得したデータベース内の各文書にお
ける出現頻度を用いて、予め定められた算出式に基づい
て種文書とデータベース内の各文書との類似度を算出す
る類似度算出ステップ、（ステップ１２）上記類似度算
出ステップで算出した類似度の降順に、文書の一覧を表
示する検索結果表示ステップを有する。That is, in the document search method according to the present invention, as the document registration processing, (step 1) a document reading step for reading the registration target document, and (step 2) a character string of the registration target document read in the document reading step. The same character type character string extraction step of dividing at a character type boundary such as kanji or katakana and extracting as a character string composed of the same character type (hereinafter referred to as the same character type character string);
(Step 3) For the same character type character string extracted in the same character type character string extraction step, the character type is determined, and if it is a kanji character, a character string of a predetermined length may be an independent word ( Hereinafter, it will be referred to as a characteristic character string).
A character string of the same character type is extracted as a characteristic character string if it is a katakana or English character, and a character string for registration is not extracted as a characteristic character string if it is any other character type (step 4). An appearance frequency counting step of counting the appearance frequency in the registration target document with respect to the feature character string extracted in the registration feature character string extraction step; (step 5) the occurrence frequency counted in the appearance frequency counting step corresponds to the occurrence frequency The method includes a step of creating and registering an appearance frequency file to be stored in a file. As a search process for a document similar to a seed document, (step 6) a seed document reading step of reading a seed document, and (step 7) reading in the seed document reading step. The same character type sentence that divides the character string of a type document at character type boundaries and extracts it as the same character type character string String extraction step, (Step 8) The same character type character string extracted in the same character type character string extraction step is determined for the character type, and if it is a kanji character, a character string of a predetermined length is used as a characteristic character string. A character string of the same character type itself is extracted as a characteristic character string if it is a katakana or English character, and a character string for retrieval is not extracted as a characteristic character string if it is any other character type. (Step 9) An appearance frequency counting step of counting the appearance frequency in the seed document with respect to the characteristic character string extracted in the search characteristic character string extraction step; (step 1
0) an appearance frequency acquisition step of reading the appearance frequency file for all the characteristic character strings extracted in the appearance frequency counting step and acquiring an appearance frequency in each document in the database; (step 11) an appearance frequency acquisition For the characteristic character string extracted in the step, a predetermined calculation is performed using the appearance frequency in the seed document counted in the appearance frequency counting step and the appearance frequency in each document in the database acquired in the appearance frequency acquisition step. A similarity calculating step of calculating a similarity between the seed document and each document in the database based on the equation; (step 12) a search result displaying a list of documents in descending order of the similarity calculated in the similarity calculating step It has a display step.

【００５７】上記文書検索方法を用いた本発明の原理
を、以下に説明する。The principle of the present invention using the above document search method will be described below.

【００５８】文書を登録する際には、（ステップ１）〜
（ステップ５）を実行する。まず、（ステップ１）で登
録対象となる文書を読み込む。次に、（ステップ２）に
おいて、（ステップ１）で読み込んだ登録対象文書中の
文字列を、漢字やカタカナ等の文字種境界で分割し、同
一文字種からなる文字列を抽出する。例えば、前記の文
書４「新しいソフトクリーム券の配布作業」という文書
からは、「新」「しい」「ソフトクリーム」「券」
「の」「配布作業」という６個の同一文字種文字列が抽
出される。When registering a document, (step 1) to
(Step 5) is executed. First, a document to be registered is read in (Step 1). Next, in (Step 2), the character string in the registration target document read in (Step 1) is divided at character type boundaries such as kanji and katakana to extract a character string having the same character type. For example, from the document “Document 4“ Distribution work of new soft serve ticket ””, “new”, “new”, “soft serve”, “ticket”
Six identical character type character strings “no” and “distribution work” are extracted.

【００５９】次に、（ステップ３）において、（ステッ
プ２）で抽出した同一文字種文字列について、その文字
種を判定し、漢字ならば予め定められた長さの文字列を
特徴文字列としてそこから抽出し、カタカナや英字なら
ば同一文字種文字列そのものを特徴文字列として抽出
し、それ以外の文字種ならば特徴文字列としては抽出を
行わない。例えば、予め漢字文字列から2-gramを抽出す
るものと定められている場合には、上記（ステップ２）
における同一文字種文字列からは、「ソフトクリーム」
「配布」「布作」「作業」が特徴文字列として抽出され
る。Next, in (Step 3), for the same character type character string extracted in (Step 2), the character type is determined, and if it is a kanji character, a character string of a predetermined length is used as a characteristic character string. If the characters are katakana or English characters, the same character type character string itself is extracted as a characteristic character string, and if it is any other character type, no character character string is extracted. For example, if it is determined in advance that a 2-gram is to be extracted from a kanji character string, the above (step 2)
From the same character type string in
“Distribution”, “cloth”, and “work” are extracted as characteristic character strings.

【００６０】次に、（ステップ４）において、（ステッ
プ３）で抽出した特徴文字列の登録対象文書内における
出現頻度を計数する。例えば、上記の文書４「新しいソ
フトクリーム券の配布作業」という文書では、特徴文字
列「ソフトクリーム」が１回出現し、「作業」は１回出
現するという情報が得られる。Next, in (Step 4), the appearance frequency of the characteristic character string extracted in (Step 3) in the registration target document is counted. For example, in the document "Document 4" Distribution work of a new soft serve ticket ", information is obtained that the characteristic character string" soft serve "appears once and" work "appears once.

【００６１】次に、（ステップ５）において、先に（ス
テップ４）で計数した特徴文字列の出現頻度を該当する
出現頻度ファイルに格納する。図２に出現頻度ファイル
の例を示す。本図に示した出現頻度ファイルは、表１、
表２、表３および表８に示した文書１〜文書４を登録し
た場合の例である。Next, in (Step 5), the appearance frequency of the characteristic character string counted in (Step 4) is stored in the corresponding appearance frequency file. FIG. 2 shows an example of the appearance frequency file. The appearance frequency files shown in this figure are shown in Table 1,
This is an example in a case where documents 1 to 4 shown in Tables 2, 3 and 8 are registered.

【００６２】検索時には、（ステップ６）〜（ステップ
１２）からなる類似文書検索ステップを実行する。At the time of retrieval, a similar document retrieval step consisting of (Step 6) to (Step 12) is executed.

【００６３】まず、（ステップ６）において、種文書と
して文書２を読み込む。First, in step (6), document 2 is read as a seed document.

【００６４】次に、（ステップ７）において、（ステッ
プ６）で読み込んだ種文書（文書２）の文字列を文字種
境界で分割し、同一文字種文字列を抽出する。Next, in (Step 7), the character string of the seed document (document 2) read in (Step 6) is divided at the character type boundary to extract the same character type character string.

【００６５】次に、（ステップ８）において、上記（ス
テップ７）で抽出した同一文字種文字列から、登録時の
（ステップ３）と同様の方法で特徴文字列を抽出する。
図３に文書２が種文書として指定された場合の（ステッ
プ８）の特徴文字列抽出処理の概要を示す。本図では、
同一文字種文字列が漢字の場合には、2-gramを抽出する
ものとしている。文書２から全ての2-gramを抽出した場
合には、１３種類の2-gramが抽出されていたのに対し、
本方法では、「ソフトウェア」「開発」「発作」「作
業」の４種類の特徴文字列に削減することができてい
る。このように、全てのn-gramを抽出する前述した従来
技術２に比べ、本発明では抽出する特徴文字列の種類を
大幅に削除できることになる。Next, in (Step 8), a characteristic character string is extracted from the same character type character string extracted in (Step 7) in the same manner as in the registration (Step 3).
FIG. 3 shows an outline of the characteristic character string extraction processing in the case where document 2 is designated as a seed document (step 8). In this figure,
If the same character type character string is a Chinese character, 2-gram is extracted. When all 2-grams were extracted from document 2, thirteen types of 2-grams were extracted.
In this method, the number of characteristic character strings of “software”, “development”, “seizure”, and “work” can be reduced to four types. As described above, in the present invention, the types of characteristic character strings to be extracted can be significantly deleted as compared with the above-described related art 2 that extracts all n-grams.

【００６６】次に、（ステップ９）において、（ステッ
プ８）で抽出した特徴文字列の種文書内における出現頻
度を計数する。そして、（ステップ１０）において、
（ステップ８）で抽出した特徴文字列に関して、前述し
た出現頻度ファイルを参照し、データベース内の各文書
における出現頻度を得る。そして、（ステップ１１）に
おいて、（ステップ８）で抽出した特徴文字列に対し
て、（ステップ９）と（ステップ１０）で計数した種文
書内における出現頻度と、データベース内の各文書にお
ける出現頻度を基に、類似度を算出する。類似度の算出
式には、従来技術２で示した式（１）を用いてもよい。
式（１）を用いて、文書２が種文書として指定された場
合の類似度を算出すると、次のようになる。Next, in (Step 9), the appearance frequency of the characteristic character string extracted in (Step 8) in the seed document is counted. Then, in (Step 10),
With respect to the characteristic character string extracted in (Step 8), the appearance frequency in each document in the database is obtained by referring to the appearance frequency file described above. Then, in (Step 11), for the characteristic character string extracted in (Step 8), the appearance frequency in the seed document counted in (Step 9) and (Step 10), and the appearance frequency in each document in the database Is calculated based on. The equation (1) shown in the related art 2 may be used as the equation for calculating the similarity.
When the similarity when the document 2 is specified as the seed document is calculated using the equation (1), the following is obtained.

【００６７】 S(1)=0.077 S(2)=1.0 S(3)=0.263 S(4)=0.148 この結果、（ステップ１２）で、文書を類似度の降順に
表示すると、文書２、文書３、文書４および文書１の順
に表示される。この類似度算出結果（S(1)=0.077、S(2)
=1.0、S(3)=0.263、S(4)=0.148）は、従来技術２による
類似度算出結果（S(1) = 0.036、S(2) = 1.0、S(3) =
0.179、S(4) = 0.190）とは異なり、文書２に類似した
順に、類似度が正しく算出されることになる。S (1) = 0.077 S (2) = 1.0 S (3) = 0.263 S (4) = 0.148 As a result, in step 12, when the documents are displayed in descending order of similarity, document 2, document 3, document 4 and document 1 are displayed in this order. This similarity calculation result (S (1) = 0.077, S (2)
= 1.0, S (3) = 0.263, S (4) = 0.148) are the similarity calculation results (S (1) = 0.036, S (2) = 1.0, S (3) =
0.179, S (4) = 0.190), the similarity is calculated correctly in the order similar to the document 2.

【００６８】以上のように、本発明の類似文書検索方法
によれば、分かち書きのない日本語のような文書に対し
て、類似文書検索を行なっても、従来技術１のような単
語辞書を用いることなく種文書から文字列を機械的に抽
出するため、従来技術２のようにどんな単語についても
漏れのない検索を行なうことが可能となる。また、従来
技術２のように文書中から単純にn-gramを抽出するので
はなく、文字種に応じて特徴文字列を抽出することによ
り、意味のまとまった文字列を用いて検索を行なうこと
ができるため、高精度な類似文書検索を実現することが
できるようになる。さらに、全n-gramを抽出する従来技
術２に比べ、抽出する文字列の種類が大幅に削減される
ため、高速に類似文書を検索することができるようにな
る。As described above, according to the similar document search method of the present invention, even if a similar document search is performed on a document such as Japanese without breaking words, the word dictionary as in the prior art 1 is used. Since the character string is mechanically extracted from the seed document without any error, it is possible to perform a search without omission for any word as in the related art 2. Also, instead of simply extracting an n-gram from a document as in conventional technology 2, it is possible to perform a search using a character string with a meaning by extracting a characteristic character string according to a character type. Therefore, a highly accurate similar document search can be realized. Furthermore, since the types of character strings to be extracted are significantly reduced as compared with the related art 2 that extracts all n-grams, similar documents can be searched at high speed.

【００６９】[0069]

【発明の実施の形態】以下、本発明の第一の実施例につ
いて図１を用いて説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS A first embodiment of the present invention will be described below with reference to FIG.

【００７０】本発明を適用した類似文書検索システムの
第一例は、ディスプレイ１００、キーボード１０１、中
央演算処理装置（ＣＰＵ）１０２、磁気ディスク装置１
０５、フロッピディスクドライブ（ＦＤＤ）１０６、主
メモリ１０９およびこれらを結ぶバス１０８から構成さ
れる。A first example of a similar document search system to which the present invention is applied is a display 100, a keyboard 101, a central processing unit (CPU) 102, a magnetic disk device 1
05, a floppy disk drive (FDD) 106, a main memory 109, and a bus 108 connecting these.

【００７１】磁気ディスク装置１０５は二次記憶装置の
一つであり、テキスト１０３、出現頻度ファイル１０４
が格納される。ＦＤＤ１０６を介してフロッピディスク
１０７に格納されている情報が、主メモリ１０９あるい
は磁気ディスク装置１０５へ読み込まれる。The magnetic disk device 105 is one of the secondary storage devices, and includes a text 103 and an appearance frequency file 104.
Is stored. The information stored in the floppy disk 107 via the FDD 106 is read into the main memory 109 or the magnetic disk device 105.

【００７２】主メモリ１０９には、システム制御プログ
ラム１１０、文書登録制御プログラム１１１、共有ライ
ブラリ１１２、テキスト登録プログラム１１３、出現頻
度ファイル作成登録プログラム１１４、検索制御プログ
ラム１１８、検索条件式解析プログラム１１９、類似文
書検索プログラム１２０および類似度ソートプログラム
１２６が格納されるとともにワークエリア１３０が確保
される。The main memory 109 has a system control program 110, a document registration control program 111, a shared library 112, a text registration program 113, an appearance frequency file creation registration program 114, a search control program 118, a search condition analysis program 119, and the like. The work area 130 is secured while the document search program 120 and the similarity sort program 126 are stored.

【００７３】共有ライブラリ１１２は、同一文字種文字
列抽出プログラム１１５、特徴文字列抽出プログラム１
１６、漢字文字列対応特徴文字列抽出プログラム１２７
およびカタカナ文字列対応特徴文字列抽出プログラム１
２８で構成される。The shared library 112 includes the same character type character string extraction program 115 and the characteristic character string extraction program 1
16. Kanji character string corresponding feature character string extraction program 127
And character string extraction program for katakana character strings 1
28.

【００７４】出現頻度ファイル作成登録プログラム１１
４は、出現頻度ファイル作成プログラム１１７で構成さ
れると共に、後述するように同一文字種文字列抽出プロ
グラム１１５と特徴文字列抽出プログラム１１６を呼び
出す構成をとる。Appearance frequency file creation registration program 11
No. 4 is constituted by an appearance frequency file creation program 117, and has a configuration of calling the same character type character string extraction program 115 and the characteristic character string extraction program 116 as described later.

【００７５】類似文書検索プログラム１２０は、種文書
読込みプログラム１２１、同一文字種文字列抽出プログ
ラム１１５、出現頻度計数プログラム１２３、出現頻度
ファイル読込みプログラム１２４および類似度算出プロ
グラム１２５で構成されると共に、後述するように特徴
文字列抽出プログラム１１６を呼び出す構成をとる。The similar document search program 120 includes a seed document reading program 121, a character string extraction program 115 of the same character type, an appearance frequency counting program 123, an appearance frequency file reading program 124, and a similarity calculation program 125, and will be described later. The configuration is such that the characteristic character string extraction program 116 is called.

【００７６】文書登録制御プログラム１１１および検索
制御プログラム１１８は、ユーザによるキーボード１０
１からの指示に応じてシステム制御プログラム１１０に
よって起動され、それぞれテキスト登録プログラム１１
３および出現頻度ファイル作成登録プログラム１１４の
制御と、検索条件式解析プログラム１１９、類似文書検
索プログラム１２０および類似度ソートプログラム１２
６の制御を行なう。The document registration control program 111 and the search control program 118 correspond to the keyboard 10 by the user.
1 is started by the system control program 110 in response to an instruction from
3 and the control of the appearance frequency file creation registration program 114, the search condition expression analysis program 119, the similar document search program 120, and the similarity degree sort program 12
6 is performed.

【００７７】以下、本実施例における類似文書検索シス
テムの処理手順について説明する。Hereinafter, the processing procedure of the similar document search system according to the present embodiment will be described.

【００７８】まず、システム制御プログラム１１０の処
理手順について図４のＰＡＤ（ＰｒｏｂｌｅｍＡｎａ
ｌｙｓｉｓＤｉａｇｒａｍ）図を用いて説明する。First, regarding the processing procedure of the system control program 110, the PAD (Problem Ana) shown in FIG.
This will be described with reference to a lysis diagram.

【００７９】システム制御プログラム１１０は、まずス
テップ４００で、キーボード１０１から入力されたコマ
ンドを解析する。First, in step 400, the system control program 110 analyzes a command input from the keyboard 101.

【００８０】そしてステップ４０１で、この結果が登録
実行のコマンドであると解析された場合には、ステップ
４０２で文書登録制御プログラム１１１を起動して、文
書の登録を行なう。If it is determined in step 401 that the result is a command to execute registration, the document registration control program 111 is activated in step 402 to register a document.

【００８１】またステップ４０３で、検索実行のコマン
ドであると解析された場合には、ステップ４０４で検索
制御プログラム１１８を起動して、類似文書の検索を行
なう。If it is determined in step 403 that the command is a search execution command, the search control program 118 is activated in step 404 to search for a similar document.

【００８２】以上が、システム制御プログラム１１０の
処理手順である。The processing procedure of the system control program 110 has been described above.

【００８３】次に、図４に示したステップ４０２でシス
テム制御プログラム１１０により起動される文書登録制
御プログラム１１１の処理手順について、図５のＰＡＤ
図を用いて説明する。Next, the processing procedure of the document registration control program 111 started by the system control program 110 in step 402 shown in FIG.
This will be described with reference to the drawings.

【００８４】文書登録制御プログラム１１１は、まずス
テップ５００でテキスト登録プログラム１１３を起動
し、ＦＤＤ１０６に挿入されたフロッピディスク１０７
から登録すべき文書のテキストデータをワークエリア１
３０に読み込み、これをテキスト１０３として磁気ディ
スク装置１０５に格納する。テキストデータは、フロッ
ピディスク１０７を用いて入力するだけに限らず、通信
回線やＣＤ−ＲＯＭ装置（図１には示していない）等を
用いて他の装置から入力するような構成を取ることも可
能である。The document registration control program 111 first activates the text registration program 113 in step 500, and the floppy disk 107 inserted in the FDD 106.
Text data of the document to be registered from work area 1
Then, the data is read into the magnetic disk device 105 as text 103. The text data is not limited to being input using the floppy disk 107, but may be configured to be input from another device using a communication line, a CD-ROM device (not shown in FIG. 1), or the like. It is possible.

【００８５】次に、ステップ５０１で出現頻度ファイル
作成登録プログラム１１４を起動し、磁気ディスク装置
１０５に格納されているテキスト１０３を読み出し、そ
の中の各文書における出現頻度ファイル１０４を作成
し、磁気ディスク装置１０５に格納する。Next, in step 501, the appearance frequency file creation / registration program 114 is started, the text 103 stored in the magnetic disk device 105 is read, and the appearance frequency file 104 for each document therein is created. It is stored in the device 105.

【００８６】以上が、文書登録制御プログラム１１１の
処理手順である。The above is the processing procedure of the document registration control program 111.

【００８７】次に、図５に示したステップ５０１で文書
登録制御プログラム１１１により起動される出現頻度フ
ァイル作成登録プログラム１１４の処理手順について、
図６のＰＡＤ図を用いて説明する。Next, the processing procedure of the appearance frequency file creation registration program 114 started by the document registration control program 111 in step 501 shown in FIG.
This will be described with reference to the PAD diagram of FIG.

【００８８】出現頻度ファイル作成登録プログラム１１
４は、まずステップ６００で同一文字種文字列抽出プロ
グラム１１５を起動し、テキスト１０３をワークエリア
１３０に読み込み、文字種境界でその文字列を分割する
ことにより同一文字種文字列を抽出し、ワークエリア１
３０に格納する。Appearance frequency file creation registration program 11
In step 4, the same character type character string extraction program 115 is started in step 600, the text 103 is read into the work area 130, and the character string is extracted by dividing the character string at the character type boundary.
30.

【００８９】次に、ステップ６０１において、特徴文字
列抽出プログラム１１６を起動し、ワークエリア１３０
に格納されている同一文字種文字列から特徴文字列を抽
出し、同じくワークエリア１３０に格納する。Next, in step 601, the characteristic character string extraction program 116 is started, and the work area 130
, A characteristic character string is extracted from the same character type character string stored in.

【００９０】そして、ステップ６０２において、出現頻
度ファイル作成プログラム１１７を起動し、ワークエリ
ア１３０に格納されている特徴文字列を参照して、その
出現頻度を計数し、出現頻度ファイル１０４として磁気
ディスク装置１０５に格納する。Then, in step 602, the appearance frequency file creation program 117 is started, the appearance frequency is counted by referring to the characteristic character string stored in the work area 130, and the appearance frequency file 104 is used as the magnetic disk drive. 105.

【００９１】以上が、出現頻度ファイル作成登録プログ
ラム１１４の処理手順である。The above is the processing procedure of the appearance frequency file creation registration program 114.

【００９２】次に、図６に示したステップ６０１におい
て出現頻度ファイル作成登録プログラム１１４により起
動される特徴文字列抽出プログラム１１６の処理手順に
ついて、図７のＰＡＤ図を用いて説明する。Next, the processing procedure of the characteristic character string extraction program 116 started by the appearance frequency file creation / registration program 114 in step 601 shown in FIG. 6 will be described with reference to the PAD diagram of FIG.

【００９３】特徴文字列抽出プログラム１１６は、同一
文字種文字列抽出プログラム１１５により抽出された同
一文字種文字列の数を調べ、全ての同一文字種文字列に
ついてステップ７０１以降を繰り返し実行する（ステッ
プ７００）。The characteristic character string extraction program 116 checks the number of identical character type character strings extracted by the identical character type character string extraction program 115, and repeatedly executes Step 701 and subsequent steps for all the identical character type character strings (Step 700).

【００９４】ステップ７０１では、ワークエリア１３０
に格納されている同一文字種文字列の文字種を判定し、
その文字種が漢字の場合にはステップ７０２を実行し、
カタカナの場合には、ステップ７０３を実行する。In step 701, the work area 130
Judge the character type of the same character type string stored in
If the character type is Kanji, execute step 702,
In the case of katakana, step 703 is executed.

【００９５】ステップ７０２では、後述する漢字文字列
対応特徴文字列抽出プログラム１２７を起動し、漢字文
字列から特徴文字列を抽出する。In step 702, a kanji character string corresponding characteristic character string extraction program 127 to be described later is started to extract a characteristic character string from the kanji character string.

【００９６】ステップ７０３では、同様に後述するカタ
カナ文字列対応特徴文字列抽出プログラム１２８を起動
し、カタカナ文字列から特徴文字列を抽出する。In step 703, a katakana character string corresponding characteristic character string extraction program 128 similarly described later is started to extract a characteristic character string from the katakana character string.

【００９７】以上が、特徴文字列抽出プログラム１１６
の処理手順である。The above is the characteristic character string extraction program 116
This is the processing procedure.

【００９８】次に、図７に示したステップ７０２で特徴
文字列抽出プログラム１１６により起動される漢字文字
列対応特徴文字列抽出プログラム１２７の処理手順につ
いて、図８のＰＡＤ図を用いて説明する。Next, the processing procedure of the kanji character string corresponding characteristic character string extraction program 127 started by the characteristic character string extraction program 116 in step 702 shown in FIG. 7 will be described with reference to the PAD diagram of FIG.

【００９９】漢字文字列対応特徴文字列抽出プログラム
１２７では、ステップ８００において、同一文字種文字
列抽出プログラム１１５により抽出されワークエリア１
３０に格納されている漢字文字列を取得する。そし
て、ステップ８０１において、上記ステップ８００で取
得した漢字文字列の先頭から一文字ずつずらしながら、
n-gram（nの値は、予め定めておく)を特徴文字列として
抽出する。In the kanji character string corresponding characteristic character string extraction program 127, in step 800, the same character type character string extraction program 115 extracts the work area 1
The kanji character string stored in 30 is acquired. Then, in step 801, while shifting one character at a time from the beginning of the kanji character string acquired in step 800,
An n-gram (the value of n is predetermined) is extracted as a characteristic character string.

【０１００】以上が、漢字文字列対応特徴文字列抽出プ
ログラム１２７の処理手順である。The above is the processing procedure of the kanji character string corresponding characteristic character string extraction program 127.

【０１０１】次に、図７に示したステップ７０３で特徴
文字列抽出プログラム１１６により起動されるカタカナ
文字列対応特徴文字列抽出プログラム１２８の処理手順
について、図９のＰＡＤ図を用いて説明する。Next, the processing procedure of the katakana character string corresponding characteristic character string extraction program 128 started by the characteristic character string extraction program 116 in step 703 shown in FIG. 7 will be described with reference to the PAD diagram of FIG.

【０１０２】カタカナ文字列対応特徴文字列抽出プログ
ラム１２８では、ステップ９００において、同一文字種
文字列抽出プログラム１１５により抽出されワークエリ
ア１３０に格納されているカタカナ文字列を取得する。In the katakana character string corresponding characteristic character string extraction program 128, in step 900, katakana character strings extracted by the same character type character string extraction program 115 and stored in the work area 130 are obtained.

【０１０３】そして、ステップ９０１において、上記ス
テップ９００で取得したカタカナ文字列そのものを特徴
文字列として抽出する。At step 901, the katakana character string itself obtained at step 900 is extracted as a characteristic character string.

【０１０４】以上が、カタカナ文字列対応特徴文字列抽
出プログラム１２８の処理手順である。The above is the processing procedure of the katakana character string corresponding characteristic character string extraction program 128.

【０１０５】以下に、図７に示した特徴文字列抽出プロ
グラム１１６の処理手順について具体例を用いて説明す
る。Hereinafter, the processing procedure of the characteristic character string extraction program 116 shown in FIG. 7 will be described using a specific example.

【０１０６】まず、図７の特徴文字列抽出プログラム１
１６のステップ７０２における漢字文字列対応特徴文字
列抽出プログラム１２７と、ステップ７０３におけるカ
タカナ文字列対応特徴文字列抽出プログラム１２８の処
理手順について、図１０〜図１２の例を用いて説明す
る。漢字文字列対応特徴文字列抽出プログラム１２７
とカタカナ文字列対応特徴文字列抽出プログラム１２８
は特徴文字列抽出プログラム１１６によって起動され
る。このとき、同一文字種文字列抽出プログラム１１５
によって抽出された同一文字種文字列が漢字文字列対応
特徴文字列抽出プログラム１２７とカタカナ文字列対応
特徴文字列抽出プログラム１２８へワークエリア１３０
を介して渡される。First, the characteristic character string extraction program 1 shown in FIG.
The processing procedure of the kanji character string corresponding characteristic character string extraction program 127 in step 16 702 and the katakana character string corresponding characteristic character string extraction program 128 in step 703 will be described with reference to the examples of FIGS. Kanji character string compatible feature character string extraction program 127
And katakana character string correspondence characteristic character string extraction program 128
Is activated by the characteristic character string extraction program 116. At this time, the same character type character string extraction program 115
The same character type character strings extracted by the above are transferred to the kanji character string corresponding characteristic character string extraction program 127 and the katakana character string corresponding characteristic character string extraction program 128.
Passed through.

【０１０７】図１０は文書１、文書２、文書３および文
書４からなるテキスト１０３から、同一文字種文字列抽
出プログラム１１５により同一文字種文字列が抽出され
た結果を示したものである。例えば、文書２「新しいソ
フトウェアの開発作業」からは「新」「しい」「ソフト
ウェア」「の」「開発作業」という５個の同一文字種文
字列が抽出される。FIG. 10 shows the result of extracting the same character type character string from the text 103 including the document 1, the document 2, the document 3, and the document 4 by the same character type character string extraction program 115. For example, five identical character type character strings “new”, “new”, “software”, “no”, and “development work” are extracted from document 2 “new software development work”.

【０１０８】この抽出された同一文字種文字列の文字種
にしたがって、特徴文字列抽出プログラム１１６は、漢
字文字列対応特徴文字列抽出プログラム１２７あるいは
カタカナ文字列対応特徴文字列抽出プログラム１２８を
起動する。The characteristic character string extraction program 116 activates the kanji character string corresponding characteristic character string extraction program 127 or the katakana character string corresponding characteristic character string extraction program 128 in accordance with the extracted character type of the same character type character string.

【０１０９】漢字文字列対応特徴文字列抽出プログラム
１２７は、ワークエリア１３０に格納されている漢字文
字列の先頭から一文字ずつずらしながら、全ての2-gram
を特徴文字列として抽出する。図１１は、図１０の例で
抽出された漢字文字列から、漢字文字列対応特徴文字列
抽出プログラム１２７により特徴文字列を抽出した結果
を示している。例えば、同一文字種文字列１０００の中
で文書２から抽出された「新」「しい」「ソフトウェ
ア」「の」「開発作業」からは、「開発」「発作」「作
業」が抽出される。The characteristic character string extraction program 127 corresponding to the kanji character string stores all 2-grams while shifting one character at a time from the beginning of the kanji character string stored in the work area 130.
Is extracted as a characteristic character string. FIG. 11 shows the result of extracting a characteristic character string from the kanji character string extracted in the example of FIG. 10 by the kanji character string corresponding characteristic character string extraction program 127. For example, “development”, “seizure”, and “work” are extracted from “new”, “shin”, “software”, “no”, and “development work” extracted from the document 2 in the same character type character string 1000.

【０１１０】カタカナ文字列対応特徴文字列抽出プログ
ラム１２８は、ワークエリア１３０に格納されているカ
タカナ文字列そのものを特徴文字列として抽出する。図
１２は、図１０の例で抽出されたカタカナ文字列から、
カタカナ文字列対応特徴文字列抽出プログラムにより特
徴文字列を抽出した結果である。例えば、同一文字種文
字列１０００の中で文書２から抽出された「新」「し
い」「ソフトウェア」「の」「開発作業」からは、「ソ
フトウェア」が抽出される。The katakana character string corresponding characteristic character string extraction program 128 extracts a katakana character string itself stored in the work area 130 as a characteristic character string. FIG. 12 shows the katakana character string extracted in the example of FIG.
This is a result of extracting a characteristic character string by a katakana character string-compatible characteristic character string extraction program. For example, “software” is extracted from “new”, “shin”, “software”, “no”, and “development work” extracted from the document 2 in the character string 1000 of the same character type.

【０１１１】以上が、第一の実施例における特徴文字列
抽出プログラム１１６のステップ７０２における漢字文
字列対応特徴文字列抽出プログラム１２７と、ステップ
７０３におけるカタカナ文字列対応特徴文字列抽出プロ
グラム１２８の処理手順である。The above is the processing procedure of the Kanji character string corresponding characteristic character string extraction program 127 in step 702 of the characteristic character string extraction program 116 in the first embodiment, and the katakana character string corresponding characteristic character string extraction program 128 in step 703. It is.

【０１１２】この例では、漢字文字列対応特徴文字列抽
出プログラム１２７の処理として、漢字文字列から2-gr
amを特徴文字列として抽出するものとして説明したが、
1-gram、あるいは3-gram以上であっても、さらには、そ
れらの組み合わせであっても、同様に特徴文字列抽出の
処理を行うことができることは明らかであろう。In this example, as the processing of the kanji character string corresponding characteristic character string extraction program 127, 2-gr
am has been described as being extracted as a feature string,
It will be apparent that the feature character string extraction process can be performed in the same manner for 1-gram or 3-gram or more, or for a combination thereof.

【０１１３】次に、図４に示したステップ４０４でシス
テム制御プログラム１１０により起動される検索制御プ
ログラム１１８による類似文書検索の処理手順につい
て、図１３のＰＡＤ図を用いて説明する。Next, the processing procedure of similar document search by the search control program 118 started by the system control program 110 in step 404 shown in FIG. 4 will be described with reference to the PAD diagram of FIG.

【０１１４】検索制御プログラム１１８は、まずステッ
プ１３００で検索条件式解析プログラム１１９を起動
し、キーボード１０１から入力された検索条件式を解析
し、検索条件式のパラメータとして指定された種文書番
号を抽出する。The search control program 118 starts the search condition expression analysis program 119 at step 1300, analyzes the search condition expression input from the keyboard 101, and extracts the seed document number designated as a parameter of the search condition expression. I do.

【０１１５】次に、ステップ１３０１で類似文書検索プ
ログラム１２０を起動し、上記ステップ１３００で抽出
された種文書番号に対し、磁気ディスク装置１０５に格
納されているテキスト１０３中の各文書の類似度を算出
する。Next, the similar document search program 120 is activated in step 1301, and the similarity of each document in the text 103 stored in the magnetic disk device 105 is compared with the seed document number extracted in step 1300. calculate.

【０１１６】そして、ステップ１３０２において、類似
度ソートプログラム１２６を起動し、上記ステップ１３
０１で算出された各文書の類似度を降順にソートする。Then, in step 1302, the similarity sort program 126 is started, and
The similarities of the documents calculated in step 01 are sorted in descending order.

【０１１７】最後に、ステップ１３０３において上記ス
テップ１３０２でソートされた類似度を各文書番号と共
に出力する。Finally, in step 1303, the similarities sorted in step 1302 are output together with the respective document numbers.

【０１１８】以上が、検索制御プログラム１１８による
文書検索の処理手順である。The above is the processing procedure of the document search by the search control program 118.

【０１１９】次に、図１３に示したステップ１３０１で
検索制御プログラム１１８により起動される類似文書検
索プログラム１２０の処理手順について、図１４のＰＡ
Ｄ図を用いて説明する。類似文書検索プログラム１２
０は、まずステップ１４００で種文書読込みプログラム
１２１を起動し、検索条件式解析プログラム１１９によ
って検索条件式から抽出された文書番号の種文書をワー
クエリア１３０に読み込む。ここで、種文書は、テキス
ト１０３中に格納されている文書を読み込むだけでな
く、フロッピディスク１０７、ＣＤ−ＲＯＭ装置（図１
には示していない）や通信回線等を用いて、他の装置か
ら入力するような構成を取ることも可能であり、また、
全文検索システム等による検索結果から入力するような
構成を取ることも可能であり、類似度ソートプログラム
１２６の出力から種文書を選択する構成を取ることも可
能である。Next, the processing procedure of the similar document search program 120 started by the search control program 118 in step 1301 shown in FIG.
This will be described with reference to FIG. Similar document search program 12
In step 1400, the seed document reading program 121 is started in step 1400, and the seed document of the document number extracted from the search condition formula by the search condition formula analysis program 119 is read into the work area 130. Here, the seed document not only reads the document stored in the text 103, but also reads the floppy disk 107 and the CD-ROM device (FIG. 1).
It is also possible to adopt a configuration in which data is input from another device using a communication line or the like.
It is also possible to adopt a configuration for inputting from a search result by a full-text search system or the like, or to select a seed document from the output of the similarity sort program 126.

【０１２０】次に、ステップ１４０１において、同一文
字種文字列抽出プログラム１１５を起動し、上記種文書
読込みステップ１４００で読み込んだ種文書のテキスト
を、文字種境界で分割して同一文字種文字列を取得し、
ワークエリア１３０に格納する。Next, in step 1401, the same character type character string extraction program 115 is started, and the text of the seed document read in the seed document reading step 1400 is divided at character type boundaries to obtain the same character type character string.
It is stored in the work area 130.

【０１２１】そして、ステップ１４０２において特徴文
字列抽出プログラム１１６を起動し、上記同一文字種文
字列抽出ステップ１４０１で取得した同一文字種文字列
から、特徴文字列を抽出する。Then, in step 1402, the characteristic character string extraction program 116 is started, and a characteristic character string is extracted from the same character type character string acquired in the same character type character string extraction step 1401.

【０１２２】図１５に、この処理の具体例を示す。特徴
文字列抽出プログラム１１６の処理手順に関しては、前
に説明した通りである。FIG. 15 shows a specific example of this processing. The processing procedure of the characteristic character string extraction program 116 is as described above.

【０１２３】本例では、種文書である文書２「新しいソ
フトウェアの開発作業」から、「新」「しい」「ソフト
ウェア」「の」「開発作業」という５個の同一文字種文
字列１５００が抽出されることになる。この抽出された
同一文字種文字列１５００の文字種にしたがって、特徴
文字列を抽出する。この結果、文書２からは「ソフトウ
ェア」「開発」「発作」「作業」の４個の特徴文字列１
５０１が抽出される。In this example, five identical character type character strings 1500 of “new”, “new”, “software”, “no”, and “development work” are extracted from document 2 “new software development work” which is a seed document. Will be. A characteristic character string is extracted according to the character type of the extracted character string 1500 of the same character type. As a result, four characteristic character strings 1 of “software”, “development”, “seizure” and “work” are obtained from the document 2.
501 is extracted.

【０１２４】次に図１４のステップ１４０３で、出現頻
度計数プログラム１２３を起動し、上記特徴文字列抽出
ステップ１４０２で抽出した特徴文字列の種文書内にお
ける出現頻度を計数する。Next, in step 1403 of FIG. 14, the appearance frequency counting program 123 is started, and the appearance frequency of the characteristic character string extracted in the characteristic character string extraction step 1402 in the seed document is counted.

【０１２５】図１６に、この具体例を示す。本図は、図
１５に例示した種文書から抽出された特徴文字列１５０
１の出現頻度を計数した結果を示している。すなわち、
「（ソフトウェア，１）、（開発，１）、（発作，
１）、（作業，１）」という出現頻度１６００が得られ
ている。ここで、例えば（開発，１）は、特徴文字列
「開発」が「１」回出現するということを示している。FIG. 16 shows this specific example. This figure shows a feature character string 150 extracted from the seed document illustrated in FIG.
The result of counting the appearance frequency of 1 is shown. That is,
"(Software, 1), (Development, 1), (Seizure,
1), (work, 1) ". Here, for example, (development, 1) indicates that the characteristic character string "development" appears "1" times.

【０１２６】次に、図１４のステップ１４０４で、出現
頻度ファイル読込みプログラム１２４を起動し、上記特
徴文字列抽出ステップ１４０２で抽出した特徴文字列
の、テキスト１０３中の各文書における出現頻度を出現
頻度ファイル１０４から読み込む。Next, in step 1404 of FIG. 14, the appearance frequency file reading program 124 is started, and the frequency of appearance of the characteristic character string extracted in the characteristic character string extraction step 1402 in each document in the text 103 is calculated as the frequency of appearance. Read from file 104.

【０１２７】図１７に、この具体例を示す。ここでは、
図１５の例で抽出された特徴文字列１５０１のテキスト
１０３中の各文書における出現頻度を、読み込んだ出現
頻度ファイルから取得した結果を示している。FIG. 17 shows this specific example. here,
FIG. 16 shows a result obtained from the read appearance frequency file, the appearance frequency of the characteristic character string 1501 extracted in the example of FIG. 15 in each document in the text 103. FIG.

【０１２８】この例では、種文書から抽出された「ソフ
トウェア」「開発」「発作」「作業」という４個の特徴
文字列１５０１の出現頻度を、出現頻度ファイル１０４
から得る。この結果、出現頻度１７００として、例えば
文書３の場合「（ソフトウェア，１）、（開発，１）、
（発作，０）、（作業，０）」という値を得ることがで
きる。In this example, the appearance frequency of the four characteristic character strings 1501 “software”, “development”, “seizure” and “work” extracted from the seed document is stored in the appearance frequency file 104.
Get from. As a result, as the appearance frequency 1700, for example, in the case of document 3, “(software, 1), (development, 1),
(Seizure, 0), (work, 0) "can be obtained.

【０１２９】最後に、図１４のステップ１４０５で、類
似度算出プログラム１２５を起動し、上記出現頻度計数
ステップ１４０３で計数した特徴文字列の種文書内にお
ける出現頻度と、上記出現頻度ファイル読込みステップ
１４０４で読み込んだ特徴文字列のテキスト１０３内の
各文書における出現頻度から、テキスト１０３中の各文
書との類似度を算出する。Finally, in step 1405 of FIG. 14, the similarity calculation program 125 is started, and the appearance frequency of the characteristic character string counted in the appearance frequency counting step 1403 in the seed document and the appearance frequency file reading step 1404 are read. The degree of similarity with each document in the text 103 is calculated from the frequency of occurrence of the characteristic character string read in step 103 in each document in the text 103.

【０１３０】図１８に、この具体例を示す。ここでは、
図１６の例で計数した種文書における出現頻度１６００
および図１７の例で取得したテキスト１０３中の各文書
における出現頻度１７００を用いて、各文書の類似度S
(1)〜S(4)を算出した結果を示している。すなわち、次
のような結果が得られる。FIG. 18 shows this specific example. here,
The appearance frequency 1600 in the seed document counted in the example of FIG.
And the similarity S of each document using the appearance frequency 1700 of each document in the text 103 acquired in the example of FIG.
The results of calculating (1) to S (4) are shown. That is, the following result is obtained.

【０１３１】 S(1)=0.077 S(2)=1.0 S(3)=0.263 S(4)=0.148 本実施例では、この類似度の算出に、従来技術２に開示
されている式（１）を用いるが、他の方法を用いても構
わない。S (1) = 0.077 S (2) = 1.0 S (3) = 0.263 S (4) = 0.148 In the present embodiment, the similarity is calculated by the equation (1) disclosed in the prior art 2. ) Is used, but other methods may be used.

【０１３２】以上が、類似文書検索プログラム１１８の
処理手順である。The above is the processing procedure of the similar document search program 118.

【０１３３】以上が、本発明の第一の実施例である。The above is the first embodiment of the present invention.

【０１３４】なお、本実施例においては、特徴文字列抽
出プログラム１１６は、漢字対応特徴文字列抽出プログ
ラム１２７およびカタカナ文字列対応特徴文字列抽出プ
ログラム１２８を含む構成としたが、英字や数字等に対
応した特徴文字列抽出プログラムを含む構成としてもよ
いし、漢字文字列対応特徴文字列抽出プログラム１２７
あるいはカタカナ文字列対応特徴文字列抽出プログラム
１２８を含まない構成であってもよい。In the present embodiment, the characteristic character string extraction program 116 is configured to include the kanji character character string extraction program 127 and the katakana character string correspondence characteristic character string extraction program 128. It may be configured to include a corresponding characteristic character string extraction program, or a kanji character string corresponding characteristic character string extraction program 127
Alternatively, the configuration may not include the katakana character string corresponding feature character string extraction program 128.

【０１３５】また、本実施例においては、同一文字種文
字列から特徴文字列を抽出する構成としたが、特定の文
字種間を境界として前後に跨る部分文字列を特徴文字列
として抽出することにより、例えば、「Ｆ１」や「ビタ
ミンＣ」等の文字列を検索に用いることもでき、さらに
高精度な類似文書検索を実現することが可能となる。In the present embodiment, the characteristic character string is extracted from the character string of the same character type. However, by extracting a partial character string extending between the specific character type as a boundary and the preceding and following character string, the characteristic character string is extracted. For example, a character string such as "F1" or "Vitamin C" can be used for the search, and a more accurate similar document search can be realized.

【０１３６】さらに、本実施例においては、出現頻度フ
ァイル１０４を図２に示した表形式で作成されるものと
したが、この方法では、データベースが大規模になるに
伴い特徴文字列の種類が増加するため、出現頻度ファイ
ル読込みステップ１４０４の処理に長大な時間を要する
ことになる。この問題は、特徴文字列に対して検索用の
インデクスを付加することにより解決できる。これによ
り、大規模なデータベースに対しても高速な類似文書検
索を実現することが可能となる。この特徴文字列に対す
る検索用インデクスとしては、「特開平８−３２９１１
２号公報」等に開示されているような単語インデクス方
式を用いることができる。Further, in this embodiment, the appearance frequency file 104 is created in the table format shown in FIG. 2. However, in this method, as the size of the database increases, the type of the characteristic character string is changed. Because of the increase, the processing of the appearance frequency file reading step 1404 requires a long time. This problem can be solved by adding a search index to the characteristic character string. This makes it possible to realize a high-speed similar document search even for a large-scale database. As a search index for this characteristic character string, see Japanese Patent Application Laid-Open No. 8-32911.
No. 2 publication ”or the like, a word index method can be used.

【０１３７】次に、本発明の第二の実施例について図１
９を用いて説明する。Next, a second embodiment of the present invention will be described with reference to FIG.
9 will be described.

【０１３８】本発明を適用した類似文書検索システムの
第二例は、種文書から抽出した特徴文字列のデータベー
ス内の各文書における出現頻度の取得に、検索漏れのな
い全文検索インデクスを利用するものである。これによ
り、本類似文書検索システムを全文検索システムと組み
合わせて実現した場合に、出現頻度ファイルをもつ必要
がなくなる。The second example of the similar document search system to which the present invention is applied uses a full-text search index without omission for obtaining the frequency of occurrence of a characteristic character string extracted from a seed document in each document in a database. It is. This eliminates the need to have an appearance frequency file when the similar document search system is implemented in combination with a full-text search system.

【０１３９】すなわち、本方法によれば、第一の実施例
における出現頻度ファイル１０４の特徴文字列の検索に
全文検索インデクスを利用することができ、大規模なデ
ータベースに対しても高速な類似文書検索を実現するこ
とが可能となる。さらに、出現頻度ファイル１０４を全
文検索用インデクスで代用するため、第一の実施例に比
べ必要となる磁気ディスク容量を削減できることにな
る。That is, according to the present method, the full-text search index can be used for searching for the characteristic character string of the appearance frequency file 104 in the first embodiment, and a high-speed similar document can be used for a large-scale database. A search can be realized. Furthermore, since the appearance frequency file 104 is substituted by the full-text search index, the required magnetic disk capacity can be reduced as compared with the first embodiment.

【０１４０】本実施例は、第一の実施例（図１）とほぼ
同様の構成を取るが、類似文書検索プログラム１２０を
構成する出現頻度ファイル読込みプログラム１２４が異
なる。このプログラムの代わりに、図１９に示すよう
に、特徴文字列検索プログラム１９００が用いられる。This embodiment has substantially the same configuration as the first embodiment (FIG. 1), but differs in the appearance frequency file reading program 124 constituting the similar document search program 120. Instead of this program, a characteristic character string search program 1900 is used as shown in FIG.

【０１４１】以下、本実施例における処理手順のうち、
第一の実施例とは異なる類似文書検索プログラム１２０
ａの処理手順について図２０を用いて説明する。Hereinafter, of the processing procedures in this embodiment,
Similar document search program 120 different from the first embodiment
The processing procedure a will be described with reference to FIG.

【０１４２】ここで、第一の実施例における類似文書検
索プログラム１２０（図１４）と異なる点は、出現頻度
取得ステップ２００４だけである。他の処理ステップの
処理手順は、第一の実施例で説明した通りである。Here, the only difference from the similar document search program 120 (FIG. 14) in the first embodiment is the appearance frequency acquisition step 2004. The processing procedure of the other processing steps is as described in the first embodiment.

【０１４３】出現頻度取得ステップ２００４では、特徴
文字列検索プログラム１９００を起動し、特徴文字列抽
出ステップ１４０２で抽出された特徴文字列を全文検索
システム１９０１で検索することにより、テキスト１０
３内の各文書における出現頻度を取得する。In the appearance frequency acquisition step 2004, the characteristic character string search program 1900 is started, and the characteristic character string extracted in the characteristic character string extraction step 1402 is searched by the full-text search system 1901.
The appearance frequency of each document in 3 is acquired.

【０１４４】本実施例の出現頻度取得ステップ２００４
で用いる特徴文字列検索プログラム１９００は、検索漏
れがなく、かつ、各文書における出現頻度を取得できる
全文検索方式であれば、どのような方式を適用しても構
わない。例えば、「特開昭６４−３５６２７号公報」
（以下、従来技術３と呼ぶ）で開示されているようなn-
gramインデクス方式を用いることも可能である。Appearance frequency acquisition step 2004 of this embodiment
For the feature character string search program 1900 used in, any method may be applied as long as there is no search omission and the full text search method can acquire the appearance frequency in each document. For example, “JP-A-64-35627”
(Hereinafter referred to as prior art 3).
It is also possible to use the gram index method.

【０１４５】この従来技術３によるn-gramインデクス方
式では、文書の登録時に、データベースへ登録する文書
のテキストデータからn-gramとそのn-gramのテキスト中
における出現位置を抽出し、全文検索用インデクス１９
０３として磁気ディスク装置１９０２に格納しておく。
検索時には指定された検索ターム中に出現するｎ−ｇｒ
ａｍを抽出し、これらに対応するインデクスを上記磁気
ディスク装置１９０２から読み込み、インデクス中のｎ
−ｇｒａｍの出現位置を比較し、検索タームから抽出し
たn-gramの位置関係とインデクス中のn-gramの位置関係
が等しいかどうかを判定することによって、指定された
検索タームが出現する文書を高速に検索する。In the n-gram index method according to the prior art 3, when a document is registered, an n-gram and an appearance position of the n-gram in the text are extracted from text data of the document to be registered in the database, and are used for full-text search. Index 19
03 is stored in the magnetic disk drive 1902.
N-gr that appears in the specified search term during search
am, the corresponding index is read from the magnetic disk drive 1902, and n in the index is read.
By comparing the appearance positions of -grams and determining whether the positional relationship of the n-gram extracted from the search term is equal to the positional relationship of the n-gram in the index, the document in which the specified search term appears is determined. Search fast.

【０１４６】この方式を用いて、特徴文字列を検索ター
ムとして全文検索システム１９０１へ入力し、該検索タ
ームの出現文書とその位置情報を取得することにより、
該特徴文字列の各文書における出現頻度を求めることが
可能となる。By using this method, a characteristic character string is input to the full-text search system 1901 as a search term, and the occurrence document of the search term and its position information are obtained.
The appearance frequency of the characteristic character string in each document can be obtained.

【０１４７】以下、この従来技術３を用いた出現頻度の
算出方法を図２１を用いて具体的に説明する。なお本図
では、n-gramのnの値を２としている。Hereinafter, a method of calculating the appearance frequency using the prior art 3 will be specifically described with reference to FIG. In the figure, the value of n of the n-gram is 2.

【０１４８】まず、文書の登録時にデータベースに登録
するテキスト２１０１がインデクス作成部２１０２に読
み込まれ、n-gramインデクス２１００が作成される。こ
のn-gramインデクス２１００には、テキスト２１０１に
出現する全ての2-gramとテキスト２１０１におけるその
2-gramの出現位置が格納される。First, a text 2101 to be registered in the database at the time of document registration is read by the index creation unit 2102, and an n-gram index 2100 is created. The n-gram index 2100 includes all the 2-grams that appear in the text 2101 and their 2-grams in the text 2101.
The 2-gram appearance position is stored.

【０１４９】本図に示すテキスト２１０１では、「心
電」という2-gramはテキスト２１０１（文書番号
「１」）の５文字目、１５文字目、・・・に現われるの
で、n-gramインデクス２１００には2-gram「心電」とこ
れに対応したかたちで出現位置｛（１，５）、（１，１
５）、・・・｝が格納される。[0149] In the text 2101 shown in this figure, the 2-gram "cardiogram" appears in the fifth character, the fifteenth character, ... of the text 2101 (document number "1"), so the n-gram index 2100 Contains the 2-gram “cardiogram” and the corresponding appearance positions ｛(1,5), (1,1
5),... Are stored.

【０１５０】検索時には、まず、検索タームがn-gram抽
出部２１０３に入力され、検索ターム中に出現する全て
のn-gramとそのn-gramの検索タームにおける出現位置が
抽出される。次に、抽出されたn-gramとこれに対応する
n-gramの検索タームにおける出現位置がインデクス検索
部２１０４に入力される。At the time of a search, first, a search term is input to the n-gram extraction unit 2103, and all n-grams appearing in the search term and the appearance positions of the n-gram in the search term are extracted. Next, the extracted n-gram and its corresponding
The appearance position of the n-gram in the search term is input to the index search unit 2104.

【０１５１】インデクス検索部２１０４では、検索ター
ムから抽出されたn-gramに対応するインデクスがn-gram
インデクス２１００から読み込まれ、これらのインデク
スの中から文書番号が一致し、かつ検索ターム中の位置
関係と同じ位置関係を持つものが抽出され、検索結果と
して出力される。In the index search unit 2104, the index corresponding to the n-gram extracted from the search term is n-gram.
The index is read from the index 2100, and the index having the same document number and the same positional relation as that in the search term is extracted from these indexes and output as a search result.

【０１５２】検索タームとして「心電図」が入力された
本図の場合、まず、n-gram抽出部２１０３において、
（n-gram「心電」、n-gram位置「１」）と（n-gram「電
図」、n-gram位置「２」）が抽出される。ここで、 n-g
ram位置「１」は検索タームの先頭、 n-gram位置「２」
はその次の文字位置を示す。In the case of the present figure in which “electrocardiogram” is input as a search term, first, the n-gram extraction unit 2103
(N-gram “electrocardiogram”, n-gram position “1”) and (n-gram “electrogram”, n-gram position “2”) are extracted. Where ng
The ram position "1" is the head of the search term, and the n-gram position "2"
Indicates the next character position.

【０１５３】次に、インデクス検索部２１０４におい
て、n-gramインデクス２１００からn-gram「心電」と
「電図」に対応するインデクスが読み込まれる。これら
のインデクスにおける出現位置がn-gram位置「１」とn-
gram位置「２」のように連続するものが、すなわち隣接
するものが抽出され検索結果として出力される。Next, the index corresponding to the n-gram “cardiogram” and the “electrogram” is read from the n-gram index 2100 in the index search unit 2104. The appearance positions in these indexes are n-gram position "1" and n-gram position
Consecutive items such as gram position “2”, that is, adjacent items are extracted and output as a search result.

【０１５４】本図では、 n-gram「心電」の出現位置
「１５」とn-gram「電図」の出現位置「１６」が隣接す
るため、 n-gram「心電図」が文字列として存在するこ
とが分かり、文書１中に検索ターム「心電図」が出現す
ることが示される。しかし、 n-gram「心電」の出現位
置「５」とn-gram「電図」の出現位置「１６」は隣接し
ていないため、この位置には検索ターム「心電図」が出
現しないことが分かる。In this figure, since the appearance position “15” of the n-gram “ECG” is adjacent to the appearance position “16” of the n-gram “ECG”, the n-gram “ECG” exists as a character string. This indicates that the search term “ECG” appears in the document 1. However, since the appearance position “5” of the n-gram “ECG” and the appearance position “16” of the n-gram “ECG” are not adjacent to each other, the search term “ECG” may not appear at this position. I understand.

【０１５５】本方法において、検索タームとして特徴文
字列入力した場合、上記インデクス検索部２１０４から
検索結果として出力される出現位置を計数することによ
り、該当特徴文字列の出現頻度を得ることが可能とな
る。In the present method, when a characteristic character string is input as a search term, the appearance frequency of the characteristic character string can be obtained by counting the number of appearance positions output from the index search unit 2104 as a search result. Become.

【０１５６】以上説明したように、本実施例によれば、
出現頻度ファイルの特徴文字列検索用インデクスと出現
頻度ファイルの代わりに、全文検索インデクスを利用で
きるため、大規模なデータベースに対しても余分なファ
イルを増やさずに、高速に類似文書検索を実現すること
が可能となる。As described above, according to the present embodiment,
A full-text search index can be used instead of the index for character string search and the frequency file of the frequency file, so similar documents can be searched at high speed even for large-scale databases without adding extra files. It becomes possible.

【０１５７】次に、本発明の第三の実施例について図２
２を用いて説明する。Next, a third embodiment of the present invention will be described with reference to FIG.
2 will be described.

【０１５８】本発明を適用した類似文書検索システムの
第三例は、種文書から抽出した特徴文字列の重要度を算
出し、この重要度が所定値を満たす特徴文字列に限定し
て、データベース内の各文書における出現頻度を取得
し、これに基づいて類似度を算出するものである。The third example of the similar document search system to which the present invention is applied calculates the importance of a characteristic character string extracted from a seed document, and restricts the characteristic character strings whose importance satisfies a predetermined value to a database. In this case, the frequency of appearance in each document is obtained, and the similarity is calculated based on the frequency.

【０１５９】すなわち、本方法は、第一の実施例におけ
る出現頻度ファイル読込みステップ１４０４で出現頻度
の取得対象とする特徴文字列数を削減することによっ
て、類似度算出に用いる特徴文字列数を削減し、文字数
の多い種文書に対しても高速な類似文書検索を実現でき
るようにするものである。That is, the present method reduces the number of characteristic character strings used for similarity calculation by reducing the number of characteristic character strings whose appearance frequency is to be acquired in the appearance frequency file reading step 1404 in the first embodiment. In addition, a high-speed similar document search can be realized even for a seed document having a large number of characters.

【０１６０】本実施例は、第一の実施例（図１）とほぼ
同様の構成を取るが、類似文書検索プログラム１２０が
異なり、図２２に示すように、特徴文字列選択プログラ
ム２２００を有する。This embodiment has substantially the same configuration as the first embodiment (FIG. 1), but differs from the similar document search program 120, and has a characteristic character string selection program 2200 as shown in FIG.

【０１６１】以下、本実施例における処理手順のうち、
第一の実施例とは異なる類似文書検索プログラム１２０
ｂの処理手順について図２３のＰＡＤ図を用いて説明す
る。Hereinafter, of the processing procedures in this embodiment,
Similar document search program 120 different from the first embodiment
The processing procedure of b will be described with reference to the PAD diagram of FIG.

【０１６２】ここで、第一の実施例における類似文書検
索プログラム１２０（図１４）の処理手順と異なる点
は、特徴文字列選択ステップ２３００だけである。他の
処理ステップの処理手順は、第一の実施例で説明した通
りである。Here, the only difference from the processing procedure of the similar document search program 120 (FIG. 14) in the first embodiment is a characteristic character string selection step 2300. The processing procedure of the other processing steps is as described in the first embodiment.

【０１６３】特徴文字列選択ステップ２３００では、特
徴文字列選択プログラム２２００を起動し、特徴文字列
抽出ステップ１４０２（特徴文字列抽出プログラム１１
６）で抽出した特徴文字列の重要度を算出し、所定の値
を満たす文字列を類似検索用の特徴文字列として選択す
る。In the characteristic character string selection step 2300, the characteristic character string selection program 2200 is started, and the characteristic character string extraction step 1402 (characteristic character string extraction program 11
The importance of the characteristic character string extracted in 6) is calculated, and a character string satisfying a predetermined value is selected as a characteristic character string for similarity search.

【０１６４】以下、特徴文字列選択ステップ２３００で
起動される特徴文字列選択プログラム２２００の処理手
順を図２４のＰＡＤ図を用いて説明する。Hereinafter, the processing procedure of the characteristic character string selection program 2200 started in the characteristic character string selection step 2300 will be described with reference to the PAD diagram of FIG.

【０１６５】特徴文字列選択プログラム２２００は、ま
ず、ステップ２４００において特徴文字列抽出ステップ
１４０２で抽出された特徴文字列を取得すし、ワークエ
リア１３０に格納する。The characteristic character string selection program 2200 first obtains the characteristic character string extracted in the characteristic character string extraction step 1402 in step 2400 and stores it in the work area 130.

【０１６６】次に、ステップ２４０１で各特徴文字列が
出現する文書数を出現頻度ファイル１０４から取得す
る。Next, in step 2401, the number of documents in which each characteristic character string appears is obtained from the appearance frequency file 104.

【０１６７】そして、ステップ２４０２において、所定
の重要度算出式を用いて該特徴文字列の重要度を算出す
る。Then, in step 2402, the importance of the characteristic character string is calculated using a predetermined importance calculation formula.

【０１６８】この結果、該重要度が所定値を満たす特徴
文字列に限定し、これを類似度算出用の特徴文字列とし
て抽出する（ステップ２４０３）。この重要度には、従
来技術２の共通性ウェイトを用いてもよい。本実施例で
は、重要度の算出に以下に示す式（２）を用いる。As a result, the characteristic character string whose importance level satisfies a predetermined value is limited, and this is extracted as a characteristic character string for calculating similarity (step 2403). For this importance, the commonality weight of prior art 2 may be used. In this embodiment, the following equation (2) is used for calculating the importance.

【０１６９】[0169]

【数２】 (Equation 2)

【０１７０】ここで、nはデータベース中の文書数、Num
Docは特徴文字列のデータベースにおける出現文書数を
示す。この値は、特徴文字列がデータベース中の全ての
文書に出現する場合に最も小さく、特定の文書に偏って
出現する場合に大きくなる。Here, n is the number of documents in the database, Num
Doc indicates the number of appearing documents in the database of the characteristic character string. This value is the smallest when the characteristic character string appears in all the documents in the database, and becomes large when the characteristic character string appears unevenly in a specific document.

【０１７１】また、特徴文字列を抽出する際に基準とす
る閾値としては、上限とする重要度と下限とする重要度
を予め定めておいてもよいし、重要度の上位k個（kは1
以上の予め定められた整数）を採るものとしてもよい。Further, as the threshold value used as a reference when extracting a characteristic character string, an upper limit importance level and a lower limit importance level may be determined in advance, or the upper k priority levels (k is 1
The above-mentioned predetermined integer) may be used.

【０１７２】以下、図２５に示す具体例で特徴文字列選
択ステップ２２００の処理手順を説明する。なお本図で
は、図１５の例で抽出した特徴文字列１５０１を対象と
し、重要度が3.0以上である特徴文字列を選択するもの
とする。The processing procedure of the characteristic character string selection step 2200 will be described below with reference to a specific example shown in FIG. In this figure, it is assumed that the characteristic character string 1501 extracted in the example of FIG. 15 is targeted and a characteristic character string having an importance of 3.0 or more is selected.

【０１７３】まず、ステップ１４０４（図２３）でワー
クエリア１３０に読み込んだ出現頻度ファイル１０４か
ら各特徴文字列の出現文書数を取得する。この例では、
文書２の特徴文字列１５０１の各出現文書数２５００と
して、［ソフトウェア，２］、［開発，３］、［発作，
２］、［作業，２］が得られる。ここで、［ソフトウェ
ア，２］は、特徴文字列「ソフトウェア」がデータベー
ス中の「２」つの文書に出現することを表わす。First, in step 1404 (FIG. 23), the number of documents in which each characteristic character string appears is obtained from the appearance frequency file 104 read into the work area 130. In this example,
As the number of appearing documents 2500 of the characteristic character string 1501 of the document 2, [software, 2], [development, 3], [seizure,
2] and [Work, 2] are obtained. Here, [software, 2] indicates that the characteristic character string “software” appears in “2” documents in the database.

【０１７４】次に、各特徴文字列の出現文書数２５００
から重要度２５０１を算出し、重要度が3.0以上の特徴
文字列を抽出する。この結果、「ソフトウェア」という
１個の特徴文字列２５０２が類似度算出用の特徴文字列
として選択されることになる。Next, the number of appearance documents of each characteristic character string is 2500.
, The importance character string having an importance of 3.0 or more is extracted. As a result, one characteristic character string 2502 “software” is selected as a characteristic character string for similarity calculation.

【０１７５】このように、特徴文字列の個数を４個から
１個に削減することができるため、類似度算出に要する
時間を大幅に削減することができる。As described above, since the number of characteristic character strings can be reduced from four to one, the time required for similarity calculation can be greatly reduced.

【０１７６】なお、本実施例では、出現頻度ファイル１
０４を参照して、各特徴文字列の出現文書数を取得する
構成としたが、文書登録時に各文書中の特徴文字列を計
数し、各特徴文字列の出現文書数を求め、これを出現文
書数ファイルとして記憶しておくことにより、さらに高
速に特徴文字列を選択することも可能である。In this embodiment, the appearance frequency file 1
04, the number of appearing documents of each characteristic character string is acquired. However, when registering a document, the number of characteristic characters in each document is counted, and the number of appearing documents of each characteristic character string is obtained. By storing as a document number file, it is possible to select a characteristic character string even faster.

【０１７７】また、本実施例では、出現頻度ファイル１
０４を参照して、各特徴文字列の出現文書数を取得し重
要度を算出する構成としたが、文書登録時に各文書にお
ける特徴文字列の重要度を算出し、これを重要度ファイ
ルとして記憶しておくことにより、さらに高速に特徴文
字列を選択することが可能となる。Also, in this embodiment, the appearance frequency file 1
04, the number of appearing documents of each characteristic character string is acquired and the importance is calculated. However, when the document is registered, the importance of the characteristic character string in each document is calculated and stored as an importance file. By doing so, it is possible to select a characteristic character string even faster.

【０１７８】さらに、本実施例では、重要度の算出に特
徴文字列のデータベース中の出現文書数を用いたが、例
えば、特徴文字列の文字種類や文字列長、種文書内の出
現頻度あるいは出現位置等の情報のいずれか一つ、ある
いは、それらを組み合わせることにより算出することも
可能である。Further, in the present embodiment, the number of documents in the database of the characteristic character string is used to calculate the importance. For example, the character type and character string length of the characteristic character string, the frequency of appearance in the seed document, or It can also be calculated by any one of the information such as the appearance position or a combination thereof.

【０１７９】以上説明したように、本発明によれば、分
かち書きのない日本語のような文書に対して、類似文書
検索を行なった場合においても、種文書から文字列を機
械的に抽出することにより、どんな単語についても漏れ
のない検索を行なうことが可能となる。また、文字種に
応じて特徴文字列を抽出することにより、意味のまとま
った文字列を用いて検索を行なうことができるため、高
精度な類似文書検索を実現することができるようにな
る。さらに、抽出する文字列の種類が大幅に削減される
ため、高速に類似文書を検索することができるようにな
る。As described above, according to the present invention, a character string is mechanically extracted from a seed document even when a similar document search is performed on a document such as Japanese without wording. Thus, it is possible to perform a search without omission for any word. Further, by extracting a characteristic character string according to a character type, a search can be performed using a character string with a significant meaning, so that a highly accurate similar document search can be realized. Furthermore, since the types of character strings to be extracted are greatly reduced, similar documents can be searched at high speed.

【０１８０】さらに、全文検索システムと組み合わせて
用いることにより、大規模な文書データベースに対して
も、高速な類似文書検索が実現可能となる。Further, when used in combination with a full-text search system, a high-speed similar document search can be realized even for a large-scale document database.

【０１８１】[0181]

【発明の効果】本発明によれば、単語辞書を用いずに類
似文書検索を行なった場合でも、意味のまとまった文字
列を用いて検索を行なうことができるため、高精度な類
似文書検索を実現することができる。また、抽出する文
字列の文字種に応じて最適な長さの部分文字列（n-gra
m）を抽出するため、高速に類似文書を検索することが
できるようになる。According to the present invention, even when a similar document search is performed without using a word dictionary, the search can be performed using a character string having a significant meaning. Can be realized. Also, a substring (n-gra) with the optimal length according to the character type of the character string to be extracted
Since m) is extracted, similar documents can be searched at high speed.

[Brief description of the drawings]

【図１】本発明による類似文書検索システムの第一の実
施例の全体構成を示す図である。FIG. 1 is a diagram showing an entire configuration of a first embodiment of a similar document search system according to the present invention.

【図２】出現頻度ファイルの構成例を示す図である。FIG. 2 is a diagram illustrating a configuration example of an appearance frequency file.

【図３】特徴文字列抽出処理の流れを示すＰＡＤ図であ
る。FIG. 3 is a PAD diagram showing a flow of a characteristic character string extraction process.

【図４】本発明の第一の実施例におけるシステム制御プ
ログラムの処理手順を示すＰＡＤ図である。FIG. 4 is a PAD showing a processing procedure of a system control program according to the first embodiment of the present invention.

【図５】本発明の第一の実施例における文書登録制御プ
ログラムの処理手順を示すＰＡＤ図である。FIG. 5 is a PAD showing a processing procedure of a document registration control program according to the first embodiment of the present invention.

【図６】本発明の第一の実施例における出現頻度ファイ
ル作成プログラムの処理手順を示すＰＡＤ図である。FIG. 6 is a PAD diagram showing a processing procedure of an appearance frequency file creation program in the first embodiment of the present invention.

【図７】本発明の第一の実施例における特徴文字列抽出
プログラムの処理手順を示すＰＡＤ図である。FIG. 7 is a PAD showing a processing procedure of a characteristic character string extraction program in the first embodiment of the present invention.

【図８】本発明の第一の実施例における漢字文字列対応
特徴文字列抽出プログラムの処理手順を示すＰＡＤ図で
ある。FIG. 8 is a PAD showing a processing procedure of a kanji character string corresponding characteristic character string extraction program in the first embodiment of the present invention.

【図９】本発明の第一の実施例におけるカタカナ文字列
対応特徴文字列抽出プログラムの処理手順を示すＰＡＤ
図である。FIG. 9 is a PAD showing a processing procedure of a katakana character string corresponding feature character string extraction program in the first embodiment of the present invention.
FIG.

【図１０】本発明の第一の実施例における同一文字種文
字列抽出プログラムの処理例を示す図である。FIG. 10 is a diagram showing a processing example of a program for extracting the same character type character string in the first embodiment of the present invention.

【図１１】本発明の第一の実施例における漢字文字列対
応特徴文字列抽出プログラムの処理例を示す図である。FIG. 11 is a diagram illustrating a processing example of a kanji character string corresponding feature character string extraction program according to the first embodiment of the present invention.

【図１２】本発明の第一の実施例におけるカタカナ文字
列対応特徴文字列抽出プログラムの処理例を示す図であ
る。FIG. 12 is a diagram illustrating a processing example of a katakana character string corresponding feature character string extraction program in the first embodiment of the present invention.

【図１３】本発明の第一の実施例における検索制御プロ
グラムの処理手順を示すＰＡＤ図である。FIG. 13 is a PAD showing a processing procedure of a search control program according to the first embodiment of the present invention.

【図１４】本発明の第一の実施例における類似文書検索
プログラムの処理手順を示すＰＡＤ図である。FIG. 14 is a PAD diagram showing a processing procedure of a similar document search program in the first embodiment of the present invention.

【図１５】本発明の第一の実施例における特徴文字列抽
出プログラムの処理例を示す図である。FIG. 15 is a diagram illustrating a processing example of a characteristic character string extraction program in the first embodiment of the present invention.

【図１６】本発明の第一の実施例における出現頻度計数
プログラムの処理例を示す図である。FIG. 16 is a diagram showing a processing example of an appearance frequency counting program in the first embodiment of the present invention.

【図１７】本発明の第一の実施例における出現頻度取得
ファイル読込みプログラムの処理例を示す図である。FIG. 17 is a diagram illustrating a processing example of an appearance frequency acquisition file reading program according to the first embodiment of this invention.

【図１８】本発明の第一の実施例における類似度算出プ
ログラムの処理例を示す図である。FIG. 18 is a diagram illustrating a processing example of a similarity calculation program in the first embodiment of the present invention.

【図１９】本発明の第二の実施例における検索処理系の
プログラム構成を示す図である。FIG. 19 is a diagram showing a program configuration of a search processing system according to the second embodiment of the present invention.

【図２０】本発明の第二の実施例における類似文書検索
プログラムの処理手順を示すＰＡＤ図である。FIG. 20 is a PAD diagram showing a processing procedure of a similar document search program according to the second embodiment of the present invention.

【図２１】本発明の第二の実施例におけるn-gramインデ
クスの例を示す図である。FIG. 21 is a diagram illustrating an example of an n-gram index according to the second embodiment of the present invention.

【図２２】本発明の第三の実施例における検索処理系の
プログラム構成を示す図である。FIG. 22 is a diagram showing a program configuration of a search processing system according to a third embodiment of the present invention.

【図２３】本発明の第三の実施例における類似文書検索
プログラムの処理手順を示すＰＡＤ図である。FIG. 23 is a PAD showing a processing procedure of a similar document search program according to the third embodiment of the present invention.

【図２４】本発明の第三の実施例における特徴文字列選
択プログラムの処理手順を示すＰＡＤ図である。FIG. 24 is a PAD showing a processing procedure of a characteristic character string selection program according to the third embodiment of the present invention.

【図２５】本発明の第三の実施例における特徴文字列の
選択の例を示す図である。FIG. 25 is a diagram showing an example of selecting a characteristic character string in the third embodiment of the present invention.

[Explanation of symbols]

１００…ディスプレイ、１０１…キーボード、１０２…中央演算処理装置（ＣＰＵ）、１０３…テキスト、１０４…出現頻度ファイル、１０５…磁気ディスク装置、１０６…フロッピディスクドライブ（ＦＤＤ）、１０７…フロッピディスク、１０８…バス、１０９…主メモリ、１１０…システム制御プログラム、１１１…文書登録制御プログラム、１１２…共有ライブラリ、１１３…テキスト登録プログラム、１１４…出現頻度ファイル作成登録プログラム、１１５…同一文字種文字列抽出プログラム、１１６…登録用特徴文字列抽出プログラム、１１７…出現頻度ファイル作成プログラム、１１８…検索制御プログラム、１１９…検索条件式解析プログラム、１２０…類似文書検索プログラム、１２１…種文書読込みプログラム、１２３…出現頻度計数プログラム、１２４…出現頻度読込みプログラム、１２５…類似度算出プログラム、１２６…類似度ソートプログラム、１２７…漢字文字列対応特徴文字列抽出プログラム、１２８…カタカナ文字列対応特徴文字列抽出プログラ
ム、１３０…ワークエリアReference Signs List 100 display, 101 keyboard, 102 central processing unit (CPU), 103 text, 104 appearance frequency file, 105 magnetic disk device, 106 floppy disk drive (FDD), 107 floppy disk, 108 Bus 109: main memory 110: system control program 111: document registration control program 112: shared library 113: text registration program 114: appearance frequency file creation registration program 115: identical character type character string extraction program 116 ... Characteristic character string extraction program for registration, 117 ... Frequency frequency file creation program, 118 ... Search control program, 119 ... Search condition expression analysis program, 120 ... Similar document search program, 121 ... Seed document reading program Ram, 123: appearance frequency counting program, 124: appearance frequency reading program, 125: similarity calculation program, 126: similarity sorting program, 127: kanji character string corresponding characteristic character string extraction program, 128 ... katakana character string corresponding characteristic character Column extraction program, 130 ... Work area

───────────────────────────────────────────────────── フロントページの続き (72)発明者菅谷奈津子神奈川県川崎市幸区鹿島田890番地株式会社日立製作所情報・通信開発本部内 (72)発明者川下靖司神奈川県横浜市戸塚区戸塚町3090番地株式会社日立製作所ソフトウェア開発本部内 ──────────────────────────────────────────────────続き Continuing on the front page (72) Inventor Natsuko Sugaya 890 Kashimada, Saiwai-ku, Kawasaki-shi, Kanagawa Prefecture Inside the Information & Communication Development Division, Hitachi, Ltd. Hitachi, Ltd. Software Development Division

Claims

[Claims]

1. A similar document search method for searching a text database storing character information as code data for a document similar to a document specified by a user, comprising the steps of: A character string extracting step of extracting a character string with a boundary of a predetermined character type from a boundary, and a search for extracting one or more partial character strings from the predetermined one or more character string types according to the type of the character string A similar document search method, comprising: extracting a partial character string for use; and calculating a similarity of a text in the text database with respect to the designated text using a predetermined similarity calculation formula.

2. The character string extracting step in the similar document search method according to claim 1, wherein, as the character string to be extracted from the designated text, a character string of the same character type is extracted at a boundary between all character types. A similar document search method having a character type character string extraction step.

3. A partial character string having a predetermined character string length according to all character types is extracted as a partial character string for search in the search partial character string extracting step in the similar document search method according to claim 2. A similar document search method characterized by having a character type search partial character string extraction step.

4. A method according to claim 1, wherein said step of extracting a partial character string for retrieval comprises extracting a character string having a predetermined length as a partial character string for retrieval. Extracting the character string itself extracted in the character string extraction step as a partial character string for search, calculating the appearance frequency ratio of the character string extracted in the character string extraction step and the specified character string of the partial character string, Extracting a partial character string that satisfies a predetermined value as a search partial character string; a character string created in advance from the character string extracted in the character string extraction step and not extracted as a search partial character string is unnecessary. Extracting a character string not included in the exclusion character string dictionary described as a word as a search partial character string; and extracting the character string in the character string extraction step. A step of not extracting a partial character string as a search partial character string from the character string obtained, or a step of extracting a partial character string for search by combining them. A similar document search method characterized by:

5. A predetermined length for extracting a character string having a predetermined length as a search partial character string in the search partial character string extraction step in the similar document search method according to claim 1, 2 or 3. A character string extracting step, a longest character string extracting step of extracting the character string itself extracted in the character string extracting step as a partial character string for search, a character string extracted in the character string extracting step, and designation of the partial character string A high appearance frequency ratio character string extracting step of calculating an appearance frequency ratio in the text and extracting a partial character string satisfying a predetermined value as a search partial character string; the above-mentioned predetermined long character string extracting step, the longest character string extracting step, and a high appearance From the partial character strings extracted in at least one of the frequency ratio character string extracting steps,
A step of deleting a character string included in an exclusion character string dictionary in which a character string not extracted as a search partial character string is described as an unnecessary word, and a character string extracted in the character string extraction step A step of not extracting a partial character string as a partial character string for search from, or a partial character string extraction step of extracting a partial character string for search by combining them. Similar document search method.

6. The similar document search method according to claim 1, wherein the importance of the search partial character string extracted in the search partial character string extraction step is calculated in advance. A similar document search method comprising a search partial character string selection step of extracting a search partial character string that satisfies a predetermined value, calculated using an expression.

7. The method according to claim 5, wherein the step of selecting a partial character string for search includes: a character type, a character string length, and the like of the partial character string for search extracted in the partial character string extraction step for search. An importance calculation step of calculating any one of information such as the number of appearance documents in the text database, the appearance frequency in the designated text, and the appearance position in the text, or a combination thereof to calculate the importance of the partial character string for search A similar document search method characterized by having:

8. A similar document search method according to claim 6, further comprising an appearance document number file creation step of saving the number of occurrence documents in the text database of the partial character string for search at the time of registration as an appearance document number file. A similarity document search method characterized in that the importance calculation step at the time of reading includes an appearance document number file reading step of reading the number of occurrence documents of the search partial character string from the occurrence document number file.

9. A similar document search method according to claim 6, wherein at the time of registration, the importance of the search partial character string is calculated using a predetermined calculation formula, and this is stored as an importance file. A similarity file search step having an importance file creation step; and an importance file reading step for reading the importance of the search partial character string from the importance file in the importance calculation step at the time of search. Method.

10. A similar document search method according to claim 1, further comprising an appearance frequency file creation step of saving an appearance frequency of each partial character string for search in each text in the text database as an appearance frequency file at the time of registration. A similar document search method, characterized in that the similarity calculation step at the time of retrieval includes an appearance frequency file reading step of reading appearance frequency information from the appearance frequency file.