JPH11238068A

JPH11238068A - Text retrieval device

Info

Publication number: JPH11238068A
Application number: JP10038743A
Authority: JP
Inventors: Takehiro Koyama; 剛弘小山
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 1998-02-20
Filing date: 1998-02-20
Publication date: 1999-08-31
Anticipated expiration: 2018-02-20
Also published as: JP3924899B2

Abstract

PROBLEM TO BE SOLVED: To prevent the omission of retrieval at the time of executing a fuzzy retrieval. SOLUTION: A retrieval character string developing part 10 generates a substitution character string obtained by substituting an erroneous recognition permission character '#' for a part of the character string. The number of erroneous recognition permission characters is determined in accordance with a matching degree designated by an input part 4. A matching part 12 can detect a candidate character string containing an error character by considering '#'as one arbitrary character. Then, '#' is considered to be two arbitrary characters or '##' is considered as one arbitrary character. Thus, the candidate character string generating erroneous division and erroneous connection can be detected in character recognition for generating the text of a retrieval object.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、文字列が不正確で
あり得るテキストを対象とするテキスト検索装置であっ
て、特にその検索漏れの低減と検索結果における重み付
けに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a text search apparatus for a text in which a character string may be inaccurate, and more particularly to a reduction in search omission and weighting of search results.

【０００２】[0002]

【従来の技術】従来より、検索文字列を指定し、文書や
文字列に含まれる当該検索文字列を探索するテキスト検
索装置があった。ワードプロセッサに搭載されている文
字列検索機能は、そのようなものの一例である。2. Description of the Related Art Conventionally, there has been a text search apparatus for designating a search character string and searching for the search character string included in a document or a character string. A character string search function provided in a word processor is one example of such a function.

【０００３】その検索対象である文書、文字列は、基本
的には誤りがないことが前提とされる。そしてその検索
においては、検索対象テキスト中に含まれる文字列が検
索文字列と完全に一致した場合のみ、関心のある文字列
が検索対象テキスト中に存在すると判断されていた。[0003] It is assumed that the documents and character strings to be searched are basically free from errors. In the search, only when the character string included in the search target text completely matches the search character string, it is determined that the character string of interest exists in the search target text.

【０００４】これに対し、検索対象テキストが光学文字
読取り装置（ＯＣＲ）で読み取られたテキストデータで
ある場合には、その読み取りにおける認識誤りにより、
不正確な文字列を含んだ不完全なテキストとなる確率が
高い。日本語ＯＣＲは精度が低いため、特にそのおそれ
が高い。この不完全テキストに対し、上述のような検索
文字列との完全一致による検索を行うと検索漏れが発生
するおそれがある。つまり、検索対象テキストが正しく
読み取られたものであるならばヒットしたはずである文
字列部分が、認識誤りによりヒットしないことが起こり
うる。On the other hand, when the text to be searched is text data read by an optical character reader (OCR), a recognition error in the reading causes an error.
The probability of incomplete text containing incorrect strings is high. Since Japanese OCR has low accuracy, it is particularly likely to do so. If a search is performed on the incomplete text based on a perfect match with the search character string as described above, a search omission may occur. That is, a character string portion that would have hit if the search target text was correctly read may not hit due to recognition error.

【０００５】そのような検索漏れを防止するために、検
索対象の曖昧さをある程度許容して検索を行う技術（以
下、曖昧検索という。）が存在する。特開昭６２−４４
８７８号公報に開示される第一の曖昧検索の従来技術
は、認識の結果、複数の候補が得られた場合、検索対象
テキスト中に候補文字を埋め込み、検索するものである
（例．文［字学］認［識織］による［本木］
文．．．）。特開平８−７０３３号公報に開示される第
二の曖昧検索の従来技術は、文字認識を行った各文字に
ついて複数の候補が得られた場合にはインデックスにそ
れらを残すものである。この場合、認識結果を各文字ご
とに格納したインデックスにおいて、認識対象の１文字
に対して複数の認識結果の文字候補が格納されうる。こ
の２つの技術は認識結果、すなわち検索対象テキストに
曖昧さを持たせるものである。一方、特開平６−１９５
３８７号公報、特開平７−１５２７７４号公報、特開平
８−６３４８７号公報に開示される第三の曖昧検索の従
来技術は、検索文字列の側に曖昧さを持たせるものであ
る。この方法は、検索文字列中の誤って認識されやすい
部分を、誤認識の可能性のある文字パターン（誤認識パ
ターン）で置き換えた不完全検索文字列を作成し、正し
い検索文字列だけでなく、不完全検索文字列によっても
探索を行うものである。誤認識パターンのタイプとして
は、文字誤り、誤分割、誤結合といったものがある。例
えば、「字」は「学」と認識されやすいが、このような
タイプが文字誤りである。また、「化」は「イヒ」と認
識されやすいが、このようなタイプが誤分割であり、一
方、「５１」は「引」と認識されやすいが、このような
タイプが誤結合である。[0005] In order to prevent such a search omission, there is a technique for performing a search by allowing a certain degree of ambiguity of a search target (hereinafter, referred to as an ambiguous search). JP-A-62-44
In the first prior art of fuzzy search disclosed in Japanese Patent No. 878, when a plurality of candidates are obtained as a result of recognition, candidate characters are embedded in a search target text and search is performed (eg, sentence [ [Character]] [Motogi]
Sentence. . . ). In the second prior art of ambiguous search disclosed in Japanese Patent Application Laid-Open No. 8-7033, if a plurality of candidates are obtained for each character that has been subjected to character recognition, those are left in the index. In this case, in the index in which the recognition result is stored for each character, a plurality of character candidates of the recognition result may be stored for one character to be recognized. These two techniques make the recognition result, that is, the search target text ambiguous. On the other hand, JP-A-6-195
The third prior art of ambiguous search disclosed in Japanese Patent Application Laid-Open Nos. 387, 7-152774 and 8-63487 is to add ambiguity to a search character string. This method creates an incomplete search string in which a part of the search string that is easily recognized by mistake is replaced with a character pattern (misrecognition pattern) that may be misrecognized. In addition, a search is also performed using an incomplete search character string. The types of the misrecognition pattern include a character error, a wrong division, and a wrong combination. For example, "character" is easily recognized as "study", but such a type is a character error. Also, “ka” is easily recognized as “hihi”, but such a type is erroneously divided, while “51” is easily recognized as “pull”, but such a type is erroneously combined.

【０００６】曖昧検索を行うことにより、検索漏れの減
少を図ることができるメリットがある一方、逆に本来、
検索文字列とは異なる文字列が検索文字列と一致すると
される検索誤りが含まれる可能性もある。[0006] By performing an ambiguous search, there is a merit that a search omission can be reduced.
There is a possibility that a search error that a character string different from the search character string matches the search character string is included.

【０００７】[0007]

【発明が解決しようとする課題】上述の第一の曖昧検索
の従来技術は、認識結果である検索対象テキストの容量
が増加する、認識結果に残らないと検索されないといっ
た問題があった。また、誤認識パターンのうち誤分割、
誤結合に対応できないという問題もあった。However, the first prior art of the fuzzy search described above has a problem that the capacity of the search target text as the recognition result increases and that the search is not performed unless the search result remains in the recognition result. In addition, the misrecognition pattern includes
There was also a problem that it was not possible to cope with misconnection.

【０００８】第二の曖昧検索の従来技術においても、認
識結果であるインデックスの容量が増加する、認識結果
に残らないと検索されないという問題があった。The second conventional technique of ambiguous search also has a problem that the capacity of the index as a recognition result increases and that the search is not performed unless the result remains in the recognition result.

【０００９】第三の曖昧検索の従来技術は、検索文字列
とは別に誤認識パターンを用意する必要があり、その容
量が増加するという問題があった。また例えば誤結合
は、連続する文字の組み合わせに依存して生じ、そのた
め多くのパターンが存在しうる。このように起こりうる
誤認識パターンを全て予め用意することは困難である。
そして予め用意されていない誤認識パターンが発生する
と、検索漏れとなるという問題があった。検索に用いら
れる検索対象テキスト、インデックス、誤認識パターン
などの容量が増加することは、単に記憶装置に大きな容
量を要するという問題だけでなく、検索処理に時間がか
かるという問題も引き起こしていた。In the third prior art of ambiguous search, it is necessary to prepare an erroneously recognized pattern separately from a search character string, and there is a problem that the capacity thereof increases. Also, for example, erroneous combination occurs depending on a combination of consecutive characters, so that many patterns can exist. It is difficult to prepare all possible erroneous recognition patterns in advance.
When an erroneous recognition pattern not prepared in advance occurs, there is a problem that a search is omitted. Increasing the capacity of a search target text, an index, an erroneously recognized pattern, and the like used for the search has caused not only a problem that the storage device requires a large capacity, but also a problem that the search process takes time.

【００１０】また、検索漏れを少なくしようとして、イ
ンデックスに複数候補を登録したり誤認識パターンを充
実させると、その一方で、検索文字列とは元来関係のな
い文字列まで、検索にてヒットするおそれがある。つま
り、検索結果に「ゴミ」（検索誤り）が多く含まれるこ
とになって、検索結果の信頼性が低くなるという問題も
あった。[0010] Further, if a plurality of candidates are registered in the index or misrecognition patterns are enriched in order to reduce search omissions, on the other hand, a character string which is originally unrelated to the search character string is hit in the search. There is a possibility that. In other words, there is also a problem that the search results include a lot of “garbage” (search errors), and the reliability of the search results decreases.

【００１１】本発明は上記問題点を解消するためになさ
れたもので、曖昧検索に用いるためのデータを少なくす
る一方で、検索漏れを低減するとともに、検索誤りの影
響を軽減するテキスト検索装置を提供することを目的と
する。SUMMARY OF THE INVENTION The present invention has been made to solve the above-mentioned problem. A text search apparatus which reduces data to be used for an ambiguous search, reduces search omissions, and reduces the effects of search errors. The purpose is to provide.

【００１２】[0012]

【課題を解決するための手段】第一の本発明に係るテキ
スト検索装置は、検索文字列とその部分列との間の指定
された一致度に応じて前記検索文字列の部分列を発生す
る部分列発生手段と、発生された前記部分列と一致する
文字列パターンを含んだ候補文字列を検索対象テキスト
中に探索する候補探索手段とを有するものである。According to a first aspect of the present invention, there is provided a text search apparatus for generating a substring of a search character string in accordance with a designated matching degree between the search character string and the substring. It has a partial string generating means, and candidate searching means for searching in a search target text for a candidate character string including a character string pattern matching the generated partial string.

【００１３】「一致度」は、検索文字列の部分列の当該
検索文字列に対する一致度であり、例えば検索文字列と
部分列とのそれぞれの文字数の比によって定義すること
ができる。指定される一致度は、数値の範囲指定であっ
てもよいし、閾値を示すものであってもよい。「部分
列」は、検索文字列の一部の文字をマスキングしたもの
であり、それを構成する文字は元の検索文字列における
位置の情報を保持している。例えば、検索文字列「キー
ワード」の部分列「キー＃＃ド」「キ＃＃ード」（＃は
マスキングされた文字を表す。）は互いに同一の文字の
組で構成されるが、マスキング位置が異なり、異なる部
分列として扱われる。また、この例に示されるように、
ある部分列を構成する文字は互いに連続する場合だけで
なく、構成する文字の間にマスキング位置が配される場
合もある。候補探索手段は、部分列を構成する各文字の
位置に同一の文字が配置される文字列を候補文字列とし
て、検索対象テキストから抽出する。つまり、候補文字
列の抽出において、検索文字列のうちマスキングされた
位置に来る文字の一致／不一致は問われない。The "matching degree" is a matching degree of a substring of a search character string to the search character string, and can be defined, for example, by a ratio of the number of characters between the search character string and the substring. The specified degree of coincidence may be a numerical range specification or a value indicating a threshold. The “substring” is obtained by masking a part of characters of the search character string, and the characters constituting the partial string hold information on the position in the original search character string. For example, the substrings “key ## do” and “key ## do” (# represents a masked character) of the search character string “keyword” are composed of the same set of characters, but the masking position Are different and are treated as different subsequences. Also, as shown in this example,
In some cases, not only are the characters forming a certain sub-sequence continuous with each other, but also a masking position is arranged between the constituent characters. The candidate searching means extracts, from the search target text, a character string in which the same character is arranged at the position of each character constituting the subsequence as a candidate character string. That is, in the extraction of the candidate character string, it does not matter whether the character at the masked position in the search character string matches or does not match.

【００１４】第二の本発明に係るテキスト検索装置は、
検索対象テキストの生成で生じうる誤り文字列パターン
を登録した誤り文字列登録部と、検索文字列に基づいて
候補文字列を探索する候補探索手段と、前記候補文字列
中の前記検索文字列と異なる部分に、前記誤り文字列登
録部に登録された登録誤り文字列パターンを検出する誤
り文字列検出手段と、前記候補文字列に対し前記検索文
字列との一致可能性に応じた優先度を定める優先度付与
手段とを有し、前記優先度付与手段は、前記候補文字列
中における前記登録誤り文字列パターンの検出に応じて
当該候補文字列の前記優先度を定めることを特徴とする
ものである。A text search device according to a second aspect of the present invention comprises:
An error character string registration unit that registers an error character string pattern that may occur in the generation of the search target text, a candidate search unit that searches for a candidate character string based on the search character string, and the search character string in the candidate character string. In different parts, an error character string detecting means for detecting a registration error character string pattern registered in the error character string registration unit, and a priority corresponding to the possibility of matching the search character string with the candidate character string. And a priority assigning means for determining the priority of the candidate character string in accordance with the detection of the registration error character string pattern in the candidate character string. It is.

【００１５】文字認識等により生成される検索対象テキ
ストは、誤った文字列を含みうるが、その誤り文字列パ
ターンはランダムではなく、元の正しい文字、又は文字
列に対して発生しやすいパターンが存在し得る。誤り文
字列登録部には、主としてそのような発生しやすい誤り
文字列パターンが格納される。本発明は、候補文字列の
うち検索文字列と異なる部分に、誤り文字列登録部に格
納された誤りパターンを検知する。そして例えば、検知
された誤りパターンが当該検索文字列中の対応部分の文
字列に対するものである場合、誤りパターンの部分は検
索対象テキスト生成前においては正しい文字列であった
可能性が高いと判断して、一致可能性に応じた優先度を
高く定めることができる。The search target text generated by character recognition or the like may include an erroneous character string, but the erroneous character string pattern is not random, and a pattern which is likely to occur for the original correct character or character string is not included. Can exist. The error character string registration section mainly stores such error character string patterns that are likely to occur. According to the present invention, an error pattern stored in the error character string registration unit is detected in a part of the candidate character string different from the search character string. For example, if the detected error pattern is for a corresponding part of the search string, it is determined that the error pattern part is likely to be a correct character string before the search target text is generated. Thus, the priority according to the possibility of matching can be set high.

【００１６】本発明の好適な態様は、前記検索文字列と
その部分列との間の指定された一致度に応じて前記検索
文字列の部分列を発生する部分列発生手段を有し、前記
候補文字列は、発生された前記部分列と一致する文字列
パターンを含んだ文字列であるものである。In a preferred aspect of the present invention, there is provided a partial string generating means for generating a partial string of the search character string in accordance with a designated degree of matching between the search character string and the partial string, The candidate character string is a character string including a character string pattern that matches the generated partial string.

【００１７】第三の本発明に係るテキスト検索装置は、
上記発明において前記候補探索手段が、前記検索文字列
と前記部分列との差分を構成する曖昧文字を、前記検索
対象テキスト中の任意の１文字とみなす手段を有して前
記探索を行うことや、前記検索対象テキスト中の任意の
２文字とみなす手段を有して前記探索を行うことや、ま
た、前記検索文字列と前記部分列との差分を構成する曖
昧文字のうち連続する２つを、前記検索対象テキスト中
の任意の１文字とみなす手段を有して前記探索を行うこ
とのいずれか、またはいくつかを備えたことを特徴とす
るものである。A text search device according to a third aspect of the present invention comprises:
In the above invention, the candidate search unit may perform the search by including a unit that regards an ambiguous character constituting a difference between the search character string and the subsequence as an arbitrary character in the search target text. Performing the search by including means for considering any two characters in the search target text; and determining two consecutive ambiguous characters constituting a difference between the search character string and the substring. And / or some means for performing the search with means for assuming an arbitrary character in the search target text.

【００１８】これらにより、それぞれ文字誤り、誤結
合、誤分割を誤りパターンとする候補文字列を検索する
ことができる。As a result, it is possible to search for candidate character strings each having a character error, a wrong combination, or a wrong division as an error pattern.

【００１９】第四の本発明に係るテキスト検索装置は、
前記優先度付与手段が、前記登録誤り文字列パターンの
検出頻度に応じて前記優先度を定めることを特徴とする
ものである。本発明によれば、例えば、検出頻度が高い
登録誤り文字列パターンに対する元の文字列はそのよう
な誤りを生じやすいと判断され、高い優先度を与えるこ
とができる。According to a fourth aspect of the present invention, there is provided a text search apparatus comprising:
The priority assigning means determines the priority in accordance with the detection frequency of the registration error character string pattern. According to the present invention, for example, it is determined that an original character string corresponding to a registration error character string pattern having a high detection frequency is likely to cause such an error, and a high priority can be given.

【００２０】本発明の好適な態様は、前記誤り文字列登
録部が、前記登録誤り文字列パターンに加えてさらにそ
の検出頻度を格納するものである。In a preferred aspect of the present invention, the erroneous character string registration unit stores the detection frequency in addition to the registered erroneous character string pattern.

【００２１】第五の本発明に係るテキスト検索装置は、
前記優先度に応じて前記候補文字列を表示する候補文字
列表示手段を有するものである。本発明によれば、ユー
ザは、優先度に基づいて、複数の候補文字列における検
索文字列に一致する可能性を把握することができ、例え
ば、検索処理の結果をチェックする際に便利である。[0021] A text search apparatus according to a fifth aspect of the present invention comprises:
There is provided a candidate character string display means for displaying the candidate character string according to the priority. ADVANTAGE OF THE INVENTION According to this invention, a user can grasp | ascertain the possibility of matching with the search character string in several candidate character strings based on a priority, For example, it is convenient at the time of checking the result of a search process. .

【００２２】[0022]

【発明の実施の形態】次に、本発明の実施形態について
図面を参照して説明する。Next, embodiments of the present invention will be described with reference to the drawings.

【００２３】［実施形態１］図１は、本発明の実施形態
であるテキスト検索装置の概略のブロック構成図であ
る。本装置は、ＯＣＲによって文字認識されたテキスト
を検索の対象とし、インデックス記憶部２、入力部４、
検索部６、対象文字位置情報記憶部８、検索文字列展開
部１０、マッチング部１２、出力部１４を含んで構成さ
れる。[Embodiment 1] FIG. 1 is a schematic block diagram of a text search apparatus according to an embodiment of the present invention. The present apparatus targets a text whose character has been recognized by the OCR as a search target, and stores an index storage unit 2, an input unit 4,
It includes a search unit 6, a target character position information storage unit 8, a search character string development unit 10, a matching unit 12, and an output unit 14.

【００２４】インデックス記憶部２は、ＯＣＲにより得
られた検索対象テキストをインデックスの形式で、検索
に先立って格納している。インデックスは、検索対象テ
キストに出現する文字をキーとして、それに当該文字の
出現位置を対応付けたものである。The index storage unit 2 stores the search target text obtained by the OCR in the form of an index prior to the search. The index is obtained by associating a character that appears in the search target text as a key with an appearance position of the character.

【００２５】入力部４は、検索文字列や一致度といった
検索条件をユーザから受け付ける。The input unit 4 receives search conditions such as a search character string and a degree of coincidence from the user.

【００２６】検索部６は、入力部４から検索文字列を得
て、それに含まれる各文字にてインデックス記憶部２に
記憶されたインデックスを検索して、検索文字列の各文
字の出現位置を対象文字位置情報記憶部８へ出力し、対
象文字位置情報記憶部８はこれを格納する。The search unit 6 obtains a search character string from the input unit 4, searches the index stored in the index storage unit 2 for each character included in the search character string, and determines the appearance position of each character in the search character string. The data is output to the target character position information storage unit 8, and the target character position information storage unit 8 stores this.

【００２７】検索文字列展開部１０は、検索文字列の部
分列を発生する部分列発生手段であり、入力部４から検
索文字列と一致度を得て、その一致度に応じて、検索文
字列からその部分列を含んだ置換文字列を展開・生成す
る。置換文字列は、検索文字列の一部の文字を例えば記
号「＃」で置換して、元の文字をマスキングすることに
より生成される。置換文字列のうち「＃」で置換された
部分以外は、検索文字列の元の文字で構成された部分列
である。The search character string developing unit 10 is a subsequence generating means for generating a substring of the search character string. Extract and generate a replacement string containing the subsequence from the column. The replacement character string is generated by replacing some characters of the search character string with, for example, a symbol “#” and masking the original character. The part of the replacement character string other than the part replaced by “#” is a partial string composed of the original characters of the search character string.

【００２８】マッチング部１２は、検索文字列展開部１
０から出力される置換文字列を用いて、対象文字位置情
報記憶部８に格納された対象文字位置情報とのマッチン
グを行う。そのマッチング結果は出力部１４へ出力さ
れ、ＣＲＴ等の表示装置に検索結果として画面表示され
る。The matching section 12 includes a search character string expanding section 1
Using the replacement character string output from 0, matching with the target character position information stored in the target character position information storage unit 8 is performed. The matching result is output to the output unit 14 and displayed on a screen such as a CRT as a search result on a screen.

【００２９】次に、具体的な例を用いて、各構成部の動
作を説明する。図２は、検索対象テキストのイメージを
示す模式図である。ここで例に用いる検索対象テキスト
は文書Ａ、文書Ｂ、文書Ｃの３つである。文書Ａにはそ
の先頭から１０文字目から文字列「ペルシャ」が存在す
る。同様に文書Ｂにはその先頭から５文字目から文字列
「ベルシャ」が存在し、文書Ｃにはその先頭から２１文
字目から文字列「ペノレシャ」が存在する。Next, the operation of each component will be described using a specific example. FIG. 2 is a schematic diagram showing an image of a search target text. Here, the search target texts used in the example are document A, document B, and document C. The document A has a character string “Persia” from the tenth character from the top. Similarly, in the document B, a character string “Belsha” exists from the fifth character from the top, and in the document C, a character string “Penoresha” exists from the 21st character from the top.

【００３０】図３は、これらの検索対象テキストに対し
て生成され、インデックス記憶部２に格納されているイ
ンデックスのイメージを示す模式図である。インデック
スは、検索対象テキストに出現する文字の種類（図中、
左端に示す。）をキーとして、当該文字種が現れる文書
中の位置をキーごとに分類したものである。その文字の
出現位置は、図中、文書Ａ〜Ｃを区別する番号Ｎdoc
（文書Ａは“１”、文書Ｂは“２”、文書Ｃは“３”）
と、各文書の先頭からの文字数Ｎcharとの組（Ｎdoc，
Ｎchar）の形式で表されている。FIG. 3 is a schematic diagram showing an index image generated for these search target texts and stored in the index storage unit 2. The index indicates the type of characters that appear in the search target text (in the figure,
Shown on the left end. ) Is used as a key, and the position in the document where the character type appears is classified for each key. The appearance position of the character is a number Ndoc that distinguishes documents A to C in the figure.
(Document A is “1”, Document B is “2”, Document C is “3”)
And the number of characters Nchar from the beginning of each document (Ndoc,
Nchar).

【００３１】ユーザは検索条件として、検索文字列「ペ
ルシャ」、一致度７０％を、入力部４に対し入力する。
ここでは、一致度ηは、検索文字列に対する部分列の文
字数の比で定義される。つまり検索文字列の文字数を
Ｍ、置換文字列のうち置換されずに残っている文字数を
ｍとすると、一致度η＝ｍ／Ｍ×１００［％］となる。
入力部４に入力される一致度は、ηの閾値ηthであり、
本装置は検索対象テキスト中にηthを超える一致度を有
する文字列を探索する。なお、一致度の閾値ηthが低い
と検索結果に含まれる「ゴミ」が増えるため、閾値ηth
の好適な値は、一般に７０％程度若しくはそれを上回る
値である。一方、閾値ηthが必要以上に高いと検索漏れ
を生じる可能性が高くなる。その点も考慮して、ここで
はηth＝７０％に設定した。The user inputs a search character string “Persia” and a matching score of 70% to the input unit 4 as search conditions.
Here, the degree of matching η is defined by the ratio of the number of characters in the substring to the search character string. That is, assuming that the number of characters in the search character string is M and the number of characters remaining in the replacement character string without being replaced is m, the degree of coincidence η = m / M × 100 [%].
The coincidence input to the input unit 4 is a threshold ηth of η,
The present apparatus searches for a character string having a matching degree exceeding ηth in the search target text. If the threshold ηth of the matching degree is low, “garbage” included in the search result increases.
Is generally about 70% or more. On the other hand, if the threshold ηth is higher than necessary, the possibility of occurrence of search omission increases. In consideration of this point, here, ηth is set to 70%.

【００３２】入力部４は、検索文字列「ペルシャ」を検
索部６へ通知する。検索部６はこの検索文字列を得る
と、それを構成する各文字「ペ」、「ル」、「シ」、
「ャ」をキーとしてインデックス記憶部２を検索し、そ
の結果を対象文字位置情報記憶部８に格納する。具体的
には、この例では文字「ペ」に対する出現位置（1,1
0）、（3,21）、文字「ル」に対する出現位置（1,1
1）、（2,6）、文字「シ」に対する出現位置（1,12）、
（2,7）、（3,24）、文字「ャ」に対する出現位置（1,1
3）、（2,8）、（3,25）が対象文字位置情報記憶部８に
格納される。図４は、対象文字位置情報記憶部８に格納
される対象文字位置情報のイメージを示す模式図であ
る。The input unit 4 notifies the search character string “Persia” to the search unit 6. When the search unit 6 obtains this search character string, the characters "pe", "ru", "shi",
The index storage unit 2 is searched using “「 ”as a key, and the search result is stored in the target character position information storage unit 8. Specifically, in this example, the appearance position (1, 1
0), (3,21), the appearance position (1,1
1), (2,6), appearance position (1,12) for the character "shi",
(2,7), (3,24), the appearance position (1,1
3), (2,8), and (3,25) are stored in the target character position information storage unit 8. FIG. 4 is a schematic diagram showing an image of the target character position information stored in the target character position information storage unit 8.

【００３３】検索文字列展開部１０は、入力部４から検
索文字列と一致度の閾値ηthを受け取って、検索文字列
のうち、一致度に応じた数の文字を誤認識許容文字で置
換した置換文字列を生成する。誤認識許容文字（曖昧文
字）を、ここでは記号「＃」にて表わす。誤認識許容文
字が置かれた部分の検索文字列の文字はマスキングされ
る。マスキングとは、後述する置換文字列と検索対象テ
キストとのマッチングにおいて、両者の異同を問わない
ことを意味する。The search character string developing unit 10 receives the search character string and the threshold value ηth of the degree of coincidence from the input unit 4, and replaces the number of characters corresponding to the degree of coincidence in the search character string with the erroneously recognizable character. Generate a replacement string. Misrecognition-allowed characters (ambiguous characters) are represented here by the symbol "#". The character of the search character string in the portion where the misrecognition allowable character is placed is masked. The masking means that the matching between the replacement character string and the search target text, which will be described later, is not considered.

【００３４】ここでは検索文字列の文字数Ｍ＝４である
ので、一致度の閾値ηth＝７０％を満たす部分列の文字
数ｍは３または４である。よって、検索文字列展開部１
０は、誤認識許容文字を全く含まない置換文字列（これ
は検索文字列に等しい。）と誤認識許容文字を１つだけ
含む置換文字列を生成する。具体的には、この例では置
換文字列として、「ペルシャ」、「＃ルシャ」、「ペ＃
シャ」、「ペル＃ャ」、「ペルシ＃」の５つが生成され
る。Here, since the number M of characters of the search character string is M = 4, the number m of characters of the partial string that satisfies the threshold ηth = 70% of the matching degree is 3 or 4. Therefore, the search character string expansion unit 1
0 generates a replacement character string that does not include any misrecognition-permitted character (this is equal to the search character string) and a replacement character string that includes only one misrecognition-permitted character. Specifically, in this example, the replacement character strings are “Persia”, “#Russia”,
Five are generated, namely "sha", "per # a", and "persi #".

【００３５】マッチング部１２は、この置換文字列と検
索対象テキストとのマッチングを行い、検索文字列に一
致する可能性を有する候補文字列を探索する候補探索手
段である。マッチングは、置換文字列中での相対的な文
字位置と、対象文字位置情報記憶部８に格納された出現
位置を照合することにより行われる。以下、α、βを置
換文字列に現れる通常の文字とする。The matching section 12 is a candidate searching means for matching the replacement character string with the search target text and searching for a candidate character string having a possibility of matching the search character string. Matching is performed by comparing a relative character position in the replacement character string with an appearance position stored in the target character position information storage unit 8. Hereinafter, α and β are normal characters appearing in the replacement character string.

【００３６】例えば、置換文字列中に現れる２文字の部
分「αβ」のマッチングは以下のように行われる。まず
マッチング部１２はα、βをキーとして対象文字位置情
報記憶部８を検索する。ここでα、βに対応する出現位
置をそれぞれ（Ｎdoc(α)，Ｎchar(α)）、（Ｎdoc
(β)，Ｎchar(β)）とする。マッチング部１２は、文書
番号に関してＮdoc(α)＝Ｎdoc(β)であり、かつ文字位
置に関してＮchar(β)＝Ｎchar(α)＋１なる出現位置が
見出すことにより、連続する２文字「αβ」の存在を検
知する。For example, matching of the two-character part "αβ" appearing in the replacement character string is performed as follows. First, the matching unit 12 searches the target character position information storage unit 8 using α and β as keys. Here, the appearance positions corresponding to α and β are (Ndoc (α), Nchar (α)) and (Ndoc (α)), respectively.
(β), Nchar (β)). The matching unit 12 finds Ndoc (α) = Ndoc (β) with respect to the document number and Nchar (β) = Nchar (α) +1 with respect to the character position, thereby finding two consecutive characters “αβ”. Detect presence.

【００３７】次に、誤認識許容文字「＃」を含んだ文字
列部分「α＃β」、「α＃＃β」、「α＃＃＃β」、
「α＃＃＃…β」等に対するマッチング処理は以下のよ
うに行われる。誤認識許容文字「＃」に関する基本的な
マッチング規則は以下の３通りである。Next, the character string portions "α # β", "α ## β", "α #### β",
Matching processing for “α ####... Β” and the like is performed as follows. There are three basic matching rules for the misrecognition allowable character "#".

【００３８】(i) 「＃」は任意の１文字と同一とみな
される、(ii) 「＃」は任意の２文字と同一とみなされ
る、(iii)「＃＃」は任意の１文字と同一とみなされ
る。(I) "#" is considered to be the same as any one character, (ii) "#" is considered to be the same as any two characters, and (iii) "##" is considered to be any one character. Are considered identical.

【００３９】(i)は文字誤りに対応した規則である。ま
た(ii)、(iii)はそれぞれ誤分割、誤結合に対応した規
則である。(I) is a rule corresponding to a character error. (Ii) and (iii) are rules corresponding to erroneous division and erroneous combination, respectively.

【００４０】規則(i)は、Ｎdoc(α)＝Ｎdoc(β)かつＮc
har(β)＝Ｎchar(α)＋２なる出現位置の探索として実
現される。規則(ii)は、Ｎdoc(α)＝Ｎdoc(β)かつＮch
ar(β)＝Ｎchar(α)＋３なる出現位置の探索として実現
される。また規則(iii)は、Ｎdoc(α)＝Ｎdoc(β)かつ
Ｎchar(β)＝Ｎchar(α)＋２なる出現位置の探索により
実現される。これらの探索により、マッチング部１２は
「α＃β」等の文字列パターンの存在を検知する。Rule (i) is that Ndoc (α) = Ndoc (β) and Nc
This is realized as a search for an appearance position where har (β) = Nchar (α) +2. Rule (ii) is that Ndoc (α) = Ndoc (β) and Nch
This is realized as a search for an appearance position where ar (β) = Nchar (α) +3. Rule (iii) is realized by searching for an appearance position where Ndoc (α) = Ndoc (β) and Nchar (β) = Nchar (α) +2. Through these searches, the matching unit 12 detects the presence of a character string pattern such as “α # β”.

【００４１】マッチング部１２は、置換文字列の各部分
について上述のマッチング処理を行って、検索対象テキ
スト中における置換文字列の存在を検知する。例えば、
置換文字列「ペルシャ」に対しては「ペ（1,10）」、
「ル（1,11）」、「シ（1,12）」、「ャ（1,13）」が上
述の基本的なマッチング規則に適合し、マッチング部１
２はマッチング結果として、当該置換文字列とその先頭
文字の出現位置との組「ペルシャ（1,10）」を出力す
る。そして１度マッチしたものは別の置換文字列でマッ
チしないように対象から除いていく。また、置換文字列
「＃ルシャ」に対しては基本規則と規則(i)に基づいて
「ル（2,6）」に先行する任意の１文字と「ル（2,
6）」、「シ（2,7）」、「ャ（2,8）」が検知され、マ
ッチング部１２はマッチング結果として「＃ルシャ（2,
5）」を出力する。また、置換文字列「ペ＃シャ」に対
しては基本規則と規則(ii)に基づいて「ペ（3,21）」、
これに続く任意の２文字、この任意の２文字に後続する
「シ（3,24）」、「ャ（3,25）」が上述のマッチング規
則に適合し、マッチング部１２はマッチング結果とし
て、「ペ＃シャ（3,21）」を出力する。なお、置換文字
列「ペル＃ャ」、「ペルシ＃」に対しても探索は行われ
るが、この例ではそれらにヒットする文字列（候補文字
列）は存在しない。The matching unit 12 performs the above-described matching process on each part of the replacement character string, and detects the presence of the replacement character string in the search target text. For example,
For the replacement string "Persia", "pe (1,10)"
"Le (1,11)", "Sh (1,12)", "Cha (1,13)" conform to the above basic matching rules, and the matching unit 1
2 outputs a pair “Persia (1, 10)” of the replacement character string and the appearance position of the first character as a matching result. Those that have been matched once are removed from the target so that they do not match with another replacement character string. In addition, for the replacement character string “#Russia”, any one character preceding “R (2,6)” and “R (2,2,
6), “S (2,7)”, and “G (2,8)” are detected, and the matching unit 12 outputs “#Russia (2,
5) is output. Also, for the replacement character string "pe # sha", based on the basic rules and rule (ii), "pe (3,21)",
Any two characters following this, and “shi (3,24)” and “ya (3,25)” following the arbitrary two characters conform to the above-described matching rules, and the matching unit 12 outputs “Pe # sha (3,21)” is output. The search is also performed for the replacement character strings “Per # a” and “Persian #”, but in this example, there is no character string (candidate character string) that hits them.

【００４２】出力部１４は、マッチング部１２で得られ
たマッチング結果に基づいて、画面上に検索結果を表示
する。上述のマッチング部１２は、マッチング結果とし
て候補文字列の位置を出力するものであり、出力部１４
はそれを例えば、「文書Ａ：（1,10）文書Ｂ：（2,
5）文書Ｃ：（3,21）」と表示することができる。そ
の他、誤認識許容文字数によって、完全一致、１文字曖
昧、２文字曖昧というようにランキングを行い、それら
のグループごとに区分して表示してもよい。The output unit 14 displays a search result on a screen based on the matching result obtained by the matching unit 12. The above-described matching unit 12 outputs the position of the candidate character string as a matching result.
For example, "Document A: (1,10) Document B: (2,
5) Document C: (3,21) "can be displayed. In addition, ranking may be performed according to the number of allowable characters for erroneous recognition, such as perfect match, one-character ambiguity, or two-character ambiguity.

【００４３】また、出力部１４は、マッチング部１２か
ら得た文書番号と文字位置を基に、検索対象テキストに
アクセスして、候補文字列を得てそれを表示してもよ
い。また、マッチング部１２自体が、候補文字列の位置
情報に基づいてインデックス記憶部２にアクセスし、候
補文字列をその位置情報と併せて出力部１４へ出力する
ように構成することもできる。このような構成により、
出力部１４は、候補文字列を含んだ内容、例えば「文書
Ａ：ペルシャ（1,10）文書Ｂ：ベルシャ（2,5）文
書Ｃ：ペノレシャ（3,21）」を表示することができる。The output unit 14 may access the search target text based on the document number and the character position obtained from the matching unit 12, obtain a candidate character string, and display it. Further, the matching unit 12 itself may be configured to access the index storage unit 2 based on the position information of the candidate character string, and output the candidate character string to the output unit 14 together with the position information. With such a configuration,
The output unit 14 can display the content including the candidate character string, for example, “Document A: Persia (1,10) Document B: Belsha (2,5) Document C: Penoresha (3,21)”.

【００４４】ちなみに本装置は、汎用コンピュータを用
いて構成することができ、特に検索部６、検索文字列展
開部１０、マッチング部１２の機能は、中央演算処理部
（ＣＰＵ：Central Processing Unit）により実行され
うる。Incidentally, the present apparatus can be configured by using a general-purpose computer. In particular, the functions of the search unit 6, the search character string expansion unit 10, and the matching unit 12 are performed by a central processing unit (CPU). Can be performed.

【００４５】本実施形態にて説明した本発明によれば、
例えば文字認識において誤って認識されることにより、
ある文字又は文字列がどのような誤った文字又は文字列
に変換されて検索対象テキストが生成されるかという情
報を用いないのにも拘わらず、文字誤り、誤分割、誤結
合に対応することができ、検索文字列に一致する可能性
のある候補文字列をもれなく検索することができる。According to the present invention described in the present embodiment,
For example, by being incorrectly recognized in character recognition,
Respond to character errors, mis-segmentation, and mis-combinations, without using information on what character or character string is converted to the wrong character or character string to generate the search target text , And a candidate character string that may match the search character string can be completely searched.

【００４６】［実施形態２］図５は、本発明の第２の実
施形態であるテキスト検索装置の概略のブロック構成図
である。本装置の構成要素のうち上記実施形態と同様の
ものについては同一の符号を付し説明を簡単にする。本
装置は、上記装置の構成に加えて、テキスト記憶部２
０、誤り文字列登録部２４、ランキング部２６とをさら
に備えた点が主たる相違点である。[Second Embodiment] FIG. 5 is a schematic block diagram of a text search apparatus according to a second embodiment of the present invention. The same components as those of the above-described embodiment among the components of the present apparatus are denoted by the same reference numerals, and the description will be simplified. This device has a text storage unit 2 in addition to the configuration of the above device.
0, an error character string registration unit 24, and a ranking unit 26 are the main differences.

【００４７】テキスト記憶部２０は、検索対象テキスト
を格納しており、各文書は文書番号を付され互いに区別
されうる。The text storage unit 20 stores search target texts, and each document is assigned a document number and can be distinguished from each other.

【００４８】マッチング部２２は、上記実施形態の装置
と同様の処理を行って、候補文字列の位置情報を得る。
本装置のマッチング部２２は、さらにその位置情報に基
づいて、テキスト記憶部２０にアクセスし、候補文字列
を取得し出力する。このとき、位置情報も併せて出力す
ることができる。The matching section 22 performs the same processing as in the apparatus of the above embodiment to obtain position information of the candidate character string.
The matching unit 22 of the present apparatus further accesses the text storage unit 20 based on the position information, acquires and outputs the candidate character string. At this time, the position information can also be output.

【００４９】誤り文字列登録部２４は、文字認識におい
て誤認識されやすい文字又は文字列である誤り文字列パ
ターンを格納している。The erroneous character string registration unit 24 stores an erroneous character string pattern that is a character or a character string that is easily erroneously recognized in character recognition.

【００５０】ランキング部２６は、マッチング部２２か
ら候補文字列を得ると、当該候補文字列中に誤り文字列
登録部２４に登録された誤り文字列パターンを探索す
る。そして、ランキング部２６はその結果に応じて候補
文字列と検索文字列との一致可能性に応じた優先度を定
める（優先度付与手段）。ランキング部２６は、候補文
字列とその優先度とを出力部１４へ出力する。Upon obtaining the candidate character string from the matching unit 22, the ranking unit 26 searches the candidate character string for an error character string pattern registered in the error character string registration unit 24. Then, the ranking unit 26 determines a priority according to the possibility of matching between the candidate character string and the search character string according to the result (priority assigning means). The ranking unit 26 outputs the candidate character strings and their priorities to the output unit 14.

【００５１】次に、具体的な例を用いて、本装置の動作
の特徴を説明する。図６は、検索対象テキストのイメー
ジを示す模式図である。ここで例に用いる検索対象テキ
ストは文書Ａ、文書Ｂ、文書Ｃの３つである。文書Ａに
はその先頭から１０文字目から文字列「スキャナ」が存
在する。同様に文書Ｂにはその先頭から５文字目から文
字列「スキャン」が存在し、文書Ｃにはその先頭から２
１文字目から文字列「スキヤナ」が存在する。Next, the characteristics of the operation of the present apparatus will be described using a specific example. FIG. 6 is a schematic diagram showing an image of the search target text. Here, the search target texts used in the example are document A, document B, and document C. Document A has a character string “scanner” from the tenth character from the top. Similarly, the document B has a character string “scan” from the fifth character from the top, and the document C has two character strings “scan” from the top.
The character string "Skyana" exists from the first character.

【００５２】図７は、これらの検索対象テキストに対し
て生成され、インデックス記憶部２に格納されているイ
ンデックスのイメージを示す模式図である。FIG. 7 is a schematic diagram showing an image of an index generated for these search target texts and stored in the index storage unit 2.

【００５３】ユーザは検索条件として、検索文字列「ス
キャナ」、一致度７０％を、入力部４に対し入力する。The user inputs a search character string “scanner” and a degree of coincidence of 70% to the input unit 4 as search conditions.

【００５４】入力部４は、検索文字列「スキャナ」を検
索部６へ通知する。検索部６はこの検索文字列を得る
と、上記実施形態と同様、それを構成する各文字をキー
としてインデックス記憶部２を検索し、その結果を対象
文字位置情報記憶部８に格納する。The input unit 4 notifies the search character string “scanner” to the search unit 6. Upon obtaining the search character string, the search unit 6 searches the index storage unit 2 using each of the characters constituting the search character string as a key, and stores the result in the target character position information storage unit 8, as in the above embodiment.

【００５５】検索文字列展開部１０は、入力部４から検
索文字列と一致度の閾値ηthを受け取って、それに応じ
た置換文字列を生成する。The search character string developing unit 10 receives the search character string and the threshold value ηth of the degree of coincidence from the input unit 4, and generates a replacement character string corresponding to the search character string.

【００５６】ここでは検索文字列の文字数Ｍ＝４、及び
一致度の閾値ηth＝７０％に基づいて、検索文字列展開
部１０は、誤認識許容文字を全く含まない置換文字列と
誤認識許容文字を１つだけ含む置換文字列を生成する。
具体的には、この例では置換文字列として、「スキャ
ナ」、「＃キャナ」、「ス＃ャナ」、「スキ＃ナ」、
「スキャ＃」の５つが生成される。Here, based on the number M of characters of the search character string M = 4 and the threshold value ηth = 70% of the matching degree, the search character string developing unit 10 determines that the replacement character string containing no erroneously recognizable character and the erroneous recognition allowable character string. Generate a replacement string containing only one character.
Specifically, in this example, as the replacement character strings, “scanner”, “#kana”, “scanner”, “scanner”,
Five “scan #” are generated.

【００５７】マッチング部２２は、この置換文字列と検
索対象テキストとのマッチングを行い、検索対象テキス
ト中における置換文字列の存在を検知する。例えば、置
換文字列「スキャナ」に対しては「ス（1,10）」、「キ
（1,11）」、「ャ（1,12）」、「ナ（1,13）」がマッチ
する。マッチング部２２は、この位置情報に基づいて、
テキスト記憶部２０に格納された検索対象テキストから
候補文字列「スキャナ」を取得し、これとその先頭文字
の出現位置との組「スキャナ（1,10）」を、ランキング
部２６へ出力する。また、置換文字列「スキャ＃」に対
しては上記実施形態で述べた基本規則と規則(i)に基づ
いて「ス（2,5）」、「キ（2,6）」、「ャ（2,7）」及
びこれに後続する任意の１文字がマッチする。マッチン
グ部２２は、この位置情報を基にテキスト記憶部２０に
アクセスして候補文字列「スキャン」を取得し、マッチ
ング結果として「スキャン（2,5）」を出力する。ま
た、置換文字列「スキ＃ナ」に対しては基本規則と規則
(i)に基づいて「ス（3,21）」、「キ（3,22）」、これ
に続く任意の１文字、この任意の１文字に後続する「ナ
（3,24）」がマッチする。マッチング部２２はこの位置
情報を基にテキスト記憶部２０にアクセスして候補文字
列「スキヤナ」を取得し、マッチング結果として「スキ
ヤナ（3,21）」を出力する。The matching section 22 matches this replacement character string with the search target text, and detects the presence of the replacement character string in the search target text. For example, for the replacement character string “scanner”, “s (1,10)”, “ki (1,11)”, “ya (1,12)”, and “na (1,13)” match. . The matching unit 22 determines, based on the position information,
A candidate character string “scanner” is acquired from the search target text stored in the text storage unit 20, and a set “scanner (1,10)” of the candidate character string and the appearance position of the first character is output to the ranking unit 26. For the replacement character string “Ska #”, based on the basic rules and the rule (i) described in the above embodiment, “S (2,5)”, “G (2,6)”, “ 2,7) "and any one character following it. The matching unit 22 accesses the text storage unit 20 based on the position information, acquires the candidate character string “scan”, and outputs “scan (2, 5)” as a matching result. The basic rules and rules for the replacement character string
Based on (i), "su (3,21)", "ki (3,22)", any one character following it, and "na (3,24)" following this one character match I do. The matching unit 22 accesses the text storage unit 20 based on the position information to acquire the candidate character string “SCAN”, and outputs “SCANANA (3, 21)” as the matching result.

【００５８】ランキング部２６は、マッチング部２２か
らマッチング結果を得ると、候補文字列のランキングを
行う。ここでランキングは、候補文字列が検索文字列に
一致する可能性に応じた優先度を定める処理であり、候
補文字列が検索文字列と異なる部分（誤り文字列）の文
字数と誤り文字列登録部２４に誤り文字列パターンとし
て登録されているかどうかに基づいて定められる。Upon obtaining the matching result from the matching unit 22, the ranking unit 26 ranks the candidate character strings. Here, the ranking is a process of determining a priority according to the possibility that the candidate character string matches the search character string. The number of characters in the part where the candidate character string differs from the search character string (error character string) and the registration of the error character string It is determined based on whether or not it is registered in the unit 24 as an error character string pattern.

【００５９】図８は、誤り文字列登録部２４に登録され
た誤り文字列パターンの一例を示す模式図である。図
は、検索対象テキストを生成する際の文字認識におい
て、「→」の左側の文字又は文字列が、右側の文字又は
文字列と誤って認識されやすいことを示している。例え
ば、「ス」は「イ」に、「ャ」は「ヤ」や「ゃ」に、
「ナ」は「メ」に、「ル」は「ノレ」に誤って認識され
やすいことを示している。FIG. 8 is a schematic diagram showing an example of an error character string pattern registered in the error character string registration unit 24. The figure shows that in character recognition when generating the search target text, the character or character string to the left of “→” is likely to be erroneously recognized as the character or character string to the right. For example, "su" becomes "i", "ya" becomes "ya" or "ゃ",
"Na" indicates that it is likely to be mistakenly recognized as "me", and "ru" may be easily recognized as "nore".

【００６０】ランキング部２６は、例えば、候補文字列
が検索文字列と完全一致の場合には、優先度を表す数値
としてポイント「１００」を付与し、１文字不一致の場
合にはポイント「１０」を付与する。その上でランキン
グ部２６は、候補文字列と検索文字列との差分である誤
り文字列が、誤り文字列登録部２４に誤り文字列パター
ンとして登録されているかどうかを調べ、もし登録され
ている場合は、既に獲得しているポイントに、例えば
「４０」ポイントを加える。For example, when the candidate character string completely matches the search character string, the ranking unit 26 assigns a point “100” as a numerical value indicating the priority, and when one character does not match, the point “10”. Is given. Then, the ranking unit 26 checks whether an error character string that is a difference between the candidate character string and the search character string is registered as an error character string pattern in the error character string registration unit 24, and if the error character string is registered. In this case, for example, "40" points are added to the points already acquired.

【００６１】よって、例えば、候補文字列「スキャナ」
は完全一致であるので、ポイント「１００」を獲得し、
候補文字列「スキヤナ」は１文字不一致で、さらに誤り
文字列登録部２４に「ャ→ヤ」が登録されているので、
それぞれのポイント「１０」、「４０」を加算したポイ
ント「５０」を得る。一方、候補文字列「スキャン」は
１文字不一致であるが、誤り文字列登録部２４にその誤
り文字列が登録されていないので、ポイント「１０」の
みを得る。そして、ランキング部２６は、例えば、ラン
キング結果として、候補文字列とその位置情報とポイン
トの組、例えば「スキャナ（1,10,100）」、「スキヤナ
（2,5,50）」、「スキャン（3,21,10）」を出力部１４
へ出力する。Thus, for example, the candidate character string "scanner"
Is an exact match, so you get the point "100"
Since the candidate character string “Sukiyana” does not match one character and “Error → Ya” is registered in the error character string registration unit 24,
A point "50" is obtained by adding the points "10" and "40". On the other hand, although the candidate character string “scan” does not match one character, since the error character string is not registered in the error character string registration unit 24, only the point “10” is obtained. Then, for example, as a ranking result, the ranking unit 26 sets a set of candidate character strings, their position information and points, for example, “scanner (1, 10, 100)”, “scanner (2, 5, 50)”, “scan (3 , 21,10) "to the output unit 14
Output to

【００６２】出力部１４は、ランキング部２６からのラ
ンキング結果を得ると、それに含まれるポイントを用い
た表示を行うことができる。例えば、ポイントが高い、
すなわち検索文字列と一致する可能性が高い順に、候補
文字列を画面表示するといったことができる。また、出
力部１４は、ある値以上のポイントを得た候補文字列の
みを表示してもよいし、ポイントが指定された範囲内に
あるものをグループ化して表示してもよい。When the output section 14 obtains the ranking result from the ranking section 26, the output section 14 can perform display using the points included therein. For example, points are high,
That is, the candidate character strings can be displayed on the screen in descending order of the possibility of matching with the search character string. In addition, the output unit 14 may display only candidate character strings that have obtained points equal to or more than a certain value, or may group and display those having points within a specified range.

【００６３】上述のランキング部２６は、誤り文字列登
録部２４に登録された誤り文字列パターンに対して一定
のポイントを付与したが、必ずしも付与されるポイント
は一律でなくてもよい。例えば、誤り文字列登録部２４
に各誤り文字列パターンの検出頻度などで表される誤り
やすさの度合いを格納し、これをランキングに反映させ
ることにより、より詳細なランキングを行うことができ
る。例えば、誤りやすさを０〜１の調整係数で設定し、
ポイントは、誤り文字列パターン共通のポイントに誤り
やすさの調整係数を乗じるといった方法がある。このよ
うな方法では、例えば、上述の例において候補文字列
「スキヤナ」の調整係数を０．８とすれば、そのポイン
トは１０＋４０×０．８＝４２となるわけである。ま
た、ユーザが検索結果に基づいて、誤り文字列パターン
の検出頻度を増減するように構成することができる。Although the ranking unit 26 assigns a fixed number of points to the error character string pattern registered in the error character string registration unit 24, the number of points may not necessarily be uniform. For example, the error character string registration unit 24
, The degree of error susceptibility represented by the frequency of detection of each error character string pattern is stored, and this is reflected in the ranking, whereby a more detailed ranking can be performed. For example, the error probability is set with an adjustment coefficient of 0 to 1,
As the point, there is a method of multiplying a common point of the error character string pattern by an adjustment coefficient of error probability. In such a method, for example, if the adjustment coefficient of the candidate character string “scan” is 0.8 in the above example, the point is 10 + 40 × 0.8 = 42. Further, it is possible to configure so that the user increases or decreases the detection frequency of the error character string pattern based on the search result.

【００６４】本実施形態にて説明した本発明によれば、
第一の実施形態で説明した発明と同様、検索処理のうち
マッチング自体は、誤り文字列登録部２４に登録された
誤り文字列パターンを必要とせずに、文字誤り、誤分
割、誤結合に対応することができ、検索文字列に一致す
る可能性のある候補文字列をもれなく検索することがで
きる。このもれなく検索することにより、検索文字列と
の一致可能性が低いものも候補文字列として検出され、
マッチング結果に含まれる「ゴミ」（検索誤り）の割合
が増加することは否めない。本発明は、もれなく検索す
るとともに、その検索結果をより確からしい順番にて表
示することを可能にし、これによりユーザが検索結果を
利用する際に各候補文字列の重要度（優先度）を把握す
ることが可能となり、検索誤りが生じても実際の利用に
おけるその影響を軽減することができる。従来の検索文
字列を誤り文字列パターンを用いて展開して検索を行う
方法では、誤り文字列パターンを登録した辞書がある程
度充実していないと検索もれが多くなり、信頼性が低く
なる。これに対し本発明では、誤り文字列登録部２４の
データが無い場合でも、もれなく検索でき、誤り文字列
登録部２４のデータを充実させていくことによりランキ
ングの精度を向上させていくことができる。According to the present invention described in the present embodiment,
Similar to the invention described in the first embodiment, the matching itself in the search processing does not require the error character string pattern registered in the error character string registration unit 24, and can handle character errors, mis-divisions, and mis-combinations. It is possible to search for a candidate character string that may match the search character string without fail. By performing a full search, those that are unlikely to match the search string are also detected as candidate strings,
It cannot be denied that the ratio of “garbage” (search error) included in the matching result increases. The present invention makes it possible to perform a complete search and to display the search results in a more probable order, thereby grasping the importance (priority) of each candidate character string when the user uses the search results. Thus, even if a search error occurs, its influence on actual use can be reduced. In the conventional method of performing a search by expanding a search character string using an error character string pattern, if the dictionary in which the error character string pattern is registered is not sufficient to some extent, the number of search omissions increases and the reliability decreases. On the other hand, in the present invention, even when there is no data in the error character string registration unit 24, the search can be performed without fail, and the accuracy of the ranking can be improved by enriching the data in the error character string registration unit 24. .

【００６５】［実施形態３］本発明の第３の実施形態
は、他の検索文字列を用いた他の検索処理例に係るもの
であり、本実施形態に係るテキスト検索装置の構成は、
上記第二の実施形態の装置と同様である。[Embodiment 3] The third embodiment of the present invention relates to another example of search processing using another search character string. The configuration of a text search apparatus according to this embodiment is as follows.
This is the same as the device of the second embodiment.

【００６６】図９は、検索対象テキストのイメージを示
す模式図である。ここで例に用いる検索対象テキストは
文書Ａ、文書Ｂ、文書Ｃの３つである。文書Ａにはその
先頭から１０文字目から文字列「アルタリア」が存在す
る。同様に文書Ｂにはその先頭から５文字目から文字列
「アル列ア」が存在し、文書Ｃにはその先頭から２１文
字目から文字列「アル夕リア」（“夕”は漢字）が存在
する。FIG. 9 is a schematic diagram showing an image of a search target text. Here, the search target texts used in the example are document A, document B, and document C. Document A has a character string “Altalia” starting from the tenth character from the beginning. Similarly, in the document B, a character string “Al column a” exists from the fifth character from the beginning, and in the document C, a character string “Al yuria” (“Yu” is a kanji) from the 21st character from the beginning. Exists.

【００６７】図１０は、これらの検索対象テキストに対
して生成され、インデックス記憶部２に格納されている
インデックスのイメージを示す模式図である。FIG. 10 is a schematic diagram showing an image of an index generated for these search target texts and stored in the index storage unit 2.

【００６８】ユーザは検索条件として、検索文字列「ア
ルタリア」及び、一致度６０％を入力部４に対し入力す
る。The user inputs a search character string “Altalia” and a coincidence degree of 60% to the input unit 4 as search conditions.

【００６９】入力部４は、検索文字列「アルタリア」を
検索部６へ通知する。検索部６はこの検索文字列を得る
と、上記実施形態と同様、それを構成する各文字をキー
としてインデックス記憶部２を検索し、その結果を対象
文字位置情報記憶部８に格納する。The input unit 4 notifies the search character string “Altalia” to the search unit 6. Upon obtaining the search character string, the search unit 6 searches the index storage unit 2 using each of the characters constituting the search character string as a key, and stores the result in the target character position information storage unit 8, as in the above embodiment.

【００７０】検索文字列展開部１０は、入力部４から検
索文字列と一致度の閾値ηthを受け取って、それに応じ
た置換文字列を生成する。The search character string developing unit 10 receives the search character string and the threshold value ηth of the degree of coincidence from the input unit 4, and generates a replacement character string according to the search character string.

【００７１】ここでは検索文字列の文字数Ｍ＝５、及び
一致度の閾値ηth＝６０％から誤認識許容文字は２文字
許される。検索文字列展開部１０は具体的には、この例
では置換文字列として、「アルタリア」、「＃＃タリ
ア」、「＃ル＃リア」、「＃ルタ＃ア」、「＃ルタリ
＃」、「ア＃＃リア」、「ア＃タ＃ア」、「ア＃タリ
＃」、「アル＃＃ア」、「アル＃リ＃」、「アルタ＃
＃」を生成しマッチング部２２へ出力する。Here, from the number of characters M of the search character string M = 5 and the threshold ηth = 60% of the degree of coincidence, two characters are permitted to be erroneously recognized. Specifically, in this example, the search character string expanding unit 10 includes “Altalia”, “## Thalia”, “#Lutalia”, “# Lutah # a”, “# Lutari #”, "A # Rear", "A # T # A", "A # Tari #", "Al # A #", "Al # Re #", "Alta #"
# ”Is generated and output to the matching unit 22.

【００７２】マッチング部２２は、この置換文字列と検
索対象テキストとのマッチングを行い、検索対象テキス
ト中における置換文字列の存在を検知する。例えば、置
換文字列「アルタリア」に対しては「ア（1,10）」、
「ル（1,11）」、「タ（1,12）」、「リ（1,13）」、
「ア（1,14）」がマッチする。マッチング部２２は、こ
の位置情報に基づいて、テキスト記憶部２０に格納され
た検索対象テキストから候補文字列「アルタリア」を取
得し、これとその先頭文字の出現位置との組「アルタリ
ア（1,10）」を、ランキング部２６へ出力する。また、
置換文字列「＃ル＃リア」に対しては任意の１文字、こ
れに続く「ル（3,22）」、これに続く任意の１文字、
「リ（3,24）」、「ア（3,25）」がマッチする。マッチ
ング部２２は、この位置情報を基にテキスト記憶部２０
にアクセスして候補文字列「アル夕リア」（“夕”は漢
字）を取得し、マッチング結果として「アル夕リア（3,
21）」を出力する。また、置換文字列「アル＃＃ア」に
対しては上記第一の実施形態で述べた規則(iii)から
「ア（2,5）」、「ル（2,6）」、これに続く１文字、及
び「ア（2,8）」がマッチする。マッチング部２２はこ
の位置情報を基にテキスト記憶部２０にアクセスして候
補文字列「アル列ア」を取得し、マッチング結果として
「アル列ア（2,5）」を出力する。The matching section 22 matches this replacement character string with the search target text, and detects the presence of the replacement character string in the search target text. For example, for the replacement string "Altalia", "A (1,10)"
"Le (1,11)", "Ta (1,12)", "Li (1,13)",
"A (1,14)" matches. The matching unit 22 obtains the candidate character string “Altalia” from the search target text stored in the text storage unit 20 based on the position information, and sets “Altalia (1, 10) is output to the ranking unit 26. Also,
Any one character for the replacement string "# le # rear", followed by "le (3,22)", followed by any one character,
"Li (3,24)" and "A (3,25)" match. The matching unit 22 uses the text storage unit 20 based on the position information.
To obtain the candidate character string “Al Yuria (“ Yu ”is a kanji).
21) is output. Also, for the replacement character string "Al ## a", "a (2,5)" and "ru (2,6)" follow rule (iii) described in the first embodiment. One character and "a (2,8)" match. The matching unit 22 accesses the text storage unit 20 based on the position information to obtain the candidate character string “Al string A”, and outputs “Al string A (2, 5)” as a matching result.

【００７３】ランキング部２６は、マッチング部２２か
らマッチング結果を得ると、候補文字列のランキングを
行う。本装置では、ランキング部２６が付与するポイン
トは、誤認識許容文字が２つの場合に拡張され、その場
合に生じ得るそれぞれのケースについて定められてい
る。例えば、以下のように定めることができる。Upon obtaining the matching result from the matching unit 22, the ranking unit 26 ranks the candidate character strings. In the present apparatus, the points given by the ranking unit 26 are extended to the case where the number of erroneously recognizable characters is two, and are defined for each case that may occur in that case. For example, it can be determined as follows.

【００７４】 (a)検索文字列と完全一致の場合：ポイント「１０
０」、 (b)１文字不一致の場合：ポイント「５
０」、 (b-1)不一致の１文字が誤り文字列パターンと一致する
場合：ポイント「３０」を加算、 (c)２文字不一致の場合：ポイント「１
０」、 (c-1)不一致の１文字が誤り文字列パターンと一致する
場合：ポイント「３０」を加算、 (c-2)不一致の２文字が誤り文字列パターンと一致する
場合：ポイント「６０」を加算。(A) In the case of a perfect match with the search character string: point “10”
0 ”, (b) If one character does not match: point“ 5 ”
0 ”, (b-1) If one mismatched character matches the error character string pattern: add point“ 30 ”, (c) If two characters do not match: point“ 1 ”
0 ", (c-1) If one mismatched character matches the error character string pattern: add point" 30 ". (C-2) If two mismatched characters match the error character string pattern: point" 60 ”.

【００７５】また、誤り文字列登録部２４に登録された
誤り文字列パターンには、「タリ→列」、「タ（カタカ
ナ）→夕（漢字）」が含まれているものとする。It is assumed that the error character string patterns registered in the error character string registration unit 24 include “Tari → column” and “T (katakana) → Yu (Kanji)”.

【００７６】ランキング部２６は、候補文字列「アルタ
リア」に対しては完全一致の場合のポイント「１００」
を付与し、「アル列ア」は２文字不一致かつ誤り文字列
「タリ→列」が誤り文字列登録部２４に登録されている
ので、１０＋６０＝７０ポイントを付与される。また、
候補文字列「アル夕リア」（“夕”は漢字）は１文字不
一致かつ誤り文字列「タ→夕」が誤り文字列登録部２４
に登録されているので、５０＋３０＝８０ポイントを付
与される。そして、ランキング部２６は、例えば、ラン
キング結果として、候補文字列とその位置情報とポイン
トの組、例えば「アルタリア（1,10,100）」、「アル列
ア（2,5,70）」、「アル夕リア（3,21,80）」を出力部
１４へ出力する。The ranking section 26 determines the point “100” in the case of a perfect match for the candidate character string “Altalia”.
Is given, and 10 + 60 = 70 points are given to “Al column a” because two characters do not match and the error character string “Tari → column” is registered in the error character string registration unit 24. Also,
The candidate character string "Al Yuria"("Yu" is a kanji) does not match one character and the erroneous character string "Ta → Yu" is an erroneous character string registration unit 24.
, 50 + 30 = 80 points are given. Then, for example, as a ranking result, the ranking unit 26 sets a set of candidate character strings, their position information, and points, for example, “Altalia (1,10,100)”, “Al sequence (2,5,70)”, “Al Evening rear (3, 21, 80) "is output to the output unit 14.

【００７７】出力部１４は、ランキング部２６からのラ
ンキング結果を得ると、例えば、ポイントが高い順に、
候補文字列を画面表示する。また、出力部１４は、ある
値以上のポイントを得た候補文字列のみを表示してもよ
いし、ポイントが指定された範囲内にあるものをグルー
プ化して表示してもよい。When the output unit 14 obtains the ranking result from the ranking unit 26, for example,
Display candidate character strings on the screen. In addition, the output unit 14 may display only candidate character strings that have obtained points equal to or more than a certain value, or may group and display those having points within a specified range.

【００７８】なお、誤り文字列登録部２４を用いたラン
キングではなく、簡単に、完全一致、１文字曖昧、２文
字曖昧というランキングを行うことも可能である。Instead of the ranking using the error character string registration unit 24, it is possible to easily perform the ranking of perfect match, one-character ambiguity, and two-character ambiguity.

[Brief description of the drawings]

【図１】本発明の第一の実施形態であるテキスト検索
装置の概略のブロック構成図である。FIG. 1 is a schematic block configuration diagram of a text search device according to a first embodiment of the present invention.

【図２】第一の実施形態に係る検索対象テキストのイ
メージを示す模式図である。FIG. 2 is a schematic diagram illustrating an image of a search target text according to the first embodiment.

【図３】第一の実施形態に係る検索対象テキストのイ
ンデックスのイメージを示す模式図である。FIG. 3 is a schematic diagram showing an image of an index of a search target text according to the first embodiment.

【図４】対象文字位置情報記憶部に格納される対象文
字位置情報のイメージを示す模式図である。FIG. 4 is a schematic diagram illustrating an image of target character position information stored in a target character position information storage unit.

【図５】本発明の第二の実施形態であるテキスト検索
装置の概略のブロック構成図である。FIG. 5 is a schematic block diagram of a text search device according to a second embodiment of the present invention.

【図６】第二の実施形態に係る検索対象テキストのイ
メージを示す模式図である。FIG. 6 is a schematic diagram illustrating an image of a search target text according to the second embodiment.

【図７】第二の実施形態に係る検索対象テキストのイ
ンデックスのイメージを示す模式図である。FIG. 7 is a schematic diagram illustrating an image of an index of a search target text according to the second embodiment.

【図８】誤り文字列登録部に登録された誤り文字列パ
ターンの一例を示す模式図である。FIG. 8 is a schematic diagram illustrating an example of an error character string pattern registered in an error character string registration unit.

【図９】第三の実施形態に係る検索対象テキストのイ
メージを示す模式図である。FIG. 9 is a schematic diagram illustrating an image of a search target text according to the third embodiment.

【図１０】検索対象テキストに対して生成され、イン
デックス記憶部に格納されているインデックスのイメー
ジを示す模式図である。FIG. 10 is a schematic diagram illustrating an image of an index generated for a search target text and stored in an index storage unit.

[Explanation of symbols]

２インデックス記憶部、４入力部、６検索部、８
対象文字位置情報記憶部、１０検索文字列展開部、
１２，２２マッチング部、１４出力部、２０テキ
スト記憶部、２４誤り文字列登録部、２６ランキン
グ部。2 index storage unit, 4 input unit, 6 search unit, 8
Target character position information storage unit, 10 search character string development unit,
12, 22 matching unit, 14 output unit, 20 text storage unit, 24 error character string registration unit, 26 ranking unit.

Claims

[Claims]

1. A text search apparatus for performing a search process on a search target text based on a search character string, wherein the search character string is determined in accordance with a specified degree of matching between the search character string and a substring thereof. Substring generating means for generating a substring of; and candidate searching means for searching in the search target text for a candidate character string including a character string pattern matching the generated substring. Text search device.

2. A text search apparatus for searching a search target text for a candidate character string having a possibility of matching the search character string, wherein an error character string in which an error character string pattern that may be generated in the generation of the search target text is registered A registration unit, a candidate search unit that searches for the candidate character string based on the search character string, and a registration error registered in the error character string registration unit in a part of the candidate character string different from the search character string. An error character string detecting means for detecting a character string pattern; and a priority assigning means for setting a priority to the candidate character string in accordance with a possibility of matching with the search character string, wherein the priority assigning means Determining the priority of the candidate character string according to detection of the registration error character string pattern in the candidate character string.

3. A substring generating means for generating a substring of the search character string in accordance with a designated matching degree between the search character string and the substring, wherein the candidate character string 3. The text search apparatus according to claim 2, wherein the character string includes a character string pattern that matches the substring.

4. The candidate search means includes means for assuming an ambiguous character constituting a difference between the search character string and the subsequence as an arbitrary character in the search target text to perform the search. The text search device according to claim 1 or 3, wherein:

5. The candidate search unit includes a unit that regards an ambiguous character that constitutes a difference between the search character string and the substring as any two characters in the search target text, and performs the search. The text search device according to claim 1 or 3, wherein:

6. The candidate searching means includes means for regarding two consecutive ambiguous characters constituting a difference between the search character string and the partial string as an arbitrary character in the search target text. The text search device according to claim 1 or 3, wherein the search is performed by performing the search.

7. The text search device according to claim 2, wherein the priority assigning unit determines the priority according to a detection frequency of the registration error character string pattern.

8. The text search device according to claim 7, wherein the error character string registration unit further stores a detection frequency of the registration error character string pattern.

9. The text search device according to claim 2, further comprising candidate character string display means for displaying the candidate character strings according to the priority.