JPH1027183A

JPH1027183A - Data registration method and device

Info

Publication number: JPH1027183A
Application number: JP9093439A
Authority: JP
Inventors: Kanji Kato; 寛次加藤; Hiromichi Fujisawa; 浩道藤澤; Mitsuo Oyama; 光男大山; Hisamitsu Kawaguchi; 川口　　久光; Atsushi Hatakeyama; 敦畠山; Noriyuki Kaneoka; 則幸兼岡; Mitsuru Akisawa; 充秋沢
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1997-04-11
Filing date: 1997-04-11
Publication date: 1998-01-27

Abstract

(57)【要約】【課題】高速なフルテキストサーチ（全文検索）を行う
ためのデータ登録方法および装置を提供することを目的
とする。【解決手段】検索対象である本文（１６０４）を登録す
る（１６０１）と共に、本文中に繰り返し現れる単語の
重複を排除した凝縮本文（１６０５）および予め定めた
各文字が本文に含まれるか否かを示す文字成分表（１６
０３）のうち少なくとも一方を作成し（１６０２、１６
０３）、登録する。 (57) [Summary] An object of the present invention is to provide a data registration method and apparatus for performing high-speed full-text search (full-text search). A text (1604) to be searched is registered (1601), a condensed text (1605) in which duplication of words repeatedly appearing in the text is eliminated, and whether a predetermined character is included in the text. Character table (16)
03) (1602, 16)
03), register.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、データ検索シス
テムにおいて、全文検索を可能とするデータ（文書）登
録方法および装置に関する。また、特に統制されていな
いキーワード（自由語と呼ぶ）を用いて検索する際に同
義語や表記法の違いによる検索もれをなくすことを可能
にする検索のためのデータ（文書）登録方法および装置
に関する。また、この発明においては、被検索文字列中
に複数の文字列集合が存在するか否かを一括して判定す
るのに適した情報検索システムのための情報登録方法及
び装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a data (document) registration method and apparatus which enable full-text search in a data search system. In addition, a data (document) registration method for a search that enables a search omission due to a difference in a synonym or a notation when a search is performed using a keyword (called a free word) that is not controlled, and Related to the device. The present invention also relates to an information registration method and apparatus for an information search system suitable for collectively determining whether or not a plurality of character string sets exist in a searched character string.

【０００２】さらにこの発明は、上記の情報検索システ
ムを実施する装置に適した記憶容量が大きく、短時間の
書き込み、読み出しが可能な集合型磁気ディスク装置、
並びに、複数件のファイルの連続書き込み、読み出しに
適した集合型磁気ディスク装置に関する。Further, the present invention provides a collective magnetic disk drive having a large storage capacity suitable for an apparatus for implementing the above information retrieval system and capable of writing and reading in a short time.
Also, the present invention relates to a collective magnetic disk device suitable for continuous writing and reading of a plurality of files.

【０００３】[0003]

【従来の技術】近年、文献情報や特許情報などの２次情
報（書誌情報）のみならず、１次情報（本文）をも含む
大規模データベース・サービスの重要性が増している。
このようなデータベース（ＤＢと略すこともある）の情
報検索では、従来からキーワードや分類コードによる方
法が用いられてきている。2. Description of the Related Art In recent years, the importance of large-scale database services including not only secondary information (bibliographic information) such as document information and patent information but also primary information (text) has been increasing.
In the information search of such a database (sometimes abbreviated as DB), a method using a keyword or a classification code has been conventionally used.

【０００４】キーワードは、データベースへの情報登録
時に、キーワードを付与する（インデキシングと言う）
専門家が統制語集（シソーラスと呼ぶ）から選んで付け
ている。そして、ＤＢ検索者もこのシソーラスからキー
ワードを選び出して検索を行なう方式がとられている。
しかしながら、このキーワード付与作業は、非常に煩雑
な作業を伴う。すなわち、登録すべきデータの内容を読
み、この内容を表現する適切な語彙をシソーラスから選
び出す必要がある。もしインデキシングを適切に行なわ
なければ、データベースから正しい情報が得られないこ
とになる。従って、このインデキシングにはデータの内
容に関する専門知識を持ち、かつシソーラスに登録され
ている語彙にも精通した専門家が必要になるという問題
がある。また、検索時にも同様に、シソーラスに則った
適切な語彙をキーワードとして指定しなければ、要求す
るデータを呼び出せなかったり、あるいは呼び出したデ
ータの中に不要なものが混じり込むという問題がある。[0004] A keyword is assigned when the information is registered in the database (referred to as indexing).
Experts select and add words from a controlled vocabulary (called a thesaurus). The DB searcher selects a keyword from the thesaurus and performs a search.
However, this keyword assignment operation involves a very complicated operation. That is, it is necessary to read the contents of the data to be registered and select an appropriate vocabulary expressing the contents from the thesaurus. If the indexing is not performed properly, the correct information cannot be obtained from the database. Therefore, there is a problem that this indexing requires an expert who has expertise in data contents and is also familiar with the vocabulary registered in the thesaurus. Similarly, at the time of retrieval, if an appropriate vocabulary according to the thesaurus is not specified as a keyword, there is a problem that requested data cannot be called or unnecessary data is mixed in the called data.

【０００５】また、このシソーラスにおいては、分類体
系自体が年月と共に変化するため、常にキーワードや分
類コードを更新しなければならないという問題も生じて
くる。[0005] Further, in this thesaurus, since the classification system itself changes with the years, there arises a problem that keywords and classification codes must be constantly updated.

【０００６】更に、インデキシングには時間がかかるた
め、新たなデータはバッチ処理によりかなりの量をまと
めて登録することになる。そのため、検索できる情報は
常に一定期間のおくれを持つという問題もある。このよ
うなことから、ＤＢの普及に伴い、ＤＢの専門家でなく
とも、シソーラス等に拘束されることなく、簡単に自由
語（非統制語ともいう）で、データの登録、検索が行な
えるシステムが望まれてきた。[0006] Furthermore, since indexing takes time, a large amount of new data is registered collectively by batch processing. Therefore, there is also a problem that information that can be searched always has a certain period of time. For this reason, with the spread of DBs, even non-DB specialists can easily register and search data in free words (also called uncontrolled words) without being bound by a thesaurus or the like. A system has been desired.

【０００７】また、データベースが大規模化するに従
い、シソーラスに記述された統制語だけではデータの内
容を十分詳細に記述できないため、キーワードで検索し
ても数十件から数百件までにしか絞り込めなくなってき
ている。この中から目的とするデータを見つけ出すため
には、それらの内容を直接読むしか方法がなく、これが
検索効率上の大きな問題となっている。[0007] Further, as the size of the database increases, the contents of data cannot be described in sufficient detail only by controlled words described in a thesaurus, so that even if a keyword is searched, only a few dozen to a few hundred are narrowed down. It is getting stuck. The only way to find the desired data from among them is to read them directly, which is a major problem in search efficiency.

【０００８】このシソーラスの制限語を用いたインデキ
シングに基づく現状の検索方式の問題に対して、自動抄
録や自動インデキシングの試みがなされてきているが、
日本語の場合その言語的な困難性から、やはり種々の辞
書を必要とするため上記の本質的な問題の解決に至って
いない。To solve the problem of the current retrieval method based on the indexing using thesaurus restricted words, attempts have been made for automatic abstraction and automatic indexing.
In the case of Japanese, linguistic difficulties also necessitate various dictionaries, so that the above-mentioned essential problem has not been solved.

【０００９】このような自由語による検索の過程では、
しばしばユーザの指定するキーワードすなわち検索文字
列と、検索対象であるＤＢ中で用いられている言葉が同
一の内容を示すのにもかかわらず、表記あるいは表現が
食い違っているために検索漏れを生ずる場合がある。例
えば、“ピアノ”という言葉を“ピヤノ”と記述した
り、また“インターフェイス”という言葉を“インタフ
ェース”、“インタフェイス”あるいは“インターフェ
ース”と記述したりすることがある。このような微妙な
音節表記法のバリエーションの違いにより、所望する情
報を検索できない場合がある。In the search process using such free words,
When the keyword specified by the user, that is, the search character string, and the words used in the DB to be searched often indicate the same content, but the search or omission is inconsistent, resulting in a search omission. There is. For example, the word “piano” may be described as “Pyano”, and the word “interface” may be described as “interface”, “interface”, or “interface”. Due to such subtle variations in syllable notation, it may not be possible to retrieve desired information.

【００１０】以下、表記法の異なる文字列に展開するこ
とを異表記展開と呼び、辞書を用いて他の文字列へ展開
することを同義語展開と呼ぶ。また、表記法の異なる文
字列のことを異表記と呼ぶ。In the following, expanding to a character string having a different notation is referred to as a notation expansion, and expanding to another character string using a dictionary is referred to as synonym expansion. Character strings with different notations are called different notations.

【００１１】これらの問題に対する根本的解決方法とし
て、検索者が自由なキーワード（自由語あるいは非統制
語と呼ぶ）に基づいてデータの本文を直接参照して内容
を検索できる全文検索（フルテキストサーチと呼ぶ）シ
ステムが提案されている。As a fundamental solution to these problems, a full-text search (full-text search) in which a searcher can search the contents by directly referring to the body of data based on free keywords (called free words or uncontrolled words) ) Systems have been proposed.

【００１２】その代表的な構成を図１に示し、以下その
内容について説明する。FIG. 1 shows a typical configuration, and its contents will be described below.

【００１３】検索システム１０１はホストコンピュータ
に接続され、通信回線を介して検索要求の受信及び検索
結果の送信を行う。ホストコンピュータから検索要求１
０７が送られると、検索制御手段１０３がこれを受け付
け、解析して、文字列照合手段１０５と複合条件判定手
段１０４へこれに対応した検索制御情報１０８を送る。
また、検索制御手段１０２は記憶装置制御手段１０３を
制御して、文字列記憶手段１０６に格納されている文字
列データ（テキストデータ）１１１を文字列照合手段１
０５へ転送させる。The search system 101 is connected to a host computer, and receives a search request and transmits a search result via a communication line. Search request 1 from host computer
When 07 is sent, the search control means 103 accepts and analyzes this, and sends the corresponding search control information 108 to the character string collation means 105 and the complex condition determination means 104.
Further, the search control means 102 controls the storage device control means 103 to convert the character string data (text data) 111 stored in the character string storage means 106 into the character string matching means 1.
05.

【００１４】文字列照合手段１０５は入力された文字列
データと、予め設定された検索文字列（キーワード）と
の照合を行い、該当する文字列を検出すると検出情報１
１０を複合条件判定手段１０４へ出力する。複合条件判
定手段１０４は検索要求中に記述された文字列間の位置
関係や共起関係などに関する複合条件に検出情報１１０
が合致するか否かを調べる。これに合致する場合には、
該当するデータデータの識別情報やデータ内容を検索結
果１０９として出力し、これをホストコンピュータへ送
り返す。The character string matching means 105 compares the input character string data with a preset search character string (keyword).
10 is output to the complex condition determination means 104. The compound condition determining means 104 adds detection information 110 to compound conditions relating to the positional relationship, co-occurrence relationship, and the like between character strings described in the search request.
Check whether or not matches. If this matches,
The identification information and the data content of the corresponding data are output as a search result 109 and sent back to the host computer.

【００１５】こうした従来例の一つが、アール・エル・
ハンスキンアンドホラー：“オペレーショナルキ
ャラクタリステイックオブアハードウェアベイ
ストパターンマッチャー”，エーシーエムトラ
ンザルションオンデータベースシステムズ，第８
巻，第１号，１９８３年（R.L.Haskin and A. Hollaa
r：“Operational Characterstics of a Hardware-Base
d Pattern Matcher”，ACM Trans. on Database Syste
m, Vol.8, No.1, 1983）に記載されている。One of such conventional examples is R.L.
Hanskin and Horror: “Operational Characteristic of a Hardware Beast Pattern Matcher”, AC M Transaction on Database Systems, No. 8
Vol. 1, No. 1983 (RLHaskin and A. Hollaa
r: “Operational Characterstics of a Hardware-Base
d Pattern Matcher ”, ACM Trans. on Database Syste
m, Vol. 8, No. 1, 1983).

【００１６】上述した文字列検索装置２００の要となる
文字列照合手段２１３における文字列の照合方式として
は、有限オートマトンを用いて複数の文字列を１回の走
査で検索する方法が知られている。その代表的な方式と
しては、エー．ブイ．エーホアンドエム．ジェイ．コ
ラッシック：“エフィシェントストリングマッチン
グ”，コミュニケーションズエーシーエム，第１
８巻，第６号，１９７５年，A.V. Aho and M. J. Coras
ick："Efficient String Matching" ，CACM，Vol 18，N
o.6，1975にその一例が開示されている。As a method of collating a character string by the character string collating means 213 which is a key of the character string retrieval apparatus 200 described above, a method of retrieving a plurality of character strings by one scan using a finite automaton is known. I have. A typical method is A. buoy. Aho and M. Jay. Classic: “Efficient String Matching”, Communications AC M, No. 1
8, No. 6, 1975, AV Aho and MJ Coras
ick: "Efficient String Matching", CACM, Vol 18, N
o.6, 1975 discloses an example.

【００１７】本文献には２種類のオートマトン作成方法
とオートマトンを用いた文字列照合方法が詳細に述べら
れている。以下、各々について説明する。This document describes in detail two types of automaton creation methods and a character string collation method using the automaton. Hereinafter, each will be described.

【００１８】まず、第１の方法（以後、従来方法１と呼
ぶ）について図２を用いて説明する。同図は、文字列デ
ータの中から、ユーザから与えられたキーワード“イン
タフェース”を検察するためのオートマトンの状態遷移
図である。ここで、円形はオートマトンの状態を、矢印
は状態遷移を表している。各矢印に付記された文字はこ
れに対応した状態遷移が起きる入力文字を示す。本図で
は“ン”以外および“イ”以外の文字といった否定を表
わす場合は否定記号“ ”を付け「｛“ン”，
“イ”｝」と表わしている。矢印４０３は状態遷移の始
まる始点状態を示している。各円形の内部に記された数
値は、同状態の状態番号を示す。二重円は“インタフェ
ース”を照合したことを示す終点の状態を表している。
本方法の特徴は入力される可能性のある全ての入力文字
に対する状態遷移をオートマシンで記述している点にあ
る。このため状態遷移の数が多くなるため、キーワード
の数が多くなるというオートマトンの作成時間が極めて
長くなるという問題がある。First, the first method (hereinafter referred to as conventional method 1) will be described with reference to FIG. FIG. 6 is a state transition diagram of an automaton for detecting a keyword “interface” given by a user from character string data. Here, the circle represents the state of the automaton, and the arrow represents the state transition. The character added to each arrow indicates the input character at which the corresponding state transition occurs. In this figure, when expressing a negation such as a character other than "n" and a character other than "i", a negative sign "" is added and "@"
“I”｝ ”. An arrow 403 indicates a starting point state at which a state transition starts. The numerical value described inside each circle indicates the state number of the same state. The double circle represents the state of the end point indicating that the “interface” has been collated.
The feature of this method is that state transitions for all input characters that may be input are described by an automatic machine. For this reason, the number of state transitions increases, so that there is a problem that the time required to create an automaton, that is, the number of keywords increases, becomes extremely long.

【００１９】以下、同図を用いて従来方法１の文字列照
合動作について説明する。オートマトンに文字が入力さ
れた場合、どの状態において入力文字の照合を行なうべ
きかをトークンに置くことにより明らかにする。すなわ
ち、トークンとは、オートマトン内で遷移する状態の位
置を表わすマークである。まず、初期設定としてトーク
ンを始点状態である状態０に置く。この例の場合、入力
文字が“イ”であるとトークンは状態１へ移動する。も
し、ここで“イ”以外の文字が入ってきた場合はトーク
ンは状態０に移動する。一方、トークンが状態１にあっ
て入力文字が“ン”ならば、トークンは状態２に移動す
る。“イ”であれば状態１に移動する。“イ”および
“ン”以外の文字であれば状態０へ移動する。次にトー
クンが状態２にある場合、入力文字が“タ”ならば、ト
ークンは状態３に移動する。ここで、もし、“イ”が入
力されたときは、トークンは状態１へ移動する。更に、
状態３にトークンがある場合、“フェース”が入力され
ると、トークンは状態４→状態５→状態６→状態７と移
動する。状態７は２重円で記されており、ここでは“イ
ンタフェース”という文字列が照合されたことになる。Hereinafter, the character string collating operation of the conventional method 1 will be described with reference to FIG. When a character is input to the automaton, it is clarified by placing in the token in which state the input character should be collated. That is, a token is a mark that indicates the position of a state that transits in the automaton. First, as an initial setting, the token is placed in state 0, which is the starting point state. In this example, if the input character is "i", the token moves to state 1. If a character other than "i" comes in, the token moves to state 0. On the other hand, if the token is in state 1 and the input character is “on”, the token moves to state 2. If “a”, move to state 1. If it is a character other than "A" and "N", it moves to state 0. Next, when the token is in state 2, if the input character is "ta", the token moves to state 3. Here, if "i" is input, the token moves to state 1. Furthermore,
If there is a token in state 3 and "face" is input, the token moves from state 4 to state 5 to state 6 to state 7. State 7 is indicated by a double circle, and here, the character string “interface” has been collated.

【００２０】この従来方法１では入力される可能性のあ
る全ての入力文字に対する状態遷移をオートマトンに記
述しているため、キーワードが多くなると状態遷移の数
が多くなりオートマトンの作成時間が極めて長くなると
いう問題がある。本方法を実現するハードウェアについ
ては、特開昭６０−１０５０３９号公報および特開昭６
０−１０５０４０号公報に開示されている。In the conventional method 1, state transitions for all input characters that may be input are described in the automaton. Therefore, as the number of keywords increases, the number of state transitions increases and the time required to create the automaton becomes extremely long. There is a problem. Hardware for realizing the present method is disclosed in Japanese Patent Application Laid-Open Nos.
It is disclosed in Japanese Patent Publication No. 0-105040.

【００２１】次に、第２の方法（以後、従来方法２と呼
ぶ）について説明する。この従来方法２は従来方法１と
比べオートマトン作成時間を短縮するための工夫がされ
ている。従来方法２では、従来方法１と比べオートマト
ンの作成時間３分の１と大幅に改善されており、詳細に
ついて特開昭６３−３１１５３０号公報に述べられてい
る。この従来方法２を図３と図４を用いて説明する。図
３は、図２と同様に“インタフェース”を照合する場合
のオートマトンの状態遷移図を示したものである。初期
設定として、トークンは始点状態である状態０に置かれ
る。ここで、入力文字“イ”が入力されたならばトーク
ンが置かれている状態０で照合を行ない状態１へ移動す
る。もし、状態０で“イ”以外の文字が入ってきた場合
はトークンは状態０に移動する。Next, the second method (hereinafter referred to as conventional method 2) will be described. The conventional method 2 is devised to shorten the automaton creation time as compared with the conventional method 1. In the conventional method 2, the time required for creating the automaton is greatly improved to one third of that in the conventional method 1. The details are described in JP-A-63-31530. The conventional method 2 will be described with reference to FIGS. FIG. 3 shows a state transition diagram of the automaton in the case where the "interface" is collated as in FIG. By default, the token is placed in state 0, the starting state. Here, if the input character "A" is input, the collation is performed in the state 0 where the token is placed, and the state is moved to the state 1. If a character other than "i" comes in state 0, the token moves to state 0.

【００２２】一方、トークンが状態１にあって入力文字
“ン”が入力されたならばトークンは状態２に移動す
る。トークンが状態２にあって“タ”が入力されたなら
ばトークンは状態３に移動する。ここでもしトークンが
状態３にあって同オートマトンに記述されていない
“フ”以外の文字、例えば“イ”が入力されたときは、
この従来方法２では「フェイル」したと言い、図４のフ
ェイルテーブルを参照することになる。フェイルテーブ
ルにはトークンが置かれている状態番号に対して再照合
すべきフェイル先の状態番号が格納されている。この場
合、現在の状態番号３に対応するフェイル先の値０を得
て状態０へトークンを移動する。そして、ここで該入力
文字“イ”について照合することによりトークンは状態
１へ移動させる。このような機能をフェイル機能と呼ん
でいる。更に、続けて“ンタフェース”という入力文字
列が１文字づつ入ってきた場合、トークンは状態２→状
態３→状態４→状態５→状態６→状態７と移動する。状
態７は２重円で記されており、ここでは“インタフェー
ス”という文字列が照合されたことになる。例えば、キ
ーワードとして“インタフェース”が与えられた場合、
本文中にはユーザが指定した検索タームと異なる表記
（異表記）で記述されることもある。On the other hand, if the token is in state 1 and the input character "n" is input, the token moves to state 2. If the token is in state 2 and "ta" is entered, the token moves to state 3. Here, if the token is in state 3 and a character other than "F", such as "I", which is not described in the automaton, is input,
In the second conventional method, it is said that "fail" has occurred, and the fail table in FIG. 4 is referred to. The fail table stores the state number of the fail destination to be rematched with the state number where the token is placed. In this case, the value 0 of the fail destination corresponding to the current state number 3 is obtained, and the token is moved to the state 0. Then, the token is moved to the state 1 by collating the input character "i". Such a function is called a fail function. Further, when the input character string “interface” successively enters one character at a time, the token moves in the order of state 2 → state 3 → state 4 → state 5 → state 6 → state 7. State 7 is indicated by a double circle, and here, the character string “interface” has been collated. For example, if the keyword “interface” is given,
The text may be described in a notation (different notation) different from the search term specified by the user.

【００２３】本文には、“インタフェース”のように
“ー”（長音記号）の代わりに“−”（マイナス記号）
を使用したり（これを長音異表記と呼ぶ）、“インター
フェース”のように“ー”を付加したり（これを長音の
有無と呼ぶ）、“インタフェイス”のように発音の表記
の違いにより“フェー”を“フェイ”と記述したりする
（これを発音異表記と呼ぶ）。In the text, "-" (minus sign) is used instead of "-" (long sign) like "interface".
(This is called a prolonged notation), "-" is added like "interface" (this is called the presence or absence of a prolonged sound), and the difference in pronunciation notation is used like "interface""Fee" is described as "fay" (this is called "pronunciation notation").

【００２４】これらを全て検索するためには、これらの
異表記を組合せた“インタフェース”，“インターフェ
ース”，“インタフェイス”，“インターフェイス”，
“インタ−フェイス”，“インタフェ−ス”，“インタ
ーフェ−ス”，“インタ−フェ−ス”，“インタ−フェ
ース”の９語全てをキーワードとする必要がある。In order to retrieve all of them, "interface", "interface", "interface", "interface",
All nine words of "interface", "interface", "interface", "interface", and "interface" need to be keywords.

【００２５】この場合の例について図５と図６を用いて
説明する。図５は、文字列データの中から、異表記を含
む上記９語を照合する場合のオートマトンの状態遷移図
である。An example of this case will be described with reference to FIGS. FIG. 5 is a state transition diagram of the automaton in a case where the above nine words including different notations are collated from character string data.

【００２６】キーワードの先頭から比較して遷移文字が
異なる場合は別状態に分岐する。If the transition character is different from the beginning of the keyword, branch to another state.

【００２７】例えば、“インタフェース”と“インター
フェース”のキーワードの例では、キーワードの前方か
ら比較すると“インタ”までは同じであるが、その次の
文字では“フ”と“ー”で遷移文字が異なる。このため
状態３から遷移文字“フ”で状態２２に遷移し、遷移文
字“ー”で状態４へ遷移するといった状態遷移の分岐が
起こる。For example, in the example of the keywords “interface” and “interface”, when compared from the front of the keyword, “inter” is the same, but in the next character, “fu” and “-” indicate transition characters. different. For this reason, a state transition branch occurs such that a transition is made from the state 3 to the state 22 with the transition character “F” and to the state 4 with the transition character “-”.

【００２８】すなわち、ある状態において遷移文字が異
なる場合別々の遷移先状態を割り付けているため木状の
オートマトンになる。図６はこのオートマトンに示され
てない文字が入力された場合の遷移先を示すフェイルテ
ーブルの説明図である。このように、異表記を含めて照
合を行なおうとすると、キーワードが多くなるため状態
数が非常に増加してしまうという問題が発生する。That is, when the transition character is different in a certain state, a tree-like automaton is formed because different transition destination states are allocated. FIG. 6 is an explanatory diagram of a fail table indicating a transition destination when a character not shown in the automaton is input. As described above, when the collation including the different notation is performed, a problem occurs that the number of states is greatly increased because the number of keywords is increased.

【００２９】また、文字列検索ではキーワードにdon't
care文字を使用することがある。キーワードに固定長の
don't care文字を使用した例を図７と図８を用いて説明
する。図７は１文字の固定長のdon't care文字“？”を
含むキーワード“Ａ？Ｂ”を検索する場合のオートマト
ンの状態遷移図を表わしている。図８はこのオートマト
ンに示されてない文字が入力された場合の遷移先を示す
フェイルテーブルの説明図である。Also, in the character string search, the keyword "don't"
Care characters may be used. Fixed length keywords
An example in which don't care characters are used will be described with reference to FIGS. FIG. 7 shows a state transition diagram of the automaton when a keyword “A? B” including one fixed-length don't care character “?” Is searched. FIG. 8 is an explanatory diagram of a fail table indicating a transition destination when a character not shown in the automaton is input.

【００３０】この例では１バイトの文字コード（ＪＩＳ
コードを用いている）の場合についてオートマトンを作
成している。“？”は任意の文字や記号との一致を許す
ことを意味する文字記号である。従って、don't care文
字“？”による遷移は本図の状態１を遷移元とする全て
の文字コード○○〜ＦＦによる遷移として表わされる。
すなわち“Ａ？Ｂ”は、先頭が“Ａ”で間に任意の１文
字が入り、末尾が“Ｂ”である文字列を検索するという
指定になる。In this example, a one-byte character code (JIS
Code is used) to create an automaton. "?" Is a character symbol meaning that matching with any character or symbol is permitted. Therefore, the transition due to the don't care character “?” Is represented as a transition due to all the character codes ○ to FF having the state 1 in FIG.
That is, “A? B” specifies that a character string having an arbitrary character inserted at the beginning with “A” at the beginning and having an end at “B” is searched.

【００３１】このように簡単な検索条件でも固定長のdo
n't care文字が入るとオートマトンの状態数が非常に増
加してしまうという問題が発生する。Even with such simple retrieval conditions, a fixed-length do
The problem that the number of states of the automaton greatly increases when an n't care character is included occurs.

【００３２】また、異表記や同義語の問題を解決する方
法として、特開昭６２−０１１９３２号公報がある。な
お、この公報の中では、異表記展開のことを異表記発生
と呼び、同義語展開のことを類似語抽出と呼んでいる。As a method for solving the problem of different notations and synonyms, there is JP-A-62-011932. In this gazette, the development of different notations is called occurrence of different notations, and the development of synonyms is called similar word extraction.

【００３３】図９に、この引例の構成をブロック図で示
す。この構成では、ローマ字やカタカナ表現で入力した
検索文字列を、一旦全てカタカナの標準化された表記の
文字列に変換する。すなわち、異表記発生の逆の操作に
より、複数個の表記法を一つにまとめる表記の標準化処
理をまず最初に行なう。また、アルファベット表現で入
力された検索文字列も外来語カナ変換により、カタカナ
表現に統一される。FIG. 9 is a block diagram showing the configuration of this reference. In this configuration, all the search character strings input in Roman characters or katakana expressions are temporarily converted into character strings in standardized notation in katakana. In other words, a standardization process of a notation that combines a plurality of notations into one is first performed by the reverse operation of the occurrence of a different notation. Also, the search character string input in the alphabetic expression is unified to the katakana expression by the foreign language kana conversion.

【００３４】こうして、一旦標準化したカタカナ文字列
を、同義語辞書を用いて類似語展開し、入力したカタカ
ナ文字列と同義の単語をカタカナ文字列として出力す
る。類似語抽出した後のカタカナ文字列は、カナ漢字変
換を行ない漢字文字列へ、カナ外来語変換を行ないアル
ファベット表現の外国語に、カナローマ字変換を施して
ローマ字文字列へ変換する。In this way, the once standardized katakana character string is developed into a similar word using a synonym dictionary, and a word having the same meaning as the input katakana character string is output as a katakana character string. The katakana character string after the extraction of similar words is converted to a kanji character string by performing kana-kanji conversion, and a kana-to-foreign word conversion is performed to convert a kana-romaji character to a foreign language of an alphabetic expression, and then converted to a roman character string.

【００３５】このようにして、類似語抽出の結果である
カタカナ文字列を、漢字、ローマ字、カタカナ、外国語
の各表現に変換して、それぞれ異表記展開する。In this way, the katakana character string obtained as a result of the similar word extraction is converted into each expression of kanji, romaji, katakana, and foreign language, and is developed in different notations.

【００３６】また、図１のこうした従来の文字列検索装
置１０１においては、文字列検索装置１０１の構成要素
である文字列記憶手段１０６として大規模なデータの記
憶ができる磁気ディスク装置が必要となる。一般の磁気
ディスク装置はデータの入出力が高速にできない問題が
あり、また、データの入出力が高速にできるマルチヘッ
ド型の磁気ディスク装置は非常に高価であるという問題
があった。In the conventional character string search apparatus 101 shown in FIG. 1, a magnetic disk device capable of storing large-scale data is required as the character string storage means 106 which is a component of the character string search apparatus 101. . A general magnetic disk device has a problem that data input / output cannot be performed at a high speed, and a multi-head type magnetic disk device capable of data input / output at a high speed has a problem that it is very expensive.

【００３７】そこで、安価な一般の小型磁気ディスク複
数台接続してデータの入出力の速度を高速化する集合型
の磁気ディスク装置が考えられてきた。そのひとつとし
て特開昭60−117326号公報記載の「画像データ分割記憶
装置」がある。In view of the above, there has been considered a collective magnetic disk drive in which a plurality of inexpensive general small magnetic disks are connected to increase the speed of data input / output. As one of them, there is an "image data division storage device" described in JP-A-60-117326.

【００３８】この装置は複数台の磁気ディスク装置を有
し、磁気ディスク装置と同数の磁気ディスクコントロー
ラ、入出力バッファと外部装置との間のデータ輸送を制
御するマスタコントローラによって構成し、外部装置か
ら入力したデータをマスタコントローラにおいて、入出
力バッファの容量以下に分割し、その分割したデータを
各磁気ディスクコントローラに順次転送し、該磁気ディ
スクコントローラは対応する磁気ディスク装置に書き込
む。マスタコントローラは書き込みを行なっていない磁
気ディスク装置の磁気ディスクコントローラに対し、シ
ーク動作を行なわせることによって、データを格納する
複数の磁気ディスク装置の２台目以降の、シーク時間を
見掛け上なくし、データの書き込み、読み出し時間を短
縮しようとするものである。This device has a plurality of magnetic disk devices, and is constituted by the same number of magnetic disk controllers as the magnetic disk devices and a master controller for controlling data transfer between the input / output buffer and the external device. In the master controller, the input data is divided into the input / output buffer or less, and the divided data is sequentially transferred to each magnetic disk controller, and the magnetic disk controller writes the data into the corresponding magnetic disk device. The master controller causes the magnetic disk controller of the magnetic disk device to which data is not being written to perform a seek operation, so that the seek time of the second and subsequent magnetic disk devices for storing data becomes apparent, and the data is lost. It is intended to shorten the time for writing and reading data.

【００３９】また、フルテキストサーチを行うためのデ
ータの登録について考慮してるものはなかった。Further, there is no method for registering data for performing a full-text search.

【００４０】[0040]

【発明が解決しようとする課題】図１に示した様な従来
例の検索システムにおいて大容量のテキストデータベー
ス検索しようとすると、下記のような幾つかの問題が発
生してくる。先ず第一に、検索時間の問題である。例え
ば、一文献当り２０ＫＢの容量を持つ文献２万件を対象
にしてフルテキストサーチを行おうとすると、４００Ｍ
Ｂのデータをスキャンしなければならないことになる。
この４００ＭＢのテキストデータを文字列記憶手段に格
納し、これを平均約１ＭＢ／ｓの実効速度で読み出し、
文字列照合手段においてこれと同等の速度で照合処理を
行ったとしても、検索を終了するには約７分を要してし
まう。すなわち、一般的な磁気ディスク装置を用いたの
ではテキストデータの読み出しに時間が掛ってしまい実
用に耐えないという問題がある。すなわち、テキストデ
ータを納める文字列記憶手段の読み出し速度を文字列照
合手段の処理速度と同程度にまで高めることが必要とな
る。本発明が解決しようとする第一の課題がここにあ
る。When a large-capacity text database is to be searched in the conventional search system as shown in FIG. 1, the following problems arise. First, there is the problem of search time. For example, if a full-text search is to be performed on 20,000 documents having a capacity of 20 KB per document, 400M
The data of B must be scanned.
This 400 MB of text data is stored in the character string storage means, and is read out at an effective speed of about 1 MB / s on average,
Even if the matching process is performed at the same speed by the character string matching means, it takes about 7 minutes to complete the search. That is, if a general magnetic disk device is used, there is a problem that reading of text data takes time and is not practical. That is, it is necessary to increase the reading speed of the character string storage means for storing the text data to approximately the same as the processing speed of the character string matching means. This is the first problem to be solved by the present invention.

【００４１】しかし、文字列記憶手段の読み出し速度を
文字列照合手段と同程度にまで高めたとしても、すなわ
ち例えば１０ＭＢ／ｓまで高速化したとしても４００Ｍ
Ｂのテキストデータをスキャンし終えるには、未だ４０
秒を要してしまう。これを実用上許容し得る数秒台に納
めることが、本発明の第二の課題である。However, even if the reading speed of the character string storage means is increased to the same level as the character string collating means, that is, even if the reading speed is increased to, for example, 10 MB / s, 400 M
It is still 40 to finish scanning the text data of B.
It takes seconds. It is a second object of the present invention to reduce this to a few seconds that is practically acceptable.

【００４２】このスキャン処理の高速化という技術に関
して特開昭62-241026号公報「文字列検索方式」が出願
されている。本「文字列検索方式」では、テキストデー
ターベース（ファイルと呼んでいる）の中に指定文字列
があるかどうかを検索する処理を高速化するために、あ
らかじめテキスト（データと呼んでいる）の内容を見て
どういった文字がどの程度の頻度で用いられているかを
調べ「使用文字頻度分布テーブル」を作成しておく。Japanese Patent Application Laid-Open No. Sho 62-241026 entitled "Character String Search Method" has been filed with respect to the technology for speeding up the scanning process. In this "string search method", in order to speed up the process of searching for a specified character string in a text database (called a file), a text (called "data") The contents are checked to see what characters are used and at what frequency, and a "used character frequency distribution table" is created.

【００４３】そして、検索時にはこの「使用文字頻度分
布テーブル」を参照して、ユーザが指定したキーワード
の中の最も使用頻度の低い文字を手掛かりにして最初テ
キストをサーチし、これに照合するものがあれば、次に
その前後の文字についても照合を行う方式を提案してい
る。At the time of retrieval, the "used character frequency distribution table" is referred to, and the text which is first used is searched by using the least frequently used character among the keywords designated by the user, and the text is searched for. If there is, then a method is proposed in which the characters before and after that are also checked.

【００４４】また、上記特開昭62-241026号公報では、
キーワードの最も頻度の低い文字の「使用文字頻度分布
テーブル」中での頻度が零の場合には、テキストをサー
チすることなく検索を終えることができるとしている。Further, in the above-mentioned Japanese Patent Application Laid-Open No. 62-241026,
If the frequency of the character with the lowest frequency in the "used character frequency distribution table" is zero, the search can be completed without searching for the text.

【００４５】したがって、特開昭62-241026号公報によ
れば、無駄な文字照合回数を削減することができるた
め、検索処理速度を上げる効果が得られることになる。Therefore, according to Japanese Patent Application Laid-Open No. Sho 62-241026, the number of unnecessary character collations can be reduced, and the effect of increasing the search processing speed can be obtained.

【００４６】しかし、本方式は、データベース（ファイ
ル）全体における「使用文字頻度分布テーブル」を作成
し、これに基づいてこの中のテキストファイル（デー
タ）を検索するものである。したがって、データベース
全体の中で、一度も現れない文字に関するキーワードを
検索する場合にはサーチ処理の効率化という点で効果が
得られるが一般的にデータベースの規模が大きくなる
と、データベース全体で一度も現れないという文字はほ
とんどなくなるため、本方式によるサーチ処理の効果は
ほとんどなくなるという問題がある。However, in this method, a "used character frequency distribution table" is created for the entire database (file), and a text file (data) is searched based on the table. Therefore, when searching for a keyword related to a character that does not appear once in the entire database, the effect can be obtained in terms of the efficiency of the search process. However, in general, when the size of the database increases, the keyword appears once in the entire database. Since there is almost no character indicating that there is no search, there is a problem that the effect of the search processing by this method is almost completely lost.

【００４７】こうした問題を解決し、効率的なサーチ処
理を実現し、延いては等価的に高速なフルテキストサー
チを可能とすることが本発明の第二の課題となる。It is a second object of the present invention to solve such a problem, realize an efficient search process, and enable an equivalently high-speed full-text search.

【００４８】一方、自由語を用いたフルテキストサーチ
においては、しばしば検索者が指定したキーワードとテ
キスト本文中に記述されている言葉の間に、同じ意味を
表していても表現に食い違いがあることがある。このよ
うな場合には、異なる表現形態を持つ文献が検索漏れと
なり、目的の文書が検索されないことが生じてくる。こ
のような言葉の例として、同義語や異形語（異表記語あ
るいは単に異表記とも呼ぶ）などがある。同義語の例と
しては「計算機」に対して「電子計算機」や「電算機」
「Computer」などが挙げられる。また、異表記の例とし
ては、「コンピュータ」に対して「コンピューター」や
「コンピュータ」，「コンピューター」，「コンヒ°ュ
ーター」，「コンヒ°ユータ」，「コンヒ°ユータ
ー」，「コンピュータ」，「コンピュ−タ−」や「コン
ピュ−タ」，「コンピユ−タ−」，「コンヒ°ュ−
タ」，「コンヒ°ュ−タ−」，「コンヒ°ユ−タ」，
「コンヒ°ユ−タ−」が、「Computer」に対して「comp
uter」，「COMPUTER」などが挙げられる。検索者が指定
するキーワードと文書の内容に記述されている言葉との
表記上の食い違いの問題に対処するためには検索者がこ
れらの同義語や異表記をすべて指定して検索を行う必要
がある。しかし、異表記などは場合によって数百にも及
ぶ形態を取り得るため、検索者が一々指定するのは事実
上困難である。こうした問題を解決するのが、本発明の
第三の課題である。On the other hand, in a full-text search using free words, there is often a discrepancy in the expression between the keyword specified by the searcher and the word described in the text body even if the same meaning is expressed. There is. In such a case, documents having different expression forms are omitted from the search, and the target document may not be searched. Examples of such words include synonyms and variants (also called variant words or simply variant forms). Examples of synonyms are "electronic computer" and "computer" for "computer"
"Computer" and the like. Examples of the different notation are “computer” and “computer”, “computer”, “computer”, “computer”, “computer”, “computer”, “computer”. "Computer", "Computer", "Computer", "Computer"
, "Consumer", "Converter",
The "Converter" sends "Computer" to "Computer".
uter "and" COMPUTER ". In order to deal with the problem of notation between the keyword specified by the searcher and the words described in the contents of the document, the searcher must specify all of these synonyms and different notations and perform a search. is there. However, since different notations can take hundreds of forms depending on the case, it is practically difficult for the searcher to specify each one. It is a third object of the present invention to solve such a problem.

【００４９】すなわち、上記従来例では、表記を標準化
する際に、元の文字列が持つ情報を変えてしまうため期
待する展開結果が得られないことがあった。That is, in the above-described conventional example, when standardizing the notation, the information obtained by the original character string is changed, so that an expected expansion result may not be obtained.

【００５０】このことを、カタカナ表記の標準化用の部
分文字列の変換ルール「“ホオ”→“ホウ”」を例にし
て説明する。この変換ルールを適用すると文字列“ジョ
ウホオ”を“ジョウホウ”（情報）と正しく標準化され
る。しかし、この同じ変換ルールを用いても“ジョウオ
ホン”（定保温）が入力された場合には“ジョウホウ
ン”と誤った文字列へ標準化してしまう。このことは標
準化処理の後の同義語展開処理、更にその後に続く異表
記展開処理に影響をおよぼし、期待する展開結果が得ら
れないことになる。本発明の課題の一つは上記の標準化
を行なわずに、常に期待する展開結果を得ることにあ
る。This will be described with reference to an example of the conversion rule ““ HOO ”→“ HO ”” for the partial character string for standardization of the katakana notation. By applying this conversion rule, the character string "joho" is correctly standardized as "joho" (information). However, even if this same conversion rule is used, if "John phone" (constant heat retention) is input, it is standardized to an incorrect character string of "John phone". This affects the synonym expansion process after the standardization process and the subsequent notation expansion process, and the expected expansion result cannot be obtained. One of the objects of the present invention is to always obtain expected expansion results without performing the above-mentioned standardization.

【００５１】また上記従来技術では、同義語辞書によっ
て“計算機”から“コンピュータ”にキーワードを同義
語展開するときに、ユーザが入力する検索キーワード
を、一旦すべてカタカナ表現に変換してから同義語展開
し、そのあとでカナ漢字変換、カナローマ字変換及びカ
ナ外国語変換をする構成となっている。そのため、同義
語辞書は必ずカタカナ文字列からカタカナ文字列へ展開
するようなものでなければならなかった。すなわち、見出し語：“コンピュータ” 同義語１：“ケイサンキ” 同義語２：“ジョウホウショリソウチ” などと、単語間の同義関係を常にカタカナ文字列で記述
しなければならなかった。このことは、同義語展開後の
カナ漢字変換辞書及びカナ外来語変換辞書でも、必ずこ
れらに対応する表現の文字列を出力するよう登録してお
かなければならないために、辞書が大きくなるという問
題がある。また日本語には同じ読みを持っていても、意
味の異なる同音異義語が多く存在し、これが同義語展開
時に弊害を生じる。例えば“ケンサク”という文字列は
“検索”とも解釈できるし“研削”とも解釈できるの
で、カタカナ表現のみによる同義語辞書では両者を区別
できないという問題がある。さらに、同義語展開後のカ
タカナ漢字変換において、同音異義語を選択をユーザが
対話的に行わなければならないという問題があった。According to the above-mentioned prior art, when keywords are synonymously expanded from “computer” to “computer” using a synonym dictionary, all search keywords input by the user are once converted into katakana expressions and then expanded into synonyms. After that, kana-kanji conversion, kana-romaji conversion, and kana foreign language conversion are performed. For this reason, the synonym dictionary must always expand from katakana character strings to katakana character strings. In other words, the headword: "computer" Synonym 1: "Keisanki" Synonym 2: "Johoshishosochi" and the like, the synonym between words had to be always described in katakana character strings. This is a problem because the kana-kanji conversion dictionary and the kana foreign-language conversion dictionary after synonym expansion must be registered so that the character strings of the expressions corresponding to these must be output. There is. In addition, even if Japanese have the same pronunciation, there are many homonyms with different meanings, which cause a problem when synonyms are developed. For example, the character string “Kensaku” can be interpreted as “search” or “grinding”, so that there is a problem that a synonym dictionary using only katakana expressions cannot distinguish between them. Furthermore, in the conversion of katakana to kanji after synonym expansion, there is a problem that the user must interactively select a homonym.

【００５２】また、検索キーワードをカタカナ表現に変
換するための外国語カナ変換辞書や、同義語展開した後
のカナ漢字変換辞書及びカナ外国語変換辞書が必要であ
り、多種類の大規模な辞書を使うためにその作成と保守
が大変となるという問題もある。すなわち、本発明の第
三の課題は上記のカナ漢字変換、カナ外国語変換時にお
ける同音異義語の問題と、これらの変換に用いる大規模
な辞書の作成、保守の問題を解決することにある。Further, a foreign language / kana conversion dictionary for converting a search keyword into katakana expression, a kana / kanji conversion dictionary after synonym expansion and a kana foreign language conversion dictionary are required, and various types of large-scale dictionaries are required. There is also a problem that the creation and maintenance of it becomes difficult because of using. That is, a third object of the present invention is to solve the problems of homonyms in the above-described kana-kanji conversion and kana foreign language conversion, and the problems of creating and maintaining a large-scale dictionary used for these conversions. .

【００５３】また、こうした数百にも及ぶ同義語や異表
記を含めてキーワードとして検索を行おうとすると、ど
うしてもこれらを一括して照合する文字列照合手段が必
要となってくる。さもなければ、同義語や異表記を含め
て検索すると、検索時間が数百倍掛ってしまい、とても
実用に耐えられなくなってしまう。このように一千語に
近い語数のキーワードが指定されても、照合速度が低下
することなく検索処理を行い得る文字列照合手段を提供
することが、本発明の第四の課題である。Further, if a search is to be performed as a keyword including such hundreds of synonyms and different notations, a character string collating means for collating all these at once is necessary. Otherwise, searching for synonyms and other notations would take hundreds of times longer and would not be practical. It is a fourth object of the present invention to provide a character string matching unit capable of performing a search process without lowering the matching speed even when a keyword having a word number close to one thousand words is designated.

【００５４】また、従来のオートマトンを用いた検索方
式では、異表記の場合、異表記を含むキーワードを全て
列挙し、キーワードに展開する。さらに、これらに基づ
いたオートマトンを作成する。ここで作成されるオート
マトンは木状に記述されるため非常に多くのオートマト
ンの状態が必要となる。In a conventional search method using an automaton, in the case of a different notation, all keywords including the different notation are enumerated and expanded into keywords. Furthermore, an automaton based on these is created. Since the automaton created here is described in a tree shape, a large number of automaton states are required.

【００５５】また、don't care文字指定検索を行なう場
合もdon't care文字の部分が許容する文字コードの全て
の組合せを列挙し、キーワードに展開する。これらに基
づきオートマトンを作成するため、異表記と同様に、非
常に多くのオートマトンの状態が必要となる。このよう
にオートマトンの状態数の増加は、オートマトン作成時
間の増加や、更にはオートマトンを格納するための状態
遷移テーブルの容量が増加、すなわちハードウェアの増
大という問題を発生する。Also, in the case of performing a search for specifying don't care characters, all combinations of character codes permitted by the don't care characters are listed and expanded into keywords. In order to create an automaton based on these, an extremely large number of automaton states are required as in the case of different notations. As described above, the increase in the number of states of the automaton causes a problem that the time required to create the automaton increases and the capacity of the state transition table for storing the automaton increases, that is, the hardware increases.

【００５６】本発明はオートマトンを用いた検索方式に
おいて、異表記やdon't care文字が指定された検索を行
なう場合もオートマトンの遷移を網状にまとめて記述す
ることにより、状態数を従来より低減しオートマトンの
作成時間の短縮をはかると共に、状態遷移テーブルの容
量が小さくて済むためコンパクトなハードウェアで実現
可能な検索方式を提供することを目的とする。According to the present invention, in a search method using an automaton, the number of states is reduced as compared with the conventional method by describing the transition of the automaton in a net-like manner even when performing a search in which a different notation or a don't care character is specified. Another object of the present invention is to provide a search method that can be realized with compact hardware because the time required for creating an automaton can be reduced and the capacity of the state transition table can be small.

【００５７】さらに、テキストデータベースに文書デー
タが逐次登録されて行くと、ある時点で文字列記憶手段
を構成する磁気ディスク装置の容量が満杯に達してしま
う場合がでてくる。こうした時にも、それ迄蓄積したデ
ータを損なうことなくシステムの蓄積容量を拡大できる
ことが必要となる。また、被検索テキストデータベース
の容量が例えば１０万件、すなわち４ＧＢにも達する程
に大規模化してきた場合、唯単に磁気ディスク装置の格
納容量を拡張するだけでは処理時間が増加し、当初の目
的が達っせなくなってしまう。検索時間を低下させるこ
となく、蓄積容量の大規模化に応えられなければならな
い。こうした要求に応え得るアーキテクチャを持つ検索
装置を提供することが、本発明の第五の課題である。Further, when the document data is sequentially registered in the text database, the capacity of the magnetic disk device constituting the character string storage means may become full at a certain point in time. Even in such a case, it is necessary to be able to increase the storage capacity of the system without damaging the data stored up to that time. Further, if the size of the text database to be searched is increased to, for example, 100,000, that is, 4 GB, simply increasing the storage capacity of the magnetic disk device increases the processing time. Will not be able to reach. It is necessary to respond to the increase in the storage capacity without reducing the search time. It is a fifth object of the present invention to provide a search device having an architecture that can meet such a demand.

【００５８】文字列検索装置の文字列記憶手段で重要と
なる要素は、記憶容量が大きいこと、ファイルのサイズ
にかかわらず、複数のファイルを連続的に高速で入出力
できること、安価であることの３点であり、これらの要
素を満足する集合型磁気ディスク装置が必要とされてい
る。The important factors in the character string storage means of the character string search device are that the storage capacity is large, that a plurality of files can be continuously input / output at a high speed regardless of the file size, and that it is inexpensive. There are three points, and a collective magnetic disk drive that satisfies these factors is needed.

【００５９】従来技術では、ただシーク時間のアクセス
時間を見掛け上なくすことにより、データの書き込み読
み出し時間を短縮しようとするもので、外部機器の要求
するデータ転送速度に対して何台の磁気ディスク装置を
用いて構成すれば良いかについて配慮されておらずコス
トパフォーマンスの点で問題があった。In the prior art, the access time of the seek time is made apparent to reduce the data write / read time, and the number of magnetic disk drives required for the data transfer speed required by the external device is reduced. No consideration has been given to whether or not the configuration should be made using the method, and there was a problem in terms of cost performance.

【００６０】また、従来技術は画像データのようにデー
タサイズの大きなファイルが複数の磁気ディスク装置に
またがるような場合にはアクセス時間を削減できる効果
があるが、複数の磁気ディスク装置にまたがらないデー
タサイズの小さなファイルの書き込み、読み出しを行な
う場合には、シーク時間を隠すことができず、１台の磁
気ディスク装置と同じアクセス時間となってしまう問題
があった。The conventional technique has an effect of reducing the access time when a file having a large data size spreads over a plurality of magnetic disk devices such as image data, but does not extend over a plurality of magnetic disk devices. When writing and reading a file with a small data size, the seek time cannot be hidden, and there is a problem that the access time becomes the same as that of one magnetic disk device.

【００６１】また、従来技術は複数のファイルの連続的
な書き込み、読み出しを行なう点に配慮がされておら
ず、上位機器からの書き込み、読み出し命令を１件のフ
ァイルについてのみ処理可能で、複数のファイルをアク
セスする場合には、１件の処理を繰返し行なう必要があ
り、それに要するオーバヘッド時間が長くなってしまう
問題があった。Further, the prior art does not take into consideration the point that continuous writing and reading of a plurality of files are performed, and it is possible to process write and read commands from a higher-level device for only one file. When accessing a file, it is necessary to repeat one process, and there is a problem that the overhead time required for the process becomes long.

【００６２】また、オーバヘッド時間のひとつとして、
上位機器からアクセス対象となるファイルを指定するた
めのファイル識別コードから磁気ディスク装置の格納位
置情報を検索する処理がある。従来の一般的な磁気ディ
スク装置では、ファイル識別コードとしてＡＳＣＩＩコ
ード等の文字コード列で構成されるファイル名称で表現
されており、このファイル名称により、磁気ディスク装
置のファイル管理情報エリアに格納されているファイル
管理情報を検索して物理的な格納位置を求めなければな
らず、それに要する処理時間が大きい問題があった。
本発明の目的は、記憶容量が大きい、ファイルのサイズ
にかかわらず複数のファイルを連続的に高速に入出力で
きる、安価な集合型磁気ディスク装置を提供するもので
ある。一方、文書情報はテキストデータだけで構成され
ている訳ではなく、図面や写真などもその構成要素とし
て含まれている。したがって、検索された文献の印刷イ
メージでの閲読の要求にも応えることが必要になる。こ
れに応え得るアーキテクチャを持つ検索装置を提供する
ことが本発明の第六の課題である。As one of the overhead times,
There is a process of retrieving storage position information of a magnetic disk device from a file identification code for specifying a file to be accessed from a higher-level device. In a conventional general magnetic disk device, a file identification code is represented by a file name composed of a character code string such as an ASCII code, and the file name is stored in a file management information area of the magnetic disk device. It is necessary to retrieve the file management information to find the physical storage location, and there is a problem in that the processing time required for the location is long.
SUMMARY OF THE INVENTION It is an object of the present invention to provide an inexpensive collective magnetic disk drive having a large storage capacity and capable of continuously inputting / outputting a plurality of files regardless of the file size. On the other hand, the document information is not only composed of text data, but also includes drawings and photographs as its constituent elements. Therefore, it is necessary to respond to a request for reading a retrieved document with a print image. It is a sixth object of the present invention to provide a search device having an architecture capable of responding to this.

【００６３】さらに、テキストデータベースは複数のユ
ーザによって共有されるべきものであり、例えばＬＡＮ
（ローカルエリアネットワーク）を介して検索対話
用のワークステーションからアクセスできなければなら
ない。したがって、検索装置はＬＡＮに接続され、他の
複数のワークステーションからの検索要求に応えられる
機能を持たなければならない。こうした機能を備えた全
文検索装置を提供することが、本発明の第七の課題であ
る。以上述べた各課題に応え得るフルテキストサーチシ
ステムを提供することが本発明の最終的な目的である。
特に、このフルテキストサーチシステムに好適なデータ
の登録を提供することを目的とする。Further, the text database is to be shared by a plurality of users.
(Local area network) must be accessible from the search interaction workstation. Therefore, the search device must be connected to the LAN and have a function of responding to search requests from a plurality of other workstations. It is a seventh object of the present invention to provide a full-text search device having such a function. It is a final object of the present invention to provide a full-text search system that can respond to each of the problems described above.
In particular, it is an object to provide data registration suitable for this full-text search system.

【００６４】[0064]

【課題を解決するための手段】本発明では、上記の目的
を達成するために以下の構成とした。検索対象となり得
るデータをデータ格納手段に格納すると共に、サーチの
際検索キーワード自身を含む可能性のないデータを除く
ことが可能な検索ファイルを登録する。検索ファイルと
しては、予め定めた各文字が前記登録されたデータに含
まれるか否かを示す「文字成分表」、登録されたデータ
中に繰り返し現れる単語の重複を排除した「凝縮本文デ
ータ」がある。In order to achieve the above object, the present invention has the following arrangement. Data that can be searched is stored in the data storage unit, and a search file that can exclude data that is unlikely to include the search keyword itself during the search is registered. As the search file, a “character component table” indicating whether or not each predetermined character is included in the registered data, and “condensed body data” that eliminates duplication of words repeatedly appearing in the registered data. is there.

【００６５】[0065]

【発明の実施の形態】以下、本発明の第一の実施例を、
図１０を用いて説明する。本実施例は、キーボード１１
０１、サーチマシン制御用コンピュータ（ＣＰＵ０）１
１５０、ディスプレイ１１２０、オートマトン生成用コ
ンピュータ（ＣＰＵ１）１１０５ａ、ビットサーチ用コ
ンピュータ（ＣＰＵ３）１１０７ａ、ストリングサーチ
エンジン１１０６、複合条件判定用コンピュータ（ＣＰ
Ｕ２）１１４５ａ、検索結果格納メモリ１１４６、及び
テキストデータファイル１１１０から構成される。ま
た、サーチマシン制御用コンピュータ（ＣＰＵ０）１１
５０では、検索式解析プログラム１１０２、同義語異表
記展開プログラム１１０３ａ、複合条件解析プログラム
１１４１ａ、検索実行制御プログラム１１０８、及び検
索結果表示プログラム１１４７が実行され、オートマト
ン生成用コンピュータ（ＣＰＵ１）１１０５ａではオー
トマトン生成プログラム１１０５が、ビットサーチ用コ
ンピュータ（ＣＰＵ３）１１０７ａではビットサーチプ
ログラム１１０７が、複合条件判定用コンピュータ（Ｃ
ＰＵ２）１１４５ａでは複合条件判定プログラム１１４
５が実行される。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, a first embodiment of the present invention will be described.
This will be described with reference to FIG. In this embodiment, the keyboard 11
01, search machine control computer (CPU0) 1
150, a display 1120, an automaton generation computer (CPU1) 1105a, a bit search computer (CPU3) 1107a, a string search engine 1106, a compound condition determination computer (CP
U2) 1145a, a search result storage memory 1146, and a text data file 1110. Also, a search machine control computer (CPU0) 11
At 50, a search expression analysis program 1102, a synonym variant expression expansion program 1103a, a complex condition analysis program 1141a, a search execution control program 1108, and a search result display program 1147 are executed, and an automaton generation computer (CPU1) 1105a generates an automaton. In the bit search computer (CPU3) 1107a, the bit search program 1107 executes the compound condition determination computer (C3).
PU2) 1145a executes the complex condition determination program 114
5 is executed.

【００６６】先ず、キーボード１１０１から入力された
検索条件式はサーチマシン制御用コンピュータ（ＣＰＵ
０）１１５０上の検索式解析プログラム１１０２により
解析される。すなわち、検索式解析プログラム１１０２
では検索条件式を構成するキーワード部分とそれらの包
含条件及び配置条件を記述した複合条件記述部に分離さ
れる。包含条件は論理条件として記述され、配置条件は
近傍条件や文脈条件として記述されたものである。分離
抽出後、キーワード部分は同じくＣＰＵ０１１５０上の
同義語異表記展開プログラム１１０３ａに渡され、複合
条件記述部は複合条件解析プログラム１１４１ａに渡さ
れる。First, the search condition expression input from the keyboard 1101 is converted into a search machine control computer (CPU
0) Analyzed by the search expression analysis program 1102 on 1150. That is, the search expression analysis program 1102
Are separated into a composite condition description part which describes the keyword parts constituting the search condition expression and their inclusion conditions and arrangement conditions. The inclusion condition is described as a logical condition, and the arrangement condition is described as a neighborhood condition or a context condition. After the separation and extraction, the keyword portion is also passed to the synonym variant description expansion program 1103a on the CPU 01150, and the complex condition description portion is passed to the complex condition analysis program 1141a.

【００６７】同義語異表記展開プログラム１１０３ａで
は、ここに内蔵された同義語辞書を参照して入力された
キーワードの同義語が、また変換ルールによって異表記
が求められる。例えば、“計算機”というキーワードが
入力されると、同義語としては“計算機”のほかに“電
算機”や“コンピュータ”などが生成され、異表記とし
ては“コンピュータ”から“コンピューター”などが生
成される。In the synonym variant notation expansion program 1103a, a synonym of a keyword input with reference to a built-in synonym dictionary is obtained in accordance with a conversion rule. For example, when the keyword "computer" is entered, "computer" and "computer" are generated as synonyms in addition to "computer", and "computer" is generated from "computer" as a different notation Is done.

【００６８】同義語としては、上記の例のような同位語
のほかに、上位語や下位語、関連語などがあり、これら
も含めて同義語として展開される。この場合の上位語の
例としては“電子機器”などがあり、下位語としては
“電卓”など、関連語としては“オフィスオートメーシ
ョン”などがある。As synonyms, in addition to the synonyms as in the above example, there are higher terms, lower terms, related words, and the like, and these are expanded as synonyms. In this case, examples of the high-order words include “electronic equipment”, low-order words include “calculator”, and related words include “office automation”.

【００６９】また、異表記展開としては、カタカナ展開
のほか、漢字ひらがな展開、アルファベット展開があ
る。図示されているのはこの中のカタカナ展開の例であ
る。漢字ひらがな展開としては、新旧字体の変換と送り
がな展開がある。新旧字体変換の例としては、“斉”か
ら“齋”、“齊”への変換などがある。また、送りがな
展開としては、“読取”から読取り”、“読み取り”へ
の展開などがある。アルファベット展開としては、ロー
マ字のヘボン式展開、ローマ字の訓令式展開及びアルフ
ァベットの大文字小文字展開がある。ローマ字のヘボン
式展開の例としては“チシキ”から“ＴＩＳＩＫＩ”へ
の展開が、ローマ字の訓令式展開の例としては“CHISHI
KI”への展開があり、アルファベットの大文字小文字展
開例としては“ＴＩＳＩＫＩ”から“ｔｉｓｉｋｉ”へ
の展開などがある。The different notation expansion includes katakana expansion, kanji hiragana expansion, and alphabet expansion. What is shown is an example of katakana expansion in this. Kanji Hiragana development includes the conversion of new and old fonts and the development of sending characters. Examples of the conversion of new and old fonts include conversion from “Sai” to “Sai” and “Sai”. In addition, examples of the expansion of the feed include expansion from “reading” to “reading” and “reading.” The expansion of the alphabet includes the expansion of the Roman alphabet into Hepburn, the expansion of the Roman alphabet, and the expansion of the uppercase and lowercase alphabet. "CHISHI" from "Chishiki" is an example of Hebon-style development, and "CHISHI" is an example of Romaji-style instruction
There is an expansion to “KI”, and an example of expansion of uppercase and lowercase letters of the alphabet is an expansion from “TISIKI” to “tisiki”.

【００７０】以上説明した同義語展開並びに異表記展開
の展開種類については、ユーザの指定によって組み合わ
せ選択できるようにすることも可能である。The expansion types of the synonym expansion and the different notation expansion described above can be combined and selected by the user.

【００７１】英語の同義語の例としては looking glass → mirror pingpong → table tennis the Lord → God typhoon → cyclone → hurricane WS → work station 等があり、英語の異表記の例としては center → centre liter → litre brier → briar humor → humour modeler → modeller Chile → Chili orangutan → orangoutan → orangoutang MacDonald → McDonald 等の例がある。Examples of English synonyms are looking glass → mirror pingpong → table tennis the Lord → God typhoon → cyclone → hurricane WS → work station, and examples of different English notations are center → center liter → litre Examples include brier → briar humor → humour modeler → modeller Chile → Chili orangutan → orangoutan → orangoutang MacDonald → McDonald.

【００７２】さらに、ドイツ語の同義語の例としては Brief → Schreiben Mostert → Mostrich Maschine → Motor 等があり、ドイツ語の異表記の例としては Foto → Photo Coda → Koda Code → Kode Buffet → Buffet Friburg → Fribourg 等が挙げられる。Further, examples of synonyms in German include Brief → Schreiben Mostert → Mostrich Maschine → Motor, and examples of different notations in German are Foto → Photo Coda → Koda Code → Kode Buffet → Buffet Friburg → Fribourg and the like.

【００７３】こうして同義語及び異表記展開されたキー
ワード群は、次にオートマトン生成用コンピュータ（Ｃ
ＰＵ１）１１０５ａ上のオートマトン生成プログラム１
１０５に送られる。The keyword group developed in this way as a synonym and a different notation is then converted to a computer (C
PU1) Automaton generation program 1 on 1105a
Sent to 105.

【００７４】オートマトン生成プログラム１１０５で
は、同義語異表記展開プログラム１１０３ａから送られ
てきたキーワード群に対して、これらを一括照合するオ
ートマトンを作成する。同義語及び異表記展開を施す
と、初期入力されたキーワードの数によっては、数百に
も及ぶ展開結果が得られることになる。The automaton generation program 1105 creates an automaton for collectively collating the keyword groups sent from the synonym variant expression expansion program 1103a. By performing synonym and variant notation expansion, several hundreds of expansion results can be obtained depending on the number of initially input keywords.

【００７５】これらのキーワードを一つずつ入力テキス
トデータから探索していたので、高速な検索を実現する
ことが不可能である。すなわち、これらのキーワードを
まとめて、テキストデータをただ一回走査するだけで探
索する必要がある。このように複数のキーワードを一括
して照合する（多量照合とも呼ぶ）方法としてオートマ
トンを用いた照合方法が知られている。その中で、この
オートマトンをハードウェアで実行する方式として「特
開昭６３−３１１５３０」を提案している。サーチエン
ジン１１０６はこの方式をさらに発展させて実現した高
速多重文字列照合回路である。したがって、本オートマ
トン生成プログラム１１０５では、このサーチエンジン
１１０６に設定する状態遷移テーブルと照合すべきキー
ワードの識別コード情報を生成し、これらをサーチエン
ジン１１０６へ転送することになる。Since these keywords are searched one by one from the input text data, it is impossible to realize a high-speed search. That is, it is necessary to collect these keywords and perform a search by scanning the text data only once. A collation method using an automaton is known as a method of collating a plurality of keywords at once (also referred to as a large amount collation). Among them, Japanese Patent Application Laid-Open No. 63-31530 proposes a method of executing this automaton by hardware. The search engine 1106 is a high-speed multiplexed character string matching circuit realized by further developing this method. Therefore, the automaton generation program 1105 generates the identification code information of the keyword to be compared with the state transition table set in the search engine 1106, and transfers these to the search engine 1106.

【００７６】また、同義語異表記展開プログラム１１０
３ａで同義語及び異表記展開されたキーワード群は、該
当キーワード識別コード（キーワード識別子とも呼ぶ）
と共に、ビットサーチ用コンピュータ（ＣＰＵ３）１１
０７ａ上のビットサーチプログラム１１０７へ渡され
る。Further, the synonym-notation expansion program 110
The keyword group developed by synonyms and different notations in 3a is a corresponding keyword identification code (also referred to as a keyword identifier).
Together with a bit search computer (CPU 3) 11
It is passed to the bit search program 1107 on 07a.

【００７７】一方、検索式解析プログラム１１０２から
入力検索条件式中の複合条件記述部を受け取ったサーチ
マシン制御用コンピュータ（ＣＰＵ０）１１５０上の複
合条件解析プログラム１１４１では、近傍条件や文脈条
件、並びに論理条件などを解析し、各条件を判定するた
めの制御情報として、指定されたキーワードの識別コー
ドとその間の指定距離情報や指定文脈コード情報及び指
定論理条件コード情報に変換され、複合条件判定用コン
ピュータ（ＣＰＵ２）１１４５ａ上の複合条件判定プロ
グラム１１４５に渡される。On the other hand, the composite condition analysis program 1141 on the search machine control computer (CPU0) 1150 which has received the composite condition description portion in the input search condition expression from the search expression analysis program 1102, executes the neighborhood condition, the context condition, and the logical condition. Analysis of conditions, etc., as control information for determining each condition, is converted into the identification code of the specified keyword and the specified distance information, specified context code information, and specified logical condition code information therebetween, and is used as a compound condition determination computer. (CPU2) This is passed to the complex condition determination program 1145 on the 1145a.

【００７８】さて、上述した検索式解析処理、同義語異
表記展開処理、オートマトン生成処理、複合条件解析処
理が終わり、ビットサーチ用コンピュータ（ＣＰＵ３）
１１０７ａ上のビットサーチプログラム１１０７、サー
チエンジン１１０６、及び複合条件判定用コンピュータ
（ＣＰＵ２）１１４５ａ上の複合条件判定プログラム１
１４５にそれぞれ制御情報が渡し終わると、検索処理が
始められる。Now, the above-described search expression analysis processing, synonymous notation expansion processing, automaton generation processing, and complex condition analysis processing are completed, and the bit search computer (CPU 3)
Bit search program 1107 on 1107a, search engine 1106, and compound condition determination program 1 on computer (CPU2) 1145a for compound condition determination
When the control information has been passed to 145, the search process is started.

【００７９】検索処理は、サーチマシン制御用コンピュ
ータ（ＣＰＵ０）１１５０上の検索実行制御プログラム
１１０８により制御される。すなわち、検索実行制御プ
ログラム１１０８では、ビットサーチプログラム１１０
７、サーチエンジン１１０６、及び複合条件判定プログ
ラム１１４５に対して起動を掛け、テキストデータファ
イル１１１０から被検索テキストデータを読み込み、階
層型プリサーチと本文サーチを実行する。まず、テキス
トデータファイル１１１０からビットサーチプログラム
１１０７へ文字成分表を読み出して文字成分表サーチを
行う。文字成分表サーチ結果は、該当文書識別子として
検索結果格納メモリ１１４６に書き出される。次に、該
文書識別子で指定される文書の凝縮本文をテキストデー
タファイル１１１０からストリングサーチエンジン１１
０６へ読み込み凝縮本文サーチを行う。ストリングサー
チエンジン１１０６では、あらかじめ設定された状態遷
移テーブル情報にしたがって指定されたキーワード群を
入力凝縮本文データの中から探し出す。そして、キーワ
ードのどれかでも見つかると、そのテキストファイルの
識別子と該当キーワードの識別コード並びに検出された
位置情報を、複合条件判定用コンピュータ（ＣＰＵ２）
１１４５ａ上の複合条件判定プログラム１１４５に送出
する。The search processing is controlled by a search execution control program 1108 on a search machine control computer (CPU 0) 1150. That is, in the search execution control program 1108, the bit search program 110
7. The search engine 1106 and the complex condition determination program 1145 are activated, the text data to be searched is read from the text data file 1110, and the hierarchical pre-search and the text search are executed. First, a character component table is read from the text data file 1110 to the bit search program 1107, and a character component table search is performed. The character component table search result is written out to the search result storage memory 1146 as a corresponding document identifier. Next, the condensed text of the document specified by the document identifier is extracted from the text data file 1110 by the string search engine 11.
06 and performs a condensed text search. The string search engine 1106 searches for a specified keyword group from the input condensed text data in accordance with preset state transition table information. When any of the keywords is found, the identifier of the text file, the identification code of the keyword, and the detected position information are stored in the compound condition determination computer (CPU 2).
It is sent to the complex condition determination program 1145 on 1145a.

【００８０】サーチエンジンの出力情報として付加され
る位置情報とは、そのキーワードが見つかった文書中の
位置を表す情報のことであり、具体的にはその文書の先
頭から数えて何文字目に当るのかを文字数でカウントし
た値である。図１１に具体例で照合位置情報を示した。
本図は、文書の内容が、「あいまい検索のための知的検
索技術を開発した。The position information added as the output information of the search engine is information indicating the position in the document where the keyword is found, and specifically, the number of characters counted from the head of the document. Is a value obtained by counting the number of characters. FIG. 11 shows the collation position information in a specific example.
This figure shows that the contents of the document are "Intelligent search technology for fuzzy search."

【００８１】・・・・・・」という場合、これを“知的
検索”というキーワードで検索した場合を想定したもの
である。ここでは、“知的検索技術”の中の“知的検
索”の部分がキーワードと一致することになるので、こ
の部分が検出されることになる。照合位置情報として
は、“知的検索”の末尾文字“索”の文書先頭からの文
字位置が採られる。この例では、１３が照合位置情報と
なる。.. "Is assumed to be a case where this is searched using the keyword" intelligent search ". Here, the part of "intelligent search" in the "intelligent search technology" matches the keyword, so that this part is detected. As the collation position information, the character position from the beginning of the document of the last character “search” of “intelligent search” is used. In this example, 13 is the collation position information.

【００８２】この照合位置情報を付加したサーチエンジ
ンの出力情報は、図１５に示した構成を取る。すなわ
ち、本実施例では３２ビット長のキーワード識別子と、
同じく３２ビット長のキーワード照合位置情報で構成さ
れる。また、各文書毎にキーワード識別子の出力に先立
って文書識別子が出力され、照合出力情報がどの文書に
対応するものかが分かるようにしてある。The output information of the search engine to which the collation position information is added has the configuration shown in FIG. That is, in the present embodiment, a keyword identifier having a length of 32 bits,
Similarly, it is composed of 32-bit keyword collation position information. Further, a document identifier is output for each document prior to the output of the keyword identifier, so that it is possible to know which document the collation output information corresponds to.

【００８３】凝縮本文サーチ結果は、該当文書識別子と
照合キーワード識別子及びキーワード照合位置情報が組
み合わされた照合情報として、複合条件判定用コンピュ
ータ（ＣＰＵ２）１１４５ａ上の複合条件判定プログラ
ム１１４５に渡される。複合条件判定プログラム１１４
５では、先に設定された複合条件判定制御情報に基づい
て、指定条件に合致する文書を判定し、その文書識別子
を検索結果格納メモリ１１４６に書き出す。検索実行制
御プログラム１１０８は、複合条件中に近傍条件あるい
は文脈条件が設定されているかを判定し、もし設定され
ている場合には最後の本文サーチを行う。すなわち、凝
縮本文サーチの結果得られた該当文書識別子に対応する
本文データをテキストデータファイル１１１０からスト
リングサーチエンジン１１０６へ読み込み本文サーチを
行うことになる。ストリングサーチエンジン１１０６か
ら出力される照合情報は複合条件判定プログラム１１４
５に渡され、ここで指定された近傍条件及び文脈条件に
合致するか否かの判定処理が行われる。この判定処理結
果は、最終的な検索結果情報として、該当文書識別子と
いう形で検索結果格納メモリ１１４６に出力される。The result of the condensed text search is passed to the composite condition determination program 1145 on the composite condition determination computer (CPU 2) 1145a as collation information in which the corresponding document identifier, collation keyword identifier and keyword collation position information are combined. Compound condition judgment program 114
In step 5, a document that matches the specified condition is determined based on the previously set composite condition determination control information, and the document identifier is written to the search result storage memory 1146. The search execution control program 1108 determines whether the neighborhood condition or the context condition is set in the compound condition, and performs the last text search if it is set. That is, the text data corresponding to the corresponding document identifier obtained as a result of the condensed text search is read from the text data file 1110 into the string search engine 1106 and the text search is performed. The collation information output from the string search engine 1106 is a compound condition determination program 114
5 to determine whether the neighborhood condition and the context condition specified here are met. This determination processing result is output to the search result storage memory 1146 as final search result information in the form of a corresponding document identifier.

【００８４】凝縮本文サーチあるいは本文サーチが済
み、最終的に検索処理が終わると、サーチマシン制御用
コンピュータ（ＣＰＵ０）１１５０上の検索結果表示プ
ログラム１１４７が検索結果格納メモリ１１４６上の該
当文書識別子に基づいて、検索結果件数、あるいはヒッ
トした文書の書誌情報である文書名や著者などの書誌事
項をテキストデータファイル１１１０から読み出してデ
ィスプレイ１１２０へ一覧表示したり、あるいはユーザ
の指定に応じてヒットした文書の本文データをテキスト
データファイル１１１０から読み出して表示したりす
る。When the condensed text search or text search has been completed and the search processing is finally completed, the search result display program 1147 on the search machine control computer (CPU 0) 1150 executes the search based on the corresponding document identifier on the search result storage memory 1146. The bibliographic items such as the number of search results or the bibliographic information of the hit document, such as the document name and author, are read from the text data file 1110 and displayed in a list on the display 1120. Text data is read from the text data file 1110 and displayed.

【００８５】以上が本発明により提供されるフルテキス
トサーチ装置の第一の実施例についての説明である。The above is the description of the first embodiment of the full text search apparatus provided by the present invention.

【００８６】次に、本発明の第二の実施例について、図
２５を用いて説明する。本実施例は、キーボード２５０
１、サーチマシン制御用コンピュータ（ＣＰＵ０）２５
２０、ディスプレイ２５２０、オートマトン生成用コン
ピュータ（ＣＰＵ１）２５０５ａ、ビットサーチ用コン
ピュータ（ＣＰＵ３）２５０７ａ、ストリングサーチエ
ンジン２５０６、複合条件判定用コンピュータ（ＣＰＵ
２）２５４５ａ、検索結果格納メモリ２５４６、半導体
メモリ装道２５１０ａ、ＲＡＭディスク装置２５１０
ｂ、集合型磁気ディスク装置２５１０ｃ、及びイメージ
データファイル２５３０から構成される。また、サーチ
マシン制御用コンピュータ（ＣＰＵ０）２５５０では、
検索式解析プログラム２５０２、同義語展開プログラム
２５０３、異表記展開プログラム２５０４、複合条件解
析プログラム２５４１、近傍条件解析プログラム２５４
２、文脈条件解析プログラム２５４３、論理条件解析プ
ログラム２５４４、検索実行制御プログラム２５０８、
及び検索結果表示プログラム２５４７が実行され、オー
トマトン生成用コンピュータ（ＣＰＵ１）２５０５ａで
はオートマトン生成プログラム２５０５が、ビットサー
チ用コンピュータ（ＣＰＵ３）２５０７ａではビットサ
ーチプログラム２５０７が、複合条件判定用コンピュー
タ（ＣＰＵ２）２５４５ａでは複合条件判定プログラム
２５４５が実行される。また、集合型磁気ディスク装置
２５１０ｃは、集合型磁気ディスク制御装置２５１０ｄ
と磁気ディスク装置２５１０ｅ₁〜２５１０ｅ₁₂から構
成される。Next, a second embodiment of the present invention will be described with reference to FIG. In this embodiment, the keyboard 250 is used.
1. Search machine control computer (CPU0) 25
20, a display 2520, an automaton generation computer (CPU1) 2505a, a bit search computer (CPU3) 2507a, a string search engine 2506, a compound condition determination computer (CPU
2) 2545a, search result storage memory 2546, semiconductor memory device 2510a, RAM disk device 2510
b, a collective magnetic disk drive 2510c, and an image data file 2530. In the search machine control computer (CPU0) 2550,
Search expression analysis program 2502, synonym expansion program 2503, variant notation expansion program 2504, compound condition analysis program 2541, neighborhood condition analysis program 254
2. context condition analysis program 2543, logical condition analysis program 2544, search execution control program 2508,
And the search result display program 2547 is executed, the automaton generation program 2505 is executed by the automaton generation computer (CPU1) 2505a, the bit search program 2507 is executed by the bit search computer (CPU3) 2507a, and the complex condition determination computer (CPU2) 2545a is executed The composite condition determination program 2545 is executed. The collective magnetic disk device 2510c is different from the collective magnetic disk control device 2510d.
And composed of a magnetic disk apparatus 2510e ₁ ~2510e _12.

【００８７】本図において、先ずキーボード２５０１か
ら入力された検索条件式はサーチマシン制御用コンピュ
ータ（ＣＰＵ０）２５５０上の検索式解析プログラム２
５０２により解析される。すなわち、検索式解析プログ
ラム２５０２では検索条件式を構成するキーワード部分
とそれらの包含条件及び配置条件を記述した複合条件記
述部に分離する。包含条件は論理条件として記述され、
配置条件は近傍条件や文脈条件として記述されたもので
ある。分離抽出後、キーワード部分は同じくＣＰＵ０、
２５５０上の同義語展開プログラム２５０３に渡され、
複合条件記述部は複合条件解析プログラム２５４１に渡
される。In this figure, first, the search condition expression input from the keyboard 2501 is obtained by the search expression analysis program 2 on the search machine control computer (CPU 0) 2550.
Analyzed by 502. In other words, the search expression analysis program 2502 separates the keyword portion that constitutes the search condition expression into a compound condition description portion that describes the inclusion condition and the placement condition thereof. Inclusion conditions are described as logical conditions,
The arrangement condition is described as a neighborhood condition or a context condition. After the separation and extraction, the keyword portion is also CPU0,
Passed to the synonym expansion program 2503 on the 2550,
The complex condition description part is passed to the complex condition analysis program 2541.

【００８８】同義語展開プログラム２５０３では、ここ
に内蔵された同義語辞書を参照して、入力されたキーワ
ードの同義語が求められる。そして、ここで同義語展開
されたキーワード群は異表記展開プログラム２５０４へ
渡される。本図の例の場合、“計算機”から、“電算
機”、“コンピュータ”、“COMPUTER”などが生成され
る。In the synonym expansion program 2503, the synonym of the input keyword is obtained by referring to the synonym dictionary stored therein. Then, the keyword group that has undergone synonym expansion is passed to the variant notation expansion program 2504. In the case of the example of this figure, “computer”, “computer”, “COMPUTER”, etc. are generated from “computer”.

【００８９】異表記展開プログラム２５０４では、ここ
に入力されてきたキーワード群に対して異表記展開処理
が施される。本図の例の場合、“コンピュータ”から
“コンピューター”が、また“COMPUTER”から“Comput
er”などが生成される。In the different notation development program 2504, different notation development processing is performed on the keyword group input here. In the case of the example in this figure, “Computer” is changed to “Computer” and “COMPUTER” is changed to “Computing”.
er "is generated.

【００９０】こうして同義語及び異表記展開されたキー
ワード群は、次にオートマトン生成用コンピュータ（Ｃ
ＰＵ１）２５０５ａ上のオートマトン生成プログラム２
５０５に送られる。The keyword group expanded in this way into a synonym and a different notation is then converted to an automaton generating computer (C
PU1) Automaton generation program 2 on 2505a
505.

【００９１】オートマトン生成プログラム２５０５で
は、異表記展開プログラム２５０４から送られてきたキ
ーワード群に対して、これらを一括照合するオートマト
ンを生成し、状態遷移テーブルと照合すべきキーワード
の識別コード情報として、サーチエンジン２５０６に設
定する。サーチエンジン２５０６は有限オートマトン方
式に基づく高速多重文字列照合回路である。The automaton generation program 2505 generates an automaton for collectively collating the keyword groups sent from the variant notation development program 2504, and searches the state transition table as identification code information of keywords to be collated. Set to engine 2506. The search engine 2506 is a high-speed multiple character string matching circuit based on the finite automaton method.

【００９２】また、異表記展開プログラム２５０４で異
表記展開されたキーワード群は、該当キーワード識別コ
ードと共に、ビットサーチ用コンピュータ（ＣＰＵ３）
２５０７ａ上のビットサーチプログラム２５０７へ渡さ
れる。The keyword group developed in the different notation by the different notation developing program 2504 together with the corresponding keyword identification code together with the bit search computer (CPU 3)
It is passed to the bit search program 2507 on 2507a.

【００９３】一方、検索式解析プログラム２５０２から
入力検索条件式中の複合条件記述部を受け取ったサーチ
マシン制御用コンピュータ（ＣＰＵ０）２５５０上の複
合条件解析プログラム２５４１では、これを解析して近
傍条件記述部と文脈条件記述部並びに論理条件記述部に
分離する。そして、各条件記述部をそれぞれ近傍条件解
析プログラム２５４２、文脈条件解析プログラム２５４
３及び論理条件解析プログラム２５４４へ渡す。On the other hand, the composite condition analysis program 2541 on the search machine control computer (CPU0) 2550, which has received the composite condition description portion in the input search condition expression from the search expression analysis program 2502, analyzes this and executes the neighborhood condition description. Part, a context condition description part, and a logical condition description part. Then, each condition description part is respectively stored in the neighborhood condition analysis program 2542 and the context condition analysis program 254.
3 and passed to the logic condition analysis program 2544.

【００９４】近傍条件解析プログラム２５４２では、字
間距離条件や語間距離条件が抽出される。ここで抽出さ
れた各条件は、指定されたキーワードの識別コードとそ
の間の距離情報に変換され、複合条件判定用コンピュー
タ（ＣＰＵ２）２５４５ａ上の複合条件判定プログラム
２５４５に渡される。The neighborhood condition analysis program 2542 extracts a character distance condition and a word distance condition. Each condition extracted here is converted into an identification code of a designated keyword and distance information therebetween, and is passed to a composite condition determination program 2545 on a composite condition determination computer (CPU 2) 2545a.

【００９５】文脈条件解析プログラム２５４３では、同
一文内共起条件や同一段落内共起条件、同一節内共起条
件、同一章内共起条件などの各種の共起条件が抽出され
る。ここで抽出された各条件は、指定されたキーワード
の識別コードと指定文脈コード情報に変換され、複合条
件判定用コンピュータ（ＣＰＵ２）２５４５ａ上の複合
条件判定プログラム２５４５に渡される。The context condition analysis program 2543 extracts various co-occurrence conditions such as the same sentence co-occurrence condition, the same paragraph co-occurrence condition, the same section co-occurrence condition, and the same chapter co-occurrence condition. Each condition extracted here is converted into an identification code of a designated keyword and designated context code information, and is passed to a complex condition determination program 2545 on a complex condition determination computer (CPU 2) 2545a.

【００９６】論理条件解析プログラム２５４４では、検
索条件式中に指定された論理条件が抽出され、論理条件
コード情報に変換され、複合条件判定用コンピュータ
（ＣＰＵ２）２５４５ａ上の複合条件判定プログラム２
５４５に渡される。The logical condition analysis program 2544 extracts the logical condition specified in the search condition expression, converts the logical condition into logical condition code information, and executes the composite condition determination program 245 on the composite condition determination computer (CPU2) 2545a.
545.

【００９７】さて、上述した検索式解析処理、同義語及
び異表記展開処理、オートマトン生成処理、複合条件解
析処理、近傍条件解析処理、文脈条件解析処理、及び論
理条件解析処理が終わり、ビットサーチ用コンピュータ
（ＣＰＵ３）２５０７ａ上のビットサーチプログラム２
５０７、サーチエンジン２５０６、及び複合条件判定用
コンピュータ（ＣＰＵ２）２５４５ａ上の複合条件判定
プログラム２５４５にそれぞれ制御情報が渡し終わる
と、検索処理が始められる。Now, the above-described search expression analysis processing, synonym and variant notation expansion processing, automaton generation processing, complex condition analysis processing, neighborhood condition analysis processing, context condition analysis processing, and logical condition analysis processing are completed, and the bit search Bit search program 2 on computer (CPU 3) 2507a
When control information has been passed to the search condition 2507 on the search condition 2507a, the search engine 2506, and the compound condition determination computer (CPU2) 2545a, search processing is started.

【００９８】検索処理は、サーチマシン制御用コンピュ
ータ（ＣＰＵ０）２５５０上の検索実行制御プログラム
２５０８により制御される。すなわち、検索実行制御プ
ログラム２５０８では、まずビットサーチプログラム２
５０７に起動を掛け、半導体メモリ装置２５１０ａから
文字成分表を読み出して文字成分表サーチを行う。文字
成分表サーチ結果は、該当文書識別子として検索結果格
納メモリ２５４６に書き出される。The search processing is controlled by a search execution control program 2508 on a search machine control computer (CPU 0) 2550. That is, in the search execution control program 2508, first, the bit search program 2
507 is started, a character component table is read from the semiconductor memory device 2510a, and a character component table search is performed. The character component table search result is written out to the search result storage memory 2546 as a corresponding document identifier.

【００９９】次に、ストリングサーチエンジン２５０
６、複合条件判定プログラム２５４５及びＲＡＭディス
ク装置２５１０ｂに起動を掛けて、検索結果格納メモリ
２５４６に書き出された文書識別子で指定される文書の
凝縮本文をＲＡＭディスク装置２５１０ｂからストリン
グサーチエンジン２５０６へ読み込み凝縮本文サーチを
行う。凝縮本文サーチ結果は、該当文書識別子と照合キ
ーワード識別子及びキーワード照合位置情報が組み合わ
された照合情報として、複合条件判定用コンピュータ
（ＣＰＵ２）２５４５ａ上の複合条件判定プログラム２
５４５に渡される。複合条件判定プログラム２５４５で
は、先に設定された複合条件判定制御情報に基づいて、
指定条件に合致する文書を判定し、その文書識別子を検
索結果格納メモリ２５４６に書き出す。Next, the string search engine 250
6. Activate the compound condition determination program 2545 and the RAM disk device 2510b and read the condensed text of the document specified by the document identifier written in the search result storage memory 2546 from the RAM disk device 2510b to the string search engine 2506. Perform a condensed text search. The condensed text search result is used as collation information in which the corresponding document identifier, collation keyword identifier, and keyword collation position information are combined, as the composite condition determination program 2 on the composite condition determination computer (CPU 2) 2545a.
545. In the composite condition determination program 2545, based on the composite condition determination control information set earlier,
The document matching the designated condition is determined, and the document identifier is written to the search result storage memory 2546.

【０１００】そして、検索実行制御プログラム２５０８
は、複合条件中に近傍条件あるいは文脈条件が設定され
ているかを判定し、もし設定されている場合には最後の
本文サーチを行う。すなわち、ストリングサーチエンジ
ン２５０６、複合条件判定プログラム２５４５及び集合
型磁気ディスク装置２５１０ｃに起動を掛けて、凝縮本
文サーチの結果得られた検索結果格納メモリ２５４６中
の該当文書識別子に対応する本文データを集合型磁気デ
ィスク装置２５１０ｃからストリングサーチエンジン２
５０６へ読み込み本文サーチを行うことになる。The search execution control program 2508
Determines whether the neighborhood condition or the context condition is set in the compound condition, and if the condition is set, performs the last text search. That is, the string search engine 2506, the complex condition determination program 2545, and the set-type magnetic disk drive 2510c are started, and the text data corresponding to the corresponding document identifier in the search result storage memory 2546 obtained as a result of the condensed text search is collected. Search engine 2 from the magnetic disk drive 2510c
The content is read into 506 and a text search is performed.

【０１０１】集合型磁気ディスク装置２５１０ｃは複数
台の磁気ディスク装置２５１０ｅ₁〜２５１０ｅ₁₂から
構成され、文字成分表、凝縮本文、本文、及び書誌事項
などの各種テキストデータがこれらの磁気ディスク装置
２５１０ｅ₁〜２５１０ｅ₁₂に分散して格納される。そ
して、これらの磁気ディスク装置２５１０ｅ₁〜２５１
０ｅ₁₂は集合磁気ディスク制御装置２５１０ｄの制御の
もとに、平行して独立にテキストデータを読み出す。読
み出されたそれぞれのテキストデータは、集合磁気ディ
スク制御装置２５１０ｄで統合され、すなわちマルチプ
レクシングされて高速にストリングサーチエンジン２５
０６へ送り出される。１２台の磁気ディスク装置を同時
に動作させた場合、一台だけの場合に比較して約１０倍
の読み出し速度が得られることになる。[0102] collective type magnetic disk unit 2510c is constituted by a plurality magnetic disk apparatus 2510e ₁ ~2510e ₁₂ of character component table, condensed text, text, and a magnetic disk drive various text data of such bibliographic 2510E ₁ distributed and stored in ~2510e _12. Then, these magnetic disk devices 2510e _{1 to} 251
0e ₁₂ is under the control of the set magnetic disk controller 2510D, reads out text data independently in parallel. The read text data are integrated by the collective magnetic disk controller 2510d, that is, are multiplexed and the string search engine 25
06. When the twelve magnetic disk devices are operated at the same time, the reading speed is about ten times faster than when only one magnetic disk device is used.

【０１０２】ストリングサーチエンジン２５０６から出
力される集合情報は、複合条件判定プログラム２５４５
に渡され、ここで指定された近傍条件及び文脈条件に合
致するか否かの判定処理が行われる。この判定処理結果
は、最終的な検索結果情報として、該当文書識別子とい
う形で検索結果格納メモリ２５４６に出力される。The set information output from the string search engine 2506 is stored in the complex condition determination program 2545.
To determine whether the neighborhood condition and the context condition specified here are met. The result of this determination processing is output to the search result storage memory 2546 as final search result information in the form of a corresponding document identifier.

【０１０３】凝縮本文サーチあるいは本文サーチが済
み、最終的に検索処理が終わると、サーチマシン制御用
コンピュータ（ＣＰＵ０）２５５０上の検索結果表示プ
ロクラム２５４７が、検索結果格納メモリ２５４６上の
該当文書識別子に基づいて、検索結果件数、あるいはヒ
ットした文書の書誌情報である文書名や著者などの書誌
事項を集合型磁気ディスク装置２５１０ｃから読み出し
てディスプレイ２５２０へ一覧表示したり、あるいはユ
ーザの指定に応じてヒットした文書の本文データを集合
磁気ディスク装置２５１０ｃから読み出して表示したり
する。更に、ユーザがヒットした文献の図面や画像情報
の閲覧を指定した場合には、イメージデータファイル２
５３０から該当するイメージデータを読み出しディスプ
レイ２５２０へ表示する。以上が本発明により提供され
るフルテキストサーチ装置の第二の実施例についての説
明である。When the condensed text search or text search has been completed and the search process is finally completed, the search result display program 2547 on the search machine control computer (CPU 0) 2550 sets the corresponding document identifier on the search result storage memory 2546 to the corresponding document identifier. Based on the search result, the bibliographic items such as the document name and the author, which are the bibliographic information of the hit document or the bibliographic information of the hit document, are read from the collective magnetic disk drive 2510c and displayed in a list on the display 2520, or the hit is specified according to the user's designation The text data of the written document is read out from the collective magnetic disk drive 2510c and displayed. Further, when the user designates viewing of the drawing or image information of the hit document, the image data file 2
The corresponding image data is read from 530 and displayed on the display 2520. The above is the description of the second embodiment of the full text search device provided by the present invention.

【０１０４】また、本実施例ではデキストデータを格納
するテキストデータフル１１０（図１）として集合磁気
ディスク制御装置１１０ｄ（図２０）を用いているが、
テキストデータファイル１１０の容量を拡大するために
集合型の光ディスク装置を用いることも可能である。す
なわち、磁気ディスク装置１１０ｅ₁〜１１０ｅ₁₂の代
わりに、光ディスク装置を用いることも可能である。た
だし、磁気ディスク装置を用いる場合に比較して、アク
セス速度が落ちるため、本文サーチ速度がその分低下す
ることになる。さらに、この場合、光ディスク装置とし
て、テキストデータの修正がない場合には追記型の光デ
ィスク装置が使え、テキストデータの修正が生じる場合
には書替え型の光ディスク装置を用いることになる。In this embodiment, the collective magnetic disk controller 110d (FIG. 20) is used as the text data full 110 (FIG. 1) for storing the text data.
In order to increase the capacity of the text data file 110, it is also possible to use a collective optical disk device. That is, instead of the magnetic disk device 110e ₁ ~110e _12, it is also possible to use an optical disk device. However, compared with the case where a magnetic disk device is used, the access speed is reduced, so that the text search speed is reduced accordingly. Further, in this case, a write-once optical disk device can be used as the optical disk device when the text data is not corrected, and a rewritable optical disk device is used when the text data is corrected.

【０１０５】次に、上述した第二の実施例におけるＲＡ
Ｍディスク装置２５１０ｂの具体的実施例について、図
７５を用いて説明する。本図において、ＲＡＭディスク
装置２５１０ｂは、凝縮本文を納める半導体メモリ７１
００（ＲＡＭ）と、この半導体メモリ７１００上の凝縮
本文の読み出しを制御するＲＡＭディスクコントローラ
７２００から構成される。Next, the RA in the above-described second embodiment will be described.
A specific embodiment of the M disk device 2510b will be described with reference to FIG. In the figure, a RAM disk device 2510b is provided with a semiconductor memory 71 for storing a condensed text.
00 (RAM) and a RAM disk controller 7200 that controls reading of the condensed text on the semiconductor memory 7100.

【０１０６】ＲＡＭディスクコントローラ７２００は、
ダイレクトメモリアクセスコントローラ７２１０（ＤＭ
ＡＣ）、アドレスコントローラ７２２０、アドレスメモ
リ７２３０から構成される。アドレスメモリ７２３０に
は、半導体メモリ７１００内のどこからどこまで読みだ
すのかを、それぞれ開始アドレスＳＴＡＲＴｎと終了ア
ドレスＥＮＤｎの対データとして、複数組設定できるよ
うにしている。この開始アドレス７３６０と終了アドレ
ス７３７０は、検索実行制御プログラム２５０８によ
り、検索結果格納メモリ２５４６内に書き込まれた読み
出し対象とすべき凝縮本文の識別子情報をもとに、検索
実行制御プログラム２５０８内で管理される凝縮本文格
納情報を参照して与えられる。The RAM disk controller 7200
Direct memory access controller 7210 (DM
AC), an address controller 7220, and an address memory 7230. In the address memory 7230, it is possible to set a plurality of sets of data to be read from the semiconductor memory 7100 as a pair of a start address STARTn and an end address ENDn. The start address 7360 and the end address 7370 are managed in the search execution control program 2508 by the search execution control program 2508 based on the identifier information of the condensed text to be read written in the search result storage memory 2546. It is given by referring to the condensed text storage information to be performed.

【０１０７】アドレスコントローラ７２２０は、検索実
行制御プログラム２５０８から与えられる起動信号に基
づいて、アドレスメモリ７２３０内の読み出し領域アド
レス情報、すなわち開始アドレスＳＴＡＲＴ１と終了ア
ドレスＥＮＤ１を読み出し、これから読み出すべき領域
の先頭アドレス７３１０と読み出すべきワード数７３２
０を求めて、これをダイレクトメモリアクセスコントロ
ーラ７２１０に設定され、これに起動を掛ける。ダイレ
クトメモリアクセスコントローラ７２１０は、指定され
たアドレス７３１０とワード数７３２０に基づき、該当
領域のデータを半導体メモリ７１００から読み出し出力
する。The address controller 7220 reads the read area address information in the address memory 7230, that is, the start address START1 and the end address END1 based on the start signal given from the search execution control program 2508, and reads the start address of the area to be read from this. 7310 and the number of words to be read 732
The value 0 is obtained and set in the direct memory access controller 7210, which is activated. The direct memory access controller 7210 reads and outputs data in the corresponding area from the semiconductor memory 7100 based on the designated address 7310 and the number of words 7320.

【０１０８】ダイレクトメモリアクセスコントローラ７
２１０は、読み出しが終了したら終了信号７３７０をア
ドレスコントローラ７２２０へ送出する。アドレスコン
トローラ７２２０はこれを受けて、次の転送アドレス情
報、すなわち開始アドレスＳＴＡＲＴ２と終了アドレス
ＥＮＤ２を読み出し、同様にしてこれから読み出すべき
領域の先頭アドレス７３１０と読み出すべきワード数７
３２０を求めて、これをダイレクトメモリアクセスコン
トローラ７２１０に設定し、起動を掛ける。これを受け
てダイレクトメモリアクセスコントローラ７２１０は指
定されたアドレス７３１０とワード数７３２０に基づ
き、該当領域のデータを半導体メモリ７１００から読み
出し出力する。Direct memory access controller 7
Upon completion of the reading, the 210 sends an end signal 7370 to the address controller 7220. In response to this, the address controller 7220 reads the next transfer address information, that is, the start address START2 and the end address END2, and similarly, the head address 7310 of the area to be read from now and the number of words to be read 7
320 is obtained and set in the direct memory access controller 7210 to start. In response to this, the direct memory access controller 7210 reads and outputs data in the corresponding area from the semiconductor memory 7100 based on the designated address 7310 and the number of words 7320.

【０１０９】以下同様の処理をくりかえして、アドレス
メモリ７２３０内に設定された転送情報に対応する半導
体メモリ７１００内のデータを読み出すことになる。Thereafter, the same processing is repeated to read data in the semiconductor memory 7100 corresponding to the transfer information set in the address memory 7230.

【０１１０】以上が、ＲＡＭディスク装置２５１０ｂの
実施例の説明である。The preceding is an explanation of the embodiment of the RAM disk device 2510b.

【０１１１】次に、上記第二の実施例における複合条件
解析プログラム２５４１（図２５）の更に詳細な実施例
について図１３を用いて説明する。本実施例では、複合
条件解析プログラム１１４１が、近傍条件判定プログラ
ム３３０、文脈条件判定プログラム３４０、及び論理条
件判定プログラム３５０によりパイプライン的に構成さ
れている。Next, a more detailed embodiment of the composite condition analysis program 2541 (FIG. 25) in the second embodiment will be described with reference to FIG. In the present embodiment, the complex condition analysis program 1141 is configured in a pipeline by a neighborhood condition determination program 330, a context condition determination program 340, and a logical condition determination program 350.

【０１１２】また、検索実行制御段階としては、本文サ
ーチを行う場合を例にしている。すなわち、入力テキス
トデータとしては、集合型磁気ディスク装置１１１０ｃ
から本文データを入力し、このなかからサーチエンジン
１１０６でキーワードの探索照合を行う場合である。The search execution control stage exemplifies a case where a text search is performed. That is, as the input text data, the collective magnetic disk device 1110c
In this case, the search engine 1106 searches for and matches a keyword from the input text data.

【０１１３】探索条件式としては、論理条件、近傍条件
及び文脈条件を含む複合条件式３０１が入力されるもの
とする。As a search condition expression, a complex condition expression 301 including a logical condition, a neighborhood condition, and a context condition is input.

【０１１４】複合条件式３０１：Ｑ＝ａｎｄ（文書〔４
Ｃ〕理解、文書〔Ｓ〕検索）この複合条件式３０１は、「文書」と「理解」がこの順
序で現れ、かつ４文字以内の距離に近接し、さらに「文
書」と「検索」が同一文中に共起するものを検索するこ
とを意味している。すなわち、“文書〔４Ｃ〕理解”
が、「文書」と「理解」がこの順序で現れ、かつ４文字
以内の距離に近接するという近傍条件を示し、“文書
〔Ｓ〕検索”が、「文書」と「理解」が同一文中に共起
する文脈条件を、“ａｎｄ（……、……）”が、これら
両者が同時に起こるという論理条件を示している。Compound conditional expression 301: Q = and (document [4
[C] understanding, document [S] search) In this compound conditional expression 301, "document" and "understanding" appear in this order, are close to a distance of four characters or less, and "document" and "search" are the same. This means searching for co-occurrence in the sentence. That is, “Understand document [4C]”
Indicates a neighborhood condition that “document” and “understanding” appear in this order and are close to a distance of 4 characters or less, and “document [S] search” indicates that “document” and “understanding” are in the same sentence. The context condition that co-occurs is “and (...,...)”, Which indicates a logical condition that both occur simultaneously.

【０１１５】このような複合条件検索式３０１が指定さ
れると、第二の実施例（図２５）で説明したように、先
ずこの検索条件式が検索式解析プログラム１１０２で解
析され、これに含まれるキーワード、すなわち単語「文
書」、「理解」及び「検索」が抽出される。そして、こ
れらにそれぞれＴ１，Ｔ２及びＴ３という識別子が付与
され、同義語展開プログラム１１０３、さらには異表記
展開プログラム１１０４へ渡される。ここでは、説明を
簡単にするために、同義語及び異表記展開される言葉が
なかったものとして説明する。したがって、同義語及び
異表記展開された結果は、入力キーワードと変わらず、
「文書」、「理解」及び「検索」の３単語ということに
なる。これらは、オートマトン生成プログラム１１０７
に渡され、ここで各文字列を照合するオートマトンが作
成され、その状態遷移テーブルがサーチエンジン１１０
６に設定されることになる。When such a complex condition search expression 301 is designated, first, as described in the second embodiment (FIG. 25), this search condition expression is analyzed by the search expression analysis program 1102 and included in the search condition expression. Keywords, ie, the words "document", "understanding", and "search" are extracted. Then, identifiers T1, T2 and T3 are given to these, respectively, and are passed to the synonym expansion program 1103 and further to the different notation expansion program 1104. Here, for the sake of simplicity, the description will be made on the assumption that there are no synonyms and words developed in different notations. Therefore, the result of synonym and different expression expansion is the same as the input keyword,
That is, three words of “document”, “understanding”, and “search”. These are the automaton generation program 1107
, Where an automaton for collating each character string is created, and its state transition table is stored in the search engine 110.
6 will be set.

【０１１６】一方、検索条件式中の複合条件について
は、複合条件解析プログラム１１４１にて、それぞれ近
傍条件“文書〔４Ｃ〕理解”、文脈条件“文書〔Ｓ〕検
索”、及び論理条件“and（……，……）に分解され
る。この時、各条件式中のキーワードは、先にオートマ
トン生成に際して付与されたキーワード識別子（ターム
識別子とも呼ぶ）で置き換えられる。したがって、近傍
条件は“Ｔ１〔４Ｃ］Ｔ２”と、文脈条件は“Ｔ１
〔Ｓ〕Ｔ３”という形式で表される。また、これらの条
件式にもそれぞれ項識別子Ｉ１及びＩ２が付与される。
したがって、論理条件式は“ａｎｄ（Ｉ１，Ｉ２）”と
表されることになる。以上の処理は、それぞれ近傍条件
解析プログラム２５４２（図２５）、文脈条件解析プロ
グラム２５４３（図２５）及び論理条件解析プログラム
（図２５）２５４４にて行われる。このようにしてター
ム識別子及び項識別子で表現された各条件は、複合条件
判定プログラム２５４５（図２５）の各条件判定処理プ
ログラムに送られる。On the other hand, for the compound condition in the search condition expression, the compound condition analysis program 1141 uses the neighborhood condition “document [4C] understanding”, the context condition “document [S] search”, and the logical condition “and ( At this time, the keyword in each conditional expression is replaced with a keyword identifier (also referred to as a term identifier) previously assigned at the time of automaton generation. 4C] T2 ”and the context condition is“ T1
[S] T3 ”. Item identifiers I1 and I2 are also assigned to these conditional expressions, respectively.
Therefore, the logical conditional expression is expressed as "and (I1, I2)". The above processing is performed by the neighborhood condition analysis program 2542 (FIG. 25), the context condition analysis program 2543 (FIG. 25), and the logical condition analysis program (FIG. 25) 2544, respectively. Each condition expressed by the term identifier and the item identifier in this way is sent to each condition determination processing program of the complex condition determination program 2545 (FIG. 25).

【０１１７】こうしてサーチエンジン１１０６に各検索
ターム照合用のオートマトン状態遷移テーブル及び検索
ターム識別子情報が設定され、近傍条件判定プログラム
３３０、文脈条件判定プログラム３４０、及び論理条件
判定プログラム３５０にそれぞれ検索ターム識別子及び
項識別子で記述された各条件式が設定されると、検索実
行制御プログラム１１０８により集合型磁気ディスク装
置１１１０ｃ、サーチエンジン１１０６、複合条件解析
プログラム１１４５、近傍条件判定プログラム３３０、
文脈条件判定プログラム３４０、及び論理条件判定プロ
グラム３５０に起動が掛けられる。In this way, the search engine 1106 is set with the automaton state transition table for search term matching and search term identifier information, and the search term identifier is stored in the neighborhood condition determination program 330, the context condition determination program 340, and the logical condition determination program 350, respectively. When the conditional expressions described with the term and the term identifier are set, the collective magnetic disk device 1110c, the search engine 1106, the complex condition analysis program 1145, the neighborhood condition determination program 330,
The context condition determination program 340 and the logical condition determination program 350 are activated.

【０１１８】そうすると、集合型磁気ディスク装置１１
１０ｃからはテキストデータが読み出されサーチエンジ
ン１１０６へ送られる。サーチエンジン１１０６では、
指定された検索ターム「文書」、「理解」及び「検索」
のどれかが見つかると、その検索ターム識別子Ｔ１，Ｔ
２及びＴ３が見つかったテキスト内の位置情報と一緒に
近傍条件判定プログラム３３０へ送られる。また、文間
の区切り記号となる「。」についても、とくにユーザか
らの指定がなくともサーチエンジン１１０６で検出しこ
れに対応する句点識別子Ｔ０並びに位置情報を近傍条件
判定プログラム３３０に送り出す。Then, the collective magnetic disk drive 11
Text data is read from 10c and sent to the search engine 1106. In search engine 1106,
The specified search terms "document", "understanding" and "search"
Are found, the search term identifiers T1, T
2 and T3 are sent to the neighborhood condition determination program 330 together with the location information in the found text. Also, the search engine 1106 detects "." Which is a delimiter between sentences, and sends the corresponding phrase identifier T0 and position information to the neighborhood condition determination program 330 without any particular designation from the user.

【０１１９】近傍条件判定プログラム３３０では、サー
チエンジン１１０６から送られてくる検索ターム識別子
をその位置情報も加味して指定された近傍条件と照らし
合わせる。もし指定近傍条件“Ｔ１〔４Ｃ〕Ｔ２”、す
なわち“文書〔４Ｃ〕理解”に合致するものがあれば、
その照合結果として該当条件に対応した項識別子Ｉ１
を、サーチエンジン１１０６から入力した句点識別子Ｔ
０、検索ターム識別子Ｔ１，Ｔ２及びＴ３に加えて文脈
条件判定プログラム３４０へ送り出す。The neighborhood condition determination program 330 compares the search term identifier sent from the search engine 1106 with the designated neighborhood condition in consideration of the position information. If there is a designated neighborhood condition "T1 [4C] T2", that is, "document [4C] comprehension",
The item identifier I1 corresponding to the condition as a result of the comparison
Is the period identifier T input from the search engine 1106.
0, in addition to the search term identifiers T1, T2 and T3, are sent to the context condition determination program 340.

【０１２０】文脈条件判定プログラム３４０では、上記
近傍条件判定プログラム３３０から入力した句点識別子
Ｔ０及び検索ターム識別子Ｔ１，Ｔ３並びにその位置情
報を基に、指定文脈条件をチェックする。文脈条件“Ｔ
１［Ｓ〕Ｔ３”は、上記句点識別子Ｔ０と、Ｔ１及びＴ
３の並びから判定する。すなわち、Ｔ１とＴ３がこの順
序でその前後を二つのＴ０で挟まれていれば文脈条件
“Ｔ１〔Ｓ〕Ｔ３”が成立したものと判断する。もしこ
の文脈条件“文書〔Ｓ〕検索”に合致するものが見つか
れば、その照合結果として該当条件に対応した項識別子
Ｉ２を、近傍条件判定プログラム３３０から入力した句
点識別子Ｔ０、及び検索ターム識別子Ｔ１，Ｔ３並びに
項識別子Ｉ１に加えて論理条件判定プログラム３５０に
送り出す。The context condition determination program 340 checks the specified context condition based on the punctuation point identifier T0 and the search term identifiers T1 and T3 input from the neighborhood condition determination program 330 and their position information. The context condition "T
1 [S] T3 "is the period identifier T0, T1 and T
3 is determined. That is, if T1 and T3 are sandwiched between two T0 before and after in this order, it is determined that the context condition “T1 [S] T3” is satisfied. If a match with this context condition “document [S] search” is found, a term identifier I2 corresponding to the relevant condition is obtained as a collation result, using a term identifier T0 input from the neighborhood condition determination program 330 and a search term identifier T1. , T3 and the item identifier I1 and send them to the logical condition determination program 350.

【０１２１】論理条件判定プログラム３５０では、文脈
条件判定プログラム３４０から送られてくる句点識別子
Ｔ０及び検索ターム識別子Ｔ１，Ｔ３並びに項識別子Ｉ
１，Ｉ２の中から指定論理条件“ａｎｄ（Ｉ１，Ｉ
２）”に合致する識別子Ｉ１，Ｉ２があるかどうか調べ
る。すなわち、項識別子Ｉ１とＩ２の両者が見つかれば
大元の複合条件検索式Ｑが成り立ったことになり、その
テキスト（文書）は検索式Ｑで検索されたことになる。
該当テキストの例としては、同図に示したテキスト３０
２のようなものが検索されることになる。In the logical condition determination program 350, the term identifier T0, the search term identifiers T1 and T3, and the term identifier I sent from the context condition determination program 340 are used.
1, I2, the designated logical condition "and (I1, I2
2) Check if there are identifiers I1 and I2 that match ". That is, if both the term identifiers I1 and I2 are found, the original complex condition retrieval formula Q is satisfied, and the text (document) is retrieved. This means that the search has been performed using Expression Q.
As an example of the text, the text 30 shown in FIG.
2 will be searched.

【０１２２】一方、上記集合型磁気ディスク装置１１１
０ｃから、サーチエンジン１１０６、近傍条件判定プロ
グラム３３０、文脈条件判定プログラム３４０及び論理
条件判定プログラム３５０へ流れる照合情報の中にはこ
れまで説明しなかったテキストデータの識別子も含まれ
ている。すなわち、論理条件判定プログラム３５０では
検索式Ｑが成立したテキストデータについては、その文
書識別子を次段の検索結果表示プログラムへ送られ、こ
こでヒット件数が表示されたり、あるいはこの文書識別
子をもとに集合型磁気ディスク装置１１１０ｃから該当
文書の書誌事項が読み出され、これがディスプレイ１１
２０へ表示されることになる。On the other hand, the collective magnetic disk device 111
The collation information flowing from 0c to the search engine 1106, the proximity condition determination program 330, the context condition determination program 340, and the logical condition determination program 350 also includes an identifier of text data not described above. That is, the logical condition determination program 350 sends the document identifier of the text data for which the search formula Q is satisfied to the search result display program at the next stage, where the number of hits is displayed, or based on this document identifier. The bibliographic information of the relevant document is read from the collective magnetic disk device 1110c,
20 will be displayed.

【０１２３】次に、本発明が提供するフルテキストサー
チ方式について具体的に説明する。本発明においては、
スキャン型のフルテキストサーチを加速する方法とし
て、２段階のプリサーチ、すなわち図１５に示す文字成
分表サーチ４０２と凝縮本文サーチ４０３を行スクに格
納されたテキスト本文を参照しに行く件数を予め絞り込
んでおく。こうすることによって、検索処理時間に占め
る割合が高い本文検索処理量を減らすことができ、全体
の検索処理時間を短縮することが可能となる。Next, the full-text search method provided by the present invention will be specifically described. In the present invention,
As a method of accelerating the scan type full text search, a two-stage pre-search, that is, a character component table search 402 and a condensed text search 403 shown in FIG. Narrow down. By doing so, it is possible to reduce the amount of text search processing that accounts for a large proportion of the search processing time, and it is possible to shorten the entire search processing time.

【０１２４】これらは全て検索実行制御プログラムによ
って制御される。先ず、第１段階目のプリサーチである
文字成分表サーチの実施例について説明する。These are all controlled by the search execution control program. First, an embodiment of a character component table search which is a first-stage pre-search will be described.

【０１２５】本文字成分表サーチでは、図１６の登録処
理全体の流れ及び図１８に詳細に示したハッシュコード
化手順に示すように、後述する凝縮本文中のすべての文
字コードに対してその文字コードをテキスト中に含む文
書のリストを作成しておく。In this character component table search, as shown in the flow of the entire registration process in FIG. 16 and the hash coding procedure shown in detail in FIG. Make a list of documents that contain the code in the text.

【０１２６】すなわち、各文字コードの文書毎の有無を
１ビットの情報（ビットリストと呼ぶ）で表し、更にこ
れをハッシュ化したものを文字成分表５００として持
つ。That is, the presence / absence of each character code for each document is represented by 1-bit information (referred to as a bit list).

【０１２７】例えば、「検索」というキーワードが指定
された場合には、図１８に示すように「検」と「索」の
それぞれの文字毎にハッシュ関数５１０を介して文字成
分表５００のエントリアドレスを求める。そして、それ
ぞれの文字コードのハッシュ値から求められたビットリ
スト５０３および５０６のビット間の論理積を取ること
によって、「検」と「索」の両文字を含む文献のビット
リスト５２０が求められる。For example, when the keyword “search” is specified, as shown in FIG. 18, the entry address of the character component table 500 via the hash function 510 for each of the characters “search” and “search” Ask for. Then, by taking the logical product between the bits of the bit lists 503 and 506 obtained from the hash values of the respective character codes, the bit list 520 of the document including both the characters "" and "" is obtained.

【０１２８】以上の文字成分表サーチの処理手順は図２
３に示したとおりである。すなわち、指定された検索条
件式中に含まれるキーワード数分だけ文字成分表サーチ
を繰返し、各キーワードの文字成分表サーチでは、この
キーワードを構成する文字数分、それぞれの文字の存在
を示したビットリストの論理積ＡＮＤをとることにな
る。この結果、各キーワード毎に、これを含む可能性を
持った文書候補がビットリストの形で求まることにな
る。最後に、こうして求まったビットリストを文書識別
子へ変換する。この文書識別子はシステム内部でユニー
クに定められた文書番号であり、ビットリストの先頭か
らビット位置に対応して付与されている。The processing procedure of the above character component table search is shown in FIG.
As shown in FIG. That is, the character component table search is repeated for the number of keywords included in the specified search condition expression. In the character component table search for each keyword, a bit list indicating the presence of each character for the number of characters constituting this keyword And AND of the two. As a result, for each keyword, a document candidate having a possibility of including the keyword is obtained in the form of a bit list. Finally, the bit list thus obtained is converted into a document identifier. This document identifier is a document number uniquely determined in the system, and is provided in correspondence with the bit position from the head of the bit list.

【０１２９】また、文字成分表サーチにおいて、指定さ
れた検索条件式中に論理積条件（ＡＮＤ）が設定されて
いる場合には、文字成分表サーチ処理の中で論理積条件
の処理も行い、これ以降の検索処理対象文書件数を絞り
込んでおくことによって、全体の検索処理時間を短縮す
ることが可能となる。In the character component table search, if a logical product condition (AND) is set in the designated search condition expression, the logical product condition is also processed in the character component table search process. By narrowing down the number of search processing target documents after this, it is possible to shorten the entire search processing time.

【０１３０】例えば、 “Ｑ＝ａｎｄ（文書、検索）” という検索条件式が入力された場合について説明する。
この検索条件式は、“文書”と“検索”が両方共表われ
る文書を検索する意味を表す。この場合、まずキーワー
ド“文書”にいて文字成分表サーチを行い、次に“検
索”というキーワードについて文字成分表サーチを行
う。その後、この両者の検索結果のビットリスト間の相
互のビット毎の論理積ＡＮＤをとり、文字成分表サーチ
の最終的な検索結果とする。この処理手順を図２４に示
す。本図では、検索条件式中に含まれるキーワード、す
なわちキーワード数分文字成分表サーチを繰り返すこと
になる。For example, a case where a search condition expression “Q = and (document, search)” is input will be described.
This search condition expression represents a meaning of searching for a document in which both “document” and “search” appear. In this case, first, a character component table search is performed for the keyword “document”, and then a character component table search is performed for the keyword “search”. Thereafter, a logical AND of each bit between the bit lists of the two search results is calculated to obtain the final search result of the character component table search. FIG. 24 shows this processing procedure. In this figure, the keyword included in the search condition expression, that is, the character component table search is repeated for the number of keywords.

【０１３１】そして、この各キーワード毎の文字成分表
サーチにおいては、このキーワードを構成する文字数
分、それぞれの文字の存在を示したビットリストの論理
積ＡＮＤをとる。この処理を、全キーワード数分行った
後、各キーワードの文字成分表サーチ結果のビットリス
ト間の論理積ＡＮＤをとる。こうして得られた最終ビッ
トリストは、検索条件式中の論理積条件で指定されたキ
ーワードを同時に含みうる文書候補を表すことになる。In the character component table search for each keyword, the logical product AND of bit lists indicating the presence of each character is obtained for the number of characters constituting the keyword. After this process is performed for all the keywords, the logical product AND between the bit lists of the character component table search result of each keyword is calculated. The final bit list thus obtained represents document candidates that can simultaneously include the keyword specified by the logical product condition in the search condition expression.

【０１３２】以上の処理のように、指定された検索条件
式中に論理積条件（ＡＮＤ）が設定されている場合に
は、文字成分表サーチ処理の中で論理積条件の処理も行
い、これ以降の検索処理対象文書件数を絞り込むことに
よって、全体の検索処理時間を短縮することが可能とな
る。As described above, when the logical product condition (AND) is set in the specified search condition expression, the logical product condition is also processed in the character component table search process. By narrowing down the number of subsequent search target documents, the entire search processing time can be reduced.

【０１３３】この文字成分表５００は、各文字コードの
文献毎の有無を１ビットの情報で表すと共に、更にこれ
をハッシュ化しているため、テーブル容量は原デキスト
データの数十分の十になり、サーチすべきデータ容量も
極めて小さくなり、検索の高速化に大きく寄与すること
になる。ただし、この文字成分表サーチだけではノイズ
が生じてしまう。すなわち、検索処理手順を示す図１７
の文書３の様に「検」と「索」がばらばらに表れるテキ
ストも検索されてしまうことになる。このノイズを消去
するのが第二のプレサーチ、すなわち、凝縮本文サーチ
である。In the character component table 500, the presence / absence of each character code for each document is represented by 1-bit information, which is further hashed, so that the table capacity becomes tens of tenths of the original text data. Also, the data capacity to be searched becomes extremely small, which greatly contributes to speeding up the search. However, noise is generated only by this character component table search. That is, FIG. 17 showing the search processing procedure
As in Document 3, the text in which the "search" and the "search" appear separately is also searched. The second pre-search, ie, the condensed text search, eliminates this noise.

【０１３４】第二のプリサーチである凝縮本文サーチで
は、凝縮本文を対象に検索を行う。凝縮本文は、予めテ
キスト本文の中から助詞や接続詞などの付属語を削除す
ると共に繰り返し現れる単語の重複を排除したものであ
る。図１９にこの凝縮本文の作成方法を示す。In the condensed text search which is the second pre-search, a search is performed on the condensed text. The condensed text is a text in which adjuncts such as particles and conjunctions are deleted from the text text in advance and duplicate words that appear repeatedly are eliminated. FIG. 19 shows a method of creating this condensed text.

【０１３５】ここでは、「あいまい検索のための知的検
索技術」６０１というテキスト文字列を例にとる。先ず
最初に文字種分割処理６１０において、入力文字列を異
なる文字種の間で分割する。この例では、「あいま
い」、「検索」、「のための」および「知的検索技術」
の４つの文字列６０２に分解される。Here, a text character string “Intelligent search technology for fuzzy search” 601 is taken as an example. First, in a character type dividing process 610, an input character string is divided between different character types. In this example, "fuzzy", "search", "for" and "intelligent search technology"
Into four character strings 602.

【０１３６】次に付属語解析処理６２０において、文字
種分割された文字列６０２のうち、ひらがな文字列「あ
いまい」と「のための」に対して付属語解析を加え、付
属語と解釈できるものは検索には用いられない言葉とし
て取り除く。すなわち、助詞や接続詞とみなせるものに
ついては捨ててしまう。このような言葉は、もし検索の
キーワードとして用いたとしても、ほとんど全ての文書
に現れるため、ほぼ全件がヒットしてしまうことにな
り、検索という意味をなさないことになる。この例で
は、ひらがな文字列「のための」６０３が助詞「の」
と、接続詞「ため」及び助詞「の」と、すべての部分文
字列が不要語と解釈できるので、検索には使われ得ない
文字列とみなして除去する。一方、「あいまい」は付属
語と解釈することができないので、そのまま凝縮本文と
して残す。この場合、「あいまい」を名詞として認識し
て残しているのではない。したがって、どのような新語
が文書に現れようとも、必ず凝縮本文に登録されること
になる。Next, in the attached word analysis processing 620, of the character strings 602 divided into character types, those that can be interpreted as attached words by adding an attached word analysis to the hiragana character strings “ambiguity” and “for” Remove words that are not used in search. That is, those that can be regarded as particles or conjunctions are discarded. Even if such a word is used as a search keyword, it appears in almost all documents, so that almost all hits are made, and it does not make sense to search. In this example, the hiragana character string “for” 603 is replaced with the particle “no”.
Since all partial character strings such as the conjunction "me" and the particle "no" can be interpreted as unnecessary words, they are removed as character strings that cannot be used for search. On the other hand, since "ambiguity" cannot be interpreted as an adjunct, it is left as it is as a condensed text. In this case, "ambiguity" is not recognized and left as a noun. Therefore, no matter what new words appear in the document, they will always be registered in the condensed text.

【０１３７】最後に、重複登録排除処理６３０におい
て、不要語として除去された残りの文字列群６０２の中
に、同じ言葉がないかどうかを調べる。もし、同じもの
があれば二重登録しないように次のものを捨ててしま
う。まったく同じでなくとも、どちらかの文字列がもう
一方の文字列に含まれていれば、その含まれる文字列は
不要であるので捨ててしまう。本図の例では、「検索」
が「知的検索技術」に含まれるため、重複登録排除とい
うことで切り落とされる。その結果、凝縮本文として最
終的に、「あいまい」、「検索」及び「技術」が残るこ
とになる。このように、凝縮本文は単語単位で原文書を
情報圧縮したことになるため、この凝縮本文をサーチす
ることによって、例えば「検索」と連続した文字列、す
なわち単語としてキーワードが現れる文書のみを拾い出
すことが可能になる。Finally, in the duplicate registration elimination process 630, it is checked whether or not the same word exists in the remaining character string group 602 removed as an unnecessary word. If there is the same thing, throw away the next one so as not to register twice. Even if they are not exactly the same, if either character string is included in the other character string, the included character string is unnecessary and is discarded. In the example of this figure, "Search"
Are included in the "intelligent search technology", and are cut off for eliminating duplicate registration. As a result, "fuzzy", "search", and "technology" are finally left as condensed texts. As described above, since the condensed text means that the original document is information-compressed word by word, by searching this condensed text, for example, only a character string continuous with “search”, that is, a document in which a keyword appears as a word is picked up. It is possible to put out.

【０１３８】このようにして作成された凝縮本文は、原
テキストと比較しその約２０〜２５％に容量が減じられ
る。したがって、フルテキストサーチを等価的に約５倍
高速化できることになる。さらに、この凝縮本文を半導
体メモリなどの高速アクセスが可能なメモリ上に置くこ
とによって、さらに等価スキャン速度を高めることが可
能となる。The capacity of the condensed text thus created is reduced to about 20 to 25% of that of the original text. Therefore, the full-text search can be equivalently speeded up about five times. Further, by placing the condensed text on a memory that can be accessed at high speed such as a semiconductor memory, it is possible to further increase the equivalent scan speed.

【０１３９】また、本凝縮本文の作成方式は、キーワー
ド辞書などを用いて検索に必要とする単語を切り出して
くる方法と異なり、辞書を用いず文法的に解析し得る不
要語だけを除去する方法を用いているために、必要な単
語を切り落してしまう危険性がなく、検索漏れが生じに
くい特徴がある。従来の検索方式ではキーワード辞書に
登録されていない新語などが採取できないことにより検
索漏れが生じたりするが、本方式では新語であっても凝
縮本文から落ちることがないため、新語ということによ
る検索漏れが生じることはない。The method of creating the condensed text differs from the method of using a keyword dictionary or the like to cut out the words required for retrieval, and the method of removing only unnecessary words that can be grammatically analyzed without using a dictionary. Is used, there is no danger that necessary words will be cut off, and search omission is less likely to occur. With the conventional search method, search omissions may occur because new words that are not registered in the keyword dictionary cannot be collected.However, in this method, even if new words do not fall from the condensed text, search omissions due to new words Does not occur.

【０１４０】また、この凝縮本文検索は、サーチエンジ
ン１１０６（図１０）を用いて行われ、この後この凝縮
本文検索の結果絞り込まれた文書について、該当する本
文データをサーチし最後の複合条件による検索を行うこ
とになる。すなわち、本文サーチではテキスト本体をス
キャンしなければ判定ができない近傍条件と文脈条件の
判定処理を行いながら検索をすることになる。This condensed text search is performed using the search engine 1106 (FIG. 10). Thereafter, for the documents narrowed down as a result of the condensed text search, the relevant text data is searched for and the last composite condition is used. A search will be performed. That is, in the text search, the search is performed while performing the determination process of the neighborhood condition and the context condition that cannot be determined unless the text body is scanned.

【０１４１】通常、文字成分表と凝縮本文は、本文デー
タと共に集合型磁気ディスク装置１１１０ｃ（図３）に
格納されていて、検索システムの立ち上げ時にそれぞれ
半導体メモリ装置１１１０ａ及びＲＡＭディスク装置１
１１０ｂヘローディンされる。検索時には、それぞれ半
導体メモリ装置１１１０ａ及びＲＡＭディスク装置１１
１０ｂから読み出されることになる。また、本文データ
は格納元の集合型磁気ディスク装置１１１０ｃ（図１
３）から直接読み出されて、検索されることになる。Normally, the character component table and the condensed text are stored together with the text data in the collective magnetic disk device 1110c (FIG. 3). When the retrieval system is started, the semiconductor memory device 1110a and the RAM disk device 1
110b is loaded. At the time of retrieval, the semiconductor memory device 1110a and the RAM disk device 11
10b. The text data is stored in the collective magnetic disk device 1110c (FIG. 1).
3) will be read directly and searched.

【０１４２】以上説明したように、事前に「文字成分表
サーチ」と「凝縮本文サーチ」という２段階のプリサー
チを行い、最も時間を要する「本文サーチ」の対象とな
る文書数を予め最小に絞り込んでおくことによって、等
価的に高速なフルテキストサーチが実現できるようにな
る。As described above, a two-stage pre-search of “character component table search” and “condensed text search” is performed in advance, and the number of documents to be subjected to the most time-consuming “text search” is minimized in advance. By narrowing down, equivalently high-speed full-text search can be realized.

【０１４３】本文検索では、テキストデータをスキャン
しなければ判別ができない近傍条件と文脈条件の判別処
理を加えて検索を行うことになる。通常、文字成分表及
び凝縮本文は集合磁気ディスクに格納されているが、シ
ステムの立上時にＲＡＭディスクにロードされ、検索時
にはＲＡＭディスクから読み出される。テキスト本文は
集合磁気ディスク装置２５１０（図２５）から読み出さ
れることになる。In the text search, a search is performed by adding a process of determining a neighborhood condition and a context condition that cannot be determined unless text data is scanned. Normally, the character component table and the condensed text are stored on the collective magnetic disk, but are loaded on the RAM disk when the system starts up, and are read from the RAM disk when searching. The text body is read from the collective magnetic disk drive 2510 (FIG. 25).

【０１４４】このように、事前に２段階のプリサーチを
行い、最も時間を要する本文検索の対象となる文献数を
予め最小に絞り込んでおくことによって、等価的に高速
なフルテキストサーチが実現できることになる。As described above, by performing a two-stage pre-search in advance and narrowing down the number of documents to be searched for the text that requires the longest time to a minimum in advance, an equivalently high-speed full-text search can be realized. become.

【０１４５】この３段階検索では、近傍条件検索と文脈
条件検索が指定されなかった場合には、本文をサーチす
る必要がないので、文字成分表サーチと凝縮本文サーチ
だけで検索を終了することができる。すなわち、図２１
に示すように、指定検索条件式中に近傍条件あるいは文
脈条件が含まれない場合には、キーワードが単語として
存在するか杏かだけを探索すればよいことになるため、
文字成分表サーチで指定キーワードを構成する文字を含
む文書を抽出し、その結果求められた文書の凝縮本文を
サーチしてキーワードが単語として含まれるもののみを
抽出し、検索を終えることができる。この結果、サーチ
時間が掛かる不要な本文サーチを省略できるため、検索
時間を全体として短縮することが可能となる。In this three-step search, if the neighborhood condition search and the context condition search are not specified, the text need not be searched, so that the search can be terminated only by the character component table search and the condensed text search. it can. That is, FIG.
As shown in, if the specified search condition expression does not include the neighborhood condition or the context condition, it is only necessary to search for whether the keyword exists as a word or whether it is apricot.
A document containing characters constituting the designated keyword is extracted by the character component table search, and the condensed text of the document obtained as a result is searched to extract only those containing the keyword as a word, and the search can be completed. As a result, unnecessary text search requiring a long search time can be omitted, so that the search time can be shortened as a whole.

【０１４６】また、この３段階の階層検索において、最
初の文字成分表サーチ結果がゼロ件で該当文書がなかっ
た場合には、ここで検索を打ち切ることが可能である。
すなわち、図２２に示すように、近傍条件あるいは文脈
条件が設定されていたとしても、次段の凝縮本文サーチ
とその後の本文サーチを省略することができる。同様
に、凝縮本文サーチ結果件数がゼロ件の場合には、たと
え近傍条件あるいは文脈条件が設定されていたとして
も、次段の本文サーチを省略することが可能である。こ
の結果、入力された検索条件式に応じて最小の時間で検
索処理を済ませることが可能となる。In the three-stage hierarchical search, if the first character component table search result is zero and there is no corresponding document, the search can be terminated here.
That is, as shown in FIG. 22, even if the neighborhood condition or the context condition is set, the next-stage condensed text search and the subsequent text search can be omitted. Similarly, if the number of condensed text search results is zero, the next-stage text search can be omitted even if the neighborhood condition or the context condition is set. As a result, it is possible to complete the search processing in a minimum time according to the input search condition expression.

【０１４７】以上説明した階層型のプリサーチでは、半
導体メモリ上に置いた文字成分表と凝縮本文で絞り込み
を行い、最後に本文を集合磁気ディスク装置から読み出
して検索を行う方式としている。このように凝縮本文を
半導体メモリに置く方式では、半導体メモリを用いる分
検索装置のコストが高くなる。したがって、凝縮本文を
磁気ディスク装置上に置いて検索を行うことにより、半
導体メモリを不要とすることができ、装置のコストを低
く抑えることが可能となる。In the hierarchical pre-search described above, the character component table and the condensed text stored in the semiconductor memory are narrowed down, and finally, the text is read out from the collective magnetic disk device and searched. In the method of placing the condensed text in the semiconductor memory as described above, the cost of the retrieval device using the semiconductor memory increases. Therefore, by performing the search by placing the condensed text on the magnetic disk device, the semiconductor memory can be made unnecessary, and the cost of the device can be reduced.

【０１４８】ただし、文字成分表サーチで絞り込んだ結
果で凝縮本文サーチを行う場合、凝縮本文を集合磁気デ
ィスク装置上から選択的に読み出すことになる。この場
合、比較的小容量の多数のデータをアクセスすることに
なるため、集合磁気ディスク装置からの実効的な読み出
し速度、すなわちスループットは、データの読み出し時
間よりも、むしろシーク時間に大きく影響されることに
なる。したがって、文字成分表サーチの結果件数が多い
場合には、アクセス時間が極めて短い半導体メモリ上に
凝縮本文を置いた場合に比較して、凝縮本文サーチ時間
が極めて大きくなることになる。このような場合には、
凝縮本文を選択的に拾い読みするより、全件を１ファイ
ルとしてまとめ読みする方がシーク回数を減少させるこ
とができるため、はるかに短時間で読み出しを行うこと
が可能となる。However, when performing a condensed text search based on the results narrowed down by the character component table search, the condensed text is selectively read from the collective magnetic disk device. In this case, since a large amount of data having a relatively small capacity is accessed, the effective read speed from the collective magnetic disk device, that is, the throughput is largely affected by the seek time rather than the data read time. Will be. Therefore, when the number of results of the character component table search is large, the condensed text search time becomes extremely long as compared with the case where the condensed text is placed on a semiconductor memory having an extremely short access time. In such a case,
Rather than selectively browsing through the condensed text, it is possible to reduce the number of seeks by reading the entire case as one file, so that the reading can be performed in a much shorter time.

【０１４９】したがって、検索装置のコストを低減する
ために、凝縮本文を半導体メモリではなく磁気ディスク
装置上に置いたまま検索する場合、図２０Ａに示すよう
な手順で検索を行うことによって、検索速度を大きく落
とすとこなく検索を行うことが可能となる。すなわち、
文字成分表サーチの結果件数が所定件数よりも多い場合
には、この文字成分表サーチの検索結果を無視して、新
たに凝縮本文を全件集合磁気ディスク装置から読み出し
て指定キーワードの存在を検索する。もし、文字成分表
サーチの結果件数が所定件数よりも少ない場合には、集
合磁気ディスク装置上の該当凝縮本文を選択的に読み出
して凝縮本文サーチを行う。Therefore, in order to reduce the cost of the search device, when searching for the condensed text while keeping it on the magnetic disk device instead of the semiconductor memory, the search is performed according to the procedure shown in FIG. Makes it possible to perform a search without greatly reducing. That is,
If the result of the character component table search is larger than the predetermined number, the search result of the character component table search is ignored, and the condensed text is newly read from the all-set magnetic disk drive to search for the presence of the specified keyword. I do. If the number of results of the character component table search is smaller than the predetermined number, the corresponding condensed text on the collective magnetic disk device is selectively read to perform the condensed text search.

【０１５０】この場合の所定件数とは、凝縮本文をこの
所定件数分選択的に読み出す時間と、凝縮本文を全件一
つのファイルとして連続的に読み出す時間が等しくなる
ような読み出し件数のことである。また、この場合も当
然凝縮本文サーチ結果件数がゼロ件の場合には、近傍条
件及び文脈条件の設定の有無にかかわらず、ここで検索
処理を打ち切ることが可能である。In this case, the predetermined number is the number of readouts in which the time for selectively reading the condensed texts by the predetermined number is equal to the time for continuously reading out the condensed texts as a single file. . Also, in this case, if the number of condensed text search results is zero, the search process can be terminated here regardless of whether the neighborhood condition and the context condition are set.

【０１５１】また、本文データの容量が小さい場合に
は、一般的に冗長な文章が少ないため、凝縮本文の大き
な圧縮率は望めない。したがって、ファイルの読み出し
時間においてディスクのシーク時間と回転待ち時間が支
配的なことを考慮すれば、凝縮本文の読み出し時間と本
文の読み出し時間に大きな差が生じなくなることにな
る。すなわち、文字成分表サーチの結果件数が所定件数
よりも少ない場合には、図２２Ｂに示すように集合型磁
気ディスク装置上の該当本文データを選択的に読み出し
て本文サーチを行う方が効率的になる。つまり、最初の
文字成分表サーチの結果件数が所定件数よりも多い場合
には、この文字成分表サーチの検索結果を無視して、新
たに凝縮本文を全件集合型磁気ディスク装置から読み出
して指定キーワードの存在を検索する。この場合、当該
凝縮本文サーチ結果件数がゼロ件の場合には、近傍条件
及び文脈条件の設定の有無にかかわらず、ここで検索処
理を打ち切る。ゼロ件でない場合には、条件式中に近傍
条件あるいは文脈条件が設定されているかを見て、もし
設定されているときには本文サーチを行うことになる。
一方、文字成分表サーチの結果件数が所定件数よりも少
ない場合には、集合型磁気ディスク装置上の該当本文を
選択的に読み出して近傍条件及び文脈条件を含めて本文
サーチを行うことになる。このような検索手順を踏むこ
とによって、文書データの平均容量が小さい場合には、
さらに効率的な検索が行えるようになる。When the volume of the text data is small, since a redundant text is generally small, a large compression ratio of the condensed text cannot be expected. Therefore, considering that the disk seek time and the rotation waiting time are dominant in the file read time, a large difference does not occur between the condensed text read time and the text read time. That is, when the number of results of the character component table search is smaller than the predetermined number, it is more efficient to selectively read out the relevant text data on the collective magnetic disk device and perform the text search as shown in FIG. 22B. Become. In other words, when the number of results of the first character component table search is larger than the predetermined number, the search result of the character component table search is ignored, and a new condensed text is newly read from the all-set magnetic disk device and specified. Search for the presence of a keyword. In this case, if the number of the condensed text search results is zero, the search processing is terminated here regardless of whether the neighborhood condition and the context condition are set. If the number is not zero, it is checked whether a neighborhood condition or a context condition is set in the conditional expression. If the condition is set, a text search is performed.
On the other hand, if the number of results of the character component table search is smaller than the predetermined number, the relevant text on the collective magnetic disk device is selectively read out and the text search is performed including the neighborhood condition and the context condition. By performing such a search procedure, if the average volume of document data is small,
A more efficient search can be performed.

【０１５２】このように、文字成分表サーチの結果件数
に応じて凝縮本文の読み出し方法を変えることによっ
て、凝縮本文を集合磁気ディスク装置上に置いても、検
索時間を大幅に増やすことなく検索処理ができるように
なるため、低価格で高性能な全文検索装置の提供が可能
となる。As described above, by changing the method of reading the condensed text according to the number of results of the character component table search, even if the condensed text is placed on the collective magnetic disk drive, the search processing can be performed without significantly increasing the search time. Thus, a low-cost and high-performance full-text search device can be provided.

【０１５３】次に本発明による同義語展開及び異表記展
開の変形例について説明する。Next, modified examples of the synonym expansion and the different notation expansion according to the present invention will be described.

【０１５４】図２８は本発明の実施例の構成を示すブロ
ック図である。本実施例は、コンソール２８００、対話
制御部２８０１、異表記展開処理部２８０２及び２８０
５、同義語展開処理部２８０３、同義語辞書ファイル２
８０４、文字列統合列部２８０６、文字列検索処理部２
８０７、テキストデータベース２８０８から構成されて
いる。コンソール２８００から入力された検索文字列４
０は、対話制御部２８０１を介して異表記展開処理部２
８０２へ送られる。異表記展開処理部２８０２で展開し
た文字列群４１は、同義語展開処理部２８０３へ送られ
ると共に、文字列統合処理部２８０６へも送られる。同
義語展開処理部２８０３では、同義語辞書２８０４を参
照し送られてきた文字列群４１の各文字列と辞書の見出
しとのマッチングをとり、一致した文字列が存在すれ
ば、同義語展開モード制御信号２８１０に従い、辞書に
記載してある見出しに対応する言葉を出力し、異表記展
開処理部２８０５へ文字列群４２を送る。異表記展開処
理部２８０５では、同義語展開された文字列４２に対
し、異表記展開処理部２８０３と全く同じ処理方法で異
表記展開して、文字列群４３を文字列統合処理部２８０
６へ出力する。文字列統合処理部２８０６は、異表記展
開処理部２８０２と２８０５から受け取った文字列群４
１と文字列群４３を、一つの文字列群４４にまとめて文
字列検索部２８０７へ出力する。文字列検索部２８０７
は、受け取った文字列群４４のうちのいずれかの文字列
が存在するものをテキストＤＢから検索して、ヒットし
た文書の識別子情報などを、対話制御部２８０１へ検索
結果４５として出力する。対話制御部２８０１は、この
検索結果４５を受けて、検索結果件数４６や、テキスト
情報４６を適宜コンソール２８００へ出力する。FIG. 28 is a block diagram showing the configuration of the embodiment of the present invention. In the present embodiment, the console 2800, the dialogue control unit 2801, the different notation development processing units 2802 and 280
5. Synonym expansion processing unit 2803, synonym dictionary file 2
804, character string integration column unit 2806, character string search processing unit 2
807, and a text database 2808. Search string 4 input from console 2800
0 is a different notation development processing unit 2 via the dialogue control unit 2801
802. The character string group 41 expanded by the different notation expansion processing unit 2802 is sent to the synonym expansion processing unit 2803 and also to the character string integration processing unit 2806. The synonym expansion processing unit 2803 matches each character string of the character string group 41 sent with reference to the synonym dictionary 2804 with the dictionary header, and if there is a matched character string, the synonym expansion mode In accordance with the control signal 2810, a word corresponding to the headline described in the dictionary is output, and the character string group 42 is sent to the different notation development processing unit 2805. The different notation development processing unit 2805 performs the different notation development on the synonym-expanded character string 42 using the same processing method as the different notation development processing unit 2803, and converts the character string group 43 into a character string integration processing unit 280.
Output to 6. The character string integration processing unit 2806 receives the character string group 4 received from the different notation development processing units 2802 and 2805.
1 and the character string group 43 are combined into one character string group 44 and output to the character string search unit 2807. Character string search unit 2807
Searches the text DB for any character string in the received character string group 44, and outputs the identifier information of the hit document to the dialog control unit 2801 as the search result 45. Upon receiving the search result 45, the dialog control unit 2801 outputs the number of search results 46 and the text information 46 to the console 2800 as appropriate.

【０１５５】異表記展開処理部２８０２と２８０５は全
く同一のものである。文字列検索部２８０７は公知の技
術で、例えば特開昭６３−３１１５３０を用いて実現で
きる。テキストＤＢ４０８は、文字コード情報であれ
ば、新聞記事データでも、ワープロで作成した文書のデ
ータでも、電子ファイリングシステムの書誌事項データ
でも構わない。The different notation development processing units 2802 and 2805 are exactly the same. The character string search unit 2807 can be realized by a known technique, for example, using Japanese Patent Laid-Open No. 63-31530. The text DB 408 may be newspaper article data, data of a document created by a word processor, or bibliographic data of an electronic filing system, as long as it is character code information.

【０１５６】以下、異表記展開処理部２８０２、２８０
５と同義語展開処理部２８０３の構成作用について詳細
に説明する。Hereinafter, different notation development processing units 2802 and 280
The configuration and operation of the synonym expansion processing unit 2803 and 5 will be described in detail.

【０１５７】まず、異表記展開処理の概要を図２９を用
いて説明する。ここでは、最初に入力文字列２９０１を
異なる字種の間で切断し、部分文字列へ分割する。First, the outline of the different notation development process will be described with reference to FIG. Here, first, the input character string 2901 is cut between different character types and divided into partial character strings.

【０１５８】例えば、入力文字列２９０１“卓上型イン
タフォーン”の場合には、漢字文字列２９０２“卓上
型”と、カタカナ文字列２９０３“インタフォーン”へ
文字種に従って分割する。次に、分割した文字列毎に異
表記展開を行ない、漢字異表記文字列リスト２９０４、
カタカナ異表記文字列リスト２９０５を得る。その後、
漢字異表記文字列リスト２９０４及びカタカナ異表記文
字列リスト２９０５をそれぞれ展開し、２つの文字種で
別々に展開した文字列群を１つに組み合せて最終結果２
９０６として出力する。For example, in the case of the input character string 2901 “desktop type”, it is divided into a kanji character string 2902 “desktop type” and a katakana character string 2903 “interphone” according to the character type. Next, different notation expansion is performed for each of the divided character strings, and a kanji different notation character string list 2904,
A katakana variant notation character string list 2905 is obtained. afterwards,
Expand the Kanji different notation character string list 2904 and the Katakana different notation character string list 2905, combine the character strings separately developed for the two character types into one, and obtain the final result 2.
906 is output.

【０１５９】次に、図３０を用いて異表記展開の処理内
容を詳細に説明する。図３０は、本発明における異表記
展開手段の実施例を示すブロック図である。本実施例の
構成は、文字種分割・選別部３００１、ローマ字判別部
３００２、ローマ字カナ変換部３００３、漢字異表記展
開部３００４、カタカナ異表記展開部３００５、アルフ
ァベット異表記展開部３００６、カナローマ字変換部３
００７、分割文字列統合部３０１０よりなる。Next, the details of processing for developing different notations will be described in detail with reference to FIG. FIG. 30 is a block diagram showing an embodiment of the variant notation expanding means in the present invention. The configuration of the present embodiment includes a character type division / selection unit 3001, a Roman character discrimination unit 3002, a Roman character / kana conversion unit 3003, a kanji variant notation development unit 3004, a katakana variant notation development unit 3005, an alphabet variant notation development unit 3006, and a kana Roman character conversion unit. 3
007, consisting of a divided character string integration unit 3010.

【０１６０】異表記展開処理部２８０２あるいは２８０
５への入力文字列３０２０は、まず文字種分割・選別部
３００１へ送られる。文字種分割・選別部３００１では
入力文字列３０２０を、上述したように漢字及びひらが
な文字列３０３１、カタカナ文字列３０３２、アルファ
ベット文字列３０３３、それ以外の文字列３０３０の４
種類の部分文字列に分割する。分割した部分文字列をそ
れぞれの文字種に従って分類し、別々の展開処理を施
す。以下文字種別に、その展開処理の概要を示す。Different notation development processing unit 2802 or 280
5 is first sent to the character type division / selection unit 3001. The character type division / selection unit 3001 converts the input character string 3020 into the kanji and hiragana character string 3031, the katakana character string 3032, the alphabet character string 3033, and the other character strings 3030 as described above.
Split into substrings of type The divided partial character strings are classified according to the respective character types, and are subjected to separate expansion processing. The outline of the expansion processing is shown in the character type below.

【０１６１】（１）漢字・ひらがな・カタカナ・アルフ
ァベット以外の文字列この文字種には数字、記号、特殊文字あるいは外字コー
ド等が当たる。本実施例ではこれらの文字種を、展開せ
ずに入力した文字列３０３０をそのまま分割文字列統合
部３０１０へ出力している。しかし数字に関して英数字
を漢数字に変換したり、英記号に関して記号“・”を
“−”や“／”に展開することなども考えられる。(1) Character strings other than kanji, hiragana, katakana, and alphabets This character type includes numbers, symbols, special characters, and external character codes. In this embodiment, a character string 3030 input without expanding these character types is output to the divided character string integration unit 3010 as it is. However, it is also conceivable to convert alphanumeric characters into kanji numerals with respect to numbers, and to expand the symbol "." Into "-" or "/" with respect to alphanumeric symbols.

【０１６２】（２）漢字・ひらがな文字列これらの文字種に関しては、文字列３０３１を漢字異表
記展開部３００４にて、漢字の新旧字体及び送りがなに
ついての異表記展開をする。漢字異表記展開部３００４
の出力文字列３０４１は、分割文字列統合部３０１０へ
送られる。(2) Kanji / Hiragana Character Strings Regarding these character types, the character string 3031 is developed in different representations for new and old kanji and kanji in kanji by a kanji variant representation development unit 3004. Kanji different notation development unit 3004
Is transmitted to the divided character string integration unit 3010.

【０１６３】（３）カタカナ文字列この文字種に関しては、文字列３０３２をカタカナ異表
記展開部３００５にて、類似音節の表記について異表記
展開を行う。展開した文字列３０４２は、分割文字列統
合部３０１０へ送られる。また、同時にカナローマ字変
換部３００７へも送られる。カナローマ字変換部３００
７でローマ字へ変換された文字列３０５３は、アルファ
ベット異表記展開部３００６にて、アルファベットの大
小文字に関する異表記展開が行なわれ、文字列３０４３
として分割文字列統合部３０１０へ送られる。(3) Katakana Character String Regarding this character type, the character string 3032 is subjected to different notation development for the notation of similar syllables by the katakana different notation development unit 3005. The expanded character string 3042 is sent to the divided character string integration unit 3010. At the same time, it is also sent to the kana-romaji conversion unit 3007. Kana-Roman conversion unit 300
The character string 3053 converted into Roman characters in step 7 is subjected to different notation development relating to the case of the alphabet in the different letter notation developing unit 3006, and the character string 3043 is obtained.
Is sent to the divided character string integration unit 3010.

【０１６４】（４）アルファベット文字列この文字種に関しては、日本語のローマ字表現の場合
と、外国語の原語の場合の２つの場合がある。(4) Alphabet Character Strings There are two types of this character type: the case of the Japanese Roman alphabet and the case of the foreign language.

【０１６５】ここでは、まずローマ字判別部３００２
で、文字列３０３３がローマ字か外国語かの判定をす
る。この判定基準には、ローマ字の表記法を用いてい
る。すなわち、アルファベット文字の並びがローマ字の
表記法に合っていればローマ字と判定し、ローマ字とし
て解釈不能の場合は外国語と判定する。この判定は、ロ
ーマ字カナ変換部３００３で兼ねることもできる。すな
わち、ローマ字カナ変換ができれば、ローマ字と判定
し、そうでない場合に外国語と判定する。また、この判
定には本実施例の方法以外にも、外国語辞書を使うとい
った方法も用いることが可能である。Here, first, the Roman character discriminating unit 3002
Then, it is determined whether the character string 3033 is a Roman character or a foreign language. The criterion uses a Roman alphabet notation. That is, if the arrangement of alphabetic characters matches the Roman alphabet notation, it is determined to be Roman, and if it cannot be interpreted as Roman, it is determined to be a foreign language. This determination can also be performed by the Roman character / kana conversion unit 3003. In other words, if the Roman alphabet / kana conversion can be performed, it is determined that the alphabet is a Roman alphabet, and if not, it is determined that the language is a foreign language. In addition to the method of the present embodiment, a method of using a foreign language dictionary can be used for this determination.

【０１６６】ローマ字判別部３００２でローマ字と判定
した文字列３０５１は、ローマ字カナ変換部３００３に
送られ、ここでカタカナ文字列３０５２に変換され、こ
れを更にカタカナ異表記展開部３００５で、類似音節に
関する異表記展開を行い文字列群３０４２を得る。文字
列群３０４２に対するこの後の処理は、（３）のカタカ
ナ文字列の処理と同じである。すなわち、カタカナ異表
記展開された文字列群３０４２は、分割文字列統合部３
０１０へ送られると共に、カナローマ字変換部３００７
へも送り出される。カタカナ文字列群３０４２は、カナ
ローマ字変換部３００７で各々ローマ字へ変換され、ロ
ーマ字文字列群３０５３としてアルファベット異表記展
開部３００６に送られる。該ローマ字文字列群３０５３
は、アルファベット異表記展開部３００６でアルファベ
ットの大小文字に関する異表記展開を施された後、分割
文字列統合部３０１０へ送られる。The character string 3051 determined to be a Roman character by the Roman character discriminating unit 3002 is sent to a Roman character / kana conversion unit 3003, where it is converted to a katakana character string 3052. Character string group 3042 is obtained by performing different notation expansion. Subsequent processing of the character string group 3042 is the same as the processing of the katakana character string of (3). That is, the character string group 3042 expanded in katakana different notation is divided into the divided character string
010 and kana-romaji conversion unit 3007
Also sent to. The katakana character string group 3042 is converted into Roman characters by the kana-roman character conversion unit 3007, and is sent to the alphabet variant writing unit 3006 as a Roman character string group 3053. The Roman character string group 3053
Is subjected to different notation development related to the case of the alphabet in the different letter notation developing unit 3006, and then sent to the divided character string integrating unit 3010.

【０１６７】一方、ローマ字判別部３００２で外国語と
判定された文字列３０３４は、ローマ字カナ変換せず
に、アルファベット異表記展開部３００６へ送られ、そ
の出力文字列３０４３は分割文字列統合部３０１０へ送
られる。On the other hand, the character string 3034 determined to be a foreign language by the Roman character discriminating unit 3002 is sent to the alphabet different notation developing unit 3006 without performing Roman character kana conversion, and the output character string 3043 is converted into the divided character string integrating unit 3010. Sent to

【０１６８】以上、異表記展開処理の流れについて説明
した。続いて、この異表記展開処理における各処理ブロ
ックの詳細について説明する。The flow of the different notation development process has been described above. Subsequently, the details of each processing block in this variant notation development processing will be described.

【０１６９】まず始めにカタカナ異表記展開部３００５
の処理について説明する。図３１は異表記展開部におけ
る処理をカタカナ文字列を例にして説明した図である。
ここでは、“インタフォーン”という入力文字列を例と
している。異表記展開処理は、変換ルールを参照して行
う。また、その変換処理は、入力文字列中の部分文字列
として変換対象となり得るものがあれば、該変換ルール
を参照して他の表記に置き換えるものである。この変換
ルールのテーブル形式も本図に示されている。この変換
ルールテーブルは、見出し部と展開部文字列リスト部か
ら構成されている。入力文字列中に、変換ルールテーブ
ルの見出し部分に相当するものが存在すれば、該当部分
を展開文字列リスト部に記述された異表記文字列群で順
次置き換えていく。First, katakana different notation developing unit 3005
Will be described. FIG. 31 is a diagram for explaining the processing in the different notation developing unit by using a katakana character string as an example.
Here, an input character string “interphone” is used as an example. The different notation expansion process is performed with reference to a conversion rule. In the conversion process, if there is a partial character string in the input character string that can be converted, the conversion process refers to the conversion rule and replaces it with another notation. The table format of this conversion rule is also shown in FIG. This conversion rule table includes a heading part and a development part character string list part. If the input character string includes a part corresponding to the heading part of the conversion rule table, the corresponding part is sequentially replaced with a different notation character string group described in the expanded character string list part.

【０１７０】この見出し文字列の探索は、入力文字列の
先頭から最長一致法により行なう。すなわち本図に示し
たように、入力文字列中の“フォー”という部分文字列
と“フォ”という部分文字列の両方が見出しにある場合
には、より長い見出しの“フォー”のほうの変換ルール
を適用する。The search for the heading character string is performed by the longest matching method from the head of the input character string. That is, as shown in the figure, when both the partial character string "Fo" and the partial character string "Fo" in the input character string are present in the heading, the conversion of the longer heading "Fo" is performed. Apply rules.

【０１７１】図３１の例を用いて、この見出し文字列探
索及び展開文字列リストへの置き換え処理を説明する。
見出し文字列探索において、入力文字列中の部分文字列
と見出し文字列との照合のため、探索ポインタを設定す
る。入力文字列と見出し文字列の照合時には、探索ポイ
ンタの位置を動かしながら、入力文字列中の探索ポイン
タを先頭とする文字列と、見出し文字列との照合を行な
っていく。まず探索ポインタを入力文字列の先頭文字に
セットする。従って、この例では文字“イ”から見出し
文字列の探索を開始することになる。該当する見出しが
存在しないので、探索ポインタを１文字移動して“ン”
を先頭とする文字列からもう一度見出し文字列を探索す
る。また該当する見出しが存在しないので、もう１文字
探索ポインタを移動して“タ”の文字から見出し文字列
を探索する。今後は“タ”の見出し文字列が見つかるの
で、“タ”の部分を展開文字列リスト部に記述された
“タ”と“ター”に置き換える。そして探索ポインタを
見出し“タ”の文字数分、すなわち１文字分移動する。
次に“フ”の文字から見出し文字列探索し、該当する見
出し“フォー”と“フォ”を得る。今度は２つの見出し
文字列が照合するが、このように複数個の照合した見出
し文字列がある場合、最長一致法に従い、最も長い見出
し文字列のある変換ルールを採用する。この例では、見
出し“フォー”のほうが“フォ”より長いので、“フォ
ー”を展開用変換ルールとして採用し、入力文字列内の
部分文字列“フォー”を展開文字列リスト部に記述され
た“フォー”、“フォ”、“ホー”及び“ホ”に置き換
える。そして探索ポインタを見出し“フォー”の文字数
分、すなわち３文字分移動する。最後に入力文字列中の
最終文字、“ン”から見出し探索を行なうことになる
が、該当する見出しがないので文字“ン”は、展開処理
が施されずそのままとなる。こうして探索ポインタが入
力文字列の最終位置に来たので処理を終了する。The process of searching for a headline character string and replacing it with an expanded character string list will be described with reference to the example shown in FIG.
In the search for a heading character string, a search pointer is set to match a partial character string in the input character string with the heading character string. At the time of collating the input character string with the heading character string, the position of the search pointer is moved and the character string starting with the search pointer in the input character string is collated with the heading character string. First, the search pointer is set to the first character of the input character string. Therefore, in this example, the search for the heading character string is started from the character "A". Since there is no corresponding heading, move the search pointer one character
Searches for a heading character string again from the character string starting with. Since there is no corresponding heading, another character search pointer is moved to search for a heading character string from the character "TA". In the future, since the heading character string "TA" is found, the "TA" part is replaced with "TA" and "TA" described in the expanded character string list portion. Then, the search pointer is moved by the number of characters of the heading "TA", that is, by one character.
Next, a heading character string is searched from the character "F" to obtain the corresponding headings "Fo" and "Fo". This time, two heading character strings are collated. When there are a plurality of collated heading character strings, a conversion rule with the longest heading character string is adopted according to the longest matching method. In this example, since the heading “Fo” is longer than “Fo”, “Fo” is adopted as the conversion rule for expansion, and the partial character string “Fo” in the input character string is described in the expansion character string list section. "Pho", "Pho", "Ho" and "Ho". Then, the search pointer is moved by the number of characters of the heading "Four", that is, three characters. Finally, a heading search is performed from the last character, "n" in the input character string. However, since there is no corresponding heading, the character "n" is not subjected to the expansion processing and remains as it is. Since the search pointer has reached the final position of the input character string in this way, the process ends.

【０１７２】以上の処理の結果として生成された展開リ
ストを含む文字列“イン（タ，ター）（フォー，フォ，
ホー，ホ）ン”の展開リストを組み合せることによって
最終的な異表記展開文字列が得られる。この例では、
“タ”の部分で２通り、“フォー”の部分で４通りに展
開するので展開結果は２×４の８通りの文字列、すなわ
ち１）“インタフォーン” ２）“インタフォン” ３）“インタホーン” ４）“インタホン” ５）“インターフォーン” ６）“インターフォン” ７）“インターホーン” ８）“インターホン” となる。The character string “in (ta, ter) (for, pho, fo) including the expanded list generated as a result of the above processing
, Ho) n "to obtain the final variant expression expansion string. In this example,
The expansion is performed in two ways in the “ta” part and four ways in the “four” part. Therefore, the expansion result is a character string of 2 × 4, ie, 1) “interphone” 2) “interphone” 3) “ Interphone 4) “Interphone” 5) “Interphone” 6) “Interphone” 7) “Interphone” 8) “Interphone”.

【０１７３】以上の見出し文字列探索と展開文字列リス
トへの置き換え処理を図３２に示すＰＡＤ図で説明す
る。まず探索ポインタを入力文字列の先頭にセットす
る。次に現在の探索ポインタを先頭とする文字列で、変
換ルールの見出し文字列があるかどうかを探索する。も
しもマッチングする見出しがなければ、探索ポインタを
１文字後方に移動してから、再び探索ポインタを先頭と
する文字列でマッチングする見出し文字列を探索する。
マッチングする見出しが存在する場合には、マッチング
した見出しのうち最も長い見出しを採用して、該当部分
を展開文字列リストで置き換える。置き換えがすんだら
探索ポインタをマッチングした見出し文字列の文字数分
後方に移動する。そして、探索ポインタが入力文字列の
最終位置に来るまで、上記の見出し文字列探索と展開リ
ストへの置き換え処理を繰り返す。The above-described headline character string search and replacement processing with the expanded character string list will be described with reference to a PAD diagram shown in FIG. First, the search pointer is set at the head of the input character string. Next, a search is performed using a character string starting with the current search pointer to see if there is a heading character string of the conversion rule. If there is no matching heading, the search pointer is moved backward by one character, and then a matching heading character string is searched again using the character string starting with the search pointer.
If there is a matching heading, the longest heading among the matching headings is adopted, and the corresponding part is replaced with the expanded character string list. When the replacement is completed, the search pointer is moved backward by the number of characters of the matched heading character string. Until the search pointer reaches the final position of the input character string, the above-described headline character string search and replacement processing with the expanded list are repeated.

【０１７４】いままで説明してきた異表記展開における
見出し文字列の探索方法に関するもう一つの実施例につ
いて図３３を用いて説明する。本実施例では、見出し文
字列の探索にオートマトンを用いている。以下、その手
順について述べる。まず変換ルールにより、図に示すよ
うなオートマトンを生成する。異表記展開処理は、この
オートマトンに入力文字列を１文字ずつ入力して、その
オートマトンの動作出力から図３１で説明した展開文字
列リストを含む文字列を得る。Another embodiment relating to the method of searching for a heading character string in the different notation development described above will be described with reference to FIG. In this embodiment, an automaton is used for searching for a heading character string. Hereinafter, the procedure will be described. First, an automaton as shown in the figure is generated by a conversion rule. In the variant notation expansion processing, an input character string is input to the automaton one character at a time, and a character string including the expansion character string list described with reference to FIG. 31 is obtained from the operation output of the automaton.

【０１７５】以下、具体的にその動作内容について説明
する。本図において記号ａは入力文字を、丸はオートマ
トンを構成する各状態を、丸の中の数字はその状態番号
を示す。線上の文字は、その文字が入力されたとき、矢
印の方向へ状態を遷移することを示す。記号‘ ’
は、その後に続く文字以外の文字を表す。また、記号
‘→’は、以下に続く文字列を出力することを示す。こ
のオートマトンは、その動作を制御するための状態遷移
テーブルと、各状態の出力を記述する出力テーブルから
なる。状態遷移テーブルは、図３４に示すものとなる。Hereinafter, the operation will be specifically described. In this figure, a symbol a indicates an input character, a circle indicates each state constituting the automaton, and a number in the circle indicates the state number. The character on the line indicates that when the character is input, the state changes in the direction of the arrow. Symbol ''
Represents a character other than the character that follows. The symbol “→” indicates that the following character string is output. This automaton includes a state transition table for controlling the operation and an output table describing the output of each state. The state transition table is as shown in FIG.

【０１７６】ここでは、各状態における入力文字とその
文字が入力されたときに遷移する遷移先の状態番号が対
として記述されている。但し状態０へ遷移する場合は記
述を省略してある。すなわち、各状態において状態遷移
テーブルに記述されていない文字が入力した場合には、
状態０へ遷移するものとしている。出力テーブルには図
３５に示すように、オートマトンの各状態において、出
力する文字列リストが記述されている。オートマトンが
動作する場合、状態０以外の状態から状態０へ遷移する
場合のみ、この出力テーブルを参照して該当する展開文
字列を出力する。そして出力がすんだら、もう一度状態
０への遷移を引き起こした文字をオートマトンへ入力
し、状態遷移テーブルを参照してオートマンを再遷移さ
せる。状態０から状態０へ戻る場合には、入力文字をそ
のまま出力する。Here, the input character in each state and the state number of the transition destination when the character is inputted are described as a pair. However, when transitioning to state 0, the description is omitted. That is, when a character that is not described in the state transition table is input in each state,
It is assumed that the state transits to state 0. As shown in FIG. 35, the output table describes a list of character strings to be output in each state of the automaton. When the automaton operates, only when a transition from a state other than state 0 to state 0 is made, the corresponding expansion character string is output with reference to this output table. Then, when the output is completed, the character which caused the transition to the state 0 again is input to the automaton, and the automan is transited again with reference to the state transition table. When returning from state 0 to state 0, the input characters are output as they are.

【０１７７】以上、オートマトン方式における状態遷移
の制御及び出力について説明した。次に、具体例をもと
に、この動作を詳細に説明する。以下、図３３の入力例
“インタフォーン”の文字列が１文字入力される毎の動
作について記述する。最初、オートマトンの状態は、状
態０にある。The control and output of the state transition in the automaton system have been described above. Next, this operation will be described in detail based on a specific example. Hereinafter, the operation of the input example “interphone” of FIG. 33 every time one character is input will be described. Initially, the state of the automaton is in state 0.

【０１７８】（１）文字“イ”が入力されると状態遷移
テーブルに状態０からの遷移先が登録されていないの
で、そのまま“イ”が出力された状態は０のままとな
る。(1) When the character "A" is input, the destination of the transition from the state 0 is not registered in the state transition table, so that the state where the "I" is output remains 0 as it is.

【０１７９】（２）文字“ン”が入力されると状態遷移
テーブルに状態０からの遷移先が登録されていないの
で、そのまま“ン”が出力され状態は０のままとなる。(2) When the character "n" is input, since the transition destination from state 0 is not registered in the state transition table, "n" is output as it is and the state remains 0.

【０１８０】（３）文字“タ”が入力されると状態遷移
テーブルを参照し、現在の状態０から遷移先として状態
番号６が読み出され、状態は６に移る。(3) When the character "TA" is input, the state transition table is referred to, the state number 6 is read as the transition destination from the current state 0, and the state shifts to 6.

【０１８１】（４）文字“フ”が入力されると状態遷移
テーブルからは状態６から“フ”で遷移する遷移先が得
られない。かつ現在の状態は０でないので、出力テーブ
ルが参照され状態６での出力文字列“タ”と“ター”が
出力される。その後、状態は０に移動する。さらに、こ
の新しい状態０でもう一度入力文字の“フ”をオートマ
トンに入力する。その結果、状態遷移テーブルの内容に
従って状態０から状態１に状態が移動する。(4) When the character "F" is input, the state transition table does not provide a transition destination that transits from state 6 to "F". Since the current state is not 0, the output table is referred to and the output character strings "ta" and "tar" in state 6 are output. Thereafter, the state moves to zero. Further, in this new state 0, the input character "F" is again input to the automaton. As a result, the state moves from state 0 to state 1 according to the contents of the state transition table.

【０１８２】（５）文字“オ”が入力されると状態遷移
テーブルの内容から、状態１より状態２へ状態が移動す
る。(5) When the character "o" is input, the state moves from state 1 to state 2 from the contents of the state transition table.

【０１８３】（６）文字“ー“が入力されると状態遷移
テーブルの内容から、状態２より状態３へ状態が移動す
る。(6) When the character "-" is input, the state moves from the state 2 to the state 3 from the contents of the state transition table.

【０１８４】（７）文字“ン”が入力されると状態遷移
テーブルからは状態３から“ン”で遷移する遷移先状態
番号が得られない。かつ現在の状態は状態０でないの
で、出力テーブルが参照され、状態３での出力文字列
“フォー”、“フォ”、
“ホー”及び“ホ”が出力される。その後、状態が０に
移り、もう一度入力文字の“ン”がオートマトンに入力
される。ここでは、状態遷移テーブルから遷移先が得ら
れないので入力文字の“ン”がそのまま出力される。(7) When the character "n" is input, the state transition table cannot obtain the transition destination state number for transitioning from state 3 to "n". And since the current state is not state 0, the output table is referred to and the output character string in state 3
“Pho”, “Pho”,
"Ho" and "Ho" are output. Thereafter, the state shifts to 0, and the input character "n" is input again to the automaton. Here, since the transition destination cannot be obtained from the state transition table, the input character "n" is output as it is.

【０１８５】（８）入力文字列の最終文字まで来たの
で、処理を終了する。こうして展開リストを含む文字列
“イン（タ，ター）（フォー，フォ，ホー，ホ）ン”が
得られる。(8) Since the last character of the input character string has been reached, the processing is terminated. In this way, a character string "in (ta, ter) (four, pho, ho, ho)" including the expansion list is obtained.

【０１８６】次に、この見出し文字列探索用オートマト
ンの生成方式を図３６を用いて説明する。このオートマ
トン生成は、実際に入力文字列が送られてくる前に一度
作っておけば良い。本図は探索オートマトンの生成方
法、すなわち状態遷移テーブル及び出力テーブルの作成
方式をＰＡＤ図で表したもので、以下この内容について
説明する。まず状態遷移テーブル及び出力テーブルを初
期化する。次に、一つずつルールを取り出しながら変換
ルールの終わりまで以下の処理を繰り返す。Next, a method of generating the heading character string searching automaton will be described with reference to FIG. This automaton generation need only be created once before the input character string is actually sent. This figure shows a method of generating a search automaton, that is, a method of creating a state transition table and an output table in a PAD diagram. The contents will be described below. First, the state transition table and the output table are initialized. Next, the following processing is repeated until the end of the conversion rule while taking out the rules one by one.

【０１８７】（１）状態番号を０にセット（２）見出し文字列の終わりまで１文字ずつ文字を取り
出しながら状態遷移テーブルを作っていく。すなわち状
態遷移テーブルを参照し、取り出した文字による遷移先
が登録されていれば遷移先状態に移動する。遷移先が登
録されていなければ、新しい状態番号を生成し、状態遷
移テーブルに追加登録する。そして、今の状態を新しい
状態に移動する。さらに出力テーブルに状態０から今の
状態へ遷移させてきた文字列を登録する。（３）見出し
文字列の全文字について（２）の処理が終了した後、出
力テーブルの現在の状態番号と、変換ルールの展開文字
列リストを登録する。(1) Set the state number to 0. (2) Create a state transition table while extracting characters one by one until the end of the heading character string. That is, the state transition table is referred to, and if the transition destination by the extracted character is registered, the state shifts to the transition destination state. If the transition destination is not registered, a new state number is generated and additionally registered in the state transition table. Then, the current state is moved to the new state. Further, the character string that has transitioned from state 0 to the current state is registered in the output table. (3) After the processing of (2) is completed for all the characters of the heading character string, the current state number of the output table and the expanded character string list of the conversion rule are registered.

【０１８８】具体的な処理の流れを２つの変換ルール
〔“フォー”→（“フォー”，“フォ”，“ホー”，
“ホ”）〕と〔“フォ”→（“フォー”，“フォ”，
“ホー”，“ホ”）〕を使って説明する。The specific processing flow is defined by two conversion rules [“Four” → (“Four”, “Foo”, “Ho”,
“E”) and [“Fo” → (“Four”, “Fo”,
"Ho", "ho")].

【０１８９】まず１番目の変換ルールの見出し文字列
“フォー”について処理する。First, processing is performed on the heading character string “Four” of the first conversion rule.

【０１９０】（１）文字“フ”の入力状態遷移テーブルは最初初期化されているため、遷移先
状態番号は一つも登録されていない。従って新しい状態
番号１を生成し、状態を１に移動する。そして、出力テ
ーブルに状態０から状態１への遷移を引き起こす文字列
“フ”を状態番号１の出力として登録する。(1) Input of Character "F" Since the state transition table is initialized first, no transition destination state number is registered. Therefore, a new state number 1 is generated, and the state is moved to 1. Then, a character string “F” that causes a transition from state 0 to state 1 is registered as an output of state number 1 in the output table.

【０１９１】（２）文字“オ”の入力状態遷移テーブルに今の状態１からの遷移先は定義され
ていない。従って新しい状態番号２を生成し、状態を２
に移動する。そして出力テーブルに状態０から状態２に
至るまでの文字列“フォ”を状態番号２の出力として登
録する。(2) Input of character "o" The transition destination from the current state 1 is not defined in the state transition table. Therefore, a new state number 2 is generated, and the state is set to 2
Go to Then, the character string “F” from state 0 to state 2 is registered as an output of state number 2 in the output table.

【０１９２】（３）文字“ー”の入力状態遷移テーブルに今の状態２からの遷移先は定義され
ていない。従って新しい状態番号３を生成し、状態を３
に移動する。そして出力テーブルに状態０から状態３に
至るまでの文字列“フォー”を状態番号３の出力として
登録する。また、これで見出し文字列の最終文字である
ので、変換ルールの展開文字列リスト（“フォー”，
“フォ”，“ホー”，“ホ”）を、先に登録した出力文
字列“フォー”を入れ替える形で、状態３の出力として
出力テーブルに登録する。(3) Input of character "-" The transition destination from the current state 2 is not defined in the state transition table. Therefore, a new state number 3 is generated, and the state is set to 3
Go to Then, the character string “four” from state 0 to state 3 is registered as an output of state number 3 in the output table. In addition, since this is the last character of the heading character string, the expansion character string list (“for”,
“Fo”, “ho”, and “ho”) are registered in the output table as the output of state 3 by replacing the previously registered output character string “pho”.

【０１９３】次に２番目の変換ルールの見出し文字列
“フォ”について処理を実行する。処理に先立ち、状態
は０に戻る。Next, the process is executed for the heading character string “F” of the second conversion rule. Prior to processing, the state returns to 0.

【０１９４】（４）文字“フ”の入力先に登録した状態遷移テーブルを参照して、遷移先の状
態番号１を得、状態を１に移動する。(4) The state number 1 of the transition destination is obtained with reference to the state transition table registered at the input destination of the character "F", and the state is moved to 1.

【０１９５】（５）文字“オ”の入力状態遷移テーブルを参照して、遷移先の状態番号２を
得、状態を２へ移動する。出力テーブルに状態２の出力
は既に登録されているが、見出し文字列の最終であるの
で、出力テーブルに状態２の出力として既に登録されて
いる。(5) Input of character "o" Referring to the state transition table, the state number 2 of the transition destination is obtained, and the state is moved to 2. Although the output of state 2 has already been registered in the output table, since it is the end of the heading character string, it is already registered as the output of state 2 in the output table.

【０１９６】“フォ”を、変換ルールの展開文字列リス
ト（“フォー”，“フォ”，“ホー”，“ホ”）に書き
換える。“Fo” is rewritten into a conversion character string list (“Foo”, “Foo”, “Ho”, “H”) of the conversion rule.

【０１９７】以上の処理により、上記二つの変換ルール
を探索するオートマトンを作成することができる。具体
例で示した２つの変換ルール以外についても、これと全
く同様の手順でオートマトンにすることができる。By the above processing, an automaton for searching for the above two conversion rules can be created. Except for the two conversion rules shown in the specific example, an automaton can be obtained by a completely similar procedure.

【０１９８】以上、例で説明してきたカタカナ異表記展
開用の変換ルールテーブルの詳細を図３７に示す。この
他、変換ルールテーブルの作成には、カタカナ文字列の
類似音節表記に関してその表記の原則を定めた「昭和２
９年国語審議会報告外来語の表記」を利用することが
できる。。すなわち上記報告書ではカタカナ文字列の異
表記が示され、その表記を統一化するための原則が述べ
られているが、これを逆に利用して変換ルールを作成す
ることができる。FIG. 37 shows the details of the conversion rule table for katakana different notation development described in the above example. In addition, the conversion rule table was created by defining the principle of notation for similar syllable notation of katakana character strings, "Showa 2
9-year National Language Council Report Foreign Language Notation "can be used. . That is, in the above report, different notations of katakana character strings are shown, and a principle for unifying the notations is described, but the conversion rules can be created by using the reverse.

【０１９９】今までカタカナ文字列の異表記展開につい
て例をあげて説明してきたが、漢字文字列の異表記展開
についても、漢字文字列用の変換ルールテーブルを用い
るだけで全く同じ処理で実現できる。漢字の新旧字体に
関する異表記展開用の変換ルールテーブルの例を図３８
に示し、送りがなに関する異表記展開ルールの例を図３
９に示す。Although the different notation expansion of katakana character strings has been described above by way of example, the different notation expansion of kanji character strings can be realized by exactly the same processing only by using the conversion rule table for kanji character strings. . FIG. 38 shows an example of a conversion rule table for developing different notations for new and old kanji characters.
Figure 3 shows an example of a different notation development rule for sending
9

【０２００】また、図３７から図３９に示した異表記展
開ルールテーブルは、必要に応じて追加修正が可能であ
り、従ってユーザの望む異表記展開が可能となる。Further, the different notation development rule tables shown in FIGS. 37 to 39 can be additionally modified as necessary, so that the different notation expansion desired by the user can be performed.

【０２０１】以上が漢字異表記展開部３００４、カタカ
ナ異表記展開部３００５の処理の詳細である。The above is the details of the processing performed by the kanji variant notation developing unit 3004 and the katakana variant notation developing unit 3005.

【０２０２】次にローマ字の異表記展開に関する説明を
する。本実施例では、ローマ字の異表記展開をするの
に、ローマ字で入力された文字列を一旦カタカナ文字列
へ変換した後、これをカタカナ異表記展開し、もう一度
カナローマ字変換によりローマ字に戻すという方法を採
っている。従って、ローマ字の異表記展開に関する部分
は、図３０のローマ字カナ変換部３００３とカナローマ
字変換部３００７の２つとなる。Next, a description will be given of the development of different representations of Roman characters. In the present embodiment, in order to develop a different representation of Roman characters, a method of once converting a character string input in Roman characters into a katakana character string, then developing this into katakana different notations, and returning to Roman characters again by kana-romaji conversion. Has been adopted. Accordingly, two parts related to the development of different representations of Roman characters are the Roman alphabet-kana conversion unit 3003 and the kana-romaji conversion unit 3007 in FIG.

【０２０３】まずローマ字カナ変換部の処理内容につい
て説明する。アルファベット文字列が入力されると、ま
ずローマ字カナ変換が行われる。ローマ字カナ変換部３
００３では、図４０の示すようなローマ字とカタカナの
対応表を用いて、ローマ字カナ変換が行われる。同図に
おいて、例えば１番目のレコードではローマ字の“Ａ”
がカタカナの“ア”に対応することを示している。ロー
マ字の項目中に複数個の文字列を並んでいるレコードに
関しては、複数個並んでいる全ての文字列がカタカナの
項目に対応していることを示している。例えば、ローマ
字の“ＳＹＡ”及び“ＳＨＡ”が、カタカナの“シャ”
に対応する。これらは、それぞれローマ字の訓令式表記
法と、ヘボン式表記法に対応している。従って、このロ
ーマ字カナ変換部３００３では訓令式でもヘボン式でも
あるいはこれらの混合したような表記法に文字列でもカ
タカナへ変換されることになる。変換方法は、前述した
漢字異表記展開、カタカナ異表記展開と同様である。す
なわち、入力文字列と対応表のローマ字文字列を最長一
致で探索して、順次対応するカタカナ文字列に置き換え
ていく。もし、対応するローマ字文字列が対応表に見つ
からないときは、入力文字列はローマ字でないと判断
し、カタカナ文字列の出力を行わない。First, the processing contents of the Roman character / kana conversion unit will be described. When an alphabetic character string is input, first, Roman alphabetic kana conversion is performed. Roman alphabet Kana conversion part 3
In 003, Roman-kana conversion is performed using a Roman-kana correspondence table as shown in FIG. In the figure, for example, in the first record, the Roman character "A"
Corresponds to the katakana "A". Regarding a record in which a plurality of character strings are arranged in a Roman character item, it indicates that all of the plurality of character strings correspond to katakana items. For example, the Roman alphabet “SYA” and “SHA” are replaced by the katakana “sha”.
Corresponding to These correspond to the Roman alphabet notation and Hepburn notation, respectively. Therefore, in the Roman alphabet kana conversion unit 3003, even a character string is converted into katakana in a notation style, a Hepburn style, or a mixed notation such as these. The conversion method is the same as the above-described Kanji different notation expansion and Katakana different notation expansion. That is, the input character string and the Roman character string in the correspondence table are searched for the longest match, and are sequentially replaced with the corresponding katakana character strings. If the corresponding Roman character string is not found in the correspondence table, it is determined that the input character string is not a Roman character, and no Katakana character string is output.

【０２０４】次にカナローマ字変換部３００７の説明を
する。ここでも図４０の対応表をそのまま用いる。こん
どは逆に入力文字列と対応表のカタカナ文字列とを最長
一致でマッチングを取りながら順次ローマ字へ入力文字
列を置き換えていく。前述の“シャ”の例のように対応
するローマ字表記が複数個存在するときには、該当部分
を部分文字列のリストとして置き換えていく。すなわ
ち、カタカナの入力文字列に対し、異表記展開の処理と
同様に展開リストを含む文字列に展開し、その展開リス
トの部分を組み合わせることによってローマ字異表記展
開の結果を得ることができる。これを“シシャモ”とい
うカタカナ文字列が入力された場合を例にして説明す
る。この時、入力文字列“シシャモ”に対してカタカナ
ローマ字対応表とのマッチング処理により、“（ＳＩ，
ＳＨＩ）（ＳＹＡ，ＳＨＡ）ＭＯ”という展開リストを
含む文字列が得られる。従って、展開リストの組み合せ
により、１）“ＳＩＳＹＡＭＯ” ２）“ＳＩＳＨＡＭＯ” ３）“ＳＨＩＳＹＡＭＯ” ４）“ＳＨＩＳＨＡＭＯ” という４種類のローマ字異表記文字列が得られることに
なる。以上が異表記展開の処理に関する説明である。Next, the kana-romaji conversion unit 3007 will be described. Here, the correspondence table of FIG. 40 is used as it is. Conversely, the input character string is successively replaced with the Roman character while matching the longest match between the input character string and the katakana character string in the correspondence table. When there are a plurality of corresponding Roman alphabets as in the example of "sha" described above, the corresponding portion is replaced as a list of partial character strings. That is, the input character string of katakana is expanded into a character string including an expansion list in the same manner as the processing of expansion of different notation, and the result of expansion of the Roman alphabet can be obtained by combining the parts of the expansion list. This will be described by taking as an example a case where a katakana character string “Shishamo” is input. At this time, the input character string “Shishamo” is matched with the Katakana-Roman character correspondence table to obtain “(SI,
A character string including an expansion list of “SHI) (SYA, SHA) MO” is obtained. Therefore, by combining the expansion lists, 1) “SISYAMO” 2) “SISHAMO” 3) “SHISHAMO” 4) “SHISHAMO” 4 As described above, the type of Roman character different notation character string is obtained, and the above is the description of the process of developing different notation.

【０２０５】これまで、説明してきた異表記展開の実施
例においては、漢字ひらがな文字列、カタカナ文字列、
ローマ字文字列、及びアルファベット文字列の全てにつ
いて展開処理を行っているが、これらの処理を選択的に
行うことも可能である。すなわち、異表記展開後の出力
文字列の種類を、１）漢字ひらがな文字列２）カタカナ文字列３）ローマ字文字列４）アルファベット文字列のいずれかもしくは、これらの混合した文字列を異表記
展開結果として出力するように制御することができる。
こうして、異表記展開を文字種毎に選択可能にすること
により、むだな展開処理を省き、かつユーザの要求に応
じた検索処理が可能となる。In the above-described embodiment of the different notation expansion, the kanji hiragana character string, the katakana character string,
Although the expansion processing is performed for all of the Roman character strings and the alphabet character strings, these processings can be selectively performed. In other words, the type of the output character string after the different notation expansion is as follows: 1) Kanji Hiragana character string 2) Katakana character string 3) Roman character string 4) Alphabet character string or a mixed character string of these It can be controlled to output as a result.
By making it possible to select different notation development for each character type, useless development processing can be omitted and search processing according to the user's request can be performed.

【０２０６】以下、この異表記展開結果の文字種制御方
法について説明する。Hereinafter, a method of controlling the character type of the result of the development of the different notation will be described.

【０２０７】この出力文字種の制御は図４１に示すよう
に、図３０の実施例の構成に更にアルファベット文字列
３０３４の出力制御を行うスイッチａ３００８、及びカ
タカナ文字列群３０４２の出力制御を行うスイッチｂ３
００９を設けることで実現する。また、漢字異表記展開
部３００４、カタカナ異表記展開部３００５、カナロー
マ字変換部３００７、スイッチａ３００８、及びスイッ
チｂ３００９の出力を制御する制御信号線３０６１、３
０６２、３０６３、３０６４、３０６５をそれぞれ設け
る。そして、ユーザが設定する異表記展開結果出力文字
種の指定モードにより該制御信号をＯＮ，ＯＦＦするこ
とでモードに応じた異表記展開を実現する。例えば、ロ
ーマ字異表記展開が不要な場合にはカナローマ字変換部
３００７の制御信号３０６３をＯＦＦして出力を止め
る。このような展開モードによる各変換部、展開部、ス
イッチの制御信号の組み合せを図４２に示す。図におい
て、展開モードはそれぞれｃ：漢字及びひらがな異表記展開ｋ：カタカナ異表記展開ｒ：ローマ字異表記展開ａ：アルファベット異表記展開を実施し、出力することを示している。複数文字では複
数の異表記の出力指定を表す。例えば展開モード‘ｃｋ
ａ’は漢字ひらがな、カタカナ及びアルファベットの各
異表記を異表記展開結果として出力するモードであるこ
とを表す。また、表中の○は該当モジュールが文字列を
出力することを示す。逆に○がついていないところは、
文字列を出力しない。例えば、‘ｃｋａ’という文字列
展開モードでは、ローマ字カナ変換部３００３、漢字異
表記展開部３００４、カタカナ異表記展開部３００５が
展開文字列を出力し、スイッチａ３００８、及びスイッ
チｂ３００９が入力文字列を通すが、カナローマ字変換
部３００７は文字列を出力しないことを示す。As shown in FIG. 41, this output character type is controlled by a switch a3008 for controlling the output of the alphabet character string 3034 and a switch b3 for controlling the output of the katakana character string group 3042 in addition to the configuration of the embodiment of FIG.
009 is provided. Also, control signal lines 3061 and 3 for controlling the output of the kanji variant notation developing unit 3004, katakana variant notation developing unit 3005, kana roman character conversion unit 3007, switch a3008, and switch b3009.
062, 3063, 3064, and 3065 are provided, respectively. Then, the control signal is turned ON and OFF in accordance with the designation mode of the different notation development result output character type set by the user, thereby realizing the different notation development according to the mode. For example, in the case where it is not necessary to expand the Roman alphabet different notation, the control signal 3063 of the kana-romaji conversion unit 3007 is turned off to stop the output. FIG. 42 shows a combination of control signals for the conversion units, the expansion unit, and the switches in such an expansion mode. In the drawing, the expansion modes are respectively c: Kanji and Hiragana different notation expansion k: Katakana different notation expansion r: Roman character different notation expansion a: Alphabet different notation expansion and output. A plurality of characters indicates a plurality of differently specified output specifications. For example, deployment mode 'ck
a ′ indicates a mode in which different notations of Kanji Hiragana, Katakana and Alphabet are output as different notation development results. In the table, “○” indicates that the corresponding module outputs a character string. Conversely, where there is no circle,
Do not output character strings. For example, in the character string expansion mode of 'cka', the Roman character / kana conversion unit 3003, the kanji variant notation development unit 3004, and the katakana variant notation development unit 3005 output a developed character string, and the switches a3008 and b3009 convert the input character string. Pass, but indicates that the kana-romaji conversion unit 3007 does not output a character string.

【０２０８】最後に同義語展開処理について説明する。Lastly, the synonym expansion processing will be described.

【０２０９】同義語展開処理部２８０３は、図４３に示
すような同義語辞書を持つ。図において、レコード番号
とは、辞書中の各見出し文字列に付与されている一連番
号である。各見出し文字列には、それぞれ同位語、上位
語、下位語、関連語が定義されている。同図の同位語、
上位語、下位語、関連語の項目に記述されている番号は
全て同辞書のレコード番号を示す。例えば見出し“計算
機”は、同位語としてレコード番号２および３、すなわ
ち“コンピュータ”と“情報処理装置”を持っているこ
とを表している。同義語展開では辞書中のどの項目を用
いて展開するか、図２８の同義語展開モード制御信号２
８１０によりユーザがモードを設定できるようにしてい
る。設定可能なモードは次の通りである。すなわち、ｕ：同位語を使った展開ｂ：上位語を使った展開ｎ：下位語を使った展開ｒ：関連語を使った展開およびｕ，ｂ，ｎ，ｒの各モードを組み合せた展開がで
きるようにしている。The synonym expansion processing unit 2803 has a synonym dictionary as shown in FIG. In the figure, a record number is a serial number assigned to each heading character string in the dictionary. Each heading character string defines a synonym, an upper word, a lower word, and a related word. Isotopes in the figure,
The numbers described in the items of upper words, lower words, and related words all indicate record numbers of the same dictionary. For example, the heading "computer" indicates that it has record numbers 2 and 3, that is, "computer" and "information processing device" as its cognates. In the synonym expansion, which item in the dictionary is used for expansion, the synonym expansion mode control signal 2 shown in FIG.
810 allows the user to set the mode. The modes that can be set are as follows. That is, u: expansion using an isotope b: expansion using a high-order word n: expansion using a low-order word r: expansion using a related word and expansion combining each mode of u, b, n, and r I can do it.

【０２１０】同義語展開の処理は、同義語展開部への入
力文字列中に存在する辞書の見出し文字列を探索するこ
とによって行う。すなわち、異表記展開処理における変
換ルールの見出し文字列の探索の場合と同様に、入力文
字列の先頭から最長一致により見出し文字列を探索す
る。そして、同位語展開の場合には入力文字列において
照合された部分文字列を、順次同位語文字列のリストで
置き換えていく。この際、照合した見出し文字列をも加
えて書き替えを行う。上位語展開、下位語展開、関連語
展開の場合には、入力文字列と見出し文字列が完全に一
致した時に限って、上述の置き換え出力を行う。すなわ
ち、入力文字列中で部分的に一致しただけでは、展開を
行わない。これは、上位語展開、下位語展開、関連語展
開において部分的な文字列の置き換えは意味のない単語
を作ってしまうためである。The processing of synonym expansion is performed by searching for a dictionary character string existing in the character string input to the synonym expansion section. In other words, as in the case of searching for a heading character string of a conversion rule in the variant notation expansion processing, the heading character string is searched by the longest match from the beginning of the input character string. Then, in the case of isotopic expansion, the partial character strings collated in the input character string are sequentially replaced with a list of isotopic character strings. At this time, rewriting is performed by adding the matched headline character string. In the case of upper word expansion, lower word expansion, and related word expansion, the above-described replacement output is performed only when the input character string completely matches the heading character string. That is, the expansion is not performed only by partially matching in the input character string. This is because partial character string replacement in upper word expansion, lower word expansion, and related word expansion creates a meaningless word.

【０２１１】同位語展開の処理を例を用いて説明する。The processing for isotopic expansion will be described using an example.

【０２１２】入力文字列が“大型計算機”とすると、同
義語辞書の見出し文字列探索を行うことにより、入力文
字列の３文字目から見出し文字列“計算機”が照合す
る。同義語辞書のレコード番号１の“計算機”から、同
位語としてレコード番号２の“コンピュータ”とレコー
ド番号３の“情報処理装置”が同位語リスト（“計算
機”，“コンピュータ”，“情報処理装置”）として得
られる。入力文字列の該当部分をこのリストで置き換え
ることにより、異表記展開処理の場合と同様に、展開リ
ストを含む文字列“大型（計算機，コンピュータ，情報
処理装置）”が得られる。この展開リストを組み合せて
（この場合は一つしかリストがないが（１）“大型計算機” （２）“大型コンピュータ” （３）“大型情報処理装置” の３つの同位語文字列が得られることになる。Assuming that the input character string is “large computer”, the heading character string “computer” is collated from the third character of the input character string by searching for the heading character string in the synonym dictionary. From the “computer” of the record number 1 of the synonym dictionary, the “computer” of the record number 2 and the “information processing device” of the record number 3 are the cognitive lists (“computer”, “computer”, “information processing device”). "). By replacing the corresponding part of the input character string with this list, a character string “large (computer, computer, information processing device)” including the expansion list can be obtained as in the case of the different notation expansion processing. By combining this expanded list (in this case, there is only one list, but three isotope character strings of (1) "large computer" (2) "large computer" (3) "large information processing device" are obtained. Will be.

【０２１３】次に上記語展開の処理について例をあげて
説明する。Next, the word expansion process will be described with an example.

【０２１４】入力文字列が“計算機”の場合、同義語辞
書の見出し文字列探索で入力文字列と見出し文字列“計
算機”が完全一致する。そこで、上位語としてレコード
番号４の“電子機器”が出力される。この場合には上位
語が一つしかないが、もちろん複数個あってもよい。複
数個の上位語が存在する場合には、前述のようにリスト
として出力する。When the input character string is “computer”, the input character string and the heading character string “computer” completely match in the heading character string search of the synonym dictionary. Therefore, “electronic device” of record number 4 is output as a higher word. In this case, there is only one upper word, but of course there may be more than one. If there are a plurality of high-order words, they are output as a list as described above.

【０２１５】下位語、関連語についても上位語展開と全
く同じ処理とする。なお、同義語展開において辞書中に
該当する文字列が存在しないときは同義語展開処理部２
８０３から、何も文字列が出力されない。[0215] Processing for lower words and related words is exactly the same as that for expanding upper words. If there is no corresponding character string in the dictionary in synonym expansion, the synonym expansion processing unit 2
No character string is output from 803.

【０２１６】以上同義語展開の処理の実施例について説
明した。ところで同義語展開では、辞書を使用するため
に辞書レコード数が多くなると見出し文字列探索に時間
が掛かることがある。この問題の解決として、辞書の見
出し文字列をインデクステーブルを用いて探索する方法
がある。図４４に、このやり方の概要を示す。同義語辞
書は、予めその見出し文字列にてアルファベット順に並
べておく。そして、同義語辞書とは別に見出し文字列の
第一番目の文字だけを集め、その文字から始まる見出し
が辞書のどの部分から始まるかを登録したインデクステ
ーブルを持つ。例えば文字“Ａ”で始まる見出しはレコ
ード番号１から存在していることを示している。見出し
文字列探索をするときには、まずこのインデクステーブ
ルを参照し、レコード番号を求め、次にこれに基づいて
同義語辞書にアクセスする。こうすることにより、同義
語辞書の全見出し文字列をスキャンする必要がなくなる
ため、処理時間が短縮できる。例えば、文字“計”で始
める文字列を探索しようとした場合、インデクステーブ
ルにより、“計”で始まる文字列が辞書のレコード番号
５０１から存在することがわかる。従って、それ以前の
無駄な文字列探索を省くことができる。さらに、見出し
文字列をアルファベット順に並べることにより、見出し
文字列の探索中に先頭文字の異なる見出しまで探索した
ら後の探索は省略できる。例えば“計”で始まる見出し
文字列を探索していれば、“計”以外の文字で始まる見
出し“情報処理装置”まで探索すれば後の不要な探索を
おこなわなくとも済む。The embodiment of the processing of synonym expansion has been described above. By the way, in synonym expansion, if a dictionary is used and the number of dictionary records increases, it may take time to search for a heading character string. As a solution to this problem, there is a method of searching for a headline character string in a dictionary by using an index table. FIG. 44 shows an outline of this method. The synonym dictionary is arranged in alphabetical order by the heading character string in advance. In addition to the synonym dictionary, there is an index table in which only the first character of the heading character string is collected and which part of the dictionary starts with the heading starting from the character is registered. For example, a heading starting with the letter “A” indicates that the heading exists from record number 1. When searching for a heading character string, the index table is first referred to to determine a record number, and then the synonym dictionary is accessed based on the record number. By doing so, it is not necessary to scan all the headline character strings in the synonym dictionary, so that the processing time can be reduced. For example, when an attempt is made to search for a character string starting with the character "total", the index table indicates that a character string starting with "total" exists from the record number 501 of the dictionary. Therefore, the useless character string search before that can be omitted. Furthermore, by arranging the heading character strings in alphabetical order, if a heading with a different first character is searched during the search for the heading character string, the subsequent search can be omitted. For example, if a heading character string starting with “total” is searched, searching for a heading “information processing device” starting with a character other than “total” eliminates the need for subsequent unnecessary searching.

【０２１７】また、インデクステーブル及び辞書中の同
位語などの記述にレコード番号を用いたが、これを辞書
中の位置を特定するアドレスで記述することにより、辞
書へのアクセスを更に高速化できる。アドレスとは、辞
書の該当する見出し文字列が存在する最初の位置を示す
もので、例えば辞書の先頭からのバイト数がある。こう
することにより、辞書へのアクセスが直に特定できるた
め、レコード番号を指定するよりも更に高速化が可能と
なる。Although the record number is used in the description of the index table and the isotope in the dictionary, the access to the dictionary can be further speeded up by describing the record number with the address specifying the position in the dictionary. The address indicates the first position where the corresponding heading character string exists in the dictionary, and includes, for example, the number of bytes from the head of the dictionary. By doing so, the access to the dictionary can be directly specified, so that the speed can be further increased as compared with the case where the record number is specified.

【０２１８】ここで、図４５から図５５の開示にしたが
って、本発明のオートマトンの実施例について説明す
る。Here, an embodiment of the automaton of the present invention will be described with reference to FIGS. 45 to 55.

【０２１９】異表記検索用のオートマトンについては図
４６の集合許容形オートマトンの状態遷移図を用いた場
合の作用について説明する。以後、集合許容形オートマ
トンをオートマトンと略して説明を行う。同図のオート
マトンは図５に示したオートマトンと同様に“インタフ
ェース”の異表記である“インターフェース”，“イン
タフェイス”，“インターフェイス”，“インターフェ
イス”，“インタフェース”，“インターフェース”，
“インターフェース”，“インターフェース”を含む９
語の検索タームを検索するためのものである。For the automaton for different notation search, the operation when the state transition diagram of the set allowable automaton in FIG. 46 is used will be described. Hereinafter, a description will be given by abbreviating the collective allowable automaton as an automaton. The automaton in the figure is, similarly to the automaton shown in FIG. 5, a different notation of “interface”, “interface”, “interface”, “interface”, “interface”, “interface”, “interface”,
9 including “interface” and “interface”
It is for searching for a word search term.

【０２２０】これらを、図４６下の複合語表現文字列で
表すことができる。“フェー”の異表記である（“フェ
イ”，“フェ”（“ー”，“―”））について説明す
る。These can be represented by a compound word expression character string shown in the lower part of FIG. The different notation of “Fee” (“Fay”, “Fee” (“−”, “−”)) will be described.

【０２２１】まず、発音異表記により“フェー”が“フ
ェイ”に置き換えられるので、（“フェイ”，“フェ
ー”）と記述することができる。First, "fay" is replaced by "fay" due to its pronunciation notation, so it can be described as ("fay", "fay").

【０２２２】次に“フェー”の長音が長音異表記で
“ー”が“―”に置き換えられるため（“ー”，
“―”）と記述できる。Next, since the long sound of "Fee" is replaced by "-" in the long sound notation ("-",
"-").

【０２２３】（“フェイ”，“フェー”）に長音異表記
の（“ー”，“―”）を適用させることにより（“フェ
イ”（“ー”，“―”））が得られる。この複合語表現
文字列を用いれば、（）内の部分文字列は等価なので
部分文字列の末尾文字による遷移を同一遷移先状態にま
とめることが可能ある。("Fay"("-","-")) can be obtained by applying ("-", "-") of the long note notation to ("Fay", "Fee"). If this compound word expression character string is used, since the partial character strings in parentheses are equivalent, it is possible to combine transitions by the last character of the partial character string into the same transition destination state.

【０２２４】ただし、他の部分文字列の遷移の中に含ま
れてしまう場合、例えば（“タ”（“ー”，“―”），
“タ”）の場合は、次の遷移文字の“フ”の遷移の遷移
元状態が“タ”の遷移先状態である状態３と“ター”お
よび“ター”の遷移先状態である状態４であるので、こ
れらの２ヶ所を遷移元状態とする遷移を記述し、遷移先
状態を状態５とする。However, if it is included in the transition of another partial character string, for example, (“TA” (“−”, “−”),
In the case of “ta”), the state 3 of the next transition character “f” is the transition destination state of “ta”, and the state 4 is the transition destination state of “tar” and “tar”. Therefore, a transition having these two locations as transition source states is described, and the transition destination state is defined as state 5.

【０２２５】このように遷移をまとめることにより、図
５のオートマトンと比べ状態数を約３分の１と大幅に減
らすことができている。By summarizing the transitions as described above, the number of states can be greatly reduced to about one third as compared with the automaton of FIG.

【０２２６】ここで用いたオートマトンの作成方法は引
用文献（エー．ブイ．エーホアンドエム．ジェイ．
コラッシック：“エフィシェントストリングマッチ
ング，コミュニケーションズエーシーエム，第１８
巻，第６号，１９７５年，A.V. Aho and M. J. Corasic
k : “Efficient String Matching”,CACM, VOL. 18, N
o.6, 1975）に開示されている。このオートマトンは上
記のコンカレントステートオートマトン方式により制御
される。以下、その内容を具体的に説明する。The method of preparing the automaton used here is described in the cited reference (A.V.A.H. and M.J.
Classic: “Efficient String Matching, Communications AC M, 18th
Vol. 6, No. 6, 1975, AV Aho and MJ Corasic
k: “Efficient String Matching”, CACM, VOL. 18, N
o.6, 1975). This automaton is controlled by the above-described concurrent state automaton method. Hereinafter, the contents will be specifically described.

【０２２７】次に、オートマトンの状態遷移制御方法に
ついて説明する。本方法は「フェイル処理」を用いずに
状態遷移を制御しようとするものである。すなわち、
「フェイル処理」を行なう代わりに、複数のトークンを
用いることによってオートマトンの状態遷移を表わそう
とするものである。Next, a state transition control method of the automaton will be described. This method is to control state transition without using “fail processing”. That is,
Instead of performing the “fail processing”, the state transition of the automaton is represented by using a plurality of tokens.

【０２２８】これまで述べてきたオートマトン方式で
は、オートマトンの状態遷移図は、初期状態を除いてア
クティブな状態、すなわち、照合途中を示す遷移状態
（トークンが置かれた状態）が唯１個だけという条件の
もとに作成されたものである。その結果、照合途中で入
力文字との不一致が生じた場合には、トークンの動きが
不連続になるため「フェイル処理」を行わなければなら
なくなる。In the automaton system described so far, the state transition diagram of the automaton has an active state excluding the initial state, that is, only one transition state (a state where a token is placed) indicating that the collation is in progress. It was created under conditions. As a result, if a mismatch with the input character occurs during the collation, the movement of the token becomes discontinuous, so that “fail processing” must be performed.

【０２２９】本方法においては、アクティブな状態が発
生する度にトークンを生成し、照合途中で不一致が生じ
た場合にはトークンを消滅させるという方法を採ること
により、フェイル処理を不要にしている。従って、入力
文字列によっては、状態遷移図上に複数のトークンが同
時に存在することにもなる。その意味で、本方法をコン
カンレントステートオートマトン方式と呼ぶことにす
る。In this method, a token is generated each time an active state occurs, and the token is deleted when a mismatch occurs during collation, thereby making the fail process unnecessary. Therefore, depending on the input character string, a plurality of tokens may simultaneously exist on the state transition diagram. In this sense, this method will be referred to as a concurrent state automaton method.

【０２３０】本方法によれば、「フェイル処理」を用い
ずに済むため、オートマトン作成時においてもフェイル
先状態の計算が不要な文字列検索装置を実現することが
できる。According to the present method, it is not necessary to use the "fail processing", so that it is possible to realize a character string search apparatus which does not need to calculate a fail destination state even when creating an automaton.

【０２３１】まず、始点状態におけるトークン生成方法
について説明する。始点状態では入力文字が入る度に照
合を行なう。遷移文字と照合した場合、新たなトークン
を生成し、このトークンを始点状態から遷移先状態へ移
動させる。ただし、始点状態から始点状態への状態遷移
の場合には、トークンを生成しない。したがって、始点
状態から始点状態への遷移は無効となるため、この遷移
を省略することも可能である。First, a token generation method in the starting point state will be described. In the starting point state, collation is performed each time an input character is entered. When matching with the transition character, a new token is generated and this token is moved from the start state to the transition destination state. However, in the case of the state transition from the start state to the start state, no token is generated. Therefore, the transition from the start state to the start state is invalid, and this transition can be omitted.

【０２３２】次に、例えば“インタフェイス”という文
字列が１文字づつ入力された場合についてオートマトン
の動作を説明する。Next, the operation of the automaton in the case where, for example, a character string "interface" is input one by one will be described.

【０２３３】まず、“イ”が入力されると、始点状態で
の照合が一致しトークンＴ１が生成され、状態１へ移動
する。トークンＴ１が状態１にきたところで“ン”が入
力されると、トークンＴ１は状態２に移動する。また、
これと同時に始点状態でも“ン”による照合が行われる
が不一致なので新たなトークンは生成されない。更に状
態２では入力文字“タ”が入力されると、トークンＴ１
は状態３に移動する。また、同時に始点状態での照合が
不一致なので新たなトークンは生成されない。次に、続
けて“フェ”が入ってきた場合トークンＴ１は状態４→
状態５→状態６と移動する。またこの間、始点状態での
照合が不一致なので新たなトークンは生成されない。次
に、“イ”が入力されるとトークンＴ１は状態６から状
態７へ移動する。また、始点状態での照合は一致するの
で新たにトークンＴ２が生成され、状態１へ移動する。
次に、“ス”が入ってきた場合トークンＴ１は状態７か
ら状態８へ移動する。トークンＴ２は状態１での照合が
不一致なためにここで消滅する。また、始点状態での照
合が不一致なので新たなトークンは生成されない。この
時、トークンＴ１が状態８に達すると、“インタフェイ
ス”という文字列を検索したことになる。First, when "A" is input, the collation in the starting state matches, a token T1 is generated, and the process moves to state 1. If "n" is input when the token T1 has reached the state 1, the token T1 moves to the state 2. Also,
At the same time, even in the start point state, collation with “n” is performed, but no new token is generated because they do not match. Further, in the state 2, when the input character "ta" is input, the token T1
Moves to state 3. At the same time, no new token is generated because the matching in the starting state does not match. Next, when "Fe" continues to be input, the token T1 is in the state 4 →
Move from state 5 to state 6. Also, during this time, a new token is not generated because the matching in the starting point state does not match. Next, when "A" is input, the token T1 moves from the state 6 to the state 7. Also, since the matching in the starting point state matches, a new token T2 is generated and the process moves to state 1.
Next, when "S" enters, the token T1 moves from the state 7 to the state 8. The token T2 disappears here because the matching in the state 1 does not match. Also, since the matching in the starting point state does not match, no new token is generated. At this time, when the token T1 reaches the state 8, the character string "interface" has been searched.

【０２３４】このように複数のトークンを用いて状態遷
移を制御することにより、オートマトンの状態数が約３
分の１と少なくて済む異表記許容検索を実現することが
できる。By controlling the state transition using a plurality of tokens, the number of states of the automaton can be reduced to about 3
It is possible to realize a different notation-permissible search that requires only one part.

【０２３５】固定長ｄｏｎｔｃａｒｅ文字を検索ター
ムに指定した固定長ｄｏｎｔｃａｒｅ文字指定検索の
処理方法について説明する。A description will now be given of a fixed-length don't care character designation search processing method in which a fixed-length don't care character is designated as a search term.

【０２３６】オートマトンは図４７のものを用いる。本
図は図７と同様に“Ａ？Ｂ”を検索する場合のオートマ
トンであり、集合遷移を採用した上記方法を用いること
により状態数を図７に比べて約１５０分の一と少なく実
現できている。The automaton shown in FIG. 47 is used. This figure is an automaton for searching for “A? B” as in FIG. 7, and the number of states can be reduced to about 1/150 of that in FIG. 7 by using the above method employing set transition. ing.

【０２３７】本オートマトン作成方法は前述の異表記の
オートマトン作成方法と同様である。The present automaton creation method is the same as the above-described automaton creation method with a different notation.

【０２３８】例えば、“ＡＸＢ”という文字列が入力さ
れた場合の本方法の動作について説明する。For example, the operation of the present method when the character string “AXB” is input will be described.

【０２３９】まず、“Ａ”が入力されると始点状態で照
合が一致するためトークンＴ１が新たに生成され、状態
１へと移動する。次に“Ｘ”が入力されると、トークン
Ｔ１は状態１から状態２に移動する。また、始点状態で
の照合が不一致なので新たにトークンＴ１は状態２から
状態３に移動する。同時に、始点状態での照合が不一致
なので新たにトークンは生成されない。次に“Ｂ”が入
力されると、トークンは生成されない。状態３は２重円
で記されており、ここでは“Ａ？Ｂ”が検索されたこと
になる。First, when "A" is input, the token T1 is newly generated because the collation matches in the start state, and the state moves to state 1. Next, when "X" is input, the token T1 moves from state 1 to state 2. In addition, since the matching in the starting point state does not match, the token T1 newly moves from the state 2 to the state 3. At the same time, no new token is generated because the matching in the starting point state does not match. Next, when "B" is input, no token is generated. State 3 is indicated by a double circle, and here, "A? B" has been searched.

【０２４０】このように複数のトークンを用いて状態遷
移を制御することにより、異表記許容検索と同様に、オ
ートマトンの状態数が約１５０分の１と少なくて済む固
定長ｄｏｎｔｃａｒｅ文字指定検索を実現することが
できる。By controlling the state transition using a plurality of tokens in this way, a fixed-length don't care character designation search that requires only a small number of states of the automaton to about 1/150 can be performed in the same manner as the variant notation allowable search. Can be realized.

【０２４１】上限距離、下限距離や上下限距離などの文
字距離を指定した距離指定検索のオートマトン方式によ
る処理方法について述べる。A description will now be given of a processing method by the automaton method for a distance designation search in which a character distance such as an upper limit distance, a lower limit distance, or an upper and lower limit distance is designated.

【０２４２】まず、上限指定の距離指定の実現方法を以
下に説明する。ここでは“Ａ”と“Ｂ”の距離が４文字
以内の距離という上限距離指定がされている場合を例に
する。上限距離指定は固定長ｄｏｎｔｃａｒｅ文字で
表すことができ、この例は“Ａ”と“Ｂ”の距離が４文
字以内の距離という上限距離指定の場合は、“ＡＢ”，
“Ａ？Ｂ”，“Ａ？？Ｂ”，“Ａ？？？Ｂ”，“Ａ？？
？？Ｂ”の５つのキーワードで表すことができる。First, a method of realizing the upper-distance designation will be described below. Here, an example in which the upper limit distance designation that the distance between “A” and “B” is within 4 characters is specified. The upper limit distance designation can be represented by a fixed-length don't care character. In this example, when the upper limit distance designation that the distance between “A” and “B” is within 4 characters, “AB”,
"A? B", "A ?? B", "A ??? B", "A ??
? ? B "can be represented by five keywords.

【０２４３】これらのキーワードからのオートマトン作
成方法を以下に説明する。まず、“ＡＢ”のオートマト
ンを作成する。ここで、状態０，状態１および状態７が
作成される。次に、“Ａ？Ｂ”のオートマトンを作成す
る。第２文字目の“？”は１文字の全ての文字を表すた
め、遷移文字“Ｂ”による状態１から状態７への遷移以
外の遷移文字による遷移先を新たに作成する必要があ
る。すなわち、遷移文字｛“Ｂ”｝による状態１から状
態２への遷移が作成される。さらに第３文字目の遷移文
字“Ｂ”による遷移として第２文字目の“？”の遷移先
である状態２及び状態７から状態８への遷移が作成され
る。同様に“Ａ？？Ｂ”，“Ａ？？？Ｂ”，“Ａ？？？
？Ｂ”について作成することにより第４８図のオートマ
トンが得られる。本オートマトン作成方法は前述の固定
長ｄｏｎｔｃａｒｅ文字のオートマトン作成方法と同
様である。A method for creating an automaton from these keywords will be described below. First, an automaton of “AB” is created. Here, state 0, state 1 and state 7 are created. Next, an automaton of “A? B” is created. Since the second character “?” Represents all one character, it is necessary to newly create a transition destination by a transition character other than the transition from the state 1 to the state 7 by the transition character “B”. That is, a transition from state 1 to state 2 by the transition character {“B”} is created. Further, as the transition by the third transition character “B”, the transition from the state 2 and the state 7 to the transition destination of the second character “?” To the state 8 is created. Similarly, "A ?? B", "A ??? B", "A ???"
? The automaton shown in Fig. 48 is obtained by creating B ". The automaton creation method is the same as the above-described automaton creation method for fixed-length don't care characters.

【０２４４】次に、例えば“ＡＢＣＢＢＢＣ”という文
字列が１文字づつ入力された場合についてオートマトン
の動作を説明する。まず、“Ａ”が入力されると、始点
状態での照合が一致しトークンが生成され状態１へ移動
する。“Ｂ”が入力されるとトークンは状態７へ移動
し、“Ａ”と“Ｂ”が隣接している“ＡＢ”を照合す
る。さらに、“Ｃ”が入力されるとトークンは状態３へ
移動する。次に、“Ｂ”が入力されるとトークンは状態
９へ移動し、“Ａ”と“Ｂ”が２文字の距離にある“Ａ
ＢＣＢ”を照合する。Next, the operation of the automaton in the case where, for example, a character string "ABCBBBC" is input one character at a time will be described. First, when “A” is input, the matching in the starting point state matches, a token is generated, and the state moves to state 1. When "B" is input, the token moves to state 7, and "A" and "B" collate "AB" which is adjacent. Further, when "C" is input, the token moves to state 3. Next, when "B" is input, the token moves to state 9, and "A" and "B" are separated by two characters "A".
BCB ".

【０２４５】次に、“Ｂ”が入力されるとトークンは状
態１０へ移動し、“Ａ”と“Ｂ”が３文字以内の距離に
ある“ＡＢＣＢＢ”を照合する。次に、“Ｂ”が入力さ
れるとトークンは状態６へ移動し、“Ａ”と“Ｂ”が４
文字の距離にある“ＡＢＣＢＢＢ”を照合する。さら
に、“Ｃ”が入力されると状態６においてトークンの遷
移先がないためトークンは消滅する。Next, when "B" is input, the token moves to state 10, and "A" and "B" collate "ABCBB" which is within a distance of three characters or less. Next, when "B" is input, the token moves to state 6, and "A" and "B"
“ABCBBB” located at the character distance is collated. Further, when "C" is input, the token disappears in state 6 because there is no transition destination of the token.

【０２４６】以上のことから、“ＡＢＣＢＢＣ”から
“Ａ”と“Ｂ”が４文字以内の距離にある検索タームで
ある“ＡＢ”，“ＡＢＣＢ”，“ＡＢＣＢＢ”および
“ＡＢＣＢＢ”が照合されていることが分かる。すなわ
ち、“ＡＢ”，“Ａ？？Ｂ”，“Ａ？？？Ｂ”および
“Ａ？？？？Ｂ”が探索できていることが示されてい
る。From the above, "AB", "ABCB", "ABCBB" and "ABCBB" which are search terms in which "A" and "B" are within four characters from "ABCBBC" are collated. You can see that there is. That is, it is shown that “AB”, “A ?? B”, “A ??? B”, and “A ???? B” have been searched.

【０２４７】このオートマトンの場合も固定長ｄｏｎｔ
ｃａｒｅ文字の場合と同様にトークンを制御すること
により、上限指定の距離指定探索を実現することができ
る。This automaton also has a fixed length dont
By controlling the token in the same manner as in the case of the care character, it is possible to realize a distance designation search with an upper limit designation.

【０２４８】次に、下限指定の距離指定の実現方法を以
下に説明する。ここでは“Ａ”と“Ｂ”の距離が２文字
以上の距離という下限距離指定がされた場合を例にす
る。下限距離指定では上限距離が無限大となるためｄｏ
ｎｔｃａｒｅ文字で表すことができない。つまり、
“Ａ”と“Ｂ”の距離が２文字以上の距離という下限距
離指定の場合は、固定長ｄｏｎｔｃａｒｅ文字で表わ
すと“Ａ？？Ｂ”，“Ａ？？？Ｂ”，“Ａ？？？？Ｂ”
…となりキーワード数が無限大になるためである。Next, a method for realizing the lower limit distance specification will be described below. Here, a case will be described as an example where the lower limit distance is specified such that the distance between "A" and "B" is two or more characters. When the lower limit distance is specified, the upper limit distance is infinite, so do
Cannot be represented by nt care characters. That is,
When the lower limit distance is specified so that the distance between “A” and “B” is two or more characters, “A ?? B”, “A ??? B”, “A ?? B?
… And the number of keywords becomes infinite.

【０２４９】この問題を解決する方法を図４９のオート
マトンで説明する。A method for solving this problem will be described with reference to an automaton shown in FIG.

【０２５０】まず、下限距離を固定長ｄｏｎｔｃａｒ
ｅ文字で表したキーワードでオートマトンを作成する。
このオートマトンの作成方法は固定長ｄｏｎｔｃａｒ
ｅ文字の場合と同様である。ここでは下限距離は２なの
で、“Ａ？？Ｂ”を検索タームとしてオートマトンを作
成する。次に下限距離だけ遷移した状態（この例では状
態３）を仮の始点とし、この始点状態以降につながる状
態についてオートマトンを作成する。このオートマトン
は、全ての入力文字に対して遷移を記述する従来方式１
を用いて作成できる。このようにして、上限距離が無限
大の場合でもオートマトンを作成することができること
になる。オートマトンの作成方法は前述した上限距離指
定のオートマトンの場合と同様である。First, the lower limit distance is set to a fixed length dont car.
Create an automaton with a keyword represented by the letter e.
The method of creating this automaton is fixed-length dont car
Same as for the e character. Here, since the lower limit distance is 2, an automaton is created using “A ?? B” as a search term. Next, a state (state 3 in this example) transited by the lower limit distance is set as a temporary start point, and an automaton is created for states connected to the start point state and thereafter. This automaton uses the conventional method 1 that describes transitions for all input characters.
Can be created using In this way, an automaton can be created even when the upper limit distance is infinite. The method of creating the automaton is the same as that of the above-described automaton with the upper limit distance specified.

【０２５１】次に、例えば“ＡＣＤＥＦＢ”という文字
列が１文字づつ入力された場合のオートマトンの動作に
ついて説明する。まず、“Ａ”が入力されると、始点状
態での照合が一致しトークンが生成され状態１へ移動す
る。Next, the operation of the automaton when the character string "ACDEFB" is input one character at a time will be described. First, when “A” is input, the matching in the starting point state matches, a token is generated, and the state moves to state 1.

【０２５２】“Ｃ”が入力されるとトークンは状態２へ
移動する。さらに、“Ｄ”が入力されると“Ｂ”以外の
文字ということでトークンは状態３へ移動する。次に、
“Ｅ”が入力されるとトークンは状態３でループする。When "C" is input, the token moves to state 2. Further, when "D" is input, the token moves to state 3 because it is a character other than "B". next,
When "E" is input, the token loops in state 3.

【０２５３】次に、“Ｆ”が入力されると同様にトーク
ンは再度状態３でループする。さらに“Ｂ”が入力され
るとトークンは状態４へ移動する。状態４は、２重丸の
状態なので、“Ａ”と“Ｂ”が２文字以上離れた距離に
ある文字列が照合されたことを示している。Next, the token loops again in state 3 when "F" is input. When "B" is further input, the token moves to state 4. State 4 is a state of a double circle, and indicates that a character string in which “A” and “B” are separated by a distance of two or more characters has been collated.

【０２５４】すなわち、“ＡＣＤＥＦＢ”を“Ａ”と
“Ｂ”が２文字以上、すなわち４文字離れた距離にある
文字列として探索できていることが分かる。That is, it can be seen that "ACDEFB" can be searched as a character string in which "A" and "B" are two or more characters, that is, four characters apart.

【０２５５】次に、下限距離指定を用いたキーワードに
可変長ｄｏｎｔｃａｒｅ文字“＊”を指定した検索の
方法について説明する。Next, a search method in which a variable-length don't care character “*” is specified as a keyword using the lower limit distance specification will be described.

【０２５６】可変長ｄｏｎｔｃａｒｅ文字“＊”は、
下限距離に０を指定した場合の距離指定を用いて実現す
ることができる。すなわち“ＡＢ＊ＣＤ”は“ＡＢ”と
“ＣＤ”の距離が０文字以上の距離という下限距離指定
に置き換えることができる。この場合のオートマトンは
図５０のようになる。このように可変長ｄｏｎｔｃａ
ｒｅ文字を指定した検索も下限距離指定と同様に実現す
ることができる。The variable length don't care character "*"
This can be realized by using a distance designation when 0 is designated as the lower limit distance. In other words, "AB * CD" can be replaced with a lower limit distance designation in which the distance between "AB" and "CD" is a distance of 0 or more characters. The automaton in this case is as shown in FIG. Thus, variable length dont ca
The search specifying the re character can be realized in the same manner as the lower limit distance specification.

【０２５７】オートマトンの作成方法および動作は前述
した下限距離指定のオートマトンの場合と同様である。The method and operation of creating an automaton are the same as in the case of the automaton with the lower limit distance specified above.

【０２５８】最後に、上下限指定の距離指定の実現方法
を以下に説明する。ここでは“Ａ”と“Ｂ”の距離が２
文字以上で、かつ、４文字以下の距離という上下限距離
指定を例にする。上下限距離指定は固定長ｄｏｎｔｃ
ａｒｅ文字で表すことができ、この例の“Ａ”と“Ｂ”
の距離が2文字以上、4文字以内の距離という場合には、
“Ａ？Ｂ”，“Ａ？？Ｂ”，“Ａ？？？Ｂ”，“Ａ？？
？？Ｂ”の４つのキーワードで表すことができ、これら
から固定長ｄｏｎｔｃａｒｅ文字の場合と同様に図５
１に示すオートマトンを作成することができる。Finally, a description will be given of a method of realizing the distance specification with the upper and lower limits specified. Here, the distance between “A” and “B” is 2
An upper and lower limit distance designation of a distance of not less than characters and not more than four characters is taken as an example. Upper and lower limit distance specification is fixed length dont c
are "A" and "B" in this example.
If the distance is more than 2 characters and less than 4 characters,
"A? B", "A ?? B", "A ??? B", "A ??
? ? B "can be represented by the four keywords" B ".
The automaton shown in FIG. 1 can be created.

【０２５９】オートマトンの作成方法および動作は前述
した下限距離指定のオートマトンの場合と同様である。The method and operation of creating an automaton are the same as those of the above-described automaton with the lower limit distance specified.

【０２６０】１文字誤りを許容した検索である１文字誤
り許容検索の処理方法について説明する。A description will be given of a processing method of a one-character error allowable search, which is a search allowing a one-character error.

【０２６１】キーワードに“ＡＢＣＤ”を指定した場合
の例について説明する。この例ではキーワードとしては
誤りなしの場合として“ＡＢＣＤ”について、１文字削
除として“ＡＢＣ”，“ＡＢＤ”，“ＡＣＤ”，“ＢＣ
Ｄ”について、１文字相違として“Ａ？ＣＤ”，“ＡＢ
？Ｄ”，“ＡＢＣ？”について、１文字挿入として“Ａ
？ＢＣＤ”，“ＡＢ？ＣＤ”，“ＡＢＣ？Ｄ”，“ＡＢ
ＣＤ？”についてオートマトンを作成する。これらを複
合語表現文字列にすると、図７６に示す通りになる。こ
れに基づいて固定長ｄｏｎｔｃａｒｅ文字の場合と同
様に図５２図に示すオートマトンを作成することができ
る。An example in which “ABCD” is specified as a keyword will be described. In this example, “ABC” is used as a keyword without error and “ABC”, “ABD”, “ACD”, “BC” is used as one character deletion.
D ”, one character difference is“ A? CD "," AB
? D "," ABC? "For" A "
? BCD "," AB? CD "," ABC? D ”,“ AB
CD? An automaton is created for ". When these are converted into compound expression character strings, they are as shown in FIG. 76. Based on this, the automaton shown in FIG. 52 can be created similarly to the case of fixed-length don't care characters. it can.

【０２６２】オートマトンの作成方法および動作は前述
した上下限距離指定のオートマトンの場合と同様であ
る。The method and operation of creating an automaton are the same as those of the above-described automaton with upper and lower limit distances specified.

【０２６３】１文字入れ替わりを許容した検索である１
文字入れ替わり許容検索の処理方法について説明する。
キーワードに“ＡＢＣＤ”を指定した場合の例について
説明する。この例ではキーワードとしては誤りなしの場
合として“ＡＢＣＤ”について、１文字入れ替わりとし
て“ＢＡＣＤ”，“ＡＣＢＤ”，“ＡＢＣＤ”について
オートマトンを作成する。これらを複合語表現文字列に
すると、図７７に示す通りになる。これに基づいて固定
長ｄｏｎｔｃａｒｅ文字の場合と同様に図５３に示す
オートマトンを作成することができる。[0263] Search 1 that allows one character replacement
A description will be given of a processing method of the character replacement allowable search.
An example in which “ABCD” is specified as a keyword will be described. In this example, an automaton is created for "ABCD" as a keyword without error and for "BACD", "ACBD", and "ABCD" as one character replacement. When these are made into compound word expression character strings, they are as shown in FIG. Based on this, the automaton shown in FIG. 53 can be created as in the case of fixed-length don't care characters.

【０２６４】オートマトンの作成方法および動作は前述
した１文字誤り許容のオートマトンの場合と同様であ
る。The method and operation of creating an automaton are the same as those of the above-described automaton allowing one-character error.

【０２６５】以上のように本発明によれば、少ない状態
数で異表記検索、固定長ｄｏｎｔｃａｒｅ文字指定検
索、文字距離指定検索、可変長ｄｏｎｔｃａｒｅ文字
指定検索、１文字誤り許容検索、１文字入れ替わり許容
検索などの検索機能を実現するオートマトンを作成する
ことができる。したがって、オートマトンの作成時間も
短縮でき、状態遷移テーブルもコンパクトにできるため
ハードウェア量の少ない文字列検索装置を実現すること
が可能となる。As described above, according to the present invention, a different notation search, a fixed length dontcare character designation search, a character distance designation search, a variable length dont care character designation search, a one-character error allowable search, and one character replacement are performed with a small number of states. An automaton that realizes a search function such as an allowable search can be created. Therefore, the time required to create an automaton can be shortened, and the state transition table can be made compact, so that a character string search device with a small amount of hardware can be realized.

【０２６６】以下、上記文字列検索方法を用いた図１０
のサーチエンジン１１０６に相当する文字列検索装置の
実施例について述べる。本実施例の構成を図４５に示
す。本実施例の構成は、文字コードと状態番号を入力す
ることにより次状態番号を出力する状態遷移テーブル２
２０と、状態番号を入力することにより照合したか否か
の照合結果ＩＤを出力する照合結果テーブル２６０、ト
ークンの消滅を制御するセレクタ２６２、トークンの消
滅制御情報として移動先の状態がないトークンか否かを
判定を行うコンパレータ２５２、初期状態番号を格納す
るレジスタ２５１および、現状態番号や次状態番号を格
納するバッファ２８０とバッファ２８１などから成って
いる。FIG. 10 using the above-described character string search method will now be described.
An embodiment of a character string search device corresponding to the search engine 1106 will be described. FIG. 45 shows the configuration of this embodiment. The configuration of the present embodiment is a state transition table 2 that outputs a next state number by inputting a character code and a state number.
20 and a collation result table 260 that outputs a collation result ID indicating whether or not collation has been performed by inputting a state number, a selector 262 that controls the extinction of the token, and whether the token has no destination state as token extinction control information. It comprises a comparator 252 for judging whether or not it is present, a register 251 for storing an initial state number, and buffers 280 and 281 for storing a current state number and a next state number.

【０２６７】本実施例を用いたオートマトンの状態遷移
動作を下記に説明する。図４５において所定の文字列記
憶手段から読み出された文字列３０１は１文字づつレジ
スタ２１１に格納される。レジスタ２１１から出力され
る文字コード３０２は、本発明によるオートマトンの遷
移表が格納されている状態遷移テーブル２２０にアドレ
ス情報として入力される。状態遷移テーブル２２０では
現在の状態番号３０５と文字コード３０２から次に遷移
すべき遷移先状態番号３０３を出力する。次状態番号が
初期状態番号であるとき、オートマトンの遷移が記述さ
れていないことを表している。このため次状態番号が初
期状態番号であるとき、トークンを消滅させる必要があ
る。次状態番号３０３はレジスタ２５０に格納された
後、セレクタ２６２及びマルチプレクサ２６０を経由し
バッファ２８０またはバッファ２８１のいずれか選択さ
れている方に格納される。このとき、セレクタ２６２で
はトークンを消滅させるか否かを制御している。このト
ークンを消滅させるか否かの判定は次状態番号３０３が
レジスタ２５１に格納されている初期次状態番号（この
例では、状態番号０）と異なるか否かをコンパレータ２
５２で調べることにより実現している。A state transition operation of the automaton using this embodiment will be described below. In FIG. 45, a character string 301 read from a predetermined character string storage unit is stored in the register 211 one character at a time. The character code 302 output from the register 211 is input as address information to the state transition table 220 storing the transition table of the automaton according to the present invention. The state transition table 220 outputs a transition destination state number 303 to be transitioned to next from the current state number 305 and the character code 302. When the next state number is the initial state number, it indicates that the transition of the automaton is not described. Therefore, when the next state number is the initial state number, it is necessary to erase the token. After the next state number 303 is stored in the register 250, it is stored in the buffer 280 or the buffer 281 whichever is selected via the selector 262 and the multiplexer 260. At this time, the selector 262 controls whether to erase the token. The determination as to whether or not to delete this token is made by comparing the next state number 303 with the initial next state number (state number 0 in this example) stored in the register 251 by the comparator 2.
This is realized by checking at 52.

【０２６８】すなわち、次状態番号３０３が初期状態番
号である場合はトークンが移動すべき状態がないことを
示しているため、セレクタ２６２では次状態番号３０３
を選択しない。このため次状態番号３０３はマルチプレ
クサ２６０には送られず、トークンは消滅することにな
る。That is, when the next state number 303 is the initial state number, it indicates that there is no state in which the token should be moved.
Do not select Therefore, the next state number 303 is not sent to the multiplexer 260, and the token disappears.

【０２６９】逆に次状態番号３０３が初期状態番号でな
い場合、トークンが移動すべき状態があることを示して
いるため、セレクタ２６２では次状態番号３０３を選択
してマルチプレクサ２６０に送られ、トークンは消滅さ
れない。On the other hand, if the next state number 303 is not the initial state number, it indicates that there is a state in which the token should be moved. Therefore, the selector 262 selects the next state number 303 and sends it to the multiplexer 260. Not extinguished.

【０２７０】このように、次状態番号３０３が初期状態
番号の場合はトークンを消滅させることによって、初期
状態番号のトークンがバッファ２８０およびバッファ２
８１に溜り、冗長なトークンがバッファ２８０およびバ
ッファ２８１に溜り、冗長なトークンがバッファ２８０
およびバッファ２８１内に存在し、オーバーフローする
という問題を解決できることになる。As described above, when the next state number 303 is the initial state number, the token is deleted so that the token of the initial state number is stored in the buffer 280 and the buffer 2.
81, redundant tokens accumulate in buffers 280 and 281 and redundant tokens accumulate in buffer 280.
Thus, the problem of overflow in the buffer 281 can be solved.

【０２７１】バッファ２８０とバッファ２８１は、それ
ぞれデータの読取り用と書込み用の２面バッファとして
用い、高速処理を実現している。この２つのバッファは
１つにまとめ回路規模を小さくすることも可能である。
ただし、処理速度は低下することになる。バッファ２８
０及びバッファ２８１には初期値として始点次状態番号
を先頭アドレスに設定しておく。バッファ２８０及びバ
ッファ２８１に送られてきた次状態番号３０３は始点状
態の次のアドレスから格納する。現状態番号３０５はセ
レクタ２６１で選択されているバッファ２８０及びバッ
ファ２８１のいずれかから逐次読み出され、全て読み出
したときに読出し終了信号３０７を発生する。マルチプ
レクサ２６０とセレクタ２６１は同期しており、選択動
作についてはマルチプレクサ２６０がバッファ２８０を
選択しているときはセレクタ２６１はバッファ２８１を
選択している。また、マルチプレクサ２６０がバッファ
２８１を選択しているときはセレクタ２６１はバッファ
２８０を選択する。バッファ２８０及びバッファ２８１
の選択の切り替えは、セレクタ２６１が選択したバッフ
ァ２８０または２８１のいずれかの読出し終了信号３０
７の発生のタイミングで行われる。状態遷移テーブル２
２０には図５５に示した状態遷移表を格納する。これは
図４６のオートマトンに対応したものである。レジスタ
２１１は通常は、レジスタ２５０と同期して文字列デー
タを取り込むが、読出し終了信号が発生するまで文字列
データを保持し、現状態番号が全て読み出されるまで次
の入力を待つことになる。検索結果テーブル２６０には
文字列の終点となる終点状態に対応して各検索タームを
識別するための所定のコードが格納されている。図４６
のオートマトンに対応した検索結果テーブル２６０の内
容を図５５に示す。検索ターム番号が０以外の内容のと
き有効な検索ターム番号を表している。すなわち、状態
番号に対応した検索ターム番号が０以外のとき照合結果
として図１０の複合条件判定手段１１４５へ送られる。
以上の動作が図４６に示したオートマトンを実行する形
で、入力文字列を構成する各文字ごとに繰返し行われる
ことにより検索処理が実現される。The buffers 280 and 281 are used as two-sided buffers for reading and writing data, respectively, to realize high-speed processing. These two buffers can be combined into one to reduce the circuit scale.
However, the processing speed will decrease. Buffer 28
The starting point next state number is set as the initial address in 0 and the buffer 281 as an initial value. The next state number 303 sent to the buffer 280 and the buffer 281 is stored from the next address of the start state. The current state number 305 is sequentially read out from any one of the buffer 280 and the buffer 281 selected by the selector 261, and when all are read out, a read end signal 307 is generated. The multiplexer 260 and the selector 261 are synchronized, and the selector 261 selects the buffer 281 when the multiplexer 260 selects the buffer 280 in the selection operation. When the multiplexer 260 selects the buffer 281, the selector 261 selects the buffer 280. Buffer 280 and buffer 281
Of the buffer 280 or 281 selected by the selector 261 in response to the read end signal 30
7 is performed. State transition table 2
20 stores the state transition table shown in FIG. This corresponds to the automaton in FIG. Normally, the register 211 takes in the character string data in synchronization with the register 250, but holds the character string data until a read end signal is generated, and waits for the next input until all the current state numbers are read. The search result table 260 stores a predetermined code for identifying each search term corresponding to the end point state which is the end point of the character string. FIG.
FIG. 55 shows the contents of the search result table 260 corresponding to the automaton of FIG. When the search term number is other than 0, it indicates a valid search term number. That is, when the search term number corresponding to the state number is other than 0, the search term number is sent to the complex condition determination means 1145 in FIG. 10 as a collation result.
The above-described operation is executed for each character constituting the input character string in the form of executing the automaton shown in FIG. 46, thereby realizing the search processing.

【０２７２】入力文字列が入力された場合、例えば、
“インタフェイス”という文字列が入力された場合の本
方式の照合動作について説明する。When an input character string is input, for example,
The collating operation of the present method when a character string “interface” is input will be described.

【０２７３】まず初期設定として、以下の処理を行う。
状態遷移テーブル２２０には図５４に示した状態遷移表
を、また、検索結果テーブル２６０には図５５の検索結
果表を格納する。これらのテーブルは、図４６のオート
マトンに対応したものである。First, the following processing is performed as an initial setting.
The state transition table 220 stores the state transition table shown in FIG. 54, and the search result table 260 stores the search result table of FIG. These tables correspond to the automaton in FIG.

【０２７４】バッファ２８０及びバッファ２８１には初
期値として始点状態番号である０を先頭アドレスに格納
する。レジスタ２５０、レジスタ２５１には初期状態番
号である０を格納する。マルチプレクサ２６０はバッフ
ァ２８１を選択し、セレクタ２６１はバッファ２８０を
選択する。したがって、次状態番号３０３は始点状態番
号である０となる。In the buffers 280 and 281, the starting point state number “0” is stored in the head address as an initial value. The register 250 and the register 251 store 0 as an initial state number. The multiplexer 260 selects the buffer 281 and the selector 261 selects the buffer 280. Therefore, the next state number 303 becomes 0, which is the start state number.

【０２７５】次に、これらの初期設定に基づいた照合動
作について説明する。Next, the collating operation based on these initial settings will be described.

【０２７６】まず、１文字目の“イ”がレジスタ２１１
に格納される。すると、文字コード３０２と現状態番号
３０５をアドレスとして状態遷移テーブル２２０より次
状態番号１が読み出され、レジスタ２５０に格納され
る。このとき現状態番号３０５は０となっている。First, the first character “A” is stored in the register 211.
Is stored in Then, the next state number 1 is read from the state transition table 220 using the character code 302 and the current state number 305 as addresses, and stored in the register 250. At this time, the current state number 305 is 0.

【０２７７】コンパレータ２５２はレジスタ２５１に格
納されている初期状態番号である０とレジスタ２５０に
格納されている次状態番号３０３である１を比較する。
等しくないのでセレクタ２６２は次状態番号３０３を選
択する。このことは、状態０から状態１に遷移文字
“イ”による遷移が記述されていることを示している。The comparator 252 compares the initial state number 0 stored in the register 251 with the next state number 303 stored in the register 250.
Since they are not equal, the selector 262 selects the next state number 303. This indicates that the transition from the state 0 to the state 1 is described by the transition character “A”.

【０２７８】状態１における検索結果テーブル２６０の
検索ターム番号は０であるので照合結果３０６は出力さ
れない。このことは、状態１に照合結果が格納されてい
ないことを示している。Since the search term number in search result table 260 in state 1 is 0, collation result 306 is not output. This indicates that the matching result is not stored in the state 1.

【０２７９】マルチプレクサ２６０では、バッファ２８
１が選択されているので、バッファ２８１に次状態番号
である１が、始点状態番号に続く２つめの次状態番号と
して格納される。バッファ２８０から全ての現状態番号
が読み取られたので、終了信号３０７が発生する。In the multiplexer 260, the buffer 28
Since 1 has been selected, the next state number 1 is stored in the buffer 281 as the second next state number following the start state number. Since all the current state numbers have been read from the buffer 280, the end signal 307 is generated.

【０２８０】これによりマルチプレクサ２６０はバッフ
ァ２８０を選択し、セレクタ２６１はバッファ２８１を
選択することになる。すなわち、バッファ内の２つの次
状態番号が現状態番号として、次の文字に対する遷移に
用いられることになる。As a result, the multiplexer 260 selects the buffer 280, and the selector 261 selects the buffer 281. That is, the two next state numbers in the buffer are used as transitions for the next character as the current state number.

【０２８１】これらの動作をオートマトンの状態遷移と
してみると、まずバッファ２８０に格納されている現状
態番号である状態０において遷移文字“イ”が記述され
ているかの照合が行われている。次に、遷移が記述され
ていなければ次の現状態番号の処理が行なわれ、遷移が
記述されている場合、バッファ２８１に次状態番号が格
納され、同時に照合結果が格納されているか、照合結果
テーブルの次状態番号が示されるアドレスに有効なター
ムＩＤが格納されているかチェックすることにより調べ
られている。この場合では遷移が記述されているので、
次状態番号である状態１がバッファ２８１に格納されて
おり、照合結果テーブルを調べたところ結果は格納され
ていないため照合結果は出力されない。Looking at these operations as state transitions of the automaton, first, it is checked whether or not the transition character "A" is described in state 0 which is the current state number stored in the buffer 280. Next, if the transition is not described, the processing of the next current state number is performed. If the transition is described, the next state number is stored in the buffer 281 and the collation result is stored at the same time. This is checked by checking whether a valid term ID is stored at the address indicating the next state number in the table. In this case, the transition is described,
State 1 which is the next state number is stored in the buffer 281, and when the collation result table is examined, the collation result is not output because the result is not stored.

【０２８２】次に２文字目の“ン”がレジスタ２１１に
読み込まれる。すると、文字コード３０２と現状態番号
３０５をアドレスとする状態遷移テーブル２２０より次
状態番号である０が出力され、レジスタ２５０に格納さ
れる。Next, the second character "n" is read into the register 211. Then, the next state number 0 is output from the state transition table 220 having the character code 302 and the current state number 305 as addresses, and stored in the register 250.

【０２８３】コンパレータ２５２はレジスタ２５１に格
納されている初期状態番号である０とレジスタ２５０に
格納されている次状態番号である０を比較する。等しい
のでセレクタ２６２では次状態番号３０３は選択されな
い。このため次状態番号３０３は、バッファ２８０に格
納されない。このことは、状態０には遷移文字“ン”に
よる繊維が記述されていないことを示している。ここで
はこのように制御することにより、始点状態から始点状
態への遷移におけるトークンの生成を防いでい次に、バ
ッファ２８１より第２の次状態番号１が読み出され、文
字コード３０２と次状態番号である１をアドレスとして
状態遷移テーブル２２０より次状態番号２が出力され、
レジスタ２５０に格納される。コンパレータ２５２はレ
ジスタ２５１に格納されている初期状態番号である０と
レジスタ２５０に格納されている次状態番号３０３であ
る２を比較する。等しくないのでセレクタ２６２は次状
態番号３０３を選択することになる。このことは、状態
１から状態２に遷移文字“ン”による遷移が記述されて
いることを示している。The comparator 252 compares the initial state number 0 stored in the register 251 with the next state number 0 stored in the register 250. Since they are equal, the next state number 303 is not selected by the selector 262. Therefore, the next state number 303 is not stored in the buffer 280. This indicates that the fiber with the transition character “n” is not described in the state 0. Here, by controlling in this manner, the generation of a token in the transition from the start state to the start state is prevented. Next, the second next state number 1 is read from the buffer 281 and the character code 302 and the next state The next state number 2 is output from the state transition table 220 using the number 1 as an address,
Stored in register 250. The comparator 252 compares the initial state number 0 stored in the register 251 with the next state number 303 stored in the register 250. Since they are not equal, the selector 262 selects the next state number 303. This indicates that the transition from the state 1 to the state 2 is described by the transition character “ン”.

【０２８４】状態２における検索結果テーブル２６０の
検索ターム番号は０であるので照合結果３０６は出力さ
れない。このことは、状態２に照合結果が格納されてい
ないことを示している。Since the search term number of search result table 260 in state 2 is 0, collation result 306 is not output. This indicates that the comparison result is not stored in the state 2.

【０２８５】また、マルチプレクサ２６０では、バッフ
ァ２８０を選択しているので、バッファ２８０に次状態
番号３０３である２が始点状態番号に続く２つめの次状
態番号として格納される。Since the multiplexer 260 selects the buffer 280, 2 which is the next state number 303 is stored in the buffer 280 as the second next state number following the start state number.

【０２８６】バッファ２８１からは全ての現状態番号を
読み取られたので、終了番号３０７が発生する。Since all the current state numbers have been read from the buffer 281, an end number 307 is generated.

【０２８７】これによりマルチプレクサ２６０はバッフ
ァ２８１を選択し、セレクタ２６１はバッファ２８０を
選択することになる。すなわち、バッファ２８０内の２
つの次状態番号が２つの現状態番号として、次の文字に
対する遷移に用いられることになる。Thus, the multiplexer 260 selects the buffer 281 and the selector 261 selects the buffer 280. That is, 2 in buffer 280
One next state number will be used as a transition to the next character as two current state numbers.

【０２８８】これらの動作をオートマトンの状態遷移と
してみると、まずバッファ２８１に格納されている現状
態番号である状態０において遷移文字“ン”が記述され
ているかの照合を行なっている。この場合記述されてな
いので次の現状態番号である状態１について同様に照合
を行う。状態２への遷移が記述されているので、状態２
は次状態番号としてバッファ２８０に格納される。ま
た、状態２には照合結果が格納されてないので、照合結
果は出力されていない。Looking at these operations as state transitions of the automaton, first, it is checked whether or not the transition character "n" is described in state 0 which is the current state number stored in the buffer 281. In this case, since it is not described, the collation is similarly performed for the next current state number, state 1. Since the transition to state 2 is described, state 2
Is stored in the buffer 280 as the next state number. Further, since no matching result is stored in state 2, no matching result is output.

【０２８９】次に３文字目の“タ”も２文字目と同様の
照合処理となるので説明を簡略する。まず、バッファ２
８０に登録されている、現状態番号である状態０におけ
る照合動作を行なうこの場合記述されていないので次の
現状態番号である状態２について同様に照合を行なう。
状態３への遷移が記述されているので、状態３が次状態
番号としてバッファ２８０に格納される。また、状態３
には照合結果が格納されてないので、照合結果は出力さ
れない。Next, the third character “ta” is subjected to the same collation processing as the second character, so that the description is simplified. First, buffer 2
The collation operation is performed in the state 0, which is the current state number, registered in 80. In this case, since it is not described, the collation is similarly performed for the next state 2, which is the current state number.
Since the transition to state 3 is described, state 3 is stored in buffer 280 as the next state number. State 3
Since no matching result is stored in, no matching result is output.

【０２９０】次に４文字目の“フ”も３文字目と同様の
照合処理となるので説明を簡略する。まず、バッファ２
８１に登録されている、現状態番号である状態０におけ
る照合動作を行なう。この場合記述されていないので次
の現状態番号である状態３について同様に照合を行な
う。状態４への遷移が記述されているので、状態４が次
状態番号としてバッファ２８０に格納される。また、状
態４には照合結果が格納されてないので、照合結果は出
力されない。Next, the fourth character “F” is subjected to the same collation processing as that of the third character, so that the description is simplified. First, buffer 2
The collation operation in the state 0, which is the current state number, registered in 81 is performed. In this case, since it is not described, collation is similarly performed for the next current state number, that is, state 3. Since the transition to state 4 is described, state 4 is stored in buffer 280 as the next state number. In addition, since no matching result is stored in state 4, no matching result is output.

【０２９１】次に５文字目の“エ”も４文字目と同様の
照合処理となるので説明を簡略する。まず、バッファ２
８０に登録されている現状態番号である状態０における
照合動作を行なう。この場合記述されてないので次の現
状態番号である状態４について同様に照合を行なう。状
態５への遷移が記述されているので、状態５が次状態番
号としてバッファ２８１に格納される。また、状態５に
は照合結果が格納されてないので、照合結果は出力され
ない。Next, the fifth character "D" is subjected to the same collation processing as that of the fourth character, so that the description is simplified. First, buffer 2
The collation operation in the state 0, which is the current state number registered in 80, is performed. In this case, since it is not described, collation is similarly performed for the next current state number, which is state 4. Since the transition to state 5 is described, state 5 is stored in buffer 281 as the next state number. In addition, since no matching result is stored in state 5, no matching result is output.

【０２９２】次に６文字目の“イ”がレジスタ２１１に
読み込まれる。このステップではトークンが２つから３
つに増加しているので詳細に説明する。Next, the sixth character "A" is read into the register 211. In this step, two to three tokens
The details will be described in detail.

【０２９３】文字コード３０２と次状態番号である０を
アドレスとする状態遷移テーブル２２０より次状態番号
として１が出力され、レジスタ２５０に格納される。コ
ンパレータ２５２はレジスタ２５１に格納されている初
期状態番号である０とレジスタ２５０に格納されている
次状態番号である１を比較する。この場合、等しくない
のでセレクタ２６２は次状態番号３０３を選択する。状
態１における検索結果テーブルのキーワード番号は０で
あるので照合結果３０６は出力されない。マルチプレク
サ２６０では、バッファ２８０が選択されているので、
バッファ２８０に次状態番号である１が始点状態番号に
続く２つめの次状態番号として格納される。[0293] From the state transition table 220 using the character code 302 and the next state number 0 as an address, 1 is output as the next state number and stored in the register 250. The comparator 252 compares the initial state number 0 stored in the register 251 with the next state number 1 stored in the register 250. In this case, since they are not equal, the selector 262 selects the next state number 303. Since the keyword number of the search result table in state 1 is 0, the matching result 306 is not output. In the multiplexer 260, since the buffer 280 is selected,
The next state number 1 is stored in the buffer 280 as the second next state number following the start state number.

【０２９４】次に、バッファ２８１より第２の次状態番
号６が読み出され、文字コード３０２と次状態番号であ
る６をアドレスとする状態遷移テーブル２２０より次状
態番号として７が出力され、レジスタ２５０に格納され
る。コンパレータ２５２はレジスタ２５１に格納されて
いる初期状態番号である０とレジスタ２５０に格納され
ている次状態番号である７を比較する。この場合、等し
くないのでセレクタ２６２は次状態番号を選択する。状
態７における検索結果テーブルのキーワード番号は０で
あるので照合結果３０６は出力されない。マルチプレク
サ２６０では、バッファ２８０が選択されているので、
バッファ２８０に次状態番号である７が３つめの次状態
番号として格納される。バッファ２８１からは全ての現
状態番号が読み取られたので、終了信号３０７が発生す
る。Next, the second next state number 6 is read from the buffer 281, and the next state number 7 is output from the state transition table 220 having the character code 302 and the next state number 6 as an address. 250. The comparator 252 compares the initial state number 0 stored in the register 251 with the next state number 7 stored in the register 250. In this case, since they are not equal, the selector 262 selects the next state number. Since the keyword number of the search result table in state 7 is 0, the matching result 306 is not output. In the multiplexer 260, since the buffer 280 is selected,
The next state number 7 is stored in the buffer 280 as the third next state number. Since all the current state numbers have been read from the buffer 281, the end signal 307 is generated.

【０２９５】これによりマルチプレクサ２６０はバッフ
ァ２８１を選択し、セレクタ２６１はバッファ２８０を
選択することになる。すなわち、バッファ２８０内の３
つの次状態番号が３つの現状態番号として、次の文字に
対する遷移に用いられることになる。As a result, the multiplexer 260 selects the buffer 281 and the selector 261 selects the buffer 280. That is, 3 in buffer 280
One next state number is used as a transition to the next character as three current state numbers.

【０２９６】次に７文字目の“ス”の照合処理に入るが
レジスタ２１１に読み込まれる。すると、文字コード３
０２と次状態番号である０をアドレスとする状態遷移テ
ーブル２２０より次状態番号として０が出力され、レジ
スタ２５０に格納される。コンパレータ２５２はレジス
タ２５１に格納されている初期状態番号である０とレジ
スタ２５０に格納されている次状態番号である０を比較
する。この場合、等しいのでセレクタ２６２では次状態
番号３０３が選択されない。このため次状態番号は、バ
ッファ２８１に格納されないことになる。Next, the process proceeds to the collation processing for the seventh character "S", but is read into the register 211. Then, character code 3
0 is output as the next state number from the state transition table 220 having an address of 02 and the next state number 0, and is stored in the register 250. The comparator 252 compares the initial state number 0 stored in the register 251 with the next state number 0 stored in the register 250. In this case, the next state number 303 is not selected by the selector 262 because they are equal. Therefore, the next state number is not stored in the buffer 281.

【０２９７】次に、バッファ２８１より第２の次状態番
号である１が読み出され、文字コード３０２と次状態番
号である０をアドレスとする状態遷移テーブル２２０よ
り次状態番号として０が出力され、レジスタ２５０に格
納される。コンパレータ２５２はレジスタ２５１に格納
されている初期状態番号である０とレジスタ２５０に格
納されている次状態番号である０を比較する。この場
合、等しいのでセレクタ２６２では次状態番号が選択さ
れない。このため次状態番号は、バッファ２８１に格納
されないことになる。Next, 1 as the second next state number is read from the buffer 281, and 0 is output as the next state number from the state transition table 220 having the character code 302 and 0 as the next state number as addresses. , Are stored in the register 250. The comparator 252 compares the initial state number 0 stored in the register 251 with the next state number 0 stored in the register 250. In this case, the next state number is not selected by the selector 262 because they are equal. Therefore, the next state number is not stored in the buffer 281.

【０２９８】本実施例では、このように制御することに
よりトークンの消滅を実現している。In this embodiment, the extinction of the token is realized by such control.

【０２９９】さらにバッファ２８１より第３の次状態番
号である７が読み出され、文字コード３０２と次状態番
号である７をアドレスとする状態遷移テーブル２２０よ
り次状態番号として８が出力され、レジスタ２５０に格
納される。コンパレータ２５２はレジスタ２５１に格納
されている初期状態番号である０とレジスタ２５０に格
納されている次状態番号である８を比較する。この場
合、等しくないのでセレクタ２６２では次状態番号３０
３が選択される。Further, the third next state number 7 is read out from the buffer 281, and the next state number 8 is output from the state transition table 220 having the character code 302 and the next state number 7 as an address. 250. The comparator 252 compares the initial state number 0 stored in the register 251 with the next state number 8 stored in the register 250. In this case, since they are not equal, the next state number 30
3 is selected.

【０３００】状態８における検索結果テーブル２６０の
キーワード番号は１であるのでキーワード番号として１
が照合結果３０６とし出力される。マルチプレクサ２６
０では、バッファ２８１が選択されているので、バッフ
ァ２８１に次状態番号である８が始点状態番号に続く２
つめの次状態番号として格納される。Since the keyword number of the search result table 260 in the state 8 is 1, the keyword number is 1
Is output as the collation result 306. Multiplexer 26
In the case of 0, since the buffer 281 is selected, the next state number 8 is stored in the buffer 281 as a 2
Stored as the next state number.

【０３０１】バッファ２８０からは全ての現状態番号が
読み取られたので、終了信号３０７が発生する。Since all the current state numbers have been read from the buffer 280, the end signal 307 is generated.

【０３０２】これによりマルチプレクサ２６０はバッフ
ァ２８１を選択する。すなわち、２つの次状態番号が２
つの現状態番号として、次の文字に対する遷移に用いら
れることを示している。[0302] Thereby, the multiplexer 260 selects the buffer 281. That is, the two next state numbers are 2
One current state number is used for transition to the next character.

【０３０３】以上のように、トークンを制御することに
より“インタフェイス”という文字列の検索を実現して
いる。As described above, the search for the character string "interface" is realized by controlling the token.

【０３０４】本実施例によれば、このように複数のトー
クンを制御することにより、「フェイル」という概念を
必要としない文字列検索を実現できる。このため、オー
トマトン作成時間を短くすることができ、状態数を少な
くできるため状態遷移テーブルをコンパクトにできると
いう利点がある。According to the present embodiment, by controlling a plurality of tokens in this way, a character string search that does not require the concept of "fail" can be realized. Therefore, there is an advantage that the time required for creating the automaton can be shortened, and the number of states can be reduced, so that the state transition table can be made compact.

【０３０５】次に、図５６から図７０の開示に基づい
て、本発明を実現するのに用いられる磁気ディスク装置
の実施例を説明する。Next, an embodiment of a magnetic disk drive used to realize the present invention will be described based on the disclosure of FIGS.

【０３０６】図５７は集合型磁気ディスク装置の構成を
示すもので、磁気ディスク装置１を有するｎ台のデータ
記憶装置１５と、データ記憶装置１５それぞれに接続す
る磁気ディスク装置１の１シリンダ分の容量を持つ入出
力バッファ３と、データ記憶装置１５と入出力バッファ
３の制御を行なうマルチディスクコントローラ４によっ
て構成している。FIG. 57 shows the structure of a collective magnetic disk device. The data storage devices 15 each include the magnetic disk device 1 and one cylinder of the magnetic disk device 1 connected to each of the data storage devices 15. It comprises an input / output buffer 3 having a capacity, a data storage device 15 and a multi-disk controller 4 for controlling the input / output buffer 3.

【０３０７】ここではデータ記憶装置１５は１台の磁気
ディスク装置１で構成し、入出力バッファ３は上記磁気
ディスク装置１の１シリンダの容量を持つメモリ１面で
構成している。In this case, the data storage device 15 is constituted by one magnetic disk device 1 and the input / output buffer 3 is constituted by one memory of the magnetic disk device 1 having a capacity of one cylinder.

【０３０８】マルチディスクコントローラ４は、アクセ
スの対象となるファイルのファイルＩＤを上位機器７か
ら直接設定できる通信メモリ５と高速データバス１０の
制御を行なうマルチプレクサコントローラ８とファイル
ＩＤから磁気ディスク装置の格納先物理情報を求めるた
め変換テーブルである物理情報テーブル６および、それ
らを制御するマスタコントローラ９によって構成してい
る。The multi-disk controller 4 has a communication memory 5 for directly setting the file ID of a file to be accessed from the host device 7, a multiplexer controller 8 for controlling the high-speed data bus 10, and a storage for the magnetic disk device based on the file ID. It is configured by a physical information table 6 which is a conversion table for obtaining destination physical information and a master controller 9 which controls them.

【０３０９】上位機器７は集合型磁気ディスク装置に命
令を与えるホストコントローラと入力されるデータの中
から指定した文字列を検出し、その検出情報を出力する
文字列検索装置により構成している。The host device 7 is composed of a host controller for giving instructions to the collective magnetic disk drive and a character string search device for detecting a specified character string from input data and outputting the detected information.

【０３１０】本集合型磁気ディスク装置にデータファイ
ルを構成するデータベースの構築を行なう前には、デー
タベースの構造定義処理を行なう。Before constructing a database constituting a data file in the present collective type magnetic disk device, a database structure definition process is performed.

【０３１１】本集合型磁気ディスク装置では論理的に関
連するファイルを物理的格納位置が近接するように配置
する手段として、最初に物理シリンダを階層構造を持つ
論理分類ＩＤに従い割り振っている。複数件のファイル
を一度にアクセスする場合、理論的に関連するファイル
を対象にすることが多い。そこで、格納位置を近接させ
ることにより、磁気ディスク装置のシリンダ間を磁気ヘ
ッドが移動する距離を短くし、アクセス時間の一部であ
るシーク時間を短縮させる。In the present collective type magnetic disk device, as a means for arranging logically related files so that their physical storage locations are close to each other, first, physical cylinders are allocated according to a logical classification ID having a hierarchical structure. When multiple files are accessed at once, the related files are often targeted. Therefore, by making the storage positions close to each other, the distance that the magnetic head moves between the cylinders of the magnetic disk device is shortened, and the seek time, which is a part of the access time, is reduced.

【０３１２】階層構造を持つ論理類ＩＤに従って物理シ
リンダの割り振りは、上位機器７が論理分類ＩＤと該フ
ァイル分類が必要とする記憶容量の組が集まって構成さ
れるデータベース構造定義情報を通信メモリ５に格納し
た後、マルチディスクコントローラ４に対しデータベー
スの構造定義命令を発行する。構造定義命令を受けたマ
ルチディスクコントローラ４内のマスタコントローラ９
は、通信メモリ５にセットされたデータベースの構造定
義情報に基づいて、論理分類に物理位置がどう対応する
かをマスタコントローラ９内のメモリ上に図５８図で示
すような構造の構造定義テーブルを作成する。図５８は
２階層でそれぞれの階層で２つの分類を持つ例で、磁気
ディスク装置全体を一台の磁気ディスク装置としてまと
めて、各分類ごとの格納位置をシリンダの位置で、記憶
容量をシリンダ数で示したものである。The allocation of the physical cylinders according to the logical type ID having a hierarchical structure is performed by the host device 7 transmitting the database structure definition information composed of a set of the logical classification ID and the storage capacity required by the file classification to the communication memory 5. Then, a database structure definition command is issued to the multi-disk controller 4. Master controller 9 in multi-disk controller 4 receiving the structure definition command
58 shows a structure definition table having a structure as shown in FIG. 58 in a memory in the master controller 9 on the basis of the structure definition information of the database set in the communication memory 5 on how the physical position corresponds to the logical classification. create. FIG. 58 shows an example in which each of the two layers has two classes in each layer. The entire magnetic disk device is put together as one magnetic disk device, and the storage position for each classification is a cylinder position, and the storage capacity is the number of cylinders. It is shown by.

【０３１３】また、データベースの構造定義処理では、
マルチディスクコントローラ４内のマスタコントローラ
９は論理分類毎に、書き込みファイルの格納先の物理位
置を保持するために、マスタコントローラ９内のメモリ
上に図５８Ａに示すような、書き込みファイルの格納先
の物理位置を差し示す格納位置ポインタテーブルを作成
する。構造定義が終了した時点では、格納位置ポインタ
テーブルは構造定義で設定した各論理分類の先頭シリン
ダ、先頭トラック、先頭セクタ、セクタ内先頭位置を示
すことになる。図５９では、図５８Ａで示した例の分類
でファイルを格納した場合の格納位置ポインタ情報を格
納している。In the database structure definition process,
The master controller 9 in the multi-disk controller 4 stores the write file storage destination in the memory in the master controller 9 as shown in FIG. Create a storage position pointer table indicating the physical position. When the structure definition is completed, the storage position pointer table indicates the first cylinder, the first track, the first sector, and the first position in the sector of each logical classification set in the structure definition. In FIG. 59, storage position pointer information when a file is stored in the classification of the example shown in FIG. 58A is stored.

【０３１４】次にデータベースの構築について説明す
る。本集合型磁気ディスク装置ではアクセスの対象とな
るファイルをファイルＩＤ（論理分類ＩＤと論理分類内
の個有の番号で構成）により指定する手段として、ファ
イルＩＤを用いた管理情報を作成している。Next, the construction of the database will be described. In this set-type magnetic disk device, management information using a file ID is created as means for designating a file to be accessed by a file ID (consisting of a logical classification ID and a unique number in the logical classification). .

【０３１５】上位機器７は通信メモリ５に書込み対象と
なるファイルのファイルＩＤとファイルサイズの組が複
数件分集まって構成されるファイル情報を格納した後、
マルチディスクコントローラ４に対し書き込み命令を発
行する。書き込み命令を受けたマルチディスクコントロ
ーラ４は、図６１に示すフローで処理を実行する。マル
チディスクコントローラ４内のマスタコントローラ９
は、通信メモリ５からファイル情報の中のファイルＩＤ
を読み出し、該ファイルＩＤが示すファイルを格納する
格納位置を格納位置ポインタテーブルから読み出す。[0315] After the host device 7 stores, in the communication memory 5, file information composed of a plurality of sets of file IDs and file sizes of files to be written,
A write command is issued to the multi-disk controller 4. Upon receiving the write command, the multi-disk controller 4 executes processing according to the flow shown in FIG. Master controller 9 in multi-disk controller 4
Is the file ID in the file information from the communication memory 5.
Is read out, and the storage location for storing the file indicated by the file ID is read from the storage location pointer table.

【０３１６】格納位置が求まるとその物理シリンダに書
き込める残り容量が求まる。その残り容量よりもファイ
ル情報のファイルサイズで与えられるファイルのサイズ
が小さければ図６０Ａに示すようなファイルＩＤをエン
トリとする物理情報テーブル６にその格納位置（ディス
ク番号，シリンダ番号，トラック番号，セクタ番号，セ
クタ内位置）、ファイルサイズ、ディスクまたがり数を
書き込む。ディスクまたがり数は、ファイルが何台の磁
気ディスク装置１にまたがっているかを表わすもので、
処理対象となっているファイルが、１台の磁気ディスク
装置の１つのシリンダに書き切れなかった場合はファイ
ルを分割して書き残したファイルを次のディスクに書き
込むことになる。このファイル分割した書き残しファイ
ルであれば、この値をカウントアップする。物理情報テ
ーブル６のエントリはファイル情報で与えられるファイ
ルＩＤで示される。When the storage position is determined, the remaining capacity that can be written to the physical cylinder is determined. If the size of the file given by the file size of the file information is smaller than the remaining capacity, the storage location (disk number, cylinder number, track number, sector) in the physical information table 6 having a file ID as an entry as shown in FIG. Number, position in sector), file size, and number of disks. The number of disks spans indicates how many magnetic disk devices 1 a file has.
If the file to be processed cannot be written to one cylinder of one magnetic disk device, the file is divided and the remaining file is written to the next disk. If the file is an unwritten file divided, the value is counted up. An entry in the physical information table 6 is indicated by a file ID given by the file information.

【０３１７】物理情報テーブルへの書込みの後、格納位
置ポインタをファイルサイズ分進める。After writing to the physical information table, the storage position pointer is advanced by the file size.

【０３１８】ファイルサイズと残り容量が等しい場合
は、１台の磁気ディスク装置１のシリンダがいっぱいに
なった時で、その磁気ディスク装置１への書き込み処理
を行なう。If the file size is equal to the remaining capacity, the writing process to the magnetic disk device 1 is performed when the cylinder of one magnetic disk device 1 is full.

【０３１９】残り容量よりもファイルサイズが大きい場
合には、残り容量と分割基準サイズを比べる。分割基準
サイズは構造定義処理で設定する値で、シリンダの残り
容量が非常に小さいにもかかわらずファイルを磁気ディ
スク装置１の間にまたがるように格納すると、そのファ
イルを読み出すためには２台の磁気ディスク装置１を制
御しなければならず、その処理分オーバヘッドが大きく
なる。そこで、ある基準を設定してその基準値よりも残
り容量が小さい場合には次の磁気ディスク装置１のシリ
ンダの先頭から書き込むようにするものである。When the file size is larger than the remaining capacity, the remaining capacity is compared with the division reference size. The division reference size is a value set in the structure definition process. If a file is stored so as to straddle between the magnetic disk devices 1 even though the remaining capacity of the cylinder is very small, two files are required to read the file. The magnetic disk device 1 must be controlled, and the processing increases the overhead. Therefore, a certain reference is set, and when the remaining capacity is smaller than the reference value, writing is performed from the head of the cylinder of the next magnetic disk device 1.

【０３２０】残り容量が分割基準サイズ以上の場合に
は、物理情報テーブル６に格納位置、ファイルサイズを
格納した後、残り容量に書き込める分のファイルと書き
残した分の書き残しファイルとに分割する。物理情報テ
ーブル６には格納物理位置とファイルサイズを書込む。If the remaining capacity is equal to or larger than the division reference size, the storage location and the file size are stored in the physical information table 6, and then the file is divided into a file that can be written into the remaining capacity and an unrecorded file that has not been written. In the physical information table 6, the storage physical position and the file size are written.

【０３２１】１シリンダがいっぱいとなる物理情報を作
成した磁気ディスク装置１は書き込み処理を行なう。書
き残しファイルはループを戻り、次の処理対象ファイル
となる。The magnetic disk device 1 that has created the physical information that fills one cylinder performs write processing. The unwritten file returns from the loop and becomes the next file to be processed.

【０３２２】残り容量が分割基準サイズよりも小さい場
合には、格納位置ポインタテーブルを次のシリンダの先
頭に進めた後、処理対象ファイルをそのまま次の処理対
象ファイルとしてループを戻り処理を続ける。この時、
１シリンダがいっぱいとなる物理情報を作成した磁気デ
ィスク装置は書き込み処理を行なう。If the remaining capacity is smaller than the division reference size, the storage position pointer table is advanced to the beginning of the next cylinder, and the process returns to the loop with the file to be processed as it is, and continues the processing. At this time,
The magnetic disk device that has created the physical information that fills one cylinder performs write processing.

【０３２３】書き込み処理は、マスタコントローラ９が
シーク命令を磁気ディスク装置１に発行し、シーク動作
を開始する。次に、上位機器７にファイルの転送要求を
発行し、マスタコントローラ９は上位機器７にファイル
の転送を要求するとともに、マルチプレクスコントロー
ラ８を制御してデータバスを切り換え、転送されてくる
ファイルを物理情報で指定する入出力バッファ３へのフ
ァイルの転送を行なう。シーク動作が終了し、ファイル
の転送が終了するとマスタコントローラ９は書込み命令
を磁気ディスク装置１に発行し、該磁気ディスク装置１
は書き込み動作を実行する。In the writing process, the master controller 9 issues a seek command to the magnetic disk device 1 and starts a seek operation. Next, a file transfer request is issued to the host device 7, and the master controller 9 requests the host device 7 to transfer the file, and also controls the multiplex controller 8 to switch the data bus so that the transferred file is transferred. The file is transferred to the input / output buffer 3 specified by the physical information. When the seek operation is completed and the file transfer is completed, the master controller 9 issues a write command to the magnetic disk device 1 and
Performs a write operation.

【０３２４】上記の動作を繰返しデータベースの構築を
行なう。The above operation is repeated to construct a database.

【０３２５】図６２は書き込み処理の時間的な関係を示
すもので、上位機器７から図に示すように“１−１”，
“２−１”，…ｉ，“ｎ−１”，“１−２”，“２−
２”……と次々と転送されてくるデータは、マルチディ
スクコントローラ４内のマルチプレクスコントローラ８
により、入出力バッファ３−１，３−２……，３−ｎ、
３−１，３−２，……に格納される。このとき、例えば
磁気データベース装置１−１は、データ“１−１”の転
送を開始する直前にマスタコントローラ９の指令により
シークを開始している。データ“１−１”の転送が終了
した時点で、マスタコントローラ９は磁気ディスク装置
１−１に書き込み命令を発行する。磁気ディスク装置１
−１は指定の書き込み位置に達するまで回転待ちを行な
った後、入出力バッファ３−１のデータ“１−１”を所
定のシリンダ，トラック，セクタへ書き込み始める。FIG. 62 shows the temporal relationship of the write processing. As shown in FIG.
"2-1", ... i, "n-1", "1-2", "2-
The data successively transferred as 2 "... Are stored in the multiplex controller 8 in the multi-disk controller 4.
, 3-n,..., 3-n,
3-1 to 3-2 are stored. At this time, for example, the magnetic database apparatus 1-1 has started a seek operation according to a command from the master controller 9 immediately before starting the transfer of the data "1-1". When the transfer of the data “1-1” is completed, the master controller 9 issues a write command to the magnetic disk device 1-1. Magnetic disk drive 1
-1 waits for rotation until a specified write position is reached, and then starts writing data "1-1" in the input / output buffer 3-1 to a predetermined cylinder, track, or sector.

【０３２６】この間、他の磁気ディスク装置も図に示す
ように同様の処理を行なうことになる。During this time, the other magnetic disk devices perform the same processing as shown in the figure.

【０３２７】図６２とこれに関する以上の説明から明ら
かなように、各磁気ディスク装置はそれぞれ並行して、
連続でファイルの書き込みができ、短時間でデータベー
スの構築ができる。As is clear from FIG. 62 and the above description, each magnetic disk device is
The file can be written continuously, and the database can be constructed in a short time.

【０３２８】次に、ファイルの読み出し処理について説
明する。また、同一磁気ディスク装置の同一シリンダ上
に読み出すファイルが複数件有る場合に、読み出すファ
イルの間にある読み出し不要のファイルも入出力バッフ
ァに一旦読み出し、上位機器に転送する際に読み出し不
要のファイルを削除する手段について説明する。Next, a file reading process will be described. Also, when there are a plurality of files to be read on the same cylinder of the same magnetic disk device, the files which do not need to be read out between the files to be read out are also read out to the input / output buffer once, and the files which do not need to be read out when transferring to the higher-level device are transferred. The means for deleting will be described.

【０３２９】上位機器７は読み出すファイルのファイル
ＩＤが複数件分集まって構成するファイル情報を通信メ
モリ５に格納した後、マルチディスクコントローラ４に
対して読み出し命令を発行する。[0329] The host device 7 stores the file information composed of a plurality of file IDs of the files to be read in the communication memory 5, and issues a read command to the multi-disk controller 4.

【０３３０】読み出し命令を受けたマルチディスクコン
トローラ４は、図６３に示すフローで処理を実行する。The multi-disk controller 4 that has received the read command executes the processing according to the flow shown in FIG.

【０３３１】マルチディスクコントローラ４内のマスタ
コントローラ９は、通信メモリ５から最初に読み出すべ
きファイルのファイルＩＤを読み出し、該ファイルＩＤ
から該ファイルが格納されている物理情報を物理情報テ
ーブル６により検索する。このファイルを先ファイル、
物理情報を先ファイルの物理情報とする。次に、通信メ
モリ５から次に読み出すべきファイルのファイルＩＤを
読み出し、該ファイルＩＤから該ファイルが格納されて
いる物理情報を物理情報テーブル６により検索する。こ
のファイルを後ファイル、物理情報を後ファイルの物理
情報とする。The master controller 9 in the multi-disk controller 4 reads the file ID of the file to be read first from the communication memory 5, and
From the physical information table 6 for the physical information in which the file is stored. This file is the destination file,
Let the physical information be the physical information of the destination file. Next, the file ID of the file to be read next is read from the communication memory 5, and the physical information table 6 is searched from the file ID for the physical information in which the file is stored. This file is referred to as a subsequent file, and the physical information is referred to as physical information of the subsequent file.

【０３３２】求めた物理情報から先ファイルと後ファイ
ルが同一シリンダに存在するかを調べ、同一シリンダに
存在すれば先ファイルと後ファイルの間に、指定してい
ない読み出し不要のファイル群があるか調べ、あれば、
そのファイル群の総サイズを求める。読み出し不要のフ
ァイルのサイズが小さい場合には、先ファイルと後ファ
イルを一度の読み出し命令で読出せるように、物理情報
を合成する。次に合成した物理情報を先ファイルの物理
情報としてループを戻り、通信メモリ５から次のファイ
ルＩＤを読み出し、そのファイルを後ファイルとして同
様な処理を行なう。It is checked from the obtained physical information whether the preceding file and the succeeding file exist in the same cylinder, and if they exist in the same cylinder, whether there is an unspecified read-less file group between the preceding file and the succeeding file. Check, if any,
Obtain the total size of the file group. If the size of the file that does not need to be read is small, the physical information is combined so that the previous file and the subsequent file can be read with a single read command. Next, the loop is returned using the combined physical information as the physical information of the previous file, the next file ID is read from the communication memory 5, and the same processing is performed using the file as the subsequent file.

【０３３３】先ファイルと後ファイルが同一シリンダに
存在しない場合と読み出し不要ファイルのサイズが大き
い場合には、先ファイルの磁気ディスク装置から読み出
し処理を実行する。後フィルの物理情報は先ファイルの
物理情報としてループを戻り、通信メモリ５から次のフ
ァイルＩＤを読出し、それを後ファイルとし同様な処理
を行なう。When the first file and the second file do not exist in the same cylinder and when the size of the unnecessary file is large, the read processing is executed from the magnetic disk device of the first file. The physical information of the rear file returns to the loop as the physical information of the previous file, reads the next file ID from the communication memory 5, and performs the same processing as the rear file.

【０３３４】このような動作を指定したファイルすべて
を読み出すまで繰り返す。The above operation is repeated until all the specified files are read.

【０３３５】先ファイルの磁気ディスク装置からの読み
出し処理は、まず、マスタコントローラ９は先ファイル
の物理情報が示す磁気ディスク装置１−ｉの磁気ディス
クコントローラ２−ｉに物理情報が示す物理位置へ磁気
ヘッドを移動させるシーク命令を発行し、磁気ディスク
装置１−ｉはシーク動作を開始する。シーク動作が終了
すると、入出力バッファ３−ｉがデータを書き込んでも
良い状態であれば、マスタコントローラ９は読み出し命
令を磁気ディスクコントローラ２−ｉに発行し、入出力
バッファ３−ｉに磁気ディスク装置１−ｉから読み出し
たファイルの格納を開始する。格納が終了すると、マス
タコントローラ９はマルチプレクスコントローラ８を制
御して入出力バッファ３−ｉから上位機器７へのデータ
の転送を開始する。In the process of reading the previous file from the magnetic disk device, first, the master controller 9 sends the magnetic file to the magnetic position indicated by the physical information in the magnetic disk controller 2-i of the magnetic disk device 1-i indicated by the physical information of the previous file. A seek command to move the head is issued, and the magnetic disk device 1-i starts a seek operation. When the seek operation is completed, if the input / output buffer 3-i is ready to write data, the master controller 9 issues a read command to the magnetic disk controller 2-i, and sends the read command to the input / output buffer 3-i. The storage of the file read from 1-i is started. When the storage is completed, the master controller 9 controls the multiplex controller 8 to start transferring data from the input / output buffer 3-i to the host device 7.

【０３３６】マルチプレクスコントローラ８は図６４に
示すように、上位機器７のデータバスに入出力バッファ
３−１から３−ｉのデータバスを選択して接続するマル
チプレクサ２０１と選択したｉ番目の入出力バッファ３
−ｉから上位機器７にマスタコントローラ９の介在なし
にデータを出力するＤＭＡコントローラ２０２と該ＤＭ
Ａコントローラ２０２に入出力バッファ３−ｉの転送範
囲を指定するための先頭アドレスと終了アドレスを格納
する先頭アドレス登録テーブル２０３と終了アドレス登
録テーブル２０４により構成している。As shown in FIG. 64, the multiplex controller 8 is connected to the multiplexer 201 for selecting and connecting the data buses of the input / output buffers 3-1 to 3-i to the data bus of the host device 7 and the selected i-th input bus. Output buffer 3
A DMA controller 202 for outputting data from i to the host device 7 without the intervention of the master controller 9;
The A controller 202 includes a start address registration table 203 for storing a start address and an end address for designating a transfer range of the input / output buffer 3-i, and an end address registration table 204.

【０３３７】マスタコントローラ９は入出力バッファ３
−ｉの転送すべきファイルが存在する先頭アドレスを先
頭アドレス登録テーブル２０３に、終了アドレスを終了
アドレス登録テーブル２０４に設定した後他の入出力バ
ッファ３から上位機器７へのデータの転送が行なわれて
いなければＤＭＡコントローラ２０２に起動命令を発行
する。ＤＭＡコントローラ２０２は先頭アドレス登録テ
ーブル２０３と終了アドレス登録テーブル２０４を参照
しながら指定した範囲のデータのみ上位機器７の要求す
る転送速度でマスタコントローラ９の介在なしに転送を
行なう。The master controller 9 controls the input / output buffer 3
After setting the start address at which the file to be transferred exists in the start address registration table 203 and the end address in the end address registration table 204, data transfer from the other input / output buffer 3 to the host device 7 is performed. If not, a start command is issued to the DMA controller 202. The DMA controller 202 refers to the start address registration table 203 and the end address registration table 204 and transfers only the data in the specified range at the transfer speed requested by the higher-level device 7 without the intervention of the master controller 9.

【０３３８】先ファイルと後ファイルを一度の読出し命
令で読み出せるように、物理情報を合成する処理を行な
い入出力バッファ３−ｉに読み出した場合には、先頭ア
ドレス登録テーブル２０３と終了アドレス登録テーブル
２０４に必要なファイルすべてが転送されるようにアド
レスを複数件分設定し、同様な処理を行なう。If the processing of synthesizing physical information is performed so that the preceding file and the succeeding file can be read by a single read command, and the data is read out to the input / output buffer 3-i, the start address registration table 203 and the end address registration table A plurality of addresses are set so that all necessary files are transferred to 204, and similar processing is performed.

【０３３９】先ファイルと後ファイルを一度の読出し命
令で読み出せるように、物理情報を合成する処理は次の
条件を満足する場合に行なう。The process of synthesizing physical information so that the first file and the second file can be read by one read command is performed when the following conditions are satisfied.

【０３４０】先ファイルのサイズをｆ１［Ｂｙｔｅ］、
後ファイルのサイズをｆ２［Ｂｙｔｅ］、読み出し不要
のファイル群の総サイズをｋ［Ｂｙｔｅ］、磁気ディス
ク装置１から入出力バッファ３へのシーク動作を含まな
い実効的な転送速度をｔ［Ｂｙｔｅ／ｓｅｃ］、回転速
度をＲ［ｒｐｓ］、平均シーク時間をｓ［ｓｅｃ］とす
るとき、平均回転待ち時間は（１／２Ｒ）であり、一度
に読み出す時間が一つずつ読み出す時間よりも短くなる
条件は、（数１）の通りになる。The size of the destination file is f1 [Byte],
The size of the subsequent file is f2 [Byte], the total size of the file group that does not need to be read is k [Byte], and the effective transfer speed not including the seek operation from the magnetic disk device 1 to the input / output buffer 3 is t [Byte / sec], the rotation speed is R [rps], and the average seek time is s [sec], the average rotation waiting time is (1 / 2R), and the time of reading at one time is shorter than the time of reading one by one. The conditions are as shown in (Equation 1).

【０３４１】[0341]

【数１】 (Equation 1)

【０３４２】この（数１）は、以下の（数２）のように
書き表すことができる。This (Equation 1) can be written as the following (Equation 2).

【０３４３】[0343]

【数２】 (Equation 2)

【０３４４】ファイルの読み出し処理の時間的な関係
は、上位機器７が要求する転送速度をＴ［Ｂｙｔｅ／ｓ
ｅｃ］、各磁気ディスク装置１の１シリンダ分の容量が
Ｍ［Ｂｙｔｅ］、各磁気ディスク装置１から入出力バッ
ファ３への転送速度をｔ［Ｂｙｔｅ／ｓｅｃ］、各磁気
ディスク装置１の最少シーク時間をｓ［ｓｅｃ］、回転
速度をＲ［ｒｐｓ］とすると、最少シーク時間ｓ［ｓｅ
ｃ］がｉ番目の入出力バッファ３−ｉ上のファイルを上
位機器７に転送する時間（Ｍ／Ｔ）より大きい場合に
は、図６５に示すようになる。上位機器７の要求する転
送速度を満足するには、ｉ台目の磁気ディスク装置１−
ｉが入出力バッファ３−ｉにファイルを読み出す時間
（ｓ＋１／Ｒ＋Ｍ／ｔ）が、全ての入出力バッファ３上
のファイルを上位機器７に転送する時間（ｎＭ／Ｔ）以
内であれば良いことになる。ここでは、連続したシリン
ダを読み出すためシーク時間を最少シーク時間とした。
また、磁気ディスク装置１に読み出し命令を発行した時
点の磁気ヘッドの位置がいかなる場合でも、上位機器７
の要求する転送速度を満足するように、回転待ちの時間
を最大値である（１／Ｒ）とした。この関係を数式で表
わすと（数３）の通りとなる。[0344] The time relation of the file reading process is as follows. The transfer speed requested by the host device 7 is set to T [Byte / s].
ec], the capacity of one cylinder of each magnetic disk device 1 is M [Byte], the transfer speed from each magnetic disk device 1 to the input / output buffer 3 is t [Byte / sec], and the minimum seek of each magnetic disk device 1 is Assuming that the time is s [sec] and the rotation speed is R [rps], the minimum seek time s [sec]
If [c] is longer than the time (M / T) for transferring the file on the i-th input / output buffer 3-i to the host device 7, the result is as shown in FIG. To satisfy the transfer speed required by the host device 7, the i-th magnetic disk device 1-
The time (s + 1 / R + M / t) when i reads a file to the input / output buffer 3-i should be within the time (nM / T) for transferring the files on all the input / output buffers 3 to the host device 7. become. Here, the seek time was set to the minimum seek time in order to read out the continuous cylinders.
Regardless of the position of the magnetic head at the time when the read command is issued to the magnetic disk device 1,
The rotation waiting time was set to the maximum value (1 / R) so as to satisfy the transfer speed required by (1). When this relationship is expressed by a mathematical expression, it is as shown in (Equation 3).

【０３４５】[0345]

【数３】 (Equation 3)

【０３４６】また、（数３）は（数４）で示すように書
き表わすことができる。(Equation 3) can be written as shown in (Equation 4).

【０３４７】[0347]

【数４】 (Equation 4)

【０３４８】また、最少シーク時間ｓ［ｓｅｃ］がｉ番
目の入出力バッファ３−ｉ上のファイルを上位機器７に
転送する時間（Ｍ／Ｔ）以下の場合のファイルの読み出
し処理の時間的な関係は、図６６に示すようになる。こ
の場合は、シーク動作が終了しても入出力バッファ３−
ｉはファイルを上位機器７に転送中であるため、読み出
し命令をｉ台目の磁気ディスク装置１−ｉに発行するこ
とができない。そこで、入出力バッファ３−ｉのファイ
ルが上位機器７に転送が終了した時点に読み出し命令を
ｉ台目のディスク装置１−ｉに発行することになる。従
って、上位機器７の要求する転送速度を満足するには、
ｉ台目の磁気ディスク装置１−ｉが入出力バッファ３−
ｉにファイルを読み出す時間（Ｍ／Ｔ＋１／Ｒ＋Ｍ／
ｔ）が、全ての入出力バッファ３上のファイルを上位機
器７に転送する時間（ｎＭ／Ｔ）以内であれば良いこと
になる。この関係を数式で表わすと（数５）のようにな
る。When the minimum seek time s [sec] is equal to or less than the time (M / T) for transferring the file on the i-th input / output buffer 3-i to the higher-level device 7, the time required for the file read processing is reduced. The relationship is as shown in FIG. In this case, even if the seek operation is completed, the input / output buffer 3-
Since i is transferring the file to the host device 7, the read command cannot be issued to the i-th magnetic disk device 1-i. Therefore, a read command is issued to the i-th disk device 1-i when the transfer of the file in the input / output buffer 3-i to the host device 7 is completed. Therefore, in order to satisfy the transfer speed required by the host device 7,
The i-th magnetic disk device 1-i has an input / output buffer 3-
i to read the file (M / T + 1 / R + M /
It suffices that t) be within the time (nM / T) for transferring the files on all the input / output buffers 3 to the host device 7. When this relationship is expressed by a mathematical formula, it becomes as shown in (Equation 5).

【０３４９】[0349]

【数５】 (Equation 5)

【０３５０】また、（数５）は（数６）示すように書き
表わすことができる。(Equation 5) can be written as shown in (Equation 6).

【０３５１】[0351]

【数６】 (Equation 6)

【０３５２】これらの条件式より、上位機器７が要求す
る転送速度を満足するには磁気ディスク装置１を何台組
み合わせればよいかを求めることができ、（数１）を満
足する最少の台数の磁気ディスク装置１で集合型磁気デ
ィスク装置を構成すれば最もコストパフォーマンスの良
いものとなる。From these conditional expressions, it is possible to determine how many magnetic disk devices 1 should be combined to satisfy the transfer speed required by the higher-level device 7. The minimum number of units that satisfies (Equation 1) can be obtained. If the collective magnetic disk device is constituted by the magnetic disk device 1 described above, the best cost performance can be obtained.

【０３５３】例えば、１トラックの容量が２０k（キ
ロ）［Ｂｙｔｅ］の６トラックからなる、１シリンダ分
の容量が１２０ｋ［Ｂｙｔｅ］の磁気ディスク装置１に
より構成し、上位機器７が要求する転送速度２Ｍ（メ
ガ）［Ｂｙｔｅ／ｓｅｃ］、各磁気ディスク装置１から
入出力バッファ３へのシーク動作を含まない実効的な転
送速度を１Ｍ［Ｂｙｔｅ／ｓｅｃ］、各磁気ディスク装
置１の最少シーク時間１０ｍ（ミリ）［ｓｅｃ］、回転
速度５０［ｒｐｓ］とすると、（数４）は（数７）およ
び（数８）のようになる。これらの式を満足する最少の
ｎは４となる。For example, the magnetic disk drive 1 is composed of six tracks each having a capacity of 20 k (bytes) and having a capacity of 120 k [bytes] for one cylinder, and the transfer speed requested by the host device 7. 2M (mega) [Byte / sec], the effective transfer speed not including the seek operation from each magnetic disk device 1 to the input / output buffer 3 is 1M [Byte / sec], and the minimum seek time of each magnetic disk device 1 is 10 m. Assuming (milli) [sec] and a rotation speed of 50 [rps], (Equation 4) becomes as (Equation 7) and (Equation 8). The minimum n that satisfies these equations is 4.

【０３５４】[0354]

【数７】 (Equation 7)

【０３５５】[0355]

【数８】 (Equation 8)

【０３５６】図６７に３台の磁気ディスク装置１で構成
した集合型磁気ディスク装置の読み出し中の時間関係
で、図６８に４台の磁気ディスク装置１で構成した集合
型磁気ディスク装置の読み出し中の時間関係、図１９に
５台の磁気ディスク装置１で構成した集合型磁気ディス
ク装置の読み出し中の時間関係を示す。FIG. 67 shows the time relationship during reading of the collective magnetic disk device constituted by three magnetic disk devices 1, and FIG. 68 shows the read time of the collective magnetic disk device constituted by four magnetic disk devices 1. FIG. 19 shows the time relationship during reading of the collective magnetic disk device including the five magnetic disk devices 1.

【０３５７】図６７の３台の磁気ディスク装置１で構成
した場合には、図からもわかるように磁気ディスク装置
１から入出力バッファ３にデータを読み出す時間が入出
力バッファ３から上位機器７への転送時間に間に合わ
ず、入出力バッファ３から上位機器７にデータの転送が
できない時間ａが発生し、入出力バッファ３から上位機
器７への転送速度が約１．６Ｍ［Ｂｙｔｅ／ｓｅｃ］と
なり上位機器が要求する転送速度を満足できない。In the case of the three magnetic disk devices 1 shown in FIG. 67, as can be seen from the figure, the time required to read data from the magnetic disk device 1 to the input / output buffer 3 is from the input / output buffer 3 to the host device 7. , A time a during which data cannot be transferred from the input / output buffer 3 to the host device 7 occurs, and the transfer speed from the input / output buffer 3 to the host device 7 becomes approximately 1.6 M [Byte / sec]. The transfer speed required by the host device cannot be satisfied.

【０３５８】また、図６９の５台の磁気ディスク装置１
で構成した場合には、上位機器７が要求する転送速度を
満足するものの、図６８の４台の磁気ディスク装置１で
構成した場合に比べ、１台の磁気ディスク装置１が処理
しない時間ｂが長く磁気ディスク装置の使用効率が悪
い。The five magnetic disk drives 1 shown in FIG.
68, the transfer speed required by the host device 7 is satisfied, but the time b during which one magnetic disk device 1 does not process is longer than that in the case of the configuration with four magnetic disk devices 1 in FIG. The use efficiency of the magnetic disk device is long.

【０３５９】従って、数１を満足する最少のｎに一致す
る４台の磁気ディスク装置１で構成した場合が、最もコ
ストパフォーマンスの良い集合型磁気ディスク装置と言
える。Therefore, it can be said that the case of a configuration comprising four magnetic disk devices 1 corresponding to the minimum n that satisfies Equation 1 is the collective magnetic disk device with the best cost performance.

【０３６０】本発明を文字列検索装置に適用したもう１
つの実施例について図５６を用いて説明する。Another example in which the present invention is applied to a character string search device
One embodiment will be described with reference to FIG.

【０３６１】図５７で説明した集合型磁気ディスク装置
は、指定したファイルのみを読み出す場合、指定したフ
ァイルが磁気ディスク装置１−１から１−ｎに平均して
存在すれば、実施例１で述べたような動作を実施して、
上位機器７へのデータ転送速度を高めることができる。
しかし、１台の磁気ディスク装置１−ｉにだけ指定した
ファイルが存在する場合、１台の磁気ディスク装置１−
ｉの読み出しが連続して行われることになる。この場
合、上位機器７へのデータ転送は、一旦磁気ディスク装
置１−ｉから入出力バッファ３−ｉに読み出した後、入
出力バッファ３−ｉから上位機器７へ転送する２段読み
出しを行なわねばならないため、データ転送が低下して
しまうという状況が発生する。このように、指定したフ
ァイルが偏って磁気ディスク装置１に存在すると上位機
器７へのデータ転送速度を効果的に高めることができな
い状況が発生し得る。そこで、実施例２は、ファイルが
偏って格納されないようにすることで、常に全磁気ディ
スク装置１を読み出し動作させ、上位機器７へのデータ
転送速度を高めるものである。In the collective magnetic disk device described with reference to FIG. 57, when only the specified file is read, if the specified file exists on average in the magnetic disk devices 1-1 to 1-n, the description will be made in the first embodiment. Perform the operation like
The data transfer speed to the host device 7 can be increased.
However, if the specified file exists only in one magnetic disk device 1-i, one magnetic disk device 1-i
Reading of i will be performed continuously. In this case, the data transfer to the host device 7 must be performed by two-stage reading in which data is temporarily read from the magnetic disk device 1-i to the input / output buffer 3-i and then transferred from the input / output buffer 3-i to the host device 7. Therefore, a situation occurs in which the data transfer is reduced. As described above, if the specified file is unevenly located in the magnetic disk device 1, a situation may occur in which the data transfer speed to the host device 7 cannot be effectively increased. Therefore, in the second embodiment, all the magnetic disk devices 1 are always read and the data transfer speed to the host device 7 is increased by preventing the files from being stored unevenly.

【０３６２】また、本実施例では記憶容量をさらに高め
るため、磁気ディスク装置の台数を増やしている。In this embodiment, the number of magnetic disk devices is increased in order to further increase the storage capacity.

【０３６３】図５６は本発明を用いた集合型磁気ディス
ク装置の構成を示すもので、図５７と相違点は磁気ディ
スク装置１の１シリンダ分と同じ容量の入出力バッファ
３を２面持ち、第１面の入出力バッファ３ａのデータを
上位機器７に転送している間に、第２面の入出力バッフ
ァ３ｂに磁気ディスク装置１からの読み出したフィルム
を格納することができることである。FIG. 56 shows the structure of a collective magnetic disk drive using the present invention. The difference from FIG. 57 is that the magnetic disk drive 1 has two input / output buffers 3 having the same capacity as one cylinder, The film read from the magnetic disk drive 1 can be stored in the input / output buffer 3b on the second surface while the data in the input / output buffer 3a on the one surface is being transferred to the host device 7.

【０３６４】また、一つのデータ記憶装置１５をｍ台の
磁気ディスク装置１−ｉ−１〜１−ｉ−ｍとマルチプレ
クサ１４によって構成し、集合型磁気ディスク装置の総
記憶容量を１台の磁気ディスク装置の記憶容量（ｎ×
ｍ）倍にしている。Further, one data storage device 15 is composed of m magnetic disk devices 1-i-1 to 1-im and the multiplexer 14, and the total storage capacity of the collective magnetic disk device is one magnetic disk device. Storage capacity of disk device (n ×
m) Doubled.

【０３６５】動作を説明すると、まず、図５７の構成と
同様にデータベースの構造定義処理を行なうが、入出力
バッファ３にマルチプレクサ１４を介して接続するｍ台
の磁気ディスク装置１を識別する情報を構造定義情報に
追加する。In operation, first, the structure of the database is defined in the same manner as in the configuration shown in FIG. 57, but information for identifying the m magnetic disk devices 1 connected to the input / output buffer 3 via the multiplexer 14 is obtained. Add to structure definition information.

【０３６６】データベースの構築は図５７の構成と同様
に行なうが、いくつかの相違点がある。図５７の構成と
の相違点は、ファイル情報で与えられるファイルを構成
する磁気ディスク装置の台数分に分割して、全磁気ディ
スク装置に分散して格納することである。また、入出力
バッファ３のデータを格納物理情報で与えられるｍ台の
内の１台の磁気ディスク装置１−ｉ−ｊにマルチプレク
サ１４を制御して格納することである。The construction of the database is performed in the same manner as in the configuration of FIG. 57, but there are some differences. The difference from the configuration of FIG. 57 is that the file given by the file information is divided into the number of magnetic disk devices constituting the file and is distributed and stored in all the magnetic disk devices. In addition, the multiplexer 14 is controlled to store the data of the input / output buffer 3 in one of the m magnetic disk devices 1-ij given by the storage physical information.

【０３６７】ファイルの分割方法としては、ファイルサ
イズを台数で割った分割サイズを求め、ファイルの先頭
から分割サイズごとに１台目の磁気ディスク装置１−１
−ｊから１−２−ｊ，１−３−ｊと順番に格納していく
ものと、ファイルの先頭から１バイトずつと言ったよう
に、決められたサイズごとに１台目の磁気ディスク装置
１−１−ｊから１−２−ｊ，１−３−ｊと順番に格納し
ていくものがある。[0367] As a method for dividing the file, a division size is obtained by dividing the file size by the number of units, and the first magnetic disk device 1-1 is divided for each division size from the head of the file.
The first magnetic disk device is stored for each determined size, such as -j to 1-2j, 1-3j in order, and 1 byte from the beginning of the file. Some items are stored in order from 1-1-j to 1-2-j and 1-3-j.

【０３６８】ファイルサイズが磁気ディスク装置の台数
で割り切れない場合は、ファイルサイズが磁気ディスク
の倍数となるように無効データを末尾に付加して、常に
１台目の磁気ディスク装置１−１−ｊにファイルの先頭
がくるように格納する。If the file size is not divisible by the number of magnetic disk devices, invalid data is added to the end so that the file size is a multiple of the magnetic disk, and the first magnetic disk device 1-1-j is always used. In the file so that the beginning of the file comes first.

【０３６９】次にファイルの読出しについて説明する。
これも図５７の例と同様に行なうが、本構成では入出力
バッファ３を２面（３ａ及び３ｂ）持っているため、そ
れぞれの磁気ディスク装置１から入出力バッファ３に読
出したファイルを格納した時点で、次のファイルの読出
し処理を開始することができる。Next, the reading of a file will be described.
This is also performed in the same manner as in the example of FIG. 57, but since the input / output buffer 3 has two surfaces (3a and 3b) in this configuration, the files read from the respective magnetic disk devices 1 to the input / output buffer 3 are stored. At this point, the reading process of the next file can be started.

【０３７０】ファイルの読み出し処理の時間的な関係は
図７０のようになり、図５７の例に比べると入出力バッ
ファ３にデータを書き込んでも良い状態になるまでの待
ち時間がなくなり、より高速の転送が可能になる。図５
７の例と同じ条件で上位機器７の要求する転送速度を満
足する関係は、１台の磁気ディスク装置１−ｉ−ｊから
２面ある入出力バッファ３−ｉの一方の入出力バッファ
３ａ−ｉにファイルを読み出す時間（ｓ＋１／Ｒ＋Ｍ／
ｔ）が、もう一方の全ての入出力バッファ３ｂ−１から
３ｂ−ｎまでのファイルを上位機器７に転送する時間
（ｎＭ／Ｔ）以内であればよく、これを数式で表すと
（数９）のようになる。The time relationship of the file reading process is as shown in FIG. 70. Compared with the example of FIG. 57, there is no waiting time until the data can be written into the input / output buffer 3, and the speed is higher. Transfer becomes possible. FIG.
The relationship that satisfies the transfer speed required by the host device 7 under the same conditions as in the example of Example 7 is that one magnetic disk device 1-ij has one input / output buffer 3a of two input / output buffers 3-i. i, the time to read the file (s + 1 / R + M /
t) may be within the time (nM / T) for transferring all the other files of the input / output buffers 3b-1 to 3b-n to the higher-level device 7, and this is expressed by the following equation. )become that way.

【０３７１】[0371]

【数９】 (Equation 9)

【０３７２】また、（数９）は容易に（数１０）のよう
に書き表すことができる。(Equation 9) can be easily written as (Equation 10).

【０３７３】[0373]

【数１０】 (Equation 10)

【０３７４】この条件により、実施例１と同様に上位機
器が要求する転送速度を満足するためのデータ記憶装置
１５の台数を求めることができる。Under these conditions, it is possible to determine the number of data storage devices 15 that satisfy the transfer speed required by the host device, as in the first embodiment.

【０３７５】また、大きな記憶容量が求められる場合に
は、データ記憶装置１５をｍ台の磁気ディスク装置１と
マルチプレクサ１４によって構成し、記憶容量をｍ倍化
することができる。When a large storage capacity is required, the data storage device 15 is composed of m magnetic disk drives 1 and the multiplexer 14, and the storage capacity can be increased by m times.

【０３７６】これらのことから決定される最少台数の磁
気ディスク装置１で集合型磁気ディスク装置を構成すれ
ば、最もコストパフォーマンスの良いものとなる。If the collective magnetic disk drive is constituted by the minimum number of magnetic disk drives 1 determined from the above, the cost performance is the best.

【０３７７】図７０の実施例では各磁気ディスク装置の
シーク動作の起動を上位機器への入出力バッファ３−１
〜３−ｎのデータ転送が終了した時点で行なっても良い
ことは明らかである。In the embodiment shown in FIG. 70, the start of the seek operation of each magnetic disk device is started by the input / output buffer 3-1 to the host device.
Obviously, the transfer may be performed at the time when the data transfer of .about.3-n is completed.

【０３７８】以上の２つの実施例では磁気ディスク装置
を用いた場合について説明したが、磁気ディスク装置以
外の光ディスク装置等の記憶媒体が回転する記憶装置に
ついても同様なことは明確である。In the above two embodiments, the case where the magnetic disk device is used has been described. However, it is clear that the same applies to a storage device such as an optical disk device other than the magnetic disk device in which the storage medium rotates.

【０３７９】次に、図７１から図７４の開示に基づい
て、本発明の文書情報検索装置をＬＡＮ等を通じて外部
と接続可能にする実施例について説明する。Next, an embodiment in which the document information retrieval apparatus of the present invention can be connected to the outside via a LAN or the like will be described based on the disclosure of FIGS. 71 to 74.

【０３８０】図７１における文書情報検索装置（サーチ
マシンと呼ぶ）３０００では、サーチマシン制御用コン
ピュータに１１５０ＬＡＮなどの通信回線１０００に接
続できるようなＬＡＮ接続制御機能を備えている。サー
チマシン３０００はサーチマシン制御用コンピュータ１
１５０とサーチユニット３１００から構成されている。
サーチマシン制御用コンピュータ１１５０では、サーチ
マシン制御プログラム１１００が実行される。このサー
チマシン制御プログラム１１００は、本発明の第一の実
施例（図１０）で説明した、検索式解析プログラム１１
０２、同義語異表記展開プログラム１１０３ａ、複合条
件解析プログラム１１４１ａ、及び検索実行制御プログ
ラム１１０８が実行される。The document information search device (referred to as a search machine) 3000 shown in FIG. 71 has a LAN connection control function such that the search machine control computer can be connected to the communication line 1000 such as 1150 LAN. The search machine 3000 is a search machine control computer 1
150 and a search unit 3100.
In the search machine control computer 1150, a search machine control program 1100 is executed. This search machine control program 1100 is the search expression analysis program 11 described in the first embodiment (FIG. 10) of the present invention.
02, a synonym variant description expansion program 1103a, a complex condition analysis program 1141a, and a search execution control program 1108 are executed.

【０３８１】サーチユニット３１００は、オートマトン
生成用コンピュータ（ＣＰＵ１）１１０５ａ、ビットサ
ーチ用コンピュータ（ＣＰＵ３）１１０７ａ、ストリン
グサーチエンジン１１０６、複合条件判定用コンピュー
タ（ＣＰＵ２）１１４５ａ、半導体メモリ装置１１１０
ａ、ＲＡＭディスク装置１１１０ｂ、集合型磁気ディス
ク装置１１１０ｃから構成される。The search unit 3100 includes an automaton generation computer (CPU1) 1105a, a bit search computer (CPU3) 1107a, a string search engine 1106, a complex condition determination computer (CPU2) 1145a, and a semiconductor memory device 1110.
a, a RAM disk device 1110b and a collective magnetic disk device 1110c.

【０３８２】オートマトン生成用コンピュータ（ＣＰＵ
１）１１０５ａではオートマトン生成プログラム１１０
５が、ビットサーチ用コンピュータ（ＣＰＵ３）１１０
７ａではビットサーチプログラム１１０７が、複合条件
判定用コンピュータ（ＣＰＵ２）１１４５ａでは複合条
件判定プログラム１１４５が実行される。A computer for generating an automaton (CPU
1) In 1105a, the automaton generation program 110
5 is a bit search computer (CPU 3) 110
In 7a, the bit search program 1107 is executed, and in the compound condition judgment computer (CPU2) 1145a, the compound condition judgment program 1145 is executed.

【０３８３】また、集合型磁気ディスク装置１１１０ｃ
は集合型磁気ディスク装置１１１０ｄと磁気ディスク装
置１１１０ｅ１〜１１１０ｅ１２から構成される。ま
た、集合型磁気ディスク装置１１１０ｄはマルチディス
クコントローラ１１１０ｆと、同期制御バッファ１１１
０ｇ１〜１１１０ｇ１２から構成される。同期制御バッ
ファ１１１０ｇ１〜１１１０ｇ１２は、それぞれ独立に
動作する磁気ディスク装置１１１０ｅ１〜１１１０ｅ１
２の出力を整合し、同期を取ってストリングサーチエン
ジン１１０６へ送出するためのものである。すなわち、
例えば、磁気ディスク装置１１１０ｅ１内の本文データ
を磁気ディスク装置１１１０ｅ２内の本文データより先
に送りださなければならないときに、もし磁気ディスク
装置１１１０ｅ２内の本文データが先に読み出されてし
まった場合でも、このデータを同期制御バッファ１１１
０ｇ２内に保持しておき、磁気ディスク装置１１１０ｅ
１内の本文データが同期制御バッファ１１１０ｇ１に読
み出され、ここからマルチディスクコントローラ１１１
０ｆを介してストリングサーチエンジン１１０６へ読み
出した後、同期制御バッファ１１１０ｇ２内のデータを
ストリングサーチエンジン１１０６へ読み出すことによ
って、当初の順番通りにデータをストリングサーチエン
ジン１１０６へ読み出すことが可能になる。このよう
に、同期制御バッファ１１１０ｇ１〜１１１０ｇ１２
は、磁気ディスク装置１１１０ｅ１〜１１１０ｅ１２の
読み出し順序が各磁気ディスク装置のシーク時間及び回
転待ち時間のバラツキによりどう変化したとしても、正
しい所定の順番で外部に送り出すことが可能となる。マ
ルチディスクコントローラ１１１０ｆは、サーチマシン
制御プログラム１１００内の検索実行制御プログラム１
１０８の制御のもとに、同期制御バッファ１１１０ｇ１
〜１１１０ｇ１２の出力を選択するマルチプレクサの機
能を果たす。Also, the collective magnetic disk drive 1110c
Is composed of a collective magnetic disk drive 1110d and magnetic disk drives 1110e1 to 1110e12. The collective magnetic disk device 1110d includes a multi-disk controller 1110f and a synchronous control buffer 111.
0g1 to 1110g12. The synchronization control buffers 1110g1 to 1110g12 are magnetic disk devices 1110e1 to 1110e1 that operate independently.
2 for synchronizing and synchronizing the output with the string search engine 1106. That is,
For example, when the text data in the magnetic disk device 1110e1 must be sent before the text data in the magnetic disk device 1110e2, and the text data in the magnetic disk device 1110e2 is read first. However, this data is stored in the synchronization control buffer 111.
0g2, the magnetic disk drive 1110e
1 is read out to the synchronization control buffer 1110g1, and the multi-disk controller 111
By reading the data in the synchronization control buffer 1110g2 to the string search engine 1106 after reading to the string search engine 1106 via 0f, the data can be read to the string search engine 1106 in the original order. Thus, the synchronization control buffers 1110g1 to 1110g12
Can be sent out to the outside in a correct and predetermined order regardless of how the read order of the magnetic disk devices 1110e1 to 1110e12 changes due to variations in the seek time and the rotation waiting time of each magnetic disk device. The multi-disk controller 1110f executes the search execution control program 1 in the search machine control program 1100.
108, the synchronization control buffer 1110g1
It functions as a multiplexer for selecting the output of ~ 1110g12.

【０３８４】ＲＡＭディスク装置１１１０ｂは、ＲＡＭ
ディスクコントローラ４２００ｂ，４１００ｃ，４１０
０ｄから構成される。半導体メモリボード４１００ａ，
４１００ｂ，４１００ｃ，４１００ｄは、同一バスに接
続され、ＲＡＭディスクコントローラ４２００制御によ
り、ここに格納された擬縮本文がランダムにアクセスさ
れる。そして、読み出されたデータはストリングサーチ
エンジン１１０６へと送出される。The RAM disk device 1110b has a RAM
Disk controllers 4200b, 4100c, 410
0d. Semiconductor memory board 4100a,
4100b, 4100c and 4100d are connected to the same bus, and the pseudo text stored here is randomly accessed under the control of the RAM disk controller 4200. Then, the read data is sent to the string search engine 1106.

【０３８５】サーチマシン制御プログラム１１５０は、
ＬＡＮ１０００を介して送られてくる検索指令に応じ
て、先ず第一に検索条件の解析、同義語展開、異表記展
開、複合条件解析、近傍条件解析、文脈条件解析、論理
条件解析を行い、ここで作成された各制御情報をビット
サーチプログラム１１０７、オートマトン生成プログラ
ム１１０５、及び複合条件判定プログラム１１４５へ渡
す。オートマトン生成プログラム１１０５で作成された
キーワード照合用の状態遷移テーブルデータは、ストリ
ングサーチエンジン１１０６へ書き込まれる。The search machine control program 1150 is
In response to a search command sent via the LAN 1000, first, search condition analysis, synonym expansion, different notation expansion, compound condition analysis, neighborhood condition analysis, context condition analysis, and logical condition analysis are performed. Are passed to the bit search program 1107, the automaton generation program 1105, and the complex condition determination program 1145. The state transition table data for keyword matching created by the automaton generation program 1105 is written to the string search engine 1106.

【０３８６】第二に、検索制御情報の設定が終了した
ら、検索の実行に入る。最初、半導体メモリ１１１０ａ
上の文字成分表が読み出され、ビットサーチプログラム
１１０７により文字成分表サーチが行われる。この文字
成分表サーチの結果は、サーチマシン制御用コンピュー
タ１１５０の主メモリ上に格納される。次に文字成分表
サーチにより絞り込まれた文書集合に対して、擬縮本文
サーチを実行する。すなわち、サーチマシン制御用コン
ピュータ１１５０の主メモリ上に格納された文字成分表
サーチの結果情報としての該当文書識別子を読み出し、
これに対応する擬縮本文をＲＡＭディスク装置１１１０
ｂから読みだし、ストリングサーチエンジン１１０６の
照合情報は、複合条件判定プログラム１１４５に渡さ
れ、ここで指定された複合条件に合致するか否かの判定
が行われる。この擬縮本文サーチ結果は、同様にしてサ
ーチマシン制御用コンピュータ１１５０の主メモリ上に
格納される。この後、もし複合条件に近傍条件あるいは
文脈条件が設定されていたなら、本文サーチへ入る。本
文サーチでは、サーチマシン制御用コンピュータ１１５
０の主メモリ上に格納された擬縮本文サーチの結果情報
としての該当文書識別子を読み出し、これに対応する本
文を集合型磁気ディスク装置１１１０ｃから読み出し、
ストリングサーチエンジン１１０６で指定キーワードの
照合処理を行う。このストリングサーチエンジン１１０
６の照合情報は複合条件判定プログラム１１４５に渡さ
れ、ここで指定された複合条件、すなわち近傍条件と文
脈条件に合致するか否かの判定が行われる。この本文サ
ーチ結果は、同様にしてサーチマシン制御用コンピュー
タ１１５０の主メモリ上に格納される。Second, when the setting of the search control information is completed, the execution of the search is started. First, the semiconductor memory 1110a
The above character component table is read out, and the bit search program 1107 performs a character component table search. The result of the character component table search is stored on the main memory of the search machine control computer 1150. Next, a pseudo-shortened text search is performed on the document set narrowed down by the character component table search. That is, the relevant document identifier as the result information of the character component table search stored on the main memory of the search machine control computer 1150 is read,
The corresponding shrunken text is stored in the RAM disk device 1110.
b, the collation information of the string search engine 1106 is passed to the complex condition determination program 1145, and it is determined whether or not it matches the complex condition specified here. The result of the pseudo text search is stored in the main memory of the search machine control computer 1150 in the same manner. Thereafter, if the neighborhood condition or the context condition is set in the compound condition, the process enters the text search. In the text search, the search machine control computer 115 is used.
0, the corresponding document identifier as the result information of the pseudo text search stored in the main memory is read, and the corresponding text is read from the collective magnetic disk device 1110c.
The string search engine 1106 performs a specified keyword matching process. This string search engine 110
The collation information of No. 6 is passed to the complex condition determination program 1145, where it is determined whether or not the designated complex condition, that is, the neighborhood condition and the context condition are matched. This text search result is stored in the main memory of the search machine control computer 1150 in the same manner.

【０３８７】これらの検索手段の制御は、すべてサーチ
マシン制御プログラム１１００内の検索実行制御プログ
ラム１１０８によって行われる。The control of these search means is all performed by the search execution control program 1108 in the search machine control program 1100.

【０３８８】第三に、上述した一連の検索処理が終了し
たら、サーチマシン制御用コンピュータ１１５０の主メ
モリ上に書き込まれた検索結果を、サーチマシン制御プ
ログラム１１００の制御によりＬＡＮ１０００を介して
要求元であるワークステーション等の検索対話端末に返
送する。Third, when the above-described series of search processing is completed, the search result written in the main memory of the search machine control computer 1150 is transmitted to the request source via the LAN 1000 under the control of the search machine control program 1100. It is returned to a search dialogue terminal such as a workstation.

【０３８９】以上説明した本発明の変形例によれば、文
書情報検索装置をＬＡＮ内のサーバとして、同じＬＡＮ
に接続された複数の検索対話用端末から検索を行うこと
が可能となり、共有資源としての有効活用ができるよう
になる。According to the modified example of the present invention described above, the document information search device is used as a server in the LAN, and
It is possible to perform a search from a plurality of search dialogue terminals connected to, and it is possible to make effective use as a shared resource.

【０３９０】次に、本発明のもう一つ別の実施例につい
て図７２を用いて説明する。Next, another embodiment of the present invention will be described with reference to FIG.

【０３９１】本実施例における文書情報検索装置（サー
チマシンと呼ぶ）３０００はＬＡＮなどの通信回線１０
００に接続できるようＬＡＮ接続制御アダプタ２１００
を備えている。サーチマシン３０００は上記ＬＡＮ接続
制御アダプタ２１００のほか、サーチマシン制御用コン
ピュータ２２００と複数のサーチユニット３００１，３
００２，……から構成されている。[0391] In this embodiment, the document information search device (called a search machine) 3000 is a communication line 10 such as a LAN.
00 LAN connection control adapter 2100
It has. The search machine 3000 includes a LAN connection control adapter 2100, a search machine control computer 2200 and a plurality of search units 3001, 3
002,....

【０３９２】サーチマシン制御用コンピュータ１１５０
では、サーチマシン制御プログラム１１００が実行され
る。このサーチマシン制御プログラム１１００は、本発
明の第一実施例（図１０図）で説明した、検索式解析プ
ログラム１１０２、同義語異表記展開プログラム１１０
３ａ、複合条件解析プログラム１１４１ａ、及び検索実
行制御プログラム１１０８で構成される。Computer for controlling search machine 1150
Then, the search machine control program 1100 is executed. The search machine control program 1100 includes the search expression analysis program 1102 and the synonym variant expression expansion program 110 described in the first embodiment of the present invention (FIG. 10).
3a, a complex condition analysis program 1141a, and a search execution control program 1108.

【０３９３】サーチユニット３００１は、オートマトン
生成用コンピュータ（ＣＰＵ１）１１０５ａ、ビットサ
ーチ用コンピュータ（ＣＰＵ３）１１０７ａ、、ストリ
ングサーチエンジン１１０６、複合条件判定用コンピュ
ータ（ＣＰＵ２）１１４５ａ、半導体メモリ装置１１１
０ａ、ＲＡＭディスク装置１１１０ｂ、集合型磁気ディ
スク装置１１１０ｃ、検索結果格納メモリ１１４６、及
びセレクタ３６１０、３６２０から構成される。The search unit 3001 includes an automaton generation computer (CPU1) 1105a, a bit search computer (CPU3) 1107a, a string search engine 1106, a complex condition determination computer (CPU2) 1145a, and the semiconductor memory device 111.
0a, a RAM disk device 1110b, a collective magnetic disk device 1110c, a search result storage memory 1146, and selectors 3610 and 3620.

【０３９４】オートマトン生成用コンピュータ（ＣＰＵ
１）１１０５ａではオートマトン生成プログラム１１０
５が、ビットサーチ用コンピュータ（ＣＰＵ３）１１０
７ａではビットサーチプログラム１１０７が、複合条件
判定用コンピュータ（ＣＰＵ２）１１４５ａでは複合条
件判定プログラム１１４５が実行される。A computer for generating an automaton (CPU
1) In 1105a, the automaton generation program 110
5 is a bit search computer (CPU 3) 110
In 7a, the bit search program 1107 is executed, and in the compound condition judgment computer (CPU2) 1145a, the compound condition judgment program 1145 is executed.

【０３９５】また、集合型磁気ディスク装置１１１０ｃ
は集合型磁気ディスク装置１１１０ｄと磁気ディスク装
置１１１０ｅ１〜１１１０ｅ１ｎから構成される。これ
らの磁気ディスク装置１１１０ｅ１〜１１１０ｅ１ｎに
は、本文データ、擬縮本文データ、文字成分表及び書誌
事項などが分散して格納されている。本文サーチの際に
は、ここから本文データがストリングサーチエンジン１
１０６へ読み込まれる。Also, the collective magnetic disk drive 1110c
Is composed of a collective magnetic disk drive 1110d and magnetic disk drives 1110e1 to 1110e1n. The magnetic disk devices 1110e1 to 1110e1n store text data, pseudo-text data, character component tables, bibliographic items, and the like in a distributed manner. From the text search, the text data is sent from here to the string search engine 1.
It is read into 106.

【０３９６】半導体メモリ装置１１１０ａには、システ
ムの立ち上げ時に磁気ディスク装置１１１０ｅ１〜１１
１０ｅ１ｎから文字成分表がロードされ、検索時に文字
成分表サーチの対象データとしてビットサーチプログラ
ム１１０５によりアクセスさせる。The semiconductor memory device 1110a has magnetic disk devices 1110e1 to 1110e when the system is started up.
The character component table is loaded from 10e1n, and is accessed by the bit search program 1105 as target data of the character component table search at the time of search.

【０３９７】同様にＲＡＭディスク装置１１１０ｂに
は、システムの立ち上げ時に磁気ディスク装置１１１０
ｅ１〜１１１０ｅ１ｎから擬縮本文がロードされ、検索
時に擬縮本文サーチの対象データとしてストリングサー
チエンジン１１０６によりアクセスされる。Similarly, the RAM disk device 1110b has a magnetic disk device 1110 when the system is started up.
The pseudo text is loaded from e1 to 1110e1n, and accessed by the string search engine 1106 as target data of the pseudo text search at the time of search.

【０３９８】ストリングサーチエンジン１１０６は、擬
縮本文サーチ及び本文サーチ時に、それぞれ擬縮本文を
ＲＡＭディスク装置１１１０ｂから、本文を集合型磁気
ディスク装置１１１０ｃから読み出し、指定キーワード
の探索照合処理を行う。セレクタ３６１０は、ストリン
グサーチエンジン１１０６への入力をＲＡＭディスク装
置１１１０ｂと集合型磁気ディスク装置１１１０ｃとで
切り替える働きをする。また、セレクタ３６２０は、検
索結果格納メモリ１１４６へ文字成分表サーチ結果を書
き込む際と、擬縮本文サーチ結果及び本文サーチ結果を
書き込む際の入力を切り替える働きをする。The string search engine 1106 reads the pseudo-contracted text from the RAM disk device 1110b and the main text from the collective magnetic disk device 1110c at the time of the pseudo-text search and the text search, respectively, and performs the search and collation processing of the designated keyword. The selector 3610 functions to switch the input to the string search engine 1106 between the RAM disk device 1110b and the collective magnetic disk device 1110c. Also, the selector 3620 functions to switch between inputting the character component table search result to the search result storage memory 1146 and inputting the pseudo text search result and the text search result.

【０３９９】サーチマシン制御プログラム１１５０は、
ＬＡＮ１０００を介して送られてくる検索指令に応じ
て、先ず第一に検索条件の解析、同義語展開、異表記展
開、複合条件解析、近傍条件解析、文脈条件解析、論理
条件解析を行う、ここで作成された各制御情報を各サー
チユニット３００１，３００２，……のビットサーチプ
ログラム１１０７、オートマトン生成プログラム１１０
５、及び複合条件判定プログラム１１４５へブロードキ
ャストする。各サーチユニット３００１，３００２，…
…では、それぞれオートマトン生成プログラム１１０５
で作成されたキーワード照合用の状態数にテーブルデー
タが、ストリングサーチエンジン１１０６へ書き込まれ
る。The search machine control program 1150 is
In response to a search command sent via the LAN 1000, first, search condition analysis, synonym expansion, different notation expansion, compound condition analysis, neighborhood condition analysis, context condition analysis, and logical condition analysis are performed. The control information created in step (1) and the bit search program (1107) for each of the search units (3001, 3002,...) And the automaton generation program (110)
5 and the composite condition determination program 1145. Each search unit 3001, 3002, ...
… Then, each automaton generation program 1105
The table data is written to the string search engine 1106 in the number of states for keyword comparison created in the above.

【０４００】第二に、検索制御情報の設定が終了した
ら、検索の実行に入る。ここでは、サーチマシン制御用
コンピュータ１１５０上の検索実行制御プログラム１１
０８が各サーチユニット３００１，３００２，……へ検
索起動情報をブロードキャストすることになる。Second, when the setting of the search control information is completed, the execution of the search is started. Here, the search execution control program 11 on the search machine control computer 1150 is used.
08 broadcasts search start information to each of the search units 3001, 3002,....

【０４０１】各サーチユニット３００１，３００２，…
…では、最初、半導体メモリ１１１０ａ上の文字成分表
が読み出され、ビットサーチプログラム１１０７より文
字成分表サーチが行われる。この文字成分表サーチの結
果は、検索結果格納メモリ１１４６に書き込まれる。こ
の時、３６２０はビットサーチプログラム１１０７から
の書込みを選択すべく、検索実行制御プログラム１１０
８により切り替えられている。Each search unit 3001, 3002,...
.., First, the character component table on the semiconductor memory 1110a is read, and a character component table search is performed by the bit search program 1107. The result of the character component table search is written to the search result storage memory 1146. At this time, the search execution control program 1102 selects the writing from the bit search program 1107.
8 has been switched.

【０４０２】次に、文字成分表サーチにより絞り込まれ
た文書集合に対して、擬縮本文サーチを実行する。すな
わち、検索結果格納メモリ１１４６上に格納された文字
成分表サーチの結果情報としての該当文書識別子を検索
実行制御プログラム１１０８が内蔵する擬縮本文格納情
報を参照して、該当擬縮本文のＲＡＭディスク装置１１
１０ｂ上の格納領域情報をＲＡＭディスクコントローラ
７２００（図２２）に設定する。その後、該当擬縮本文
をＲＡＭディスク装置１１１０ｂから読み出し、ストリ
ングサーチエンジン１１０６で指定キーワードの照合処
理を行う。このストリングサーチエンジン１１０６の照
合情報は、複合条件判定プログラム１１４５に渡され、
ここで指定された複合条件に合致するか否かの判定が行
われる。この擬縮本文サーチ結果は、同様にして検索結
果格納メモリ１１４６に書き込まれる。当然、セレクタ
３６２０は複合条件判定プログラム１１４５からの書込
みを選択すべく、検索実行制御プログラム１１０８によ
り切り替えられている。Next, a pseudo text search is executed on the document set narrowed down by the character component table search. That is, the relevant document identifier as the result information of the character component table search stored in the search result storage memory 1146 is referred to the pseudo-text storage information included in the search execution control program 1108, and the RAM disk of the relevant pseudo-text is stored. Device 11
The storage area information on 10b is set in the RAM disk controller 7200 (FIG. 22). Thereafter, the corresponding foreshortened text is read from the RAM disk device 1110b, and the string search engine 1106 performs matching processing of the designated keyword. The collation information of the string search engine 1106 is passed to the complex condition determination program 1145,
Here, it is determined whether or not the compound condition specified is met. The pseudo text search result is similarly written into the search result storage memory 1146. Naturally, the selector 3620 has been switched by the search execution control program 1108 to select writing from the complex condition determination program 1145.

【０４０３】この後、もし複合条件に近傍条件あるいは
文脈条件が設定されていたなら、本文サーチへ入る。本
文サーチでは、検索結果格納メモリ１１４６上に格納さ
れた擬縮本文サーチの結果情報としての該当文書識別子
を検索実行制御プログラム１１０８が内蔵する本文格納
情報を参照して、該当本文の集合型磁気ディスク制御装
置１１１０ｄ（図２０）に設定する。その後、これに対
応する本文を集合型磁気ディスク装置１１１０ｃから読
み出し、ストリングサーチエンジン１１０６で指定キー
ワードの照合処理を行う。このストリングサーチエンジ
ン１１０６の照合情報は、複合条件判定プログラム１１
４５に渡され、ここで指定された複合条件、すなわち近
傍条件と文脈条件に合致するか否かの判定が行われる。
この本文サーチ結果は、検索結果格納メモリ１１４６に
書き込まれる。このとき、セレクタ３６１０は集合型磁
気ディスク装置１１１０ｃから読み込みを選択すべく、
またセレクタ３６２０は複合条件判定プログラム１１４
５からの書込みを選択すべく、検索実行制御プログラム
１１０８により切り替えられている。Thereafter, if the neighborhood condition or the context condition is set in the compound condition, the process enters the text search. In the text search, the relevant document identifier as the result information of the pseudo-text search stored in the search result storage memory 1146 is referred to the text storage information included in the search execution control program 1108, and the set type magnetic disk of the relevant text is referred to. This is set in the control device 1110d (FIG. 20). After that, the corresponding text is read from the collective magnetic disk device 1110c, and the string search engine 1106 performs the specified keyword collation processing. The collation information of the string search engine 1106 is stored in the complex condition determination program 11
The determination is made as to whether or not the composite condition specified here, that is, the neighborhood condition and the context condition are met.
This text search result is written into the search result storage memory 1146. At this time, the selector 3610 selects reading from the collective magnetic disk device 1110c.
Further, the selector 3620 sets the compound condition determination program 114
5 is switched by the search execution control program 1108 so as to select writing from 5 on.

【０４０４】以上説明した検索手順の制御は、すべて検
索実行制御プログラム１１０８により、各サーチユニッ
ト３００１，３００２，……へブロードキャストするこ
とにより行われる。The above-described control of the search procedure is all performed by the search execution control program 1108 by broadcasting to the search units 3001, 3002,...

【０４０５】第三に、上述した一連の検索処理が全サー
チユニット３００１，３００２，……で終了したら、各
検索結果格納メモリ１１４６上に書き込まれた検索結果
を、検索実行制御プログラム１１０８が全ユニット３０
０１，３００２，……から収集し、これを統合してサー
チマシン制御プログラム１１００の制御によりＬＡＮ１
０００を介して要求元へ返送する。Third, when the above-described series of search processing is completed in all the search units 3001, 3002,..., The search execution control program 1108 compares the search results written in each search result storage memory 1146 with all the search units. 30
.., And are integrated to control LAN1 under the control of the search machine control program 1100.
000 to the requestor.

【０４０６】また、サーチマシン制御用コンピュータ１
１５０には磁気ディスク装置２４００が付設されてお
り、ここには本サーチマシン３０００の構成情報、すな
わちサーチユニット数などが格納されている。さらに、
この磁気ディスク装置２４００は、ＬＡＮを介して複数
のユーザから検索要求が来る場合、サーチユニットの動
作状況に応じこれらの要求を一時的に格納したり、ある
いは返送すべき検索結果情報を一時的に格納するのにも
用いられる。The search machine control computer 1
A magnetic disk device 2400 is attached to 150, and stores configuration information of the search machine 3000, that is, the number of search units and the like. further,
When search requests are received from a plurality of users via the LAN, the magnetic disk device 2400 temporarily stores these requests or temporarily stores search result information to be returned according to the operation status of the search unit. Also used to store.

【０４０７】さらに、サーチマシン制御用コンピュータ
１１５０に付随したコンソール２３００には、本サーチ
マシンの動作状況が適宜表示されたり、あるいはここか
ら保守動作の指示を行うことができるようになってい
る。Further, the operation status of the present search machine is appropriately displayed on the console 2300 attached to the search machine control computer 1150, or the maintenance operation can be instructed from here.

【０４０８】この実施例の公正に示されているように、
本発明によれば、集合型磁気ディスク装置内の磁気ディ
スク装置の台数及びサーチユニットのユニット数をユー
ザの要求仕様、すなわち要求テキストデータ蓄積容量、
検索時間などに応じて極めて容易に情報検索装置を構成
できることになる。As shown fairly in this example,
According to the present invention, the number of magnetic disk devices and the number of search units in the collective magnetic disk device can be specified by the user as required specifications, that is, the required text data storage capacity,
The information retrieval apparatus can be configured very easily according to the retrieval time and the like.

【０４０９】図７３は、本発明の更に別の実施例を示し
たものである。これまで説明してきた実施例では、集合
型磁気ディスク装置１１１０ｃ、ビットサーチ用コンピ
ュータ１１０７ａあるいはストリングサーチエンジン１
１０６、複合条件判定用コンピュータ１１４５ａ及び検
索結果格納メモリ１１４６をカスケードに接続し、パイ
プライン動作させることによって処理速度の向上を図っ
ている。これに対して、本実施例では、上記各部分をバ
ス８０００で結合することによって、ハードウェアの構
成を簡単化し、引いては装置規模を抑えた構成にしたも
のである。また、集合型磁気ディスク装置１１１０ｃ、
ＲＡＭディスク装置１１１０ｂ、及び半導体メモリ装置
１１１０ａの制御コンピュータ１１５０ａを、またビッ
トサーチ用コンピュータ１１０７ａ、ストリングサーチ
エンジン１１０６、及び複合条件判定用コンピュータ１
１４５ａの制御コンピュータ１１５０ｂをそれぞれ一台
ずつ配ることによって、サーチマシン制御用コンピュー
タ１１５０のロードを軽減し、全体として負荷分散を図
ることにより、検索処理にかかわるオーバヘッドの低減
を可能とするものである。FIG. 73 shows still another embodiment of the present invention. In the embodiments described so far, the collective magnetic disk drive 1110c, the bit search computer 1107a, or the string search engine 1
106, the compound condition determination computer 1145a and the search result storage memory 1146 are connected in cascade, and a pipeline operation is performed to improve the processing speed. On the other hand, in the present embodiment, the above components are connected by a bus 8000, thereby simplifying the hardware configuration, and thereby reducing the device scale. Also, a collective magnetic disk drive 1110c,
A control computer 1150a for the RAM disk device 1110b and the semiconductor memory device 1110a, a bit search computer 1107a, a string search engine 1106, and a complex condition determination computer 1
The load on the search machine control computer 1150 can be reduced by distributing one control computer 1150b for each of the 145a, and the load on the search machine control computer 1150 can be reduced as a whole, thereby reducing the overhead related to the search processing.

【０４１０】最後に、本発明の文書情報検索装置をネッ
トワークシステムにおいて使用する場合の実施例につい
て、図７４を用いて説明する。Finally, an embodiment in which the document information search device of the present invention is used in a network system will be described with reference to FIG.

【０４１１】同図において１０００がＬＡＮなどのネッ
トワークであり、これにサーチマシン３０００がサーチ
マシン制御用ワークステーション２２００を介して、通
信制御手段２１００により接続されている。In the figure, reference numeral 1000 denotes a network such as a LAN, to which a search machine 3000 is connected by a communication control means 2100 via a search machine control workstation 2200.

【０４１２】また、５２００は光ディスク装置５５１
０，５５２０，……５５３０を統括制御するイメージサ
ーバである。該イメージサーバ５２００も同じく、通信
制御手段５１００によりネットワーク１０００に接続さ
れている。５４００は光ディスク装置５５１０，５５２
０，……５５３０における文書に対応するイメージデー
タの所在を管理する管理情報を格納する磁気ディスク装
置である。[0412] Reference numeral 5200 denotes an optical disk device 551.
., 5530 are image servers that integrally control 5530. The image server 5200 is also connected to the network 1000 by the communication control unit 5100. 5400 is an optical disk drive 5510, 552
A magnetic disk device that stores management information for managing the location of image data corresponding to documents at 0,.

【０４１３】１２００はデータを表示できる検索対話用
のワークステーションであり、これもまた通信制御手段
１１００によってネットワーク１０００に接続されてい
る。本ワークステーションにはこのほかイメージプリン
タ１４００、イメージスキャナ１５００、磁気ディスク
装置１６００、光ディスク装置１７００が接続されてい
る。[0413] Reference numeral 1200 denotes a search dialogue workstation capable of displaying data, which is also connected to the network 1000 by the communication control means 1100. In addition to this workstation, an image printer 1400, an image scanner 1500, a magnetic disk device 1600, and an optical disk device 1700 are connected.

【０４１４】６２００も検索対話用のワークステーショ
ンであり、これも通信制御手段６１００により同じネッ
トワークに接続されている。これは、検索ならびに閲読
専用のワークステーションである。ワークステーション
１２００からは、サーチマシン３０００からの検索結果
に応じて、イメージサーバ５２００へ該当文書に対応す
るイメージデータを要求し、これをネットワーク経由で
受け取りコンソール１３００に表示し、図面なども含め
て閲読することが可能である。このイメージデータはイ
メージプリンタ１４００にハードコピーをとることもで
きる。また、このイメージデータを本ワークステーショ
ン上で編集し、個人専用のプライベートファイルとして
光ディスク装置１７００に格納することも可能である。
この編集に際しては、イメージスキャナ１５００から入
力したイメージを用いることもできる。[0414] Reference numeral 6200 denotes a search dialogue workstation, which is also connected to the same network by the communication control means 6100. This is a search and read-only workstation. The workstation 1200 requests image data corresponding to the corresponding document from the image server 5200 in accordance with the search result from the search machine 3000, receives the image data via the network, displays the image data on the console 1300, and reads the drawing including the drawing. It is possible to This image data can be hard copied to the image printer 1400. It is also possible to edit this image data on this workstation and store it in the optical disk device 1700 as a private file for personal use.
In this editing, an image input from the image scanner 1500 can be used.

【０４１５】したがって、サーチマシンあるいはイメー
ジサーバのデータベースに格納されているパブリックな
データには手を加えることなく編集部分のみを磁気ディ
スク装置１６００及び光ディスク装置１７００に格納
し、その対応情報を磁気ディスク装置１６００に持つこ
とも可能である。Therefore, only the edited part is stored in the magnetic disk device 1600 and the optical disk device 1700 without modifying the public data stored in the database of the search machine or the image server, and the corresponding information is stored in the magnetic disk device. 1600.

【０４１６】[0416]

【発明の効果】スキャン型のフルテキストサーチを加速
する方法として、プリサーチを行うことが可能となるデ
ータの登録を実現することが可能となる。プリサーチを
行うことにより、磁気ディスク等データ格納手段に格納
されたテキスト本文を参照しに行く件数を減らすことが
できるようになるため、検索処理時間に占める割合が高
い本文検索処理量を減らすことが可能になり、その結果
全体の検索処理時間を短縮することが可能となる。As a method of accelerating a scan type full text search, it is possible to realize registration of data which enables presearch. By performing the pre-search, it is possible to reduce the number of cases where the text body stored in the data storage means such as a magnetic disk is referred to, thereby reducing the amount of body search processing which accounts for a large portion of the search processing time. Becomes possible, and as a result, it becomes possible to shorten the entire search processing time.

【０４１７】検索データファイルをサーチすることによ
り検索の高速化が図れる。具体的には、文字成分表によ
り指定された文字をすべて含む文献のみを抽出すること
ができ、以降の検索対象とする文書数を必要最小限に絞
り込むことが可能となるため、全体の検索処理時間を短
縮することが可能となる。さらに、擬縮本文データをス
キャンすることによって、指定されたキーワードが単語
単位で記述されている文書だけを抽出することができ、
以降の本文検索の対象となる文書数を必要最小限に絞り
込むことが可能となるため、全体の検索処理時間をさら
に短縮することが可能となる。By searching the search data file, the speed of the search can be increased. Specifically, it is possible to extract only documents containing all the characters specified by the character component table, and it is possible to narrow down the number of documents to be searched later to the minimum necessary. Time can be reduced. Furthermore, by scanning the mock text data, it is possible to extract only documents in which a specified keyword is described in word units,
Since the number of documents to be subjected to the subsequent text search can be reduced to a necessary minimum, the entire search processing time can be further reduced.

【０４１８】したがって、プリサーチの結果絞り込まれ
た文書について、データを読みだしてスキャンし、最後
の複合条件による本文検索を行うことになるため、等価
的に非常に高速なフルテキストサーチが実現できること
になる。[0418] Therefore, for a document narrowed down as a result of the pre-search, data is read out and scanned, and a full-text search is performed according to the last compound condition. Therefore, a very high-speed full text search can be equivalently realized. become.

[Brief description of the drawings]

【図１】従来の検索システムを示すブロック構成図であ
る。FIG. 1 is a block diagram showing a conventional search system.

【図２】従来の有限オートマトンによる文字列検索原理
を表した説明図（その１）である。FIG. 2 is an explanatory diagram (part 1) showing a conventional character string search principle using a finite automaton.

【図３】従来の有限オートマトンによる文字列検索原理
を表した説明図（その２）である。FIG. 3 is an explanatory diagram (part 2) illustrating a conventional character string search principle using a finite automaton.

【図４】従来例に対応するフェイルテーブルの説明図
（その１）である。FIG. 4 is an explanatory diagram (part 1) of a fail table corresponding to a conventional example.

【図５】従来の有限オートマトンによる文字列検索原理
を表した説明図（その３）である。FIG. 5 is an explanatory diagram (part 3) illustrating a principle of a character string search using a conventional finite automaton.

【図６】従来例に対応するフェイルテーブルの説明図
（その２）である。FIG. 6 is an explanatory diagram (part 2) of a fail table corresponding to the conventional example.

【図７】従来の有限オートマトンによる文字列検索原理
を表した説明図（その４）である。FIG. 7 is an explanatory view (No. 4) showing the principle of character string search by the conventional finite automaton.

【図８】従来例に対応するフェイルテーブルの説明図
（その３）である。FIG. 8 is an explanatory view (3) of a fail table corresponding to the conventional example.

【図９】異表記展開の従来構成を示したブロック図であ
る。FIG. 9 is a block diagram showing a conventional configuration of different notation development.

【図１０】この発明の第１の実施例の概要を示すブロッ
ク構成図である。FIG. 10 is a block diagram showing an outline of a first embodiment of the present invention.

【図１１】照合位置情報の一例を示す図である。FIG. 11 is a diagram illustrating an example of collation position information.

【図１２】照合位置を付したサーチエンジンの出力情報
例を示す図である。FIG. 12 is a diagram illustrating an example of output information of a search engine with a matching position added.

【図１３】複合条件判定部の詳細を示す図である。FIG. 13 is a diagram illustrating details of a composite condition determination unit.

【図１４】２つのキーワードを使った検索例を示す図で
ある。FIG. 14 is a diagram illustrating a search example using two keywords.

【図１５】フルテキストサーチを加速する手段を示す構
成図である。FIG. 15 is a configuration diagram showing a means for accelerating a full text search.

【図１６】テキストの登録処理の手順を示す図である。FIG. 16 is a diagram showing a procedure of a text registration process.

【図１７】図１６に示す登録手順で登録、生成された文
字成分表から、検索処理を行う手段を示した図である。17 is a diagram showing a means for performing a search process from a character component table registered and generated by the registration procedure shown in FIG. 16;

【図１８】文字成分表の構成とこれを用いたサーチの具
体例を示した図である。FIG. 18 is a diagram showing a configuration of a character component table and a specific example of a search using the same.

【図１９】擬縮本文の作成を示す図である。FIG. 19 is a diagram illustrating creation of a pseudo-text.

【図２０】文字成分表サーチの手順を示したＰＡＤ図
（その１）である。FIG. 20 is a PAD diagram (No. 1) showing a procedure of a character component table search.

【図２１】文字成分表サーチの手順を示したＰＡＤ図
（その２）である。FIG. 21 is a PAD diagram (No. 2) showing a procedure of a character component table search.

【図２２】文字成分表サーチの手順を示したＰＡＤ図
（その３）である。FIG. 22 is a PAD diagram (part 3) showing a procedure of a character component table search.

【図２３】文字成分表サーチの手順を示したＰＡＤ図
（その４）である。FIG. 23 is a PAD diagram (part 4) showing a procedure of a character component table search.

【図２４】文字成分表サーチの手順を示したＰＡＤ図
（その５）である。FIG. 24 is a PAD diagram (No. 5) showing a procedure of a character component table search.

【図２５】図１０に示した実施例の変形例を示す構成図
である。FIG. 25 is a configuration diagram showing a modification of the embodiment shown in FIG. 10;

【図２６】同義語・異表記の展開の処理を行う実施例の
ブロック構成図である。FIG. 26 is a block diagram of an embodiment for performing processing for developing synonyms and different notations.

【図２７】図２６に示した実施例での処理概略を説明し
た図である。FIG. 27 is a diagram for explaining an outline of processing in the embodiment shown in FIG. 26;

【図２８】本発明の１実施例の構成ブロック図である。FIG. 28 is a configuration block diagram of one embodiment of the present invention.

【図２９】異表記展開処理の過程を例示する図である。FIG. 29 is a diagram illustrating an example of a process of developing a different notation.

【図３０】異表記展開手段のブロック図である。FIG. 30 is a block diagram of a different notation developing unit.

【図３１】異表記展開処理部における変換ルールの適用
処理をカタカナ文字の列で説明した図である。FIG. 31 is a diagram illustrating a process of applying a conversion rule in a variant notation expansion processing unit using a katakana character string.

【図３２】異表記展開処理を示すＰＡＤ図である。FIG. 32 is a PAD showing a different notation development process.

【図３３】見だし文字列検索をオートマトンを用いて実
行する実施例を説明する図である。FIG. 33 is a diagram illustrating an example in which a search for a search character string is performed using an automaton.

【図３４】オートマトンの状態遷移テーブル図である。FIG. 34 is a state transition table diagram of the automaton.

【図３５】オートマトンの出力テーブル図である。FIG. 35 is an output table diagram of an automaton.

【図３６】検索オートマトンの状態遷移テーブル及び出
力テーブルの作成方法を表したＰＡＤ図である。FIG. 36 is a PAD diagram showing a method of creating a state transition table and an output table of a search automaton.

【図３７】カタカナ異表記変換ルールテーブルを示す図
である。FIG. 37 is a diagram showing a katakana different notation conversion rule table.

【図３８】漢字の新旧字体に関する異表記変換ルールテ
ーブルを示す図である。FIG. 38 is a diagram showing a different notation conversion rule table for new and old kanji characters.

【図３９】漢字の送り仮名に関する異表記変換ルールテ
ーブルを示す図である。FIG. 39 is a diagram showing a different notation conversion rule table related to the kanji kana.

【図４０】ローマ字とカタカナの対応表の例を示す図で
ある。FIG. 40 is a diagram showing an example of a correspondence table between Roman characters and katakana;

【図４１】異表記展開手段の展開モードを設定可能とす
るブロック図である。FIG. 41 is a block diagram that enables setting of a development mode of a different notation development unit.

【図４２】異表記展開における各変換部、展開部及びス
イッチの出力の制御状態を示す図である。FIG. 42 is a diagram illustrating a control state of output of each conversion unit, expansion unit, and switch in different notation expansion.

【図４３】同義語辞書を示す図である。FIG. 43 is a diagram showing a synonym dictionary.

【図４４】同義語辞書の見出し文字列をインデックステ
ーブルを用いた探索の概要を示す図である。FIG. 44 is a diagram showing an outline of searching for a heading character string of a synonym dictionary using an index table.

【図４５】本発明の実施例である有限オートマトンを用
いた文字検索か色の構成ブロック図である。FIG. 45 is a block diagram showing the configuration of a character search or color using a finite state automaton according to an embodiment of the present invention.

【図４６】本発明の実施例の有限オートマトンによる文
字列検索方法の原理を示した説明図（その１）である。FIG. 46 is an explanatory diagram (part 1) illustrating the principle of a character string search method using a finite automaton according to an embodiment of the present invention.

【図４７】本発明の実施例の有限オートマトンによる文
字列検索方法の原理を示した説明図（その２）である。FIG. 47 is an explanatory diagram (part 2) illustrating the principle of the character string search method using the finite state automaton according to the embodiment of this invention.

【図４８】本発明の実施例の有限オートマトンによる文
字列検索方法の原理を示した説明図（その３）である。FIG. 48 is an explanatory diagram (part 3) illustrating the principle of the character string search method using the finite state automaton according to the embodiment of this invention.

【図４９】本発明の実施例の有限オートマトンによる文
字列検索方法の原理を示した説明図（その４）である。FIG. 49 is an explanatory diagram (No. 4) illustrating the principle of the character string search method using the finite automaton according to the embodiment of this invention.

【図５０】本発明の実施例の有限オートマトンによる文
字列検索方法の原理を示した説明図（その５）である。FIG. 50 is an explanatory diagram (No. 5) illustrating the principle of the character string search method using the finite state automaton according to the embodiment of this invention.

【図５１】本発明の実施例の有限オートマトンによる文
字列検索方法の原理を示した説明図（その６）である。FIG. 51 is an explanatory diagram (No. 6) illustrating the principle of the character string search method using the finite state automaton according to the embodiment of this invention.

【図５２】本発明の実施例の有限オートマトンによる文
字列検索方法の原理を示した説明図（その７）である。FIG. 52 is an explanatory diagram (No. 7) illustrating the principle of the character string search method using the finite state automaton according to the embodiment of this invention.

【図５３】本発明の実施例の有限オートマトンによる文
字列検索方法の原理を示した説明図（その８）である。FIG. 53 is an explanatory diagram (No. 8) of the principle of the character string search method using the finite state automaton according to the embodiment of this invention;

【図５４】本発明の実施例の状態遷移テーブルの説明図
である。FIG. 54 is an explanatory diagram of a state transition table according to the embodiment of this invention.

【図５５】検索結果テーブルの説明図である。FIG. 55 is an explanatory diagram of a search result table.

【図５６】本発明の実施例である集合型磁気ディスク装
置の構成例図である。FIG. 56 is a configuration example diagram of a collective magnetic disk device according to an embodiment of the present invention.

【図５７】本発明の１実施例を示す構成図である。FIG. 57 is a configuration diagram showing one embodiment of the present invention.

【図５８】構造定義テーブルの構造を示す図である。FIG. 58 is a diagram showing the structure of a structure definition table.

【図５９】格納位置ポインタテーブルの構造を示す図で
ある。FIG. 59 is a diagram showing the structure of a storage position pointer table.

【図６０】物理情報テーブルの構造を示す図である。FIG. 60 is a diagram showing the structure of a physical information table.

【図６１】図５７に示した実施例のファイルの書込みの
フローチャートである。FIG. 61 is a flowchart of writing a file in the embodiment shown in FIG. 57.

【図６２】図５７に示す集合型磁気ディスク装置におけ
るファイルの書き込み処理のタイムチャートである。FIG. 62 is a time chart of a file write process in the collective magnetic disk device shown in FIG. 57;

【図６３】図５７に示す実施例におけるファイルの読み
出し処理のフローチャート、でる。FIG. 63 is a flowchart of a file reading process in the embodiment shown in FIG. 57.

【図６４】マルチプレクスコントローラの構成を示す図
である。FIG. 64 is a diagram illustrating a configuration of a multiplex controller.

【図６５】図５７に示す実施例における集合型磁気ディ
スク装置におけるファイルの読み出し処理のタイムチャ
ートである。FIG. 65 is a time chart of a file reading process in the collective magnetic disk device in the embodiment shown in FIG. 57;

【図６６】図５７に示す実施例における集合型磁気ディ
スク装置におけるファイルの読み出し処理のタイムチャ
ートである。FIG. 66 is a time chart of a file reading process in the collective magnetic disk device in the embodiment shown in FIG. 57;

【図６７】図５７に示す実施例において、３台の磁気デ
ィスク装置で構成した集合型磁気ディスク装置における
ファイルの読み出し処理のタイムチャートである。FIG. 67 is a time chart of a file reading process in the collective magnetic disk device constituted by three magnetic disk devices in the embodiment shown in FIG. 57;

【図６８】図５７に示す実施例において、４台の磁気デ
ィスク装置で構成した集合型磁気ディスク装置における
ファイルの読み出し処理のタイムチャートである。FIG. 68 is a time chart of a file read process in the collective magnetic disk device constituted by four magnetic disk devices in the embodiment shown in FIG. 57;

【図６９】図５７に示した実施例において、５台の磁気
ディスク装置で構成した集合型磁気ディスク装置におけ
るファイルの読み出しのタイムチャートである。FIG. 69 is a time chart of reading a file in a collective magnetic disk device including five magnetic disk devices in the embodiment shown in FIG. 57;

【図７０】図５６に示した実施例において、２台の集合
型磁気ディスク装置におけるファイルの読み出し処理の
タイムチャートである。FIG. 70 is a time chart of a file reading process in two collective magnetic disk devices in the embodiment shown in FIG. 56;

【図７１】ＬＡＮに接続した実施例を示す構成ブロック
図である。FIG. 71 is a configuration block diagram showing an embodiment connected to a LAN.

【図７２】図７１に示した実施例の変形例を示すブロッ
ク図である。FIG. 72 is a block diagram showing a modification of the embodiment shown in FIG. 71.

【図７３】図７１に示した実施例の変形例を示す図であ
る。FIG. 73 is a view showing a modification of the embodiment shown in FIG. 71;

【図７４】図７１の変形例の構成を示す構成ブロック図
である。FIG. 74 is a configuration block diagram showing a configuration of a modified example of FIG. 71.

【図７５】ＲＡＭディスク装置の具体例を示す図であ
る。FIG. 75 is a diagram showing a specific example of a RAM disk device.

【図７６】符号語表現文字列の１例を示す図（その１）
である。FIG. 76 shows an example of a code word expression character string (part 1).
It is.

【図７７】符号語表現文字列の１例を示す図（その２）
である。FIG. 77 shows an example of a code word expression character string (part 2).
It is.

[Explanation of symbols]

１１０１…キーボード、１１０２…検索式解析プログラ
ム、１１０３ａ…同義語異表記展開プログラム、１１０
５…ビットサーチ用コンピュータ、１１０６…ストリン
グサーチエンジン、１１０７…ビットサーチプログラ
ム、１１１０…テキストデータファイル、１１４５ａ…
複合条件判定用コンピュータ、１１４６…検索結果格納
メモリ、１１５０…サーチマシン制御用コンピュータ1101 ... keyboard, 1102 ... search expression analysis program, 1103a ... synonymous notation expansion program, 110
5: Bit search computer, 1106: String search engine, 1107: Bit search program, 1110: Text data file, 1145a ...
Computer for determining complex conditions, 1146: Search result storage memory, 1150: Computer for controlling search machine

───────────────────────────────────────────────────── フロントページの続き (72)発明者川口久光東京都国分寺市東恋ケ窪１丁目280番地株式会社日立製作所中央研究所内 (72)発明者畠山敦東京都国分寺市東恋ケ窪１丁目280番地株式会社日立製作所中央研究所内 (72)発明者兼岡則幸東京都国分寺市東恋ケ窪１丁目280番地株式会社日立製作所中央研究所内 (72)発明者秋沢充東京都国分寺市東恋ケ窪１丁目280番地株式会社日立製作所中央研究所内 ──────────────────────────────────────────────────続き Continuing on the front page (72) Inventor Hisamitsu Kawaguchi 1-280 Higashi Koigakubo, Kokubunji-shi, Tokyo Inside the Central Research Laboratory of Hitachi, Ltd. (72) Inventor Atsushi Hatakeyama 1-280 Higashi Koikebo, Kokubunji-shi, Tokyo Central Research Laboratory (72) Inventor Noriyuki Kaneoka 1-280 Higashi Koikekubo, Kokubunji, Tokyo, Japan Inside Central Research Laboratory, Hitachi, Ltd. (72) Inventor Mitsuru Akizawa 1-280 Higashi Koikebo, Kokubunji, Tokyo, Japan Central Research Laboratory, Hitachi, Ltd

Claims

[Claims]

1. A method of registering data in a data storage means capable of storing a plurality of data, wherein the data is registered in the data storage means, and it is determined whether or not each predetermined character is included in the registered data. A data registration method characterized by registering a character component table shown in association with the registered data.

2. The data registration method according to claim 1, wherein the data includes text data composed of a character code.

3. The data registration method according to claim 1, wherein the character component table is registered in the data storage unit.

4. The data registration method according to claim 1, wherein each of said predetermined characters is said data to be registered and data stored in said data storage means before said data to be registered. A data registration method characterized by being a character that appears in at least one of the following.

5. A method of registering data in a data storage means capable of storing a plurality of data, wherein the data is registered in the data storage means, and a word which repeatedly appears in the registered data is determined from the registered data. A data registration method characterized by storing condensed text data from which duplication has been eliminated in correspondence with the registered data.

6. The data registration method according to claim 5, wherein said condensed text data is registered in said data storage means.

7. The data registration method according to claim 5, wherein the condensed text data is included in a predetermined number or more of data registered in advance in the data storage unit and the registered data. A data registration method characterized in that a word to be registered is excluded.

8. The data registration method according to claim 5, wherein said condensed text data is registered by removing an attached word from said registered data. Method.

9. A method of registering data in a data storage means capable of storing a plurality of data, the method comprising: registering data in the data storage means; and determining whether or not each of predetermined characters is included in the registered data. Registering condensed text data from which duplicate words of words repeatedly appearing in the registered data are removed from the registered character component table and the registered data in correspondence with the registered data, respectively. Method.

10. The data registration method according to claim 9, wherein the predetermined character is a character that appears in the condensed text data.

11. A computer in which data is registered in data storage means capable of storing a plurality of data, and a character component table indicating whether or not each predetermined character is included in the registered data is stored in the registered data storage unit. A storage medium storing a program to be registered corresponding to data.

12. The storage medium according to claim 11, wherein the predetermined characters are characters that appear in at least one of the data to be registered and data stored in the data storage unit before the data to be registered. A storage medium characterized by the following.

13. A computer in which data is registered in data storage means capable of storing a plurality of data, and from the registered data, condensed text data in which duplication of words repeatedly appearing in the registered data is eliminated. A storage medium storing a program to be registered corresponding to the registered data.

14. A computer in which data is registered in a data storage means capable of storing a plurality of data, a character component table indicating whether or not each predetermined character is included in the registered data, and A storage medium storing a program for registering condensed body data in which duplicates of words repeatedly appearing in the registered data from the registered data are respectively associated with the registered data.

15. The storage medium according to claim 14, wherein the predetermined character is a character that appears in the condensed text data.

16. A method for indicating whether or not a predetermined character is included in the data corresponding to each of a plurality of data stored in the data storage means. A storage medium for storing a character component table for eliminating data which is searched when a search is performed and which is unlikely to include an input search keyword.

17. A method for removing duplication of words repeatedly appearing in the registered data from registered data corresponding to each of the plurality of data stored in the data storage means, and storing the data in the data storage means. A storage medium for storing condensed text data for eliminating data that is searched when a search is performed on data that has no possibility of including an input search keyword.

18. A search is performed when a search is performed on the data stored in the data storage means, the search data corresponding to each of the plurality of data stored in the data storage means, and the search keyword is included. It is for eliminating data that is not possible, and repeatedly appears in the registered data from a character component table indicating whether a predetermined character is included in the data and the registered data. A storage medium for storing condensed text data from which duplication of words has been eliminated.

19. The storage medium according to claim 18, wherein the predetermined character is a character that appears in the condensed text data.

20. A means for registering data in data storage means capable of storing a plurality of data, and a character component table indicating whether or not each predetermined character is included in the registered data. A data registration device comprising means for registering data in correspondence with data.

21. The data registration device according to claim 20, wherein each of the predetermined characters appears in at least one of the data to be registered and data stored in the data storage unit before the data to be registered. A data registration device, which is a character.

22. The data registration device according to claim 20, wherein the data includes document data having a character code.

23. A means for registering data in a data storage means capable of storing a plurality of data, and: extracting condensed text data from the registered data by eliminating duplication of words repeatedly appearing in the registered data. A data registration device comprising means for registering data in correspondence with data to be registered.

24. The data registration device according to claim 23, wherein the data includes document data having a character code.

25. A means for registering data in data storage means capable of storing a plurality of data, a character component table indicating whether or not each predetermined character is included in the registered data, and A data registration device, comprising: means for registering, from data, condensed text data in which duplication of words repeatedly appearing in the registered data has been eliminated in correspondence with the registered data.

26. The data registration device according to claim 25, wherein the predetermined character is a character that appears in the condensed text data.

27. The data registration device according to claim 25, wherein the data includes document data having a character code.