JPH08314964A

JPH08314964A - Index model creation device

Info

Publication number: JPH08314964A
Application number: JP7121065A
Authority: JP
Inventors: Mitsuaki Inaba; 葉光昭稲; Naohiko Noguchi; 口直彦野; Yuji Sugano; 野祐司菅
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1995-05-19
Filing date: 1995-05-19
Publication date: 1996-11-29
Anticipated expiration: 2015-01-11
Also published as: JP2996895B2

Abstract

(57)【要約】【目的】電子計算機を用いた文書検索システムにおい
て、プリサーチ方式に基くシグネチャファイルの索引付
与単位を、利用者の要求・利用者の検索履歴・索引容量
に応じて適切に設定することによって高速な検索を可能
にする。【構成】特別区分入力手段１１２により利用者が指定
した文字および文字列については、それらを含む検索要
求に対して検索速度の向上を図る。また検索要求文字列
出現頻度算定手段を設けることにより、利用者の良く利
用する検索要求に対して検索速度の向上を図る。さら
に、最大索引量入力手段を設けることにより索引量の上
限を指定し、絞り込み率算定手段を設けることにより最
大索引量を越えない範囲の絞り込み率をもつ索引型式を
自動的に作成する。 (57) [Summary] [Purpose] In a document retrieval system using an electronic computer, the indexing unit of the signature file based on the pre-search method is appropriately selected according to the user's request, the user's search history, and the index capacity. Enables fast search by setting. [Structure] For a character and a character string designated by the user by the special category input means 112, the search speed is improved in response to a search request including them. Further, by providing a search request character string appearance frequency calculating means, the search speed is improved in response to a search request frequently used by the user. Further, the maximum index amount input means is provided to specify the upper limit of the index amount, and the narrowing rate calculation means is provided to automatically create an index type having a narrowing rate within a range not exceeding the maximum index amount.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、電子計算機を利用した
文書検索システムや文書編集システムにおける文書中か
ら文字列等を検索するための索引型式作成装置に関する
ものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an index type creating apparatus for searching a character string or the like from a document in a document searching system or a document editing system using an electronic computer.

【０００２】[0002]

【従来の技術】近年、ワードプロセッサやパーソナルコ
ンピュータの普及、コンピュータの記憶装置の容量の増
大、コンピュータによる文字認識の実用化等に伴い、文
書中のすべての文字情報を蓄積した全文データベースが
増加してきている。このため、大量の文字情報を蓄積
し、必要に応じて文書情報を検索する全文データベース
検索システムに対する関心が高まってきている。2. Description of the Related Art In recent years, with the spread of word processors and personal computers, the increase in storage capacity of computers, the practical use of character recognition by computers, and the like, the number of full-text databases that store all character information in documents has increased. There is. Therefore, there is an increasing interest in a full-text database search system that accumulates a large amount of character information and retrieves document information as needed.

【０００３】従来の文書データベースシステムでは、文
書を検索する際の鍵として、文書毎に人手により付与さ
れたキーワードを利用するキーワード検索方式が一般的
であった。しかし、キーワード付け作業が蓄積文書の増
加に間に合わない、時間が経過するとキーワードが陳腐
化する、キーワード付けを行なった者と検索するものと
のキーワードの解釈の相違により検索洩れが生じる、な
どの問題があった。このような背景から、近年、全文検
索（フルテキストサーチ）と呼ばれる文書検索方式が注
目されている。In a conventional document database system, a keyword search method is generally used in which a keyword manually assigned to each document is used as a key for searching a document. However, problems such as keyword addition work not keeping up with the increase of accumulated documents, keywords becoming obsolete over time, and omission of search due to difference in interpretation of keywords between the person who made the keywords and the one to be searched. was there. From such a background, a document search method called a full-text search (full-text search) has recently been attracting attention.

【０００４】全文検索は、文書データの他には補助的な
情報を持たずに、検索毎に文書データを全文走査する
「フルテキストスキャン」方式と、検索に先だって、文
書データ中に出現する文字あるいは文字列の情報を高速
に取り出せるような索引情報を自動的に作成しておい
て、検索時にこの索引を検索する方式の２種類に大別さ
れる。The full-text search is a "full-text scan" method in which full-text scanning is performed on the document data for each search without any auxiliary information other than the document data, and a character appearing in the document data prior to the search. Alternatively, it is roughly classified into two types, that is, a method of automatically creating index information that can extract character string information at a high speed and searching this index at the time of searching.

【０００５】このうちフルテキストスキャン方式は、原
文書以外の情報を用いないので、記憶容量が少なくて済
むとともに文書データの更新直後でも即座に検索できる
点、および正規表現等の文字列パターンや論理条件を含
む複雑な検索条件の場合や検索結果が多い場合でも、検
索時間がほぼ一定である点が長所であるが、文書データ
の全てを走査するため、索引方式に比べて検索時間が遅
いという問題が指摘されている。Of these, the full-text scanning method does not use information other than the original document, so it requires less storage capacity and can be searched immediately immediately after updating the document data, and character string patterns and logical expressions such as regular expressions. The advantage is that the search time is almost constant even when the search conditions are complicated including conditions or there are many search results. However, since the entire document data is scanned, the search time is slower than the index method. A problem has been pointed out.

【０００６】一方、索引方式は、一般にフルテキストス
キャン方式よりも検索速度が速く、索引の作成方法によ
っては、検索速度が文書量にほとんど依存しないという
利点があるが、索引情報の容量が大きいこと、索引を作
成する時間が長いこと、検索条件が複雑な場合や検索結
果が多い場合に検索速度が低下すること等の問題が指摘
されている。On the other hand, the index method generally has a faster search speed than the full-text scan method, and depending on the method of creating the index, the search speed has little advantage that it depends on the amount of documents, but the index information has a large capacity. It has been pointed out that problems such as a long index creation time and a low search speed when the search conditions are complicated or the number of search results are large.

【０００７】索引方式の問題点を解決するものとして、
シグネチャファイルを用いたプリサーチ方式がある。そ
の中で、本願出願人は、文字または文字連接の出現頻度
に応じてグループ化を行ないシグネチャファイルの型式
を作成するという手法を提案した（特許願平成５年第２
５３０３２号）。図４はこの方法による実施例の索引型
式作成装置の構成を示すブロック図である。サンプル文
書データ４０１中における各文書レコードが、文書区切
り手段４０３によりサンプル文書区切りデータ４０２か
らの位置情報をもとに切り出され、各文字の出現の度合
を文字出現頻度算定手段４０４が統計的に調べ、出現の
度合が予め定められた値すなわち絞り込み率以下である
低頻度文字については、文字グループ化手段４０７が複
数の文字のグループ化を行なう。この時、当該グループ
に属する文字の少なくとも１種が出現する度合が予め定
められた絞り込み率を越えないように文字を振り分け
る。サンプル文書データ４０１中における出現の度合が
絞り込み率を越える高頻度文字については、２つの高頻
度文字から成る２文字連続のサンプル文書データ中にお
ける出現の度合を２文字連続出現頻度算定手段４０５が
調べ、出現の度合が絞り込み率以下である低頻度２文字
連続については、２文字連続グループ化手段４０８が文
字の場合と同様にグループ化を行う。出現の度合が絞り
込み率を越える高頻度２文字連続については、高頻度２
文字連続に属する２つの２文字連続をそれぞれ初めの２
文字、および最後の２文字に持つ３文字連続のサンプル
文書中における出現の度合を３文字連続出現頻度算定手
段４０６が調べ、出現の度合が絞り込み率以下である低
頻度３文字連続については、３文字連続グループ化手段
４０９が文字または２文字連続の場合と同様にグループ
化を行う。高頻度３文字連続については、各高頻度３文
字連続だけから成るグループにする。その後、索引型式
出力手段４１０が各グループに対して１ｂｉｔの索引情
報を割り当てるような索引型式データ４１１を出力す
る。As a solution to the problems of the index system,
There is a pre-search method using a signature file. Among them, the applicant of the present application has proposed a method of grouping according to the frequency of appearance of characters or character concatenation to create a signature file type (Patent Application 1993, Second Edition).
53032). FIG. 4 is a block diagram showing the configuration of the index type creating apparatus according to the embodiment of this method. Each document record in the sample document data 401 is cut out by the document delimiter 403 based on the position information from the sample document delimiter data 402, and the character appearance frequency calculation unit 404 statistically checks the degree of appearance of each character. The character grouping unit 407 groups a plurality of characters with respect to low-frequency characters whose appearance degree is a predetermined value or less, that is, a narrowing-down rate or less. At this time, the characters are sorted so that the degree of appearance of at least one of the characters belonging to the group does not exceed a predetermined narrowing rate. For high-frequency characters whose degree of appearance in the sample document data 401 exceeds the narrowing-down rate, the two-character consecutive appearance frequency calculation means 405 checks the degree of appearance in two-character consecutive sample document data consisting of two high-frequency characters. As for the low-frequency two-character continuous character whose degree of appearance is less than or equal to the narrowing-down rate, the two-character continuous grouping unit 408 performs grouping as in the case of characters. For high-frequency 2 consecutive characters whose degree of appearance exceeds the screening rate, high-frequency 2
Two consecutive two characters belonging to a consecutive character are the first two
The three-character consecutive appearance frequency calculation unit 406 examines the degree of appearance of characters and the three-character consecutive sample documents of the last two characters, and 3 for low-frequency three-character consecutive cases where the degree of appearance is less than the narrowing rate. Grouping is performed in the same manner as in the case where the character continuous grouping means 409 is for characters or two characters in succession. For high-frequency 3 consecutive characters, each group consists of only 3 high-frequency consecutive characters. After that, the index type output means 410 outputs the index type data 411 which assigns 1-bit index information to each group.

【０００８】このようにして、文字および文字列の出現
の度合が異なっていても、検索条件によらずに絞り込み
率以下に検索対象文書データを絞り込むことを可能に
し、多くの種類の低頻度文字がある場合でも、容量の小
さな索引を作成することを可能にするような索引型式デ
ータを作成することができる。In this way, even if the degree of appearance of characters and character strings is different, it is possible to narrow down the document data to be searched to below the narrowing rate regardless of the search conditions, and many types of low-frequency characters Even if there is, it is possible to create index type data that makes it possible to create a small index.

【０００９】[0009]

【発明が解決しようとする課題】しかしながら、文字お
よび文字列をグループ化する上記の従来技術では、サン
プル文書において出現頻度の低い文字および文字列は利
用者の検索要求にどれほど使われるかということとは関
係なく、当該グループに属する文字および文字列の少な
くとも１種が出現する度合が予め定められた絞り込み率
を越えないようにグループ化されてしまうため、検索速
度の向上は絞り込み率の逆数倍程度に抑えられ、それ以
上には検索速度は向上できないという課題があった。However, in the above-mentioned conventional technique for grouping characters and character strings, how often the characters and character strings having a low frequency of appearance in the sample document are used for the user's search request. Irrelevant, since the degree to which at least one of the characters and character strings belonging to the group appears is grouped so that it does not exceed the predetermined narrowing rate, the search speed is improved by the reciprocal of the narrowing rate. There was a problem that the search speed could not be improved beyond that level.

【００１０】また、利用者が設定できるパラメータは検
索対象文書の絞り込み率であり、絞り込み率によって変
化する索引データ量を予め知ることができないため、記
憶容量に制限のある場合などには、索引型式データの作
成を繰り返し行って適切な絞り込み率を求める必要があ
り、必要な索引型式データ作成までに時間がかかるとい
う課題があった。The parameter that can be set by the user is the narrowing down rate of the documents to be searched, and the index data amount that changes depending on the narrowing down rate cannot be known in advance. Therefore, when the storage capacity is limited, the index type There is a problem that it takes time to create the necessary index type data because it is necessary to repeatedly create the data and obtain an appropriate narrowing rate.

【００１１】本発明は、上記従来技術の課題を解決する
もので、利用者が設定した特定の検索要求に対して、も
しくは過去の検索履歴から調べた利用者が良く用いる検
索要求に対しては、他の検索要求に対する検索速度を低
下させることなく絞り込み率の逆数倍を上回る高速な検
索を可能にする索引型式作成装置を提供することを目的
とする。また、索引データ量に対する制限を直接与える
ことにより、適切な絞り込み率を自動的に設定して索引
型式を作成することのできる索引型式作成装置を提供す
ることを目的とする。The present invention solves the above-mentioned problems of the prior art. For a specific search request set by the user, or for a search request frequently used by the user checked from past search histories. It is an object of the present invention to provide an index type creating apparatus that enables a high-speed search that exceeds the reciprocal of the narrowing rate without reducing the search speed for other search requests. It is another object of the present invention to provide an index type creating device capable of creating an index type by automatically setting an appropriate narrowing rate by directly giving a limit to the amount of index data.

【００１２】[0012]

【課題を解決するための手段】上記目的を達成するため
に、本発明による索引型式作成装置は、１つの要素だけ
からなる単独グループに入れる文字または文字列を指定
する特別区分入力手段を備え、その他に従来技術による
索引作成装置の持つ、サンプル文書データ中のある１文
字の出現の度合を統計的に調べる文字出現頻度算定手段
と、前回調べた文字の出現の度合がある値よりも高い場
合に、前回調べた文字の全てを含むＮ文字（Ｎは２、
３、・・・の自然数）の文字列についての出現の度合を
統計的に調べる複数のＮ文字連続出現頻度算定手段と、
文字出現頻度算定手段および複数のＮ文字連続出現頻度
算定手段の出力と特別区分入力手段の出力から文字また
は文字列をグループ化する複数のグループ化手段とを備
えたものである。In order to achieve the above-mentioned object, the index type creating apparatus according to the present invention comprises special section input means for designating a character or a character string to be included in a single group consisting of only one element, In addition, a character appearance frequency calculation means that statistically examines the degree of appearance of one character in the sample document data, which is possessed by the index creation device according to the related art, and the degree of appearance of the previously examined character is higher than a certain value. , N characters (N is 2,
A plurality of N character consecutive appearance frequency calculation means for statistically checking the degree of appearance of a character string of 3, ...
The character appearance frequency calculating means and the plurality of N character consecutive appearance frequency calculating means and the plurality of grouping means for grouping characters or character strings from the output of the special classification input means are provided.

【００１３】また本発明による索引型式作成装置は、過
去の検索要求履歴における検索要求文字列の出現頻度を
算定する検索要求文字列出現頻度算定手段を備え、その
他に従来技術による索引作成装置の持つ、サンプル文書
データ中のある１文字の出現の度合を統計的に調べる文
字出現頻度算定手段と、前回調べた文字の出現の度合が
ある値よりも高い場合に、前回調べた文字の全てを含む
Ｎ文字（Ｎは２、３、・・・の自然数）の文字列につい
ての出現の度合を統計的に調べる複数のＮ文字連続出現
頻度算定手段と、文字出現頻度算定手段および複数のＮ
文字連続出現頻度算定手段の出力と検索要求文字列出現
頻度算定手段の出力から文字または文字列をグループ化
する複数のグループ化手段とを備えたものである。Further, the index type creating apparatus according to the present invention is provided with a search request character string appearance frequency calculating means for calculating the appearance frequency of the search request character string in the past search request history. , A character appearance frequency calculation means for statistically checking the degree of appearance of one character in the sample document data, and includes all of the previously examined characters when the degree of appearance of the previously examined character is higher than a certain value A plurality of N-character consecutive appearance frequency calculation means for statistically examining the degree of appearance of a character string of N characters (N is a natural number of 2, 3, ...), a character appearance frequency calculation means, and a plurality of N
It is provided with a plurality of grouping means for grouping characters or character strings from the output of the character continuous appearance frequency calculating means and the output of the search request character string appearance frequency calculating means.

【００１４】また本発明による索引型式作成装置は、サ
ンプル文書データ中のある１文字の出現の度合を統計的
に調べる文字出現頻度算定手段と、前回調べた文字の出
現の度合がある値よりも高い場合に、前回調べた文字の
全てを含むＮ文字（Ｎは２、３、・・・の自然数）の文
字列についての出現の度合を統計的に調べる複数のＮ文
字連続出現頻度算定手段と、索引データ量に対する制限
を入力するための最大索引量入力手段と、文字出現頻度
算定手段および複数のＮ文字連続出現頻度算定手段の出
力と最大索引量入力手段の出力から最大索引量以下の大
きさの索引作成を可能とする絞り込み率を求め、再度文
字出現頻度算定手段および複数のＮ文字連続出現頻度算
定手段に結果を出力する絞り込み率算定手段と、文字出
現頻度算定手段および複数のＮ文字連続出現頻度算定手
段の出力から文字または文字列をグループ化する複数の
グループ化手段とを備えたものである。Further, the index type creating apparatus according to the present invention has a character appearance frequency calculating means for statistically checking the degree of appearance of one character in the sample document data, and a character appearance frequency calculating means which is previously examined and is higher than a certain value. And a plurality of N character consecutive appearance frequency calculation means for statistically checking the degree of appearance for an N character string (N is a natural number of 2, 3, ...) Including all of the characters checked last time, if high. A maximum index amount input means for inputting a limit on the amount of index data, an output of the character appearance frequency calculation means and a plurality of N character consecutive appearance frequency calculation means and an output of the maximum index amount input means Of the character appearance frequency calculating means and the character appearance frequency calculating means and the N character consecutive appearance frequency calculating means, and outputs the result to the character appearance frequency calculating means and the plurality of N character consecutive appearance frequency calculating means. It is obtained by a plurality of grouping means for grouping a character or character string from the output of the fine plurality of N characters continuous occurrence frequency calculating means.

【００１５】[0015]

【作用】本発明は上記構成によって、予め与えられた絞
り込み率に対してサンプル文書データ中の文字または文
字列の出現の度合を文字出現頻度算定手段および複数の
Ｎ文字連続出現頻度算定手段が調べた後、グループ化手
段が文字および文字列の区分を決定する際に、特別区分
入力手段で入力された文字および文字列についてその構
成文字のうち低頻度文字を、その文字だけからなるグル
ープとして登録することによって、それらを含む検索要
求文字列で検索した場合に、他の検索要求に対する検索
速度を低下させることなく、さらに高速な検索が可能と
なる索引型式を作成することができる。According to the present invention, the character appearance frequency calculating means and the plurality of N character consecutive appearance frequency calculating means check the degree of appearance of the character or the character string in the sample document data with respect to the predetermined narrowing down rate. Then, when the grouping means determines the classification of characters and character strings, the low-frequency characters of the constituent characters of the characters and character strings input by the special classification input means are registered as a group consisting of only those characters. By doing so, when searching with a search request character string including them, it is possible to create an index type that enables a higher speed search without reducing the search speed for other search requests.

【００１６】また、予め与えられた絞り込み率に対して
サンプル文書データ中の文字または文字列の出現の度合
を文字出現頻度算定手段および複数のＮ文字連続出現頻
度算定手段が調べた後、グループ化手段が文字および文
字列の区分を決定する際に、過去の検索要求履歴におけ
る検索要求文字列の出現頻度を検索要求文字列出現頻度
算定手段によって算定し、高頻度で現れる検索要求文字
列についてその構成文字のうちサンプル文書中の低頻度
文字であるものは、その文字だけからなるグループとし
て登録することによって、各利用者が良く用いる検索要
求に対して高速な検索が可能となる索引型式を自動的に
作成することができる。Further, after the character appearance frequency calculating means and the plurality of N character consecutive appearance frequency calculating means examine the degree of appearance of a character or a character string in the sample document data with respect to a predetermined narrowing down ratio, grouping is performed. When the means determines the classification of the character and the character string, the appearance frequency of the search request character string in the past search request history is calculated by the search request character string appearance frequency calculation means, and regarding the search request character string that appears at high frequency, Of the constituent characters, those that are infrequent characters in the sample document are registered as a group consisting of only those characters, so that the index type that enables high-speed search for the search request frequently used by each user is automatically created. Can be created dynamically.

【００１７】また、予め与えられた絞り込み率に対して
サンプル文書データ中の文字または文字列の出現の度合
を文字出現頻度算定手段および複数のＮ文字連続出現頻
度算定手段が調べ、絞り込み率算定手段が文字および文
字列の出現頻度から最大索引量入力手段で入力された索
引量を上限として絞り込み率を算定し、さらにこの絞り
込み率に対して再度文字または文字列の出現の度合を文
字出現頻度算定手段および複数のＮ文字連続出現頻度算
定手段が調べ、その後グループ化手段が文字および文字
列の区分を決定することによって、記憶容量に制限のあ
る場合でも、予め設定された最大索引量を越えない範囲
で最も高速化が図れるような索引型式を短時間に作成す
ることができる。Further, the degree of appearance of a character or a character string in the sample document data is examined by a character appearance frequency calculating means and a plurality of N character consecutive appearance frequency calculating means with respect to a predetermined narrowing rate, and the narrowing rate calculating means is calculated. Calculates the narrowing rate from the appearance frequency of characters and character strings with the index amount input by the maximum index amount input means as the upper limit, and then calculates the appearance frequency of characters or character strings again for this narrowing rate. Even if the storage capacity is limited, the preset maximum index amount is not exceeded by the means and the plurality of N character consecutive appearance frequency calculation means, and then the grouping means determines the division of the character and the character string. It is possible to create an index type that can achieve the highest speed in the range in a short time.

【００１８】[0018]

【Example】

（実施例１）以下、本発明の第１の実施例について、図
面を参照しながら説明する。図１は本発明の第１の実施
例における索引型式作成装置の構成を示すブロック図で
ある。図１において、１０１は文書データを構成する複
数の文書レコードを格納したサンプル文書データであ
る。サンプル文書データ１０１は、検索対象文書データ
の全部または一部でもよく、検索対象文書データに対
し、文字および文字列の出現に関する統計的性質が類似
している他の文書データであってもよい。１０２はサン
プル文書データ１０１中の各文書レコードの位置を記録
したサンプル文書区切りデータ、１０３はサンプル文書
区切りデータ１０２の位置情報に従ってサンプル文書デ
ータ１０１から指定された文書レコードを切り出して、
レコード先頭を表す特別な文字＜ＳＴＡＲＴ＞を文書レ
コード先頭に付与し、レコード終了を表す特別な文字＜
ＥＮＤ＞を文書レコード末尾に付与した文字列を出力す
る文書区切り手段、１０４は文書区切り手段１０３の出
力である文書レコード文字列を受け取ってサンプル文書
データ１０１中に出現する各文字の出現の度合を「当該
文字の出現する文書レコードの文字数の総和を全文書レ
コードの文字数の総和で除した値」として算定する文字
出現頻度算定手段、１０５は文書区切り手段１０３の出
力である文書レコード文字列と、文字出現頻度算定手段
１０４の算定結果とを受けとってサンプル文書データ１
０１中に高頻度で出現する２文字連続の出現の度合を
「当該２文字連続の出現する文書レコードの文字数の総
和を全文書レコードの文字数の総和で除した値」として
算定する２文字連続出現頻度算定手段、１０６は文書区
切り手段１０３の出力である文書レコード文字列と２文
字連続出現頻度算定手段１０５の算定結果とを受け取っ
て、サンプル文書データ１０１中に高頻度で出現する３
文字連続の出現の度合を「当該３文字連続の出現する文
書レコードの文字数の総和を全文書レコードの文字数の
総和で除した値」として算定する３文字連続出現頻度算
定手段、１０７は文字出現頻度算定手段１０４の算定結
果を受け取って、出現の度合が予め定められた「絞り込
み率」以下である複数の文字をグループ化し、グループ
に属するいずれかの文字が出現する度合が絞り込み率を
越えない範囲で絞り込み率にもっとも近くなるように調
整する文字グループ化手段、１０８は２文字連続出現頻
度算定手段１０５の算定結果を受け取って、出現の度合
が絞り込み率以下である複数の２文字連続をグループ化
し、グループに属するいずれかの２文字連続が出現する
度合が絞り込み率を越えない範囲で絞り込み率に最も近
くなるように調整する２文字連続グループ化手段、１０
９は３文字連続出現頻度算定手段１０６の算定結果を受
け取って、出現の度合が絞り込み率以下である複数の３
文字連続がある場合には、これをグループ化し、グルー
プに属するいずれかの３文字連続が出現する度合が絞り
込み率を越えない範囲で絞り込み率に最も近くなるよう
に調整し、出現の度合が絞り込み率よりも高い３文字連
続はそれ１つだけで１グループにする３文字連続グルー
プ化手段、１１０は文字グループ化手段１０７と２文字
連続グループ化手段１０８と３文字連続グループ化手段
１０９の出力であるグループ化情報を受け取って各グル
ープに通し番号を付与し、各グループの通し番号と、所
属文字あるいは２文字連続あるいは３文字連続との対応
表を出力する索引型式出力手段、１１１は索引型式出力
手段１１０の出力する索引型式データである。そして、
１１２は文字グループ化手段１０７に対して、指定した
要素だけからなるグループを作成するよう指示する特別
区分入力手段である。(First Embodiment) A first embodiment of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing the configuration of an index type creating device according to the first embodiment of the present invention. In FIG. 1, reference numeral 101 is sample document data in which a plurality of document records forming the document data are stored. The sample document data 101 may be all or a part of the search target document data, or may be other document data having similar statistical properties regarding the appearance of characters and character strings to the search target document data. Reference numeral 102 denotes sample document delimiter data in which the positions of the respective document records in the sample document data 101 are recorded. Reference numeral 103 denotes a designated document record cut out from the sample document data 101 according to the position information of the sample document delimiter data 102.
A special character <START> indicating the beginning of the record is added to the beginning of the document record, and a special character <indicating the end of the record <
END> is a document delimiter for outputting a character string added to the end of the document record. Reference numeral 104 is a document delimiter output from the document delimiter 103. The document delimiter 103 indicates the degree of appearance of each character appearing in the sample document data 101. A character appearance frequency calculation means for calculating as "a value obtained by dividing the total number of characters of the document records in which the character appears by the total number of characters of all document records", 105 is a document record character string output from the document delimiter means 103, Sample document data 1 based on the calculation result of the character appearance frequency calculation means 104
2 consecutive characters appearing in 01 are calculated as the degree of occurrence of 2 consecutive characters "the value obtained by dividing the sum of the number of characters of the document records in which the two consecutive characters appear by the sum of the number of characters of all document records". A frequency calculation means 106 receives the document record character string output from the document delimiter means 103 and the calculation result of the two-character consecutive appearance frequency calculation means 105, and appears frequently in the sample document data 101.
Three-character consecutive appearance frequency calculating means for calculating the degree of appearance of consecutive characters as "value obtained by dividing total sum of the number of characters of the document record in which the consecutive three characters appear by total sum of the number of characters of all document records", 107 is a character appearance frequency A range in which, after receiving the calculation result of the calculation means 104, a plurality of characters whose appearance degree is less than or equal to a predetermined “narrowing rate” are grouped and the degree to which any character belonging to the group appears does not exceed the narrowing rate. The character grouping means 108 for adjusting so as to be closest to the narrowing-down rate, 108 receives the calculation result of the two-character consecutive appearance frequency calculating means 105, and groups a plurality of two-character consecutively whose degree of appearance is less than or equal to the narrowing-down rate. , Adjust so that the degree to which any two consecutive characters belonging to the group appear is closest to the narrowing rate within the range that does not exceed the narrowing rate. 2 character continuous grouping means that, 10
9 receives the calculation result of the three-character consecutive appearance frequency calculation means 106, and a plurality of 3 whose degree of appearance is less than or equal to the narrowing rate
If there are consecutive characters, group them and adjust the degree of occurrence of any three consecutive characters belonging to the group to be the closest to the narrowing rate within the range that does not exceed the narrowing rate. 3 character consecutive grouping means for making 3 groups of consecutive 3 characters higher than the rate into one group by one, 110 is an output of the character grouping means 107, 2 character consecutive grouping means 108 and 3 character consecutive grouping means 109. Index type output means for receiving certain grouping information, assigning serial numbers to each group, and outputting a correspondence table of serial numbers of each group and belonging characters or two consecutive characters or three consecutive characters, 111 is an index type output means 110. Is the index type data output by. And
Reference numeral 112 is a special classification input means for instructing the character grouping means 107 to create a group consisting of only designated elements.

【００１９】以上のように構成された索引型式作成装置
について、その動作を説明する。まず、サンプル文書デ
ータ１０１中の各文書レコードが、文書区切り手段１０
３で切り出されて、文字出現頻度算定手段１０４に送ら
れ、各文字の出現の度合が、該当文字の出現する文書レ
コードの文字数の総和／全文書レコードの文字数の総和
によって算定される。利用者は、特別区分入力手段１１
２により検索速度を改善したい検索要求文字列を入力す
る。文字グループ化手段１０７は、文字出現頻度算定手
段１０４の算定結果を受け取って、出現の度合が予め定
められた「絞り込み率」以下である複数の文字をグルー
プ化し、グループに属するいずれかの文字が出現する度
合が絞り込み率を越えない範囲で絞り込み率に最も近く
なるように調整する。この時、グループのいずれかの文
字が現れる度合の算定法は、グループ内の各文字の出現
が統計的に独立であると仮定し、以下の式から求める。The operation of the index type creating apparatus constructed as described above will be described. First, each document record in the sample document data 101 is a document delimiter 10
It is cut out in 3 and sent to the character appearance frequency calculation means 104, and the degree of appearance of each character is calculated by the sum of the number of characters of the document record in which the corresponding character appears / the sum of the number of characters of all the document records. The user uses the special category input means 11
Input the search request character string whose search speed is to be improved by 2. The character grouping unit 107 receives the calculation result of the character appearance frequency calculating unit 104, groups a plurality of characters whose degree of appearance is equal to or less than a predetermined “narrowing rate”, and determines which of the characters belonging to the group. The degree of appearance should be adjusted to be the closest to the narrowing rate within a range that does not exceed the narrowing rate. At this time, the calculation method of the degree of appearance of any character in the group is calculated from the following formula, assuming that the appearance of each character in the group is statistically independent.

【００２０】[0020]

【数１】ただし、Ｐはグループ内のｎ個の文字のいずれかが現れ
る度合であり、Ｐj （ｊ＝１，２，・・・ｎ）はグルー
プ内のｊ番目の文字が現れる度合である。[Equation 1] However, P is the degree to which any of the n characters in the group appears, and Pj (j = 1, 2, ..., N) is the degree to which the j-th character in the group appears.

【００２１】またグループ化の際、特別区分入力手段１
１２で入力された文字または各文字列についてその構成
文字のうち文字出現頻度算定手段１０４の結果が低頻度
である文字については、それらの各低頻度文字をその文
字だけからなる単独グループとして登録する。When grouping, the special section input means 1
For the characters input in 12 or for each character string, of the constituent characters, for the characters whose result by the character appearance frequency calculation means 104 is low in frequency, those low frequency characters are registered as a single group consisting of only those characters. .

【００２２】サンプル文書データ１０１の１回目の走査
が終了したら、文書区切り手段１０３は、サンプル文書
データ１０１の２回目の走査を開始し、切り出した文書
レコードを２文字連続出現頻度算定手段１０５に送る。
２文字連続出現頻度算定手段１０５は、文書レコード中
の２文字連続のうちで、高頻度文字同士の連続のみを抽
出し、各２文字連続の出現度合が「当該２文字連続の出
現する文書レコードの文字数の総和／全文書レコードの
文字数の総和」によって算定される。高頻度文字同士か
らなる２文字連続のうち高頻度２文字連続以外のすべて
を、式（１）と同様の基準によってグループに属するい
ずれかの２文字連続が現れる度合が絞り込み率以下にな
るように、２文字連続グループ化手段１０８がグループ
化する。When the first scan of the sample document data 101 is completed, the document delimiter 103 starts the second scan of the sample document data 101 and sends the clipped document record to the two-character consecutive appearance frequency calculator 105. .
The two-character consecutive appearance frequency calculation unit 105 extracts only the consecutive high-frequency characters from the two-letter consecutive letters in the document record, and the degree of appearance of each two-letter consecutive letters is “the document record in which the two-letter consecutive letters appear. Of the total number of characters / total number of characters in all document records ". The degree of appearance of any two consecutive characters belonging to a group is equal to or less than the narrowing rate based on the same criteria as in the formula (1) for all of the consecutive two characters consisting of high-frequency characters other than the frequent two-character consecutive. The two-character continuous grouping unit 108 groups the characters.

【００２３】こうして、サンプル文書データ１０１の２
回目の走査が終了したら、文書区切り手段１０３は、サ
ンプル文書データ１０１の３回目の走査を開始し、切り
出した文書レコードを３文字連続出現頻度算定手段１０
６に送る。３文字連続出現頻度算定手段１０６は、文書
レコード中の３文字連続のうちで、（第１文字、第２文
字）および（第２文字、第３文字）がいずれも高頻度２
文字連続である３文字連続のみを抽出し、各３文字連続
の出現の度合が、「当該３文字連続の出現する文書レコ
ードの文字数の総和／全文書レコードの文字数の総和」
によって算定され、その結果が３文字連続グループ化手
段１０９に送られ、式（１）と同様の基準によって絞り
込み率をもとにグループ化される。Thus, 2 of the sample document data 101
When the scanning of the third time is completed, the document dividing means 103 starts the third scanning of the sample document data 101, and the cut-out document record is calculated by the three-character consecutive appearance frequency calculating means 10.
Send to 6. The three-character continuous appearance frequency calculation means 106 has a high frequency of 2 for the (first character, second character) and (the second character, third character) among the three characters in the document record.
Only three consecutive characters, which are consecutive characters, are extracted, and the degree of appearance of each consecutive three characters is "total sum of character numbers of document records in which the consecutive three characters appear / total sum of character numbers of all document records".
And the result is sent to the three-character continuous grouping means 109 and grouped based on the narrowing-down rate according to the same criteria as in formula (1).

【００２４】こうして得られたグループ化情報が、索引
型式出力手段１１０に送られ、低頻度文字グループ、２
文字連続グループ、３文字連続グループの１つ１つに対
して、１ｂｉｔの索引情報を割り当てるような索引型式
を索引型式データ１１１に出力する。The grouping information thus obtained is sent to the index type output means 110, and the low frequency character groups, 2
An index type that assigns 1-bit index information to each of the character continuous group and the three character continuous group is output to the index type data 111.

【００２５】以上のように、本実施例によれば、サンプ
ル文書中にはあまり出現しないが、利用者が高速で検索
したいという文字に対しては、特別区分に指定しグルー
プ化を行なわないことで、索引容量をあまり大きくする
ことなく、また他の検索要求に対する検索速度を低下さ
せることなく、その文字を含む検索要求対しては、高速
な検索が可能となる索引型式を作成することができる。
特に、特別区分に指定した１文字で検索した場合、絞り
込み率をｃ、当該文字が文書中に出現する度合をｃ’
（ｃ’＜ｃ＜１）とすれば、グループ化を行う従来の方
法では、全文書量のｃ倍の文書をフルテキストスキャン
しなければならないのに対し、本実施例によれば、全文
書量のｃ’倍の文書をフルテキストスキャンするだけで
よいので、検索速度はｃ／ｃ’倍に向上する。As described above, according to the present embodiment, the characters that do not appear in the sample document very much, but the user wants to search at high speed, should be designated as a special classification and not grouped. Thus, it is possible to create an index type that enables a high-speed search for a search request including the character without increasing the index capacity too much and reducing the search speed for other search requests. .
In particular, when searching with one character specified in the special category, the narrowing rate is c, and the degree of occurrence of the character in the document is c '.
If (c ′ <c <1), the conventional method for grouping requires full-text scanning of documents that are c times the total amount of documents, whereas according to the present embodiment, all documents are scanned. The search speed is increased by c / c 'times, since only full text scans of documents of c'times the quantity are required.

【００２６】（実施例２）次に、本発明の第２の実施例
について、図面を参照しながら説明する。図２は本発明
の第２の実施例における索引型式作成装置の構成を示す
ブロック図である。図２において、２０１はサンプル文
書データ、２０２はサンプル文書区切りデータ、２０３
は文書区切り手段、２０４は文字出現頻度算定手段、２
０５は２文字連続出現頻度算定手段、２０６は３文字連
続出現頻度算定手段、２０７は文字グループ化手段、２
０８は２文字連続グループ化手段、２０９は３文字連続
グループ化手段、２１０は索引型式出力手段、２１１は
索引型式データである。そして、２１２は検索要求履歴
データ、２１３は過去の検索要求履歴データ２１２から
検索要求文字列の出現頻度を算定し、文字グループ化手
段２０７に対して、単一の要素だけからなるグループを
作成するよう指示する検索要求文字列出現頻度算定手段
である。(Second Embodiment) Next, a second embodiment of the present invention will be described with reference to the drawings. FIG. 2 is a block diagram showing the configuration of the index type creating device according to the second embodiment of the present invention. In FIG. 2, 201 is sample document data, 202 is sample document delimiter data, and 203.
Is a document delimiter, 204 is a character appearance frequency calculator, 2
Reference numeral 05 is a two-character continuous appearance frequency calculation means, 206 is a three-character continuous appearance frequency calculation means, 207 is a character grouping means, 2
Reference numeral 08 is a two-character continuous grouping means, 209 is a three-character continuous grouping means, 210 is an index type output means, and 211 is index type data. Further, 212 is the search request history data, 213 is the occurrence frequency of the search request character string from the past search request history data 212, and creates a group consisting of only a single element for the character grouping means 207. It is a means for calculating the frequency of appearance of a search request character string.

【００２７】以上のように構成された索引型式作成装置
について、その動作を説明する。まず、サンプル文書デ
ータ２０１中の各文書レコードが、文書区切り手段２０
３で切り出されて、文字出現頻度算定手段２０４に送ら
れ、各文字の出現の度合が、該当文字の出現する文書レ
コードの文字数の総和／全文書レコードの文字数の総和
によって算定される。文字グループ化手段２０７は、文
字出現頻度算定手段２０４の算定結果を受け取って、出
現の度合が予め定められた「絞り込み率」以下である複
数の文字をグループ化し、グループに属するいずれかの
文字が出現する度合が絞り込み率を越えない範囲で絞り
込み率に最も近くなるように調整する。この時、グルー
プのいずれかの文字が現れる度合の算定法は、グループ
内の各文字の出現が統計的に独立であると仮定し、式
（１）から求める。またグループ化の際、検索要求文字
列出現頻度算定手段２１３が検索要求履歴データ２１２
から算定した出現頻度が高い検索要求文字または文字列
について、その構成文字のうち文字出現頻度算定手段２
０４の結果が低頻度である文字については、それらの各
低頻度文字をその文字だけからなる単独グループとして
登録する。The operation of the index type creating apparatus constructed as described above will be described. First, each document record in the sample document data 201 is stored in the document delimiter 20.
It is cut out in 3 and sent to the character appearance frequency calculation means 204, and the degree of appearance of each character is calculated by the sum of the number of characters of the document record in which the corresponding character appears / the sum of the number of characters of all the document records. The character grouping unit 207 receives the calculation result of the character appearance frequency calculating unit 204, groups a plurality of characters whose degree of appearance is less than or equal to a predetermined “narrowing rate”, and determines whether any of the characters belonging to the group The degree of appearance should be adjusted to be the closest to the narrowing rate within a range that does not exceed the narrowing rate. At this time, the calculation method of the degree of appearance of any character in the group is calculated from the equation (1), assuming that the appearance of each character in the group is statistically independent. Further, when grouping, the search request character string appearance frequency calculation unit 213 causes the search request history data 212 to be displayed.
The character appearance frequency calculation means 2 among the constituent characters of the search request character or the character string having a high appearance frequency calculated from
As for the character whose result of 04 is infrequent, each of those infrequent characters is registered as a single group consisting of only that character.

【００２８】サンプル文書データ２０１の１回目の走査
が終了したら、文書区切り手段２０３は、サンプル文書
データ２０１の２回目の走査を開始し、切り出した文書
レコードを２文字連続出現頻度算定手段２０５に送る。
２文字連続出現頻度算定手段２０５は、文書レコード中
の２文字連続のうちで、高頻度文字同士の連続のみを抽
出し、各２文字連続の出現度合が「当該２文字連続の出
現する文書レコードの文字数の総和／全文書レコードの
文字数の総和」によって算定される。高頻度文字同士か
らなる２文字連続のうち高頻度２文字連続以外のすべて
を、式（１）と同様の基準によってグループに属するい
ずれかの２文字連続が現れる度合が絞り込み率以下にな
るように、２文字連続グループ化手段２０８がグループ
化する。After the first scanning of the sample document data 201 is completed, the document delimiter 203 starts the second scanning of the sample document data 201 and sends the clipped document record to the two-character consecutive appearance frequency calculation unit 205. .
The two-character consecutive appearance frequency calculation unit 205 extracts only the consecutive high-frequency characters from the two-letter consecutive letters in the document record, and the degree of appearance of each two-letter consecutive letters is “the document record in which the two-letter consecutive letters appear. Of the total number of characters / total number of characters in all document records ". The degree of appearance of any two consecutive characters belonging to a group is equal to or less than the narrowing rate based on the same criteria as in the formula (1) for all of the consecutive two characters consisting of high-frequency characters other than the frequent two-character consecutive. The two-character continuous grouping unit 208 groups the characters.

【００２９】こうして、サンプル文書データ２０１の２
回目の走査が終了したら、文書区切り手段２０３は、サ
ンプル文書データ２０１の３回目の走査を開始し、切り
出した文書レコードを３文字連続出現頻度算定手段２０
６に送る。３文字連続出現頻度算定手段２０６は、文書
レコード中の３文字連続のうちで、（第１文字、第２文
字）および（第２文字、第３文字）がいずれも高頻度２
文字連続である３文字連続のみを抽出し、各３文字連続
の出現の度合が、「当該３文字連続の出現する文書レコ
ードの文字数の総和／全文書レコードの文字数の総和」
によって算定され、その結果が３文字連続グループ化手
段２０９に送られ、式（１）と同様の基準によって絞り
込み率をもとにグループ化される。Thus, 2 of the sample document data 201
When the scanning of the third time is completed, the document dividing means 203 starts the third scanning of the sample document data 201, and the cut-out document record is calculated by the three-character consecutive appearance frequency calculating means 20.
Send to 6. The three-character continuous appearance frequency calculation unit 206 has a high frequency of 2 for the (first character, second character) and (the second character, third character) among the three characters in the document record.
Only three consecutive characters, which are consecutive characters, are extracted, and the degree of appearance of each consecutive three characters is "total sum of character numbers of document records in which the consecutive three characters appear / total sum of character numbers of all document records".
And the result is sent to the three-character consecutive grouping means 209 and grouped based on the narrowing-down rate according to the same criteria as in formula (1).

【００３０】こうして得られたグループ化情報が、索引
型式出力手段２１０に送られ、低頻度文字グループ、２
文字連続グループ、３文字連続グループの１つ１つに対
して、１ｂｉｔの索引情報を割り当てるような索引型式
を索引型式データ２１１に出力する。The grouping information thus obtained is sent to the index type output means 210, and the low frequency character groups, 2
An index type that allocates 1-bit index information to each of the character continuous group and the three character continuous group is output to the index type data 211.

【００３１】以上のように、本実施例によれば、サンプ
ル文書中にはあまり出現しないが、利用者が検索要求と
して頻繁に用いるという文字を検索要求履歴から自動的
に選びだし、そのような文字に対してはグループ化を行
なわないことで、索引容量をあまり大きくすることな
く、また他の検索要求に対する検索速度を低下させるこ
となく、各利用者に応じた高速な検索を可能にする索引
型式を作成することができる。As described above, according to the present embodiment, a character that does not appear frequently in the sample document but is frequently used by the user as a search request is automatically selected from the search request history. By not grouping the characters, an index that enables high-speed search according to each user without significantly increasing the index capacity and reducing the search speed for other search requests A model can be created.

【００３２】（実施例３）次に、本発明の第３の実施例
について、図面を参照しながら説明する。図３は本発明
の一実施例における索引型式作成装置の構成を示すブロ
ック図である。図３において、３０１はサンプル文書デ
ータ、３０２はサンプル文書区切りデータ、３０３は文
書区切り手段、３０４は文字出現頻度算定手段、３０５
は２文字連続出現頻度算定手段、３０６は３文字連続出
現頻度算定手段、３０７は文字グループ化手段、３０８
は２文字連続グループ化手段、３０９は３文字連続グル
ープ化手段、３１０は索引型式出力手段、３１１は索引
型式データである。そして、３１２は作成する索引の最
大量を入力する最大索引量入力手段、３１３は最大索引
量入力手段３１２からの入力と文字出現頻度算定手段３
０４の算定結果と２文字連続出現頻度算定手段３０５の
算定結果と３文字連続出現頻度算定手段３０６の算定結
果を受け取って絞り込み率を算定し、その結果を再度文
字出現頻度算定手段３０４と２文字連続出現頻度算定手
段３０５と３文字連続出現頻度算定手段３０６に出力す
る絞り込み率算定手段である。(Embodiment 3) Next, a third embodiment of the present invention will be described with reference to the drawings. FIG. 3 is a block diagram showing the configuration of the index type creating device according to an embodiment of the present invention. In FIG. 3, 301 is sample document data, 302 is sample document delimiter data, 303 is document delimiter, 304 is character appearance frequency calculator, 305.
Is a two-character consecutive appearance frequency calculation means, 306 is a three-character consecutive appearance frequency calculation means, 307 is a character grouping means, 308
Is a 2-character continuous grouping means, 309 is a 3-character continuous grouping means, 310 is an index type output means, and 311 is index type data. Then, 312 is a maximum index amount input means for inputting the maximum amount of the index to be created, 313 is an input from the maximum index amount input means 312 and the character appearance frequency calculation means 3
The calculation result of 04, the calculation result of the two-character continuous appearance frequency calculation means 305, and the calculation result of the three-character continuous appearance frequency calculation means 306 are received to calculate the narrowing rate, and the result is again calculated with the character appearance frequency calculation means 304 and two characters. It is a narrowing-down rate calculating means for outputting to the consecutive appearance frequency calculating means 305 and the three-character consecutive appearance frequency calculating means 306.

【００３３】以上のように構成された索引型式作成装置
について、その動作を説明する。まず、サンプル文書デ
ータ３０１中の各文書レコードが、文書区切り手段３０
３で切り出されて、文字出現頻度算定手段３０４に送ら
れ、各文字の出現の度合が、「該当文字の出現する文書
レコードの文字数の総和／全文書レコードの文字数の総
和」によって算定される。文書中に出現した文字の総数
をＮとし、絞り込み率の初期値として予め定められた値
ｃよりも高い出現頻度をもつ文字を高頻度文字（その数
をα（ｃ））とし、それ以外の文字を低頻度文字とす
る。The operation of the index type creating apparatus constructed as above will be described. First, each document record in the sample document data 301 is stored in the document delimiter 30.
It is cut out in 3 and sent to the character appearance frequency calculation means 304, and the degree of appearance of each character is calculated by "total sum of character numbers of document records in which the corresponding character appears / total sum of character numbers of all document records". Let N be the total number of characters that appear in the document, let a character with an appearance frequency higher than a predetermined value c as the initial value of the narrowing rate be a high-frequency character (the number of which is α (c)), and the other characters. Characters are infrequent characters.

【００３４】サンプル文書データ３０１の１回目の走査
が終了したら、文書区切り手段３０３は、サンプル文書
データ３０１の２回目の走査を開始し、切り出した文書
レコードを２文字連続出現頻度算定手段３０５に送る。
２文字連続出現頻度算定手段３０５は、文書レコード中
の２文字連続のうちで、高頻度文字同士の連続のみを抽
出し（その総数をＷ（ｃ）とする）、各２文字連続の出
現度合が「当該２文字連続の出現する文書レコードの文
字数の総和／全文書レコードの文字数の総和」によって
算定される。ｃよりも高い出現頻度をもつ２文字連続を
高頻度２文字連続（その数をβ（ｃ））とする。When the first scan of the sample document data 301 is completed, the document delimiter 303 starts the second scan of the sample document data 301 and sends the clipped document record to the two-character consecutive appearance frequency calculator 305. .
The two-character consecutive appearance frequency calculation unit 305 extracts only the consecutive high-frequency characters from the two-letter consecutive letters in the document record (the total number of which is W (c)), and the appearance degree of each two-letter consecutive letters. Is calculated by "total sum of the number of characters of the document record in which two consecutive characters appear / total sum of the number of characters of all document records". Two consecutive characters having a higher appearance frequency than c are defined as two consecutive frequently occurring characters (the number of them is β (c)).

【００３５】こうして、サンプル文書データ３０１の２
回目の走査が終了したら、文書区切り手段３０３は、サ
ンプル文書データ３０１の３回目の走査を開始し、切り
出した文書レコードを３文字連続出現頻度算定手段３０
６に送る。３文字連続出現頻度算定手段３０６は、文書
レコード中の３文字連続のうちで、（第１文字、第２文
字）および（第２文字、第３文字）がいずれも高頻度２
文字連続である３文字連続のみを抽出し（その総数をＴ
（ｃ）とする）、各３文字連続の出現の度合が、「当該
３文字連続の出現する文書レコードの文字数の総和／全
文書レコードの文字数の総和」によって算定される。ｃ
よりも高い出現頻度をもつ３文字連続を高頻度３文字連
続（その数をγ（ｃ））とする。Thus, 2 of the sample document data 301
After the scanning of the third time is completed, the document dividing means 303 starts the third scanning of the sample document data 301, and the cut-out document record is calculated by the three-character consecutive appearance frequency calculating means 30.
Send to 6. The three-character continuous appearance frequency calculation unit 306 has a high frequency of (First character, Second character) and (Second character, Third character) of the three consecutive characters in the document record.
Extract only 3 consecutive characters (the total number is T
(C)), the degree of appearance of each three consecutive characters is calculated by "total sum of the number of characters of the document record in which the consecutive three characters appear / total sum of the number of characters of all document records". c
A series of three characters having a higher appearance frequency than the above is defined as a series of high-frequency three characters (the number thereof is γ (c)).

【００３６】こうして得られた絞り込み率の初期値ｃに
対する文字出現頻度分布、２文字連続出現頻度分布、３
文字連続出現頻度分布と最大索引量入力手段３１２で入
力された索引量の上限から、再度出現頻度分布を調べる
ことなく、絞り込み率算定手段３１３が以下のような方
法で絞り込み率を決定する。The character appearance frequency distribution with respect to the initial value c of the narrowing down ratio thus obtained, the two character continuous appearance frequency distribution, 3
Based on the character continuous appearance frequency distribution and the upper limit of the index amount input by the maximum index amount input means 312, the narrowing-down rate calculating means 313 determines the narrowing-down rate by the following method without checking the appearance frequency distribution again.

【００３７】文字出現頻度分布は、絞り込み率によって
変化しない。したがって、絞り込み率ｃ₁のときの高頻
度文字数は、前に調べた文字出現頻度分布から直接知る
ことができ、これをα（ｃ₁）とする。文書中に現れる
高頻度文字同士の２文字連続の数は、任意の高頻度文字
同士の組合せの総数に比例すると仮定すると、絞り込み
率ｃ₁のときの高頻度文字同士の２文字連続の数Ｗ（ｃ
₁）は、式（２）によって表される。Ｗ（ｃ₁）＝Ｗ（ｃ）×｛α（ｃ₁）²／α（ｃ）²｝・・・（２）絞り込み率ｃ₁における２文字連続出現頻度分布（ｘ軸
にランク、ｙ軸に出現頻度をとったもの）は絞り込み率
ｃにおける２文字連続出現頻度分布をｘ軸方向に拡大縮
小したものと仮定すると、絞り込み率ｃ₁に対する高頻
度２文字連続の数β（ｃ₁）は、絞り込み率ｃに対する
２文字連続出現頻度分布で出現頻度がｃ ₁より高くなる
２文字連続の数β’（ｃ₁）を用いて式（３）のように
表せる。 β（ｃ₁）＝β’（ｃ₁）×｛Ｗ（ｃ₁）／Ｗ（ｃ）｝＝β’（ｃ₁）×｛α（ｃ₁）²／α（ｃ）²｝・・・（３）The character appearance frequency distribution depends on the narrowing rate.
It does not change. Therefore, the narrowing rate c₁High frequency
The number of degree characters is directly known from the character appearance frequency distribution examined previously.
It is possible to use α (c₁). Appear in the document
The number of two consecutive high-frequency characters is the same as any high-frequency character.
Assuming that it is proportional to the total number of combinations,
Rate c₁The number of consecutive two characters W (c
₁) Is represented by equation (2). W (c₁) = W (c) × {α (c₁)²/ Α (c)²} (2) Narrowing rate c₁2 characters consecutive appearance frequency distribution (x axis
To (rank, appearance frequency on the y-axis) is the narrowing rate
Expand / shrink the distribution of the appearance frequency of two consecutive characters in c in the x-axis direction
Assuming that it is small, the narrowing rate c₁Against
Number of two consecutive letters β (c₁) Is for the narrowing rate c
The appearance frequency is c in the two-character continuous appearance frequency distribution ₁Get higher
Number of consecutive two letters β '(c₁) Is used as in equation (3)
Can be represented. β (c₁) = Β ′ (c₁) × {W (c₁) / W (c)} = β '(c₁) × {α (c₁)²/ Α (c)²} (3)

【００３８】また、文書中に現れる３文字連続のうち
で、（第１文字、第２文字）および（第２文字、第３文
字）がいずれも高頻度２文字連続であるような３文字連
続の数は、任意の高頻度２文字連続同士の組合せの総数
に比例すると仮定すると、絞り込み率ｃ₁のときの前記
の条件を満たす３文字連続の数Ｔ（ｃ₁）は、式（４）
によって表される。Ｔ（ｃ₁）＝Ｔ（ｃ）×｛β（ｃ₁）²／β（ｃ）²｝・・・（４）In addition, among the three consecutive characters appearing in the document, three consecutive characters such that (first character, second character) and (second character, third character) are high-frequency two consecutive characters. the number of, when assumed to be proportional to the total number of any high frequency second character sequence among the combination, the number of the conditions are satisfied 3 character sequence when the narrowing ratio c ₁ T (c ₁₎ has the formula (4)
Represented by T (c ₁ ) = T (c) × {β (c ₁ ) ² / β (c) ² } (4)

【００３９】絞り込み率ｃ₁のときの３文字連続出現頻
度分布は、絞り込み率ｃのときの３文字連続出現頻度分
布をｘ軸方向に拡大縮小したものと仮定すると、絞り込
み率ｃ₁に対する高頻度３文字連続の数γ（ｃ₁）は、
絞り込み率ｃに対する３文字連続出現頻度分布において
出現頻度がｃ₁より高くなる３文字連続の数γ’
（ｃ ₁）を用いて、式（５）のように表せる。 γ（ｃ₁）＝γ’（ｃ₁）×｛Ｔ（ｃ₁）／Ｔ（ｃ）｝＝γ’（ｃ₁）×｛β（ｃ₁）²／β（ｃ）²｝・・・（５）Narrowing rate c₁When 3 characters appear consecutively
The degree distribution is the frequency of occurrence of three consecutive characters at the narrowing rate c.
Assuming that the cloth is scaled in the x-axis direction,
Rate c₁The number of high-frequency three-character continuous γ (c₁) Is
In the three-character continuous appearance frequency distribution for the narrowing rate c
Appearance frequency is c₁Higher number of consecutive three letters γ '
(C ₁) Can be expressed as in equation (5). γ (c₁) = Γ ′ (c₁) × {T (c₁) / T (c)} = γ '(c₁) × {β (c₁)²/ Β (c)²} (5)

【００４０】作成される索引の大きさは、後の各グルー
プ化手段によって得られる低頻度文字グループ、低頻度
２文字連続グループ、３文字連続グループの総数に文書
レコード数Ｒを乗じたもの（単位ｂｉｔ）になる。絞り
込み率ｃ₁に対する低頻度文字の出現頻度の総和をＳ₁
とすると、低頻度文字グループの数はＳ₁／ｃ₁と近似
できる。絞り込み率ｃ₁に対する低頻度２文字連続の出
現頻度の総和Ｓ₂は、３直線ｘ＝β（ｃ₁）、ｘ＝Ｗ
（ｃ₁）、ｙ＝０と絞り込み率ｃ₁に対する２文字連続
出現頻度分布の曲線で囲まれた面積と考えられるので、
３直線ｘ＝β’（ｃ₁）、ｘ＝Ｗ（ｃ）、ｙ＝０と絞り
込み率ｃに対する２文字連続出現頻度分布の曲線で囲ま
れた面積、すなわち絞り込み率ｃに対する２文字連続出
現頻度分布において出現頻度がｃ₁以下であるような２
文字連続の出現頻度の総和Ｓ₂’を用いて式（６）のよ
うに表せる。Ｓ₂＝Ｓ₂’×｛Ｗ（ｃ₁）／Ｗ（ｃ）｝＝Ｓ₂’×｛α（ｃ₁）²／α（ｃ）²｝・・・（６）The size of the created index is obtained by multiplying the total number of low-frequency character groups, low-frequency 2-character continuous groups, and 3-character continuous groups obtained by each of the subsequent grouping means by the number of document records R (unit: unit). bit). The sum of the frequency of occurrence of low frequency character for narrowing ratio c ₁ S ₁
Then, the number of infrequent character groups can be approximated as S ₁ / c ₁ . The sum S ₂ of the appearance frequencies of two low-frequency consecutive characters with respect to the narrowing-down rate c ₁ is 3 straight lines x = β (c ₁ ), x = W
(C ₁ ), y = 0 and the area surrounded by the curve of the two-character continuous appearance frequency distribution for the narrowing-down rate c ₁ ,
Area surrounded by three straight line x = β '(c ₁ ), x = W (c), y = 0 and the curve of the two-character continuous appearance frequency distribution for the narrowing-down rate c, that is, the two-character continuous appearance frequency for the narrowing-down rate c 2 where the frequency of occurrence is less than or equal to c ₁ in the distribution
It can be expressed as in Expression (6) using the sum S ₂ 'of the appearance frequencies of consecutive characters. S ₂ = S ₂ '× {W (c ₁ ) / W (c)} = S ₂ ' × {α (c ₁ ) ² / α (c) ² } (6)

【００４１】３文字連続グループの総数については、高
頻度３文字連続の数は無視できるほど少ないと仮定し、
低頻度３文字連続のみを考える。絞り込み率ｃ₁ に対す
る低頻度３文字連続の出現頻度の総和Ｓ₃は、３直線ｘ
＝γ（ｃ₁）、ｘ＝Ｔ（ｃ₁）、ｙ＝０と絞り込み率ｃ
₁に対する３文字連続出現頻度分布の曲線で囲まれた面
積と考えられるので、３直線ｘ＝γ’（ｃ₁）、ｘ＝Ｔ
（ｃ）、ｙ＝０と絞り込み率ｃに対する３文字連続出現
頻度分布の曲線で囲まれた面積、すなわち絞り込み率ｃ
に対する３文字連続出現頻度分布において出現頻度がｃ
₁以下であるような３文字連続の出現頻度の総和Ｓ₃’
を用いて、式（７）のように表せる。Ｓ₃＝Ｓ₃’×｛Ｔ（ｃ₁）／Ｔ（ｃ）｝＝Ｓ₃’×｛β（ｃ₁）²／β（ｃ）²｝＝Ｓ₃’×｛β’（ｃ₁）²／β（ｃ）²｝×｛α（ｃ₁）⁴／α（ｃ）⁴｝・・・（７）As for the total number of consecutive 3 character groups, it is assumed that the number of consecutive high frequency 3 characters is negligibly small.
Consider only low frequency 3 consecutive letters. The sum S ₃ of the appearance frequencies of three consecutive low-frequency characters with respect to the narrowing-down rate c ₁ is 3 straight lines x
= Γ (c ₁ ), x = T (c ₁ ), y = 0 and narrowing rate c
Since it is considered to be the area surrounded by the curve of the three-character continuous appearance frequency distribution for ₁ , three straight lines x = γ ′ (c ₁ ), x = T
(C), the area surrounded by the curve of the three-character continuous appearance frequency distribution with respect to y = 0 and the narrowing rate c, that is, the narrowing rate c
In the three-character continuous appearance frequency distribution for
Sum of appearance frequencies of three consecutive characters that are less than or equal to ₁ S ₃ '
Can be expressed as in equation (7). _{_{S 3 = S 3 '× {}} T (c 1) / T (c)} = S 3' × {β (c 1) 2 / β (c) 2} = S 3 '× {β' (c 1) ² / β (c) ² } × {α (c ₁ ) ⁴ / α (c) ⁴ } (7)

【００４２】すなわち絞り込み率ｃ₁としたときに作成
される索引の大きさＩ（ｃ₁）（単位はｂｉｔ）は、絞
り込み率ｃにおける各出現頻度分布から算出できる値に
よって、式（８）のように近似的に求めることができ
る。Ｉ（ｃ₁）＝｛（Ｓ₁＋Ｓ₂＋Ｓ₃）／ｃ₁｝×Ｒ＝［［Ｓ₁＋Ｓ₂’×｛α（ｃ₁）²／α（ｃ）²｝＋Ｓ₃’×｛β’（ｃ₁）²／β（ｃ）²｝ ×｛α（ｃ₁）⁴／α（ｃ）⁴｝］／ｃ₁］×Ｒ・・（８）That is, the index size I (c ₁ ) (unit is bit) created when the narrowing-down rate c ₁ is calculated by the value calculated from each appearance frequency distribution at the narrowing-down rate c, Can be approximately calculated as follows. I (c ₁ ) = {(S ₁ + S ₂ + S ₃ ) / c ₁ } × R = [[S ₁ + S ₂ ′ × {α (c ₁ ) ² / α (c) ² } + S ₃ ′ × {β '(C ₁ ) ² / β (c) ² } × {α (c ₁ ) ⁴ / α (c) ⁴ }] / c ₁ ] × R ·· (8)

【００４３】絞り込み率算定手段３１３は、最大索引量
以下の大きさの索引作成を可能にする絞り込み率ｃ₁を
算定し、再度、文字出現頻度算定手段３０４、２文字連
続出現頻度算定手段３０５、３文字連続出現頻度算定手
段３０６に出力する。The narrowing-down rate calculating means 313 calculates the narrowing-down rate c ₁ that enables the creation of an index having a size equal to or smaller than the maximum index amount, and again the character appearance frequency calculating means 304, the two-character continuous appearance frequency calculating means 305, It outputs to the three-character consecutive appearance frequency calculation means 306.

【００４４】文字グループ化手段３０７は、文字出現頻
度算定手段３０４の算定結果を受け取って、出現の度合
が絞り込み率ｃ₁以下である複数の文字をグループ化
し、グループに属するいずれかの文字が出現する度合が
絞り込み率を越えない範囲で絞り込み率に最も近くなる
ように調整する。この時、グループのいずれかの文字が
現れる度合の算定法は、グループ内の各文字の出現が統
計的に独立であると仮定し、式（１）から求める。The character grouping unit 307 receives the calculation result of the character appearance frequency calculating unit 304, groups a plurality of characters whose degree of appearance is the narrowing rate c ₁ or less, and one of the characters belonging to the group appears. The degree of adjustment is adjusted so that it is closest to the narrowing rate within a range that does not exceed the narrowing rate. At this time, the calculation method of the degree of appearance of any character in the group is calculated from the equation (1), assuming that the appearance of each character in the group is statistically independent.

【００４５】２文字連続出現頻度算定手段３０５は、文
書レコード中の２文字連続のうちで、高頻度文字同士の
連続のみを抽出し、各２文字連続の出現度合が「当該２
文字連続の出現する文書レコードの文字数の総和／全文
書レコードの文字数の総和」によって算定される。高頻
度文字同士からなる２文字連続のうち高頻度２文字連続
以外のすべてを、式（１）と同様の基準によってグルー
プに属するいずれかの２文字連続が現れる度合が絞り込
み率以下になるように、２文字連続グループ化手段３０
８がグループ化する。The two-character continuous appearance frequency calculation means 305 extracts only the continuous high-frequency characters from the two-character continuous characters in the document record, and the appearance degree of each two-character continuous character is "the relevant 2".
It is calculated by the sum of the number of characters of document records in which consecutive characters appear / the sum of the number of characters of all document records. The degree of appearance of any two consecutive characters belonging to a group is equal to or less than the narrowing rate based on the same criteria as in the formula (1) for all of the consecutive two characters consisting of high-frequency characters other than the frequent two-character consecutive. 2 character continuous grouping means 30
8 group.

【００４６】３文字連続出現頻度算定手段３０６は、文
書レコード中の３文字連続のうちで、（第１文字、第２
文字）および（第２文字、第３文字）がいずれも高頻度
２文字連続である３文字連続のみを抽出し、各３文字連
続の出現の度合が、「当該３文字連続の出現する文書レ
コードの文字数の総和／全文書レコードの文字数の総
和」によって算定され、その結果が３文字連続グループ
化手段３０９に送られ、式（１）と同様の基準によって
絞り込み率ｃ₁をもとにグループ化される。The three-character consecutive appearance frequency calculating means 306 calculates the (first character, second character) among the three-character consecutive characters in the document record.
Characters) and (2nd character, 3rd character) are both high-frequency 2 consecutive characters. Only 3 consecutive characters are extracted, and the degree of appearance of each 3 consecutive characters is “document record in which the consecutive 3 characters appear. Of the total number of characters / the total number of characters of all document records ", and the result is sent to the three-character continuous grouping means 309, and grouped based on the narrowing-down rate c _{1 according} to the same criteria as in formula (1). To be done.

【００４７】こうして得られたグループ化情報が、索引
型式出力手段３１０に送られ、低頻度文字グループ、２
文字連続グループ、３文字連続グループの１つ１つに対
して、１ｂｉｔの索引情報を割り当てるような索引型式
を索引型式データ３１１に出力する。The grouping information thus obtained is sent to the index type output means 310, and the low-frequency character group, 2
An index type that allocates 1-bit index information to each of the character continuous group and the three character continuous group is output to the index type data 311.

【００４８】以上のように、本実施例によれば、コンピ
ュータの記憶容量に制限がある場合でも、索引データ量
の上限を直接与えることにより適切な絞り込み率を自動
的に求め、何度も索引型式の作成を繰り返すことをしな
くても要求を満たす大きさの索引型式を作成することが
できる。As described above, according to the present embodiment, even when the storage capacity of the computer is limited, an appropriate narrowing down rate is automatically obtained by directly giving the upper limit of the index data amount, and the index can be repeatedly displayed. It is possible to create an index model of a size that meets the requirements without repeating the model creation.

【００４９】[0049]

【発明の効果】以上のように、本発明によれば、利用者
が設定した特定の検索要求に対して、もしくは過去の検
索履歴から調べた利用者が良く用いる検索要求に対して
は、他の検索要求に対する検索速度を低下させることな
く、予め指定された絞り込み率の逆数倍を上回る検索速
度の向上を実現する索引型式を作成することができる。As described above, according to the present invention, other than the specific search request set by the user or the search request frequently used by the user checked from the past search history, It is possible to create an index type that realizes an improvement in the search speed that exceeds the reciprocal multiple of the pre-specified narrowing rate without reducing the search speed for the search request.

【００５０】また、記憶装置の容量に制限のある場合な
どに、作成される索引量の上限を利用者が設定でき、何
度も試行錯誤を繰り返して最適な絞り込み率を決定する
必要がなく、利用者の手間が省けるとともに、トータル
な索引型式作成時間の短縮が可能となる。Further, when the capacity of the storage device is limited, the user can set the upper limit of the index amount to be created, and it is not necessary to repeat trial and error many times to determine the optimum narrowing rate. This saves the user the trouble and shortens the total index model creation time.

[Brief description of drawings]

【図１】本発明の第１の実施例における索引型式作成装
置の構成を示すブロック図FIG. 1 is a block diagram showing the configuration of an index type creating device according to a first embodiment of the present invention.

【図２】本発明の第２の実施例における索引型式作成装
置の構成を示すブロック図FIG. 2 is a block diagram showing a configuration of an index type creating device according to a second embodiment of the present invention.

【図３】本発明の第３の実施例における索引型式作成装
置の構成を示すブロック図FIG. 3 is a block diagram showing a configuration of an index type creating device according to a third embodiment of the present invention.

【図４】従来技術による索引型式作成装置の構成を示す
ブロック図FIG. 4 is a block diagram showing a configuration of an index type creating device according to a conventional technique.

[Explanation of symbols]

１０１サンプル文書データ１０２サンプル文書区切りデータ１０３文書区切り手段１０４文字出現頻度算定手段１０５２文字連続出現頻度算定手段１０６３文字連続出現頻度算定手段１０７文字グループ化手段１０８２文字連続グループ化手段１０９３文字連続グループ化手段１１０索引型式出力手段１１１索引型式データ１１２特別区分入力手段２０１サンプル文書データ２０２サンプル文書区切りデータ２０３文書区切り手段２０４文字出現頻度算定手段２０５２文字連続出現頻度算定手段２０６３文字連続出現頻度算定手段２０７文字グループ化手段２０８２文字連続グループ化手段２０９３文字連続グループ化手段２１０索引型式出力手段２１１索引型式データ２１２検索要求履歴データ２１３検索要求文字列出現頻度算定手段３０１サンプル文書データ３０２サンプル文書区切りデータ３０３文書区切り手段３０４文字出現頻度算定手段３０５２文字連続出現頻度算定手段３０６３文字連続出現頻度算定手段３０７文字グループ化手段３０８２文字連続グループ化手段３０９３文字連続グループ化手段３１０索引型式出力手段３１１索引型式データ３１２最大索引量入力手段３１３絞り込み率算定手段４０１サンプル文書データ４０２サンプル文書区切りデータ４０３文書区切り手段４０４文字出現頻度算定手段４０５２文字連続出現頻度算定手段４０６３文字連続出現頻度算定手段４０７文字グループ化手段４０８２文字連続グループ化手段４０９３文字連続グループ化手段４１０索引型式出力手段４１１索引型式データ 101 sample document data 102 sample document delimiter data 103 document delimiter means 104 character appearance frequency calculation means 105 2 character consecutive appearance frequency calculation means 106 3 character consecutive appearance frequency calculation means 107 character grouping means 108 2 character consecutive grouping means 109 3 characters Continuous grouping means 110 Index type output means 111 Index type data 112 Special classification input means 201 Sample document data 202 Sample document break data 203 Document break means 204 Character appearance frequency calculation means 205 2 Character continuous appearance frequency calculation means 206 3 Character continuous appearance Frequency calculation means 207 Character grouping means 208 Two-character continuous grouping means 209 Three-character continuous grouping means 210 Index type output means 211 Index type data 212 Search request history data 213 Inspection Search request character string appearance frequency calculation means 301 Sample document data 302 Sample document delimiter data 303 Document delimiter means 304 Character appearance frequency calculation means 305 2 Character continuous appearance frequency calculation means 306 3 Character continuous appearance frequency calculation means 307 Character grouping means 308 2 Character continuous grouping means 309 3 Character continuous grouping means 310 Index type output means 311 Index type data 312 Maximum index amount input means 313 Narrowing rate calculation means 401 Sample document data 402 Sample document break data 403 Document break means 404 Character appearance frequency calculation Means 405 Two-character consecutive appearance frequency calculating means 406 Three-character consecutive appearance frequency calculating means 407 Character grouping means 408 Two-character consecutive grouping means 409 Three-character consecutive grouping means 410 Index type output means 411 Index type data

Claims

[Claims]

1. A character appearance frequency calculation means for statistically checking the appearance degree of a certain character in sample document data, and when the appearance degree of the previously examined character is higher than a certain value,
N characters including all of the characters checked last time (N is 2, 3, ...
A natural number), a plurality of N-character consecutive appearance frequency calculation means for statistically examining the degree of appearance of a character string, and a special classification input means for specifying a character or character string to be included in a single group consisting of only one element , A plurality of grouping means for grouping characters or character strings from the outputs of the character appearance frequency calculating means and the plurality of N character consecutive appearance frequency calculating means and the output of the special classification input means. Index model making device.

2. A character appearance frequency calculation means for statistically checking the appearance degree of a certain character in the sample document data, and when the appearance degree of the previously examined character is higher than a certain value,
N characters including all of the characters checked last time (N is 2, 3, ...
A natural number) of a plurality of N character consecutive appearance frequency calculating means for statistically checking the degree of appearance of the character string, and a search request character string appearance frequency for calculating the appearance frequency of the search request character string in the past search request history Calculating means, a plurality of grouping means for grouping characters or character strings from the outputs of the character appearance frequency calculating means and the plurality of N character consecutive appearance frequency calculating means and the output of the search request character string appearance frequency calculating means; An index type creating apparatus comprising:

3. A character appearance frequency calculation means for statistically checking the appearance degree of a certain character in the sample document data, and when the appearance degree of the previously examined character is higher than a certain value,
N characters including all of the characters checked last time (N is 2, 3, ...
A natural number), a plurality of N-character consecutive appearance frequency calculation means for statistically examining the degree of appearance of a character string, a maximum index amount input means for inputting a limit on the amount of index data, and the character appearance frequency calculation Means and the output of the plurality of N-character consecutive appearance frequency calculating means and the output of the maximum index amount input means, a narrowing down rate that enables the creation of an index having a size equal to or less than the maximum index amount is calculated, and the character appearance frequency calculating means is calculated again. And a narrowing-down rate calculating means for outputting a result to the plurality of N-character consecutive appearance frequency calculating means, and a group of characters or character strings from the output of the character appearance frequency calculating means and the plurality of N-character consecutive appearance frequency calculating means. An index type creating apparatus comprising a plurality of grouping means.