JPH08314964A - Index model creation device - Google Patents

Index model creation device

Info

Publication number
JPH08314964A
JPH08314964A JP7121065A JP12106595A JPH08314964A JP H08314964 A JPH08314964 A JP H08314964A JP 7121065 A JP7121065 A JP 7121065A JP 12106595 A JP12106595 A JP 12106595A JP H08314964 A JPH08314964 A JP H08314964A
Authority
JP
Japan
Prior art keywords
character
characters
appearance frequency
appearance
consecutive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
JP7121065A
Other languages
Japanese (ja)
Other versions
JP2996895B2 (en
Inventor
Mitsuaki Inaba
葉 光 昭 稲
Naohiko Noguchi
口 直 彦 野
Yuji Sugano
野 祐 司 菅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Panasonic Holdings Corp
Original Assignee
Matsushita Electric Industrial Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Matsushita Electric Industrial Co Ltd filed Critical Matsushita Electric Industrial Co Ltd
Priority to JP7121065A priority Critical patent/JP2996895B2/en
Publication of JPH08314964A publication Critical patent/JPH08314964A/en
Application granted granted Critical
Publication of JP2996895B2 publication Critical patent/JP2996895B2/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

(57)【要約】 【目的】 電子計算機を用いた文書検索システムにおい
て、プリサーチ方式に基くシグネチャファイルの索引付
与単位を、利用者の要求・利用者の検索履歴・索引容量
に応じて適切に設定することによって高速な検索を可能
にする。 【構成】 特別区分入力手段112により利用者が指定
した文字および文字列については、それらを含む検索要
求に対して検索速度の向上を図る。また検索要求文字列
出現頻度算定手段を設けることにより、利用者の良く利
用する検索要求に対して検索速度の向上を図る。さら
に、最大索引量入力手段を設けることにより索引量の上
限を指定し、絞り込み率算定手段を設けることにより最
大索引量を越えない範囲の絞り込み率をもつ索引型式を
自動的に作成する。
(57) [Summary] [Purpose] In a document retrieval system using an electronic computer, the indexing unit of the signature file based on the pre-search method is appropriately selected according to the user's request, the user's search history, and the index capacity. Enables fast search by setting. [Structure] For a character and a character string designated by the user by the special category input means 112, the search speed is improved in response to a search request including them. Further, by providing a search request character string appearance frequency calculating means, the search speed is improved in response to a search request frequently used by the user. Further, the maximum index amount input means is provided to specify the upper limit of the index amount, and the narrowing rate calculation means is provided to automatically create an index type having a narrowing rate within a range not exceeding the maximum index amount.

Description

【発明の詳細な説明】Detailed Description of the Invention

【0001】[0001]

【産業上の利用分野】本発明は、電子計算機を利用した
文書検索システムや文書編集システムにおける文書中か
ら文字列等を検索するための索引型式作成装置に関する
ものである。
BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an index type creating apparatus for searching a character string or the like from a document in a document searching system or a document editing system using an electronic computer.

【0002】[0002]

【従来の技術】近年、ワードプロセッサやパーソナルコ
ンピュータの普及、コンピュータの記憶装置の容量の増
大、コンピュータによる文字認識の実用化等に伴い、文
書中のすべての文字情報を蓄積した全文データベースが
増加してきている。このため、大量の文字情報を蓄積
し、必要に応じて文書情報を検索する全文データベース
検索システムに対する関心が高まってきている。
2. Description of the Related Art In recent years, with the spread of word processors and personal computers, the increase in storage capacity of computers, the practical use of character recognition by computers, and the like, the number of full-text databases that store all character information in documents has increased. There is. Therefore, there is an increasing interest in a full-text database search system that accumulates a large amount of character information and retrieves document information as needed.

【0003】従来の文書データベースシステムでは、文
書を検索する際の鍵として、文書毎に人手により付与さ
れたキーワードを利用するキーワード検索方式が一般的
であった。しかし、キーワード付け作業が蓄積文書の増
加に間に合わない、時間が経過するとキーワードが陳腐
化する、キーワード付けを行なった者と検索するものと
のキーワードの解釈の相違により検索洩れが生じる、な
どの問題があった。このような背景から、近年、全文検
索(フルテキストサーチ)と呼ばれる文書検索方式が注
目されている。
In a conventional document database system, a keyword search method is generally used in which a keyword manually assigned to each document is used as a key for searching a document. However, problems such as keyword addition work not keeping up with the increase of accumulated documents, keywords becoming obsolete over time, and omission of search due to difference in interpretation of keywords between the person who made the keywords and the one to be searched. was there. From such a background, a document search method called a full-text search (full-text search) has recently been attracting attention.

【0004】全文検索は、文書データの他には補助的な
情報を持たずに、検索毎に文書データを全文走査する
「フルテキストスキャン」方式と、検索に先だって、文
書データ中に出現する文字あるいは文字列の情報を高速
に取り出せるような索引情報を自動的に作成しておい
て、検索時にこの索引を検索する方式の2種類に大別さ
れる。
The full-text search is a "full-text scan" method in which full-text scanning is performed on the document data for each search without any auxiliary information other than the document data, and a character appearing in the document data prior to the search. Alternatively, it is roughly classified into two types, that is, a method of automatically creating index information that can extract character string information at a high speed and searching this index at the time of searching.

【0005】このうちフルテキストスキャン方式は、原
文書以外の情報を用いないので、記憶容量が少なくて済
むとともに文書データの更新直後でも即座に検索できる
点、および正規表現等の文字列パターンや論理条件を含
む複雑な検索条件の場合や検索結果が多い場合でも、検
索時間がほぼ一定である点が長所であるが、文書データ
の全てを走査するため、索引方式に比べて検索時間が遅
いという問題が指摘されている。
Of these, the full-text scanning method does not use information other than the original document, so it requires less storage capacity and can be searched immediately immediately after updating the document data, and character string patterns and logical expressions such as regular expressions. The advantage is that the search time is almost constant even when the search conditions are complicated including conditions or there are many search results. However, since the entire document data is scanned, the search time is slower than the index method. A problem has been pointed out.

【0006】一方、索引方式は、一般にフルテキストス
キャン方式よりも検索速度が速く、索引の作成方法によ
っては、検索速度が文書量にほとんど依存しないという
利点があるが、索引情報の容量が大きいこと、索引を作
成する時間が長いこと、検索条件が複雑な場合や検索結
果が多い場合に検索速度が低下すること等の問題が指摘
されている。
On the other hand, the index method generally has a faster search speed than the full-text scan method, and depending on the method of creating the index, the search speed has little advantage that it depends on the amount of documents, but the index information has a large capacity. It has been pointed out that problems such as a long index creation time and a low search speed when the search conditions are complicated or the number of search results are large.

【0007】索引方式の問題点を解決するものとして、
シグネチャファイルを用いたプリサーチ方式がある。そ
の中で、本願出願人は、文字または文字連接の出現頻度
に応じてグループ化を行ないシグネチャファイルの型式
を作成するという手法を提案した(特許願平成5年第2
53032号)。図4はこの方法による実施例の索引型
式作成装置の構成を示すブロック図である。サンプル文
書データ401中における各文書レコードが、文書区切
り手段403によりサンプル文書区切りデータ402か
らの位置情報をもとに切り出され、各文字の出現の度合
を文字出現頻度算定手段404が統計的に調べ、出現の
度合が予め定められた値すなわち絞り込み率以下である
低頻度文字については、文字グループ化手段407が複
数の文字のグループ化を行なう。この時、当該グループ
に属する文字の少なくとも1種が出現する度合が予め定
められた絞り込み率を越えないように文字を振り分け
る。サンプル文書データ401中における出現の度合が
絞り込み率を越える高頻度文字については、2つの高頻
度文字から成る2文字連続のサンプル文書データ中にお
ける出現の度合を2文字連続出現頻度算定手段405が
調べ、出現の度合が絞り込み率以下である低頻度2文字
連続については、2文字連続グループ化手段408が文
字の場合と同様にグループ化を行う。出現の度合が絞り
込み率を越える高頻度2文字連続については、高頻度2
文字連続に属する2つの2文字連続をそれぞれ初めの2
文字、および最後の2文字に持つ3文字連続のサンプル
文書中における出現の度合を3文字連続出現頻度算定手
段406が調べ、出現の度合が絞り込み率以下である低
頻度3文字連続については、3文字連続グループ化手段
409が文字または2文字連続の場合と同様にグループ
化を行う。高頻度3文字連続については、各高頻度3文
字連続だけから成るグループにする。その後、索引型式
出力手段410が各グループに対して1bitの索引情
報を割り当てるような索引型式データ411を出力す
る。
As a solution to the problems of the index system,
There is a pre-search method using a signature file. Among them, the applicant of the present application has proposed a method of grouping according to the frequency of appearance of characters or character concatenation to create a signature file type (Patent Application 1993, Second Edition).
53032). FIG. 4 is a block diagram showing the configuration of the index type creating apparatus according to the embodiment of this method. Each document record in the sample document data 401 is cut out by the document delimiter 403 based on the position information from the sample document delimiter data 402, and the character appearance frequency calculation unit 404 statistically checks the degree of appearance of each character. The character grouping unit 407 groups a plurality of characters with respect to low-frequency characters whose appearance degree is a predetermined value or less, that is, a narrowing-down rate or less. At this time, the characters are sorted so that the degree of appearance of at least one of the characters belonging to the group does not exceed a predetermined narrowing rate. For high-frequency characters whose degree of appearance in the sample document data 401 exceeds the narrowing-down rate, the two-character consecutive appearance frequency calculation means 405 checks the degree of appearance in two-character consecutive sample document data consisting of two high-frequency characters. As for the low-frequency two-character continuous character whose degree of appearance is less than or equal to the narrowing-down rate, the two-character continuous grouping unit 408 performs grouping as in the case of characters. For high-frequency 2 consecutive characters whose degree of appearance exceeds the screening rate, high-frequency 2
Two consecutive two characters belonging to a consecutive character are the first two
The three-character consecutive appearance frequency calculation unit 406 examines the degree of appearance of characters and the three-character consecutive sample documents of the last two characters, and 3 for low-frequency three-character consecutive cases where the degree of appearance is less than the narrowing rate. Grouping is performed in the same manner as in the case where the character continuous grouping means 409 is for characters or two characters in succession. For high-frequency 3 consecutive characters, each group consists of only 3 high-frequency consecutive characters. After that, the index type output means 410 outputs the index type data 411 which assigns 1-bit index information to each group.

【0008】このようにして、文字および文字列の出現
の度合が異なっていても、検索条件によらずに絞り込み
率以下に検索対象文書データを絞り込むことを可能に
し、多くの種類の低頻度文字がある場合でも、容量の小
さな索引を作成することを可能にするような索引型式デ
ータを作成することができる。
In this way, even if the degree of appearance of characters and character strings is different, it is possible to narrow down the document data to be searched to below the narrowing rate regardless of the search conditions, and many types of low-frequency characters Even if there is, it is possible to create index type data that makes it possible to create a small index.

【0009】[0009]

【発明が解決しようとする課題】しかしながら、文字お
よび文字列をグループ化する上記の従来技術では、サン
プル文書において出現頻度の低い文字および文字列は利
用者の検索要求にどれほど使われるかということとは関
係なく、当該グループに属する文字および文字列の少な
くとも1種が出現する度合が予め定められた絞り込み率
を越えないようにグループ化されてしまうため、検索速
度の向上は絞り込み率の逆数倍程度に抑えられ、それ以
上には検索速度は向上できないという課題があった。
However, in the above-mentioned conventional technique for grouping characters and character strings, how often the characters and character strings having a low frequency of appearance in the sample document are used for the user's search request. Irrelevant, since the degree to which at least one of the characters and character strings belonging to the group appears is grouped so that it does not exceed the predetermined narrowing rate, the search speed is improved by the reciprocal of the narrowing rate. There was a problem that the search speed could not be improved beyond that level.

【0010】また、利用者が設定できるパラメータは検
索対象文書の絞り込み率であり、絞り込み率によって変
化する索引データ量を予め知ることができないため、記
憶容量に制限のある場合などには、索引型式データの作
成を繰り返し行って適切な絞り込み率を求める必要があ
り、必要な索引型式データ作成までに時間がかかるとい
う課題があった。
The parameter that can be set by the user is the narrowing down rate of the documents to be searched, and the index data amount that changes depending on the narrowing down rate cannot be known in advance. Therefore, when the storage capacity is limited, the index type There is a problem that it takes time to create the necessary index type data because it is necessary to repeatedly create the data and obtain an appropriate narrowing rate.

【0011】本発明は、上記従来技術の課題を解決する
もので、利用者が設定した特定の検索要求に対して、も
しくは過去の検索履歴から調べた利用者が良く用いる検
索要求に対しては、他の検索要求に対する検索速度を低
下させることなく絞り込み率の逆数倍を上回る高速な検
索を可能にする索引型式作成装置を提供することを目的
とする。また、索引データ量に対する制限を直接与える
ことにより、適切な絞り込み率を自動的に設定して索引
型式を作成することのできる索引型式作成装置を提供す
ることを目的とする。
The present invention solves the above-mentioned problems of the prior art. For a specific search request set by the user, or for a search request frequently used by the user checked from past search histories. It is an object of the present invention to provide an index type creating apparatus that enables a high-speed search that exceeds the reciprocal of the narrowing rate without reducing the search speed for other search requests. It is another object of the present invention to provide an index type creating device capable of creating an index type by automatically setting an appropriate narrowing rate by directly giving a limit to the amount of index data.

【0012】[0012]

【課題を解決するための手段】上記目的を達成するため
に、本発明による索引型式作成装置は、1つの要素だけ
からなる単独グループに入れる文字または文字列を指定
する特別区分入力手段を備え、その他に従来技術による
索引作成装置の持つ、サンプル文書データ中のある1文
字の出現の度合を統計的に調べる文字出現頻度算定手段
と、前回調べた文字の出現の度合がある値よりも高い場
合に、前回調べた文字の全てを含むN文字(Nは2、
3、・・・の自然数)の文字列についての出現の度合を
統計的に調べる複数のN文字連続出現頻度算定手段と、
文字出現頻度算定手段および複数のN文字連続出現頻度
算定手段の出力と特別区分入力手段の出力から文字また
は文字列をグループ化する複数のグループ化手段とを備
えたものである。
In order to achieve the above-mentioned object, the index type creating apparatus according to the present invention comprises special section input means for designating a character or a character string to be included in a single group consisting of only one element, In addition, a character appearance frequency calculation means that statistically examines the degree of appearance of one character in the sample document data, which is possessed by the index creation device according to the related art, and the degree of appearance of the previously examined character is higher than a certain value. , N characters (N is 2,
A plurality of N character consecutive appearance frequency calculation means for statistically checking the degree of appearance of a character string of 3, ...
The character appearance frequency calculating means and the plurality of N character consecutive appearance frequency calculating means and the plurality of grouping means for grouping characters or character strings from the output of the special classification input means are provided.

【0013】また本発明による索引型式作成装置は、過
去の検索要求履歴における検索要求文字列の出現頻度を
算定する検索要求文字列出現頻度算定手段を備え、その
他に従来技術による索引作成装置の持つ、サンプル文書
データ中のある1文字の出現の度合を統計的に調べる文
字出現頻度算定手段と、前回調べた文字の出現の度合が
ある値よりも高い場合に、前回調べた文字の全てを含む
N文字(Nは2、3、・・・の自然数)の文字列につい
ての出現の度合を統計的に調べる複数のN文字連続出現
頻度算定手段と、文字出現頻度算定手段および複数のN
文字連続出現頻度算定手段の出力と検索要求文字列出現
頻度算定手段の出力から文字または文字列をグループ化
する複数のグループ化手段とを備えたものである。
Further, the index type creating apparatus according to the present invention is provided with a search request character string appearance frequency calculating means for calculating the appearance frequency of the search request character string in the past search request history. , A character appearance frequency calculation means for statistically checking the degree of appearance of one character in the sample document data, and includes all of the previously examined characters when the degree of appearance of the previously examined character is higher than a certain value A plurality of N-character consecutive appearance frequency calculation means for statistically examining the degree of appearance of a character string of N characters (N is a natural number of 2, 3, ...), a character appearance frequency calculation means, and a plurality of N
It is provided with a plurality of grouping means for grouping characters or character strings from the output of the character continuous appearance frequency calculating means and the output of the search request character string appearance frequency calculating means.

【0014】また本発明による索引型式作成装置は、サ
ンプル文書データ中のある1文字の出現の度合を統計的
に調べる文字出現頻度算定手段と、前回調べた文字の出
現の度合がある値よりも高い場合に、前回調べた文字の
全てを含むN文字(Nは2、3、・・・の自然数)の文
字列についての出現の度合を統計的に調べる複数のN文
字連続出現頻度算定手段と、索引データ量に対する制限
を入力するための最大索引量入力手段と、文字出現頻度
算定手段および複数のN文字連続出現頻度算定手段の出
力と最大索引量入力手段の出力から最大索引量以下の大
きさの索引作成を可能とする絞り込み率を求め、再度文
字出現頻度算定手段および複数のN文字連続出現頻度算
定手段に結果を出力する絞り込み率算定手段と、文字出
現頻度算定手段および複数のN文字連続出現頻度算定手
段の出力から文字または文字列をグループ化する複数の
グループ化手段とを備えたものである。
Further, the index type creating apparatus according to the present invention has a character appearance frequency calculating means for statistically checking the degree of appearance of one character in the sample document data, and a character appearance frequency calculating means which is previously examined and is higher than a certain value. And a plurality of N character consecutive appearance frequency calculation means for statistically checking the degree of appearance for an N character string (N is a natural number of 2, 3, ...) Including all of the characters checked last time, if high. A maximum index amount input means for inputting a limit on the amount of index data, an output of the character appearance frequency calculation means and a plurality of N character consecutive appearance frequency calculation means and an output of the maximum index amount input means Of the character appearance frequency calculating means and the character appearance frequency calculating means and the N character consecutive appearance frequency calculating means, and outputs the result to the character appearance frequency calculating means and the plurality of N character consecutive appearance frequency calculating means. It is obtained by a plurality of grouping means for grouping a character or character string from the output of the fine plurality of N characters continuous occurrence frequency calculating means.

【0015】[0015]

【作用】本発明は上記構成によって、予め与えられた絞
り込み率に対してサンプル文書データ中の文字または文
字列の出現の度合を文字出現頻度算定手段および複数の
N文字連続出現頻度算定手段が調べた後、グループ化手
段が文字および文字列の区分を決定する際に、特別区分
入力手段で入力された文字および文字列についてその構
成文字のうち低頻度文字を、その文字だけからなるグル
ープとして登録することによって、それらを含む検索要
求文字列で検索した場合に、他の検索要求に対する検索
速度を低下させることなく、さらに高速な検索が可能と
なる索引型式を作成することができる。
According to the present invention, the character appearance frequency calculating means and the plurality of N character consecutive appearance frequency calculating means check the degree of appearance of the character or the character string in the sample document data with respect to the predetermined narrowing down rate. Then, when the grouping means determines the classification of characters and character strings, the low-frequency characters of the constituent characters of the characters and character strings input by the special classification input means are registered as a group consisting of only those characters. By doing so, when searching with a search request character string including them, it is possible to create an index type that enables a higher speed search without reducing the search speed for other search requests.

【0016】また、予め与えられた絞り込み率に対して
サンプル文書データ中の文字または文字列の出現の度合
を文字出現頻度算定手段および複数のN文字連続出現頻
度算定手段が調べた後、グループ化手段が文字および文
字列の区分を決定する際に、過去の検索要求履歴におけ
る検索要求文字列の出現頻度を検索要求文字列出現頻度
算定手段によって算定し、高頻度で現れる検索要求文字
列についてその構成文字のうちサンプル文書中の低頻度
文字であるものは、その文字だけからなるグループとし
て登録することによって、各利用者が良く用いる検索要
求に対して高速な検索が可能となる索引型式を自動的に
作成することができる。
Further, after the character appearance frequency calculating means and the plurality of N character consecutive appearance frequency calculating means examine the degree of appearance of a character or a character string in the sample document data with respect to a predetermined narrowing down ratio, grouping is performed. When the means determines the classification of the character and the character string, the appearance frequency of the search request character string in the past search request history is calculated by the search request character string appearance frequency calculation means, and regarding the search request character string that appears at high frequency, Of the constituent characters, those that are infrequent characters in the sample document are registered as a group consisting of only those characters, so that the index type that enables high-speed search for the search request frequently used by each user is automatically created. Can be created dynamically.

【0017】また、予め与えられた絞り込み率に対して
サンプル文書データ中の文字または文字列の出現の度合
を文字出現頻度算定手段および複数のN文字連続出現頻
度算定手段が調べ、絞り込み率算定手段が文字および文
字列の出現頻度から最大索引量入力手段で入力された索
引量を上限として絞り込み率を算定し、さらにこの絞り
込み率に対して再度文字または文字列の出現の度合を文
字出現頻度算定手段および複数のN文字連続出現頻度算
定手段が調べ、その後グループ化手段が文字および文字
列の区分を決定することによって、記憶容量に制限のあ
る場合でも、予め設定された最大索引量を越えない範囲
で最も高速化が図れるような索引型式を短時間に作成す
ることができる。
Further, the degree of appearance of a character or a character string in the sample document data is examined by a character appearance frequency calculating means and a plurality of N character consecutive appearance frequency calculating means with respect to a predetermined narrowing rate, and the narrowing rate calculating means is calculated. Calculates the narrowing rate from the appearance frequency of characters and character strings with the index amount input by the maximum index amount input means as the upper limit, and then calculates the appearance frequency of characters or character strings again for this narrowing rate. Even if the storage capacity is limited, the preset maximum index amount is not exceeded by the means and the plurality of N character consecutive appearance frequency calculation means, and then the grouping means determines the division of the character and the character string. It is possible to create an index type that can achieve the highest speed in the range in a short time.

【0018】[0018]

【実施例】【Example】

(実施例1)以下、本発明の第1の実施例について、図
面を参照しながら説明する。図1は本発明の第1の実施
例における索引型式作成装置の構成を示すブロック図で
ある。図1において、101は文書データを構成する複
数の文書レコードを格納したサンプル文書データであ
る。サンプル文書データ101は、検索対象文書データ
の全部または一部でもよく、検索対象文書データに対
し、文字および文字列の出現に関する統計的性質が類似
している他の文書データであってもよい。102はサン
プル文書データ101中の各文書レコードの位置を記録
したサンプル文書区切りデータ、103はサンプル文書
区切りデータ102の位置情報に従ってサンプル文書デ
ータ101から指定された文書レコードを切り出して、
レコード先頭を表す特別な文字<START>を文書レ
コード先頭に付与し、レコード終了を表す特別な文字<
END>を文書レコード末尾に付与した文字列を出力す
る文書区切り手段、104は文書区切り手段103の出
力である文書レコード文字列を受け取ってサンプル文書
データ101中に出現する各文字の出現の度合を「当該
文字の出現する文書レコードの文字数の総和を全文書レ
コードの文字数の総和で除した値」として算定する文字
出現頻度算定手段、105は文書区切り手段103の出
力である文書レコード文字列と、文字出現頻度算定手段
104の算定結果とを受けとってサンプル文書データ1
01中に高頻度で出現する2文字連続の出現の度合を
「当該2文字連続の出現する文書レコードの文字数の総
和を全文書レコードの文字数の総和で除した値」として
算定する2文字連続出現頻度算定手段、106は文書区
切り手段103の出力である文書レコード文字列と2文
字連続出現頻度算定手段105の算定結果とを受け取っ
て、サンプル文書データ101中に高頻度で出現する3
文字連続の出現の度合を「当該3文字連続の出現する文
書レコードの文字数の総和を全文書レコードの文字数の
総和で除した値」として算定する3文字連続出現頻度算
定手段、107は文字出現頻度算定手段104の算定結
果を受け取って、出現の度合が予め定められた「絞り込
み率」以下である複数の文字をグループ化し、グループ
に属するいずれかの文字が出現する度合が絞り込み率を
越えない範囲で絞り込み率にもっとも近くなるように調
整する文字グループ化手段、108は2文字連続出現頻
度算定手段105の算定結果を受け取って、出現の度合
が絞り込み率以下である複数の2文字連続をグループ化
し、グループに属するいずれかの2文字連続が出現する
度合が絞り込み率を越えない範囲で絞り込み率に最も近
くなるように調整する2文字連続グループ化手段、10
9は3文字連続出現頻度算定手段106の算定結果を受
け取って、出現の度合が絞り込み率以下である複数の3
文字連続がある場合には、これをグループ化し、グルー
プに属するいずれかの3文字連続が出現する度合が絞り
込み率を越えない範囲で絞り込み率に最も近くなるよう
に調整し、出現の度合が絞り込み率よりも高い3文字連
続はそれ1つだけで1グループにする3文字連続グルー
プ化手段、110は文字グループ化手段107と2文字
連続グループ化手段108と3文字連続グループ化手段
109の出力であるグループ化情報を受け取って各グル
ープに通し番号を付与し、各グループの通し番号と、所
属文字あるいは2文字連続あるいは3文字連続との対応
表を出力する索引型式出力手段、111は索引型式出力
手段110の出力する索引型式データである。そして、
112は文字グループ化手段107に対して、指定した
要素だけからなるグループを作成するよう指示する特別
区分入力手段である。
(First Embodiment) A first embodiment of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing the configuration of an index type creating device according to the first embodiment of the present invention. In FIG. 1, reference numeral 101 is sample document data in which a plurality of document records forming the document data are stored. The sample document data 101 may be all or a part of the search target document data, or may be other document data having similar statistical properties regarding the appearance of characters and character strings to the search target document data. Reference numeral 102 denotes sample document delimiter data in which the positions of the respective document records in the sample document data 101 are recorded. Reference numeral 103 denotes a designated document record cut out from the sample document data 101 according to the position information of the sample document delimiter data 102.
A special character <START> indicating the beginning of the record is added to the beginning of the document record, and a special character <indicating the end of the record <
END> is a document delimiter for outputting a character string added to the end of the document record. Reference numeral 104 is a document delimiter output from the document delimiter 103. The document delimiter 103 indicates the degree of appearance of each character appearing in the sample document data 101. A character appearance frequency calculation means for calculating as "a value obtained by dividing the total number of characters of the document records in which the character appears by the total number of characters of all document records", 105 is a document record character string output from the document delimiter means 103, Sample document data 1 based on the calculation result of the character appearance frequency calculation means 104
2 consecutive characters appearing in 01 are calculated as the degree of occurrence of 2 consecutive characters "the value obtained by dividing the sum of the number of characters of the document records in which the two consecutive characters appear by the sum of the number of characters of all document records". A frequency calculation means 106 receives the document record character string output from the document delimiter means 103 and the calculation result of the two-character consecutive appearance frequency calculation means 105, and appears frequently in the sample document data 101.
Three-character consecutive appearance frequency calculating means for calculating the degree of appearance of consecutive characters as "value obtained by dividing total sum of the number of characters of the document record in which the consecutive three characters appear by total sum of the number of characters of all document records", 107 is a character appearance frequency A range in which, after receiving the calculation result of the calculation means 104, a plurality of characters whose appearance degree is less than or equal to a predetermined “narrowing rate” are grouped and the degree to which any character belonging to the group appears does not exceed the narrowing rate. The character grouping means 108 for adjusting so as to be closest to the narrowing-down rate, 108 receives the calculation result of the two-character consecutive appearance frequency calculating means 105, and groups a plurality of two-character consecutively whose degree of appearance is less than or equal to the narrowing-down rate. , Adjust so that the degree to which any two consecutive characters belonging to the group appear is closest to the narrowing rate within the range that does not exceed the narrowing rate. 2 character continuous grouping means that, 10
9 receives the calculation result of the three-character consecutive appearance frequency calculation means 106, and a plurality of 3 whose degree of appearance is less than or equal to the narrowing rate
If there are consecutive characters, group them and adjust the degree of occurrence of any three consecutive characters belonging to the group to be the closest to the narrowing rate within the range that does not exceed the narrowing rate. 3 character consecutive grouping means for making 3 groups of consecutive 3 characters higher than the rate into one group by one, 110 is an output of the character grouping means 107, 2 character consecutive grouping means 108 and 3 character consecutive grouping means 109. Index type output means for receiving certain grouping information, assigning serial numbers to each group, and outputting a correspondence table of serial numbers of each group and belonging characters or two consecutive characters or three consecutive characters, 111 is an index type output means 110. Is the index type data output by. And
Reference numeral 112 is a special classification input means for instructing the character grouping means 107 to create a group consisting of only designated elements.

【0019】以上のように構成された索引型式作成装置
について、その動作を説明する。まず、サンプル文書デ
ータ101中の各文書レコードが、文書区切り手段10
3で切り出されて、文字出現頻度算定手段104に送ら
れ、各文字の出現の度合が、該当文字の出現する文書レ
コードの文字数の総和/全文書レコードの文字数の総和
によって算定される。利用者は、特別区分入力手段11
2により検索速度を改善したい検索要求文字列を入力す
る。文字グループ化手段107は、文字出現頻度算定手
段104の算定結果を受け取って、出現の度合が予め定
められた「絞り込み率」以下である複数の文字をグルー
プ化し、グループに属するいずれかの文字が出現する度
合が絞り込み率を越えない範囲で絞り込み率に最も近く
なるように調整する。この時、グループのいずれかの文
字が現れる度合の算定法は、グループ内の各文字の出現
が統計的に独立であると仮定し、以下の式から求める。
The operation of the index type creating apparatus constructed as described above will be described. First, each document record in the sample document data 101 is a document delimiter 10
It is cut out in 3 and sent to the character appearance frequency calculation means 104, and the degree of appearance of each character is calculated by the sum of the number of characters of the document record in which the corresponding character appears / the sum of the number of characters of all the document records. The user uses the special category input means 11
Input the search request character string whose search speed is to be improved by 2. The character grouping unit 107 receives the calculation result of the character appearance frequency calculating unit 104, groups a plurality of characters whose degree of appearance is equal to or less than a predetermined “narrowing rate”, and determines which of the characters belonging to the group. The degree of appearance should be adjusted to be the closest to the narrowing rate within a range that does not exceed the narrowing rate. At this time, the calculation method of the degree of appearance of any character in the group is calculated from the following formula, assuming that the appearance of each character in the group is statistically independent.

【0020】[0020]

【数1】 ただし、Pはグループ内のn個の文字のいずれかが現れ
る度合であり、Pj (j=1,2,・・・n)はグルー
プ内のj番目の文字が現れる度合である。
[Equation 1] However, P is the degree to which any of the n characters in the group appears, and Pj (j = 1, 2, ..., N) is the degree to which the j-th character in the group appears.

【0021】またグループ化の際、特別区分入力手段1
12で入力された文字または各文字列についてその構成
文字のうち文字出現頻度算定手段104の結果が低頻度
である文字については、それらの各低頻度文字をその文
字だけからなる単独グループとして登録する。
When grouping, the special section input means 1
For the characters input in 12 or for each character string, of the constituent characters, for the characters whose result by the character appearance frequency calculation means 104 is low in frequency, those low frequency characters are registered as a single group consisting of only those characters. .

【0022】サンプル文書データ101の1回目の走査
が終了したら、文書区切り手段103は、サンプル文書
データ101の2回目の走査を開始し、切り出した文書
レコードを2文字連続出現頻度算定手段105に送る。
2文字連続出現頻度算定手段105は、文書レコード中
の2文字連続のうちで、高頻度文字同士の連続のみを抽
出し、各2文字連続の出現度合が「当該2文字連続の出
現する文書レコードの文字数の総和/全文書レコードの
文字数の総和」によって算定される。高頻度文字同士か
らなる2文字連続のうち高頻度2文字連続以外のすべて
を、式(1)と同様の基準によってグループに属するい
ずれかの2文字連続が現れる度合が絞り込み率以下にな
るように、2文字連続グループ化手段108がグループ
化する。
When the first scan of the sample document data 101 is completed, the document delimiter 103 starts the second scan of the sample document data 101 and sends the clipped document record to the two-character consecutive appearance frequency calculator 105. .
The two-character consecutive appearance frequency calculation unit 105 extracts only the consecutive high-frequency characters from the two-letter consecutive letters in the document record, and the degree of appearance of each two-letter consecutive letters is “the document record in which the two-letter consecutive letters appear. Of the total number of characters / total number of characters in all document records ". The degree of appearance of any two consecutive characters belonging to a group is equal to or less than the narrowing rate based on the same criteria as in the formula (1) for all of the consecutive two characters consisting of high-frequency characters other than the frequent two-character consecutive. The two-character continuous grouping unit 108 groups the characters.

【0023】こうして、サンプル文書データ101の2
回目の走査が終了したら、文書区切り手段103は、サ
ンプル文書データ101の3回目の走査を開始し、切り
出した文書レコードを3文字連続出現頻度算定手段10
6に送る。3文字連続出現頻度算定手段106は、文書
レコード中の3文字連続のうちで、(第1文字、第2文
字)および(第2文字、第3文字)がいずれも高頻度2
文字連続である3文字連続のみを抽出し、各3文字連続
の出現の度合が、「当該3文字連続の出現する文書レコ
ードの文字数の総和/全文書レコードの文字数の総和」
によって算定され、その結果が3文字連続グループ化手
段109に送られ、式(1)と同様の基準によって絞り
込み率をもとにグループ化される。
Thus, 2 of the sample document data 101
When the scanning of the third time is completed, the document dividing means 103 starts the third scanning of the sample document data 101, and the cut-out document record is calculated by the three-character consecutive appearance frequency calculating means 10.
Send to 6. The three-character continuous appearance frequency calculation means 106 has a high frequency of 2 for the (first character, second character) and (the second character, third character) among the three characters in the document record.
Only three consecutive characters, which are consecutive characters, are extracted, and the degree of appearance of each consecutive three characters is "total sum of character numbers of document records in which the consecutive three characters appear / total sum of character numbers of all document records".
And the result is sent to the three-character continuous grouping means 109 and grouped based on the narrowing-down rate according to the same criteria as in formula (1).

【0024】こうして得られたグループ化情報が、索引
型式出力手段110に送られ、低頻度文字グループ、2
文字連続グループ、3文字連続グループの1つ1つに対
して、1bitの索引情報を割り当てるような索引型式
を索引型式データ111に出力する。
The grouping information thus obtained is sent to the index type output means 110, and the low frequency character groups, 2
An index type that assigns 1-bit index information to each of the character continuous group and the three character continuous group is output to the index type data 111.

【0025】以上のように、本実施例によれば、サンプ
ル文書中にはあまり出現しないが、利用者が高速で検索
したいという文字に対しては、特別区分に指定しグルー
プ化を行なわないことで、索引容量をあまり大きくする
ことなく、また他の検索要求に対する検索速度を低下さ
せることなく、その文字を含む検索要求対しては、高速
な検索が可能となる索引型式を作成することができる。
特に、特別区分に指定した1文字で検索した場合、絞り
込み率をc、当該文字が文書中に出現する度合をc’
(c’<c<1)とすれば、グループ化を行う従来の方
法では、全文書量のc倍の文書をフルテキストスキャン
しなければならないのに対し、本実施例によれば、全文
書量のc’倍の文書をフルテキストスキャンするだけで
よいので、検索速度はc/c’倍に向上する。
As described above, according to the present embodiment, the characters that do not appear in the sample document very much, but the user wants to search at high speed, should be designated as a special classification and not grouped. Thus, it is possible to create an index type that enables a high-speed search for a search request including the character without increasing the index capacity too much and reducing the search speed for other search requests. .
In particular, when searching with one character specified in the special category, the narrowing rate is c, and the degree of occurrence of the character in the document is c '.
If (c ′ <c <1), the conventional method for grouping requires full-text scanning of documents that are c times the total amount of documents, whereas according to the present embodiment, all documents are scanned. The search speed is increased by c / c 'times, since only full text scans of documents of c'times the quantity are required.

【0026】(実施例2)次に、本発明の第2の実施例
について、図面を参照しながら説明する。図2は本発明
の第2の実施例における索引型式作成装置の構成を示す
ブロック図である。図2において、201はサンプル文
書データ、202はサンプル文書区切りデータ、203
は文書区切り手段、204は文字出現頻度算定手段、2
05は2文字連続出現頻度算定手段、206は3文字連
続出現頻度算定手段、207は文字グループ化手段、2
08は2文字連続グループ化手段、209は3文字連続
グループ化手段、210は索引型式出力手段、211は
索引型式データである。そして、212は検索要求履歴
データ、213は過去の検索要求履歴データ212から
検索要求文字列の出現頻度を算定し、文字グループ化手
段207に対して、単一の要素だけからなるグループを
作成するよう指示する検索要求文字列出現頻度算定手段
である。
(Second Embodiment) Next, a second embodiment of the present invention will be described with reference to the drawings. FIG. 2 is a block diagram showing the configuration of the index type creating device according to the second embodiment of the present invention. In FIG. 2, 201 is sample document data, 202 is sample document delimiter data, and 203.
Is a document delimiter, 204 is a character appearance frequency calculator, 2
Reference numeral 05 is a two-character continuous appearance frequency calculation means, 206 is a three-character continuous appearance frequency calculation means, 207 is a character grouping means, 2
Reference numeral 08 is a two-character continuous grouping means, 209 is a three-character continuous grouping means, 210 is an index type output means, and 211 is index type data. Further, 212 is the search request history data, 213 is the occurrence frequency of the search request character string from the past search request history data 212, and creates a group consisting of only a single element for the character grouping means 207. It is a means for calculating the frequency of appearance of a search request character string.

【0027】以上のように構成された索引型式作成装置
について、その動作を説明する。まず、サンプル文書デ
ータ201中の各文書レコードが、文書区切り手段20
3で切り出されて、文字出現頻度算定手段204に送ら
れ、各文字の出現の度合が、該当文字の出現する文書レ
コードの文字数の総和/全文書レコードの文字数の総和
によって算定される。文字グループ化手段207は、文
字出現頻度算定手段204の算定結果を受け取って、出
現の度合が予め定められた「絞り込み率」以下である複
数の文字をグループ化し、グループに属するいずれかの
文字が出現する度合が絞り込み率を越えない範囲で絞り
込み率に最も近くなるように調整する。この時、グルー
プのいずれかの文字が現れる度合の算定法は、グループ
内の各文字の出現が統計的に独立であると仮定し、式
(1)から求める。またグループ化の際、検索要求文字
列出現頻度算定手段213が検索要求履歴データ212
から算定した出現頻度が高い検索要求文字または文字列
について、その構成文字のうち文字出現頻度算定手段2
04の結果が低頻度である文字については、それらの各
低頻度文字をその文字だけからなる単独グループとして
登録する。
The operation of the index type creating apparatus constructed as described above will be described. First, each document record in the sample document data 201 is stored in the document delimiter 20.
It is cut out in 3 and sent to the character appearance frequency calculation means 204, and the degree of appearance of each character is calculated by the sum of the number of characters of the document record in which the corresponding character appears / the sum of the number of characters of all the document records. The character grouping unit 207 receives the calculation result of the character appearance frequency calculating unit 204, groups a plurality of characters whose degree of appearance is less than or equal to a predetermined “narrowing rate”, and determines whether any of the characters belonging to the group The degree of appearance should be adjusted to be the closest to the narrowing rate within a range that does not exceed the narrowing rate. At this time, the calculation method of the degree of appearance of any character in the group is calculated from the equation (1), assuming that the appearance of each character in the group is statistically independent. Further, when grouping, the search request character string appearance frequency calculation unit 213 causes the search request history data 212 to be displayed.
The character appearance frequency calculation means 2 among the constituent characters of the search request character or the character string having a high appearance frequency calculated from
As for the character whose result of 04 is infrequent, each of those infrequent characters is registered as a single group consisting of only that character.

【0028】サンプル文書データ201の1回目の走査
が終了したら、文書区切り手段203は、サンプル文書
データ201の2回目の走査を開始し、切り出した文書
レコードを2文字連続出現頻度算定手段205に送る。
2文字連続出現頻度算定手段205は、文書レコード中
の2文字連続のうちで、高頻度文字同士の連続のみを抽
出し、各2文字連続の出現度合が「当該2文字連続の出
現する文書レコードの文字数の総和/全文書レコードの
文字数の総和」によって算定される。高頻度文字同士か
らなる2文字連続のうち高頻度2文字連続以外のすべて
を、式(1)と同様の基準によってグループに属するい
ずれかの2文字連続が現れる度合が絞り込み率以下にな
るように、2文字連続グループ化手段208がグループ
化する。
After the first scanning of the sample document data 201 is completed, the document delimiter 203 starts the second scanning of the sample document data 201 and sends the clipped document record to the two-character consecutive appearance frequency calculation unit 205. .
The two-character consecutive appearance frequency calculation unit 205 extracts only the consecutive high-frequency characters from the two-letter consecutive letters in the document record, and the degree of appearance of each two-letter consecutive letters is “the document record in which the two-letter consecutive letters appear. Of the total number of characters / total number of characters in all document records ". The degree of appearance of any two consecutive characters belonging to a group is equal to or less than the narrowing rate based on the same criteria as in the formula (1) for all of the consecutive two characters consisting of high-frequency characters other than the frequent two-character consecutive. The two-character continuous grouping unit 208 groups the characters.

【0029】こうして、サンプル文書データ201の2
回目の走査が終了したら、文書区切り手段203は、サ
ンプル文書データ201の3回目の走査を開始し、切り
出した文書レコードを3文字連続出現頻度算定手段20
6に送る。3文字連続出現頻度算定手段206は、文書
レコード中の3文字連続のうちで、(第1文字、第2文
字)および(第2文字、第3文字)がいずれも高頻度2
文字連続である3文字連続のみを抽出し、各3文字連続
の出現の度合が、「当該3文字連続の出現する文書レコ
ードの文字数の総和/全文書レコードの文字数の総和」
によって算定され、その結果が3文字連続グループ化手
段209に送られ、式(1)と同様の基準によって絞り
込み率をもとにグループ化される。
Thus, 2 of the sample document data 201
When the scanning of the third time is completed, the document dividing means 203 starts the third scanning of the sample document data 201, and the cut-out document record is calculated by the three-character consecutive appearance frequency calculating means 20.
Send to 6. The three-character continuous appearance frequency calculation unit 206 has a high frequency of 2 for the (first character, second character) and (the second character, third character) among the three characters in the document record.
Only three consecutive characters, which are consecutive characters, are extracted, and the degree of appearance of each consecutive three characters is "total sum of character numbers of document records in which the consecutive three characters appear / total sum of character numbers of all document records".
And the result is sent to the three-character consecutive grouping means 209 and grouped based on the narrowing-down rate according to the same criteria as in formula (1).

【0030】こうして得られたグループ化情報が、索引
型式出力手段210に送られ、低頻度文字グループ、2
文字連続グループ、3文字連続グループの1つ1つに対
して、1bitの索引情報を割り当てるような索引型式
を索引型式データ211に出力する。
The grouping information thus obtained is sent to the index type output means 210, and the low frequency character groups, 2
An index type that allocates 1-bit index information to each of the character continuous group and the three character continuous group is output to the index type data 211.

【0031】以上のように、本実施例によれば、サンプ
ル文書中にはあまり出現しないが、利用者が検索要求と
して頻繁に用いるという文字を検索要求履歴から自動的
に選びだし、そのような文字に対してはグループ化を行
なわないことで、索引容量をあまり大きくすることな
く、また他の検索要求に対する検索速度を低下させるこ
となく、各利用者に応じた高速な検索を可能にする索引
型式を作成することができる。
As described above, according to the present embodiment, a character that does not appear frequently in the sample document but is frequently used by the user as a search request is automatically selected from the search request history. By not grouping the characters, an index that enables high-speed search according to each user without significantly increasing the index capacity and reducing the search speed for other search requests A model can be created.

【0032】(実施例3)次に、本発明の第3の実施例
について、図面を参照しながら説明する。図3は本発明
の一実施例における索引型式作成装置の構成を示すブロ
ック図である。図3において、301はサンプル文書デ
ータ、302はサンプル文書区切りデータ、303は文
書区切り手段、304は文字出現頻度算定手段、305
は2文字連続出現頻度算定手段、306は3文字連続出
現頻度算定手段、307は文字グループ化手段、308
は2文字連続グループ化手段、309は3文字連続グル
ープ化手段、310は索引型式出力手段、311は索引
型式データである。そして、312は作成する索引の最
大量を入力する最大索引量入力手段、313は最大索引
量入力手段312からの入力と文字出現頻度算定手段3
04の算定結果と2文字連続出現頻度算定手段305の
算定結果と3文字連続出現頻度算定手段306の算定結
果を受け取って絞り込み率を算定し、その結果を再度文
字出現頻度算定手段304と2文字連続出現頻度算定手
段305と3文字連続出現頻度算定手段306に出力す
る絞り込み率算定手段である。
(Embodiment 3) Next, a third embodiment of the present invention will be described with reference to the drawings. FIG. 3 is a block diagram showing the configuration of the index type creating device according to an embodiment of the present invention. In FIG. 3, 301 is sample document data, 302 is sample document delimiter data, 303 is document delimiter, 304 is character appearance frequency calculator, 305.
Is a two-character consecutive appearance frequency calculation means, 306 is a three-character consecutive appearance frequency calculation means, 307 is a character grouping means, 308
Is a 2-character continuous grouping means, 309 is a 3-character continuous grouping means, 310 is an index type output means, and 311 is index type data. Then, 312 is a maximum index amount input means for inputting the maximum amount of the index to be created, 313 is an input from the maximum index amount input means 312 and the character appearance frequency calculation means 3
The calculation result of 04, the calculation result of the two-character continuous appearance frequency calculation means 305, and the calculation result of the three-character continuous appearance frequency calculation means 306 are received to calculate the narrowing rate, and the result is again calculated with the character appearance frequency calculation means 304 and two characters. It is a narrowing-down rate calculating means for outputting to the consecutive appearance frequency calculating means 305 and the three-character consecutive appearance frequency calculating means 306.

【0033】以上のように構成された索引型式作成装置
について、その動作を説明する。まず、サンプル文書デ
ータ301中の各文書レコードが、文書区切り手段30
3で切り出されて、文字出現頻度算定手段304に送ら
れ、各文字の出現の度合が、「該当文字の出現する文書
レコードの文字数の総和/全文書レコードの文字数の総
和」によって算定される。文書中に出現した文字の総数
をNとし、絞り込み率の初期値として予め定められた値
cよりも高い出現頻度をもつ文字を高頻度文字(その数
をα(c))とし、それ以外の文字を低頻度文字とす
る。
The operation of the index type creating apparatus constructed as above will be described. First, each document record in the sample document data 301 is stored in the document delimiter 30.
It is cut out in 3 and sent to the character appearance frequency calculation means 304, and the degree of appearance of each character is calculated by "total sum of character numbers of document records in which the corresponding character appears / total sum of character numbers of all document records". Let N be the total number of characters that appear in the document, let a character with an appearance frequency higher than a predetermined value c as the initial value of the narrowing rate be a high-frequency character (the number of which is α (c)), and the other characters. Characters are infrequent characters.

【0034】サンプル文書データ301の1回目の走査
が終了したら、文書区切り手段303は、サンプル文書
データ301の2回目の走査を開始し、切り出した文書
レコードを2文字連続出現頻度算定手段305に送る。
2文字連続出現頻度算定手段305は、文書レコード中
の2文字連続のうちで、高頻度文字同士の連続のみを抽
出し(その総数をW(c)とする)、各2文字連続の出
現度合が「当該2文字連続の出現する文書レコードの文
字数の総和/全文書レコードの文字数の総和」によって
算定される。cよりも高い出現頻度をもつ2文字連続を
高頻度2文字連続(その数をβ(c))とする。
When the first scan of the sample document data 301 is completed, the document delimiter 303 starts the second scan of the sample document data 301 and sends the clipped document record to the two-character consecutive appearance frequency calculator 305. .
The two-character consecutive appearance frequency calculation unit 305 extracts only the consecutive high-frequency characters from the two-letter consecutive letters in the document record (the total number of which is W (c)), and the appearance degree of each two-letter consecutive letters. Is calculated by "total sum of the number of characters of the document record in which two consecutive characters appear / total sum of the number of characters of all document records". Two consecutive characters having a higher appearance frequency than c are defined as two consecutive frequently occurring characters (the number of them is β (c)).

【0035】こうして、サンプル文書データ301の2
回目の走査が終了したら、文書区切り手段303は、サ
ンプル文書データ301の3回目の走査を開始し、切り
出した文書レコードを3文字連続出現頻度算定手段30
6に送る。3文字連続出現頻度算定手段306は、文書
レコード中の3文字連続のうちで、(第1文字、第2文
字)および(第2文字、第3文字)がいずれも高頻度2
文字連続である3文字連続のみを抽出し(その総数をT
(c)とする)、各3文字連続の出現の度合が、「当該
3文字連続の出現する文書レコードの文字数の総和/全
文書レコードの文字数の総和」によって算定される。c
よりも高い出現頻度をもつ3文字連続を高頻度3文字連
続(その数をγ(c))とする。
Thus, 2 of the sample document data 301
After the scanning of the third time is completed, the document dividing means 303 starts the third scanning of the sample document data 301, and the cut-out document record is calculated by the three-character consecutive appearance frequency calculating means 30.
Send to 6. The three-character continuous appearance frequency calculation unit 306 has a high frequency of (First character, Second character) and (Second character, Third character) of the three consecutive characters in the document record.
Extract only 3 consecutive characters (the total number is T
(C)), the degree of appearance of each three consecutive characters is calculated by "total sum of the number of characters of the document record in which the consecutive three characters appear / total sum of the number of characters of all document records". c
A series of three characters having a higher appearance frequency than the above is defined as a series of high-frequency three characters (the number thereof is γ (c)).

【0036】こうして得られた絞り込み率の初期値cに
対する文字出現頻度分布、2文字連続出現頻度分布、3
文字連続出現頻度分布と最大索引量入力手段312で入
力された索引量の上限から、再度出現頻度分布を調べる
ことなく、絞り込み率算定手段313が以下のような方
法で絞り込み率を決定する。
The character appearance frequency distribution with respect to the initial value c of the narrowing down ratio thus obtained, the two character continuous appearance frequency distribution, 3
Based on the character continuous appearance frequency distribution and the upper limit of the index amount input by the maximum index amount input means 312, the narrowing-down rate calculating means 313 determines the narrowing-down rate by the following method without checking the appearance frequency distribution again.

【0037】文字出現頻度分布は、絞り込み率によって
変化しない。したがって、絞り込み率c1 のときの高頻
度文字数は、前に調べた文字出現頻度分布から直接知る
ことができ、これをα(c1 )とする。文書中に現れる
高頻度文字同士の2文字連続の数は、任意の高頻度文字
同士の組合せの総数に比例すると仮定すると、絞り込み
率c1 のときの高頻度文字同士の2文字連続の数W(c
1 )は、式(2)によって表される。 W(c1 )=W(c)×{α(c1 2 /α(c)2 } ・・・(2) 絞り込み率c1 における2文字連続出現頻度分布(x軸
にランク、y軸に出現頻度をとったもの)は絞り込み率
cにおける2文字連続出現頻度分布をx軸方向に拡大縮
小したものと仮定すると、絞り込み率c1 に対する高頻
度2文字連続の数β(c1 )は、絞り込み率cに対する
2文字連続出現頻度分布で出現頻度がc 1 より高くなる
2文字連続の数β’(c1 )を用いて式(3)のように
表せる。 β(c1 )=β’(c1 )×{W(c1 )/W(c)} =β’(c1 )×{α(c1 2 /α(c)2 }・・・(3)
The character appearance frequency distribution depends on the narrowing rate.
It does not change. Therefore, the narrowing rate c1High frequency
The number of degree characters is directly known from the character appearance frequency distribution examined previously.
It is possible to use α (c1). Appear in the document
The number of two consecutive high-frequency characters is the same as any high-frequency character.
Assuming that it is proportional to the total number of combinations,
Rate c1The number of consecutive two characters W (c
1) Is represented by equation (2). W (c1) = W (c) × {α (c1)2/ Α (c)2} (2) Narrowing rate c12 characters consecutive appearance frequency distribution (x axis
To (rank, appearance frequency on the y-axis) is the narrowing rate
Expand / shrink the distribution of the appearance frequency of two consecutive characters in c in the x-axis direction
Assuming that it is small, the narrowing rate c1Against
Number of two consecutive letters β (c1) Is for the narrowing rate c
The appearance frequency is c in the two-character continuous appearance frequency distribution 1Get higher
Number of consecutive two letters β '(c1) Is used as in equation (3)
Can be represented. β (c1) = Β ′ (c1) × {W (c1) / W (c)} = β '(c1) × {α (c1)2/ Α (c)2} (3)

【0038】また、文書中に現れる3文字連続のうち
で、(第1文字、第2文字)および(第2文字、第3文
字)がいずれも高頻度2文字連続であるような3文字連
続の数は、任意の高頻度2文字連続同士の組合せの総数
に比例すると仮定すると、絞り込み率c1 のときの前記
の条件を満たす3文字連続の数T(c1 )は、式(4)
によって表される。 T(c1 )=T(c)×{β(c1 2 /β(c)2 } ・・・(4)
In addition, among the three consecutive characters appearing in the document, three consecutive characters such that (first character, second character) and (second character, third character) are high-frequency two consecutive characters. the number of, when assumed to be proportional to the total number of any high frequency second character sequence among the combination, the number of the conditions are satisfied 3 character sequence when the narrowing ratio c 1 T (c 1) has the formula (4)
Represented by T (c 1 ) = T (c) × {β (c 1 ) 2 / β (c) 2 } (4)

【0039】絞り込み率c1 のときの3文字連続出現頻
度分布は、絞り込み率cのときの3文字連続出現頻度分
布をx軸方向に拡大縮小したものと仮定すると、絞り込
み率c1 に対する高頻度3文字連続の数γ(c1 )は、
絞り込み率cに対する3文字連続出現頻度分布において
出現頻度がc1 より高くなる3文字連続の数γ’
(c 1 )を用いて、式(5)のように表せる。 γ(c1 )=γ’(c1 )×{T(c1 )/T(c)} =γ’(c1 )×{β(c1 2 /β(c)2 }・・・(5)
Narrowing rate c1When 3 characters appear consecutively
The degree distribution is the frequency of occurrence of three consecutive characters at the narrowing rate c.
Assuming that the cloth is scaled in the x-axis direction,
Rate c1The number of high-frequency three-character continuous γ (c1) Is
In the three-character continuous appearance frequency distribution for the narrowing rate c
Appearance frequency is c1Higher number of consecutive three letters γ '
(C 1) Can be expressed as in equation (5). γ (c1) = Γ ′ (c1) × {T (c1) / T (c)} = γ '(c1) × {β (c1)2/ Β (c)2} (5)

【0040】作成される索引の大きさは、後の各グルー
プ化手段によって得られる低頻度文字グループ、低頻度
2文字連続グループ、3文字連続グループの総数に文書
レコード数Rを乗じたもの(単位bit)になる。絞り
込み率c1 に対する低頻度文字の出現頻度の総和をS1
とすると、低頻度文字グループの数はS1 /c1 と近似
できる。絞り込み率c1 に対する低頻度2文字連続の出
現頻度の総和S2 は、3直線x=β(c1 )、x=W
(c1 )、y=0と絞り込み率c1 に対する2文字連続
出現頻度分布の曲線で囲まれた面積と考えられるので、
3直線x=β’(c1 )、x=W(c)、y=0と絞り
込み率cに対する2文字連続出現頻度分布の曲線で囲ま
れた面積、すなわち絞り込み率cに対する2文字連続出
現頻度分布において出現頻度がc1 以下であるような2
文字連続の出現頻度の総和S2 ’を用いて式(6)のよ
うに表せる。 S2 =S2 ’×{W(c1 )/W(c)} =S2 ’×{α(c1 2 /α(c)2 } ・・・(6)
The size of the created index is obtained by multiplying the total number of low-frequency character groups, low-frequency 2-character continuous groups, and 3-character continuous groups obtained by each of the subsequent grouping means by the number of document records R (unit: unit). bit). The sum of the frequency of occurrence of low frequency character for narrowing ratio c 1 S 1
Then, the number of infrequent character groups can be approximated as S 1 / c 1 . The sum S 2 of the appearance frequencies of two low-frequency consecutive characters with respect to the narrowing-down rate c 1 is 3 straight lines x = β (c 1 ), x = W
(C 1 ), y = 0 and the area surrounded by the curve of the two-character continuous appearance frequency distribution for the narrowing-down rate c 1 ,
Area surrounded by three straight line x = β '(c 1 ), x = W (c), y = 0 and the curve of the two-character continuous appearance frequency distribution for the narrowing-down rate c, that is, the two-character continuous appearance frequency for the narrowing-down rate c 2 where the frequency of occurrence is less than or equal to c 1 in the distribution
It can be expressed as in Expression (6) using the sum S 2 'of the appearance frequencies of consecutive characters. S 2 = S 2 '× {W (c 1 ) / W (c)} = S 2 ' × {α (c 1 ) 2 / α (c) 2 } (6)

【0041】3文字連続グループの総数については、高
頻度3文字連続の数は無視できるほど少ないと仮定し、
低頻度3文字連続のみを考える。絞り込み率c1 に対す
る低頻度3文字連続の出現頻度の総和S3 は、3直線x
=γ(c1 )、x=T(c1)、y=0と絞り込み率c
1 に対する3文字連続出現頻度分布の曲線で囲まれた面
積と考えられるので、3直線x=γ’(c1 )、x=T
(c)、y=0と絞り込み率cに対する3文字連続出現
頻度分布の曲線で囲まれた面積、すなわち絞り込み率c
に対する3文字連続出現頻度分布において出現頻度がc
1 以下であるような3文字連続の出現頻度の総和S3
を用いて、式(7)のように表せる。 S3 =S3 ’×{T(c1 )/T(c)} =S3 ’×{β(c1 2 /β(c)2 } =S3 ’×{β’(c1 2 /β(c)2 }×{α(c1 4 /α(c )4 } ・・・(7)
As for the total number of consecutive 3 character groups, it is assumed that the number of consecutive high frequency 3 characters is negligibly small.
Consider only low frequency 3 consecutive letters. The sum S 3 of the appearance frequencies of three consecutive low-frequency characters with respect to the narrowing-down rate c 1 is 3 straight lines x
= Γ (c 1 ), x = T (c 1 ), y = 0 and narrowing rate c
Since it is considered to be the area surrounded by the curve of the three-character continuous appearance frequency distribution for 1 , three straight lines x = γ ′ (c 1 ), x = T
(C), the area surrounded by the curve of the three-character continuous appearance frequency distribution with respect to y = 0 and the narrowing rate c, that is, the narrowing rate c
In the three-character continuous appearance frequency distribution for
Sum of appearance frequencies of three consecutive characters that are less than or equal to 1 S 3 '
Can be expressed as in equation (7). S 3 = S 3 '× { T (c 1) / T (c)} = S 3' × {β (c 1) 2 / β (c) 2} = S 3 '× {β' (c 1) 2 / β (c) 2 } × {α (c 1 ) 4 / α (c) 4 } (7)

【0042】すなわち絞り込み率c1 としたときに作成
される索引の大きさI(c1 )(単位はbit)は、絞
り込み率cにおける各出現頻度分布から算出できる値に
よって、式(8)のように近似的に求めることができ
る。 I(c1 )={(S1 +S2 +S3 )/c1 }×R =[[S1 +S2 ’×{α(c1 2 /α(c)2 } +S3 ’×{β’(c1 2 /β(c)2 } ×{α(c1 4 /α(c)4 }]/c1 ]×R・・(8)
That is, the index size I (c 1 ) (unit is bit) created when the narrowing-down rate c 1 is calculated by the value calculated from each appearance frequency distribution at the narrowing-down rate c, Can be approximately calculated as follows. I (c 1 ) = {(S 1 + S 2 + S 3 ) / c 1 } × R = [[S 1 + S 2 ′ × {α (c 1 ) 2 / α (c) 2 } + S 3 ′ × {β '(C 1 ) 2 / β (c) 2 } × {α (c 1 ) 4 / α (c) 4 }] / c 1 ] × R ·· (8)

【0043】絞り込み率算定手段313は、最大索引量
以下の大きさの索引作成を可能にする絞り込み率c1
算定し、再度、文字出現頻度算定手段304、2文字連
続出現頻度算定手段305、3文字連続出現頻度算定手
段306に出力する。
The narrowing-down rate calculating means 313 calculates the narrowing-down rate c 1 that enables the creation of an index having a size equal to or smaller than the maximum index amount, and again the character appearance frequency calculating means 304, the two-character continuous appearance frequency calculating means 305, It outputs to the three-character consecutive appearance frequency calculation means 306.

【0044】文字グループ化手段307は、文字出現頻
度算定手段304の算定結果を受け取って、出現の度合
が絞り込み率c1 以下である複数の文字をグループ化
し、グループに属するいずれかの文字が出現する度合が
絞り込み率を越えない範囲で絞り込み率に最も近くなる
ように調整する。この時、グループのいずれかの文字が
現れる度合の算定法は、グループ内の各文字の出現が統
計的に独立であると仮定し、式(1)から求める。
The character grouping unit 307 receives the calculation result of the character appearance frequency calculating unit 304, groups a plurality of characters whose degree of appearance is the narrowing rate c 1 or less, and one of the characters belonging to the group appears. The degree of adjustment is adjusted so that it is closest to the narrowing rate within a range that does not exceed the narrowing rate. At this time, the calculation method of the degree of appearance of any character in the group is calculated from the equation (1), assuming that the appearance of each character in the group is statistically independent.

【0045】2文字連続出現頻度算定手段305は、文
書レコード中の2文字連続のうちで、高頻度文字同士の
連続のみを抽出し、各2文字連続の出現度合が「当該2
文字連続の出現する文書レコードの文字数の総和/全文
書レコードの文字数の総和」によって算定される。高頻
度文字同士からなる2文字連続のうち高頻度2文字連続
以外のすべてを、式(1)と同様の基準によってグルー
プに属するいずれかの2文字連続が現れる度合が絞り込
み率以下になるように、2文字連続グループ化手段30
8がグループ化する。
The two-character continuous appearance frequency calculation means 305 extracts only the continuous high-frequency characters from the two-character continuous characters in the document record, and the appearance degree of each two-character continuous character is "the relevant 2".
It is calculated by the sum of the number of characters of document records in which consecutive characters appear / the sum of the number of characters of all document records. The degree of appearance of any two consecutive characters belonging to a group is equal to or less than the narrowing rate based on the same criteria as in the formula (1) for all of the consecutive two characters consisting of high-frequency characters other than the frequent two-character consecutive. 2 character continuous grouping means 30
8 group.

【0046】3文字連続出現頻度算定手段306は、文
書レコード中の3文字連続のうちで、(第1文字、第2
文字)および(第2文字、第3文字)がいずれも高頻度
2文字連続である3文字連続のみを抽出し、各3文字連
続の出現の度合が、「当該3文字連続の出現する文書レ
コードの文字数の総和/全文書レコードの文字数の総
和」によって算定され、その結果が3文字連続グループ
化手段309に送られ、式(1)と同様の基準によって
絞り込み率c1 をもとにグループ化される。
The three-character consecutive appearance frequency calculating means 306 calculates the (first character, second character) among the three-character consecutive characters in the document record.
Characters) and (2nd character, 3rd character) are both high-frequency 2 consecutive characters. Only 3 consecutive characters are extracted, and the degree of appearance of each 3 consecutive characters is “document record in which the consecutive 3 characters appear. Of the total number of characters / the total number of characters of all document records ", and the result is sent to the three-character continuous grouping means 309, and grouped based on the narrowing-down rate c 1 according to the same criteria as in formula (1). To be done.

【0047】こうして得られたグループ化情報が、索引
型式出力手段310に送られ、低頻度文字グループ、2
文字連続グループ、3文字連続グループの1つ1つに対
して、1bitの索引情報を割り当てるような索引型式
を索引型式データ311に出力する。
The grouping information thus obtained is sent to the index type output means 310, and the low-frequency character group, 2
An index type that allocates 1-bit index information to each of the character continuous group and the three character continuous group is output to the index type data 311.

【0048】以上のように、本実施例によれば、コンピ
ュータの記憶容量に制限がある場合でも、索引データ量
の上限を直接与えることにより適切な絞り込み率を自動
的に求め、何度も索引型式の作成を繰り返すことをしな
くても要求を満たす大きさの索引型式を作成することが
できる。
As described above, according to the present embodiment, even when the storage capacity of the computer is limited, an appropriate narrowing down rate is automatically obtained by directly giving the upper limit of the index data amount, and the index can be repeatedly displayed. It is possible to create an index model of a size that meets the requirements without repeating the model creation.

【0049】[0049]

【発明の効果】以上のように、本発明によれば、利用者
が設定した特定の検索要求に対して、もしくは過去の検
索履歴から調べた利用者が良く用いる検索要求に対して
は、他の検索要求に対する検索速度を低下させることな
く、予め指定された絞り込み率の逆数倍を上回る検索速
度の向上を実現する索引型式を作成することができる。
As described above, according to the present invention, other than the specific search request set by the user or the search request frequently used by the user checked from the past search history, It is possible to create an index type that realizes an improvement in the search speed that exceeds the reciprocal multiple of the pre-specified narrowing rate without reducing the search speed for the search request.

【0050】また、記憶装置の容量に制限のある場合な
どに、作成される索引量の上限を利用者が設定でき、何
度も試行錯誤を繰り返して最適な絞り込み率を決定する
必要がなく、利用者の手間が省けるとともに、トータル
な索引型式作成時間の短縮が可能となる。
Further, when the capacity of the storage device is limited, the user can set the upper limit of the index amount to be created, and it is not necessary to repeat trial and error many times to determine the optimum narrowing rate. This saves the user the trouble and shortens the total index model creation time.

【図面の簡単な説明】[Brief description of drawings]

【図1】本発明の第1の実施例における索引型式作成装
置の構成を示すブロック図
FIG. 1 is a block diagram showing the configuration of an index type creating device according to a first embodiment of the present invention.

【図2】本発明の第2の実施例における索引型式作成装
置の構成を示すブロック図
FIG. 2 is a block diagram showing a configuration of an index type creating device according to a second embodiment of the present invention.

【図3】本発明の第3の実施例における索引型式作成装
置の構成を示すブロック図
FIG. 3 is a block diagram showing a configuration of an index type creating device according to a third embodiment of the present invention.

【図4】従来技術による索引型式作成装置の構成を示す
ブロック図
FIG. 4 is a block diagram showing a configuration of an index type creating device according to a conventional technique.

【符号の説明】[Explanation of symbols]

101 サンプル文書データ 102 サンプル文書区切りデータ 103 文書区切り手段 104 文字出現頻度算定手段 105 2文字連続出現頻度算定手段 106 3文字連続出現頻度算定手段 107 文字グループ化手段 108 2文字連続グループ化手段 109 3文字連続グループ化手段 110 索引型式出力手段 111 索引型式データ 112 特別区分入力手段 201 サンプル文書データ 202 サンプル文書区切りデータ 203 文書区切り手段 204 文字出現頻度算定手段 205 2文字連続出現頻度算定手段 206 3文字連続出現頻度算定手段 207 文字グループ化手段 208 2文字連続グループ化手段 209 3文字連続グループ化手段 210 索引型式出力手段 211 索引型式データ 212 検索要求履歴データ 213 検索要求文字列出現頻度算定手段 301 サンプル文書データ 302 サンプル文書区切りデータ 303 文書区切り手段 304 文字出現頻度算定手段 305 2文字連続出現頻度算定手段 306 3文字連続出現頻度算定手段 307 文字グループ化手段 308 2文字連続グループ化手段 309 3文字連続グループ化手段 310 索引型式出力手段 311 索引型式データ 312 最大索引量入力手段 313 絞り込み率算定手段 401 サンプル文書データ 402 サンプル文書区切りデータ 403 文書区切り手段 404 文字出現頻度算定手段 405 2文字連続出現頻度算定手段 406 3文字連続出現頻度算定手段 407 文字グループ化手段 408 2文字連続グループ化手段 409 3文字連続グループ化手段 410 索引型式出力手段 411 索引型式データ 101 sample document data 102 sample document delimiter data 103 document delimiter means 104 character appearance frequency calculation means 105 2 character consecutive appearance frequency calculation means 106 3 character consecutive appearance frequency calculation means 107 character grouping means 108 2 character consecutive grouping means 109 3 characters Continuous grouping means 110 Index type output means 111 Index type data 112 Special classification input means 201 Sample document data 202 Sample document break data 203 Document break means 204 Character appearance frequency calculation means 205 2 Character continuous appearance frequency calculation means 206 3 Character continuous appearance Frequency calculation means 207 Character grouping means 208 Two-character continuous grouping means 209 Three-character continuous grouping means 210 Index type output means 211 Index type data 212 Search request history data 213 Inspection Search request character string appearance frequency calculation means 301 Sample document data 302 Sample document delimiter data 303 Document delimiter means 304 Character appearance frequency calculation means 305 2 Character continuous appearance frequency calculation means 306 3 Character continuous appearance frequency calculation means 307 Character grouping means 308 2 Character continuous grouping means 309 3 Character continuous grouping means 310 Index type output means 311 Index type data 312 Maximum index amount input means 313 Narrowing rate calculation means 401 Sample document data 402 Sample document break data 403 Document break means 404 Character appearance frequency calculation Means 405 Two-character consecutive appearance frequency calculating means 406 Three-character consecutive appearance frequency calculating means 407 Character grouping means 408 Two-character consecutive grouping means 409 Three-character consecutive grouping means 410 Index type output means 411 Index type data

Claims (3)

【特許請求の範囲】[Claims] 【請求項1】 サンプル文書データ中のある1文字の出
現の度合を統計的に調べる文字出現頻度算定手段と、前
回調べた文字の出現の度合がある値よりも高い場合に、
前回調べた文字の全てを含むN文字(Nは2、3、・・
・の自然数)の文字列についての出現の度合を統計的に
調べる複数のN文字連続出現頻度算定手段と、1つの要
素だけからなる単独グループに入れる文字または文字列
を指定する特別区分入力手段と、前記文字出現頻度算定
手段および前記複数のN文字連続出現頻度算定手段の出
力と前記特別区分入力手段の出力から文字または文字列
をグループ化する複数のグループ化手段とを備えること
を特徴とする索引型式作成装置。
1. A character appearance frequency calculation means for statistically checking the appearance degree of a certain character in sample document data, and when the appearance degree of the previously examined character is higher than a certain value,
N characters including all of the characters checked last time (N is 2, 3, ...
A natural number), a plurality of N-character consecutive appearance frequency calculation means for statistically examining the degree of appearance of a character string, and a special classification input means for specifying a character or character string to be included in a single group consisting of only one element , A plurality of grouping means for grouping characters or character strings from the outputs of the character appearance frequency calculating means and the plurality of N character consecutive appearance frequency calculating means and the output of the special classification input means. Index model making device.
【請求項2】 サンプル文書データ中のある1文字の出
現の度合を統計的に調べる文字出現頻度算定手段と、前
回調べた文字の出現の度合がある値よりも高い場合に、
前回調べた文字の全てを含むN文字(Nは2、3、・・
・の自然数)の文字列についての出現の度合を統計的に
調べる複数のN文字連続出現頻度算定手段と、過去の検
索要求履歴における検索要求文字列の出現頻度を算定す
る検索要求文字列出現頻度算定手段と、前記文字出現頻
度算定手段および前記複数のN文字連続出現頻度算定手
段の出力と前記検索要求文字列出現頻度算定手段の出力
から文字または文字列をグループ化する複数のグループ
化手段とを備えることを特徴とする索引型式作成装置。
2. A character appearance frequency calculation means for statistically checking the appearance degree of a certain character in the sample document data, and when the appearance degree of the previously examined character is higher than a certain value,
N characters including all of the characters checked last time (N is 2, 3, ...
A natural number) of a plurality of N character consecutive appearance frequency calculating means for statistically checking the degree of appearance of the character string, and a search request character string appearance frequency for calculating the appearance frequency of the search request character string in the past search request history Calculating means, a plurality of grouping means for grouping characters or character strings from the outputs of the character appearance frequency calculating means and the plurality of N character consecutive appearance frequency calculating means and the output of the search request character string appearance frequency calculating means; An index type creating apparatus comprising:
【請求項3】 サンプル文書データ中のある1文字の出
現の度合を統計的に調べる文字出現頻度算定手段と、前
回調べた文字の出現の度合がある値よりも高い場合に、
前回調べた文字の全てを含むN文字(Nは2、3、・・
・の自然数)の文字列についての出現の度合を統計的に
調べる複数のN文字連続出現頻度算定手段と、索引デー
タ量に対する制限を入力するための最大索引量入力手段
と、前記文字出現頻度算定手段および前記複数のN文字
連続出現頻度算定手段の出力と前記最大索引量入力手段
の出力から最大索引量以下の大きさの索引作成を可能に
する絞り込み率を求め、再度前記文字出現頻度算定手段
および前記複数のN文字連続出現頻度算定手段に結果を
出力する絞り込み率算定手段と、前記文字出現頻度算定
手段および前記複数のN文字連続出現頻度算定手段の出
力から文字または文字列をグループ化する複数のグルー
プ化手段とを備えることを特徴とする索引型式作成装
置。
3. A character appearance frequency calculation means for statistically checking the appearance degree of a certain character in the sample document data, and when the appearance degree of the previously examined character is higher than a certain value,
N characters including all of the characters checked last time (N is 2, 3, ...
A natural number), a plurality of N-character consecutive appearance frequency calculation means for statistically examining the degree of appearance of a character string, a maximum index amount input means for inputting a limit on the amount of index data, and the character appearance frequency calculation Means and the output of the plurality of N-character consecutive appearance frequency calculating means and the output of the maximum index amount input means, a narrowing down rate that enables the creation of an index having a size equal to or less than the maximum index amount is calculated, and the character appearance frequency calculating means is calculated again. And a narrowing-down rate calculating means for outputting a result to the plurality of N-character consecutive appearance frequency calculating means, and a group of characters or character strings from the output of the character appearance frequency calculating means and the plurality of N-character consecutive appearance frequency calculating means. An index type creating apparatus comprising a plurality of grouping means.
JP7121065A 1995-05-19 1995-05-19 Index type making device Expired - Fee Related JP2996895B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP7121065A JP2996895B2 (en) 1995-05-19 1995-05-19 Index type making device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP7121065A JP2996895B2 (en) 1995-05-19 1995-05-19 Index type making device

Publications (2)

Publication Number Publication Date
JPH08314964A true JPH08314964A (en) 1996-11-29
JP2996895B2 JP2996895B2 (en) 2000-01-11

Family

ID=14801979

Family Applications (1)

Application Number Title Priority Date Filing Date
JP7121065A Expired - Fee Related JP2996895B2 (en) 1995-05-19 1995-05-19 Index type making device

Country Status (1)

Country Link
JP (1) JP2996895B2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005234688A (en) * 2004-02-17 2005-09-02 Ricoh Co Ltd Important language identification method, important language identification program, important language identification device, document search device, and keyword extraction device

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07105237A (en) * 1993-10-08 1995-04-21 Matsushita Electric Ind Co Ltd Index creating method and apparatus and document retrieval apparatus

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07105237A (en) * 1993-10-08 1995-04-21 Matsushita Electric Ind Co Ltd Index creating method and apparatus and document retrieval apparatus

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005234688A (en) * 2004-02-17 2005-09-02 Ricoh Co Ltd Important language identification method, important language identification program, important language identification device, document search device, and keyword extraction device

Also Published As

Publication number Publication date
JP2996895B2 (en) 2000-01-11

Similar Documents

Publication Publication Date Title
US6947920B2 (en) Method and system for response time optimization of data query rankings and retrieval
US6658626B1 (en) User interface for displaying document comparison information
US6587850B2 (en) Method and apparatus for profile score threshold setting and updating
Faloutsos et al. Signature files: An access method for documents and its analytical performance evaluation
CA2618854C (en) Ranking search results using biased click distance
US7840524B2 (en) Method and apparatus for indexing, searching and displaying data
US8037061B2 (en) System and computer readable medium for generating refinement categories for a set of search results
US7058695B2 (en) System and media for simplifying web contents, and method thereof
CN1112647C (en) System and method for ranking documents in a collection of documents in response to a query
US6826576B2 (en) Very-large-scale automatic categorizer for web content
KR100304335B1 (en) Keyword Extraction System and Document Retrieval System Using It
US8046370B2 (en) Retrieval of structured documents
US8862565B1 (en) Techniques for web site integration
EP0890911A2 (en) Multistage intelligent string comparison method
US7203673B2 (en) Document collection apparatus and method for specific use, and storage medium storing program used to direct computer to collect documents
JP3333998B2 (en) Automatic classifying apparatus and method
WO2003091828A2 (en) Method and system for searching documents with numbers
JP3081093B2 (en) Index creation method and apparatus and document search apparatus
JPH08314964A (en) Index model creation device
JPH1145257A (en) Web document search support apparatus and computer-readable recording medium storing a program for causing a computer to function as the apparatus
JPH064584A (en) Text search device
JPH11282874A (en) Information filtering method and device
US6836772B1 (en) Key word deriving device, key word deriving method, and storage medium containing key word deriving program
JPH07192010A (en) Document processor
JPH06215036A (en) Search method of document collection

Legal Events

Date Code Title Description
LAPS Cancellation because of no payment of annual fees