JPH0749875A

JPH0749875A - Document information classification method, document information collection method using the same, and document information collection system

Info

Publication number: JPH0749875A
Application number: JP5195839A
Authority: JP
Inventors: Hiroko Yuasa; 寛子湯浅; Keiji Kojima; 啓二小島
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1993-08-06
Filing date: 1993-08-06
Publication date: 1995-02-21

Abstract

(57)【要約】（修正有）【構成】文書収集サーバシステム１００は、自動的に
複数の情報源に接続して新文書を取得し、適合度計算１
０６によって、あらかじめユーザが記述した検索条件と
の適合度を調べる。文書格納処理１０７は、検索条件間
の関係から分類体系を構成し、適合した文書を分類して
フォルダに格納する。フォルダ管理処理１０８は、各フ
ォルダへの情報の集まり具合を監視し、自動的にフォル
ダの細分化、統合、構造の変更を行なって情報の整理を
する。【効果】各分類への情報の集まり具合に応じて、分類
体系や検索条件を改善し、各分類に分類される情報量を
その全体を容易に把握できる程度の数に抑さえることが
できる。 (57) [Summary] (Modified) [Configuration] The document collection server system 100 automatically connects to a plurality of information sources to acquire a new document, and calculates the degree of conformance 1
According to 06, the matching degree with the search condition previously described by the user is checked. The document storage processing 107 forms a classification system from the relationship between the search conditions, classifies the matched documents, and stores them in a folder. The folder management process 108 monitors the collection of information in each folder and automatically subdivides the folders, integrates them, and changes the structure to organize the information. [Effect] It is possible to improve the classification system and the search conditions according to the degree of information gathered in each classification, and to suppress the amount of information classified into each classification to a number that can easily grasp the whole.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、計算機ネットワークを
介して、自動的に情報を収集、分類、整理する情報収集
システムに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an information collecting system for automatically collecting, classifying and organizing information via a computer network.

【０００２】[0002]

【従来の技術】計算機ネットワークの整備は急速に進ん
でおり、オンライン情報検索サービス、ネットニュース
からの情報収集、電子メールや電子掲示板を利用した質
疑応答といった、いわゆる情報のブロードキャッチが行
なえる環境が整いつつある。2. Description of the Related Art The development of computer networks is advancing rapidly, and there is an environment where the so-called broad catch of information can be performed, such as online information search services, information collection from net news, and question and answer using electronic mail and electronic bulletin boards. It's getting ready.

【０００３】これらの最新情報の有用性は認識されてい
るものの、次のような点が問題となり、有効に利用され
ていない。Although the usefulness of these latest information has been recognized, it has not been effectively used due to the following problems.

【０００４】（１）情報源によって利用法が異なり、複
数の情報源から情報収集する操作が煩雑である。(1) The usage varies depending on the information source, and the operation of collecting information from a plurality of information sources is complicated.

【０００５】（２）検索式を論理式で入力しなければな
らない。所望の情報を得るための適切な検索式を記述す
るのは難しい。(2) The search expression must be input as a logical expression. It is difficult to write an appropriate search formula for obtaining desired information.

【０００６】（３）収集した情報の分類と整理に手間と
時間がかかる。(3) It takes time and effort to classify and organize the collected information.

【０００７】「２１世紀の情報化社会」（日経バイト１
９９１年１１月３２０ページ〜３３１ページ）に記載さ
れている広域情報サーバWAISは、（１）の問題点をプロ
トコルを共通化（NISO Z39.50を拡張）し、さらに情報
源への接続と検索を自動化することにより解決し、
（２）の問題点を関連性フィードバックにより解決し
た。関連性フィードバックは次のような検索条件の精練
手法である。ユーザが検索したい内容を記述すると、そ
れを検索条件としてWAIS はその内容に合う情報を検索
し提示する。ユーザがその中から欲しかった情報を選ぶ
と、WAISはユーザが選んだ情報を検索条件にフィードバ
ックし、検索条件を改善する。この関連性フィードバッ
クを用いた情報検索により、ユーザは検索式を記述する
ことなく所望の情報を検索できるようになった。"The 21st Century Information Society" (Nikkei Byte 1
Wide-area information server WAIS described in November 991, pages 320 to 331) makes the problem (1) common to the protocol (expands NISO Z39.50), and connects to information sources and searches. Is solved by automating
The problem of (2) was solved by relevance feedback. Relevance feedback is a method of refining the following search conditions. When the user describes the contents to be searched, WAIS searches and presents the information that matches the contents, using the contents as the search condition. When the user selects the information that he / she wants, WAIS feeds back the information selected by the user to the search condition and improves the search condition. The information search using this relevance feedback allows the user to search for desired information without describing a search formula.

【０００８】（３）の問題点を解決するために、様々な
文書の自動分類システムが考案されている。In order to solve the problem (3), various document automatic classification systems have been devised.

【０００９】たとえば、特開平１ー１８８９３４の文書
分類システムは、標本文書群を調べることにより、各分
野におけるキーワードの出現頻度情報を得て、入力され
た文書からキーワードを抽出して、分野毎に点数を計算
し、最高得点の分野へ分類する。For example, the document classification system of Japanese Patent Laid-Open No. 1-188834 obtains information on the frequency of appearance of keywords in each field by examining a sample document group, extracts the keywords from the input document, and extracts the keywords for each field. Calculate scores and classify into the highest scoring areas.

【００１０】特開昭６３−２１４８３２の通知文書処理
システムは、通知文書の書式を解析し、通信文中に出現
する単語の重みを分類カテゴリー別に付加し、その総和
を求め、最大となるカテゴリーを選ぶことにより分類す
る。The notification document processing system of Japanese Patent Laid-Open No. 63-214832 analyzes the format of the notification document, adds the weights of the words appearing in the message to each category, calculates the sum of them, and selects the maximum category. Classify by

【００１１】[0011]

【発明が解決しようとする課題】WAISは、上記（１）、
（２）の問題点は解決したが、収集した情報の分類、整
理に関しては配慮していない。[Problems to be Solved by the Invention] WAIS is based on the above (1),
Although the problem of (2) was solved, no consideration was given to the classification and organization of the collected information.

【００１２】階層的に情報を分類整理することが望まれ
るが、従来の方法では、これに適していなかった。It is desired to classify and organize information hierarchically, but the conventional method is not suitable for this.

【００１３】また、（３）を解決する従来の自動分類シ
ステムにおいては、分類する分類体系をあらかじめ確立
しておく必要があった。さらに、各分野を特徴付けるキ
ーワード群やキーワード群の出現頻度などをあらかじめ
与えるか、または求めるかする必要があった。Further, in the conventional automatic classification system that solves (3), it is necessary to establish a classification system for classification in advance. Furthermore, it is necessary to give or obtain in advance the keyword group that characterizes each field and the appearance frequency of the keyword group.

【００１４】しかし、あらかじめ適切な汎用的分類体系
を設けるのは困難である。分類体系が適切でないと、あ
る分類に多くの情報が集中することがある。ある分類の
情報量が多くなり過ぎると、ユーザは収集した情報の全
容を把握しにくくなる。However, it is difficult to provide an appropriate general-purpose classification system in advance. Inappropriate classification systems can concentrate a lot of information in a classification. When the amount of information of a certain category becomes too large, it becomes difficult for the user to grasp the whole content of the collected information.

【００１５】また、最先端の分野では多くの人に認めら
れる分類体系や専門用語が確定していないことが多く、
しかも頻繁に変更される。最先端の分野に関する文書を
従来の自動分類システムで適切に分類するのは難しい。Further, in the most advanced fields, the classification system and technical terms accepted by many people are often unfixed,
And it changes frequently. It is difficult to properly classify documents related to the state of the art with a conventional automatic classification system.

【００１６】本発明の第１の目的は、階層的に情報を分
類整理するのに適した文書情報分類方法、それを使用し
た文書情報収集方法およびシステムを提供することにあ
る。A first object of the present invention is to provide a document information classification method suitable for hierarchically classifying and organizing information, and a document information collecting method and system using the same.

【００１７】本発明の第２の目的は、収集した文書情報
の集まり具合から、分類体系と分類に用いる検索条件の
改良を自動的に行なう文書情報収集方法およびシステム
を提供することにある。A second object of the present invention is to provide a document information collecting method and system for automatically improving a classification system and a search condition used for classification based on a collection condition of collected document information.

【００１８】[0018]

【課題を解決するための手段】本発明による第１の文書
情報分類方法は、階層関係で相互に関連付けされた複数
のフォルダの各々に対応して、一つまたは複数の検索条
件からなる検索条件群を記憶し、各フォルダに対応して
記憶された検索条件群に基づいて、分類すべき情報と各
フォルダとの間の適合度を検出し、各フォルダと該情報
との間の検出された適合度と上記階層関係とに基づい
て、該情報が対応するフォルダとして、該複数のフォル
ダの一つまたは複数を決定し、該決定されたする一つの
フォルダまたは複数のフォルダの各々に対応して該情報
を記憶するステップを有する。According to a first document information classification method of the present invention, a search condition consisting of one or a plurality of search conditions corresponding to each of a plurality of folders which are associated with each other in a hierarchical relationship. A group is stored, the matching degree between the information to be classified and each folder is detected based on the search condition group stored corresponding to each folder, and the detected between each folder and the information is detected. Based on the compatibility and the hierarchical relationship, one or more of the plurality of folders is determined as a folder to which the information corresponds, and the determined one folder or each of the plurality of folders is determined. Storing the information.

【００１９】本発明による第２の文書情報分類方法は、
階層関係で相互に関連付けされた複数のフォルダの各々
に対応して、一つまたは複数の検索条件からなる検索条
件群を記憶し、各フォルダに対応して記憶された検索条
件群と予め定めて判断基準とに基づいて、分類すべき情
報を対応させるフォルダとして、該複数のフォルダの一
つまたは複数を決定し、決定されたフォルダに対応して
該情報を記憶し、複数の分類すべき情報の各々に対して
上記決定および記憶を行ない、各フォルダに対応して記
憶された複数の情報が、そのフォルダの再構成のために
定めた所定の条件を満たすか否かを判別し、いずれか一
つのフォルダが該所定の条件を満たしたとき、その一つ
のフォルダに対応して記憶された複数の情報とそのフォ
ルダに対応して記憶された一群の検索条件を再構成する
ステップを有する。A second document information classification method according to the present invention is
A search condition group consisting of one or a plurality of search conditions is stored in correspondence with each of a plurality of folders which are associated with each other in a hierarchical relationship, and the search condition group stored in advance corresponding to each folder is predetermined. Based on the determination criteria, one or more of the plurality of folders is determined as a folder to which the information to be classified is associated, the information is stored corresponding to the determined folder, and the plurality of information to be classified For each of the above, the above determination and storage are performed, and it is determined whether or not the plurality of pieces of information stored corresponding to each folder satisfy a predetermined condition defined for the reconstruction of the folder. And a step of reconstructing a plurality of information stored corresponding to the one folder and a group of search conditions stored corresponding to the folder when one folder satisfies the predetermined condition.

【００２０】本発明による第３の文書情報分類方法は、
階層関係で相互に関連付けされた複数のフォルダの各々
に対応して、一つまたは複数の検索条件からなる検索条
件群を記憶し、各フォルダに対応して記憶された検索条
件群と予め定めて判断基準とに基づいて、分類すべき情
報を対応させるためのフォルダとして、該複数のフォル
ダの一つまたは複数を決定し、決定されたフォルダに対
応して該情報を記憶し、複数の分類すべき情報の各々に
対して上記決定および記憶を行ない、該複数のフォルダ
の内の一部の複数のフォルダに対応して記憶された複数
の情報が、該複数のフォルダの再構成のために定めた所
定の条件を満たすか否かを判別し、いずれかの一部の複
数のフォルダが該所定の条件を満たしたとき、該一部の
複数のフォルダに対応して記憶された複数の情報と、該
一部の複数のフォルダに対応して記憶された一群の検索
条件を再構成するステップを有する。A third document information classification method according to the present invention is
A search condition group consisting of one or a plurality of search conditions is stored in correspondence with each of a plurality of folders which are associated with each other in a hierarchical relationship, and the search condition group stored in advance corresponding to each folder is predetermined. Based on the judgment criteria, one or more of the plurality of folders is determined as a folder for associating the information to be classified, and the information is stored in correspondence with the determined folder and a plurality of classifications are performed. The above-mentioned determination and storage are performed for each piece of information that should be stored, and a plurality of pieces of information stored corresponding to some of the plurality of folders are determined for reconstruction of the plurality of folders. It is determined whether or not a predetermined condition is satisfied, and when any one of the plurality of folders satisfies the predetermined condition, a plurality of pieces of information stored in correspondence with the some of the plurality of folders are stored. , Some of the Comprising the step of reconstructing a set of search conditions stored in correspondence with the da.

【００２１】[0021]

【作用】本発明による第１の文書情報分類方法では、各
フォルダに対応して記憶された検索条件と検索対象文書
情報との適合度と、複数の検索条件の階層構造とを考慮
して、検索対象文書情報を対応させるフォルダを決定す
るので、ユーザが記述した検索条件群を階層構造をなす
分類体系であると見做して収集した文書情報を分類でき
る。In the first document information classification method according to the present invention, the matching degree between the search condition stored in association with each folder and the search target document information and the hierarchical structure of the plurality of search conditions are considered, Since the folders to be associated with the search target document information are determined, the collected document information can be classified by regarding the search condition group described by the user as a classification system having a hierarchical structure.

【００２２】本発明による第２の文書情報分類方法で
は、各フォルダに対応して記憶された文書情報に依存し
て、各フォルダの分割など、フォルダの再構成をするこ
とが出来る。したがって、検索により得られた文書情報
の集まり具合に応じて、自動的に分類体系を変更でき
る。In the second document information classification method according to the present invention, folders can be reconfigured such as dividing each folder depending on the document information stored corresponding to each folder. Therefore, the classification system can be automatically changed according to the degree of collection of the document information obtained by the search.

【００２３】本発明による第３の文書情報分類方法で
は、複数のフォルダにまたがるフォルダの再構成をする
ことが出来る。In the third document information classification method according to the present invention, it is possible to reconfigure folders that span a plurality of folders.

【００２４】[0024]

【実施例】以下本発明の１実施例について説明する。EXAMPLE One example of the present invention will be described below.

【００２５】本実施例の文書情報収集システムが対象と
するのは、オンライン文書情報検索サービス、電子メー
ル、電子掲示板などを介して電子的に得ることができ
る、それぞれユーザにとって意味のある内容を一群の文
字で表した情報である。以下このような情報を文書情報
とよぶ。The target of the document information collecting system of the present embodiment is a group of contents which are meaningful to the user and which can be obtained electronically through an online document information search service, electronic mail, electronic bulletin board and the like. It is the information represented by the character. Hereinafter, such information will be referred to as document information.

【００２６】これらのサービスは、それぞれ様々な企業
や団体により運営されている。以後これらのサービスを
情報源と呼ぶ。各情報源が提供する文書情報は、一般
に、多岐に亘るので、複数の分野に分けてユーザに提示
される。これらの分野をドメインと呼ぶ。ドメインにお
いて提供される個々の情報を文書と呼ぶ。文書が検索条
件に適合したときに格納する検索結果格納領域をフォル
ダと呼ぶ。Each of these services is operated by various companies and organizations. Hereinafter, these services will be referred to as information sources. Since the document information provided by each information source generally has a wide variety, it is presented to the user in a plurality of fields. These fields are called domains. Individual information provided in the domain is called a document. A search result storage area that is stored when a document matches the search condition is called a folder.

【００２７】図１に本実施例の文書収集システムと本実
施例の文書収集システムが文書収集する外部の情報源と
からなるシステム構成例を示す。本実施例の文書収集シ
ステムは文書収集クライアント５００と文書収集サーバ
５１０とからなる。FIG. 1 shows an example of a system configuration including a document collection system of this embodiment and an external information source for collecting documents by the document collection system of this embodiment. The document collection system of this embodiment includes a document collection client 500 and a document collection server 510.

【００２８】文書収集クライアント５００はネットワー
ク上に複数存在して同時に文書収集サーバ５１０にアク
セスすることができる。A plurality of document collection clients 500 exist on the network and can simultaneously access the document collection server 510.

【００２９】文書収集クライアント５００のメモリ５２
２上の文書収集クライアントシステム５０１は、ユーザ
が、収集した文書を格納するフォルダを作成したり、ど
のような文書を収集するかを表す検索条件を各フォルダ
に登録したり、フォルダに収集された文書を見たりする
ためのグラフィカル・ユーザ・インタフェースを提供す
る。Memory 52 of document collection client 500
In the document collection client system 501 on 2, the user creates a folder for storing the collected documents, registers a search condition indicating what kind of document is to be collected in each folder, or is collected in the folder. It provides a graphical user interface for viewing and viewing documents.

【００３０】文書収集サーバ５１０のメモリ５２３上の
文書収集サーバシステム１００は、文書収集クライアン
トシステム５０１からの要求に応じて文書情報を提供す
る一方で、自動的に、ニュースサーバ５２０や文書サー
バ５２１などの外部の情報源から、ユーザが登録した検
索条件群に適合する文書を収集し、さらに分類、整理を
行う。The document collection server system 100 on the memory 523 of the document collection server 510 provides document information in response to a request from the document collection client system 501, while automatically automatically providing the news server 520 and the document server 521. Documents that meet the search condition group registered by the user are collected from external information sources, and further classified and organized.

【００３１】まず、文書収集クライアントシステム５０
１について説明する。First, the document collection client system 50
1 will be described.

【００３２】ユーザが文書収集クライアントシステム５
０１を起動すると図３に示すようなインタフェース画面
４００をCRT５０２上に表示する。ユーザはこのインタ
フェース画面４００上で、キーボード５０３、マウス５
０４などの入力デバイスを用いて様々な操作を行い、収
集した文書を格納するフォルダを作成・消去したり、文
書を収集するための検索条件を記述したり、収集結果を
見たり、評価したりする。The user collects the document collecting client system 5
When 01 is activated, an interface screen 400 as shown in FIG. 3 is displayed on the CRT 502. The user can use the keyboard 503 and mouse 5 on the interface screen 400.
Perform various operations using input devices such as 04 to create / delete folders that store collected documents, describe search conditions for collecting documents, view collection results, and evaluate To do.

【００３３】文書収集クライアントシステム５０１が行
う処理の流れ図を図７に示す。文書収集クライアントシ
ステム５０１が起動されると、まず文書収集サーバシス
テムへの接続を行う（ステップ１２０）。次に図３に示
すインタフェース画面４００を表示する（ステップ１２
１）。FIG. 7 shows a flow chart of the processing performed by the document collection client system 501. When the document collection client system 501 is activated, first, the document collection server system is connected (step 120). Next, the interface screen 400 shown in FIG. 3 is displayed (step 12).
1).

【００３４】この後、イベントループ１２２に入り、ス
テップ１２３〜１２６を繰り返す。即ち、ユーザの操作
を受理・解析し（ステップ１２３）、操作に対応するコ
マンドを文書収集サーバシステム１００に送信し（ステ
ップ１２４）、実行結果を文書収集サーバシステム１０
０から受信し（ステップ１２５）、その実行結果をイン
タフェース画面４００に反映させる（ステップ１２
６）、という処理を繰り返す。After that, the event loop 122 is entered, and steps 123 to 126 are repeated. That is, the operation of the user is accepted and analyzed (step 123), the command corresponding to the operation is transmitted to the document collection server system 100 (step 124), and the execution result is sent to the document collection server system 10.
It is received from 0 (step 125) and the execution result is reflected on the interface screen 400 (step 12).
6) is repeated.

【００３５】ユーザがメニューから終了を選ぶ操作を行
うと、終了コマンドを文書収集クライアントシステム５
０１に送信して、イベントループ１２２を抜け、文書収
集サーバシステム１００との接続切断処理を行い（ステ
ップ１２７）、終了する。When the user selects the end from the menu, the end command is issued to the document collection client system 5
01, the process exits the event loop 122, disconnection processing with the document collection server system 100 is performed (step 127), and the processing ends.

【００３６】図３に示したインターフェース画面４００
の具体例について説明する。この画面は、既にユーザに
よってフォルダ作成とそのフォルダに収集すべき文書の
検索条件登録が行われ、文書収集サーバシステム１００
により、ユーザが登録した検索条件群に適合する文書を
収集・分類された時点の例である。The interface screen 400 shown in FIG.
A specific example of will be described. On this screen, the user has already created a folder and registered search conditions for documents to be collected in the folder, and the document collection server system 100
Is an example at the time when documents matching and collecting search conditions registered by the user are collected and classified.

【００３７】４０２は、内容を表示中のフォルダの名前
である。この例ではuser1 というフォルダの下位ディレ
クトリであるvoice というフォルダの内容を表示中であ
る。Reference numeral 402 is the name of the folder whose contents are being displayed. In this example, the contents of the folder named voice, which is a subordinate directory of the folder named user1, are being displayed.

【００３８】４０３にはフォルダuser1/voiceにユーザ
が登録した検索条件を表示する。表示されたテキストを
直接編集することにより、検索条件の更新を行うことが
できる。本実施例では、各フォルダに対して記憶された
検索条件は、単語（以下ワードと呼ぶ）、あるいは句、
あるいは文章など、ユーザが自然語で記述し得るものを
列挙したものからなる。In 403, search conditions registered by the user in the folder user1 / voice are displayed. The search conditions can be updated by directly editing the displayed text. In this embodiment, the search condition stored for each folder is a word (hereinafter referred to as a word) or a phrase,
Alternatively, it is a list of things that the user can describe in natural language, such as sentences.

【００３９】４０４にはフォルダuser1/voiceの下位の
フォルダの一覧を表示する。各フォルダについて、フォ
ルダ名、フォルダに収集されている文書数、フォルダに
対応する検索条件の書き出しを表示している。この例で
は、user1/voice の下にそれぞれ、recognition とsynt
hesis の二つの下位フォルダがある。In 404, a list of folders under the folder user1 / voice is displayed. For each folder, the folder name, the number of documents collected in the folder, and the writing of search conditions corresponding to the folder are displayed. In this example, recognition and synt are listed under user1 / voice, respectively.
There are two subfolders of hesis.

【００４０】この下位フォルダ一覧の項目をクリックす
るとクリックされた下位フォルダへ移動することができ
る。By clicking an item in this subordinate folder list, it is possible to move to the clicked subordinate folder.

【００４１】４０５にはフォルダuser1/voiceにすでに
収集されている文書の一覧を表示する。At 405, a list of documents already collected in the folder user1 / voice is displayed.

【００４２】各文書について、タイトル、フォルダuser
1/voiceの検索条件への適合度、適合した検索条件中の
ワード、情報源名、ドメイン名などを表示している。For each document, title, folder user
The degree of conformity to the search condition of 1 / voice, the word in the matched search condition, the information source name, the domain name, etc. are displayed.

【００４３】この文書一覧の項目をクリックすると、ク
リックされた文書の内容を見ることができる。文書の内
容は４０６に表示される。By clicking an item in this document list, the contents of the clicked document can be viewed. The content of the document is displayed at 406.

【００４４】フォルダの作成・消去はメニュー４０１の
Fileメニューを使って行う。また、Gotoメニューを使っ
ても、別のフォルダへ移動できる。Creating / deleting a folder can be done from the menu 401.
This is done using the File menu. You can also use the Goto menu to move to another folder.

【００４５】また、ユーザはメニュー４０１のEdit メ
ニューを使って収集された文書や文書が格納されている
フォルダに対して評価を与えることができる。つまり、
ユーザが、メニューを用いて、有用／無用な文書であ
る、有用／無用なフォルダである、という評価を与える
と、対応するコマンドが文書収集サーバシステム１００
に送られる。文書収集サーバシステム１００は、文書や
フォルダに対する評価を検索条件に反映させ、次回の文
書収集時からよりユーザの意図にあった文書を収集す
る。Further, the user can give an evaluation to the collected document or the folder in which the document is stored by using the Edit menu of the menu 401. That is,
When the user uses the menu to give an evaluation that it is a useful / useless document or a useful / useless folder, the corresponding command is the document collection server system 100.
Sent to. The document collection server system 100 reflects the evaluation of the documents and folders in the search condition, and collects the documents more suitable for the user from the next document collection.

【００４６】サーバ５１０のメモリ上の文書収集サーバ
システム１００は、クライアント５００からの要求を処
理する一方で、ユーザが作成したフォルダ群と各フォル
ダに登録した検索条件に基づいて、文書の収集・分類・
整理を行う。The document collection server system 100 on the memory of the server 510 processes the request from the client 500, while collecting and classifying the documents based on the folder group created by the user and the search condition registered in each folder.・
Organize.

【００４７】つまり、文書収集サーバシステム１００
は、ニュースサーバ５２０や文書サーバ５２１などの外
部の情報源に定期的にアクセスし、前回にアクセスした
後で各情報源に蓄積された文書を取得し、ユーザが登録
された検索条件群に適合するものを検索する。この際、
検索条件中の各ワードの対象文書における出現数を対象
文書とその検索条件との適合度とする。適合した検索条
件が登録されているフォルダの中から、フォルダの階層
構造を考慮して対象文書を分類するフォルダを選び、そ
のフォルダへ格納する。さらに、多くの文書が蓄積され
たフォルダを自動分割するなどの文書の収集状況に応じ
た文書の整理を行う。That is, the document collection server system 100
Regularly accesses an external information source such as the news server 520 or the document server 521, acquires the documents accumulated in each information source after the previous access, and matches the search condition group with which the user is registered. Search for what you want to do. On this occasion,
The number of appearances of each word in the search condition in the target document is defined as the matching degree between the target document and the search condition. From the folders in which the matching search conditions are registered, a folder into which the target document is classified is selected in consideration of the hierarchical structure of the folders and stored in that folder. Furthermore, the documents are organized according to the document collection status such as automatically dividing the folder in which many documents are accumulated.

【００４８】なお、文書収集の対象となる外部の情報源
は、サーバからアクセス可能な他のネットワーク上に在
っても良い。The external information source that is the target of the document collection may be on another network accessible from the server.

【００４９】文書の収集・分類・整理についてさらに詳
しく説明する前に、まず、ユーザが作成するフォルダと
検索条件について図４に示した例で説明する。Before describing in more detail the collection / classification / arrangement of documents, first, folders created by the user and search conditions will be described with reference to the example shown in FIG.

【００５０】文書収集サーバシステム１００にユーザ登
録を行うと、各ユーザに一つのフォルダが割り当てられ
る。ユーザは自分に割り当てられたフォルダの下に、自
由に、下位フォルダを階層的に作成して、各々のフォル
ダに対して、そのフォルダにはどのような文書を収集す
べきかという検索条件を登録する。When a user is registered in the document collection server system 100, one folder is assigned to each user. The user can freely create subordinate folders hierarchically under the folder assigned to him and register the search condition for each folder as to what documents should be collected in that folder. .

【００５１】図４の例では２人のユーザ user1、 user2
が登録されており、それぞれフォルダ５４０、フォルダ
５５０が割り当てられている。user1 は、フォルダ５４
０の下に階層的にフォルダ５４１ー５４４を作成し、各
フォルダに検索条件５４５ー５４８を登録してある。In the example of FIG. 4, there are two users, user1 and user2.
Are registered, and a folder 540 and a folder 550 are respectively assigned. user1 is in folder 54
Folders 541 to 544 are hierarchically created under 0, and search conditions 545 to 548 are registered in each folder.

【００５２】一方、user2は下位フォルダを作成せず、
フォルダ５５０に、興味のある事柄を羅列しただけの検
索条件５５１を登録してある。On the other hand, user2 does not create a subordinate folder,
In the folder 550, the search conditions 551 are simply registered, listing the matters of interest.

【００５３】フォルダとフォルダに対応する検索条件
は、ユーザが作成、更新するほかに、文書収集サーバシ
ステム１００によっても、文書の収集状況に応じて自動
的に作成されたり、更新されたりすることもある。詳し
くは後述する。In addition to the user creating and updating the folders and the search conditions corresponding to the folders, the document collecting server system 100 may also automatically create or update the folders according to the document collection status. is there. Details will be described later.

【００５４】したがって、user2 のように、階層的なフ
ォルダを作成せずに、興味のある事柄を羅列しておくだ
けでも、収集された文書は自動的に分類・整理される。Therefore, the collected documents are automatically classified / arranged simply by listing the matters of interest without creating a hierarchical folder like user2.

【００５５】図２の流れ図に従い、文書収集サーバシス
テム１００について説明する。The document collection server system 100 will be described with reference to the flowchart of FIG.

【００５６】文書収集サーバシステム１００は、複数の
ユーザからの要求にいつでも対応し、同時に定期的に文
書の収集を行うために、常にユーザの接続要求がある
か、または、一定時間が経過したかを監視している（ス
テップ１０１）。ユーザが接続要求をした場合には、ク
ライアント要求処理１１０を開始する。一定時間が経過
した場合には、文書収集処理１０２を開始する。いずれ
の場合も、文書収集サーバシステム１００本体の処理は
直ちにステップ１０１に戻り、ユーザの接続要求と一定
時間経過の監視を続ける。The document collection server system 100 always responds to requests from a plurality of users and, at the same time, periodically collects documents, so that there is always a user connection request or whether a certain time has elapsed. Is being monitored (step 101). When the user makes a connection request, the client request processing 110 is started. When the fixed time has elapsed, the document collection process 102 is started. In either case, the process of the document collection server system 100 main body immediately returns to step 101 to continue the connection request from the user and the monitoring of the elapse of a fixed time.

【００５７】図６にクライアント要求処理１１０の流れ
とコマンド実行時に用いるデータ構造との対応を示す。FIG. 6 shows the correspondence between the flow of the client request processing 110 and the data structure used during command execution.

【００５８】クライアント要求処理１１０が開始される
とまず、クライアントからの要求処理を行うための準備
として、クライアントとの接続（ステップ１１１）、フ
ォルダテーブルのロード（ステップ１１２）を行う。When the client request processing 110 is started, first, as a preparation for processing the request from the client, connection with the client (step 111) and loading of the folder table (step 112) are performed.

【００５９】このあと、クライアントから終了コマンド
を受信するまで、クライアント５００から送信されてく
るコマンドの解析（ステップ１１３）と実行（ステップ
１１４）を繰り返す。Thereafter, the command transmitted from the client 500 is repeatedly analyzed (step 113) and executed (step 114) until the end command is received from the client.

【００６０】終了コマンドを受信して、繰り返しを終了
すると、クライアントの切断を行って、クライアント要
求処理１１０を終了する。When the end command is received and the repetition is ended, the client is disconnected and the client request processing 110 is ended.

【００６１】各コマンドの実行時には、必要に応じて各
種のテーブルのロード、参照、更新、セーブを行う。When executing each command, various tables are loaded, referenced, updated, and saved as necessary.

【００６２】たとえば、ユーザがあるフォルダに格納さ
れている文書一覧の表示を要求する操作をすると、文書
収集クライアントシステム５０１は対応するコマンドと
対象のフォルダ名を送信する。クライアント要求処理１
１０はこのコマンドとフォルダ名を受信すると、フォル
ダテーブルを参照して、そのフォルダに格納されている
文書群の情報（各文書のタイトル、適合度、適合した検
索条件中のワード、情報源名など）をクライアントへ送
信する。For example, when the user performs an operation of requesting the display of the document list stored in a certain folder, the document collection client system 501 sends the corresponding command and the target folder name. Client request processing 1
When 10 receives this command and the folder name, it refers to the folder table and refers to the information of the documents stored in the folder (title of each document, degree of conformity, word in conforming search condition, source name, etc.) ) Is sent to the client.

【００６３】図５に示した文書収集処理１０２（図２）
の流れと文書収集処理時に用いるデータ構造との対応に
従って、文書収集処理について説明する。Document collection process 102 shown in FIG. 5 (FIG. 2)
The document collection process will be described in accordance with the correspondence between the flow of the above and the data structure used in the document collection process.

【００６４】まず、内部ＤＢ５１１からメモリ上に文書
収集用のテーブル（文書番号テーブル３００、フォルダ
テーブル３１０、ワード・フォルダテーブル３３０、ワ
ード・文書テーブル３５０）をロードする（ステップ１
０３）。First, a table for collecting documents (document number table 300, folder table 310, word / folder table 330, word / document table 350) is loaded from the internal DB 511 onto the memory (step 1).
03).

【００６５】文書番号テーブル３００は、どのような情
報源が利用可能か、各情報源にはどのようなドメインが
あるか、それらのドメインにはそれぞれ何番から何番ま
での文書があり、既に何番までは取得済みであるかとい
う情報を表す。The document number table 300 shows what types of information sources are available, what domains each information source has, what number of documents each of these domains has, and It represents information up to which number has been acquired.

【００６６】フォルダテーブル３１０は、どのようなフ
ォルダがどのような階層構造を成しているか、各フォル
ダにはどのような文書が格納されているかを表す。The folder table 310 shows what kind of folder has what kind of hierarchical structure, and what kind of document is stored in each folder.

【００６７】ワード・フォルダテーブル３３０は、各フ
ォルダに対応付けられている検索条件にはどのようなワ
ードが出現するかを表す。The word / folder table 330 represents what words appear in the search condition associated with each folder.

【００６８】ワード・文書テーブル３５０にはどの文書
にどのようなワードが出現するかを表している。各テー
ブルについて詳しくは後述する。The word / document table 350 shows what kind of word appears in which document. Details of each table will be described later.

【００６９】次に、各情報源の全ての新文書について、
ステップ１０５〜１０７を繰り返し実行する。Next, for all new documents of each information source,
Repeat steps 105 to 107.

【００７０】ステップ１０５の新文書取得処理は、各情
報源に接続し、文書番号テーブル３００に登録されてい
る文書番号より新しい文書があるかどうか調べ、もしあ
ればその文書を取得する。In the new document acquisition processing of step 105, the information source is connected to check whether there is a newer document than the document number registered in the document number table 300, and if there is, obtain the document.

【００７１】次に、ステップ１０６の適合度計算が、取
得した文書の各フォルダにおける適合度を計算する。ま
ず、取得した文書にどのようなワードが出現するかを表
わすフォルダ検索テーブル３７０を作成し、各フォルダ
における適合度を記憶するために適合フォルダテーブル
３９０を作成・初期化する。そして、フォルダ検索テー
ブル３７０とワード・フォルダテーブル３３０とを照合
して、適合度をフォルダごとに算出し、適合フォルダテ
ーブル３９０に登録する。適合度計算について詳しくは
後述する。Next, the fitness calculation in step 106 calculates the fitness in each folder of the acquired document. First, a folder search table 370 that represents what words appear in the acquired document is created, and a compatible folder table 390 is created and initialized to store the degree of suitability in each folder. Then, the folder search table 370 and the word folder table 330 are collated to calculate the matching degree for each folder, and the matching degree is registered in the matching folder table 390. Details of the fitness calculation will be described later.

【００７２】次にステップ１０７の文書格納処理が、適
合フォルダテーブル３９０に登録された各フォルダにお
ける適合度と、フォルダテーブル３１０が表わすフォル
ダ間の階層構造とから文書を格納するフォルダを決定
し、その文書をフォルダテーブル３１０とワード・文書
テーブル３５０に登録する。文書格納処理について詳し
くは後述する。Next, in step 107, the document storing process determines the folder for storing the document from the degree of conformity in each folder registered in the conforming folder table 390 and the hierarchical structure between folders represented by the folder table 310, and The document is registered in the folder table 310 and the word / document table 350. Details of the document storage processing will be described later.

【００７３】次に、ステップ１０８のフォルダ管理処理
が、ワード・文書テーブル３５０が表わす各文書におけ
るワードの出現頻度分布を用いてフォルダ内の文書を分
析し、フォルダの自動分割や統合を行ない、フォルダテ
ーブル３１０とワード・フォルダテーブル３３０とを更
新する。詳しくは後述する。Next, the folder management processing of step 108 analyzes the documents in the folder using the word frequency distribution in each document represented by the word / document table 350, and performs automatic folder division and integration. The table 310 and the word folder table 330 are updated. Details will be described later.

【００７４】以上のステップ１０５〜１０７の繰り返し
中に更新された文書収集用テーブルを内部ＤＢ５１１へ
セーブする（ステップ１０９）。The document collection table updated during the repetition of the above steps 105 to 107 is saved in the internal DB 511 (step 109).

【００７５】ここまでで、一通りの文書収集処理１０２
を終了する。Up to this point, the general document collection processing 102
To finish.

【００７６】以上述べた文書収集処理１０２で用いるデ
ータ構造や処理についてさらに詳しく説明する。The data structure and processing used in the document collection processing 102 described above will be described in more detail.

【００７７】文書番号テーブル３００のデータ構造を^-
１３に示す。文書番号テーブル３００は、ハッシュテー
ブルで、各エントリは図１２に示す文書番号リスト３０
２を指している。情報源名とドメイン名を入力とするハ
ッシュ関数の値でエントリを決定する。[0077] The data structure of the document number table 300 ^-
13 shows. The document number table 300 is a hash table, and each entry has a document number list 30 shown in FIG.
Pointing to 2. The entry is determined by the value of the hash function that takes the source name and domain name as input.

【００７８】文書番号リスト３０２は、情報源名へのポ
インタ３０３、ドメイン名へのポインタ３０４、そのド
メインの最古文書の番号３０５、最新文書の番号３０
６、文書収集システムが既に収集処理を施した文書の番
号３０７、同ハッシュ値の他の文書番号リストへのポイ
ンタ３０８の組である。The document number list 302 includes a pointer 303 to the information source name, a pointer 304 to the domain name, the number 305 of the oldest document in the domain, and the number 30 of the latest document.
6, a set of the document number 307 of the document which the document collection system has already collected, and a pointer 308 to another document number list of the same hash value.

【００７９】文書番号テーブルは、文書収集を始める際
にロードされ、文書を情報源から取得する度に更新され
る。The document number table is loaded at the beginning of document collection and is updated each time a document is acquired from the information source.

【００８０】内部ＤＢには、どのような情報源がある
か、どのようなドメインがあるか、どのドメインの文書
は何番まで収集処理済みかが記憶されている。まず、内
部ＤＢ５１１から、記憶されている情報源名、ドメイン
名、既取得文書番号を読み込んで文書番号リスト３０２
を作成し、情報源名とドメイン名を入力とするハッシュ
関数の値をエントリとして文書番号テーブル３００に登
録する。次に、各情報源から各ドメインの最古文書番
号、最新文書番号を取得し、文書番号リストに書き込
む。このとき文書番号テーブル３００に登録されていな
いドメインがあれば、これはその情報源において新規に
作成されたドメインであるので、既取得文書番号を０と
して文書番号リストを生成し、文書番号テーブル３００
に登録する。The internal DB stores what kind of information source exists, what kind of domain exists, and how many documents in which domain have been collected and processed. First, the stored information source name, domain name, and acquired document number are read from the internal DB 511, and the document number list 302 is read.
Is created, and the value of the hash function with the information source name and the domain name as input is registered in the document number table 300 as an entry. Next, the oldest document number and latest document number of each domain are acquired from each information source and written in the document number list. At this time, if there is a domain not registered in the document number table 300, this is a domain newly created in the information source, so that the document number list is generated with the already acquired document number set to 0, and the document number table 300
Register with.

【００８１】たとえば、図１３の文書番号リスト３０２
ーａは、internet news という情報源の fj.ai という
ドメインには、１２３番から１４５番までの文書があ
り、そのうち１３０番までは収集処理済みであることを
示している。For example, the document number list 302 shown in FIG.
-A indicates that there are 123 to 145 documents in the domain fj.ai of the internet news source, and 130 of them have been collected and processed.

【００８２】フォルダテーブル３１０のデータ構造を図
１５に示す。フォルダテーブル３１０はハッシュテーブ
ルで、各エントリは図１４に示すフォルダリスト３１４
を指している。フォルダ名を入力とするハッシュ関数の
値でエントリを決定する。The data structure of the folder table 310 is shown in FIG. The folder table 310 is a hash table, and each entry has a folder list 314 shown in FIG.
Pointing to. The entry is determined by the value of the hash function that takes the folder name as input.

【００８３】フォルダリスト３１４は、フォルダの ID
番号３１５、フォルダ名へのポインタ３１６、上位フォ
ルダを表すフォルダリストへのポインタ３１７、下位フ
ォルダリスト３２１へのポインタ３１８、格納文書リス
ト３２４へのポインタ３１９、同ハッシュ値の他のフォ
ルダを表すフォルダリストへのポインタ３２０の組であ
る。The folder list 314 is a folder ID.
A reference numeral 315, a pointer 316 to a folder name, a pointer 317 to a folder list showing a higher folder, a pointer 318 to a lower folder list 321, a pointer 319 to a stored document list 324, and a folder list showing another folder having the same hash value. Is a set of pointers 320 to.

【００８４】下位フォルダリスト３２１は、下位フォル
ダを表すフォルダリストへのポインタ３２２とフォルダ
リスト３１４で表されるフォルダの他の下位フォルダを
表す下位フォルダリストへのポインタ３２３の組であ
る。The lower folder list 321 is a set of a pointer 322 to a folder list representing a lower folder and a pointer 323 to a lower folder list representing another lower folder of the folder represented by the folder list 314.

【００８５】格納文書リスト３２４は、格納された文書
の情報源名へのポインタ３２５、ドメイン名へのポイン
タ３２６、文書番号３２７、格納文書リスト３２４が表
す文書のフォルダリスト３１４が表すフォルダにおける
適合度３２８、このフォルダに格納された他の文書を表
す格納文書リストへのポインタ３２９の組である。The stored document list 324 is a pointer 325 to an information source name of a stored document, a pointer 326 to a domain name, a document number 327, and a matching degree of a document represented by the stored document list 324 in a folder represented by a folder list 314. 328 is a set of pointers 329 to the stored document list representing other documents stored in this folder.

【００８６】例えば、図１５のフォルダリスト３１４ー
ａは、フォルダ ID が１００３の voice というフォル
ダの上位フォルダはフォルダリスト３１４ーｂで表され
るフォルダuser1であること、フォルダリスト３１４ー
ｃで表されるフォルダsynthesisを下位フォルダに持つ
ことと、このフォルダには適合度１３点のinternet new
s という情報源のfj.ai というドメインの１２０番の文
書等が格納されていることとを表している。For example, in the folder list 314-a of FIG. 15, the upper folder of the folder voice whose folder ID is 1003 is the folder user1 represented by the folder list 314-b, and is represented by the folder list 314-c. Have a folder synthesis, which is a subordinate folder, and that this folder has an internet new
This indicates that the 120th document in the domain fj.ai of the information source s is stored.

【００８７】図１７に示すワード・フォルダテーブル３
３０は、ハッシュテーブルで、各エントリは図１６に示
すワード・フォルダリスト３３３を指している。ワード
を入力とするハッシュ関数の値でエントリを定める。Word folder table 3 shown in FIG.
Reference numeral 30 is a hash table, and each entry points to the word folder list 333 shown in FIG. The entry is defined by the value of the hash function that takes a word as input.

【００８８】ワード・フォルダリスト３３３は、ワード
へのポインタ３３４、フォルダ頻度リスト３４０へのポ
インタ３３５、同ハッシュ値の他のワード・フォルダリ
ストへのポインタ３３６の組である。フォルダ頻度リス
ト３４０は、このワードが出現する検索条件に対応する
フォルダのフォルダ ID ３４１、検索条件中のワードの
出現頻度３４２、他のフォルダ頻度リストへのポインタ
３４３の組である。The word folder list 333 is a set of a pointer 334 to a word, a pointer 335 to a folder frequency list 340, and a pointer 336 to another word folder list of the same hash value. The folder frequency list 340 is a set of a folder ID 341 of a folder corresponding to a search condition in which this word appears, an appearance frequency 342 of a word in the search condition, and a pointer 343 to another folder frequency list.

【００８９】例えば、図１７のフォルダリスト３３３ー
ａとフォルダ頻度リスト３４０ーａは、言語というワー
ドが、フォルダ ID １００３のフォルダに対応する検索
条件中に１回出現することを表し、フォルダリスト３３
３ーｂとフォルダ頻度リスト３４０ーｂ、３４０ーｃ
は、音声認識というワードが、 ID １００３のフォルダ
と ID １００４のフォルダのそれぞれに対応する検索条
件中に１回づつ出現することを表す。For example, the folder list 333-a and the folder frequency list 340-a shown in FIG. 17 indicate that the word "language" appears once in the search condition corresponding to the folder with the folder ID 1003.
3-b and folder frequency list 340-b, 340-c
Indicates that the word “voice recognition” appears once in the search condition corresponding to each of the folder with ID 1003 and the folder with ID 1004.

【００９０】ワード・文書テーブル３５０のデータ構造
を図１９に示す。ワード・文書テーブル３５０はハッシ
ュテーブルで、各エントリは図１８に示すワード・文書
リスト３５４を指している。ワードを入力とするハッシ
ュ関数の値でエントリを決定する。The data structure of the word / document table 350 is shown in FIG. The word / document table 350 is a hash table, and each entry points to the word / document list 354 shown in FIG. The entry is determined by the value of the hash function that takes a word as input.

【００９１】図１８のワード・文書リスト３５４は、ワ
ードへのポインタ３５５、文書頻度リスト３６０へのポ
インタ３５６、同ハッシュ値の他のワード・文書リスト
へのポインタ３５７の組である。文書頻度リスト３６０
は、このワードが出現する文書の情報源名へのポインタ
３６１、ドメイン名へのポインタ３６２、文書番号３６
３、出現頻度３６４、このワードが出現する他の文書頻
度リストへのポインタ３６５の組である。The word / document list 354 of FIG. 18 is a set of a pointer 355 to a word, a pointer 356 to a document frequency list 360, and a pointer 357 to another word / document list having the same hash value. Document frequency list 360
Is a pointer 361 to the information source name of the document in which this word appears, a pointer 362 to the domain name, and a document number 36.
3, the appearance frequency 364, and a pointer 365 to another document frequency list in which this word appears.

【００９２】例えば、図１９のワード・文書リスト３３
４ーａと文書頻度リスト３６０ーａは、言語というワ
ードが、情報源internet newsのドメインfj.sci.langの
５６番の文書に５回出現することを表し、ワード・文書
リスト３３４ーｂと文書頻度リスト３６０ーｂ、３６０
ーｃは、音声認識というワードが情報源internet news
のドメインfj.ai の１２０番の文書に２回出現し、ドメ
インfj.sci.langの５６番の文書に２回出現することを
表している。For example, the word / document list 33 in FIG.
4-a and document frequency list 360-a indicate that the word language appears 5 times in the 56th document of domain fj.sci.lang of the information source internet news, and word and document list 334-b. Document frequency list 360-b, 360
The word c is the word "voice recognition" in the source of information internet news
It appears twice in the 120th document of domain fj.ai and twice in the 56th document of domain fj.sci.lang.

【００９３】フォルダテーブル３１０、ワード・フォル
ダテーブル３３０、ワード・文書テーブル３５０の内容
は、内部ＤＢ５１１に記憶されている。これらのテーブ
ルは文書収集処理１０２が開始されたときやクライアン
ト要求処理１１０が開始された時やコマンド実行時に、
必要に応じてメモリ上へロードされ、それぞれの処理を
実行中に参照・更新され、終了するときに内部DB５１１
にセーブされる。ただし、各テーブルは排他的に更新さ
れる。フォルダの作成・削除によるフォルダテーブルの
更新、検索条件の更新によるワード・フォルダテーブル
の更新は、ただちにセーブされる。The contents of the folder table 310, word / folder table 330, and word / document table 350 are stored in the internal DB 511. These tables are stored when the document collection process 102 is started, when the client request process 110 is started, or when a command is executed.
It is loaded into the memory as needed, is referenced / updated while each process is being executed, and internal DB 511 is used when it ends.
Will be saved to. However, each table is updated exclusively. The update of the folder table by creating / deleting a folder and the update of the word / folder table by updating the search condition are immediately saved.

【００９４】例としてワード・フォルダテーブル３５０
のロードについて図８に流れ図を示す。ワード・フォル
ダテーブル３５０のロードは、フォルダテーブル３１０
をロードした後で行う。As an example, the word folder table 350
FIG. 8 shows a flow chart for the loading of No. The word folder table 350 is loaded by the folder table 310.
After loading.

【００９５】まず、ワード・フォルダテーブル３５０を
初期化する（ステップ１６０）。First, the word folder table 350 is initialized (step 160).

【００９６】次に、フォルダテーブル３１０に登録され
ている全てのフォルダについて、フォルダのワード登録
（ステップ１６４〜１６６）を繰り返す（ステップ１６
１）。Next, the word registration of the folders (steps 164 to 166) is repeated for all the folders registered in the folder table 310 (step 16).
1).

【００９７】フォルダのワード登録は、まず、そのフォ
ルダに対応する検索条件を内部ＤＢ５１１からメモリ５
２３上に読みこみ、（ステップ１６４）、ワードを抽出
する（ステップ１６５）。抽出した各ワードについて図
１６のワード・フォルダリスト３３３を作成し、ワード
のハッシュ値を計算して図１７のワード・フォルダテー
ブル３３０に登録する（ステップ１６６）。In the word registration of a folder, first, search conditions corresponding to the folder are stored in the internal DB 511 to the memory 5
23 is read (step 164) and the word is extracted (step 165). The word folder list 333 of FIG. 16 is created for each extracted word, and the hash value of the word is calculated and registered in the word folder table 330 of FIG. 17 (step 166).

【００９８】全てのフォルダについてワード登録を行う
とこの繰り返しを終了し、ワード・フォルダテーブルロ
ード処理１５１を終了する。When word registration is performed for all folders, this repetition is ended, and the word / folder table load processing 151 is ended.

【００９９】適合度計算１０６が行なう検索処理につい
て図９に基づいて説明する。The search process performed by the fitness calculation 106 will be described with reference to FIG.

【０１００】この処理は、検索条件群に出現するワード
と文書に出現するワードの類似性を調べることにより、
取得した文書と各フォルダの適合度を調べる。This processing is performed by checking the similarity between the word appearing in the search condition group and the word appearing in the document,
Check the conformity between the acquired document and each folder.

【０１０１】ここで本実施例で使用する、検索対象文書
といずれかのフォルダとの適合度について説明する。Here, the matching degree between the document to be searched and any of the folders used in this embodiment will be described.

【０１０２】検索対象文書といずれかのフォルダとの適
合度は、いくつかの方法が考えられるが、本実施例で
は、その文書内のワードのうち、そのフォルダに適合し
たワード（すなわち、そのフォルダに対応して記憶され
た検索条件に含まれるワードに一致した、文書内のワー
ド）のそれぞれとそのフォルダとの適合度を求め、それ
らのワードとそのフォルダとの適合度の総和を求め、こ
の総和をその文書とそのフォルダとの適応度とする。There are several possible methods of matching the search target document with any of the folders, but in the present embodiment, among the words in the document, words that match the folder (that is, the folder). Of each word in the document that matches the word contained in the search condition stored in association with the folder and the sum of the degrees of conformity between those words and the folder is calculated. The sum is the fitness of the document and the folder.

【０１０３】ここで、そのフォルダに適応したワードと
そのフォルダとの適応度もいろいろの方法で求めること
が出来るが、本実施例では、より好適なものとして、そ
のワードのその文書内での重みとそのワードのそのフォ
ルダ内での重みとの積でもってそのワードとそのフォル
ダの適応度とする。Here, the fitness of the word adapted to the folder and the fitness of the folder can also be obtained by various methods, but in this embodiment, as a more preferable one, the weight of the word in the document is set. And the weight of the word in the folder to obtain the fitness of the word and the folder.

【０１０４】ここで、そのワードの文書内の重みは、い
ろいろの方法で検出可能であるが、本実施例では、より
好適なものとして、そのワードのその文書内での出現頻
度でもって、そのワードのその文書内での重みとする。Here, the weight of the word in the document can be detected by various methods, but in this embodiment, it is more preferable that the weight is the frequency of appearance of the word in the document. It is the weight of the word in the document.

【０１０５】さらに、そのワードとそのフォルダとの適
応度もいろいろの方法で検出可能であるが、本実施例で
は、より好適なものとして、そのフォルダに対応して記
憶された検索条件内でのそのワードの出現回数を使用す
る。Further, the fitness between the word and the folder can be detected by various methods, but in the present embodiment, it is more preferable that the fitness within the search condition stored corresponding to the folder is satisfied. Use the number of occurrences of that word.

【０１０６】従って、本実施例では、そのワードとその
フォルダとの適合度は、そのワードの文書内出現頻度と
そのワードのそのフォルダに対応する検索条件内での出
現頻度の積でもって表すことが出来、その検索対象文書
とそのフォルダとの適応度は、このようにして求めた各
ワードの適応度の総和で与えられる。Therefore, in this embodiment, the matching degree between the word and the folder is represented by the product of the appearance frequency of the word in the document and the appearance frequency of the word in the search condition corresponding to the folder. The fitness between the search target document and the folder is given by the sum of the fitness of each word thus obtained.

【０１０７】より具体的には、取得した文書に出現する
ワードを図２１のフォルダ検索テーブル３７０に登録
し、全フォルダに対応する検索条件に出現するワードを
登録してあるワード・フォルダテーブル３３０と照合し
て、フォルダ毎に適合度を集計し、適合度順にフォルダ
をソートする。More specifically, a word / folder table 330 in which the words appearing in the acquired document are registered in the folder search table 370 of FIG. 21, and the words appearing in the search conditions corresponding to all folders are registered. Collation is performed and the goodness of fit is totaled for each folder, and the folders are sorted in order of goodness of fit.

【０１０８】まず、フォルダ検索テーブル３７０の初期
化（ステップ１７０）、図２３の適合フォルダテーブル
３９０の初期化（ステップ１７１）を行なう。First, the folder search table 370 is initialized (step 170) and the compatible folder table 390 of FIG. 23 is initialized (step 171).

【０１０９】次に取得文書からワードを抽出し（ステッ
プ１７２）、各ワードをフォルダ検索テーブル３７０に
登録する(ステップ１７３）。Next, words are extracted from the acquired document (step 172) and each word is registered in the folder search table 370 (step 173).

【０１１０】フォルダ検索テーブル３７０はハッシュテ
ーブルで、各エントリは、図２０に示すフォルダ検索リ
スト３７２を指す。ワードを引数とするハッシュ関数の
値でエントリを決定する。フォルダ検索リスト３７２は
文書中のワードへのポインタ３７３、適合フォルダリス
ト３８０へのポインタ３７４、文書中の出現頻度３７
５、同ハッシュ値の他のフォルダ検索リストへのポイン
タ３７６の組である。適合フォルダリスト３８０は、ワ
ードが出現する検索条件に対応するフォルダのフォルダ
ID ３８１、そのフォルダにおける適合度３８２、他の
適合フォルダリストへのポインタ３８３の組である。The folder search table 370 is a hash table, and each entry points to the folder search list 372 shown in FIG. The entry is determined by the value of the hash function that takes the word as an argument. The folder search list 372 is a pointer 373 to a word in the document, a pointer 374 to a matching folder list 380, and an appearance frequency 37 in the document.
5, a set of pointers 376 to other folder search lists having the same hash value. The matching folder list 380 is a folder of folders corresponding to the search condition in which the word appears.
It is a set of an ID 381, a matching degree 382 in the folder, and a pointer 383 to another matching folder list.

【０１１１】たとえば、図２１のフォルダ検索テーブル
３３０のフォルダ検索リスト３７２ーａは、言語という
ワードが検索対象の文書中に２回出現することを表して
いる。まだ検索を実行していないので、適合フォルダリ
ストへのポインタ３７４ーａはＮＵＬＬである。同様
に、フォルダ検索リスト３７２ーｂ、３７２ーｃはそれ
ぞれ対象文書中に無音時間というワードが３回出現する
こと、音声認識というワードが５回出現することを表し
ている。For example, the folder search list 372-a of the folder search table 330 shown in FIG. 21 indicates that the word "language" appears twice in the document to be searched. Since no search has been performed yet, the pointer 374-a to the matching folder list is NULL. Similarly, the folder search lists 372-b and 372-c respectively show that the word "silent time" appears three times and the word "voice recognition" appears five times in the target document.

【０１１２】次に、フォルダ検索テーブル３７０とワー
ド・フォルダテーブル３３０を照合し、適合するフォル
ダがあれば、適合フォルダリストを作成し、フォルダ検
索テーブルに登録する（ステップ１７４）。Next, the folder search table 370 and the word / folder table 330 are collated, and if there is a matching folder, a matching folder list is created and registered in the folder searching table (step 174).

【０１１３】すなわち、フォルダ検索テーブル３７０に
登録されているワードが、ワード・フォルダテーブル３
３０にも登録されていれば、フォルダ検索リスト３７２
の頻度３７５とワード・フォルダリスト３３３に登録さ
れている各フォルダ頻度リスト３４０の頻度３４２を掛
け合わせた値をそのワードの各フォルダにおける適合度
として、それぞれに対応する適合フォルダリスト３８０
を作成し、フォルダ検索リスト３７２に登録する。That is, the word registered in the folder search table 370 is the word / folder table 3
If it is also registered in 30, the folder search list 372
Value 375 and the frequency 342 of each folder frequency list 340 registered in the word / folder list 333 are taken as the matching degree in each folder of the word, and the matching folder list 380 corresponding to each
Is created and registered in the folder search list 372.

【０１１４】例えば、フォルダ検索テーブル３７０に登
録されている言語というワードはワード・フォルダテー
ブル３３０のフォルダ頻度リスト３４０ーａが示すよう
に、フォルダ ID １００３に対応する検索条件に１回出
現している。したがって、言語というワードの ID １
００３のフォルダにおける適合度は２点で、検索実行前
にはＮＵＬＬであった適合フォルダリストへのポインタ
は、図２２に示した検索実行後のフォルダ検索テーブル
３７０のように適合フォルダリスト３８０ーａを指す。For example, the word "language" registered in the folder search table 370 appears once in the search condition corresponding to the folder ID 1003 as shown in the folder frequency list 340-a of the word folder table 330. . Therefore, the word ID 1
The matching degree in the folder 003 is 2 points, and the pointer to the matching folder list, which was NULL before the search is executed, has a matching folder list 380-a as shown in the folder search table 370 after the search shown in FIG. Refers to.

【０１１５】同様に音声認識というワードは、ワード・
フォルダテーブル３３０のフォルダ頻度リスト３４０ー
ｂが示すように、ID１００３のフォルダとID１００４の
フォルダに対応する検索条件にそれぞれ１回出現してい
る。したがって、音声認識というワードの ID １００３
のフォルダと ID １００４のフォルダにおける適合度は
それぞれ５点である。したがって、検索実行後は、図２
２のフォルダ検索テーブル３７０に適合フォルダリスト
３８０ーｂと３８０ーｃが登録される。しかし、無音時
間というワードは、ワード・フォルダテーブルに登録さ
れていない。すなわち適合するフォルダが存在しないと
いうことで、検索実行後も適合フォルダリストへのポイ
ンタはＮＵＬＬである。Similarly, the word voice recognition is
As shown in the folder frequency list 340-b of the folder table 330, each appears once in the search condition corresponding to the folder of ID1003 and the folder of ID1004. Therefore, the word ID 1003
There are 5 points of conformity in each of the folder and the folder of ID 1004. Therefore, after executing the search,
The compatible folder lists 380-b and 380-c are registered in the second folder search table 370. However, the word silent time is not registered in the word folder table. That is, since there is no matching folder, the pointer to the matching folder list is NULL even after the search is executed.

【０１１６】最後にフォルダ毎に、各ワードのフォルダ
における適合度を集計し、適合度が０でないフォルダを
適合度の高い順に図２３の適合フォルダテーブル３８０
に登録する（ステップ１７５）。適合フォルダテーブル
の各エントリは、適合フォルダリスト３８１を指す。適
合フォルダリストは、図２０の適合フォルダリストと同
じデータ構造であるが、各ワードのフォルダにおける適
合度の合計を適合度とし、他の適合フォルダリストへの
ポインタ３８３は使用しない。Finally, the goodness of fit in the folder of each word is tabulated for each folder, and the folders whose non-fitnesses are not 0 are arranged in descending order of goodness of fit.
(Step 175). Each entry in the compatible folder table points to the compatible folder list 381. The conforming folder list has the same data structure as the conforming folder list of FIG. 20, but the total of the conforming degrees in the folders of each word is the conforming degree, and the pointer 383 to another conforming folder list is not used.

【０１１７】以上の適合度計算によって作成された適合
フォルダテーブル３９０とフォルダの階層構造が登録さ
れているフォルダテーブル３１０を用いて、文書格納処
理１０９が、文書を格納すべきフォルダを選んで格納す
る。文書格納処理１０９の流れを図１０に示す。The document storing process 109 selects and stores the folder in which the document is to be stored by using the adaptive folder table 390 created by the above-described fitness calculation and the folder table 310 in which the hierarchical structure of folders is registered. . The flow of the document storage processing 109 is shown in FIG.

【０１１８】この文書格納処理は、大きく分けて２段階
（ステップ１８０、１８１）からなる。This document storage processing is roughly divided into two stages (steps 180 and 181).

【０１１９】まず、ステップ１８０では、適合フォルダ
テーブル３９０に登録された各フォルダにおける適合度
とフォルダテーブル３１０に登録されたフォルダの階層
構造から、対象文書をどのフォルダに格納すべきかを決
定し、対象文書を格納した文書としてフォルダテーブル
３１０に登録する。First, in step 180, it is determined which folder the target document should be stored in based on the conformance of each folder registered in the compatible folder table 390 and the hierarchical structure of the folders registered in the folder table 310. The document is registered in the folder table 310 as a document storing the document.

【０１２０】図２４に格納フォルダの決定方法の説明図
を示す。本実施例ではフォルダの階層構造の各枝で適合
したフォルダの中で最も下位のフォルダに格納する。 A
からＨまでのフォルダがあり、それぞれ図に示す適合
度であった場合には、A―B―D―G という枝の適合した
フォルダの中で最も下位のD、同様に枝A―B―E 中のE、
枝 A―C―F―H 中のH に格納する。この方法は、下位の
フォルダは上位のフォルダの検索条件を継承していると
考え、検索条件をより詳しく記述しているフォルダに格
納するものである。FIG. 24 shows an explanatory diagram of the method of determining the storage folder. In this embodiment, the folder is stored in the lowest folder among the matched folders in each branch of the folder hierarchical structure. A
If there are folders from H to H, and each has the matching degree shown in the figure, the lowest D among the matching folders of A-B-D-G, as well as branch A-B-E E inside,
Store in H of branch A-C-F-H. In this method, it is considered that the lower folders inherit the search conditions of the upper folders, and the search conditions are stored in a folder in which the search conditions are described in more detail.

【０１２１】次に、ステップ１８１では、対象文書に出
現するワードをワード・文書テーブル３５０に登録す
る。Next, in step 181, the words appearing in the target document are registered in the word / document table 350.

【０１２２】すなわち、文書に出現する全ワードについ
て図１８のワード・文書リストを作成し、ワード・文書
テーブル３５０に登録する。That is, the word / document list of FIG. 18 is created for all the words appearing in the document, and registered in the word / document table 350.

【０１２３】この処理により、ワード・文書テーブル３
５０には格納された全文書について、各文書にどのよう
なワードが出現するかが記録される。このワード・文書
テーブル３５０は、次に述べるフォルダ管理処理でフォ
ルダ内の文書を分析するのに用いる。By this processing, the word / document table 3
In 50, what words appear in each document are recorded for all the stored documents. The word / document table 350 is used to analyze the documents in the folder in the folder management process described below.

【０１２４】フォルダ管理処理１０８について説明す
る。The folder management process 108 will be described.

【０１２５】階層構造を成すフォルダに対応する検索条
件を分類体系とみなして文書の収集、分類を続けると、
文書が特定のフォルダに集中して、フォルダ内の文書数
がユーザが把握しきれないほど増えることがある。ま
た、文書が複数のフォルダに重複して格納されることが
多くなり、無駄が生じることもある。When the retrieval conditions corresponding to folders having a hierarchical structure are regarded as a classification system and documents are collected and classified,
Documents may be concentrated in a specific folder, and the number of documents in the folder may increase beyond the user's knowledge. In addition, documents are often stored in a plurality of folders in duplicate, which may be wasteful.

【０１２６】これらの現象は、ユーザが適切に分類体系
を構成していなかった場合や世間の情勢や研究動向が変
化し、分類体系が合わなくなった場合に起きる。These phenomena occur when the user does not properly configure the classification system, or when the situation and research trends in the world change and the classification system does not match.

【０１２７】フォルダ管理処理１０８は、各フォルダへ
の文書の集まり具合を分析することによって、これらの
現象を検知し、フォルダの階層構造やフォルダに対応す
る検索条件を改良する。これにより、フォルダ内の文書
数をユーザが把握できる程度の数に抑さえたり、文書が
複数のフォルダに不必要に重複して格納されないように
したりし、フォルダの階層構造を実情にあった体系に維
持する。The folder management processing 108 detects these phenomena by analyzing the collection state of documents in each folder, and improves the hierarchical structure of folders and the search condition corresponding to the folders. As a result, the number of documents in a folder can be suppressed to a number that the user can grasp, and documents can be prevented from being stored in multiple folders unnecessarily redundantly. To maintain.

【０１２８】フォルダ管理処理の流れを図１１に示す。FIG. 11 shows the flow of folder management processing.

【０１２９】各フォルダに対してステップ２０１〜ステ
ップ２０５を繰り返し行なう（ステップ２００）。Steps 201 to 205 are repeated for each folder (step 200).

【０１３０】まず、フォルダに格納された文書数を監視
する（ステップ２０１）。フォルダにあらかじめ与えた
数以上の文書が格納されていれば、そのフォルダ内の文
書を統計的手法を用いて分析する（ステップ２０２）。
異なった性質のものが混ざり合っている対象の中で、類
似している個体を集めてグループに分類する手法はクラ
スタ分析として知られており、たとえば、「多変量解析
ハンドブック」（現代数学社１９８６年）に記載されて
いる。ステップ２０２はクラスタ分析の手法を用いて、
フォルダ内の文書に出現するワードの頻度に基づき文書
を再分類する。再分類した文書の集合をクラスタと呼
ぶ。First, the number of documents stored in the folder is monitored (step 201). If the folder stores more than the given number of documents in advance, the documents in the folder are analyzed using a statistical method (step 202).
A method of collecting similar individuals and classifying them into groups among objects having different properties is known as cluster analysis. For example, “Multivariate Analysis Handbook” (Hyundai Mathematics Co., Ltd. 1986). Year). Step 202 uses the method of cluster analysis,
Reclassify documents based on the frequency of words that appear in the documents in the folder. A set of reclassified documents is called a cluster.

【０１３１】次にクラスタ間の関係を分析し、クラスタ
の階層構造を決定する（ステップ２０３）。ここで行な
うクラスタ間の関係解析については後述する。クラスタ
に対応してフォルダと検索条件を生成し、クラスタの階
層構造に対応して階層的にフォルダを作成する（ステッ
プ２０４）。次にワード・文書テーブル３５０から、生
成した各フォルダ内の文書に共通して高頻度に出現する
ワードを抽出し、フォルダに対応する検索条件に加える
（ステップ２０５）。これにより、検索条件を精練する
ことができる。Next, the relationship between clusters is analyzed to determine the hierarchical structure of clusters (step 203). The relationship analysis between clusters performed here will be described later. Folders and search conditions are generated corresponding to the clusters, and folders are created hierarchically corresponding to the hierarchical structure of the clusters (step 204). Next, from the word / document table 350, words that appear frequently in the generated documents in each folder are extracted and added to the search condition corresponding to the folder (step 205). As a result, the search condition can be refined.

【０１３２】ここまでの処理を各フォルダに施したら、
各フォルダに格納された文書群を分析し、フォルダの再
構成、すなわち、フォルダの統合、階層構造の変更を行
なう（ステップ２０６）。ここで行なう分析については
後述する。When the above processing is applied to each folder,
The document group stored in each folder is analyzed, and the folders are reconstructed, that is, the folders are integrated and the hierarchical structure is changed (step 206). The analysis performed here will be described later.

【０１３３】図２６ー３１を使ってステップ２０６で行
なうクラスタ間関係の分析方法を説明する。A method of analyzing the inter-cluster relationship performed in step 206 will be described with reference to FIGS.

【０１３４】図２５のように、ワード群ｗ１とワード群
ｗ２からなる検索条件があり、この検索条件に対応する
フォルダ４５０に文書群ｄが格納されているとする。こ
のとき、このフォルダ内の文書について、ワード・文書
テーブルから得られるデータを統計的に分析してえられ
る、ワードと文書の関係のパタンを図２６の４５１、図
２６の４５５、図２８の４５８に示す。As shown in FIG. 25, it is assumed that there is a search condition consisting of a word group w1 and a word group w2, and a document group d is stored in a folder 450 corresponding to this search condition. At this time, regarding the documents in this folder, the pattern of the relationship between the word and the document, which is obtained by statistically analyzing the data obtained from the word / document table, is shown by 451 in FIG. 26, 455 in FIG. 26, and 458 in FIG. Shown in.

【０１３５】図２６は、文書群ｄがワード群ｗ１が出現
する文書群ｄ１とワード群ｗ２が出現する文書群ｄ２の
二つの独立したクラスタに分類される場合である。この
場合、ワード群ｗ１からなる検索条件とワード群ｗ２か
らなる検索条件を生成し、それぞれに対応するフォルダ
４５３、４５４と両者の上位のフォルダ４５２を設け、
図２６に示す階層構造にする。FIG. 26 shows a case where the document group d is classified into two independent clusters, a document group d1 in which the word group w1 appears and a document group d2 in which the word group w2 appears. In this case, a search condition composed of the word group w1 and a search condition composed of the word group w2 are generated, and folders 453 and 454 corresponding to them and a folder 452 above them are provided,
The hierarchical structure shown in FIG. 26 is used.

【０１３６】図２７は、ワード群１のみが出現する文書
群ｄ１とワード群ｗ１とワード群ｗ２出現する文書群ｄ
２の二つのクラスタに分類される場合である。ワード群
ｗ２が出現する文書群にはワード群ｗ１も出現してい
る。そこで、ワード群ｗ１からなる検索条件とワード群
ｗ２からなる検索条件を生成し、それぞれに対応するフ
ォルダ４５６と４５７を設け、図２７に示す階層構造に
する。FIG. 27 shows a document group d1 in which only word group 1 appears, a word group w1 and a document group d in which word group w2 appears.
This is the case of being classified into two clusters of 2. The word group w1 also appears in the document group in which the word group w2 appears. Therefore, a search condition including the word group w1 and a search condition including the word group w2 are generated, and folders 456 and 457 corresponding to the search conditions are provided to form the hierarchical structure shown in FIG.

【０１３７】図２８は、ワード群ｗ１のみが出現する文
書群ｄ１とワード群ｗ２のみが出現する文書群ｄ３とワ
ード群ｗ１とワード群ｗ２の両方が出現する文書群ｄ２
の３つのクラスタに分類される場合である。この場合、
ワード群ｗ１のみからなる検索条件とワード群ｗ２のみ
からなる検索条件とワード群ｗ１かつワード群ｗ２なる
検索条件を生成し、それぞれに対応するフォルダ４５
６、４５７、４５８とこれらの上位のフォルダ４５５を
設け、図２８のような階層構造にする。FIG. 28 shows a document group d1 in which only the word group w1 appears, a document group d3 in which only the word group w2 appears, and a document group d2 in which both the word group w1 and the word group w2 appear.
This is the case of being classified into three clusters. in this case,
A search condition including only the word group w1 and a search condition including only the word group w2 and a search condition including the word group w1 and the word group w2 are generated, and folders 45 corresponding to the search conditions are generated.
6, 457, 458 and folders 455 above these are provided to form a hierarchical structure as shown in FIG.

【０１３８】同じ図２５ー３０を使ってステップ２０５
で行なうフォルダ間関係の分析方法を説明する。Using the same FIGS. 25-30, step 205
The method of analyzing the relationship between folders will be described below.

【０１３９】図２７のような階層構造のフォルダがある
ときに、フォルダ４５３とフォルダ４５４に重複して格
納される文書が増えたとすると、ワードと文書の関係の
パタンが４５１のパタンから４５７かまたは４５８のパ
タンに変化したと考えられる。文書が重複して格納され
ていることは、フォルダテーブル３１０から検知でき
る。そこで、フォルダ４５３、４５４に格納されている
文書について、ワード・文書テーブル３５０から得られ
る各文書におけるワードの出現頻度に基づきクラスタ間
関係分析と同様の統計的分析を行ってワードと文書の分
布のパタンを調べ、パタンに応じてフォルダと検索条件
を再構成する。If there are more documents stored in the folders 453 and 454 in duplicate when there are folders having a hierarchical structure as shown in FIG. 27, the pattern of the relation between words and documents is 457 from the pattern of 451 or 457. It is thought that the pattern changed to 458. It can be detected from the folder table 310 that the documents are redundantly stored. Therefore, with respect to the documents stored in the folders 453 and 454, statistical analysis similar to the inter-cluster relation analysis is performed based on the word appearance frequency in each document obtained from the word / document table 350, and the distribution of the words and the documents is analyzed. Examine patterns and reconfigure folders and search conditions according to the patterns.

【０１４０】また、図２７のような階層構造があるとき
に、フォルダ４５７に格納される文書の中でフォルダ４
５６に適合しない文書の割合があらかじめ与えられた割
合を越えるようになった場合、検索条件の上下関係が変
化したことを意味する。このことは、フォルダテーブル
３１０に登録されているフォルダ４５７に格納された文
書のフォルダ４５６への適合度を調べることにより検知
できる。フォルダ４５６とフォルダ４５７とに格納され
ている文書に対しクラスタ分析を行ない、フォルダを再
構成する。Further, when there is a hierarchical structure as shown in FIG. 27, among the documents stored in the folder 457, the folder 4
When the ratio of documents that do not conform to 56 exceeds a predetermined ratio, it means that the hierarchical relationship of the search conditions has changed. This can be detected by checking the compatibility of the document stored in the folder 457 registered in the folder table 310 with the folder 456. Cluster analysis is performed on the documents stored in the folders 456 and 457 to reconfigure the folders.

【０１４１】[0141]

【発明の効果】本発明によれば以下の効果が得られる。According to the present invention, the following effects can be obtained.

【０１４２】（１）ユーザが記述した検索条件群に適合
する情報を複数の情報源から収集し、検索条件群の階層
構造を分類体系と見做して収集した情報を分類できる。(1) Information matching the search condition group described by the user can be collected from a plurality of information sources, and the collected information can be classified by regarding the hierarchical structure of the search condition group as a classification system.

【０１４３】（２）各検索条件に対応した検索結果格納
領域への情報の集まり具合に応じて、分類体系を変更す
ることができる。(2) The classification system can be changed according to how information is collected in the search result storage area corresponding to each search condition.

【０１４４】その結果、ある検索結果格納領域に格納さ
れる情報量をユーザがその全体を把握できる程度の数に
抑さえることができる。あるいは情勢の変化に応じ、適
切な分類体系を維持できる。As a result, the amount of information stored in a certain search result storage area can be suppressed to a number that allows the user to grasp the entire information. Alternatively, an appropriate classification system can be maintained according to changes in the situation.

[Brief description of drawings]

【図１】本発明の一実施例のシステム構成図である。FIG. 1 is a system configuration diagram of an embodiment of the present invention.

【図２】本実施例の文書収集サーバシステムの流れ図で
ある。FIG. 2 is a flowchart of the document collection server system of this embodiment.

【図３】本実施例のインタフェース画面の例である。FIG. 3 is an example of an interface screen of this embodiment.

【図４】本実施例のフォルダとフォルダに対応する検索
条件の例である。FIG. 4 is an example of folders and search conditions corresponding to the folders of the present embodiment.

【図５】本実施例の文書収集処理の流れ図である。FIG. 5 is a flowchart of a document collection process of this embodiment.

【図６】本実施例のクライアント要求処理の流れ図であ
る。FIG. 6 is a flowchart of client request processing according to the present embodiment.

【図７】本実施例の文書収集クライアントシステムの流
れ図である。FIG. 7 is a flow chart of the document collection client system of the present embodiment.

【図８】本実施例のワード・フォルダテーブルロード処
理の流れ図である。FIG. 8 is a flow chart of a word folder table loading process of the present embodiment.

【図９】本実施例の適合度計算の流れ図である。FIG. 9 is a flowchart of the fitness calculation in this embodiment.

【図１０】本実施例の文書格納処理の流れ図である。FIG. 10 is a flowchart of a document storage process of this embodiment.

【図１１】本実施例のフォルダ管理処理の流れ図であ
る。FIG. 11 is a flowchart of a folder management process of this embodiment.

【図１２】本実施例の文書番号リストのデータ構造であ
る。FIG. 12 is a data structure of a document number list of this embodiment.

【図１３】本実施例の文書番号テーブルのデータ構造で
ある。FIG. 13 is a data structure of a document number table of this embodiment.

【図１４】本実施例のフォルダリストと下位フォルダリ
ストと格納文書リストのデータ構造である。FIG. 14 is a data structure of a folder list, a subordinate folder list, and a stored document list according to this embodiment.

【図１５】本実施例のフォルダテーブルのデータ構造で
ある。FIG. 15 is a data structure of a folder table according to the present embodiment.

【図１６】本実施例のワード・フォルダリストとフォル
ダ頻度リストのデータ構造である。FIG. 16 is a data structure of a word / folder list and a folder frequency list according to this embodiment.

【図１７】本実施例のワード・フォルダテーブルのデー
タ構造である。FIG. 17 is a data structure of a word folder table according to the present embodiment.

【図１８】本実施例のワード・文書リストと文書頻度リ
ストのデータ構造である。FIG. 18 is a data structure of a word / document list and a document frequency list according to this embodiment.

【図１９】本実施例のワード・文書テーブルのデータ構
造である。FIG. 19 is a data structure of a word / document table according to the present embodiment.

【図２０】本実施例のフォルダ検索リストと適合フォル
ダリストのデータ構造である。FIG. 20 is a data structure of a folder search list and a compatible folder list according to this embodiment.

【図２１】本実施例のフォルダ検索テーブルのデータ構
造である。FIG. 21 is a data structure of a folder search table of this embodiment.

【図２２】本実施例の検索処理後のフォルダ検索テーブ
ルの例である。FIG. 22 is an example of a folder search table after the search processing of the present embodiment.

【図２３】本実施例の適合フォルダテーブルのデータ構
造である。FIG. 23 is a data structure of a compatible folder table of this embodiment.

【図２４】本実施例の文書を格納するフォルダの決定方
法の説明図である。FIG. 24 is an explanatory diagram of a method of determining a folder for storing a document according to this embodiment.

【図２５】本実施例のフォルダ内文書分析を行なうフォ
ルダの説明図である。FIG. 25 is an explanatory diagram of folders for performing in-folder document analysis according to the present embodiment.

【図２６】本実施例のフォルダ内文書分析結果のワード
と文書の第１の分布パタンそれに従って生成されるフォ
ルダの説明図である。FIG. 26 is an explanatory diagram of a word generated as a result of document analysis in a folder and a first distribution pattern of a document according to the present embodiment, and a folder generated according to the word;

【図２７】本実施例のフォルダ内文書分析結果のワード
と文書の第２の分布パタンそれに従って生成されるフォ
ルダの説明図である。FIG. 27 is an explanatory diagram of a word generated as a result of document analysis in a folder and a second distribution pattern of a document according to the present embodiment, and a folder generated according to the word;

【図２８】本実施例のフォルダ内文書分析結果のワード
と文書の第３の分布パタンそれに従って生成されるフォ
ルダの説明図である。FIG. 28 is an explanatory diagram of a word generated as a result of analyzing a document in a folder and a third distribution pattern of a document according to the present embodiment, and a folder generated according to the word;

[Explanation of symbols]

１００…文書収集サーバシステム、１０２…文書収集処
理、１０６…適合度計算、１０７…文書格納処理、１０
８…フォルダ管理処理、１１０…クライアント要求処
理。Reference numeral 100 ... Document collection server system, 102 ... Document collection processing, 106 ... Fitness calculation, 107 ... Document storage processing, 10
8 ... Folder management processing, 110 ... Client request processing.

Claims

[Claims]

1. A method for automatically classifying information by a computer, which corresponds to one of a plurality of folders associated with each other in a hierarchical relationship, and one or more words to be searched respectively. Storing a plurality of search conditions that specify the search conditions, detecting the matching degree between the search conditions stored corresponding to each folder and the document information to be classified, and detecting the correspondence between each folder and the information. Based on the matching degree and the hierarchical relationship, one or more folders in which the document information should be registered are determined from among the plurality of folders, and one or more folders corresponding to the determined one or more folders are determined. An information classification method for storing the document information.

2. The detection detects the degree of conformity between the document information and each of the words designated by the search condition stored corresponding to each folder, and is stored corresponding to each folder. 2. The document information according to claim 1, further comprising the step of determining a sum of matching degrees detected between each of the words designated by the search condition and the document information as a matching degree between the folder and the document information. Classification method.

3. The step of determining the matching degree between the document information and each of one or a plurality of words included in a search condition stored in correspondence with each folder includes the step of determining the matching degree in the document information. 3. The document information classification method according to claim 2, further comprising a step of determining a product of a weight of the word and a weight of the word in the search condition as a matching degree between the word and the document information.

4. The weight in the document information of each of one or more words included in the search condition stored corresponding to each folder is proportional to the number of appearances of the word in the document information. The document information classification method according to claim 3, which is a value to be set.

5. The weight in each of the one or a plurality of words included in the search condition stored corresponding to each folder in the search condition is the weight of the word in the plurality of words in the search condition. 5. The document information classification method according to claim 3, which is a value proportional to the number of appearances.

6. The determination is performed by detecting one or a plurality of folders matching the document information according to the matching degree detected between the document information and each folder, and detecting the plurality of folders by the detection. Is detected, one or a plurality of folders corresponding to the document information in the detected plurality of folders is set to the relative position in the hierarchical relationship between the detected plurality of folders. The document information classification method according to claim 1, wherein the selection is performed depending on the relationship.

7. The selection is representative of a group of folders when the plurality of folders detected as being compatible with the document information include a group of folders having a relatively upper and lower relationship. 7. The document information classification method according to claim 6, further comprising the step of selecting one folder as a folder to which the document information is associated.

8. The selection is performed when the detected plurality of folders include another group of folders that is not in a relative hierarchical relationship with the group of folders. 8. The document information classification method according to claim 7, further comprising the step of detecting one folder representing a folder as another folder corresponding to the document information.

9. The document information classification method according to claim 7, wherein one folder representing the group of folders is a folder located at the lowest level of the group of folders.

10. A method for automatically classifying document information by a computer, which corresponds to one of a plurality of folders mutually associated in a hierarchical relationship, each of which should be one or a plurality of searches. A plurality of search conditions for designating a word are stored, and based on the search condition stored corresponding to each folder and a predetermined judgment criterion, the document information to be classified is associated as a folder of the plurality of folders. Determine one or more, store the document information corresponding to the determined folder, perform the above determination and storage for each of the plurality of document information to be classified, and store the document information corresponding to each folder. It is determined whether a plurality of document information satisfy a predetermined condition defined for the reconstruction of the folder, and when any one of the folders satisfies the predetermined condition, one of them Document information classification method comprising the step of reconfiguring correspondingly stored search condition to a plurality of document information and the folder stored to correspond to the folder.

11. The document information classification method according to claim 10, wherein the predetermined condition is that the total number of registered document information in the one folder exceeds a predetermined value.

12. The reconstructing step divides the search condition stored corresponding to the folder into a plurality of new search conditions each designating a part of a plurality of words designated by the search condition. , Dividing a plurality of document information registered in the one folder into a plurality of document information groups, replacing the one folder with a plurality of new folders, and corresponding to each of the plurality of new folders, 11. The document according to claim 10, further comprising a step of storing one or a plurality of words designated by one of the new plurality of search conditions and a part of the document information obtained by dividing the plurality of document information. Information classification method.

13. The step of dividing the plurality of pieces of document information includes a plurality of pieces of document information that conforms to a plurality of new search conditions obtained by dividing the search information stored corresponding to the one folder. 13. The document information classification method according to claim 12, comprising the step of dividing the document information into the document information subgroups.

14. The step of reconstructing comprises reconstructing a part of a plurality of document information registered in the one folder and a part of a plurality of words specified by a search condition stored corresponding to the folder. Select, arrange at least one new folder in the lower hierarchy of the one folder, and correspond to the new folder, the selected part of the words and the selected part of the document information 11. The document information classification method according to claim 10, further comprising the step of storing.

15. The step of selecting a part of the plurality of pieces of document information and the part of the words includes storing the plurality of pieces of document information in association with the one folder. Part of the word group that is searchable, but some other words in the word group cannot be searched, and some of the document information and the other part of the word The method further comprises a step of separating the document information from other document information that can be searched by any of the above, and selecting a part of the document information obtained by the separation and the part of the words used for the separation.
Document information classification method described in 4.

16. The step of reconfiguring creates a new folder, and registers a plurality of pieces of document information registered in the one folder and another folder in a duplicated manner in correspondence with the new folder. Then, the registration of the plurality of overlapping documents in the one folder and the other one folder is deleted, and the word groups stored in correspondence with the one folder and the other one folder are deleted. Of the other word groups stored as
11. The document information classification method according to claim 10, further comprising a step of storing a word group for searching the plurality of overlapping documents in association with the new folder.

17. The new folder is arranged in the same hierarchy as the one folder and the other folder.
Document information classification method described in 6.

18. Determining one or a plurality of folders for associating the document information with the document information to be classified based on a word group designated by a search condition stored corresponding to each folder. The degree of matching between each folder is detected, and based on the detected degree of matching between each folder and the document information and the hierarchical relationship, the plurality of folders are registered as the folders for registering the document information. 11. The document information classification method according to claim 10, further comprising the step of determining one or a plurality of folders.

19. A method for automatically classifying document information by a computer, each of which corresponds to one of a plurality of folders which are associated with each other in a hierarchical relationship, and one or more of each of which should be searched. A plurality of search conditions for designating a word are stored, and based on the search condition stored corresponding to each folder and a predetermined determination criterion, the document information to be classified is associated as a folder, One or more are determined, the document information is stored corresponding to the determined folder, and the above determination and storage are performed for each of the plurality of document information to be classified. It is determined whether a plurality of pieces of document information stored corresponding to a plurality of folders of a copy satisfy a predetermined condition defined for reconstruction of the plurality of folders, F When the folder satisfies the predetermined condition, a plurality of document information stored corresponding to the some folders and a plurality of search conditions stored corresponding to the some folders are displayed. A document information classification method having a step of reconstructing.

20. The predetermined condition includes a condition in which the some of the plurality of folders relate to an upper folder and a lower folder of the folder, and the reconfiguration corresponds to the lower folder. The document information group registered by the above and the document information group registered corresponding to the folder on the upper side are redistributed to the folder on the upper side and the folder on the lower side, and after the redistribution, , Based on the new document information group registered corresponding to the lower folder and the new document information group registered in the upper folder, registered corresponding to the lower folder 20. The document information classification method according to claim 19, further comprising the step of redistributing the document information group and the document information group registered corresponding to the upper folder to the upper folder and the lower folder. .

21. The predetermined condition relates to a relative size of the number of pieces of document information registered corresponding to the lower folder and the number of document information registered corresponding to the upper folder. The document information classification method according to claim 19, which is a condition.

22. The condition is that the number of pieces of document information registered corresponding to the lower folder is smaller than the number of document information registered corresponding to the upper folder. 21. Document information classification method described in 21.

23. The determination of one or a plurality of folders for associating the document information is performed between the document information to be classified and each folder based on a search condition stored corresponding to each folder. Of the plurality of folders is detected as a folder for associating the document information based on the detected degree of matching between each folder and the document information and the hierarchical relationship. 20. The document information classification method according to claim 19, further comprising a determining step.

24. In a computer system having a storage device for holding a database and a computer for selectively retrieving document information specified by the user from the computer, a user-specified computer system associated with each other in a hierarchical relationship. Corresponds to one of multiple folders, stores multiple search conditions that specify one or more words to be searched, and monitors whether there is newly registered document information in the database. When there is newly registered document information, the matching degree between the document information and each search condition is determined, and based on the matching degree between each search condition and the document information and the hierarchical relationship, the document information is searched. Document information collection method for determining one or a plurality of folders to be associated with each other.

25. It is determined whether or not each of the plurality of folders satisfies a condition regarding reconstruction of a folder related to a plurality of document information stored corresponding to the folder, and any one of the folders is determined. When the reconstruction condition is satisfied, a new folder is generated, a new search condition for searching at least a part of the plurality of document information stored corresponding to the one folder is generated, and the new search condition is generated. 3. The method further comprising the step of storing different search conditions and the part of the document information in association with the new folder.
4. Document information collection method described in 4.

26. It is determined whether or not any of the folders satisfies a division condition for dividing a plurality of pieces of document information stored corresponding to the folder into a plurality of new folders. When the folder satisfies the division condition, a plurality of new folders are generated, and the plurality of document information is divided into a plurality of groups based on the plurality of document information and the search condition stored corresponding to the one folder. Determining a plurality of search conditions for the document information, and selecting one of the plurality of groups of document information obtained by the division and one of the plurality of groups of search document information obtained by the division. 25. The document information collecting method according to claim 24, further comprising a step of storing the document information of 1. and the document information corresponding to one of the new plurality of folders.

27. It is determined whether or not a set of a plurality of folders satisfies a condition for separating and storing a plurality of pieces of document information that are correspondingly stored in duplicate, and a set of the plurality of folders is determined. When any of the folders in the set satisfies the separation condition, a new folder is generated, and a search condition for searching a plurality of pieces of document information redundantly stored in those folders is set to any one of the set of folders. Generated based on the search conditions stored corresponding to each other, store the generated search conditions corresponding to the new folder, and correspond the plurality of duplicate document information to the new folder 2 further comprising the step of re-memorizing
4. Document information collection method described in 4.

28. A first computer having means for storing a database containing document information to be provided to a user, and a second computer for communicating with the first computer to retrieve the document information in the database. And a user-operable terminal connected to the second computer, the terminal having names of a plurality of folders designated by the user, which are related to each other in a hierarchical relationship, And a means for sending to the second computer a plurality of search conditions including a plurality of words designated by the user corresponding to each folder, and the second computer has a means for sending the plurality of sent folders to the second computer. Means for storing the name and search condition, means for communicating with the first computer to detect whether or not there is document information newly registered in the database, and when there is newly registered document information , Its document information And the degree of matching of the plurality of search conditions with each other, and based on the degree of matching between each search condition and the document information and the hierarchical relationship, determine one or a plurality of folders to which the document information is associated, Means for storing the document information and its name corresponding to each of the determined folders, and a plurality of names of the plurality of folders and a plurality of folders stored corresponding to the respective folders in response to a request from the terminal. And a means for sending the name of the document information of the folder to the terminal, the terminal displaying a plurality of folders having the names of the plurality of sent folders in a mode in which the hierarchical relationship can be identified, A document information collecting system further comprising means for displaying the names of the plurality of sent document information in association with folders corresponding to the respective document information.