JP2003281186A

JP2003281186A - Example-based search method and search system for similarity determination

Info

Publication number: JP2003281186A
Application number: JP2002322059A
Authority: JP
Inventors: Jon Su Park; ジョンスパク; Yun Jin Pi; ユンジンピ; Chin San Kim; チンサンキム; Namu Gon Son; ナムゴンソン; Jon Hieoku Lee; ジョンヒェオクリー; Oo Uu Kuon; オーウークォン
Original assignee: Posco Co Ltd; Pohang University of Science and Technology
Current assignee: Posco Holdings Inc; Pohang University of Science and Technology
Priority date: 2001-11-13
Filing date: 2002-11-06
Publication date: 2003-10-03
Anticipated expiration: 2022-11-06
Also published as: KR100685023B1; JP3735335B2; KR20030039576A

Abstract

(57)【要約】【課題】本発明は文書を自動検索する例題ベース文書
検索方法及び検索システムに関するもので、既に構築さ
れたデータベースの文書と例題文書との類似度合いを定
量的に計算して使用者に提供することにより、例題文書
と同一または類似した関連技術を短時間内に探し出すの
に役立つ類似性判断のための例題ベース検索方法及び検
索システムを提供する。【解決手段】本発明は、従来の関連技術文書を入力す
る段階、文書の特定構造分析により前記関連技術文書を
単語ベクトルで表現する段階、及び前記表現された単語
ベクトルを貯蔵する段階を含む索引過程と、例題文書を
入力する段階、文書の特定構造分析により前記例題文書
を単語ベクトルで表現する段階、及び前記索引過程にお
いて貯蔵された関連技術文書に対する単語ベクトルと例
題文書に対するベクトルとの類似度を求める段階を含む
検索過程とを含んで成る例題ベース検索方法及びその検
索システムを旨とする。 The present invention relates to an example-based document search method and a search system for automatically searching for a document, and quantitatively calculates the degree of similarity between a document in an already constructed database and an example document. Provided to a user is an example-based search method and a search system for determining similarity that are useful for searching for related technologies that are the same as or similar to an example document in a short time. The present invention provides an index including a step of inputting a related art document, a step of expressing the related art document as a word vector by analyzing a specific structure of the document, and a step of storing the expressed word vector. And inputting an example document, expressing the example document as a word vector by analyzing the specific structure of the document, and similarity between the word vector for the related technical document stored in the indexing process and the vector for the example document. And a search system therefor.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は文書を自動検索する
例題ベース検索方法及び検索システムに関するもので、
より詳しくは、文書の特定構造を利用して同一または類
似する関連技術を検索する類似性判断のための例題ベー
ス検索方法及び検索システムに関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an example-based search method and search system for automatically searching documents,
More specifically, the present invention relates to an example-based search method and a search system for determining similarity using the specific structure of a document to search for the same or similar related technologies.

【０００２】[0002]

【従来の技術】新技術の開発は全く新たな思想に基づき
もするが、今日に至っては同一または関連技術分野の技
術を利用した改良技術から生み出される場合がより多
く、その内容もより複雑多様化され、その価値も増大し
ている。従って、産業分野においては、既に開発・発表
された技術同士の同一性及び/または類似性を判断すべ
き場合が頻繁になってくる。2. Description of the Related Art The development of new technology is based on a completely new idea, but to date, it is more often produced from improved technology utilizing the technology of the same or related technical fields, and its contents are more complex and diverse. And its value is increasing. Therefore, in the industrial field, it is often necessary to judge the identity and / or similarity between technologies already developed and announced.

【０００３】例えば、企業や研究所などにて新技術の開
発を図る場合または新技術の開発途中あるいは完成後に
該技術と同一または類似な関連技術が存在するか否かを
検索し、該技術が検索された関連技術に対して同一性及
び類似性を有するか否かを判断する作業が行われてい
る。For example, when a new technology is developed in a company or a research institute, or while the new technology is being developed or completed, it is searched whether or not there is a related technology which is the same as or similar to the technology, and the technology is searched. Work is being performed to determine whether or not there is an identity and similarity with the related technology retrieved.

【０００４】前記のように技術同士の同一性及び類似性
を判断するために従来のキーワード(keyword)検索シス
テムを利用する場合、使用者は例題技術(例えば、新技
術)を具体的に把握し例題技術に関連する技術を検索し
てから例題技術と検索された関連技術との同一性及び/
または類似性(差異点)などを直接判断しなければならな
い。As described above, when a conventional keyword search system is used to determine the identity and similarity between technologies, the user must understand the example technology (for example, new technology). After searching the technology related to the example technology, the example technology and the related technology found are identified as
Or, you have to directly judge the similarity (difference).

【０００５】このように、従来のキーワード(keyword)
検索システムを利用する場合、使用者が新たな知識を理
解し且つ文書内容を確実に認知しなければならなく、例
題技術と検索された関連技術との同一性及び/または類
似性(差異点)などを判断するのに大変時間がかかり、ま
た幾つかのキーワードにより検索が行われる為関連技術
を見落とす可能性があり、検索の正確度が劣る問題があ
った。As described above, conventional keywords
When using the search system, the user must understand new knowledge and surely recognize the document content, and the sameness and / or similarity (difference) between the example technology and the related technology searched It takes a lot of time to judge such a problem, and there is a possibility that the related art may be overlooked because the search is performed by using some keywords, and the accuracy of the search is poor.

【０００６】こうして、前記のような従来のキーワード
(keyword)検索システムの問題点を解決すべく例題ベー
ス検索(example-based retrieval)に関連した技術が提
案されてきた。Thus, the conventional keywords as described above
(keyword) Techniques related to example-based retrieval have been proposed to solve the problems of the retrieval system.

【０００７】例えば、Verity社のSearch 97、Oracle社
のIntermediaなどのような常用検索システム製作社にて
検索システムの一部として提供するソフトウェアにかか
る機能が一部具現されている。さらに、理論的にはジェ
ラルド・サルトン(非特許文献１)やリカルドとベルチエ
(非特許文献２)などの文献にも例題ベース検索のための
基本的方法論が述べられている。For example, some functions related to software provided as a part of the search system by a regular search system manufacturer such as Verity's Search 97 and Oracle's Intermedia are realized. Furthermore, theoretically, Gerald Salton (Non-Patent Document 1), Ricardo and Berthier
Documents such as (Non-Patent Document 2) also describe basic methodologies for example-based retrieval.

【０００８】一般に例題ベース検索とは、情報検索シス
テムにおいて特定文書を探し出すためにクエリー(quer
y)として幾つかの核心キーワードを組み合せて用いる代
わりに使用者が選択した例題文書(example)をそのまま
クエリーとして検索する方式のことをいう。即ち、文書
からキーワードを抽出して単語ベクトルで表現するが、
これを索引といい、かかる単語ベクトルがキーワードの
組み合せと同様の役目を果たす。In general, the example-based search is a query (quer) for finding a specific document in an information search system.
Instead of using some core keywords in combination as y), it refers to the method of directly searching the example document selected by the user as a query. That is, a keyword is extracted from a document and expressed by a word vector,
This is called an index, and such a word vector plays the same role as a combination of keywords.

【０００９】しかし、前記常用検索システムや文献に提
示された方法論においてはキーワードの取扱や文書の取
扱を同一に看做し、索引過程において特定単語の出現当
否だけを重視する為、単語の出現位置など文書の主題を
示す重要情報が見落とされてしまう問題があった。言い
換えると、文書の特徴に対する処理を行わず文書におけ
る重要部分とそうでない部分の内容を区別しないので検
索の正確率が低下する問題があった。However, in the regular search system and the methodologies presented in the literature, the handling of keywords and the handling of documents are regarded as the same, and only the appearance or non-appearance of specific words is emphasized in the indexing process. There was a problem that important information indicating the subject of the document was overlooked. In other words, since the feature of the document is not processed and the content of the important part and the content of the unimportant part in the document are not distinguished, there is a problem that the accuracy rate of the search decreases.

【００１０】かかる諸問題は文書の構造的な特徴により
文書が多くのフィールド(field)を含むことから惹起さ
れるものと看做される。かかる事項に鑑みて幾つかの常
用システムにおいては、使用者が文書を幾つかのフィー
ルドに区分するようにさせ、使用者が望むフィールド同
士の単なる関連性(similarity)を利用し検索する方式を
提供している。しかし、かかる文書部分同士の比較によ
る検索は文書全体の内容に対する精密な処理要求に符合
しないとの問題がある。It is considered that such problems are caused by the fact that the document includes many fields due to the structural characteristics of the document. In consideration of such matters, some commercial systems provide a method for allowing a user to divide a document into several fields, and to search by using the simple similarity between the fields desired by the user. is doing. However, there is a problem that the search by comparing the document parts does not meet the precise processing request for the content of the entire document.

【００１１】[0011]

【非特許文献１】Gerard Salton. (1989). Automatic t
ext processing : the transformation, analysis, and
retrieval of information by computer. Addison-Wes
eley, Reading, Massachusetts.[Non-Patent Document 1] Gerard Salton. (1989). Automatic t
ext processing: the transformation, analysis, and
retrieval of information by computer.Addison-Wes
eley, Reading, Massachusetts.

【非特許文献２】Ricardo Baeza-Yates & Berthier Rib
erio-Neto.(1999).Modern information retrieval. Add
ison-Weseley, Reading, Seoul[Non-Patent Document 2] Ricardo Baeza-Yates & Berthier Rib
erio-Neto. (1999). Modern information retrieval. Add
ison-Weseley, Reading, Seoul

【００１２】[0012]

【発明が解決しようとする課題】かかる従来技術の諸問
題を解決すべく本発明者は研究を重ねその結果に基づき
本発明を提案するまでに至ったもので、本発明は技術の
同一性及び/または類似性をより迅速且つ正確に判断で
きるよう同一及び/または類似な関連技術をその類似度
と共に表示する類似性判断のための例題ベース検索方法
及び検索システムを提供することに目的がある。SUMMARY OF THE INVENTION The present inventors have conducted research to solve the problems of the prior art and have proposed the present invention based on the results of the research. It is an object of the present invention to provide an example-based search method and a search system for similarity determination, which displays the same and / or similar related technologies together with their similarity so that the similarity can be determined more quickly and accurately.

【００１３】[0013]

【課題を解決するための手段】以下、本発明について説
明する。本発明は、関連技術文書を入力する段階、文書
の特定構造分析により前記関連技術文書を単語ベクトル
で表現する段階、及び前記表現された単語ベクトルを貯
蔵する段階を含む索引過程；、例題文書を入力する段
階、文書の特定構造分析により前記例題文書を単語ベク
トルで表現する段階、及び前記索引過程において貯蔵さ
れた関連技術文書に対する単語ベクトルと例題文書に対
する単語ベクトルとの類似度を求める段階を含む検索過
程を含んで成る例題ベース検索方法に関するものであ
る。The present invention will be described below. The present invention provides an indexing process including inputting a related technical document, expressing the related technical document as a word vector by analyzing a specific structure of the document, and storing the expressed word vector; The steps include inputting, expressing the example document as a word vector by analyzing a specific structure of the document, and obtaining a similarity between the word vector for the related technical document stored in the indexing process and the word vector for the example document. The present invention relates to an example-based search method including a search process.

【００１４】さらに、本発明の好ましき類似性判断のた
めの例題ベース検索方法は例題ベース索引過程及び例題
ベース検索過程を含み、前記例題ベース索引過程は、関
連技術文書を入力する段階；入力された関連技術文書に
おいて文書の構造的特性により段落を区分し、区分され
た段落別にキーワードを抽出する段階；前記各段落から
抽出されたキーワードに対する各段落内での加重値を求
め段落別にキーワード及びその加重値を単語ベクトルで
表現する段階；前記単語ベクトルで表現されたキーワー
ドとその加重値を貯蔵する段階を含み、並びに前記例題
ベース検索過程は、例題技術の記載された例題文書を入
力する段階；入力された例題文書において文書の構造的
特性により段落を区分し、区分された段落別にキーワー
ドを抽出する段階；前記各段落から抽出されたキーワー
ドに対する各段落内での加重値を求め段落別にキーワー
ド及びその加重値を単語ベクトルで表現する段階；前
記表現された例題文書に対する段落別単語ベクトルと前
記索引過程において貯蔵された関連技術文書に対する段
落別単語ベクトルを用いて例題文書と関連技術文書との
対応段落同士の類似度を求め、その段落間類似度を用い
て例題文書と関連技術文書との類似度を求める段階；及
び前記求めた類似度の降冪順に関連技術文書を整列して
使用者に提供する段階を含んで成る（請求項１）。Further, the example-based search method for determining the preferred similarity according to the present invention includes an example-based index process and an example-based search process, wherein the example-based index process includes inputting a related technical document; In the related technical documents, a paragraph is divided according to the structural characteristics of the document, and a keyword is extracted for each divided paragraph; a weight value in each paragraph is calculated for each keyword extracted from each paragraph, and a keyword is obtained for each paragraph. Expressing the weights in a word vector; storing the keywords expressed in the word vector and the weights, and the example-based search process inputs an example document in which example techniques are described. ; A step of dividing paragraphs in the input example document according to the structural characteristics of the document and extracting a keyword for each divided paragraph Calculating a weight value in each paragraph for the keyword extracted from each paragraph and expressing the keyword and its weight value in each paragraph by a word vector; storing the word vector by paragraph in the expressed example document and the indexing process The degree of similarity between the corresponding paragraphs of the example document and the related technical document is obtained by using the word vector for each related technical document, and the similarity between the example document and the related technical document is obtained using the similarity between the paragraphs. And a step of arranging related technical documents and providing them to a user in descending order of the calculated similarity (claim 1).

【００１５】さらに、本発明は、関連技術文書を入力す
る関連技術入力部、文書の構造分析により前記関連技術
入力部において入力された関連技術文書を単語ベクトル
で表現する関連技術文書表現部、及び前記関連技術文書
表現部において表現された単語ベクトルを貯蔵する関連
技術文書貯蔵部を含む索引部；例題技術の記載された例
題文書を入力する例題文書入力部、文書の構造分析によ
り例題文書入力部において入力された例題文書を単語ベ
クトルで表現する例題文書表現部、前記関連技術文書貯
蔵部に貯蔵された関連技術文書に対する単語ベクトルと
例題文書表現部で表現された例題文書に対する単語ベク
トルを用いて例題文書に対する類似度を求める類似度演
算部、及び類似度演算部から求めた類似度の降冪順に関
連技術文書を整列して使用者に提供する表示部を含んで
成る類似性判断のための例題ベース検索システムに関す
るものである。Further, according to the present invention, a related technology input section for inputting a related technology document, a related technology document expression section for expressing the related technology document input in the related technology input section by a structure analysis of the document by a word vector, and An index unit including a related technical document storage unit that stores the word vector expressed in the related technical document expression unit; an example document input unit for inputting an example document in which example techniques are described, and an example document input unit for analyzing the structure of the document. In the example document expression unit that expresses the example document input in 1. with the word vector, the word vector for the related technical document stored in the related technical document storage unit and the word vector for the example document expressed in the example document expression unit are used. Align the related technical documents in the descending order of the similarity calculated by the similarity calculating unit and the similarity calculating unit for obtaining the similarity to the example document It relates example based retrieval system for similarity determination comprising a display unit for providing the user Te.

【００１６】さらに、本発明の好ましき類似性判断のた
めの例題ベース検索システムは、例題ベース索引部及び
例題ベース検索部を含み、前記例題ベース索引部は、関
連技術文書を入力する関連技術文書入力部；前記入力部
で入力された関連技術文書において文書の構造的特性に
より段落を区分し、区分された段落別にキーワードを抽
出する第１キーワード抽出部；前記第１キーワード抽出
部において各段落から抽出されたキーワードに対する各
段落内での加重値を求め段落別にキーワード及びその加
重値を単語ベクトルで表現する第１単語ベクトル表現
部；及び前記第１単語ベクトル表現部において単語ベク
トルで表現されたキーワードとその加重値を貯蔵する単
語ベクトル貯蔵部を含み、並びに前記例題ベース検索部
は、例題技術の記載された例題文書を入力する例題文書
入力部；前記例題文書入力部で入力された例題文書にお
いて文書の構造的特性により段落を区分し、区分された
段落別にキーワードを抽出する第２キーワード抽出部；
前記第２キーワード抽出部において各段落から抽出され
たキーワードに対する各段落内での加重値を求め段落別
にキーワード及びその加重値を単語ベクトルで表現する
第２単語ベクトル表現部；前記第２単語ベクトル表現部
で表現された例題文書に対する段落別単語ベクトルと前
記単語ベクトル貯蔵部に貯蔵された関連技術文書に対す
る段落別単語ベクトルを用いて例題文書と関連技術文書
との対応段落同士の類似度を求め、該段落間類似度を用
いて例題文書と関連技術文書との類似度を求める類似度
演算部；及び類似度演算部において求めた類似度の降
冪順に関連技術文書を整列して使用者に提供する表示部
を含んで成る（請求項６）。Further, the example-based search system for determining the preferred similarity according to the present invention includes an example-base index unit and an example-base search unit, and the example-base index unit inputs related technical documents. Document input unit; first keyword extraction unit that divides paragraphs in the related technical document input by the input unit according to the structural characteristics of the document and extracts keywords for each divided paragraph; each paragraph in the first keyword extraction unit A first word vector expression part for expressing a keyword and its weight value for each paragraph by a word vector, and calculating a weight value in each paragraph for the keyword extracted from; and a word vector expressed by the first word vector expression part The example base search unit includes a word vector storage unit that stores keywords and their weights, and the example base search unit is a description of example techniques. Examples document input unit for inputting the example document; the example by dividing the paragraph by structural characteristics of the document in the example document input by the document input unit, the second keyword extracting unit for extracting a keyword by segmented paragraph;
A second word vector expression unit that obtains a weight value in each paragraph for a keyword extracted from each paragraph in the second keyword extraction unit and expresses the keyword and its weight value by a word vector for each paragraph; the second word vector expression Using the paragraph-based word vector for the example document represented by the section and the paragraph-based word vector for the related technical document stored in the word vector storage unit, the similarity between corresponding paragraphs of the example document and the related technical document is obtained, A similarity calculation unit that obtains the similarity between the example document and the related technical document using the inter-paragraph similarity; and the related technical documents are arranged and provided to the user in descending order of the similarity calculated by the similarity calculation unit. And a display unit for displaying the information (claim 6).

【００１７】以下、本発明に対して詳しく説明する。こ
こで使う「例題技術」とは類似技術が存在するか否かを判
断しようとする技術のことを意味し、「関連技術」とは前
記例題技術の他の全ての技術を意味するもので、関連技
術には例題技術より先に公知されたものはいうまでもな
くその後公知されたものも含まれる。The present invention will be described in detail below. The "example technology" used here means a technology for determining whether or not there is a similar technology, and the "related technology" means all the other technologies of the above example technology. Related art includes not only those publicly known prior to the example technology but also those publicly known thereafter.

【００１８】ここで使う「類似性判断」とは例題技術が関
連技術と同一及び/または類似するか否かを判断するこ
とを意味する。例えば、例題技術が特許に関わる発明で
ある場合、前記「類似性判断」は該発明の完成時点及び/
または出願時点等より先あるいは後に出願された特許文
書または先あるいは後に頒布された刊行物などに記載さ
れた発明(考案、技術など)などと同一及び/または類似
するか否かを判断することを意味し、同一または類似性
(進歩性)を前提とする特許要件(新規性、進歩性、先願
関係)判断、出願当否判断、特許侵害当否判断などに適
用される。As used herein, "similarity determination" means determining whether or not an example technique is the same and / or similar to a related technique. For example, when the example technology is an invention related to a patent, the “similarity judgment” means the completion point of the invention and / or
Alternatively, it may be determined whether or not the invention is the same as and / or similar to the invention (device, technology, etc.) described in the patent document filed before or after the filing date, or the publications distributed earlier or later. Mean, identical or similar
It is applied to the determination of patent requirements (newness, inventive step, prior application relationship) based on (inventive step), application judgment, patent infringement judgment, etc.

【００１９】ここで用いる用語「例題文書」は技術(発
明、考案などを含む)などが記載された文書同士の同一
性及び/または類似性を判断する際、他関連技術などと
同一性及び/または類似性があるか否かを判断しようと
する技術が記載された文書のことを意味し、「関連技術
文書」とは同一性及び/または類似性があるか否かを判断
しようとする技術の他の関連技術が記載された文書のこ
とを意味する。The term "example document" used herein is used to determine the identity and / or similarity between documents in which technologies (including inventions, inventions, etc.) are described and the similarities with other related technologies. Or, it means a document in which technology for which it is determined whether there is similarity is described, and "related technical document" is technology for which it is determined whether there is identity and / or similarity. It means a document that describes other related technologies of.

【００２０】前記例題文書及び関連技術文書の代表例に
は、一般技術文献及び技術資料、各国特許庁で要求する
記載要件に応じて発明などが記載された特許文書(特許
明細書など)、並びに記載内容が前記記載要件を一部充
たすか(発明申告書、提案書など)全く充たさない(発明
申告書、提案書など)申請文書が挙げられる。前記特許
文書などには発明または考案が記載された、出願中の非
公開明細書、特許または実用新案公開公報、特許または
実用新案公告公報及び特許または実用新案登録公報など
が挙げられる。さらに、前記申請文書には研究課題を整
理した文書、研究結果を整理した文書、完成した技術内
容を整理した文書など(職務発明申告書、提案書など)が
含まれる。Typical examples of the above-mentioned example documents and related technical documents are general technical documents and technical materials, patent documents (patent specifications, etc.) in which an invention or the like is described according to the description requirements required by each national patent office, and An application document may be one in which the described content partially satisfies the above described requirements (invention declaration, proposal, etc.) or does not satisfy at all (invention declaration, proposal, etc.). Examples of the patent documents and the like include non-disclosed specifications, patents or utility model publications, patents or utility model publications, and patents or utility model registrations, in which an invention or invention is described. Further, the application document includes a document summarizing research subjects, a document summarizing research results, a document summarizing completed technical contents (employee invention declaration, proposal, etc.).

【００２１】[0021]

【発明の実施の形態】以下、添付の図面に基づいて本発
明による好ましき実施の形態を説明する。本発明は既に
構築されたデータベースの文書と例題文書との類似度合
いを定量的に計算して使用者に提供することにより、例
題文書と同一または類似な関連技術を短時間内に探し出
すのに役立つ類似性判断のための例題ベース検索方法及
び検索システムを提供するものである。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT A preferred embodiment of the present invention will be described below with reference to the accompanying drawings. INDUSTRIAL APPLICABILITY The present invention quantitatively calculates the degree of similarity between a document in an already constructed database and an example document and provides it to the user, which helps to find a related technique that is the same as or similar to the example document in a short time. An example-based search method and a search system for determining similarity are provided.

【００２２】図１は本発明に符合する類似性判断のため
の例題ベース検索システムにおける全体構成図を示す。
図１によると、本発明に符合する例題ベース検索システ
ム１００は一般の情報検索システムと同様に索引部１１
０と検索部１２０とに大きく分かれる。前記索引部１１
０は関連技術文書が入力される関連技術文書入力部１１
１、文書の構造分析により関連技術文書を表現する関連
技術文書表現部１１２及び表現された文書を貯蔵する関
連技術文書貯蔵部１１３を含み、また検索部１２０は例
題文書が入力される例題文書入力部１２１、文書の特定
構造分析により例題文書を表現する例題文書表現部１２
２及び類似度演算部１２３を含む。FIG. 1 shows an overall configuration diagram of an example-based search system for similarity determination according to the present invention.
Referring to FIG. 1, the example-based search system 100 according to the present invention is similar to a general information search system in the index unit 11.
It is roughly divided into 0 and the search unit 120. The index section 11
Reference numeral 0 indicates a related technical document input section 11 into which a related technical document is input.
1, a related technical document expression unit 112 that expresses a related technical document by structural analysis of the document, and a related technical document storage unit 113 that stores the expressed document, and the search unit 120 inputs an example document to which an example document is input. Section 121, an example document expression section 12 for expressing an example document by analyzing a specific structure of the document
2 and the similarity calculation unit 123.

【００２３】本発明において「索引」とは、本発明の例題
ベースシステムにおいて関連技術文書を検索すべく該文
書を予め検索し易い構造でシステムに記録する過程のこ
とをいい、本発明において「検索」とは、関連技術文書に
関わる類似性判断などのために使用者が提示した例題文
書(例題技術)を分析し索引された関連技術文書から技術
的類似性を求め検索する過程のことをいう。In the present invention, the "index" means a process of recording a related technical document in the example-based system of the present invention in a structure such that the document can be searched in advance in order to search the related technical document. '' Means a process of analyzing an example document (example technique) presented by the user for similarity determination related to the related technical document and searching for a technical similarity from the indexed related technical document. .

【００２４】一般に文書検索及び情報検索はベクトル空
間モデル(vector space model)という理論に基づいてい
る。本発明においても索引と検索をベクトル空間モデル
に基づき構成する。Generally, document retrieval and information retrieval are based on the theory of vector space model. Also in the present invention, the index and the search are constructed based on the vector space model.

【００２５】本発明を説明するために、先ずベクトル空
間モデルに基づいた例題ベース検索システムについて図
２を参照に説明する。図２によると、一般の例題ベース
検索システム２００においては、索引部２１０と検索部
２２０の両方で所与された例題文書と関連文書を表現す
る共通過程の文書表現過程を介する。In order to explain the present invention, first, an example-based retrieval system based on a vector space model will be described with reference to FIG. Referring to FIG. 2, in the general example-based search system 200, a document representation process, which is a common process for expressing the given example document and the related document given by both the index unit 210 and the search unit 220, is performed.

【００２６】ベクトル空間モデルに基づく例題ベース検
索システムにおいては全ての文書を単語から成るベクト
ルで表現する。貯蔵する文書の集合に現れる単語の数を
ｎとすれば、文書Ｄ_ｉはｎ次元の単語ベクトル(Ｗ
_ｉ、１、Ｗ_ｉ、２、.... 、Ｗ_ｉ、 _ｎ)で表現される。In the example-based retrieval system based on the vector space model, all documents are represented by vectors consisting of words. If the number of words appearing in the set of documents to be stored is n, the document D _i is an n-dimensional word vector (W
_{i, 1} , W _{i, 2} , ..., W _i, _n ).

【００２７】この際、Ｗ_ｉ、ｊは文書Ｄ_ｉに対する単語
Ｔ_ｊの加重値を意味する。一般の文書検索において文書
Ｄ_ｉに対する単語Ｔ_ｊの加重値Ｗ_ｉ、ｊは単語頻度数(t
f:term frequency)と文書逆頻度(idf:inverse document
frequency)を利用して求める。At this time, W _{i, j} means a weight value of the word T _j with respect to the document D _i . In general document retrieval, the weight value W _{i, j} of the word T _j with respect to the document D _i is the word frequency count (t
f: term frequency) and document inverse frequency (idf: inverse document)
frequency).

【００２８】文書Ｄｉにおける単語Ｔｊの単語頻度数ｔ
ｆ_ｉ、ｊは単語Ｔ_ｊが文書Ｄ_ｉに現れる出現回数であ
り、単語Ｔ_ｊが文書の内容をどれほど代表するのかに関
する尺度である。Word frequency number t of word Tj in document Di
f _{i, j} is the number of occurrences of the word T _j appears in document D _i, word T _j is a measure of whether the representative of how the contents of the document.

【００２９】一方、単語Ｔ_ｊの文書逆頻度は文書集合に
おいて単語Ｔ_ｊが出現する文書数の比率である文書頻度
(document frequency)の逆(reverse)を意味する。単語
Ｔ_ｊが現れる文書の数が少ないほど単語Ｔ_ｊは単語Ｔ_ｊ
が現れる文書を他の文書と区別させ得る能力が高い。単
語Ｔ_ｊの文書逆頻度は文書の差別性を表わす尺度として
働く。[0029] On the other hand, document frequency document reverse frequency of the word T _j is the ratio of the number of documents in which the word T _j appears in the document set
It means the reverse of (document frequency). The smaller the number of documents in which the word T _j appears, the more the word T _j becomes the word T _j.
It has a high ability to distinguish the document in which is shown from other documents. The document inverse frequency of the word T _j serves as a measure of document discrimination.

【００３０】単語頻度数と文書逆頻度を使った様々な単
語加重値計算法が研究されてきた。本発明においては広
く知れ渡ったINQUERYシステムの単語加重値計算法を用
いる。文書Ｄ_ｉに対する単語Ｔ_ｊの加重値Ｗ_ｉ、ｊは下
記式（１）のように求める。Various word weight calculation methods using the word frequency and the document inverse frequency have been studied. In the present invention, the word weight calculation method of the widely known INQUERY system is used. The weight value W _{i, j} of the word T _j with respect to the document D _i is obtained by the following equation (1).

【数１】 (ここで、tｆ_ｉ、ｊ：文書Ｄ_ｉに対する単語Ｔ_ｊの頻度
数、ｍａｘ_ｔｆ：文書Ｄ _ｉにおいて最も頻繁に現れる単
語の頻度数、Ｎ：全体の文書の数、ｎ：単語Ｔ_ｊが現れ
る文書の数)[Equation 1] (Where tf_{i, j}: Document D_iFor the word T_jFrequency
Number, max_tf: Document D _iThe most frequently appearing in
Frequency of words, N: number of whole documents, n: word T_jAppears
Number of documents)

【００３１】前記式（１）により文書に現れる各単語の
加重値を求めれば文書を単語と単語加重値で表現するこ
とができる。If the weighted value of each word appearing in the document is calculated by the equation (1), the document can be expressed by the word and the word weighted value.

【００３２】図２によると、索引部２１０を成す関連技
術文書入力部２１１に入力された関連技術文書を第１文
書表現部２１２において単語と単語加重値で表現する段
階を経た後、該文書表現をシステムに迅速且つ容易な検
索に適すよう記録すべく、文書貯蔵部２１３において逆
索引ファイル(inverted indexing file)貯蔵構造で貯蔵
する逆索引ファイル貯蔵段階を経ることになる。前記逆
索引ファイル貯蔵段階は情報検索において伝統的に用い
られてきた。Referring to FIG. 2, after the related technical document input to the related technical document input unit 211 forming the index unit 210 is expressed in the first document expression unit 212 by a word and a word weight value, the document expression is expressed. In order to record the data into the system for quick and easy retrieval, the document storage unit 213 undergoes an inverted index file storage step of storing it in an inverted indexing file storage structure. The reverse index file storage step has been traditionally used in information retrieval.

【００３３】一方、例題ベース検索システムの検索部２
２０においては、図２によると、前記式（１）により例
題文書入力部２２１に入力された例題文書を第２文書表
現部２２２において単語とその加重値から成るベクトル
で表現した後、文書−文書類似度演算部２２３において
前記文書貯蔵部２１３に既に貯蔵された関連文書のベク
トル表現と比較して類似度を求める文書−文書類似度の
計算過程を経てから、表示部において類似度が０より大
きい関連文書を類似度順に整列して使用者に提供するこ
とになる。On the other hand, the search unit 2 of the example-based search system
In FIG. 20, according to FIG. 2, after the example document input to the example document input unit 221 by the equation (1) is expressed by the second document expression unit 222 by a vector composed of words and weights thereof, document-document After the document-document similarity calculation process of calculating the similarity by comparing the vector representation of the related document already stored in the document storage 213 in the similarity calculation unit 223, the similarity is greater than 0 in the display unit. The related documents are arranged in order of similarity and provided to the user.

【００３４】索引過程において与えられた例題文書Ｄ_ｉ
と貯蔵された各関連文書Ｄ_ｘとの類似度[ｓｉｍ(Ｄ_ｘ、
Ｄ_ｉ)]は伝統的に下記式（２）のコサイン類似度(cosin
e similarity)式及び下記式（３）の内的類似度(inner
product similarity)式を用いて求める。An example document D _i given in the indexing process
And similarity of each related document D _x stored [sim (D _x ,
D _i )] is traditionally the cosine similarity (cosin
e similarity) and the following equation (3)
product similarity) formula.

【００３５】[0035]

【数２】 (ここで、Ｗ_ｘ、ｊ：文書Ｄ_ｘに対する単語Ｔ_ｊの加重
値、Ｗ_ｉ、ｊ：文書Ｄ_ｉに対する単語Ｔ_ｊの加重値)[Equation 2] (Where W _{x, j} : weight value of word T _j for document D _x , W _{i, j} : weight value of word T _j for document D _i )

【００３６】[0036]

【数３】 (ここで、Ｗ_ｘ、ｊ：文書Ｄ_ｘに対する単語Ｔ_ｊの加重
値、Ｗ_ｉ、ｊ：文書Ｄ_ｉに対する単語Ｔ_ｊの加重値)[Equation 3] (Where W _{x, j} : weight value of word T _j for document D _x , W _{i, j} : weight value of word T _j for document D _i )

【００３７】本発明は、図１及び図２から判るように例
題ベース検索方法において索引過程と検索過程における
文書表現段階と検索過程における類似度を求める段階を
改善したものである。即ち、本発明の改善ポイントは、
図１に示すように例題ベース索引過程と例題ベース検索
過程において文書表現を例題文書の特徴的な構造把握に
基づいて行い、さらに例題ベース検索過程において類似
度を例題文書の特徴的な構造把握に基づく文書表現を用
いて求めることである。As can be seen from FIGS. 1 and 2, the present invention improves the steps of index representation, the step of document expression in the retrieval step, and the step of obtaining the similarity in the retrieval step in the example-based retrieval method. That is, the improvement point of the present invention is
As shown in FIG. 1, in the example-based index process and the example-based search process, document representation is performed based on the characteristic structure grasp of the example document, and in the example-base search process, similarity is used to grasp the characteristic structure of the example document. It is to obtain using the document expression based on.

【００３８】本発明による例題ベース検索のための検索
システムの一例を図３に示す。図３によると、本発明の
検索システム３００は例題ベース索引部３１０及び例題
ベース検索部３２０を含み、前記例題ベース索引部３１
０は関連技術文書を入力する関連技術文書入力部３１
１、第１キーワード抽出部３１２、第１単語ベクトル表
現部３１３及び単語ベクトル貯蔵部３１４を含む。An example of a search system for example-based search according to the present invention is shown in FIG. Referring to FIG. 3, the search system 300 of the present invention includes an example base index unit 310 and an example base search unit 320, and the example base index unit 31.
Reference numeral 0 indicates a related technical document input unit 31 for inputting a related technical document.
1, a first keyword extraction unit 312, a first word vector expression unit 313, and a word vector storage unit 314.

【００３９】前記第１キーワード抽出部３１２は、前記
入力部で入力された関連技術文書において文書の構造的
特性により段落を区分し、区分された段落別にキーワー
ドを抽出するよう構成され、前記第１単語ベクトル表現
部３１３は、前記第１キーワード抽出部３１２において
各段落から抽出されたキーワードに対する各段落内での
加重値を求めて段落別にキーワード及びその加重値を単
語ベクトルで表現するよう構成される。The first keyword extracting unit 312 is configured to divide a paragraph in the related technical document input by the input unit according to the structural characteristics of the document, and extract a keyword for each divided paragraph. The word vector expression unit 313 is configured to obtain a weight value in each paragraph with respect to the keyword extracted from each paragraph in the first keyword extraction unit 312, and express the keyword and the weight value for each paragraph by a word vector. .

【００４０】さらに、前記単語ベクトル貯蔵部３１４は
前記第１単語ベクトル表現部３１３において単語ベクト
ルで表現されたキーワードとその加重値を貯蔵するよう
構成される。Further, the word vector storage unit 314 is configured to store the keyword expressed by the word vector in the first word vector expression unit 313 and its weight value.

【００４１】一方、前記例題ベース検索部３２０は例題
技術の記載された例題文書を入力する例題文書入力部３
２１、第２キーワード抽出部３２２、第２単語ベクトル
表現部３２３、類似度演算部３２４、及び表示部３２５
を含む。On the other hand, the example base search unit 320 inputs an example document in which example techniques are described, as an example document input unit 3
21, second keyword extraction unit 322, second word vector expression unit 323, similarity calculation unit 324, and display unit 325.
including.

【００４２】前記第２キーワード抽出部３２２は、前記
例題文書入力部３２１で入力された例題文書(例えば、
新技術など)において文書の構造的特性により段落を区
分し、区分された段落別にキーワードを抽出するように
構成され、前記第２単語ベクトル表現部３２３は、前記
第２キーワード抽出部３２２において各段落から抽出さ
れたキーワードに対する各段落内での加重値を求め段落
別にキーワード及びその加重値を単語ベクトルで表現す
るよう構成される。The second keyword extraction unit 322 is provided with an example document (eg, the example document input by the example document input unit 321).
(Eg, new technology), the paragraphs are divided according to the structural characteristics of the document, and the keywords are extracted for each of the divided paragraphs. The weight value in each paragraph for the keyword extracted from is calculated, and the keyword and its weight value are expressed by a word vector for each paragraph.

【００４３】さらに、前記類似度演算部３２４は、前記
第２単語ベクトル表現部３２３で表現された例題文書に
対する段落別単語ベクトルと前記単語ベクトル貯蔵部３
１４に貯蔵された関連技術文書に対する段落別単位ベク
トルを用いて例題文書と関連技術文書との対応段落間の
類似度を求め、該段落間類似度を用いて例題文書と関連
技術文書との類似度を求めるよう構成され、さらに前記
表示部３２５は類似度演算部３２４で求めた類似度の降
冪順に関連技術文書を整列して使用者に提供するよう構
成される。Further, the similarity calculation unit 324 and the word vector storage unit 3 and the word vectors for each paragraph with respect to the example document expressed by the second word vector expression unit 323.
The similarity between the corresponding paragraphs of the example document and the related technical document is calculated using the unit vector for each related technical document stored in 14, and the similarity between the example document and the related technical document is calculated using the paragraph similarity. The display unit 325 is further configured to arrange related technical documents in order of descending power of the similarity calculated by the similarity calculating unit 324 and provide the related technical documents to the user.

【００４４】以下、図３の検索システムに基づき本発明
により検索する方法について説明する。The search method according to the present invention based on the search system of FIG. 3 will be described below.

【００４５】本発明により例題ベース検索を行うために
は、先ず索引部３１０の関連技術文書入力部３１１に関
連技術文書を入力する。次いで、入力された関連技術文
書において文書の構造的特性により段落を区分し、区分
された段落別にキーワードを第１キーワード抽出部３１
２で抽出する。次いで、前記のように各段落から抽出さ
れたキーワードに対する各段落内での加重値を求め、段
落別にキーワード及びその加重値を第１単語ベクトル表
現部３１３において単語ベクトルで表現する。次いで、
前記単語ベクトルで表現されたキーワードとその加重値
を単語ベクトル貯蔵部３１４に貯蔵する。In order to perform the example-based search according to the present invention, first, the related technical document is input to the related technical document input unit 311 of the index unit 310. Then, the input related technical document is divided into paragraphs according to the structural characteristics of the document, and the keywords are classified into the divided first paragraphs by the first keyword extracting unit 31.
Extract with 2. Next, the weight value in each paragraph for the keyword extracted from each paragraph as described above is obtained, and the keyword and its weight value are expressed by a word vector in the first word vector expression unit 313 for each paragraph. Then
The keyword expressed by the word vector and its weight value are stored in the word vector storage unit 314.

【００４６】次いで、例題技術の記載された例題文書を
例題文書入力部３２１に入力する。次いで、入力された
例題文書において文書の構造的特性により段落を区分
し、区分された段落別にキーワードを第２キーワード抽
出部３２２で抽出する。次いで、前記各段落から抽出さ
れたキーワードに対する各段落内での加重値を求め、段
落別にキーワード及びその加重値を第２単語ベクトル表
現部３２３において単語ベクトルで表現する。次いで、
類似度演算部３２４において前記のように表現された例
題文書に対する段落別単語ベクトルと前記索引過程にお
いて貯蔵された関連技術文書に対する段落別単位ベクト
ルを用いて例題文書と関連技術文書との対応段落間の類
似度を求め、該段落間類似度を用いて例題文書と関連技
術文書間の類似度を求める。次いで、表示部３２５にお
いて前記求めた類似度の降冪順に関連技術文書を整列し
て使用者に提供するのである。Next, the example document in which the example technique is described is input to the example document input unit 321. Next, in the input example document, paragraphs are divided according to the structural characteristics of the document, and the keywords are extracted by the second keyword extraction unit 322 for each divided paragraph. Next, the weight value in each paragraph for the keyword extracted from each paragraph is obtained, and the keyword and its weight value are expressed by a word vector in the second word vector expression unit 323 for each paragraph. Then
Between the paragraphs corresponding to the example document and the related technical document using the paragraph-based word vector for the example document expressed as described above in the similarity calculator 324 and the paragraph-based unit vector stored for the related technical document in the indexing process. Is calculated, and the similarity between the example document and the related technical document is calculated using the similarity between paragraphs. Then, the related technical documents are arranged on the display unit 325 in the descending order of the calculated similarity and provided to the user.

【００４７】図４には、例題文書入力部１２１に入力さ
れた例題文書に対して文書の構造分析により段落を区分
する段落区分部１２２１及び区分された段落別に文書を
表現(単語加重値を設定)する段落別文書表現部１２２２
を含んで成る本発明の例題文書表現部１２２の一例を示
してある。In FIG. 4, a paragraph division unit 1221 for dividing paragraphs into the example document input to the example document input unit 121 by the structural analysis of the document and a document is expressed for each divided paragraph (a word weight value is set). ) Paragraph-specific document representation unit 1222
An example of the example document expression unit 122 of the present invention is shown.

【００４８】図４によると、本発明の類似性判断のため
の例題ベース検索方法において、索引過程と検索過程に
共通に含まれる、文書の特徴的な構造把握による文書表
現方法は大きく分けると２段階の過程から成る。即ち、
図４によると、本発明に符合する特許構造分析による文
書表現方法は、例題文書入力部１２１に入力された例題
文書に対して例題文書表現部１２２の段落区分部１２２
１で文書の特徴的な構造分析により段落を区分する段階
及び段落別文書表現部１２２２で段落別に文書を表現
(単語加重値を設定)する段階から成る。According to FIG. 4, in the example-based retrieval method for similarity determination according to the present invention, the document representation method by grasping the characteristic structure of the document, which is commonly included in the index process and the retrieval process, is roughly divided into two. It consists of stages. That is,
Referring to FIG. 4, the document representation method according to the present invention, which is based on the patent structure analysis, includes a paragraph division unit 122 of the example document representation unit 122 for the example document input to the example document input unit 121.
In step 1, the paragraph is divided by the characteristic structure analysis of the document, and the paragraph-based document representation unit 1222 represents the document in the paragraph.
It consists of the steps of (setting the word weights).

【００４９】例えば、前記例題技術が特許関連発明(以
下、「例題発明」ともいう)の場合には次のように行うこ
とができる。即ち、前記例題発明の記載された例題文書
を本発明により表現するためには先ず、入力された各例
題文書を特許構造把握により段落に区分するのである
が、その例として次のような題目により段落を区分(段
落化)することが挙げられる。１．発明の名称２．要約書３．索引語：発明の名称の他の重要なキーワードを文書
作成者が書き込む場合４．図面の詳細な説明５．本発明の詳細な説明：下記のように「関連技術、発
明の目的、構成、作用、効果、利用分野」の区分が具体
的に記述されない形で記述された場合に適用６．関連技術及び発明の技術分野７．本発明が果たそうとする技術的課題(または発明の
目的)：一部特許文書において「発明目的及び構成」また
は「発明の目的、作用及び効果」などの形態で下記構造の
ように現れる場合があるが、かかる場合には最も先の記
述である「発明の目的」に区分する。８．発明の構成：「発明の目的」に係り説明したように、
「発明の構成及び作用」などの形態で作成された場合には
「発明の構成」に区分する。９．発明の作用：同じく「発明の作用及び効果」などの形
態である場合、「発明の作用」に区分する。１０．発明の効果：同じく「発明の効果及び利用分野」な
どの場合には、「発明の効果」に区分する。１１．発明の利用分野１２．構造把握ができない内容：具体的に如何なる特許
構造に所属するかが明確に把握されない全ての内容をこ
の構造に分類する。非構造的な特許文書の場合には全内
容がこの構造に区分される。１３．特許請求範囲の各請求項入力された例題文書において、このような段落は特許庁
が要求する記載要件を充たす出願形式に応じた特許文書
(特許明細書など)の場合、各段落の題目により容易に区
分可能である。For example, when the example technique is a patent-related invention (hereinafter, also referred to as "example invention"), it can be performed as follows. That is, in order to express the example document described in the example invention according to the present invention, first, each input example document is divided into paragraphs by grasping the patent structure. One example is to divide paragraphs into paragraphs. 1. Title of invention 2. Summary 3. Index word: When the document creator writes other important keywords of the title of the invention. Detailed Description of the Drawings 5. Detailed explanation of the present invention: Applicable when the category of "related art, purpose of invention, structure, action, effect, field of use" is described in a form not specifically described as follows6. Related Art and Technical Field of Invention 7. Technical problem to be achieved by the present invention (or object of invention): In some patent documents, the following structure may appear in a form such as "object and structure of invention" or "object, function and effect of invention". However, in such a case, it is classified into the "purpose of the invention" which is the earliest description. 8. Structure of the invention: As explained in connection with "Object of invention",
When it is created in a form such as “structure and action of invention”, it is classified as “structure of invention”. 9. Action of the invention: Similarly, when it is in the form of "action and effect of the invention", it is classified as "action of the invention". 10. Effects of the invention: Similarly, in the case of "effects and fields of use of the invention", etc., they are classified as "effects of the invention". 11. Fields of Use of Invention 12. Contents whose structure cannot be grasped: All contents whose specific patent structure is not clearly understood are classified into this structure. In the case of unstructured patent documents, the entire content is divided into this structure. 13. In the example document entered in each claim of the claims, such paragraphs are patent documents corresponding to the application format that satisfies the description requirements required by the JPO.
In the case of patent specifications, etc., they can be easily classified by the title of each paragraph.

【００５０】とりわけ、特許庁が要求するＳＧＭＬ(Sta
ndard Generalized Markup Language)から成る特許文書
やＸＭＬ(Extensible Markup Language)などにより作成
された特許文書においては段落題目及び段落を区別し易
いので、前記段落を容易に区別することができる。In particular, SGML (Sta
Since it is easy to distinguish between paragraph titles and paragraphs in a patent document made up of ndard Generalized Markup Language) or a patent document created in XML (Extensible Markup Language), the paragraphs can be easily distinguished.

【００５１】このように段落を容易に区別できるマーク
アップ言語（Markup Language）で作成された特許文書
の場合には、本発明において段落区分部１２２１を行う
ことなく直接段落文書表現部１２２２において段落別に
単語ベクトルで表現して全体文書を表現する。In the case of a patent document created in a markup language capable of easily distinguishing paragraphs as described above, the paragraph dividing unit 1221 is not performed in the present invention, and the paragraph document expression unit 1222 directly divides each paragraph. The whole document is represented by word vectors.

【００５２】前記のように段落を容易に区分できない場
合には、文書において特定段落を探し出すために各段落
の題目を探す方法が必要となる。例えば、例題発明の場
合には特許庁が要求する記載要件を充たす文書及び、と
りわけ記載要件を一部充たした、または全く充たさない
文書において特定段落を探すために各段落の題目を探す
方法が必要である。When the paragraphs cannot be easily divided as described above, a method of searching the subject of each paragraph is required to find a specific paragraph in the document. For example, in the case of example inventions, there is a need for a method that searches for the subject of each paragraph in order to find a specific paragraph in a document that satisfies the description requirements required by the JPO and, in particular, in a document that partially or completely does not satisfy the description requirements. Is.

【００５３】本発明により技術文書の段落題目を探す好
ましきシステム及び方法の一例を図５及び図６に夫々示
す。An example of a preferred system and method for searching for paragraph titles in technical documents according to the present invention is shown in FIGS. 5 and 6, respectively.

【００５４】図５には本発明に符合する好ましき段落題
目抽出システムの一例を示す。図５によると、本発明に
符合する好ましき段落題目抽出システム４００は、文章
抽出部４１０、語節抽出部４２０、段落題目表示部４３
０、選択部４４０、段落題目当否判断部４５０、及び段
落題目抽出部４６０を含む。FIG. 5 shows an example of a preferred paragraph title extraction system consistent with the present invention. Referring to FIG. 5, a preferred paragraph title extraction system 400 according to the present invention includes a sentence extraction unit 410, a phrase extraction unit 420, and a paragraph title display unit 43.
0, a selection unit 440, a paragraph title validity determination unit 450, and a paragraph title extraction unit 460.

【００５５】前記文章抽出部４１０は、夫々入力された
関連技術文書または例題文書から文章を抽出するよう構
成され、前記語節抽出部４２０は前記文章抽出部４１０
から抽出された文章から語節を抽出するよう構成され、
そして段落題目表示部４３０は前記語節抽出部４２０に
おいて抽出された語節が構造判断規則に該当する段落題
目を表示するよう構成される。The sentence extracting unit 410 is configured to extract a sentence from the related technical document or the example document respectively inputted, and the phrase extracting unit 420 makes the sentence extracting unit 410.
Configured to extract phrases from sentences extracted from
The paragraph title display unit 430 is configured to display the paragraph titles whose phrases extracted by the phrase extraction unit 420 correspond to the structure determination rule.

【００５６】前記選択部４４０は、前記段落題目表示部
において段落題目表示過程が完了した文章中最も多い語
節と一致する段落題目を選択するよう構成され、前記段
落題目当否判断部４５０は前記選択部４４０で選択され
た語節に対して一致する語節数/全体語節数の比が一定
値以上、好ましくは０.８以上か判断して一定値(０.８)
以上であれば新たな段落題目として判断し、その比が一
定値(０.８)未満であれば関連段落に含ませるよう構成
し、前記段落題目抽出部４６０は段落題目当否判断部４
５０において段落題目として判断されたものを段落題目
に抽出するよう構成される。The selection unit 440 is configured to select a paragraph title that matches the largest number of phrases in the sentence in which the paragraph title display process is completed in the paragraph title display unit, and the paragraph title hit / miss determination unit 450 selects the selection. A constant value (0.8) by judging whether the ratio of the number of matching clauses / the total number of clauses with respect to the clause selected by the section 440 is a certain value or more, preferably 0.8 or more.
If it is more than the above, it is judged as a new paragraph title, and if the ratio is less than a constant value (0.8), it is included in the related paragraph.
It is configured to extract what is determined as a paragraph title in 50 into a paragraph title.

【００５７】前記段落題目抽出システムは前記第１キー
ワード抽出部３１２及び第２キーワード抽出部３２２に
夫々具備することが好ましい。The paragraph title extraction system is preferably provided in each of the first keyword extraction unit 312 and the second keyword extraction unit 322.

【００５８】図５に示す段落題目抽出システムを用いて
例題文書及び関連技術文書の段落題目を抽出する過程を
図６に基づき説明する。A process of extracting the paragraph titles of the example document and the related technical document using the paragraph title extraction system shown in FIG. 5 will be described with reference to FIG.

【００５９】図６によると、本発明により例題文書また
は関連技術文書の段落題目を探すためには先ず、文章抽
出部４１０において入力例題文書または関連技術文書か
ら文章を抽出した後(段階５１０)、語節抽出部４２０に
おいて抽出された文章から語節を抽出する(段階５２
０)。本発明においては抽出される文章の単位にはリタ
ーン(return)文字で区分される単位を用いることが好ま
しい。次いで、段落題目表示部４３０において前記のよ
うに抽出された語節が段落判断規則に該当する段落題目
を表示して蓄積する(段階５３０)。前記語節抽出及び段
落題目表示過程を抽出された文章が終わるまで繰り返し
行う(段階５４０)。Referring to FIG. 6, in order to search for a paragraph title of an example document or a related technical document according to the present invention, first, a sentence extracting unit 410 extracts a sentence from the input example document or a related technical document (step 510). A phrase is extracted from the sentence extracted by the phrase extracting unit 420 (step 52).
0). In the present invention, it is preferable to use a unit divided by a return character as a unit of the extracted sentence. Then, the paragraph title display unit 430 displays and accumulates the paragraph titles whose phrases extracted as described above correspond to the paragraph determination rule (step 530). The phrase extraction and paragraph title display process is repeated until the extracted sentence is completed (step 540).

【００６０】次いで、抽出された文章に対して段落題目
表示過程が完了すると、選択部４４０において文章中最
も多い語節に一致する段落題目を選択する(段階５５
０)。次いで、段落題目当否判断部４５０において前記
のように選択された語節に対して一致した語節数/全体
語節数の比が一定値以上(図６には０.８以上としてい
る)であるかを判断して、０.８以上であれば新たな段落
題目として判断し、その比が０.８未満であれば関連段
落に含ませる(段階５６０)。次いで、段落題目当否判断
部４５０において段落題目として判断されたものは段落
題目抽出部４６０において段落題目に抽出(生成)する
(段階５７０)。前記過程を文書の最終文章まで繰り返し
行うことにより例題文書に対する段落題目を検査するこ
とができる(段階５８０)。Then, when the paragraph title display process is completed for the extracted sentence, the paragraph title corresponding to the most words in the sentence is selected in the selection unit 440 (step 55).
0). Next, when the ratio of the number of matching clauses / the total number of clauses with respect to the clause selected as described above in the paragraph title validity determination section 450 is a certain value or more (0.8 or more in FIG. 6). If it is 0.8 or more, it is determined as a new paragraph title, and if the ratio is less than 0.8, it is included in the related paragraph (step 560). Next, the paragraph title validity determining unit 450 determines (generates) a paragraph title as a paragraph title in the paragraph title extracting unit 460.
(Step 570). By repeating the above process until the final sentence of the document, the paragraph title of the example document can be checked (step 580).

【００６１】以下、本発明により抽出された文章が新た
な段落の始まりを知らせる段落題目であるか検査する過
程、即ち抽出された文章から段落題目を抽出する過程の
一例を説明する。Hereinafter, an example of a process of checking whether the extracted sentence is a paragraph title that notifies the start of a new paragraph, that is, a process of extracting a paragraph title from the extracted sentence will be described.

【００６２】前記段落題目であるか検査するための段落
判断規則の好ましき例として次のような構成を挙げられ
る。 [段落名][手掛り単語集合(相互間ＯＲ関係)][一致度合
い][必要度合い]＄・[段落名]：次の規則が一致する場合の特定段落を指す・[手掛り単語集合]：端緒となる単語の羅列で、共に羅
列される場合はそれらの中のいずれかが一致すればよい
という意味である。即ち、ＯＲの関係である。・[一致度合い]：３種の一致度合いがある。「＋」表示は
手掛り単語集合の単語と正確に入力語節が完全一致しな
ければならず、「−」表示は手掛り単語と部分一致しなけ
ればならず、「＝」表示は手掛り単語が現れさえすれば該
段落が確実に一致することを意味する。即ち、他規則を
適用する必要無くその構造の初文章であることが明確に
分かることを意味する。・[必要度合い]：２つのタイプがある。「ｙ」タイプはそ
の構造として認識されるために必ず現在の規則が充たさ
れるべきであることを意味し、「ｎ」タイプは現在の構造
として認識されるために必ず必要なものではなく、有り
得ることを意味する。・＄：一規則の終を区分する認識子である。As a preferable example of the paragraph judgment rule for checking whether the item is the above-mentioned paragraph title, the following configuration can be given. [Paragraph name] [Cue word set (mutual OR relationship)] [Matching degree] [Necessity] $ ・ [Paragraph name]: Pointing to a specific paragraph when the following rules match ・ [Cue word set]: Starting point It means that any of them should match if they are listed together. That is, it is the relation of OR. -[Matching degree]: There are three kinds of matching degree. The "+" display must exactly match the input phrase with the word in the clue word set, the "-" display must partially match the clue word, and the "=" display will show the clue word. This means that the paragraphs will match exactly. That is, it means that it is clearly understood that it is the first sentence of the structure without the need to apply other rules.・ [Necessity]: There are two types. The "y" type means that the current rules must be satisfied in order to be recognized as its structure, and the "n" type is not necessarily required to be recognized as the current structure, and can be Means that. $: A recognizer that distinguishes the end of one rule.

【００６３】抽出された文章が段落題目であるかを検査
するためには先ず、抽出された文章から語節を抽出し、
抽出された各語節が前記各規則と比較して一致するかを
把握する。例えば、「発明の詳細な説明」の構造を把握す
るための規則は次のとおりである。６ { 図面図案図名面の図面の簡単な図 } − ｙ
＄６ { 添付内容説明名称氏名簡単化説明図書名
構成 } − ｙ＄６ { 簡単な詳細な } ＋ｎ＄６ { 考案発明 } ＋ｎ＄６ { 各本 } − ｎ＄６ { 主要 } − ｎ＄６ { 対する } ＋ｎ＄６ { 符合 } − ｎ＄６ { 部分 } − ｎ＄In order to check whether the extracted sentence is a paragraph title, first, words are extracted from the extracted sentence,
The extracted words are compared with the rules to determine whether they match. For example, the rules for understanding the structure of the "detailed description of the invention" are as follows. 6 {Simple drawing of the drawing on the drawing name side} -y
$ 6 {Attachment Description Name Name Name Simplified Explanation Book Name
Composition} − y $ 6 {Simple detailed} + n $ 6 {Invented invention} + n $ 6 {Each book} − n $ 6 {Main} − n $ 6 {Compare} + n $ 6 {Sign} − n $ 6 {part} − n $

【００６４】前記例において「６」は「図面の詳細な説明」
を示す「段落名」フィールド(field)である。そして、最
初の規則は「図面、図案、図名、面の、図面の簡単な、
図」のように６つの手掛り単語に言及しながら、これら
が該入力語節と「部分一致」してもよいことを意味する。
ここで「部分一致」とは、所与された文章が「図面の詳細
な説明」である場合、「図面」という手掛り単語が「図面
の」という語節と一部一致する場合を意味する。In the above example, "6" is "detailed description of the drawing".
It is a "paragraph name" field that indicates. And the first rule is "drawing, design, drawing name, face, simple drawing,
References to six clue words, such as "Figure," mean that they may "partially match" the input phrase.
Here, "partial match" means that when the given sentence is "detailed description of the drawing", the clue word "drawing" partially matches the phrase "of the drawing".

【００６５】第三の規則は「＋」の完全一致とされ、先の
入力文章の「詳細な」という語節に適用される。もしこの
場合、所与された文章が「図面が詳細であれば説明がよ
り…」であったとすると、第一の規則が「図面が」で一致
する。しかし、第三の規則がたとえ「詳細であれば」で部
分一致しても、如何なる単語とも完全一致にならないの
で適用されない。The third rule is an exact match of "+" and is applied to the word "detailed" in the preceding input sentence. In this case, if the given text is "the description is more detailed if the drawing is detailed ...", the first rule matches "the drawing is". However, even if the third rule partially matches "if it is detailed", it does not apply because it does not exactly match any word.

【００６６】先の「図面の詳細な説明」に対する規則中必
ず適用されべき規則は「ｙ」で表示された第一と第二の規
則である。即ち、入力文章が「図面の詳細な説明」に対す
る段落の始まり、即ち段落題目であることを示すために
は、入力文章中この二つの規則を充たす語節が必ず存在
しなければならない。Among the rules for the above "Detailed Description of Drawings", the rules which must be applied are the first and second rules indicated by "y". That is, in order to indicate that the input sentence is the beginning of a paragraph for the "detailed description of the drawing", that is, the paragraph title, there must be a phrase satisfying these two rules in the input sentence.

【００６７】文書の特徴的構造分析過程において最も重
要な規則は、入力文章の全体語節中８０％以上の語節が
一つの段落を示す規則により正しく検査されてこそ新た
な段落題目とすることである。例えば、「図面の説明で
述べたように、図面１は…」のような文章が入力文章で
ある場合、「図面」と「説明」が先の第一と第二の規則に適
用され「発明の詳細な説明」に該当する段落題目となる可
能性が高いが、全体文章において残りの単語がその他の
規則により適用されない為その構造、即ち段落題目に分
析されないのである。The most important rule in the process of analyzing the characteristic structure of a document is that a new paragraph title is obtained when 80% or more of the entire phrases of the input sentence are correctly examined by the rule indicating one paragraph. Is. For example, when a sentence such as “as described in the description of drawings, drawing 1 is…” is an input sentence, “drawing” and “explanation” are applied to the first and second rules described above. There is a high possibility that it will be a paragraph title corresponding to "Detailed explanation of", but the rest of the words in the whole sentence are not applied by other rules, so that structure, that is, the paragraph title, is not analyzed.

【００６８】次いで、前記のように文書の特徴的構造把
握により区分された段落別に単語を抽出してその加重値
を求め、単語及び加重値を単語ベクトルで表現する。Next, as described above, words are extracted for each paragraph divided by grasping the characteristic structure of the document, the weighted value is obtained, and the word and the weighted value are expressed by a word vector.

【００６９】本発明においては関連例題ベース検索にお
いて文書を一つのベクトルで表示するのと違って、文書
を諸段落のベクトルで表現する。例えば、例題技術が特
許関連発明の場合には前述のように例題文書(特許文書)
を１３個の段落で定義することができ、その中１３段落
である特許請求範囲の各請求項は特許文書により異な
る。したがって、本発明においては各特許文書当り少な
くとも１３個以上のベクトルが存在することになり、文
書を１３個以上のベクトルで表現することができる。In the present invention, a document is represented by a vector of paragraphs, unlike displaying a document by one vector in the related example base search. For example, if the example technology is a patent-related invention, the example document (patent document) as described above
Can be defined in 13 paragraphs, of which 13 claims, each claim of which is dependent on the patent document. Therefore, in the present invention, there are at least 13 or more vectors for each patent document, and a document can be represented by 13 or more vectors.

【００７０】したがって、特許文書Ｄ_ｉは段落集合の
(Ｆ_i1、Ｆ_i2、…、Ｆ_im)で表記するとｍは１３以上の値
となる。Therefore, the patent document D _i is
When expressed by (F _i1 , F _i2 , ..., F _im ), m has a value of 13 or more.

【００７１】さらに、各段落Ｆ_ｉｊはｎ次元の単語ベク
トル(Ｗ_ｉｊ、１、Ｗ_ｉｊ、２、…、Ｗ_ｉｊ、ｎ)で表現
される。この際、Ｗ_ｉｊ、ｑは文書Ｄ_ｉの段落Ｆ_ｉｊに
おける単語Ｔ_ｑの加重値を示す。したがって、関連加重
値計算法の前記式（１）を下記式（４）のように変更し
なければならない。Furthermore, each paragraph F _ij is represented by an n-dimensional word vector (W _{ij, 1} , W _{ij, 2} , ..., W _{ij, n} ). At this time, W _{ij, q} indicates the weighted value of the word T _q in the paragraph F _ij of the document D _i . Therefore, the equation (1) of the related weight calculation method must be changed to the following equation (4).

【００７２】[0072]

【数４】 (ここで、ｔｆ_ｉｊ、ｑ：文書Ｄ_ｉの段落Ｆ_ｉｊにおけ
る単語Ｔ_ｑの頻度数、ｍａｘ_ｔｆ：文書Ｄ_ｉの段落Ｆ
_ｉｊにおいて最も多く現れる単語の頻度数、Ｎ：全体の
文書の数、ｎ：単語Ｔ_ｑが現れる文書の数)[Equation 4] _{(Where, tf ij, q:} frequency number of documents _{D i} word _{T q} in paragraph _{F ij} of, _{max tf:} paragraph of the document _{D i} F
₍ The frequency of the most frequently occurring words in _ij , N: the number of documents in total, n: the number of documents in which the word T _q appears)

【００７３】前記文書の特徴的な構造分析(把握)による
文書表現方法は、本発明の例題ベース索引部及び検索部
に共通に用いられる。かかる表現方法を例題ベース索引
においては、前記索引部の関連技術文書貯蔵部に従来の
方法の如く逆索引ファイル貯蔵構造として貯蔵して、検
索の際迅速に検索可能にさせることが好ましい。The document expression method based on the characteristic structure analysis (grasping) of the document is commonly used for the example base index unit and the search unit of the present invention. In the example-based index, it is preferable that such a representation method is stored in the related technical document storage unit of the index unit as an inverted index file storage structure as in the conventional method, so that the index can be quickly searched.

【００７４】本発明に符合する例題ベース検索過程は、
図１のように検索部の類似度演算部において後述の如く
技術的類似度を判断する段階を含む。例えば、新たな発
明の特許性を判断する場合、技術的類似度、即ち特許的
類似度判断過程は次のとおりである。An example-based search process consistent with the present invention is
As shown in FIG. 1, the similarity calculation unit of the search unit includes a step of determining a technical similarity as described later. For example, in determining the patentability of a new invention, the technical similarity, that is, the patent similarity determination process is as follows.

【００７５】本発明の例題ベース検索過程においては、
特許性判断に役立つ特許的類似度計算を可能にすべく、
特許専担者が特許性判断を下す論理的行為に倣う。In the example-based search process of the present invention,
In order to enable patent similarity calculation that is useful for determining patentability,
The patent specialist follows the logical act of making a patentability decision.

【００７６】通常、特許専担者は新たな発明の特許性を
判断するために、各請求項が関連特許において同じ形態
で現れる場合に最も先に両特許間の特許性が一致するも
のと判断する。そして、両特許の目的及び効果と題目が
類似する場合、その特許性が類似すると看做し、次いで
各請求項、発明の構成と作用が類似するかを検査する。
残りの内容はその後調べる。Generally, in order to determine the patentability of a new invention, the patent specialist determines that the patentability between the two patents is the earliest when the claims appear in the same form in the related patent. . Then, when the purpose and effect of both patents are similar to the subject, it is considered that the patentability is similar, and then it is examined whether or not each claim and the structure and operation of the invention are similar.
The rest of the content will be examined later.

【００７７】したがって、本発明においても両特許文書
の特許的類似性判断は、最も先に両特許の請求項中一つ
でも１００％一致すれば両特許の特許的類似性を手放し
で１００％とし、そうでなければ重要段落が類似するほ
ど両特許間には高い特許的類似性があると看做す仕方で
行われることが好ましい。Therefore, also in the present invention, if the patent similarity judgment of both patent documents is 100% even if one of the claims of both patents is the earliest, the patent similarity of both patents is 100% without letting go. However, it is preferable to consider that there is a high degree of patent similarity between the two patents if the important paragraphs are similar to each other.

【００７８】例えば、例題文書と関連技術文書が特許文
書(特許明細書)である場合、例題文書と関連技術文書と
の類似度判断は次のように行うことが好ましい。For example, when the example document and the related technical document are patent documents (patent specifications), it is preferable to determine the similarity between the example document and the related technical document as follows.

【００７９】前記両文書の特許請求範囲段落中同一請求
項が一つでも存在すればその両特許は同一なものと判断
し、前記両文書の特許請求範囲段落中同一請求項が一項
も無い場合には類似度判断を次のように行う。If any one of the same claims exists in the claim paragraphs of both documents, the two patents are judged to be the same, and there is no same claim in the claim paragraphs of both documents. In this case, the similarity determination is performed as follows.

【００８０】即ち、両文書の「発明の名称、発明の目的
及び発明の効果」段落同士の類似度を求め、こうして求
めた類似度中最も高いものをこれら段落の代表類似度値
に選定した後最も高い加重値を与え、「要約書、発明の
構成、発明の作用、請求範囲」段落同士の類似度を求
め、こうして求めた類似度中最も高いものをこれら段落
の代表類似度値として選定した後加重値を与え、また残
りの段落同士の類似度を求めてこれらの平均値を代表類
似度値として選定した後最も低い加重値を与える。前記
各代表類似度値に夫々加重値を乗じた値を合わせた値を
比較して類似度を判断する。That is, after calculating the similarity between the "title of invention, object of invention and effect of invention" paragraphs of both documents, the highest similarity among the similarities thus obtained is selected as the representative similarity value of these paragraphs. Given the highest weighted value, the similarity between the "abstract, composition of invention, action of invention, and claim" paragraphs was calculated, and the highest similarity among the similarities thus obtained was selected as the representative similarity value of these paragraphs. After giving the post-weighting value, determining the similarity between the remaining paragraphs and selecting the average value thereof as the representative similarity value, the lowest weighting value is given. The similarity is determined by comparing the values obtained by multiplying the respective representative similarity values by the respective weighted values.

【００８１】一方、本発明による各段落間類似性判断
は、前記式（２）と前記式（３）において文書ベクトル
を段落ベクトルに変更した類似度式を用いて行うことが
できる。本発明においては前記式（２）のコサイン類似
度式を用いて段落間類似度を求めることが好ましい。On the other hand, the inter-paragraph similarity judgment according to the present invention can be carried out by using the similarity equation in which the document vector is changed to the paragraph vector in the equations (2) and (3). In the present invention, it is preferable to obtain the inter-paragraph similarity using the cosine similarity expression of the above equation (2).

【００８２】例題文書Ｄ_ｉのｊ番目の段落Ｆ_ｉｊと関連
技術文書Ｄ_ｐのｑ番目の段落Ｆ_ｐｑ間の類似度ｓｉｍ_
Ｆ(Ｆ_ｉｊ、Ｆ_ｐｑ)は下記式（５）のように定義するこ
とができる。Similarity between the j-th paragraph F _{ij of the} example document D _i and the q-th paragraph F _{pq of the} related technical document D _p sim_
F (F _ij , F _pq ) can be defined by the following equation (5).

【数５】 (ここで、Ｗ_ｉｊ、ｌ：単語Ｔ_ｌが文書Ｄ_ｉのｊ番目の
段落Ｆ_ｉｊにおける加重値、Ｗ_ｐｑ、ｌ：単語Ｔ_ｌが文
書Ｄ_ｐのｊ番目の段落Ｆ_ｐｑにおける加重値)[Equation 5] (Where W _{ij, l} : the word T _l is the weight value in the j-th paragraph F _ij of the document D _i , W _{pq, l} : the word T _l is the weight value in the j-th paragraph F _pq of the document D _p )

【００８３】前記式（５）はコサイン類似度式を使うの
で、段落間類似度ｓｉｍ_Ｆ(Ｆ_ｉｊ、Ｆ_ｐｑ)は常に０
と１の間の値となる。両段落間の類似度ｓｉｍ_Ｆ(Ｆ
_ｉｊ、Ｆ_ｐｑ)が１の場合は１００％相互一致するベク
トルであることを意味する。Since the above equation (5) uses the cosine similarity equation, the inter-paragraph similarity sim_F (F _ij , F _pq ) is always 0.
The value is between 1 and 1. Similarity between both paragraphs sim_F (F
_{When ij} and F _pq ) are 1, it means that the vectors are 100% mutually coincident.

【００８４】段落表記法Ｆ_ｉｊにおいてｊは先に説明し
た段落把握から得られる段落順序と一致する。したがっ
て、例えば、Ｆ_ｉ１は特許文書Ｄ_ｉの「発明の名称」段落
を意味し、Ｆ_ｉ２は特許文書Ｄ_ｉの「要約書」段落を意味
する。In the paragraph notation F _ij , j matches the paragraph order obtained from the paragraph grasp described above. Thus, for example, F _i1 means "entitled" paragraph patent document D _i, F _i2 means "abstract" paragraph patent documents D _i.

【００８５】次いで、前記のように段落間類似度を求め
てから、該段落間類似度を用いて下記式（６）により所
与された例題文書(例題技術)Ｄ_ｉと関連技術文書(関連
技術)Ｄ_ｐ間の技術的類似度ｓｉｍ_Ｐ(Ｄ_ｉ、Ｄ_ｐ)を求
めることが好ましい。Next, after the inter-paragraph similarity is calculated as described above, the example document (example technology) D _i and the related technical document (related technology) between _{D p} technical similarity sim_P _(D i, it is preferable to determine the _{D p).}

【数６】 [Equation 6]

【００８６】例えば、前記例題文書と関連技術文書が特
許文書である場合、前記式（６）の最初の項は、両特許
の請求項中いずれかが一致すれば両特許間の特許的類似
度ｓｉｍ_Ｐ(Ｄ_ｉ、Ｄ_ｐ)が１となり完全に特許性が一
致すると看做すことを数式で表現したものである。そし
て、第二の項はそうでない場合、前述した段落重要度に
より求めることを意味する。この際、αとβ、μは各段
落グループの重要度を示す。従って、αとβ、μの和は
常に１にならなければならない。本発明においては実験
によりα値を０.５、β値を０.３、μ値を０.２と定め
る。For example, when the example document and the related technical document are patent documents, the first term of the formula (6) is the patent similarity between the two patents if the claims of both patents match. This is a mathematical expression that sim_P (D _i , D _p ) is 1 and is considered to be completely patentable. Then, the second term, if not, means to obtain the paragraph importance described above. At this time, α, β, and μ indicate the importance of each paragraph group. Therefore, the sum of α, β, and μ must always be 1. In the present invention, α value is set to 0.5, β value is set to 0.3, and μ value is set to 0.2 by experiments.

【００８７】したがって、第二の項は、例えば「発明の
目的」と「発明の効果」と「発明の題目」の類似度中最も高
い値に加重値０.５を乗じた値と、「要約書」と「発明の構
成」と「発明の目的」と「請求項」の和から最も高い値に加
重値０.３を乗じた値、そして最後に残りの段落の類似
度平均に０.２を乗じた値を全て合わせた値により両特
許間特許的類似度を求める数式を意味するのである。Therefore, the second term is, for example, a value obtained by multiplying the highest value among the similarities of “object of invention”, “effect of invention” and “title of invention” by a weight value of 0.5, and “summary”. Value, which is obtained by multiplying the highest value from the sum of the “book”, the “structure of the invention”, the “object of the invention”, and the “claims” by a weighted value of 0.3, and finally, the average similarity between the remaining paragraphs is 0.2. It means a mathematical expression for obtaining the patent similarity between both patents by a value obtained by adding all the values multiplied by.

【００８８】前記数式（６）は一例に過ぎず、例えば各
段落グループの重要度を３つでなく２つまたは４つ以上
に設定でき、その重要度の値も変化させ得ることは言う
までもない。The above formula (6) is merely an example, and it goes without saying that the importance of each paragraph group can be set to two or four or more instead of three, and the value of the importance can be changed.

【００８９】前記式（６）により例題文書Ｄ_ｉに対する
全ての関連技術文書の技術的類似度を求めると、これを
降冪順に整列して使用者に提供する。そうすると、使用
者は例題技術と技術的に類似する順に関連技術を検索で
きるようになる。When the technical similarities of all the related technical documents to the example document D _i are calculated by the equation (6), they are arranged in descending order of power and provided to the user. Then, the user can search for related technologies in the order of technical similarity to the example technology.

【００９０】なお、本発明の好ましき実施例は例示を目
的として開示されたものであり、当業者ならば本発明の
思想と範囲内において多様な修正、変更、付加などが可
能で、かかる修正・変更などは本発明の技術的範囲に属
するものと看做されるべきである。The preferred embodiments of the present invention are disclosed for the purpose of illustration, and those skilled in the art can make various modifications, changes and additions within the spirit and scope of the present invention. Modifications and changes should be regarded as belonging to the technical scope of the present invention.

【００９１】[0091]

【発明の効果】上述したように、本発明は技術的に類似
する関連文書をその類似度合いと共に表示することによ
り類似性を容易且つ迅速に判断させ得る効果を奏する。
さらに、本発明を新たな発明の申告または出願時に用い
る場合、特許性判断専担者や発明者が類似する関連文書
をその類似度合いと共に見比べることができるので、発
明の特許性などを容易且つ迅速に判断させ得る効果を奏
する。さらに、本発明は技術が記載された文書を直接本
発明システムに提供することができるので、使用者が技
術に対する知識を習得及び理解する必要が無く検索時間
を大幅に短縮させ得る効果を奏する。As described above, according to the present invention, technically similar related documents are displayed together with their degree of similarity, so that the similarity can be judged easily and quickly.
Furthermore, when the present invention is used at the time of filing or filing a new invention, the patentability judgment specialist or the inventor can compare similar related documents together with the degree of similarity, so that the patentability of the invention can be easily and quickly obtained. The effect that can be judged is produced. Further, the present invention can directly provide a document in which the technology is described to the system of the present invention, so that the user does not need to learn and understand the technology, and thus the search time can be significantly shortened.

[Brief description of drawings]

【図１】本発明に符合する類似性判断のための例題ベー
ス検索システムにおける全体構成図である。FIG. 1 is an overall configuration diagram of an example-based search system for similarity determination according to the present invention.

【図２】通常の例題ベース検索システムにおける全体構
成図である。FIG. 2 is an overall configuration diagram of a normal example-based search system.

【図３】本発明における好ましき例題ベース検索システ
ムの一例を示す構成図である。FIG. 3 is a configuration diagram showing an example of a preferred example-based search system according to the present invention.

【図４】本発明による文書の特定構造把握により文書を
表現する方法の一例を示すフロー図である。FIG. 4 is a flowchart showing an example of a method of expressing a document by grasping a specific structure of the document according to the present invention.

【図５】本発明による文書の特定構造把握により段落題
目を抽出する段落題目抽出システムの一例を示す構成図
である。FIG. 5 is a configuration diagram showing an example of a paragraph title extraction system for extracting a paragraph title by grasping a specific structure of a document according to the present invention.

【図６】本発明による文書の特定構造把握により段落題
目を抽出する方法の一例を示すフロー図である。FIG. 6 is a flowchart showing an example of a method of extracting a paragraph title by grasping a specific structure of a document according to the present invention.

[Explanation of symbols]

１００、３００…検索システム、１１０、３１０…索引部、１１１、３１１…関連技術文書入力部、１１２…関連技術文書表現部、１１３…関連技術文書貯蔵部、１２０、３２０…検索部、１２１、３２１…例題文書入力部、１２２…例題文書表現部、１２３、３２４…類似度演算部、３１２…第１キーワード抽出部、３１３…第１単語ベクトル表現部、３１４…単語ベクトル貯蔵部、３２２…第２キーワード抽出部、３２３…第２単語ベクトル表現部、３２５…表示部、４００…段落題目抽出システム、４１０…文章抽出部、４２０…語節抽出部、４３０…段落題目表示部、４４０…選択部、４５０…段落題目当否判断部、４６０…段落題目抽出部。 100, 300 ... Search system, 110, 310 ... Index section, 111, 311 ... Related technical document input section, 112 ... Related technical document expression section, 113 ... Related technical document storage section, 120, 320 ... Search unit, 121, 321 ... Example document input section, 122 ... Example document expression part, 123, 324 ... Similarity calculation unit, 312 ... the first keyword extraction unit, 313 ... First word vector expression part, 314 ... Word vector storage section, 322 ... the second keyword extraction unit, 323 ... Second word vector expression part, 325 ... Display, 400 ... Paragraph title extraction system, 410 ... a sentence extraction unit, 420 ... a word extraction unit, 430 ... Paragraph title display section, 440 ... Selector 450 ... Paragraph title judgment unit, 460 ... Paragraph title extraction unit.

───────────────────────────────────────────────────── フロントページの続き (72)発明者パクジョンス大韓民国、チョラナン−ド、ドンクワンギャン−シ、カンホー−ドン 700 クワンギャンアイアンファクトリー内 (72)発明者ピユンジン大韓民国、チョラナン−ド、ドンクワンギャン−シ、カンホー−ドン 700 クワンギャンアイアンファクトリー内 (72)発明者キムチンサン大韓民国、キョンサンブック−ド、ポーハング−シ、ナン−ク、ドンチョン−ドン５ポスコ内 (72)発明者ソンナムゴン大韓民国、キョンサンブック−ド、ポーハング−シ、ナン−ク、コードン−ドン１ポスコ内 (72)発明者リージョンヒェオク大韓民国、キョンサンブック−ド、ポーハング−シ、ナン−ク、ヒョジャ−ドン、サン 31 ポーハングユニバーシティオブサイエンスアンドテクノロジー内 (72)発明者クォンオーウー大韓民国、キョンサンブック−ド、ポーハング−シ、ナン−ク、ヒョジャ−ドン、サン 31 ポーハングユニバーシティオブサイエンスアンドテクノロジー内Ｆターム(参考） 5B075 ND03 NK02 NK32 NR05 PP24 PQ02 PQ46 PQ74 PR06 QM08 ─────────────────────────────────────────────────── ─── Continued front page (72) Inventor Park Johnson South Korea, Cholanand, Don Kwangi Yang-si, Kang Ho-dong 700 Kwang Gyan Iron Factory (72) Inventor Piyun Jin South Korea, Cholanand, Don Kwangi Yang-si, Kang Ho-dong 700 Kwang Gyan Iron Factory (72) Inventor Kim Jin Sun Republic of Korea, Kyongsan Book-do, Poha Ngushi, Nanku, Donchon-Don Within 5 POSCO (72) Inventor Song Nam Gong Republic of Korea, Kyongsan Book-do, Poha Ngushi, Nank, Cordon-Don 1 In POSCO (72) Inventor Lee Jong Hyuk Republic of Korea, Kyongsan Book-do, Poha Ngushi, Nanku, Hyoja Don, Sa 31 Pohang University In Buu Science and Technology (72) Inventor Kwon Ou Republic of Korea, Kyongsan Book-do, Poha Ngushi, Nanku, Hyoja Don, Sa 31 Pohang University In Buu Science and Technology F term (reference) 5B075 ND03 NK02 NK32 NR05 PP24 PQ02 PQ46 PQ74 PR06 QM08

Claims

[Claims]

1. The method includes an example-based index process and an example-based search process, wherein the example-based index process inputs a related technical document; the input related technical document is divided into paragraphs according to structural characteristics of the document, and is divided. Extracting a keyword for each selected paragraph; determining a weight value in each paragraph for the keyword extracted from each paragraph as described above, and expressing the keyword and its weight value in each paragraph by a word vector;
And storing the keyword expressed by the word vector and its weight as described above, and the example-based search process inputting an example document in which example techniques are described; The paragraph is divided according to the structural characteristics of the paragraph, and the keyword is extracted for each of the divided paragraphs; as described above, the weight value in each paragraph is obtained for the keyword extracted from each paragraph, and the keyword and its weight value are obtained for each paragraph. Expressing with a word vector;
And a similarity between corresponding paragraphs of the example document and the related technical document is obtained using the paragraph-based word vector for the example document expressed as described above and the paragraph-based word vector for the related technical document stored in the indexing process. A step of obtaining a similarity between the example document and the related technical document using the similarity between paragraphs; and a step of arranging the related technical documents and providing them to the user in descending order of the similarity calculated as described above. An example-based search method for similarity determination comprising.

2. In the indexing process and the searching process, the paragraph division between the related technical document and the example document is performed based on the paragraph title described in the patent document satisfying the description requirement required by each national patent office. An example-based search method for determining similarity according to claim 1.

3. A step of extracting a sentence from the input example document and extracting a phrase from the extracted sentence for a paragraph title of the document; a paragraph in which the extracted phrase corresponds to the structure judgment rule. Displaying the title; Repeating the phrase extraction and paragraph title display process until the extracted sentence is finished; When the paragraph title display process is completed for the extracted sentence, it matches with the largest number of phrases in the sentence Selecting a paragraph title to be selected; it is determined whether the ratio of the number of matching phrases / the total number of phrases to the selected phrase as described above is 0.8 or more, and if it is 0.8 or more. If the ratio is less than 0.8, it is included in the relevant paragraph; and the step is repeated until the final sentence of the document, and the paragraph title for the example document is extracted. The type according to claim 2, characterized in that Example based search method for the sex judgment.

4. The example-based search method for determining similarity according to claim 3, wherein the structure determination rule is configured as follows. [Paragraph name] [Cue word set (mutual OR relationship)] [Match degree] [Necessity] $ {where [Paragraph name]: [Cue word set, which indicates a specific paragraph when the following rules match] ]: A list of starting words, which means that if they are listed together, only one of them may match, that is, the relationship of OR. [Matching degree]: There are 3 kinds of matching degree However, "+" display must exactly match the input phrase with the word in the clue word set, "-" display must partially match the clue word, and "=" indicates the clue It means that the paragraphs will be matched exactly as long as the word appears, that is, it is possible to clearly judge that it is the first sentence of the structure without applying other rules. [Necessity]: 2 There is a type, but because the "y" type is recognized as its structure Not in the sense that should the current rule is satisfied, "n"
The type is not necessarily required to be recognized as the current structure, and means that it is possible. $: A recognizer that distinguishes the end of one rule. }

5. The example document and the related technical document consist of a patent specification, and the similarity between the example document and the related technical document is determined by the same claim if any of the same claims exists in the claim paragraphs of the two documents. Determining that both patents are the same; and if there are no identical claims in the claims paragraphs of both documents,
The highest weighted value after selecting the similarity between the "title of invention, object of invention and effect of invention" paragraphs of both documents and selecting the highest similarity among the thus obtained similarity as the representative similarity value of these paragraphs Then, the similarity between paragraphs of “abstract, structure of invention, action of invention, and claims” is obtained, and the highest similarity among the thus obtained similarity is selected as the representative similarity value of these paragraphs, and then the weighted value is calculated. And give the similarity between the remaining paragraphs,
After the average value is selected as the representative similarity value, the lowest weighted value is given, and the value obtained by multiplying the representative similarity value by each weighted value is compared to determine the similarity. The example-based search method for determining similarity according to claim 1.

6. An example base index unit and an example base search unit, wherein the example base index unit inputs a related technical document, a related technical document input unit, and a structural structure of a document in the related technical document input by the input unit. A first keyword extracting unit that divides paragraphs according to characteristics and extracts a keyword for each divided paragraph; a weighting value in each paragraph is calculated for each keyword extracted from each paragraph in the first keyword extracting unit, and a keyword for each paragraph and A first word vector expression unit that expresses the weighted value in a word vector; and a word vector storage unit that stores the keyword expressed in the word vector in the first word vector expression unit and the weighted value, and the example base The search unit inputs an example document in which example techniques are described; an example document input unit; input by the example document input unit A second keyword extracting unit that divides paragraphs according to the structural characteristics of the document in the generated example document and extracts a keyword for each divided paragraph; within each paragraph for the keyword extracted from each paragraph in the second keyword extracting unit. A second word vector expression unit for obtaining the weighted value of each keyword and expressing the weighted value as a word vector for each paragraph; the word vector for each paragraph and the word vector storage unit for the example document expressed in the second word vector expression unit. The similarity between the corresponding paragraphs of the example document and the related technical document is obtained by using the paragraph-based word vector for the stored related technical document, and the similarity between the example document and the related technical document is calculated by using the paragraph similarity. Similarity calculation unit to be sought; and display for arranging related technical documents in order of descending power of similarity calculated by similarity calculation unit and providing to user Example-based retrieval system for the comprising at similarity determination.

7. The first keyword extracting unit and the second keyword extracting unit respectively divide paragraph sections of a related technical document and an example document into a paragraph subject described in a patent document satisfying a description requirement required by each national patent office. The example-based search system for similarity determination according to claim 6, wherein the example-based search system is configured to be performed based on the above.

8. The first keyword extracting section and the second keyword extracting section extract a sentence from a related technical document or an example document respectively input; a sentence extracting unit; a phrase from a sentence extracted from the sentence extracting unit. A paragraph extraction section that displays the paragraph titles whose phrases extracted from the phrase extraction section correspond to the structure judgment rule; the paragraph title display process is completed in the paragraph title display section. A selection unit that selects a paragraph title that matches the largest number of phrases in the sentence; is the ratio of the number of matching phrases / the total number of phrases to the phrase selected in the selecting unit 0.8 or more? Judgment, 0.
If it is 8 or more, it is judged as a new paragraph title, and if the ratio is less than 0.8, it is included in the related paragraph. The example-based search system for similarity determination according to claim 7, further comprising a paragraph title extraction system including a paragraph title extraction unit for extracting a title.