JPS63201867A

JPS63201867A - Document image automatic summarization method

Info

Publication number: JPS63201867A
Application number: JP62033263A
Authority: JP
Inventors: Akira Kagami; 晃加賀美; Koichi Honma; 弘一本間; Fuminobu Furumura; 文伸古村; Fumio Wakamori; 和歌森　文男
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1987-02-18
Filing date: 1987-02-18
Publication date: 1988-08-19

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は、文書内容を簡易に理解するのに役立つ資料を
得るための文書画像自動要約方式に係り、特に文書画像
の蓄積・検索システムにおける自動インデクシングに好
適な文書画像自動要約方式に関する。[Detailed Description of the Invention] [Industrial Application Field] The present invention relates to an automatic document image summarization method for obtaining materials useful for easily understanding document contents, and particularly relates to a document image storage/retrieval system. The present invention relates to an automatic document image summarization method suitable for automatic indexing.

[Conventional technology]

従来の技術は、特開昭６０−１３８６７０号公報に記載
のように、文書の抽出個所を指定する手段を設け、所要
部分を抜枠することにより要約文書を作成していた。In the conventional technique, as described in Japanese Patent Application Laid-open No. 138670/1984, a summary document is created by providing a means for specifying a location to extract from a document and cutting out the necessary portions.

[Problem that the invention seeks to solve]

上記従来技術は、文書を逐−読んで、その中の抽出個所
を厳密に指定する手間を省くことについては配慮されて
おらず、大量文書の要約処理に膨大なコストが必要にな
るという問題があった。The above-mentioned conventional technology does not take into account the need to read documents one by one and specify precisely the extraction points within them, and there is a problem in that a huge amount of cost is required for summarizing a large amount of documents. there were.

また、文書は多様な要素で構成されており、その要約手
段も多様である点については配慮がされておらず、文書
全体の適切な要約を行えないという問題もあった。Further, there is also a problem in that documents are composed of various elements and there are various means of summarizing them, which cannot be properly summarized.

本発明の目的は、かかる従来技術の問題点を解決し、大
規模な文書画像の蓄積・検索システムを効率的に構築・
運用するための文書画像自動インデクシング技術を提供
することにある。The purpose of the present invention is to solve the problems of the prior art and to efficiently construct and retrieve a large-scale document image storage and retrieval system.
The purpose of this invention is to provide automatic document image indexing technology for operational use.

[Means for solving problems]

上記目的を達成するため、まず入力された文書画像をそ
の内容構成に応じて自動的に領域分割し、各分割領域に
対しあらかじめ用意しである適当な要約処理を施し、そ
うして得られた要約情報を１つの画像に偏集して出力す
る。To achieve the above objective, first, the input document image is automatically divided into regions according to its content structure, and each divided region is subjected to an appropriate summarization process prepared in advance. Summary information is concentrated in one image and output.

[Effect]

文書画像をその内容構成に応じて、自動的に領域分割す
ることにより、領域とその領域に必要な要約手段とを対
応づけできる。それによって、多様な要素で構成されて
いる文書も柔軟に処理でき、かつ、各要約手段の処理対
象領域を限定できるため、処理の自動化・効率化が容易
に実現される。By automatically dividing a document image into regions according to its content structure, regions can be associated with summarization means necessary for the regions. As a result, documents made up of various elements can be processed flexibly, and the area to be processed by each summarization means can be limited, so automation and efficiency of processing can be easily realized.

〔Example〕

以下、本発明の一実施例を詳細に説明する。 Hereinafter, one embodiment of the present invention will be described in detail.

第１図は、本発明による公開特許公報の抄録画像自動作
成方式の処理の流れを示すブロック図である。FIG. 1 is a block diagram showing the processing flow of an automatic patent publication abstract image creation method according to the present invention.

公報画像入力部１１で画像として入力された公報文書は
、まず構成要素抽出部１２において、文字または図形ご
とに外接矩形枠単位で切出される。A gazette document input as an image in the gazette image input section 11 is first extracted in a component extraction section 12 for each character or figure in units of circumscribed rectangular frames.

その−例を第２図に示す。なお、これは公知の処理アル
ゴリズムでも実施できる。An example is shown in FIG. Note that this can also be implemented using a known processing algorithm.

要素並び分析部１３は、第３図に示すように、外接矩形
枠３１単位で切出された要素３２から、以下の特徴量を
抽出する。As shown in FIG. 3, the element arrangement analysis unit 13 extracts the following feature amounts from the elements 32 cut out in units of the circumscribed rectangular frame 31.

■、外接矩形粋の左上点のＸ座標：ｘ１■、外接矩形枠
の左上点のＸ座標：Ｙ１■、外接矩形枠の横方向の大き
さ：ＷＩ■、外接矩形枠の縦方向の大きさ：ＨＩ■、要
素のＸ軸投影周辺分布：（ｐｔ（ｍ））（ｍ＝１．・・
・、　Ｗｔ） ■、要素のｙ軸投影周辺分布：（Ｑｉ（ｎ））（ｎ＝１
．・・・、　Ｈｌ）ここで、添字ｉは要素ｉの特徴量であることを示す。こ
のうち、■〜■を用いて第１図の文書を以下の３つの領
域に粗く分割する。■, X coordinate of the upper left point of the circumscribed rectangle: x1■, X coordinate of the upper left point of the circumscribed rectangle: Y1■, Horizontal size of the circumscribed rectangle: WI■, Vertical size of the circumscribed rectangle :HI■, X-axis projection marginal distribution of element: (pt(m))(m=1...
・, Wt) ■, Element's y-axis projected marginal distribution: (Qi(n)) (n=1
．． ..., Hl) Here, the subscript i indicates the feature amount of element i. Among these, the document shown in FIG. 1 is roughly divided into the following three areas using ■ to ■.

Ａ、書誌的事項領域２１　　　（第１ページ）Ｂ、明細
書本文領域２２　　（第１ページ〜）Ｃ０明細書図面領
域２３　　（〜最終ページ）Ａ、Ｂ、Ｃはこの順に出現
するため、ＡとＢ。A, bibliographic matter area 21 (first page) B, specification text area 22 (first page~) C0 specification drawing area 23 (~last page) A, B, and C appear in this order, so A, B, and C appear in this order. B.

ＢとＣの区別ができれば上記の領域分割は可能となる。If B and C can be distinguished, the above area division becomes possible.

ＡとＢの境界は第１ページ中の大きな空白領域として存
在する。よって、これをＳＰ＋＝　（Ｘｔ　　Ｘｉ＋ｔ）２＋１００（Ｙｔ　　
Ｙｔ＋ｔ）”を最大にする要素ｊで検出すればよい。な
お、空白領域の特徴を考慮して、ｙ方向に重みをつけた
。The boundary between A and B exists as a large blank area in the first page. Therefore, this is SP+= (Xt Xi+t)2+100(Yt
It is sufficient to detect the element j that maximizes "Yt+t)". Note that weighting is given in the y direction in consideration of the characteristics of the blank area.

一方、ＢとＣの境界は、外接矩形枠の大きさを示すＷ、
またはＨｌが許容値を初めて越える要素ｉで容易に検出
できる。あるいはまた、非文字領域の特徴である要素並
びの不ぞろいを外接矩形枠の左上点（ｘ、＋　、　Ｙｌ
　）によって検出することもできる。On the other hand, the boundary between B and C is W, which indicates the size of the circumscribed rectangular frame.
Alternatively, it can be easily detected at the element i where Hl exceeds the allowable value for the first time. Alternatively, the irregular arrangement of elements, which is a feature of a non-text area, can be expressed as the upper left point of the circumscribing rectangle (x, +, Yl
) can also be detected.

次に、上記３つの分割領域に対し、それぞれ、ブロック
１４〜１６に示す詳細分割及び、要約処理を加える。Next, detailed division and summary processing shown in blocks 14 to 16 are applied to the three divided regions, respectively.

書誌的事項抽出部１４は、書誌的事項領域２１から書誌
的事項を抽出する。公報は、項目の記述に先立ち、その
内容を端的に表現する特定の文字列（以下、見出し語と
記す）を表示している。第４図にその一例を示す。よっ
て、この見出し語４１を探索することにより、各書誌的
事項４２の場所を限定（領域を詳細に分割）できる。後
は、見出し語との相対的位置関係から書誌的事項だけが
精度良く抽出される。なお、見出し語は複数（例えばＮ
個）の文字で構成される文字列であるため、全文字を誤
判定する確率は非常に小さいという組合せ効果を利用す
れば１文字毎の探索精度が不十分でも全体として高精度
な探索が可能となる。すなわち、ＯＣＲのようなコスト
の高い装置を用いなくてもよくなる。例えば、登録しで
ある見出し語（ａｔ）；ｔ＝１．・・・、Ｎと処理文書
中の候補文字列（ｂｔ）；　ｔ＝、ｔ、・・・、Ｎとの
一致は、１文字毎の類似ａｔ”ｂｔのＮ回連続として判定する。この際、１文字毎の類似判
定を周辺分布を用いて以下の式で行えば、簡易で高速な
見出し語探索を実施できる。すなわち、くＫ・（（Ｓ　＋　＋　Ｓ　Ｊ）／　２　）が満足され
れば、文字ｉとｊは類似であると判定する。ここで、Ｃ
は位置ずれを補正するための相互ずらし量であり、Ｗｌ
とＷＪ　、Ｐｌとｐ、。The bibliographic item extraction unit 14 extracts bibliographic items from the bibliographic item area 21. Prior to describing an item, a publication displays a specific character string (hereinafter referred to as a headword) that clearly expresses the content. An example is shown in FIG. Therefore, by searching for this headword 41, the location of each bibliographic item 42 can be limited (the area can be divided in detail). After that, only bibliographic items are extracted with high accuracy based on the relative positional relationship with the headword. Note that there may be multiple headwords (for example, N
Since the character string is made up of several characters, the probability of misjudging all characters is extremely small.Using this combination effect, even if the search accuracy for each character is insufficient, it is possible to perform a high-precision search as a whole. becomes. In other words, there is no need to use expensive equipment such as OCR. For example, the registered entry word (at); t=1. A match between ..., N and the candidate character string (bt) in the processed document; t=, t, ..., N is determined as N consecutive occurrences of similar at"bt for each character. In this case, , a simple and fast headword search can be carried out by performing similarity judgment for each character using the marginal distribution using the following formula.In other words, if kuK・((S + + S J)/2) is satisfied. If so, it is determined that characters i and j are similar.Here, C
is the mutual shift amount for correcting the positional shift, and Wl
and W.J., Pl and p.

ＳＬとＳ、はそれぞれｉとｊの外接矩形枠の横方向の大
きさ、Ｘ軸投影周辺分布、および面積である。また、Ｋ
は文書全体で決まる比例定数である。SL and S are the horizontal size, X-axis projected peripheral distribution, and area of the circumscribed rectangular frames of i and j, respectively. Also, K
is a proportionality constant determined by the entire document.

なお、ｙ軸投影周辺分布を用いた類似判定も同様であり
、２段階で判定を行うことにより、負荷をそれほど重く
しなくても精度向上を実現することもできる。Note that the same applies to the similarity determination using the y-axis projected marginal distribution, and by performing the determination in two stages, it is possible to improve accuracy without increasing the load too much.

頻出文字列抽出部１５は、明細書本文領域から、キーワ
ード候補としての頻出文字列を抽出する。The frequently appearing character string extraction unit 15 extracts frequently appearing character strings as keyword candidates from the specification text area.

明細書全体を対象としたのでは、処理効率が低下したり
、非キーワードを抽出する可能性が増加するため、前述
の見出し語探索手段を用いて、処理対象を一部１例えば
、「特許請求の範囲Ｊに限定する。これは、２つの見出
し語である「特許請求の範囲」と「発明の詳細な説明」
とに囲まれた領域として切出される。この限定された領
域から頻出文字列を抽出する方式として、本出願人によ
る特願昭６１−２８８７７５号「頻出文字列抽出方法」
がある。If the entire specification is targeted, the processing efficiency will decrease and the possibility of extracting non-keywords will increase. This is limited to Scope J. This is based on the two headwords “Claims” and “Detailed Description of the Invention.”
It is extracted as an area surrounded by. As a method for extracting frequently occurring character strings from this limited area, Japanese Patent Application No. 61-288775 filed by the present applicant entitled "Frequently occurring character string extraction method"
There is.

主要図面抽出部１６は、明細書図面領域から主要図面を
抽出する。一般に、主要図面を第１図とする傾向にある
ので１図形と考えられる一定以上のサイズを持つ外接矩
形枠のうち、最も左上にある（出現が早い）ものを主要
図面として抽出する。The main drawing extraction unit 16 extracts main drawings from the specification drawing area. Generally, there is a tendency to use Figure 1 as the main drawing, so of the circumscribed rectangular frames of a certain size or more that are considered to be one figure, the one located at the upper left (earliest) is extracted as the main drawing.

なお、公報によって主要図面の大きさが色々であるから
、外接矩形枠の長辺が一定サイズ以下となるように相似
変換を施す。Since the sizes of the main drawings vary depending on the publication, similarity transformation is performed so that the long sides of the circumscribed rectangular frame are smaller than a certain size.

続く抄録要素編集部１７では、第５図に示すように、書
誌的事項５１．頻出文字列５２．主要図面５３の３つの
抄録要素を１枚の画像に編集して抄録画像を作成する。Next, in the abstract element editing section 17, as shown in FIG. 5, bibliographic items 51. Frequently occurring character string 52. An abstract image is created by editing the three abstract elements of the main drawing 53 into one image.

抄録画像出力部１８では、作成した抄録画像をＣＲＴデ
ィスプレイに表示したり、光ディスク等の記憶装置に格
納したりする。The abstract image output unit 18 displays the created abstract image on a CRT display or stores it in a storage device such as an optical disk.

この実施例によれば、見出し語探索により文書の構成を
的確に把握できるため、更に効率的かつ高精度な要約処
理が実現される。According to this embodiment, since the structure of a document can be accurately grasped by searching for headwords, more efficient and highly accurate summarization processing can be realized.

また、以上の実施例において、分割された明細書本文領
域等に対し、ＯＣＲ等によるコード変換処理を加えれば
、一層柔軟な要約処理を行うことができる。例えば、コ
ード上で開発された日本語処理等の既存ソフトウェアを
利用して、意味解析や翻訳なども可能となる。Furthermore, in the above embodiments, if code conversion processing using OCR or the like is applied to the divided specification text areas, etc., even more flexible summary processing can be performed. For example, existing software such as Japanese language processing developed on the code can be used to perform semantic analysis and translation.

〔Effect of the invention〕

本発明によれば、入力された文書画像をその内容構成に
応じて自動的に領域分割することにより、処理対象領域
を限定し、かつ、各領域に対し適当な要約処理を用意で
きるため、文書画像要約装置の自動化を効率的かつ高精
度に実現できるという効果がある。According to the present invention, by automatically dividing an input document image into regions according to its content structure, it is possible to limit the processing target region and prepare an appropriate summarization process for each region. This has the effect that automation of the image summarization device can be realized efficiently and with high accuracy.

[Brief explanation of the drawing]

第１図は本発明による公開特許公報の抄録画像作成方式
の一実施例のブロック図、第２図は上記抄録画像作成方
式の構成要素抽出部で抽出される公報構成要素の一例を
示す図、第３図は上記構成要素の特徴量を示す図、第４
図は見出し語と書誌的事項との位置関係を示す図、第５
図は抄録画像の一例を示す図である。FIG. 1 is a block diagram of an embodiment of the method for creating an abstract image of a published patent publication according to the present invention, and FIG. 2 is a diagram showing an example of the components of the publication extracted by the component extracting section of the abstract image creation method. Figure 3 is a diagram showing the feature amounts of the above-mentioned components, and Figure 4
Figure 5 shows the positional relationship between headwords and bibliographic items.
The figure shows an example of an abstract image.

Claims

[Claims] 1. A document is input as an image, the input document image is divided into regions according to its content structure, and a predetermined summarization process is individually prepared for each divided image region. administer,
An automatic document image summarization method characterized in that summary information extracted from each image area by the summarization process is edited into one image and output. 2. Automatic summarization of document images according to claim 1, characterized in that, after converting the image into another representation for the divided image area, a predetermined summarization process is performed on the representation. method. 3. Input a document as an image, register in advance a character string that will become a heading indicating the contents of the document, search for the character string from the input document image by image processing, and add the searched character string to the character string. An automatic document image summarization method characterized in that the document image is divided into regions based on the document image.