JPH0660221A

JPH0660221A - Area extracting method for document image

Info

Publication number: JPH0660221A
Application number: JP4211661A
Authority: JP
Inventors: Naohiro Amamoto; 直弘天本; Akitoshi Tsukamoto; 明利塚本; Sadamasa Hirogaki; 節正広垣
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1992-08-07
Filing date: 1992-08-07
Publication date: 1994-03-04

Abstract

PURPOSE:To set an appropriate integrated threshold to the document image of a newspaper, etc., and to exactly integrate a text character area and other character area. CONSTITUTION:With respect to an area image S1 generated by an area image generation processing 1, a level image S2 is formed by a label image generation processing 2. In a character decision processing 3, a specific character in a document image and other character are discriminated and in a character image generation processing 4, as for a specific character, a first character image S4a is formed and as for other character, a second character image S4b is formed. In an integrated threshold setting processing 5, an integrated threshold S5 is derived from a first character image S4a. In such a manner, an appropriate integrated threshold is obtained. In an area extraction processing 6, an integrated processing is executed separately to a first and a second character image S4a, S4b by using the integrated threshold S5. Thus, the area extraction of the document image of a newspaper, etc., can be executed exactly.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、ファクシミリ等の通信
機器や文書画像データベース入力装置、光学的文字読取
り装置（ＯＣＲ）等において、新聞等の文書画像をその
構成要素の領域に抽出する文書画像の領域抽出方法に関
するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a communication device such as a facsimile, a document image database input device, an optical character reader (OCR), etc., for extracting a document image of a newspaper or the like into its component area. The present invention relates to a region extraction method of.

【０００２】[0002]

【従来の技術】従来、この種の文書画像の領域抽出方法
には、例えば、特開昭６２−７１３７９号公報に記載さ
れるものがあった。この文献に記載された文書画像の領
域抽出方法では、文書画像データを入力し、走査方向
（例えば、横方向）に黒画素を計数して閾値を越えるラ
インを検出し、該計数値が閾値以下の白ラインが所定個
数連続する状態を判定して第１の領域切り出しを行う。
この第１の領域切り出し内で、副走査方向（例えば、縦
方向）に黒画素を計数して該計数値が閾値を越える列を
検出し、該計数値が閾値以下の白列が所定個数連続する
状態を判定して第２の領域切り出しを行う。2. Description of the Related Art Heretofore, as a method for extracting a region of a document image of this kind, there has been a method described in, for example, Japanese Patent Laid-Open No. 62-71379. In the document image area extraction method described in this document, the document image data is input, black pixels are counted in the scanning direction (for example, the horizontal direction) to detect lines exceeding a threshold value, and the count value is equal to or less than the threshold value. The first area is cut out by determining the state in which a predetermined number of continuous white lines continue.
Within this first region cutout, black pixels are counted in the sub-scanning direction (for example, in the vertical direction) to detect columns in which the count value exceeds the threshold value, and a predetermined number of white columns in which the count value is less than or equal to the threshold value are consecutive. The second area is cut out by determining the state to be performed.

【０００３】さらに、第２の領域切り出し内で、第１の
領域切り出しと同様な処理により、第３の領域切り出し
を行い、この第３の領域切り出し内で、第２の領域切り
出しと同様な処理により、第４の領域切り出しを行う。
そこで、この第４の領域切り出しで検出された領域につ
いて、その領域のランレングス情報、及び黒画素率情報
により、文字部、写真部、及び図表部の領域の属性を判
別している。Further, in the second area cutout, the third area cutout is performed by the same processing as the first area cutout, and in the third area cutout, the same processing as the second area cutout is performed. Thus, the fourth region is cut out.
Therefore, for the area detected by the fourth area cutout, the attributes of the areas of the character portion, the photograph portion, and the graphic portion are discriminated based on the run length information and the black pixel ratio information of the area.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、上記構
成の文書画像の領域抽出方法では、次のような課題があ
った。（ａ）従来の方法では、領域を分割して抽出する際に、
その分割対象に応じて様々な閾値を任意に設定する必要
があった。例えば、個々の論文誌に対して領域分割処理
を施す場合、それらの閾値を各論文誌に応じた適切な値
に設定し直さなければならず、その値の選定に手間がか
かるという問題があった。（ｂ）前記（ａ）の問題を解決するため、本願出願人
は、先に特願平３−５２８４６号明細書（提案１）、及
び特願平３−１９５４３７号明細書（提案２）におい
て、領域抽出方法の提案を行った。However, the document image area extraction method having the above configuration has the following problems. (A) In the conventional method, when the area is divided and extracted,
It was necessary to arbitrarily set various thresholds according to the division target. For example, when performing region segmentation processing on individual journals, it is necessary to reset the thresholds to appropriate values according to each journal, and there is the problem that it takes time to select those values. It was (B) In order to solve the above problem (a), the applicant of the present application has previously described in Japanese Patent Application No. 3-52846 (Proposal 1) and Japanese Patent Application No. 3-195437 (Proposition 2). , And proposed the area extraction method.

【０００５】提案１では、先ず、入力された白黒２値画
像データと同サイズの全面黒の多値の領域画像を作成す
る。次に、原画像を横及び縦方向に走査し、各々の方向
で閾値Ｔ１，Ｔ２以上の白ランが存在すれば、領域画像
においてその白ランに対応する部分を白にして、文書画
像の構成要素を黒画素連結領域として表現する。この領
域画像の黒の部分に対してラベル付けを行い、領域画像
の各黒連結領域に一意に番号を与えてラベル画像を作成
し、このラベル画像を用いて領域分割を行うというもの
であった。ところが、この方法では、文字領域が行単位
もしくは文字単位といった小さな領域でしか表現するこ
とができず、文字認識を行う際に文書の構成や、文章の
つながり等を知ることが困難である。そこで、この欠点
を除去するため、提案２では、前記領域分割を行った
後、文字領域について白ラン幅ヒストグラムから統合閾
値を設定し、その統合閾値を用いて統合を行うことによ
り、領域の抽出を行うようにしている。しかし、領域の
抽出対象となる文書画像が例えば新聞のような場合、図
２（ａ），（ｂ）のような問題が生じる。In the first proposal, first, an all-black multivalued area image of the same size as the input monochrome binary image data is created. Next, the original image is scanned in the horizontal and vertical directions, and if there are white runs equal to or greater than the thresholds T1 and T2 in each direction, the portion corresponding to the white runs in the area image is made white to form the document image. The element is represented as a black pixel connected area. The black part of the area image is labeled, a number is uniquely assigned to each black connected area of the area image to create a label image, and the area division is performed using this label image. . However, with this method, the character area can be expressed only in a small area such as a line unit or a character unit, and it is difficult to know the document structure, the connection of sentences, and the like when performing character recognition. Therefore, in order to eliminate this defect, in Proposal 2, after performing the region division, an integration threshold is set from the white run width histogram for the character region, and the integration is performed using the integration threshold to extract the region. I'm trying to do. However, when the document image from which the region is to be extracted is, for example, a newspaper, the problems shown in FIGS. 2A and 2B occur.

【０００６】図２（ａ），（ｂ）は、先の提案２の問題
点の説明図である。図２（ａ）に示すように、新聞のよ
うな見出し文字Ａ及び本文文字Ｂ等を含む文書画像で
は、行間が狭く、段組の間隔が狭いが、このような文書
に対しては適切な統合閾値が得られない。また、図２
（ｂ）の破線で囲まれた領域Ｃのように、違う段落どう
し、あるいは本文文字Ｂの領域とそれ以外の見出し文字
Ａ等の文字領域とを誤って統合してしまうという問題が
生じ、未だ技術的に充分満足のゆく領域抽出方法が得ら
れなかった。本発明は、前記従来技術が持っていた課題
として、新聞等のような文書に対しては適切な統合閾値
が得られない点、及び本文文字領域とそれ以外の文字領
域とを誤って統合してしまうという点について解決し
た、新聞等の文書画像の領域抽出方法を提供するもので
ある。FIGS. 2A and 2B are explanatory views of the problems of the above-mentioned proposal 2. As shown in FIG. 2A, in a document image containing headline characters A, body characters B, etc., such as a newspaper, the line spacing is narrow and the column spacing is narrow, but it is suitable for such documents. The integrated threshold cannot be obtained. Also, FIG.
As in the area C surrounded by the broken line in (b), there is a problem that different paragraphs are merged, or the area of the body character B and the other character areas such as the heading character A are erroneously integrated. A technically satisfactory area extraction method has not been obtained. The present invention, as a problem that the above-mentioned conventional art has, is that an appropriate integration threshold cannot be obtained for documents such as newspapers, and that the text area and other text areas are erroneously integrated. The present invention provides a method for extracting a region of a document image of a newspaper or the like, which solves the problem that the image is lost.

【０００７】[0007]

【課題を解決するための手段】第１の発明は、前記課題
を解決するために、全面黒の領域画像と原画像から文書
画像の構成要素を黒画素連結領域として表現した領域画
像を作成する領域画像作成処理と、前記領域画像の構成
要素に一意に番号を与えたラベル画像を作成するラベル
画像作成処理とを施し、前記ラベル画像を用いて前記文
書画像の領域抽出を行う文書画像の領域抽出方法におい
て、前記ラベル画像から、文字の大きさによって特定の
文字とそれ以外の文字とを判別する文字判定処理を行
う。そして、文字画像作成処理において、前記特定の文
字について前記ラベル画像から文字領域以外の領域を全
て白にした第１の文字画像を作成すると共に、前記特定
の文字以外の文字について前記ラベル画像から文字領域
以外の領域を全て白にした第２の文字画像を作成する。
その後、前記第１の文字画像を走査して統合閾値を設定
する統合閾値設定処理と、前記統合閾値を用い、前記第
１と第２の文字画像に対して別々に領域抽出を行う領域
抽出処理とを、実行して新聞等の文書画像の領域抽出を
行うようにしている。第２の発明は、第１の発明の領域
抽出処理において、第２の文字画像の領域抽出では統合
閾値を変更して領域抽出を行うようにしている。In order to solve the above-mentioned problems, a first aspect of the present invention creates an area image in which a constituent element of a document image is expressed as a black pixel connected area from an area image of the whole black and an original image. An area of a document image for which area image creation processing and label image creation processing for creating a label image in which constituent elements of the area image are uniquely numbered are performed, and the area of the document image is extracted using the label image In the extraction method, a character determination process is performed from the label image to distinguish a specific character from other characters according to the size of the character. Then, in the character image creating process, a first character image in which all areas other than the character area are white from the label image is created for the specific character, and characters other than the specific character are created from the label image. A second character image in which all areas other than the area are white is created.
Then, an integrated threshold value setting process of scanning the first character image to set an integrated threshold value, and a region extraction process of separately extracting an area for the first and second character images using the integrated threshold value. And are executed to extract the area of the document image of a newspaper or the like. In a second aspect of the present invention, in the area extraction process of the first aspect, in the area extraction of the second character image, the integration threshold value is changed to perform the area extraction.

【０００８】[0008]

【作用】第１の発明によれば、以上のように文書画像の
領域抽出方法を構成したので、ラベル画像作成処理によ
って作成されたラベル画像を用いて文書画像を作成する
際、文字判定処理において文字の大きさによって特定の
文字とそれ以外の文字とを判別する。文字画像作成処理
では、特定の文字で構成される第１の文字画像を作成す
ると共に、特定の文字以外の文字で構成される第２の文
字画像を作成する。統合閾値設定処理では、第１の文字
画像から統合閾値を設定し、その統合閾値を用いて領域
抽出処理により、第１の文字画像と第２の文字画像に対
して別々に統合処理を行って新聞等の文書画像の領域抽
出を行う。第２の発明によれば、第２の文字画像の領域
抽出では、異なる統合閾値によって領域抽出が行われる
ので、文書画像の構成に対応した領域抽出が行える。従
って、前記課題を解決できるのである。According to the first aspect of the present invention, since the area extraction method of the document image is configured as described above, when the document image is created using the label image created by the label image creation process, the character determination process is performed. A specific character is distinguished from other characters depending on the size of the character. In the character image creating process, a first character image composed of a specific character is created, and a second character image composed of characters other than the specific character is created. In the integrated threshold setting process, the integrated threshold is set from the first character image, and the integrated threshold is used to perform the integration process separately for the first character image and the second character image. The area of a document image such as a newspaper is extracted. According to the second aspect, in the area extraction of the second character image, the area extraction is performed with different integration thresholds, so that the area extraction corresponding to the configuration of the document image can be performed. Therefore, the above problem can be solved.

【０００９】[0009]

【実施例】本発明の実施例を示す領域抽出方法の処理全
体説明（Ｉ）と、その各処理内容（II）とを、図１、図
３〜図５を参照しつつ、以下説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The overall processing (I) of the area extraction method showing the embodiment of the present invention and the contents of each processing (II) will be described below with reference to FIGS. 1 and 3 to 5.

【００１０】（Ｉ）領域抽出方法の処理全体説明（図１、図３）図１は領域抽出方法の全体の処理内容図、及び図３
（ａ），（ｂ）は図１における文字画像（例えば、新
聞）の例を示す図である。図１に示すように、先ず、領
域画像作成処理１により、全面黒の領域画像と原画像か
ら文書画像の構成要素を黒画素連結領域として表現した
領域画像Ｓ１を作成し、ラベル画像作成処理２により、
該領域画像Ｓ１の構成要素に一意に番号を与えたラベル
画像Ｓ２を作成する。(I) Description of Overall Processing of Region Extraction Method (FIGS. 1 and 3) FIG. 1 is a diagram showing the entire processing contents of the region extraction method, and FIG.
(A), (b) is a figure which shows the example of the character image (for example, newspaper) in FIG. As shown in FIG. 1, first, an area image creating process 1 creates an area image S1 in which constituent elements of a document image are expressed as black pixel connected areas from an entirely black area image and an original image, and a label image creating process 2 is performed. Due to
A label image S2 is created by uniquely giving numbers to the constituent elements of the area image S1.

【００１１】次に、文字判定処理３で、例えば新聞の本
文を構成する文字とそれ以外の文字とを分類し、その分
類結果を文字画像作成処理４へ送る。一般の新聞では、
本文を構成する文字の大きさがほぼ一定であり、また見
出し文字等に使用される文字が本文を構成する文字に比
べて明らかに大きい。そのため、文字判定処理３では、
例えば文字の大きさで、本文を構成する文字とそれ以外
の文字を容易に判定できる。文字画像作成処理４では、
本文を構成する文字についてラベル画像Ｓ２から文字領
域以外の領域を全て白にした第１の文字画像Ｓ４ａを作
成すると共に、本文を構成する文字以外の文字について
ラベル画像Ｓ２から文字領域以外の領域を全て白にした
第２の文字画像Ｓ４ｂを作成し、該第１の文字画像Ｓ４
ａを統合閾値設定処理５へ送る。Next, in the character determination processing 3, for example, the characters that make up the body of the newspaper and the other characters are classified, and the classification result is sent to the character image creation processing 4. In general newspapers,
The size of the characters that make up the text is almost constant, and the characters used for headline characters and the like are obviously larger than the characters that make up the text. Therefore, in the character determination processing 3,
For example, it is possible to easily determine the characters forming the body and the other characters by the size of the characters. In the character image creation process 4,
With respect to the characters forming the body, a first character image S4a in which all the areas other than the character area are white from the label image S2 is created, and regarding the characters other than the characters forming the body, the areas other than the character area are separated from the label image S2. A second character image S4b that is all white is created, and the first character image S4 is generated.
a is sent to the integrated threshold setting process 5.

【００１２】統合閾値設定処理５では、第１の文字画像
Ｓ４ａを走査し、統合閾値Ｓ５を設定する。例えば、図
３（ａ）のような文字画像の場合、先の提案２と同様の
統合閾値設定操作を行う。先の提案２では、図３
（ａ），（ｂ）の両方の要素が混在した文字画像につい
てこの統合閾値設定処理を行っていたのに対し、本実施
例では、図３（ａ）のような文字画像に対してのみ統合
閾値設定処理５を行うため、より確実に、適切な統合閾
値Ｓ５を得ることができる。その後、領域抽出処理６で
は、統合閾値Ｓ５を用いて第１と第２の文字画像Ｓ４
ａ，Ｓ４ｂの各々について領域抽出を行う。この領域抽
出処理６では、例えば図３（ａ），（ｂ）のような文字
画像に対してそれぞれ別々に、先の提案２と同様の領域
抽出処理を行う。但し、図３（ｂ）のような文字画像に
ついては、ブロックの間隔が広いため、統合閾値設定処
理５で設定した統合閾値Ｓ５の例えばｎ倍（ｎ；実数）
の値を統合閾値として領域の抽出を行うことが望まし
い。In the integrated threshold value setting process 5, the first character image S4a is scanned and the integrated threshold value S5 is set. For example, in the case of the character image as shown in FIG. 3A, the same integrated threshold setting operation as that of the proposal 2 is performed. In the previous proposal 2, FIG.
While the integrated threshold value setting process is performed for a character image in which both elements (a) and (b) are mixed, in the present embodiment, only the character image as shown in FIG. 3A is integrated. Since the threshold setting process 5 is performed, an appropriate integrated threshold S5 can be obtained more reliably. After that, in the region extraction processing 6, the first and second character images S4 are used by using the integrated threshold S5.
Region extraction is performed for each of a and S4b. In this area extraction processing 6, for example, the same area extraction processing as that of the above-mentioned proposal 2 is performed separately for the character images as shown in FIGS. 3 (a) and 3 (b). However, in the case of the character image as shown in FIG. 3B, since the block interval is wide, for example, n times the integrated threshold S5 set in the integrated threshold setting process 5 (n; real number)
It is desirable to extract the area with the value of as the integrated threshold.

【００１３】（II）前記（Ｉ）の各処理内容（II）（１）統合閾値設定処理５（図４）図４は、図１中の統合閾値設定処理５の処理内容図であ
る。図４において、先ず、図１中の文字画像作成処理４
により作成された第１の文字画像Ｓ４ａのデータ５１を
横方向に走査し、最長白ラン分布抽出処理５２により、
横方向の白ラン分布を抽出する。縦方向も同様な操作を
行う。次に、白ラン幅ヒストグラム作成処理５３によ
り、白ラン分布から横及び縦方向のヒストグラムを作成
する。そして、白ラン幅検出処理５４により、このヒス
トグラムの最大値を与える連続する白ラン幅を求めた
後、統合閾値決定処理５５により、横及び縦方向の統合
閾値からなる統合閾値Ｓ５を設定する。(II) Details of Processes of (I) (II) (1) Integrated Threshold Setting Process 5 (FIG. 4) FIG. 4 is a process content diagram of the integrated threshold setting process 5 in FIG. In FIG. 4, first, the character image creation process 4 in FIG.
The data 51 of the first character image S4a created by
The white run distribution in the horizontal direction is extracted. The same operation is performed in the vertical direction. Next, a white run width histogram creation process 53 creates horizontal and vertical histograms from the white run distribution. Then, the white run width detection process 54 determines the continuous white run width that gives the maximum value of this histogram, and then the integrated threshold value determination process 55 sets an integrated threshold value S5 composed of horizontal and vertical integrated threshold values.

【００１４】（II）（２）領域抽出処理６（図５）図５は、図１中の領域抽出処理６の内容を示す図であ
る。先ず、第１と第２の文字画像Ｓ４ａ，Ｓ４ｂに対し
てそれぞれ横分割処理６１で１回目の横分割を行い、以
降は縦分割処理６２と横分割処理６４とを交互に行う。
判定処理６３，６５で、第１と第２の文字画像Ｓ４ａ，
Ｓ４ｂに対し横分割によって得られた第１のブロック数
と、縦分割によって得られた第２のブロック数との一致
／不一致の判定を行い、縦分割のブロック数と横分割の
ブロック数が等しくなった時点で、第１と第２の文字画
像Ｓ４ａ，Ｓ４ｂに対するそれぞれの領域抽出処理を終
了する。(II) (2) Region Extraction Process 6 (FIG. 5) FIG. 5 is a diagram showing the contents of the region extraction process 6 in FIG. First, the first and second character images S4a and S4b are each horizontally divided by the horizontal division processing 61, and thereafter, the vertical division processing 62 and the horizontal division processing 64 are alternately performed.
In the determination processes 63 and 65, the first and second character images S4a,
For S4b, it is determined whether the first block number obtained by the horizontal division and the second block number obtained by the vertical division do not match and the number of blocks of the vertical division is equal to the number of blocks of the horizontal division. At that time, the area extraction processing for the first and second character images S4a and S4b is completed.

【００１５】以上のように、本実施例では、図１の文字
判定処理３で、例えば新聞の本文を構成する文字とそれ
以外の文字とに分類し、本文を構成する文字から作成し
た第１の文字画像Ｓ４ａに対して統合閾値設定処理５で
統合閾値Ｓ５の設定を行うため、適切な統合閾値を得る
ことができる。しかも、領域抽出処理６において、２種
類の文字から作成した第１と第２の文字画像Ｓ４ａ，Ｓ
４ｂに対して別々に領域の抽出を行うため、先の提案２
では本文文字領域とそれ以外の文字領域とを誤って統合
するおそれがあったのに対し、本実施例の方法では正し
く統合が行える。なお、本発明は上記実施例に限定され
ず、種々の変形が可能である。例えば、図５の領域抽出
処理６において、縦分割処理６２を行った後に横分割処
理６１を行い、その後、その処理を交互に繰り返すよう
にしたり、あるいは図５以外の方法で領域抽出処理６を
行ったり、さらに図４以外の方法で統合閾値設定処理５
を行うようにしてもよい。また、本実施例の領域抽出方
法は、新聞以外の一般の文書画像についても適用が可能
である。As described above, in the present embodiment, in the character determination processing 3 of FIG. 1, for example, the characters constituting the text of the newspaper are classified into the characters constituting the text and the other characters, and the first character is created from the characters constituting the text. Since the integrated threshold value S5 is set in the integrated threshold value setting process 5 for the character image S4a, the appropriate integrated threshold value can be obtained. Moreover, in the region extraction processing 6, the first and second character images S4a, S created from two types of characters.
Since the regions are separately extracted for 4b, the above proposal 2
However, while there is a risk that the text area and the other text areas may be erroneously integrated, the method of the present embodiment allows the text areas to be correctly integrated. The present invention is not limited to the above embodiment, and various modifications can be made. For example, in the area extraction processing 6 of FIG. 5, the vertical division processing 62 is performed, then the horizontal division processing 61 is performed, and then the processing is alternately repeated, or the area extraction processing 6 is performed by a method other than FIG. And further, the integrated threshold value setting process 5 by a method other than FIG.
May be performed. The area extraction method of this embodiment can be applied to general document images other than newspapers.

【００１６】[0016]

【発明の効果】以上詳細に説明したように、第１の発明
によれば、文書画像における特定の文字とそれ以外の文
字とを文字判定処理で判別し、文書画像作成処理によっ
て特定の文字から作成した第１の文字画像に対して統合
閾値設定処理で統合閾値の設定を行うため、適切な統合
閾値を得ることができる。しかも、特定の文字とそれ以
外の文字との２種類の文字から作成した第１と第２の文
字画像に対して領域抽出処理で別々に領域抽出を行うた
め、第１の文字領域と第２の文字領域とを誤って統合す
ることがなく、正しく統合が行える。第２の発明によれ
ば、第２の文字画像の領域抽出では、第１の文字画像の
領域抽出に用いた統合閾値とは異なる値を用いることに
より、ブロックの間隔が広い新聞等の文書画像に対して
的確な領域の抽出が可能となる。As described in detail above, according to the first aspect of the present invention, a specific character in a document image is distinguished from other characters by a character determination process, and a specific character is extracted from the specific character by a document image creation process. Since the integrated threshold is set in the created first character image by the integrated threshold setting process, an appropriate integrated threshold can be obtained. Moreover, since the area extraction processing separately performs area extraction on the first and second character images created from two types of characters, that is, a specific character and other characters, the first character area and the second character image are extracted. It can be integrated correctly without accidentally integrating with the character area of. According to the second invention, in the region extraction of the second character image, by using a value different from the integrated threshold used in the region extraction of the first character image, a document image of a newspaper or the like having a wide block interval. Therefore, it is possible to extract an accurate area.

[Brief description of drawings]

【図１】本発明の実施例を示す文書画像の領域抽出方法
の処理内容図である。FIG. 1 is a processing content diagram of a document image area extraction method according to an embodiment of the present invention.

【図２】先の提案２の問題点の説明図である。FIG. 2 is an explanatory diagram of a problem of the previous proposal 2.

【図３】図１の文字画像の例を示す図である。FIG. 3 is a diagram showing an example of the character image of FIG.

【図４】図１中の統合閾値設定処理５の処理内容を示す
図である。FIG. 4 is a diagram showing the processing content of integrated threshold setting processing 5 in FIG.

【図５】図１中の領域抽出処理６の処理内容を示す図で
ある。5 is a diagram showing the processing contents of a region extraction processing 6 in FIG.

[Explanation of symbols]

１領域画像作成処理２ラベル画像作成処理３文字判定処理４文字画像作成処理５統合閾値設定処理６領域抽出処理Ｓ１領域画像Ｓ２ラベル画像Ｓ４ａ，Ｓ４ｂ第１，第２の文字画像Ｓ５統合閾値 1 area image creation processing 2 label image creation processing 3 character determination processing 4 character image creation processing 5 integrated threshold setting processing 6 area extraction processing S1 area image S2 label images S4a, S4b first and second character images S5 integrated threshold

Claims

[Claims]

1. An area image creating process for creating an area image in which a constituent element of a document image is expressed as a black pixel connected area from an entirely black area image and an original image, and a number is uniquely given to the constituent element of the area image. A label image creating process for creating a label image, and a document image area extracting method for extracting an area of the document image using the label image, wherein a specific character according to the size of the character is extracted from the label image. A character determination process for determining other characters, and a first character image in which all areas other than the character area are white from the label image for the specific character, and at the same time, for the character other than the specific character A character image creating process for creating a second character image in which all areas other than the character area are white from the label image; and scanning the first character image to set an integrated threshold value. Performing an integrated threshold value setting process and an area extraction process for separately performing the area extraction on the first and second character images by using the integrated threshold value. A method for extracting a region of a document image as a feature.

2. The document image region extraction method according to claim 1, wherein in the region extraction processing, the region extraction is performed by changing an integration threshold in the region extraction of the second character image.