JPH0660221A - Area extracting method for document image - Google Patents

Area extracting method for document image

Info

Publication number
JPH0660221A
JPH0660221A JP4211661A JP21166192A JPH0660221A JP H0660221 A JPH0660221 A JP H0660221A JP 4211661 A JP4211661 A JP 4211661A JP 21166192 A JP21166192 A JP 21166192A JP H0660221 A JPH0660221 A JP H0660221A
Authority
JP
Japan
Prior art keywords
image
character
area
processing
integrated threshold
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP4211661A
Other languages
Japanese (ja)
Inventor
Naohiro Amamoto
直弘 天本
Akitoshi Tsukamoto
明利 塚本
Sadamasa Hirogaki
節正 広垣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Oki Electric Industry Co Ltd
Original Assignee
Oki Electric Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oki Electric Industry Co Ltd filed Critical Oki Electric Industry Co Ltd
Priority to JP4211661A priority Critical patent/JPH0660221A/en
Publication of JPH0660221A publication Critical patent/JPH0660221A/en
Pending legal-status Critical Current

Links

Landscapes

  • Character Input (AREA)
  • Facsimile Image Signal Circuits (AREA)
  • Image Analysis (AREA)

Abstract

PURPOSE:To set an appropriate integrated threshold to the document image of a newspaper, etc., and to exactly integrate a text character area and other character area. CONSTITUTION:With respect to an area image S1 generated by an area image generation processing 1, a level image S2 is formed by a label image generation processing 2. In a character decision processing 3, a specific character in a document image and other character are discriminated and in a character image generation processing 4, as for a specific character, a first character image S4a is formed and as for other character, a second character image S4b is formed. In an integrated threshold setting processing 5, an integrated threshold S5 is derived from a first character image S4a. In such a manner, an appropriate integrated threshold is obtained. In an area extraction processing 6, an integrated processing is executed separately to a first and a second character image S4a, S4b by using the integrated threshold S5. Thus, the area extraction of the document image of a newspaper, etc., can be executed exactly.

Description

【発明の詳細な説明】Detailed Description of the Invention

【0001】[0001]

【産業上の利用分野】本発明は、ファクシミリ等の通信
機器や文書画像データベース入力装置、光学的文字読取
り装置(OCR)等において、新聞等の文書画像をその
構成要素の領域に抽出する文書画像の領域抽出方法に関
するものである。
BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a communication device such as a facsimile, a document image database input device, an optical character reader (OCR), etc., for extracting a document image of a newspaper or the like into its component area. The present invention relates to a region extraction method of.

【0002】[0002]

【従来の技術】従来、この種の文書画像の領域抽出方法
には、例えば、特開昭62−71379号公報に記載さ
れるものがあった。この文献に記載された文書画像の領
域抽出方法では、文書画像データを入力し、走査方向
(例えば、横方向)に黒画素を計数して閾値を越えるラ
インを検出し、該計数値が閾値以下の白ラインが所定個
数連続する状態を判定して第1の領域切り出しを行う。
この第1の領域切り出し内で、副走査方向(例えば、縦
方向)に黒画素を計数して該計数値が閾値を越える列を
検出し、該計数値が閾値以下の白列が所定個数連続する
状態を判定して第2の領域切り出しを行う。
2. Description of the Related Art Heretofore, as a method for extracting a region of a document image of this kind, there has been a method described in, for example, Japanese Patent Laid-Open No. 62-71379. In the document image area extraction method described in this document, the document image data is input, black pixels are counted in the scanning direction (for example, the horizontal direction) to detect lines exceeding a threshold value, and the count value is equal to or less than the threshold value. The first area is cut out by determining the state in which a predetermined number of continuous white lines continue.
Within this first region cutout, black pixels are counted in the sub-scanning direction (for example, in the vertical direction) to detect columns in which the count value exceeds the threshold value, and a predetermined number of white columns in which the count value is less than or equal to the threshold value are consecutive. The second area is cut out by determining the state to be performed.

【0003】さらに、第2の領域切り出し内で、第1の
領域切り出しと同様な処理により、第3の領域切り出し
を行い、この第3の領域切り出し内で、第2の領域切り
出しと同様な処理により、第4の領域切り出しを行う。
そこで、この第4の領域切り出しで検出された領域につ
いて、その領域のランレングス情報、及び黒画素率情報
により、文字部、写真部、及び図表部の領域の属性を判
別している。
Further, in the second area cutout, the third area cutout is performed by the same processing as the first area cutout, and in the third area cutout, the same processing as the second area cutout is performed. Thus, the fourth region is cut out.
Therefore, for the area detected by the fourth area cutout, the attributes of the areas of the character portion, the photograph portion, and the graphic portion are discriminated based on the run length information and the black pixel ratio information of the area.

【0004】[0004]

【発明が解決しようとする課題】しかしながら、上記構
成の文書画像の領域抽出方法では、次のような課題があ
った。 (a)従来の方法では、領域を分割して抽出する際に、
その分割対象に応じて様々な閾値を任意に設定する必要
があった。例えば、個々の論文誌に対して領域分割処理
を施す場合、それらの閾値を各論文誌に応じた適切な値
に設定し直さなければならず、その値の選定に手間がか
かるという問題があった。 (b)前記(a)の問題を解決するため、本願出願人
は、先に特願平3−52846号明細書(提案1)、及
び特願平3−195437号明細書(提案2)におい
て、領域抽出方法の提案を行った。
However, the document image area extraction method having the above configuration has the following problems. (A) In the conventional method, when the area is divided and extracted,
It was necessary to arbitrarily set various thresholds according to the division target. For example, when performing region segmentation processing on individual journals, it is necessary to reset the thresholds to appropriate values according to each journal, and there is the problem that it takes time to select those values. It was (B) In order to solve the above problem (a), the applicant of the present application has previously described in Japanese Patent Application No. 3-52846 (Proposal 1) and Japanese Patent Application No. 3-195437 (Proposition 2). , And proposed the area extraction method.

【0005】提案1では、先ず、入力された白黒2値画
像データと同サイズの全面黒の多値の領域画像を作成す
る。次に、原画像を横及び縦方向に走査し、各々の方向
で閾値T1,T2以上の白ランが存在すれば、領域画像
においてその白ランに対応する部分を白にして、文書画
像の構成要素を黒画素連結領域として表現する。この領
域画像の黒の部分に対してラベル付けを行い、領域画像
の各黒連結領域に一意に番号を与えてラベル画像を作成
し、このラベル画像を用いて領域分割を行うというもの
であった。ところが、この方法では、文字領域が行単位
もしくは文字単位といった小さな領域でしか表現するこ
とができず、文字認識を行う際に文書の構成や、文章の
つながり等を知ることが困難である。そこで、この欠点
を除去するため、提案2では、前記領域分割を行った
後、文字領域について白ラン幅ヒストグラムから統合閾
値を設定し、その統合閾値を用いて統合を行うことによ
り、領域の抽出を行うようにしている。しかし、領域の
抽出対象となる文書画像が例えば新聞のような場合、図
2(a),(b)のような問題が生じる。
In the first proposal, first, an all-black multivalued area image of the same size as the input monochrome binary image data is created. Next, the original image is scanned in the horizontal and vertical directions, and if there are white runs equal to or greater than the thresholds T1 and T2 in each direction, the portion corresponding to the white runs in the area image is made white to form the document image. The element is represented as a black pixel connected area. The black part of the area image is labeled, a number is uniquely assigned to each black connected area of the area image to create a label image, and the area division is performed using this label image. . However, with this method, the character area can be expressed only in a small area such as a line unit or a character unit, and it is difficult to know the document structure, the connection of sentences, and the like when performing character recognition. Therefore, in order to eliminate this defect, in Proposal 2, after performing the region division, an integration threshold is set from the white run width histogram for the character region, and the integration is performed using the integration threshold to extract the region. I'm trying to do. However, when the document image from which the region is to be extracted is, for example, a newspaper, the problems shown in FIGS. 2A and 2B occur.

【0006】図2(a),(b)は、先の提案2の問題
点の説明図である。図2(a)に示すように、新聞のよ
うな見出し文字A及び本文文字B等を含む文書画像で
は、行間が狭く、段組の間隔が狭いが、このような文書
に対しては適切な統合閾値が得られない。また、図2
(b)の破線で囲まれた領域Cのように、違う段落どう
し、あるいは本文文字Bの領域とそれ以外の見出し文字
A等の文字領域とを誤って統合してしまうという問題が
生じ、未だ技術的に充分満足のゆく領域抽出方法が得ら
れなかった。本発明は、前記従来技術が持っていた課題
として、新聞等のような文書に対しては適切な統合閾値
が得られない点、及び本文文字領域とそれ以外の文字領
域とを誤って統合してしまうという点について解決し
た、新聞等の文書画像の領域抽出方法を提供するもので
ある。
FIGS. 2A and 2B are explanatory views of the problems of the above-mentioned proposal 2. As shown in FIG. 2A, in a document image containing headline characters A, body characters B, etc., such as a newspaper, the line spacing is narrow and the column spacing is narrow, but it is suitable for such documents. The integrated threshold cannot be obtained. Also, FIG.
As in the area C surrounded by the broken line in (b), there is a problem that different paragraphs are merged, or the area of the body character B and the other character areas such as the heading character A are erroneously integrated. A technically satisfactory area extraction method has not been obtained. The present invention, as a problem that the above-mentioned conventional art has, is that an appropriate integration threshold cannot be obtained for documents such as newspapers, and that the text area and other text areas are erroneously integrated. The present invention provides a method for extracting a region of a document image of a newspaper or the like, which solves the problem that the image is lost.

【0007】[0007]

【課題を解決するための手段】第1の発明は、前記課題
を解決するために、全面黒の領域画像と原画像から文書
画像の構成要素を黒画素連結領域として表現した領域画
像を作成する領域画像作成処理と、前記領域画像の構成
要素に一意に番号を与えたラベル画像を作成するラベル
画像作成処理とを施し、前記ラベル画像を用いて前記文
書画像の領域抽出を行う文書画像の領域抽出方法におい
て、前記ラベル画像から、文字の大きさによって特定の
文字とそれ以外の文字とを判別する文字判定処理を行
う。そして、文字画像作成処理において、前記特定の文
字について前記ラベル画像から文字領域以外の領域を全
て白にした第1の文字画像を作成すると共に、前記特定
の文字以外の文字について前記ラベル画像から文字領域
以外の領域を全て白にした第2の文字画像を作成する。
その後、前記第1の文字画像を走査して統合閾値を設定
する統合閾値設定処理と、前記統合閾値を用い、前記第
1と第2の文字画像に対して別々に領域抽出を行う領域
抽出処理とを、実行して新聞等の文書画像の領域抽出を
行うようにしている。第2の発明は、第1の発明の領域
抽出処理において、第2の文字画像の領域抽出では統合
閾値を変更して領域抽出を行うようにしている。
In order to solve the above-mentioned problems, a first aspect of the present invention creates an area image in which a constituent element of a document image is expressed as a black pixel connected area from an area image of the whole black and an original image. An area of a document image for which area image creation processing and label image creation processing for creating a label image in which constituent elements of the area image are uniquely numbered are performed, and the area of the document image is extracted using the label image In the extraction method, a character determination process is performed from the label image to distinguish a specific character from other characters according to the size of the character. Then, in the character image creating process, a first character image in which all areas other than the character area are white from the label image is created for the specific character, and characters other than the specific character are created from the label image. A second character image in which all areas other than the area are white is created.
Then, an integrated threshold value setting process of scanning the first character image to set an integrated threshold value, and a region extraction process of separately extracting an area for the first and second character images using the integrated threshold value. And are executed to extract the area of the document image of a newspaper or the like. In a second aspect of the present invention, in the area extraction process of the first aspect, in the area extraction of the second character image, the integration threshold value is changed to perform the area extraction.

【0008】[0008]

【作用】第1の発明によれば、以上のように文書画像の
領域抽出方法を構成したので、ラベル画像作成処理によ
って作成されたラベル画像を用いて文書画像を作成する
際、文字判定処理において文字の大きさによって特定の
文字とそれ以外の文字とを判別する。文字画像作成処理
では、特定の文字で構成される第1の文字画像を作成す
ると共に、特定の文字以外の文字で構成される第2の文
字画像を作成する。統合閾値設定処理では、第1の文字
画像から統合閾値を設定し、その統合閾値を用いて領域
抽出処理により、第1の文字画像と第2の文字画像に対
して別々に統合処理を行って新聞等の文書画像の領域抽
出を行う。第2の発明によれば、第2の文字画像の領域
抽出では、異なる統合閾値によって領域抽出が行われる
ので、文書画像の構成に対応した領域抽出が行える。従
って、前記課題を解決できるのである。
According to the first aspect of the present invention, since the area extraction method of the document image is configured as described above, when the document image is created using the label image created by the label image creation process, the character determination process is performed. A specific character is distinguished from other characters depending on the size of the character. In the character image creating process, a first character image composed of a specific character is created, and a second character image composed of characters other than the specific character is created. In the integrated threshold setting process, the integrated threshold is set from the first character image, and the integrated threshold is used to perform the integration process separately for the first character image and the second character image. The area of a document image such as a newspaper is extracted. According to the second aspect, in the area extraction of the second character image, the area extraction is performed with different integration thresholds, so that the area extraction corresponding to the configuration of the document image can be performed. Therefore, the above problem can be solved.

【0009】[0009]

【実施例】本発明の実施例を示す領域抽出方法の処理全
体説明(I)と、その各処理内容(II)とを、図1、図
3〜図5を参照しつつ、以下説明する。
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The overall processing (I) of the area extraction method showing the embodiment of the present invention and the contents of each processing (II) will be described below with reference to FIGS. 1 and 3 to 5.

【0010】 (I) 領域抽出方法の処理全体説明(図1、図3) 図1は領域抽出方法の全体の処理内容図、及び図3
(a),(b)は図1における文字画像(例えば、新
聞)の例を示す図である。図1に示すように、先ず、領
域画像作成処理1により、全面黒の領域画像と原画像か
ら文書画像の構成要素を黒画素連結領域として表現した
領域画像S1を作成し、ラベル画像作成処理2により、
該領域画像S1の構成要素に一意に番号を与えたラベル
画像S2を作成する。
(I) Description of Overall Processing of Region Extraction Method (FIGS. 1 and 3) FIG. 1 is a diagram showing the entire processing contents of the region extraction method, and FIG.
(A), (b) is a figure which shows the example of the character image (for example, newspaper) in FIG. As shown in FIG. 1, first, an area image creating process 1 creates an area image S1 in which constituent elements of a document image are expressed as black pixel connected areas from an entirely black area image and an original image, and a label image creating process 2 is performed. Due to
A label image S2 is created by uniquely giving numbers to the constituent elements of the area image S1.

【0011】次に、文字判定処理3で、例えば新聞の本
文を構成する文字とそれ以外の文字とを分類し、その分
類結果を文字画像作成処理4へ送る。一般の新聞では、
本文を構成する文字の大きさがほぼ一定であり、また見
出し文字等に使用される文字が本文を構成する文字に比
べて明らかに大きい。そのため、文字判定処理3では、
例えば文字の大きさで、本文を構成する文字とそれ以外
の文字を容易に判定できる。文字画像作成処理4では、
本文を構成する文字についてラベル画像S2から文字領
域以外の領域を全て白にした第1の文字画像S4aを作
成すると共に、本文を構成する文字以外の文字について
ラベル画像S2から文字領域以外の領域を全て白にした
第2の文字画像S4bを作成し、該第1の文字画像S4
aを統合閾値設定処理5へ送る。
Next, in the character determination processing 3, for example, the characters that make up the body of the newspaper and the other characters are classified, and the classification result is sent to the character image creation processing 4. In general newspapers,
The size of the characters that make up the text is almost constant, and the characters used for headline characters and the like are obviously larger than the characters that make up the text. Therefore, in the character determination processing 3,
For example, it is possible to easily determine the characters forming the body and the other characters by the size of the characters. In the character image creation process 4,
With respect to the characters forming the body, a first character image S4a in which all the areas other than the character area are white from the label image S2 is created, and regarding the characters other than the characters forming the body, the areas other than the character area are separated from the label image S2. A second character image S4b that is all white is created, and the first character image S4 is generated.
a is sent to the integrated threshold setting process 5.

【0012】統合閾値設定処理5では、第1の文字画像
S4aを走査し、統合閾値S5を設定する。例えば、図
3(a)のような文字画像の場合、先の提案2と同様の
統合閾値設定操作を行う。先の提案2では、図3
(a),(b)の両方の要素が混在した文字画像につい
てこの統合閾値設定処理を行っていたのに対し、本実施
例では、図3(a)のような文字画像に対してのみ統合
閾値設定処理5を行うため、より確実に、適切な統合閾
値S5を得ることができる。その後、領域抽出処理6で
は、統合閾値S5を用いて第1と第2の文字画像S4
a,S4bの各々について領域抽出を行う。この領域抽
出処理6では、例えば図3(a),(b)のような文字
画像に対してそれぞれ別々に、先の提案2と同様の領域
抽出処理を行う。但し、図3(b)のような文字画像に
ついては、ブロックの間隔が広いため、統合閾値設定処
理5で設定した統合閾値S5の例えばn倍(n;実数)
の値を統合閾値として領域の抽出を行うことが望まし
い。
In the integrated threshold value setting process 5, the first character image S4a is scanned and the integrated threshold value S5 is set. For example, in the case of the character image as shown in FIG. 3A, the same integrated threshold setting operation as that of the proposal 2 is performed. In the previous proposal 2, FIG.
While the integrated threshold value setting process is performed for a character image in which both elements (a) and (b) are mixed, in the present embodiment, only the character image as shown in FIG. 3A is integrated. Since the threshold setting process 5 is performed, an appropriate integrated threshold S5 can be obtained more reliably. After that, in the region extraction processing 6, the first and second character images S4 are used by using the integrated threshold S5.
Region extraction is performed for each of a and S4b. In this area extraction processing 6, for example, the same area extraction processing as that of the above-mentioned proposal 2 is performed separately for the character images as shown in FIGS. 3 (a) and 3 (b). However, in the case of the character image as shown in FIG. 3B, since the block interval is wide, for example, n times the integrated threshold S5 set in the integrated threshold setting process 5 (n; real number)
It is desirable to extract the area with the value of as the integrated threshold.

【0013】(II) 前記(I)の各処理内容 (II)(1) 統合閾値設定処理5(図4) 図4は、図1中の統合閾値設定処理5の処理内容図であ
る。図4において、先ず、図1中の文字画像作成処理4
により作成された第1の文字画像S4aのデータ51を
横方向に走査し、最長白ラン分布抽出処理52により、
横方向の白ラン分布を抽出する。縦方向も同様な操作を
行う。次に、白ラン幅ヒストグラム作成処理53によ
り、白ラン分布から横及び縦方向のヒストグラムを作成
する。そして、白ラン幅検出処理54により、このヒス
トグラムの最大値を与える連続する白ラン幅を求めた
後、統合閾値決定処理55により、横及び縦方向の統合
閾値からなる統合閾値S5を設定する。
(II) Details of Processes of (I) (II) (1) Integrated Threshold Setting Process 5 (FIG. 4) FIG. 4 is a process content diagram of the integrated threshold setting process 5 in FIG. In FIG. 4, first, the character image creation process 4 in FIG.
The data 51 of the first character image S4a created by
The white run distribution in the horizontal direction is extracted. The same operation is performed in the vertical direction. Next, a white run width histogram creation process 53 creates horizontal and vertical histograms from the white run distribution. Then, the white run width detection process 54 determines the continuous white run width that gives the maximum value of this histogram, and then the integrated threshold value determination process 55 sets an integrated threshold value S5 composed of horizontal and vertical integrated threshold values.

【0014】(II)(2) 領域抽出処理6(図5) 図5は、図1中の領域抽出処理6の内容を示す図であ
る。先ず、第1と第2の文字画像S4a,S4bに対し
てそれぞれ横分割処理61で1回目の横分割を行い、以
降は縦分割処理62と横分割処理64とを交互に行う。
判定処理63,65で、第1と第2の文字画像S4a,
S4bに対し横分割によって得られた第1のブロック数
と、縦分割によって得られた第2のブロック数との一致
/不一致の判定を行い、縦分割のブロック数と横分割の
ブロック数が等しくなった時点で、第1と第2の文字画
像S4a,S4bに対するそれぞれの領域抽出処理を終
了する。
(II) (2) Region Extraction Process 6 (FIG. 5) FIG. 5 is a diagram showing the contents of the region extraction process 6 in FIG. First, the first and second character images S4a and S4b are each horizontally divided by the horizontal division processing 61, and thereafter, the vertical division processing 62 and the horizontal division processing 64 are alternately performed.
In the determination processes 63 and 65, the first and second character images S4a,
For S4b, it is determined whether the first block number obtained by the horizontal division and the second block number obtained by the vertical division do not match and the number of blocks of the vertical division is equal to the number of blocks of the horizontal division. At that time, the area extraction processing for the first and second character images S4a and S4b is completed.

【0015】以上のように、本実施例では、図1の文字
判定処理3で、例えば新聞の本文を構成する文字とそれ
以外の文字とに分類し、本文を構成する文字から作成し
た第1の文字画像S4aに対して統合閾値設定処理5で
統合閾値S5の設定を行うため、適切な統合閾値を得る
ことができる。しかも、領域抽出処理6において、2種
類の文字から作成した第1と第2の文字画像S4a,S
4bに対して別々に領域の抽出を行うため、先の提案2
では本文文字領域とそれ以外の文字領域とを誤って統合
するおそれがあったのに対し、本実施例の方法では正し
く統合が行える。なお、本発明は上記実施例に限定され
ず、種々の変形が可能である。例えば、図5の領域抽出
処理6において、縦分割処理62を行った後に横分割処
理61を行い、その後、その処理を交互に繰り返すよう
にしたり、あるいは図5以外の方法で領域抽出処理6を
行ったり、さらに図4以外の方法で統合閾値設定処理5
を行うようにしてもよい。また、本実施例の領域抽出方
法は、新聞以外の一般の文書画像についても適用が可能
である。
As described above, in the present embodiment, in the character determination processing 3 of FIG. 1, for example, the characters constituting the text of the newspaper are classified into the characters constituting the text and the other characters, and the first character is created from the characters constituting the text. Since the integrated threshold value S5 is set in the integrated threshold value setting process 5 for the character image S4a, the appropriate integrated threshold value can be obtained. Moreover, in the region extraction processing 6, the first and second character images S4a, S created from two types of characters.
Since the regions are separately extracted for 4b, the above proposal 2
However, while there is a risk that the text area and the other text areas may be erroneously integrated, the method of the present embodiment allows the text areas to be correctly integrated. The present invention is not limited to the above embodiment, and various modifications can be made. For example, in the area extraction processing 6 of FIG. 5, the vertical division processing 62 is performed, then the horizontal division processing 61 is performed, and then the processing is alternately repeated, or the area extraction processing 6 is performed by a method other than FIG. And further, the integrated threshold value setting process 5 by a method other than FIG.
May be performed. The area extraction method of this embodiment can be applied to general document images other than newspapers.

【0016】[0016]

【発明の効果】以上詳細に説明したように、第1の発明
によれば、文書画像における特定の文字とそれ以外の文
字とを文字判定処理で判別し、文書画像作成処理によっ
て特定の文字から作成した第1の文字画像に対して統合
閾値設定処理で統合閾値の設定を行うため、適切な統合
閾値を得ることができる。しかも、特定の文字とそれ以
外の文字との2種類の文字から作成した第1と第2の文
字画像に対して領域抽出処理で別々に領域抽出を行うた
め、第1の文字領域と第2の文字領域とを誤って統合す
ることがなく、正しく統合が行える。第2の発明によれ
ば、第2の文字画像の領域抽出では、第1の文字画像の
領域抽出に用いた統合閾値とは異なる値を用いることに
より、ブロックの間隔が広い新聞等の文書画像に対して
的確な領域の抽出が可能となる。
As described in detail above, according to the first aspect of the present invention, a specific character in a document image is distinguished from other characters by a character determination process, and a specific character is extracted from the specific character by a document image creation process. Since the integrated threshold is set in the created first character image by the integrated threshold setting process, an appropriate integrated threshold can be obtained. Moreover, since the area extraction processing separately performs area extraction on the first and second character images created from two types of characters, that is, a specific character and other characters, the first character area and the second character image are extracted. It can be integrated correctly without accidentally integrating with the character area of. According to the second invention, in the region extraction of the second character image, by using a value different from the integrated threshold used in the region extraction of the first character image, a document image of a newspaper or the like having a wide block interval. Therefore, it is possible to extract an accurate area.

【図面の簡単な説明】[Brief description of drawings]

【図1】本発明の実施例を示す文書画像の領域抽出方法
の処理内容図である。
FIG. 1 is a processing content diagram of a document image area extraction method according to an embodiment of the present invention.

【図2】先の提案2の問題点の説明図である。FIG. 2 is an explanatory diagram of a problem of the previous proposal 2.

【図3】図1の文字画像の例を示す図である。FIG. 3 is a diagram showing an example of the character image of FIG.

【図4】図1中の統合閾値設定処理5の処理内容を示す
図である。
FIG. 4 is a diagram showing the processing content of integrated threshold setting processing 5 in FIG.

【図5】図1中の領域抽出処理6の処理内容を示す図で
ある。
5 is a diagram showing the processing contents of a region extraction processing 6 in FIG.

【符号の説明】[Explanation of symbols]

1 領域画像作成処理 2 ラベル画像作成処理 3 文字判定処理 4 文字画像作成処理 5 統合閾値設定処理 6 領域抽出処理 S1 領域画像 S2 ラベル画像 S4a,S4b 第1,第2の文字画像 S5 統合閾値 1 area image creation processing 2 label image creation processing 3 character determination processing 4 character image creation processing 5 integrated threshold setting processing 6 area extraction processing S1 area image S2 label images S4a, S4b first and second character images S5 integrated threshold

Claims (2)

【特許請求の範囲】[Claims] 【請求項1】 全面黒の領域画像と原画像から文書画像
の構成要素を黒画素連結領域として表現した領域画像を
作成する領域画像作成処理と、前記領域画像の構成要素
に一意に番号を与えたラベル画像を作成するラベル画像
作成処理とを施し、前記ラベル画像を用いて前記文書画
像の領域抽出を行う文書画像の領域抽出方法において、 前記ラベル画像から、文字の大きさによって特定の文字
とそれ以外の文字とを判別する文字判定処理と、 前記特定の文字について前記ラベル画像から文字領域以
外の領域を全て白にした第1の文字画像を作成すると共
に、前記特定の文字以外の文字について前記ラベル画像
から文字領域以外の領域を全て白にした第2の文字画像
を作成する文字画像作成処理と、 前記第1の文字画像を走査して統合閾値を設定する統合
閾値設定処理と、 前記統合閾値を用い、前記第1と第2の文字画像に対し
て別々に領域抽出を行う領域抽出処理とを、 実行して前記文書画像の領域抽出を行うことを特徴とす
る文書画像の領域抽出方法。
1. An area image creating process for creating an area image in which a constituent element of a document image is expressed as a black pixel connected area from an entirely black area image and an original image, and a number is uniquely given to the constituent element of the area image. A label image creating process for creating a label image, and a document image area extracting method for extracting an area of the document image using the label image, wherein a specific character according to the size of the character is extracted from the label image. A character determination process for determining other characters, and a first character image in which all areas other than the character area are white from the label image for the specific character, and at the same time, for the character other than the specific character A character image creating process for creating a second character image in which all areas other than the character area are white from the label image; and scanning the first character image to set an integrated threshold value. Performing an integrated threshold value setting process and an area extraction process for separately performing the area extraction on the first and second character images by using the integrated threshold value. A method for extracting a region of a document image as a feature.
【請求項2】 前記領域抽出処理において、前記第2の
文字画像の領域抽出では統合閾値を変更して領域抽出を
行うことを特徴とする請求項1記載の文書画像の領域抽
出方法。
2. The document image region extraction method according to claim 1, wherein in the region extraction processing, the region extraction is performed by changing an integration threshold in the region extraction of the second character image.
JP4211661A 1992-08-07 1992-08-07 Area extracting method for document image Pending JPH0660221A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP4211661A JPH0660221A (en) 1992-08-07 1992-08-07 Area extracting method for document image

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP4211661A JPH0660221A (en) 1992-08-07 1992-08-07 Area extracting method for document image

Publications (1)

Publication Number Publication Date
JPH0660221A true JPH0660221A (en) 1994-03-04

Family

ID=16609502

Family Applications (1)

Application Number Title Priority Date Filing Date
JP4211661A Pending JPH0660221A (en) 1992-08-07 1992-08-07 Area extracting method for document image

Country Status (1)

Country Link
JP (1) JPH0660221A (en)

Similar Documents

Publication Publication Date Title
US6393150B1 (en) Region-based image binarization system
JP4655335B2 (en) Image recognition apparatus, image recognition method, and computer-readable recording medium on which image recognition program is recorded
US20030128396A1 (en) Image type classification using edge features
US5502777A (en) Method and apparatus for recognizing table and figure having many lateral and longitudinal lines
JPH05225378A (en) Area dividing system for document image
JP2000207489A (en) Character extraction method, apparatus and recording medium
US6269186B1 (en) Image processing apparatus and method
US6289122B1 (en) Intelligent detection of text on a page
US6987879B1 (en) Method and system for extracting information from images in similar surrounding color
JPH0660221A (en) Area extracting method for document image
JP4116377B2 (en) Image processing method and image processing apparatus
JPH03126181A (en) How to divide document images into regions
US6678427B1 (en) Document identification registration system
JPH0660220A (en) Area extracting method for document image
JP2877548B2 (en) Document image attribute discrimination method
JP3020293B2 (en) Attribute determination method
JP3756660B2 (en) Image recognition method, apparatus and recording medium
JPH0540848A (en) Area extraction method for document image
JPH0540849A (en) Area extraction method for document image
JP2899356B2 (en) Character recognition device
JP2000148908A (en) Document image processing method, apparatus and recording medium
JP2771045B2 (en) Document image segmentation method
JPH0535914A (en) Picture inclination detection method
JP2001143076A (en) Image processor
JPH06187490A (en) Area division method

Legal Events

Date Code Title Description
A02 Decision of refusal

Free format text: JAPANESE INTERMEDIATE CODE: A02

Effective date: 19990518