JPH03250385A

JPH03250385A - Character string extracting system

Info

Publication number: JPH03250385A
Application number: JP2048362A
Authority: JP
Inventors: Takashi Ishikawa; 孝石川; Akihiro Oka; 昭宏岡
Original assignee: Pentel Co Ltd
Current assignee: Pentel Co Ltd
Priority date: 1990-02-28
Filing date: 1990-02-28
Publication date: 1991-11-08

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】（産業上の利用分野）文書情報をコンピュータに入力するため、文書をイメー
ジスキャナで読み取りコード化文書に変換する文書認識
システムに関するもので、特に文書画像データを入力と
して、文書項目の文字列を出力する文書認識システムに
関するものである。Detailed Description of the Invention (Industrial Application Field) This invention relates to a document recognition system that reads a document using an image scanner and converts it into a coded document in order to input document information into a computer. The present invention relates to a document recognition system that outputs character strings of document items.

（従来の技術およびその課題）文書情報をコンピュータに入力するため、従来はイメー
ジスキャナと画像処理装置を組み合わせた光学的文書読
取装置が知られており、文書情報から文字列領域を抽出
する方法として、連結画素の外接矩形を一定のアルゴリ
ズムでマージする方法が行なわれている。この方法では
、罫線が存在する場合には前処理によって罫線を構成す
る画素を除去することが必要であり、余分の計算時間を
必要としていた。(Prior art and its problems) Conventionally, optical document reading devices that combine an image scanner and an image processing device have been known for inputting document information into a computer. , a method is used in which circumscribed rectangles of connected pixels are merged using a certain algorithm. In this method, if a ruled line exists, it is necessary to remove pixels forming the ruled line by preprocessing, which requires extra calculation time.

（課題を解決するための手段）本発明は如上の問題点に鑑みなされたもので、文書画像
から文字列領域を抽出する文字列抽出処理において、水
平方向と垂直方向のランレングスを基準値と比較するこ
とによって、文字列領域を罫線から分離して抽出する文
字列抽出方式を提案するものである。(Means for Solving the Problems) The present invention has been made in view of the above problems, and uses horizontal and vertical run lengths as reference values in a character string extraction process that extracts a character string area from a document image. This paper proposes a character string extraction method that separates and extracts character string regions from ruled lines through comparison.

（作用）本発明の文字列抽出方式は、基本矩形の抽出と罫線の除
去を同時に、かつ、高速に処理する方式を提案するもの
である。(Operation) The character string extraction method of the present invention proposes a method for processing basic rectangle extraction and ruled line removal simultaneously and at high speed.

（実施例）本発明の基本的な考え方は、２列の画素列のＯＲ処理を
上から下へと、下から上への２回処理を行なう際に、水
平方向のランレングスと垂直方向のランレングスとをチ
エツクして、ランレングスが予め定めた基準値を超える
場合には、そのランを構成する画素についてはＯＲ処理
を行なわないというものである。つまり、ランレングス
に対する基準値は最大の文字サイズに相当し、この値を
超える連結画素は文字ではない、すなわち罫線と認識す
ることになる。(Example) The basic idea of the present invention is that when performing OR processing of two pixel columns twice, from top to bottom and from bottom to top, the horizontal run length and vertical run length are When the run length is checked and the run length exceeds a predetermined reference value, the OR processing is not performed on the pixels forming the run. In other words, the reference value for the run length corresponds to the maximum character size, and connected pixels exceeding this value are not recognized as characters, that is, as ruled lines.

本発明を添付図面を参照して説明すると、第１図はゼネ
ラルフロー、第２図は各ステップのデータ、第３図は基
本矩形抽呂処理フローである。The present invention will be explained with reference to the accompanying drawings. FIG. 1 shows a general flow, FIG. 2 shows data of each step, and FIG. 3 shows a basic rectangular drawer processing flow.

第１図のゼネラルフローのステップ１で、文書の画像を
イメージスキャナで読み込み、第２図のａで表すような
画像がデータとして入力される。In step 1 of the general flow shown in FIG. 1, an image of a document is read by an image scanner, and an image shown by a in FIG. 2 is input as data.

ステップ２では、基本矩形抽出として４方向のＯＲ処理
を行なうが、方向の順序（下向き、上向き、右向き、左
向き）は任意に設定できるものである。In step 2, OR processing in four directions is performed as basic rectangle extraction, but the order of the directions (downward, upward, rightward, leftward) can be set arbitrarily.

本実施例では下向き、上向き、右向き、左向きの順序と
した（第３図参照）６尚、各ＯＲ処理は方向が異なるこ
とを除いて同じアルゴリズムによるので、下向きの処理
の例を第４図に示す。第４図はＯＲ処理前の原画像を示
す。ＯはＯＮの画素（データがあることを示す）を表し
、×は処理の着目点を示す。ここで下向きに処理してき
て、着目点×において、この画素がＯＮでなく処理方向
についての１つ前の画素（０で示す）がＯＮなので、着
目点Ｘは候補点となる。この候補点の画素をＯＮとする
か否とするかは次の２つの基準に従う。In this example, the order is downward, upward, rightward, and leftward (see Figure 3).6 Note that each OR process is based on the same algorithm except for the direction, so an example of the downward process is shown in Figure 4. show. FIG. 4 shows the original image before OR processing. O represents an ON pixel (indicating that there is data), and × represents a processing focus point. Here, processing is performed downward, and at the point of interest x, this pixel is not ON, but the previous pixel in the processing direction (indicated by 0) is ON, so the point of interest X becomes a candidate point. Whether the pixel at this candidate point is turned ON or not is determined according to the following two criteria.

（１）処理方向についての１つ前の画素までのランレン
グス（連続した画素数）が基準値未満である。(1) The run length (number of consecutive pixels) up to the previous pixel in the processing direction is less than the reference value.

（２）候補点の処理方向に直交する方向での隣接画素の
少なくとも１つがＯＮである。(2) At least one of the pixels adjacent to the candidate point in the direction perpendicular to the processing direction is ON.

ここで、ランレングスの基準値は最大文字サイズ（画素
数単位）に対応し、予め定めておく。第４図の例ではこ
の基準値を１０としておくと、着目点×でのランレング
スは３であり、（１）の基準を満たす。更に、左隣の画
素がＯＮなので、（２）の基準をも満たす。従って、着
目点×をＯＮにする（Ｏで示す、第５図参照）。第５図
の状態で次の着目点Ｘは（２）の基準を満たさないので
ＯＮにしない。Here, the reference value of the run length corresponds to the maximum character size (in units of number of pixels) and is determined in advance. In the example of FIG. 4, if this reference value is set to 10, the run length at the point of interest x is 3, which satisfies the criterion (1). Furthermore, since the pixel on the left is ON, the criterion (2) is also satisfied. Therefore, the point of interest x is turned on (indicated by O, see FIG. 5). In the state shown in FIG. 5, the next point of interest X does not satisfy the criterion (2), so it is not turned on.

次に罫線の除去に対する前記した（１）の基準の効果に
ついて第６図を参照して説明する。着目点×の１つ前ま
でのランレングスは１０であり、予め定めた基準値以上
（未満ではない）なので、前記（２）の基準を満たして
もＯＮにしない。これによって罫線の領域が拡大される
ことがなくなる。また、（１）の基準で基準値未満とし
たのは、第６図の処理の後で第７図の処理方向に処理す
る場合、着目点×が（１）の基準を満たすので、ＯＮに
なってしまうため、この分の余裕を持たせるためである
。Next, the effect of the above-mentioned criterion (1) on the removal of ruled lines will be explained with reference to FIG. The run length up to one point before the point of interest x is 10, which is greater than (but not less than) a predetermined reference value, so it is not turned on even if the criterion (2) is satisfied. This prevents the ruled line area from being expanded. Also, the reason why it is set to be less than the reference value based on the criterion (1) is that when processing in the processing direction shown in FIG. 7 after the processing shown in FIG. This is to provide some margin for this.

基本矩形抽出されたものを第２図のｂに示す。The extracted basic rectangle is shown in FIG. 2b.

ステップ３では、基本矩形を文字列の長手方向に直交す
る方向に投影して重なりを持つ基本矩形の集まりに外接
する矩形として文字素を構成する（第２図のＣ参照）。In step 3, a grapheme is constructed as a rectangle that circumscribes a collection of overlapping basic rectangles by projecting the basic rectangle in a direction perpendicular to the longitudinal direction of the character string (see C in FIG. 2).

ステップ４では、文字素を文字列の長手方向に投影して
重なりを持つものの集まりの外接矩形を文字列領域とし
て抽出する（第２図のｄ参照）。In step 4, the graphemes are projected in the longitudinal direction of the character string, and a circumscribed rectangle of a collection of overlapping characters is extracted as a character string area (see d in FIG. 2).

（発明の効果）本発明は如上のような構成となしたので、基本矩形抽出
処理が単純なＯＲ処理とランレングスの基準値との比較
で構成されているので、高速に処理することが出来、か
つ、罫線の除去を同時に行なうことが出来るので、従来
のように罫線除去の前処理が不要であり、処理を効率化
することができるものである。(Effects of the Invention) Since the present invention has the above configuration, the basic rectangle extraction process consists of a simple OR process and a comparison with the run length reference value, so it can be processed at high speed. Moreover, since the ruled lines can be removed at the same time, there is no need for pre-processing for removing the ruled lines as in the conventional method, and the processing can be made more efficient.

[Brief explanation of drawings]

図面は本発明の一実施例を示すもので、第１図は本発明
のゼネラルフロー、第２図は各ステップのデータ、第３
図は基本矩形抽出処理フロー、第４図〜第７図は処理を
説明する図である。The drawings show one embodiment of the present invention, and Fig. 1 shows the general flow of the invention, Fig. 2 shows the data of each step, and Fig. 3 shows the general flow of the invention.
The figure is a basic rectangle extraction process flow, and FIGS. 4 to 7 are diagrams explaining the process.

Claims

[Claims]

A character string characterized in that in a character string extraction process for extracting a character string region from a document image, the character string region is separated from the ruled lines and extracted by comparing the horizontal and vertical run lengths with reference values. Extraction method