JPH0351981A

JPH0351981A - character recognition device

Info

Publication number: JPH0351981A
Application number: JP1186611A
Authority: JP
Inventors: Mikio Aoki; 三喜男青木
Original assignee: Seiko Epson Corp
Current assignee: Seiko Epson Corp
Priority date: 1989-07-19
Filing date: 1989-07-19
Publication date: 1991-03-06

Abstract

PURPOSE:To recognize accurate characters speedily by defining only small letters as candidate characters for an extracted character after a small letter is recognized once halfway in a word, and setting the candidate characters while judging character spaces at all times. CONSTITUTION:The character recognition device is equipped with a character input device 102, a character display means 103, a ROM 104 stored with character data for recognition and a word dictionary, and a RAM 105 for storing character images and a position 202 which is a half as high as a small letter is scanned in a word area to extract connection components in eight directions every time a character is met. Features of the extracted character are extracted and compared with all characters in the character data dictionary in the ROM 104 and the most matching character is regarded as a recognition result. When a small letter is judged, candidate characters of following extracted characters are narrowed down to the small letters and when there is a character space more than necessary in the same word, it is considered that the candidate characters of the extracted character may be either of a large letter or small letter. Consequently, characters can be recognized speedily.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は、欧米文書等の文字画像を光学的な方法により
入力することにより、文書画像から文字領域を抽出し、
コード番号に変換する文字認識装置に関する．〔従来の技術〕近年、文字認識装置の急激なる進歩により、様々な文字
画像から文字領域を自動的に抽出し、さらに一つ一つの
文字を切り出し、認識し、自動的に文書ファイルが作成
できる様になってきており、文字の認識方法には様々な
方法が考え出されている．例えば、文字の認識方法としては文字のメッシュ特徴や
ペリフエラル特徴を比較して認識する方法がある．これ
らの方法は、研究実用化報告　第３４＠　第１号ｐｐ．
４７〜５８に掲載されている。メッシュ特徴は、文字全
体の大まかな形状分布を表現したものであり、特徴の抽
出方法は、文字の外接矩形を分割してｎｘｎの小領域を
求める．該各々の小領域に含まれる文字部の面積を計数
しメッシュ特徴とする。該メッシュ特徴は、一つの文字
につきｎｘｎコのデーターを持っており、ｎ×ｎコのう
ちのある領域における文字部の面積の割合を辞書として
所有しているデーターと比較することによって文字の推
定を行う．またベリフエラル特徴は、文字の周辺情報に
着目したものであり、特徴の抽出方法は、まず文字パタ
ーンの外接矩形を求め、外接矩形の各辺をそれぞれｎ分
割する．次に分割された各分割辺から文字に向かって走
査していき、最初に文字に出合うまでの面積、次に文字
に出合うまでの面積を計数する。各分割辺に対して同様
の処理を行うことにより、ｎｘ４Ｘ２のデーターを持つ
ベリフエラル特徴を得ることができ、該ｎＸ４Ｘ２のデ
ーターと辞書として所有しているデーターとを比較する
ことによって文字の推定が可能となる．以上の方法は、
抽出した文字の特徴を抽出し、該特徴と、認識用の辞書
データーの全てのデーターとを比較して最も近レ＼もの
を認識結果として出力している．また、前記認識を行う前に、抽出文字行から一単語一単
語を抽出し、さらに一文字一文字の抽出を行うが、該抽
出は一般に抽出文字行の行方向と垂直な方向の周辺分布
を計数し、該周辺分布の値より文字間隔及び単語間隔を
推定して、一文字一文字の文字の抽出を行っている．〔発明が解決しようとする課題〕しかしながら、前記文字の抽出方法において文字の抽出
を行うと、第５図５０１に示した様な文字画像において
正確な文字の抽出ができない．該文字画像５０１は、文
字同士は全く接触していない．しかしながら、行方向と
垂直な方向の周辺分布はＹの周辺分布とＯの周辺分布と
が接触して一つの周辺分布５０２を形成している．従っ
て、該周辺分布の形状より文字領域を推定して文字を抽
出する場合、ＹとＯを一諸に抽出するか、あるいは、第
５図の５０３か５０４のいずれかの位置で強制的に抽出
することになる．したがって、抽出された文字画像を認
識すると間違った結果を出す可能性が非常に多い。[Detailed Description of the Invention] [Industrial Application Field] The present invention extracts a character area from a document image by inputting a character image of a Western document, etc. using an optical method.
This relates to a character recognition device that converts into code numbers. [Prior art] In recent years, rapid advances in character recognition devices have made it possible to automatically extract character areas from various character images, cut out and recognize individual characters, and automatically create document files. Various methods have been devised to recognize characters. For example, one way to recognize characters is to compare the mesh features and peripheral features of the characters. These methods are described in Research and Practical Application Report No. 34 @ No. 1 pp.
Published in 47-58. The mesh feature expresses the rough shape distribution of the entire character, and the method of extracting the feature is to divide the circumscribed rectangle of the character to obtain nxn small regions. The area of the character portion included in each of the small areas is counted and used as a mesh feature. This mesh feature has nxn data for one character, and the character can be estimated by comparing the area ratio of the character part in a certain area of nxn with data owned as a dictionary. I do. Furthermore, the verifical feature focuses on peripheral information of a character, and the method for extracting the feature is first to find a circumscribed rectangle of a character pattern, and then divide each side of the circumscribed rectangle into n parts. Next, it scans toward the characters from each divided side, and counts the area until it encounters the first character, and then the area until it encounters the next character. By performing the same processing on each divided side, verifical features with nx4x2 data can be obtained, and characters can be estimated by comparing the nx4x2 data with the data held as a dictionary. becomes. The above method is
The features of the extracted characters are extracted, and the features are compared with all the data in the dictionary data for recognition, and the closest match is output as the recognition result. Furthermore, before performing the above recognition, each word is extracted from the extracted character line, and each character is further extracted, but this extraction generally involves counting the marginal distribution in the direction perpendicular to the line direction of the extracted character line. , character spacing and word spacing are estimated from the values of the marginal distribution, and each character is extracted. [Problems to be Solved by the Invention] However, when characters are extracted using the above-mentioned character extraction method, it is not possible to accurately extract characters from a character image as shown in FIG. 5 501. In the character image 501, the characters do not touch each other at all. However, in the direction perpendicular to the row direction, the Y marginal distribution and the O marginal distribution contact each other to form one marginal distribution 502. Therefore, when extracting characters by estimating the character area from the shape of the marginal distribution, either Y and O should be extracted all at once, or they should be forcibly extracted at either position 503 or 504 in Figure 5. I will do it. Therefore, there is a high possibility that an incorrect result will be produced when the extracted character image is recognized.

また、抽出された文字の特徴を抽出し、辞書中にある文
字データーと比較して認識を行う場合にも、常にすべて
の文字データーと比較を行って最も一致した文字を出力
するので、認識にかかる時間が非常に長くなり、その結
果文字認識装置の時間あたりの処理量を減らし、性能を
落としている結果となる．そこで本発明は以上の課題を解決するもので、その目的
とするところは、欧米文書等の文字認識装置において、
文字行方向と垂直な方向の周辺分布が重った文字画像に
おいても簡単なアルゴリズムで正確かつ高速に文字を認
識する文字認識装置を提供することにある．〔課題を解決するための手段〕本発明の文字認識装置は（１）欧米文書の紙面等の反射光を光電変換して文書画
像を入力する光学的画像入力手段と、前記入力画像から
文字行及び単語の位置を検知して一単語一単語を抽出す
る手段と、前記抽出単語から一文字一文字を抽出し、あらかじめ所
有している文字データー辞書との比較を行いながら文字
コードに変換する文字認識手段と、認識文字が大文字か
小文字かを判断する手段と、単語辞書データーとを具備
し、一文字一文字の抽出時において、抽出文字を該抽出文字
の前に抽出した文字との間隔をチェックし、同一単語内
において必要以上に文字間隔がある場合には、前記抽出
文字の候補文字は大文字、小文字の両方とすることを特
徴とする．（２）前記認識文字が大文字か小文字かを判
断する手段において小文字と判断された以後は、抽出文
字の候補文字は小文字のみとすることを特徴とする．（３）前記文字の抽出は、文字行の高さの中心付近を左
から右、もしくは右から左に走査していき、文字にぶつ
かった時点で黒画素の８方向の連結成分を抽出すること
を特徴とする．〈４〉前記認識文字出力前に、単語辞書データ！中の単
語と認識結果とを比較し、データー修正することを特徴
とする．（５）前記単語辞書データーとの比較は、単語辞書デー
ター中、及び認識単語中の例えば一等の記号を無視して
行った後、単語辞書データー中の単語を出力することを
特徴とする．〔実施例〕以下本発明について実施例に基づいて詳細に説明する．本発明の文字認識装置のブロック図を第１図に示す．文
字認識装置は、プログラムに従って処理を実行するＣＰ
Ｕ１０１、文字画像を記憶装置に入力する画像入力装Ｆ
？　１　０　２、文字認識結果を表示する文字表示手段
１０３、認識用の文字データ単語辞書の納まっているＲ
ＯＭ１０４、文字画像を記憶する記憶装置であるＲＡＭ
１０５より構成されている．以下、本発明の文字認識装置の文字認識の方法を第４図
に示すフローチャートに基づいて、第２図及び第３図を
用いて詳細に説明する．本発明の文字認識装置はまず初
めに画像入力装Ｗ１０２において光学的な方法により紙
面等に書かれた文字をイメージデーターとして記憶装置
であるＲＡＭ１０５に入力する．次に入力した文字画像
から単語領域の抽出を行う．単語領域の抽出は、まず入
力文字画像の文字行方向の周辺分布を計数する．該周辺
分布（図示せず）は、文字行の存在する位置で値が大き
くなり、文字行と文字行との間は周辺分布の値が小さい
．従って、該周辺分布の値により文字行の位置を容易に
推定することが可能である．文字行の位置を推定すると
次に抽出文字行の行方向と垂直な方向の周辺分布を計数
する．該周辺分布（図示せず）の値の大きいところは文
字の存在している領域であり、小さいところは文字の存
在していない領域である．従って文字の存在していない
領域を調べることにより、単語間隔と文字間隔の大きさ
が推定でき、単語領域を抽出できる．単語領域を抽出すると次は、抽出した単語内の一文字一
文字の文字を抽出しながら認識を行う．第２図２０１に
前記抽出単語を示す。該単語は、全ての文字が隣接する
文字と全く接触していない．従って、第２図２０５に示
す様な、文字行と垂直な方向の周辺分布の値により文字
の存在する領域２０４と文字の存在しない領域２０３を
推定し文字の抽出を行うことが可能であるが、文字画像
中に第５図（ａ）５０１に示した様な文字画像が存在す
る可能性があるので本発明においては、単語領域内にお
いて、小文字の高さの１／２の位置２０２を左から右へ
走査して行き、文字にぶつかるごとに８方向の連結成分
を抽出する．次に、該抽出文字の特徴を抽出し、ＲＯＭ
１０４中の文字データー辞書の全ての文字と比較し、文
字データー辞書中の最も一致した文字を認識結果とする
．また、ここで抽出文字を認識することにより、該抽出
文字が大文字か小文字かを判断することが可能となる．
通常欧米文書の場合、単語が途中で小文字に変わると以
降の文字は小文字であるという性質をもつ．従って、こ
こで、抽出文字が小文字であると判断できれば以降の抽
出文字の候補文字を小文字に絞ることが可能となり、文
字認識の高速化が可能となる．第２図２０１の場合、一
文字目は大文字のＡであるため二文字目の候補文字は大
文字小文字全てが対象となる．しかしながら、二文字目
を認識することにより、該文字が小文字と判断できるか
ら以降の文字は小文字のみが候補文字となり、文字の認
識の速度が速くなる。また、欧米文書中には、第２図２
０２に示した様な文字画像が存在する．該文字画像は、
行方向と垂直な周辺分布においては第２図２０３に示し
た様に一つの単語と解釈される．しかしながら、途中に
−（バー）が存在し、次の文字が大文字となっている．
このようなことはしばしばあり、小文字として認識を行
うと認識を間違えてしまう．そのためには、途中の−を
認識すれば良いが、本発明は認識を高速に行うために、
小文字の高さの１／２の位置一本しか走査していない．
従って、文字画像２０１の様な場合には、第２図に示す
様に一を拾わない可能性がある．この様な場合にも認識
を間違えないために文字の抽出と同時に文字の右端及び
左端の位置を求める．該位置と前抽出文字の右端の位置
を比較することにより前抽出文字との間隔を判断でき、
該間隔が通常文字間隔に比べて極端に大きかった場合に
は、その間に例えば一の様なものが存在すると推定でき
、その結果、該抽出文字は大文字の可能性もあると判断
できる．こうして抽出単語を認識すると第２図２０６に
示した様な一を除いた単語２０６を得ることができる．
こうして認識した単語３０１を出力する前に、ＲＯＭ１
０４中の単語辞書と比較する．単語辞書との比較を行う
のは、前記認識において通常と比べて極端に間隔のあい
た領域に存在した記号（文字ではない）を補充するため
であり、単語辞書との比較はー、゛等の記号を無視して
行う。こうして比較することにより、認識結果３０１よ
り単語辞書３０２中の単語３０３を検索することができ
、該単語３０３を出力することにより正確な認識結果を
帰ることが可能となる．以上述べた様に、欧米文書等の文字認識において、単語
の途中で一担小文字があらわれると以降の文字は小文字
であるという欧米文書の性格を利用して、小文字を認識
したあとは抽出文字の候補文字を小文字のみとするので
認識に要する時間を非常に短くすることが可能となる．
また、文字の抽出は、８方向の連結成分を抽出し、同時
に抽出文字と前抽出文字との間隔を常に判断しているの
で、誤って候補文字を決定することも無く、また行方向
と垂直な方向の周辺分布において重っている様な文字画
像においても正確に文字を認識することが可能となる．〔発明の効果〕以上述べた様に本発明は、単語の認識途中において一担
認識文字が小文字と判断した後は、抽出文字の候補文字
を小文字に絞って認識を行うので認識が高速になる．ま
た、単語の途中で大文字に変化する様な場合も、常に文
字間隔を判断しながら候補文字を設定しているので、大
文字を小文字に認識する様な間違いは無くなる．さらに
、文字の抽出は、８方向の連結成分を抽出するので、行
方向と垂直な方向の周辺分布において重った文字画像に
おいても正確に文字の抽出ができ、その結果正しく認識
できる．以上の様に本発明によれば高速でかつ正確な文
字の認識が可能となる。その結果該方法を構或要素に用
いる文字認識装置の信頼性を大幅に向上させるという効
果を有する．Also, when recognizing the extracted characters by comparing them with the character data in the dictionary, the recognition process always compares all the character data and outputs the most matching character. The time required becomes extremely long, and as a result, the amount of processing per hour of the character recognition device is reduced, resulting in a drop in performance. Therefore, the present invention is intended to solve the above-mentioned problems, and its purpose is to provide a character recognition device for European and American documents, etc.
The object of this invention is to provide a character recognition device that can accurately and quickly recognize characters using a simple algorithm even in character images with overlapping peripheral distributions in the direction perpendicular to the character line direction. [Means for Solving the Problems] The character recognition device of the present invention includes (1) an optical image input means for inputting a document image by photoelectrically converting light reflected from the paper surface of a European and American document, and character line recognition from the input image. and a means for detecting the position of words and extracting each word, and a character recognition means for extracting each character from the extracted word and converting it into a character code while comparing it with a pre-existing character data dictionary. It is equipped with a means for determining whether a recognized character is an uppercase or lowercase character, and word dictionary data, and when extracting each character, it checks the interval between the extracted character and the character extracted before the extracted character, and determines whether the extracted character is the same. If there is an unnecessarily large spacing between characters within a word, the candidate characters for the extracted characters are both uppercase and lowercase. (2) After the recognized character is determined to be a lowercase letter by the means for determining whether it is an uppercase letter or a lowercase letter, candidate characters for extraction characters are limited to lowercase letters. (3) To extract the above-mentioned characters, scan the vicinity of the center of the height of the character line from left to right or right to left, and extract the connected components of black pixels in 8 directions at the point when a character is encountered. It is characterized by <4> Word dictionary data before outputting the recognized characters! It is characterized by comparing the words in the text with the recognition results and correcting the data. (5) The comparison with the word dictionary data is performed by ignoring, for example, first-class symbols in the word dictionary data and recognized words, and then the words in the word dictionary data are output. [Example] The present invention will be explained in detail below based on an example. Figure 1 shows a block diagram of the character recognition device of the present invention. The character recognition device is a CP that executes processing according to a program.
U101, image input device F for inputting character images into a storage device;
? 1 0 2. Character display means 103 for displaying character recognition results, R containing character data word dictionary for recognition.
OM104, RAM which is a storage device that stores character images
It consists of 105. Hereinafter, the character recognition method of the character recognition device of the present invention will be explained in detail based on the flowchart shown in FIG. 4 and with reference to FIGS. 2 and 3. The character recognition device of the present invention first inputs characters written on paper or the like by an optical method to the RAM 105, which is a storage device, as image data in the image input device W102. Next, extract word regions from the input character image. To extract a word region, first count the peripheral distribution of the input character image in the character line direction. The value of the marginal distribution (not shown) becomes large at the position where a character line exists, and the value of the marginal distribution is small between character lines. Therefore, it is possible to easily estimate the position of a character line based on the value of the marginal distribution. Once the position of the character line is estimated, the marginal distribution in the direction perpendicular to the line direction of the extracted character line is counted. Areas where the value of the marginal distribution (not shown) is large are areas where characters exist, and areas where the value is small are areas where no characters exist. Therefore, by examining regions where no characters exist, the size of word spacing and character spacing can be estimated, and word regions can be extracted. Once a word region is extracted, the next step is to perform recognition while extracting each character within the extracted word. FIG. 2 201 shows the extracted words. In the word, all letters do not touch any adjacent letters. Therefore, it is possible to extract characters by estimating the region 204 where characters exist and the region 203 where no characters exist based on the value of the marginal distribution in the direction perpendicular to the character line, as shown in FIG. 2 205. , there is a possibility that a character image like the one shown in FIG. Then, scan to the right and extract connected components in 8 directions every time you encounter a character. Next, extract the characteristics of the extracted character and store it in the ROM.
104 in the character data dictionary, and the most matching character in the character data dictionary is taken as the recognition result. Also, by recognizing the extracted characters here, it becomes possible to determine whether the extracted characters are uppercase or lowercase.
Normally, in Western documents, if a word changes to lowercase in the middle, subsequent characters are lowercase. Therefore, if it is determined that the extracted character is a lowercase character, it becomes possible to narrow down the candidate characters for subsequent extraction characters to lowercase characters, which makes it possible to speed up character recognition. In the case of 201 in Figure 2, the first character is a capital letter A, so all uppercase and lowercase letters are eligible for the second character. However, by recognizing the second character, it can be determined that the second character is a lowercase character, so that only lowercase characters become candidate characters for subsequent characters, and the speed of character recognition becomes faster. Also, in European and American documents, Figure 2
There is a character image like the one shown in 02. The character image is
In the marginal distribution perpendicular to the row direction, it is interpreted as one word as shown in Figure 2 203. However, there is a - (bar) in the middle, and the next letter is capitalized.
This often happens, and if you try to recognize it as a lowercase letter, you will get it wrong. To do this, it is sufficient to recognize the - in the middle, but in order to perform the recognition at high speed, the present invention
Only one line at 1/2 the height of the lowercase letter is scanned.
Therefore, in a case like the character image 201, there is a possibility that one will not be picked up as shown in FIG. In such cases, in order to avoid mistakes in recognition, the positions of the right and left edges of the character are determined at the same time as character extraction. By comparing this position with the position of the right end of the previous extracted character, the distance between the previous extracted character and the previous extracted character can be determined.
If the interval is extremely large compared to the normal character interval, it can be estimated that something like 1 exists between them, and as a result, it can be determined that the extracted character may be an uppercase letter. When the extracted words are recognized in this way, a word 206 with one removed can be obtained as shown in FIG. 2 206.
Before outputting the word 301 recognized in this way, ROM1
Compare with the word dictionary in 04. The reason for comparing with the word dictionary is to supplement the symbols (not letters) that existed in areas that were extremely spaced apart compared to normal during the above recognition, and the comparison with the word dictionary is to supplement the symbols (not letters) that were present in the areas that were extremely spaced apart compared to normal. Do this by ignoring the symbols. By comparing in this way, it is possible to search for a word 303 in the word dictionary 302 from the recognition result 301, and by outputting the word 303, it is possible to return an accurate recognition result. As mentioned above, in character recognition of European and American documents, etc., we take advantage of the characteristic of Western documents that if a single lowercase letter appears in the middle of a word, the following characters are lowercase letters, and after recognizing the lowercase letters, the extraction character is Since the candidate characters are only lowercase letters, it is possible to significantly shorten the time required for recognition.
In addition, character extraction extracts connected components in eight directions, and at the same time constantly judges the interval between the extracted character and the previous extracted character, so there is no possibility of mistakenly determining candidate characters, and This makes it possible to accurately recognize characters even in character images that overlap in the peripheral distribution in different directions. [Effects of the Invention] As described above, in the present invention, after determining that a recognized character is a lowercase letter during word recognition, recognition is performed by narrowing down candidate characters for extraction to lowercase letters, resulting in faster recognition. ．． Furthermore, even when a word changes to uppercase in the middle of a word, candidate characters are always determined while determining the character spacing, eliminating errors such as recognizing uppercase letters as lowercase letters. Furthermore, since characters are extracted by connecting components in eight directions, characters can be extracted accurately even in character images that overlap in the peripheral distribution in the direction perpendicular to the line direction, and as a result, characters can be recognized correctly. As described above, according to the present invention, high-speed and accurate character recognition is possible. As a result, it has the effect of greatly improving the reliability of character recognition devices that use this method as a structural element.

[Brief explanation of drawings]

第１図は本発明の文字認識装置のブロック図．第２図、
第３図は本発明の文字認識の様子を示す図．２０１・・・入力画像２０２・・・文字抽出走査線２０６・・・単語認識結果３０１・・・単語認識結果３０２・・・単語辞書３０３・・・最級単語認識結果第４図は本発明の文字認識装置のフローチャート．第５図は、周辺分布の形状による文字の切り出し位置決
定の様子を示す図である．以上Figure 1 is a block diagram of the character recognition device of the present invention. Figure 2,
Figure 3 is a diagram showing how the present invention performs character recognition. 201... Input image 202... Character extraction scanning line 206... Word recognition result 301... Word recognition result 302... Word dictionary 303... Finest word recognition result Figure 4 shows the result of the present invention. Flowchart of character recognition device. FIG. 5 is a diagram showing how character extraction positions are determined based on the shape of the peripheral distribution. that's all

Claims

[Claims]

(1) Optical image input means that inputs a document image by photoelectrically converting light reflected from the paper surface of a Western document, and means that detects the position of character lines and words from the input image and extracts each word. a character recognition means for extracting each character from the extracted word and converting it into a character code while comparing it with a pre-existing character data dictionary; a means for determining whether a recognized character is an uppercase or lowercase character; and a word dictionary. When extracting each character, the interval between the extracted character and the character extracted before the extracted character is checked, and if there is an unnecessarily large character interval within the same word, the extracted character A character recognition device characterized in that candidate characters are both uppercase and lowercase letters.

(2) The character recognition device according to claim 1, wherein after the recognized character is determined to be a lowercase character by the means for determining whether the recognized character is an uppercase character or a lowercase character, candidate characters for extraction characters are limited to lowercase characters.

(3) To extract the above-mentioned characters, scan the vicinity of the center of the height of the character line from left to right or right to left, and extract the connected components of black pixels in 8 directions at the point when a character is encountered. The character recognition device according to claim 1, characterized in that:

(4) The character recognition device according to claim 1, wherein before outputting the recognized character, words in the word dictionary data are compared with the recognition result to correct the data.

(5) Comparison with the word dictionary data is performed by ignoring symbols such as - (bar) and ' (apo) in the word dictionary data and recognized words, and then comparing the words in the word dictionary data. The character recognition device according to claim 1, wherein the character recognition device outputs a character recognition device.