JPH0573716A

JPH0573716A - English character recognition device

Info

Publication number: JPH0573716A
Application number: JP3236677A
Authority: JP
Inventors: Ryoichi Yushimo; 良一湯下
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1991-09-18
Filing date: 1991-09-18
Publication date: 1993-03-26

Abstract

(57)【要約】【目的】正確な単語切り出しを行うことにより認識率の
向上を図る。【構成】画像入力部１にて入力した文書画像から切り出
された、文章領域・行領域・文字領域の位置情報をもと
に文字認識部５で文字の認識処理を行い、その結果とし
て得られた行ごとの文字列に対して、隣接する文字領域
の水平方向の隙間の広さから、単語区切り処理部７で得
られた確定単語区切りと候補単語区切りをもとに、分割
候補生成部８で複数の分割候補を生成し、得られた全て
の候補に対して、分割で得られる文字列が英単語として
綴りが正しいかを英単語綴り辞書10との照合により英単
語綴り判定部９で判定する。【効果】文字の間隔情報に加え、英単語の綴り情報を利
用して単語の区切り処理を行うことにより、正確な単語
区切りが得られ、認識率は向上する。 (57) [Summary] [Purpose] To improve the recognition rate by accurately extracting words. [Structure] The character recognition unit 5 performs character recognition processing based on the position information of the text area, line area, and character area cut out from the document image input by the image input unit 1, and the result is obtained. Based on the definite word segmentation and the candidate word segmentation obtained by the word segmentation processing unit 7, the segmentation candidate generation unit 8 based on the width of the horizontal gap between the adjacent character regions for the character string of each line. , A plurality of division candidates are generated, and whether all of the obtained candidates are correctly spelled as an English word in the character string obtained by division is compared with the English word spelling dictionary 10 by the English word spelling determination unit 9 judge. [Effect] By using the spelling information of English words in addition to the character spacing information, the word segmentation processing is performed, so that accurate word segmentation can be obtained and the recognition rate is improved.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は英文字の認識を行う英文
字認識装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an English character recognizing device for recognizing English characters.

【０００２】[0002]

【従来の技術】近年、文字認識装置をコンピュータなど
の入力装置として利用する要求が高まっており、安定な
認識結果を得ることのできる文字認識装置がコンピュー
タなどのシステムの性能向上に不可欠となっている。従
来の認識装置では、英文書の単語区切り処理は画像情報
のみをもとに行われていた。2. Description of the Related Art In recent years, there has been an increasing demand for using a character recognition device as an input device for a computer or the like, and a character recognition device capable of obtaining a stable recognition result is indispensable for improving the performance of a system such as a computer. There is. In the conventional recognition device, the word segmentation process of an English document is performed only on the basis of image information.

【０００３】[0003]

【発明が解決しようとする課題】上記のように従来の文
字認識装置は、英文書の単語区切り処理を画像情報のみ
をもとに行っていたため、文字の間隔が狭い文書やイタ
リック体の文書などを認識する際に、誤って単語を区切
り、そのため認識率の低下を招いていた。As described above, the conventional character recognition apparatus performs word segmentation processing of English documents based only on image information, so that documents with narrow character intervals, italicized documents, etc. When recognizing, the words were erroneously separated, which led to a decrease in recognition rate.

【０００４】本発明は上記問題を解決するもので、文字
の間隔情報に加え、英単語の綴り情報を利用して単語区
切り処理を行うことにより、文字の間隔が狭い文書やイ
タリック体の文書などにおいて、精度の高い単語区切り
処理を行うことのできる英文字認識装置を提供すること
を目的とするものである。The present invention solves the above-described problem. In addition to character spacing information, word segmentation processing is performed using spelling information of English words, so that documents with narrow character spacing, italicized documents, etc. It is an object of the present invention to provide an English character recognizing device capable of performing highly accurate word segmentation processing.

【０００５】[0005]

【課題を解決するための手段】上記課題を解決するため
に、本発明の英文字認識装置は、認識対象文書を入力す
る画像入力部と、入力された文書画像から文章領域を出
力する文章領域切り出し部と、文章領域を１行ずつ分割
し行領域として切り出す行領域切り出し部と、行領域内
の文字を１文字ずつ分割し文字領域として切り出す文字
領域切り出し部と、文字領域の画像の図形特徴と認識辞
書としてあらかじめ求められた各文字種の図形特徴を比
較し、それらの間の類似の度合により認識結果を決定す
る文字認識部と、隣接する文字領域の間隔をもとに単語
の区切りとして確定した確定単語区切りとその候補とな
る候補単語区切りを求める単語区切り処理部と、確定単
語区切りにより仕切られた領域内の文字列を候補単語区
切りの組合せによりいく通りかに分割し、分割候補を生
成する分割候補生成部と、分割した全ての文字列が英単
語の綴りとして正しいかを英単語の綴りを保持している
英単語綴り辞書と照合することにより判定し、正しい綴
りが得られた分割文字列を認識結果とする英単語綴り判
定部とを備えたものである。In order to solve the above-mentioned problems, an English character recognition apparatus of the present invention comprises an image input unit for inputting a document to be recognized, and a text area for outputting a text area from the input document image. Graphic part of the image of the character area, and the character part image part which divides the text area into line areas by dividing the sentence area into line areas And a character recognition unit that determines the recognition result based on the degree of similarity between them and the figure features of each character type that were previously obtained as a recognition dictionary are determined as word delimiters based on the interval between adjacent character areas. The defined word break and a candidate word break that becomes a candidate for the word break processing unit, and the character string in the area partitioned by the fixed word break are combined by the candidate word break. A division candidate generation unit that divides into several divisions and generates division candidates, and collates whether all the divided character strings are correct in spelling English words with an English word spelling dictionary that holds the spelling of English words. And a spelling determination unit that uses the divided character string for which correct spelling is obtained as a recognition result.

【０００６】[0006]

【作用】上記構成により、単語区切り処理部で得られた
確定単語区切りおよび候補単語区切りをもとに、分割候
補生成部において、確定単語区切りで囲まれている文字
列をその間に存在する候補単語区切りの組合せにより分
割して複数の分割候補を生成し、得られたすべての候補
に対して、分割で得られる文字列が英単語として綴りが
正しいかを英単語綴り辞書との照合により英単語綴り判
定部で判定し、これにより単語の区切り誤りを抑えて認
識率の向上を図ることができる。With the above configuration, based on the definite word delimiter and the candidate word delimiter obtained by the word delimiter processing unit, in the division candidate generation unit, the character string surrounded by the definite word delimiter exists between the candidate words. Divide by the combination of delimiters to generate multiple division candidates, and for all the obtained candidates, check whether the character string obtained by division is spelled correctly as an English word by comparing it with an English word spelling dictionary. The spelling determination unit makes the determination, and thus the word segmentation error can be suppressed and the recognition rate can be improved.

【０００７】[0007]

【実施例】本発明の一実施例を図面とともに説明する。
図１は本発明の一実施例の英文字認識装置を示す構成図
である。図１において、１は認識対象文書を文書画像と
して入力する画像入力部、２は入力された文書画像から
文字列の集まりを見つけ、文書領域を出力する文書領域
切り出し部、３は文書領域から横方向の文字の連なりを
見つけ、一連なりの文字列を行として切り出す行領域切
り出し部、４は行領域内の文字を一文字ずつ分割し文字
領域として切り出す文字領域切り出し部、５は文字領域
の画像の図形特徴と認識辞書としてあらかじめ求められ
た各文字種の図形特徴を比較し、それらの間の類似の度
合により認識結果を決定する文字認識部、６はあらかじ
め求められた各文字種の図形特徴を保持している認識辞
書、７は隣接する文字領域の間隔をもとに単語の区切り
として確定した確定単語区切りと確定しないがその候補
となる候補単語区切りを求める単語区切り処理部、８は
確定単語区切りにより仕切られた領域内の文字列を候補
単語区切りの組合せにより、いく通りかに分割し、分割
候補を生成する分割候補生成部、９は分割した全ての文
字列が英単語として正しい綴りかを、英単語の綴りを保
持している英単語綴り辞書と照合することにより判定
し、正しい綴りが得られた分割文字列を認識結果とする
英単語綴り判定部、10は英単語の綴り情報を保持してい
る英単語綴り辞書、11は１から５および７から９の各部
をつなぐ内部バス、12,13 はそれぞれ５と６、９と10を
つなぐ内部バスである。このように構成された本実施例
の英文字認識装置について、図２の全体の流れ図と図３
の処理の過程を示す概念図を用いて、以下その動作を説
明する。DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention will be described with reference to the drawings.
FIG. 1 is a block diagram showing an English character recognition apparatus according to an embodiment of the present invention. In FIG. 1, 1 is an image input unit for inputting a recognition target document as a document image, 2 is a document region cutout unit for finding a set of character strings from the input document image, and outputting a document region, 3 is a horizontal region from the document region. A line area cutout unit that finds a series of characters in a direction and cuts out a series of character strings as a line, 4 is a character area cutout unit that divides each character in the line area and cuts it out as a character region, and 5 is a character region image. The character recognition unit 6 compares the figure features and the figure features of each character type previously obtained as a recognition dictionary and determines the recognition result based on the degree of similarity between them. The character recognition unit 6 holds the previously obtained figure features of each character type. The recognition dictionary 7, which is not a definite word definite as a definite word delimiter based on the interval between adjacent character areas, is not definite, but a candidate word delimiter as a candidate is obtained. A word segmentation processing unit, 8 is a division candidate generation unit that generates a division candidate by dividing the character string in the area partitioned by the fixed word division into several ways according to the combination of candidate word divisions, and 9 is a division candidate Whether the character string of is spelled correctly as an English word is determined by comparing it with an English word spelling dictionary that holds the spelling of the English word, and the spelled word with the correct spelling is used as the recognition result. Judgment part, 10 is an English word spelling dictionary holding spelling information of English words, 11 is an internal bus connecting each part of 1 to 5 and 7 to 9, 12 and 13 are connecting 5 and 6, 9 and 10 respectively It is an internal bus. Regarding the English character recognizing device of the present embodiment configured as described above, the entire flow chart of FIG. 2 and FIG.
The operation will be described below with reference to the conceptual diagram showing the process of the process.

【０００８】まず、認識対象文書を画像入力部１にて文
書画像として入力する（処理14）。入力された文書画像
を文章領域切り出し部２に送り、文書領域切り出し部２
にて文書画像中に存在する黒画素のヒストグラムを水平
および垂直方向に求め、その分布から文書領域を切り出
し、その位置情報を内部データとして蓄える（処理1
5）。First, a document to be recognized is input as a document image in the image input section 1 (process 14). The input document image is sent to the text area cutout unit 2, and the document area cutout unit 2 is sent.
At, the histogram of black pixels existing in the document image is obtained in the horizontal and vertical directions, the document area is cut out from the distribution, and the position information is stored as internal data (Process 1
Five).

【０００９】次に行領域切り出し部３に文章領域の位置
情報を送り、文章領域内の行の切り出し処理を行う（処
理16）。行領域切り出し部３では隣接する文字間の隙間
よりも行間の隙間の方が大きいことに注目して横方向の
文字の連なりを見つけ、一連なりの文字列を行として切
り出し、文章領域内で切り出された全ての行の位置情報
を内部データとして蓄える。Next, the position information of the text area is sent to the line area cutout section 3 to cut out the line in the text area (process 16). In the line area cutout unit 3, paying attention to the fact that the space between lines is larger than the space between adjacent characters, a series of characters in the horizontal direction is found, and a series of character strings is cut out as a line and cut out in the text area. The position information of all the rows that have been stored is stored as internal data.

【００１０】次に、行領域の位置情報を文字切り出し部
４に送り、文字領域に対する文字領域切り出し処理を行
う（処理17）。文字切り出し部４では行領域内の黒画素
の連結情報、および垂直方向のヒストグラムの変化に注
目し、横方向に黒画素の連結が切れている箇所やヒスト
グラムの値が一定の値よりも低い箇所を文字と文字の区
切りとして分割し、一文字ずつの文字領域を求め、その
位置情報を内部データとして蓄える。Next, the position information of the line area is sent to the character cutting section 4 to perform a character area cutting process for the character area (process 17). In the character cutout unit 4, paying attention to the black pixel connection information in the row area and the change in the vertical histogram, the black pixel connection in the horizontal direction is broken or the histogram value is lower than a certain value. Is divided as a character-to-character separation, a character area is obtained for each character, and the position information is stored as internal data.

【００１１】次に、文字領域の位置情報を文字認識部５
に送り、文章領域内の全ての文字の認識処理を行う（処
理18）。文字の認識処理は文字領域内の画像から黒画素
の分布を図形特徴として求めておき、それと認識辞書６
にあらかじめ用意した各文字種の同様の図形特徴とを比
較することにより、それらの間の類似の度合いを求め、
最も類似性の高い文字種をその文字領域の認識結果とす
る。Next, the position information of the character area is obtained by the character recognition unit 5.
And recognizes all the characters in the text area (process 18). In the character recognition process, the distribution of black pixels is obtained from the image in the character area as a graphic feature, and the recognition dictionary 6
By comparing the similar graphic features of each character type prepared in advance to the degree of similarity between them,
The character type with the highest similarity is used as the recognition result of the character area.

【００１２】以上の処理15から処理18にて求められた、
文章領域・行領域・文字領域の位置情報、および文字領
域の認識結果をもとに、行の一連なりとなっている文字
列を単語毎に区切る単語区切り処理を単語区切り処理部
７にて行う（処理19）。単語区切り処理は行領域内にお
ける隣接する文字領域間の水平方向の隙間に注目して行
われ、隙間が単語区切りとして安定な大きさ（一定値
１）以上あればその隙間を単語区切りとして確定する
（確定単語区切り）。また、確定はできないが単語区切
りの可能性がある大きさ（一定値２）以上あればその隙
間を単語区切りの候補とする（候補単語区切り）。な
お、一定値１および一定値２は行領域内の文字領域の隙
間の大きさの分布により、各行領域毎に求められる値で
ある。From the above processing 15 to processing 18,
Based on the position information of the text area / line area / character area and the recognition result of the character area, the word division processing unit 7 performs word division processing for dividing a character string that is a series of lines into words. (Process 19). The word segmentation process is performed paying attention to the horizontal gap between adjacent character regions in the line region, and if the gap is a stable size (constant value 1) or more as the word segment, the gap is determined as the word segment. (Definite word delimiter). If the size cannot be determined, but there is a possibility of word breaks (constant value 2) or more, the gap is set as a word break candidate (candidate word break). The constant value 1 and the constant value 2 are values obtained for each line area by the distribution of the size of the gap between the character areas in the line area.

【００１３】確定単語区切りおよび候補単語区切りをも
とに、分割候補生成部８にて、分割候補生成処理を行
い、文字列の連なりを単語毎に分割し英単語としての文
字列の候補を生成する（処理20）。分割候補生成処理
は、確定単語区切りで囲まれている文字列をその間に存
在する候補単語区切りの組合せにより分割し、複数の分
割候補を得ることである。Based on the fixed word delimiter and the candidate word delimiter, the division candidate generation unit 8 performs a division candidate generation process to divide the character string sequence into words to generate a character string candidate as an English word. Yes (process 20). The division candidate generation process is to obtain a plurality of division candidates by dividing a character string surrounded by definite word divisions by a combination of candidate word divisions existing therebetween.

【００１４】分割の過程を図３に示す。図３において、
24は入力文書内のある一行、25は24に対して単語区切り
処理をほどこした結果、26は確定単語区切り１と２の間
の文字列を候補単語区切りの組合せにより分割し、分割
候補を求めた結果を示している。この例の場合、確定単
語区切り１と２の間には、Ｉｆ、ｙｏｕ、ｊｕｓｔの３
つの単語があるが、これらの単語間の隙間が他の単語間
より狭いため単語の区切りとして確定できず、その可能
性のある文字間として３つの候補単語区切りが得られて
いる。分割候補生成処理では、この候補単語区切りの組
合せにより26の分割候補結果に示すように、８通りの分
割候補を生成する。The process of division is shown in FIG. In FIG.
24 is a line in the input document, 25 is a word segmentation process for 24, and 26 is a character string between fixed word segment 1 and 2 is segmented by a combination of candidate word segmentation to obtain segment candidates. The results are shown. In the case of this example, 3 between If, you and just are provided between the fixed word delimiters 1 and 2.
Although there are two words, the gap between these words is narrower than that between other words, so that it cannot be determined as a word division, and three candidate word divisions are obtained as possible character divisions. In the division candidate generation processing, eight combinations of division candidates are generated by the combination of the candidate word divisions, as shown by 26 division candidate results.

【００１５】英単語綴り判定部では、上記で得られた全
ての候補に対して、分割で得られる文字列が英単語とし
て綴りが正しいかを、英単語綴り辞書10との照合により
判定する（処理21）。In the English word spelling determination unit, it is determined by checking with the English word spelling dictionary 10 whether or not the character string obtained by division is spelled correctly as an English word for all the candidates obtained above ( Process 21).

【００１６】分割候補結果を示す26において、分割候補
１の文字列は「Ｉｆｙｏｕｊｕｓｔ」であるが、このよ
うな綴りは英単語に無いため判定結果は「誤り」とな
る。また、分割候補２の文字列は「Ｉｆ」「ｙｏｕｊｕ
ｓｔ」であり、「Ｉｆ」は正しいが、「ｙｏｕｊｕｓ
ｔ」が英単語に無いため判定結果は「誤り」となる。同
様に全ての分割候補の判定処理を行い、分割候補４の
「Ｉｆ」「ｙｏｕ」「ｊｕｓｔ」が全て綴りが正しいた
め正解となり、認識結果として出力される。In 26 showing the result of the division candidate, the character string of the division candidate 1 is "Ifyoujust", but since such a spelling is not in the English word, the determination result is "error". The character string of the division candidate 2 is “If” and “youju”.
"st" and "If" is correct, but "youjus"
Since "t" is not in the English word, the determination result is "wrong". Similarly, all the division candidates are determined, and all the “If”, “you”, and “just” of the division candidate 4 are spelled correctly, so that they are correct and are output as recognition results.

【００１７】以上、処理20から処理21を確定単語区切り
に囲まれた文字列全てに行い、文章領域全ての認識結果
を得る（処理22、処理23）。As described above, the processes 20 to 21 are performed on all the character strings surrounded by the definite word delimiters to obtain the recognition result of all the sentence regions (process 22 and process 23).

【００１８】[0018]

【発明の効果】以上のように本発明により、文字の間隔
情報に加え、英単語の綴り情報を利用して単語区切り処
理を行うことにより、文字の間隔が狭い文書やイタリッ
ク体の文書などにおいて、単語区切り処理の誤りを軽減
し、認識率の向上を図ることができる。As described above, according to the present invention, in addition to character spacing information, spelling information of English words is used to perform word segmentation processing, so that a document with narrow character spacing or an italicized document can be used. It is possible to reduce the error in the word segmentation processing and improve the recognition rate.

[Brief description of drawings]

【図１】本発明の一実施例の英文字認識装置の構成図で
ある。FIG. 1 is a configuration diagram of an English character recognition device according to an embodiment of the present invention.

【図２】同英文字認識装置における文字認識処理の全体
の流れ図である。FIG. 2 is an overall flowchart of character recognition processing in the English character recognition device.

【図３】同処理の過程を示す概念図である。FIG. 3 is a conceptual diagram showing a process of the same process.

[Explanation of symbols]

１画像入力部２文章領域切り出し部３行領域切り出し部４文字領域切り出し部５文字認識部６認識辞書７単語区切り処理部８分割候補生成部９英単語綴り判定部 10 英単語綴り辞書 DESCRIPTION OF SYMBOLS 1 Image input section 2 Text area cutout section 3 Line area cutout section 4 Character area cutout section 5 Character recognition section 6 Recognition dictionary 7 Word segmentation processing section 8 Division candidate generation section 9 English word spelling determination section 10 English word spelling dictionary

Claims

[Claims]

1. An image input unit for inputting a document to be recognized,
A text area cutout unit that outputs a text area from the input document image, a line area cutout unit that divides the text area one line at a time and cuts it out as a line area, and a character in the line area one character at a time and cuts out as a character area Adjacent to the character region cutout unit, a character recognition unit that compares the graphic feature of the image of the character region and the graphic feature of each character type obtained in advance as a recognition dictionary, and determines the recognition result based on the degree of similarity between them. Based on the space between character areas, the fixed word breaks that are fixed as word breaks and the candidate word breaks that are candidates for the word break processing unit, and the character strings in the area partitioned by the fixed word breaks are used as candidate word breaks. Divide into several combinations according to the combination, and a division candidate generation unit that generates division candidates, and check whether all the divided character strings are correct as spelling of English words. Ri is determined by matching the English word spelling dictionary that holds, English character recognition device comprising a English word spelling determination unit to recognition result divided character string obtained is correct spelling.