JPH0527157B2

JPH0527157B2 -

Info

Publication number: JPH0527157B2
Application number: JP59045044A
Authority: JP
Inventors: Tozen Hai; Eiichiro Yamamoto; Yukikazu Kaburayama
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1984-03-09
Filing date: 1984-03-09
Publication date: 1993-04-20
Also published as: JPS60189582A

Description

【発明の詳細な説明】〔発明の技術分野〕本発明は文字認識後処理方式、特に文字認識装
置に関連して、漢字、ひらがな、カタカナ等を含
む非漢字の混合した文章を読取り、文字認識を行
なつた後に生ずる認識エラーを修正する文字認識
後処理方式に関するものである。[Detailed Description of the Invention] [Technical Field of the Invention] The present invention relates to a character recognition post-processing method, particularly a character recognition device, which reads a text containing a mixture of non-kanji including kanji, hiragana, katakana, etc., and performs character recognition. This invention relates to a character recognition post-processing method for correcting recognition errors that occur after character recognition is performed.

[Prior art and problems]

例えば光学的文字認識装置を用いて漢字、非漢
字を含む日本語の文章を認識することは既に行な
われているが、認識後に生じているエラーを修正
する方式、すなわち文字認識後処理対策は、未だ
十分満足しうる段階に達していない。 For example, optical character recognition devices have already been used to recognize Japanese texts including kanji and non-kanji characters, but there are no methods for correcting errors that occur after recognition, that is, post-processing measures for character recognition. We have not yet reached a stage where we are fully satisfied.

従来、このような文字認識後処理対策として考
えられていた方式は、上記のような文字を含む複
雑な日本語文章を扱う場合において、該文章を単
語部分で区切つて予め分かち書き等を行なつて文
章中に用いられた単語の位置を予め記入時に意識
しておく方式、あるいは文字を認識した結果、エ
ラーを生じた際におけるエラーの修正時に、エラ
ーのある単語の特定位置を知らせる方式などがあ
る。 Conventionally, the method considered as a post-processing measure for character recognition is to divide the text into word parts and perform separate writing etc. in advance when handling complex Japanese sentences containing the above characters. There are two methods: one in which the position of words used in a sentence is kept in mind when filling in the text, and another in which the specific position of the word in error is notified when an error is corrected when an error occurs as a result of character recognition. .

しかし、いずれの上記方式においても、記入
時、あるいは修正時に人が介在して必ず余分な措
置を講じておかなければならなかつた。上記の例
では、記入時の分かち書きをすることであると
か、特定の単語の位置を予め意識しておかなけれ
ばならず、最終目的に対していわば１つのダミー
ステツプの如き処理が必要となつていた。 However, in any of the above methods, extra steps must be taken by human intervention when filling out or editing information. In the above example, it is necessary to separate notes when writing, or to be aware of the position of specific words in advance, and it is necessary to perform processing like a dummy step for the final purpose. Ta.

[Purpose of the invention]

本発明は上記の問題点に鑑みこれを解決するも
ので、本発明においては文章中の漢字と非漢字と
を識別するために別個の識別方式により識別をし
て漢字、非漢字の識別精度を上げると共に、識別
された漢字列に対して特定長の単語を形成する文
字列にしたがつて単語辞書を用意し、認識後のエ
ラー対策、エラー修正を効率的に行ないうる文字
認識後処理方式を提供することを目的としてい
る。 The present invention solves this problem in view of the above-mentioned problems.In the present invention, in order to distinguish between kanji and non-kanji in a text, separate identification methods are used to improve the accuracy of kanji and non-kanji identification. At the same time, a word dictionary is prepared according to the character strings that form words of a specific length for the identified kanji string, and a character recognition post-processing method that can efficiently take measures against and correct errors after recognition is developed. is intended to provide.

[Structure of the invention]

この目的を達成するため本発明の文字認識後処
理方式では、平仮名文字等の非漢字と漢字とを分
類することができる文字認識装置において、漢字
か非漢字かを判定する漢字・非漢字判定手段と、
非漢字を識別する非漢字識別手段と、漢字を識別
する漢字識別手段と、漢字文字列を抽出する漢字
文字列抽出手段と、漢字文字列が規定文字数より
長い場合にこれを規定文字数に分離する単語分離
手段を備え、漢字識別手段で認識した漢字文字列
をその漢字長と合致した単語辞書と照合して単語
による認識を行うことを特徴とする。 In order to achieve this object, the character recognition post-processing method of the present invention uses a character recognition device capable of classifying non-kanji such as hiragana characters and kanji, and a kanji/non-kanji determination means for determining whether the character is a kanji or a non-kanji. and,
A non-kanji identification means for identifying non-kanji characters, a kanji identification means for identifying kanji characters, a kanji character string extraction means for extracting a kanji character string, and when the kanji character string is longer than a specified number of characters, it is separated into a specified number of characters. The present invention is characterized in that it includes word separation means and performs word recognition by comparing the Kanji character string recognized by the Kanji identification means with a word dictionary that matches the length of the Kanji characters.

[Embodiments of the invention]

本発明を一実施例にもとづき詳述するに先立ち
その概略を第２図により説明する。 Before explaining the present invention in detail based on one embodiment, its outline will be explained with reference to FIG.

第２図ａに示す文を例えばOCRで読取り、漢
字、非漢字を一文字ずつ認識して第２図ｂの如き
認識結果を得たとき、まずオペレータが非漢字部
分におけるエラーを例えばキー入力により修正
し、同ｃの如き修正結果を得る。それから漢字部
分を単語辞書と照合して後処理を行うが、この場
合、漢字が多数連続している部分については、単
語としてもつとも多いのが２文字の組合せである
ので、例えば２文字毎に区切つて後処理を行う。
これにより「指足」、「人力」、「装直」、「目一」…
等を「指定」、「入力」、「装置」、「同一」…等の正
しいものに修正することができる。 When the sentence shown in Figure 2a is read using OCR, for example, and the kanji and non-kanji characters are recognized one by one, and the recognition result shown in Figure 2b is obtained, the operator first corrects the error in the non-kanji part by, for example, key input. Then, a modified result as shown in c. Then, post-processing is performed by comparing the kanji part with a word dictionary.In this case, for parts where there are many consecutive kanji characters, most words have combinations of two characters, so for example, they are separated into two-character parts. Then perform post-processing.
As a result, "fingers and feet", "human power", "re-fitting", "one eye"...
etc. can be corrected to correct values such as "designation", "input", "device", "same", etc.

本発明の一実施例を第１図〜第５図にもとづき
説明する。 An embodiment of the present invention will be described based on FIGS. 1 to 5.

第１図は本発明の一実施例構成図、第２図は本
発明における認識エラーの修正プロセス説明図、
第３図は文中の漢字非漢字を識別する方式、第４
図は第３図に示すループ数、連結成分数および平
均黒ラン数による漢字の分析識別例、第５図は第
３図における輪郭線分系列方式による例（特開昭
58−225849号公報参照）を示す。 FIG. 1 is a configuration diagram of an embodiment of the present invention, FIG. 2 is an explanatory diagram of the recognition error correction process in the present invention,
Figure 3 shows a method for identifying kanji and non-kanji characters in a sentence.
The figure shows an example of analyzing and identifying kanji using the number of loops, the number of connected components, and the average number of black runs shown in Figure 3. Figure 5 shows an example of the contour line segment series method in Figure 3 (Japanese Patent Application Laid-Open No.
58-225849).

第１図において、１は読取られるべき漢字、非
漢字（ひらがな、カタカナなど）の含んでいるド
キユメント入力部、２は漢字・非漢字用判定回
路、３は非漢字用識別回路、４は漢字識別回路、
５は漢字文字列抽出回路、６は表示装置、Ｋはそ
のキー部、７は単語分離回路、８は所定の長さの
漢字列で異なる意味を有する単語群が記憶されて
いる単語辞書部、９は単語処理回路を示す。 In Figure 1, 1 is a document input section containing kanji and non-kanji (hiragana, katakana, etc.) to be read, 2 is a kanji/non-kanji determination circuit, 3 is a non-kanji identification circuit, and 4 is kanji identification. circuit,
5 is a kanji character string extraction circuit; 6 is a display device; K is a key thereof; 7 is a word separation circuit; 8 is a word dictionary section in which groups of words having different meanings are stored in kanji strings of a predetermined length; 9 indicates a word processing circuit.

本発明による文字認識処理方式においては、漢
字が混合されている文章に対して、まず漢字であ
るか非漢字であるかを、各文字列について正確に
判別した上で、漢字であることが判れば単語とい
うものは、いくつかの漢字が連続したものである
から、漢字列にのみ単語に関する後処理を自動的
に行なわんとするものである。 In the character recognition processing method according to the present invention, for a sentence containing a mixture of kanji, it is first accurately determined whether each character string is a kanji or a non-kanji, and then it is determined whether it is a kanji. Since a word is a series of several kanji characters, post-processing for words is automatically performed only on kanji strings.

次に第１図の実施例の動作を説明する。 Next, the operation of the embodiment shown in FIG. 1 will be explained.

例えばOCRでドキユメントを読取つて得たド
キユメント入力部１の文字すなわち文中の漢字、
非漢字を漢字・非漢字判定回路２を介して順次判
定する。その結果、非漢字であると判定されたも
のは識別回路３で、どのような文字（ひらがな、
カタカナ等）であるかが識別される。一方、漢字
であると判定された出力は漢字用の識別回路４に
送られ、どのような漢字であるかがそれぞれ１文
字毎に識別される。 For example, the characters in the document input section 1 obtained by reading the document with OCR, that is, the kanji in the sentence,
Non-kanji characters are sequentially determined through a kanji/non-kanji determination circuit 2. As a result, the identification circuit 3 determines what kind of characters (hiragana,
Katakana, etc.) is identified. On the other hand, the output determined to be a kanji is sent to a kanji identification circuit 4, and the type of kanji is identified for each character.

非漢字用の識別回路３から出力された非漢字出
力及び漢字用の識別回路４から出力された漢字出
力が表示装置６上に表示されるので、認識された
非漢字、例えばエラーのあるひらがなはオペレー
タによつて表示装置６上のキーＫにより修正され
る。これにより第２図ｂの「わ」、「ほ」、「加」等
が、同ｃに示す如く、「れ」、「は」、「が」（「加」
は漢字・非漢字の誤判定による）と修正される。 Since the non-kanji output output from the non-kanji identification circuit 3 and the kanji output output from the kanji identification circuit 4 are displayed on the display device 6, recognized non-kanji characters, such as erroneous hiragana, are displayed on the display device 6. Modifications are made by the operator using key K on the display device 6. As a result, ``wa'', ``ho'', ``ka'', etc. in Figure 2b are changed to ``re'', ``ha'', ``ga''(``ka'') as shown in Figure 2c.
is corrected as (due to misjudgment of kanji/non-kanji).

他方、識別回路４からの漢字出力は漢字文字列
抽出回路５に送られ、非漢字に挾まれた漢字文字
列を抽出する。そして抽出された漢字列は単語分
離回路に送られ、該漢字列が４語、５語、６語と
いうように長い場合には、ここで所定の単位で区
切る。例えば「文字認識装置」という文字列は
「文字」「認識」「装置」のように区切られる。こ
うしてある長さ、例えば２語で区切られた漢字は
単語後処理回路９へ送られる。前記単語後処理回
路９は単語辞書部８に接続され、そこからの漢字
出力を受けるようになつている。更に該単語辞書
部８には異なる意味を有するある長さの単語群が
予め多数記憶されている。したがつて、単語後処
理回路９において単語分離回路７からの単語と、
単語辞書部８からの単語（両者は長さが等しい）
とが比較され、認識された前者の単語が正しいも
のかどうかが判定される。このとき識別回路４で
は複数の候補が抽出されてこれらも送出されてく
るので、これら候補も使用して照合比較する。 On the other hand, the kanji output from the identification circuit 4 is sent to a kanji character string extraction circuit 5, which extracts kanji character strings sandwiched between non-kanji characters. The extracted kanji string is then sent to a word separation circuit, where if the kanji string is long, such as 4, 5, or 6 words, it is separated into predetermined units. For example, the character string "character recognition device" is divided into "character,""recognition," and "device." In this way, Kanji characters separated by a certain length, for example two words, are sent to the word post-processing circuit 9. The word post-processing circuit 9 is connected to the word dictionary section 8 and receives kanji output from there. Further, the word dictionary section 8 stores in advance a large number of word groups having different meanings and having a certain length. Therefore, in the word post-processing circuit 9, the words from the word separation circuit 7 and
Words from word dictionary part 8 (both have equal length)
are compared, and it is determined whether the former recognized word is correct. At this time, since a plurality of candidates are extracted and sent to the identification circuit 4, these candidates are also used for comparison and comparison.

すなわち、非漢字、例えば、送り文字などのひ
らがなにエラーがあれば、認識後に表示装置６上
で、エラーのある文字はすでに修正されているの
で、認識された漢字にエラーがあれば漢字を含む
文字の前後の意味から、単語分離回路７からのエ
ラー漢字は、単語後処理回路９において単語辞書
部８からの漢字出力と比較され、エラーを生じて
いると判定されるので該辞書部８からの正しい単
語に自動的に置換されて出力される。 That is, if there is an error in a non-kanji character, such as a hiragana character, the character with the error has already been corrected on the display device 6 after recognition, so if there is an error in the recognized kanji character, the kanji will be included. Based on the meanings before and after the characters, the error kanji from the word separation circuit 7 is compared with the kanji output from the word dictionary section 8 in the word post-processing circuit 9, and since it is determined that an error has occurred, the error kanji is sent from the dictionary section 8. will be automatically replaced with the correct word in the output.

例えば、第２図のａに示す如き文章がドキユメ
ント１に記入されていた場合に、読取後、漢字、
非漢字用識別回路４、および３で認識された結果
の文章いおいてｂに示す如きエラーがあつたとす
る。該認識エラー文章は表示装置６上にそのまま
表示されるから、オペレータがそれを見てエラー
の存在するひらがなをｃのように、キーを押して
修正する。すなわち、図示の例では「わ」を
『れ』に、「ほ」を『は』に、それに漢字と認識さ
れてしまつた「加」を『が』に修正する。このひ
らがな修正プロセスにおいては、オペレータは漢
字単語の正否の判定、修正は全くしなくてよい。 For example, if a sentence like the one shown in Figure 2 a is written in document 1, after reading it, the kanji,
Assume that there is an error as shown in b in the sentences recognized by the non-kanji identification circuits 4 and 3. Since the recognition error sentence is displayed as it is on the display device 6, the operator looks at it and corrects the hiragana in which the error exists, such as c, by pressing a key. That is, in the illustrated example, ``wa'' is corrected to ``re'', ``ho'' to ``ha'', and ``KA'', which has been recognized as a kanji character, to ``GA''. In this hiragana correction process, the operator does not have to judge whether or not the kanji word is correct or correct it at all.

漢字単語の認識エラーについては第２図のｄの
ように自動的に修正が行なわれる。すなわち、第
２図のｂに示すような漢字単語にエラーがある
と、単語後処理回路９において、単語分離回路７
からのエラー単語「指足」「装直」「目一」「人力」
「便用」なる入力およびこれらの変換のときに得
られたそれぞれの候補と、単語辞書部８から順次
取出して比較した結果、見つけた正しい単語『指
定』『入力』『装置』『同一』とが、ｄに示すよう
に自動的に置換される。 Errors in the recognition of kanji words are automatically corrected as shown in d of FIG. That is, if there is an error in a kanji word as shown in b in FIG.
Error words from ``finger and foot'', ``rearrangement'', ``first glance'', ``manpower''
As a result of sequentially extracting and comparing the input "convenient" and each candidate obtained during these conversions from the word dictionary section 8, the correct words "designation", "input", "device", and "same" were found. is automatically replaced as shown in d.

次に漢字と非漢字とを識別する具体的な方式に
ついて概説する。これについては同一出願人によ
る特願昭57−169510号によりすでに出願されてい
る。 Next, we will outline a specific method for distinguishing between kanji and non-kanji. Regarding this, an application has already been filed in Japanese Patent Application No. 169510/1983 by the same applicant.

第３図に示す如く、画数が多く複雑な漢字と画
数の少ない非漢字（例えば、ひらがな、カタカ
ナ）との識別は、第１段階で下記に述べるループ
数および連結成分数を分析して判定し、それで判
定のつかない場合には第２段階で下記に述べる平
均黒ラン数を調べて判定し、非漢字と画数の差し
て違わない少画数の漢字（例えば、ひらがなと識
別が困難な「山」「川」等）は第３段階で輪郭線
分系列を調べて最終的な両者の識別を行なう。 As shown in Figure 3, the discrimination between complex kanji with a large number of strokes and non-kanji with a small number of strokes (e.g. hiragana, katakana) is determined by analyzing the number of loops and the number of connected components described below in the first step. If the determination cannot be made, the second step is to check the average number of black runs described below and determine the number of black runs. ",""river," etc.), the contour line segment series is examined in the third step to make a final distinction between the two.

第４図は、第３図の第１、第２段階までの識別
方式を漢字の「漢」について行なう実例を示す。
同図においてループ数は「漢」の右側の環を形成
している部分であり、この場合ループ数は２つで
ある。連結成分数というのは、各画が分離・独立
している数であつて、図示の例ではサンズイ部の
３個、右側のクサカンムリに類似した部分の１
個、それにその下部のループを含むブロツクの１
個で計５つということになる。 FIG. 4 shows an example in which the identification method up to the first and second stages of FIG. 3 is applied to the kanji character "kan".
In the figure, the number of loops is the part forming the ring on the right side of "Kan", and in this case, the number of loops is two. The number of connected components is the number of separate and independent parts of each stroke, and in the illustrated example, there are three in the crested part, and one in the part similar to the crested crest on the right.
, and one of the blocks containing its bottom loop
That's a total of 5 pieces.

平均黒ラン数は列の黒ラン数と行の黒ラン数に
分けられ、列の黒ラン数は図示の例では垂直に漢
字を走査した際の存在する黒点（情報あり）の数
で、左側のサンズイ部では３、右側部分では６と
いうことになる。これを一般式で表わせば列の平均黒ラン数n_yは、行の平均黒ラン数n_xは、ということになる。 The average number of black runs is divided into the number of black runs in columns and the number of black runs in rows. In the example shown, the number of black runs in a column is the number of black dots (with information) that exist when a kanji is scanned vertically. This means that the number is 3 for the sandy part and 6 for the right side part. Expressing this in a general formula, the average number of black runs in the column n _y is The average number of black runs in a row n _x is It turns out that.

以上、ループ数、連結成分数、平均黒ラン数の
使用により、多画数文字と少画数文字とに分離す
ることができた。次に、この少画数文字の中を、
非漢字と少画数漢字とに分けるために、輪郭線分
特徴を用いる。輪郭線分特徴抽出の例を第５図に
示す。輪郭線分は、文字の縁部において各線分が
開いているか（○）閉じているか（●）により、
４種の線分（○―○ ○―● ●―○ ●―●）
が出来る。これら４種の線分の出現系列は、原パ
タンの構造が単純である場合には非常に安定して
いる。従つて、少画数文字に対して輪郭線分の出
現系列（出現順序）を調べることにより、その文
字の属するカテゴリーを知ることができる。 As described above, by using the number of loops, the number of connected components, and the average number of black runs, characters can be separated into characters with a large number of strokes and characters with a small number of strokes. Next, inside this small number of strokes characters,
Contour segment features are used to classify non-kanji characters and kanji characters with a small number of strokes. FIG. 5 shows an example of contour line segment feature extraction. Contour line segments are determined by whether each line segment is open (○) or closed (●) at the edge of the character.
4 types of line segments (○―○ ○―● ●―○ ●―●)
I can do it. The appearance series of these four types of line segments are very stable when the structure of the original pattern is simple. Therefore, by checking the appearance series (order of appearance) of outline segments for a character with a small number of strokes, it is possible to know the category to which the character belongs.

このようにして本発明においては、第３図に示
す３つの識別段階を踏んで最終的に、漢字、非漢
字をかなりの精度で識別し、前述した如く、非漢
字に対して生じた認識エラーは表示装置６上でオ
ペレータが修正し、漢字に対して生じた認識エラ
ーは自動的に正しい単語に修正されうる。 In this way, in the present invention, kanji and non-kanji are finally identified with considerable accuracy by going through the three identification steps shown in FIG. is corrected by the operator on the display device 6, and recognition errors occurring in the kanji characters can be automatically corrected into correct words.

〔Effect of the invention〕

以上述べたように、本発明においては漢字と非
漢字の混在したドキユメントを作成する際に、文
章の分かち書きをしたり、あるいは文章中で特定
の文字位置を予め意識したり、あるいは特定の単
語を認識ポイントとして予定せずに、漢字に対す
る識別エラーは自動的に修正しうるので、余分な
マンパワーを必要とせず、効率的な文字認識後処
理を行なうことができる。例えば光学文字認識装
置が得られる。 As described above, in the present invention, when creating a document containing a mixture of kanji and non-kanji characters, it is possible to separate the sentences, or to be aware of specific character positions in the sentences, or to write specific words. Since identification errors for Chinese characters can be automatically corrected without being scheduled as recognition points, efficient character recognition post-processing can be performed without requiring extra manpower. For example, an optical character recognition device is obtained.

[Brief explanation of drawings]

第１図は本発明の文字認識後処理方式の実施例
の構成、第２図は本発明により処理される文字認
識後の認識エラーの修正プロセス、第３図は文章
中の漢字、非漢字の識別方式、第４図は第３図の
第１段階、第２段階までの識別法による漢字の分
析・識別例、第５図は第３図の第３段階の識別法
による漢字識別例をそれぞれ示す。図中、１はドキユメント入力部、２は漢字、非
漢字判定回路、３は非漢字用識別回路、４は漢字
用識別回路、５は漢字文字列抽出回路、６は表示
装置、Ｋはキー、７は単語分離回路、８は単語辞
書部、９は単語後処理回路を示す。 Figure 1 shows the configuration of an embodiment of the character recognition post-processing method of the present invention, Figure 2 shows the process of correcting recognition errors after character recognition processed by the present invention, and Figure 3 shows how to correct kanji and non-kanji characters in a sentence. Identification method, Figure 4 shows an example of kanji analysis and identification using the first and second stage identification methods in Figure 3, and Figure 5 shows an example of kanji identification using the third stage identification method in Figure 3. show. In the figure, 1 is a document input section, 2 is a kanji/non-kanji discrimination circuit, 3 is a non-kanji discrimination circuit, 4 is a kanji discrimination circuit, 5 is a kanji character string extraction circuit, 6 is a display device, K is a key, 7 is a word separation circuit, 8 is a word dictionary section, and 9 is a word post-processing circuit.

Claims

[Scope of Claims] 1. In a character recognition device capable of classifying non-kanji such as hiragana characters and kanji, there is provided a kanji/non-kanji determination means for determining whether a kanji is a kanji or a non-kanji, and a non-kanji character for identifying a non-kanji. comprising an identification means, a kanji identification means for identifying kanji, a kanji character string extraction means for extracting a kanji character string, and a word separation means for separating a kanji character string into a predetermined number of characters when the kanji character string is longer than a predetermined number of characters; A character recognition post-processing method characterized by performing word recognition by comparing a kanji character string recognized by an identification means with a word dictionary that matches the kanji length.