JPH04199454A

JPH04199454A - Document input device

Info

Publication number: JPH04199454A
Application number: JP2333531A
Authority: JP
Inventors: Jun Sato; 純佐藤
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1990-11-29
Filing date: 1990-11-29
Publication date: 1992-07-20

Abstract

PURPOSE:To improve the recognizing accuracy of the post-processing by carrying out the post-processing before putting a character code of a null character and a null line into a character code string. CONSTITUTION:A shaping part 27 calculates the average character and line pitches of a character string based on the coordinates of each character read out of a character information buffer 23 and then divides the coordinates of each character by the average character and line pitches for calculation of the row and the column of each character included in the character code string. In this case, the character code of a null character and a null line are put into the character code string outputted from a post-processing part 25 so as to secure the approximation to a character position on an original. Then a coded document is outputted through a terminal 28. In such a way, the post- processing is carried out before the character code of the null character and the null line are put into a character code string. So that the word of the character code string is never divided by the character code of the null character and the null line. Thus the recognizing accuracy is improved for the post- processing.

Description

【発明の詳細な説明】〔概要〕文書を読取って蓄積のために出力する文書入力装置に関
し、認識精度が向上することを目的とし、原稿を読取って画像データを得る読取手段と、該画像デ
ータから各文字の位置を測定して位置情報を得る位置測
定手段と、該画像データから各文字を認識して文字コー
ドに変換する認識手段と、該位置測定手段で得られた各
文字の位置情報を少なくとも記憶する記憶手段と、該認
識手段で得られた文字コード列を単語辞書と照合して認
識できなかった部分及び誤認識部分を訂正する後処理手
段と、該後処理手段の出力する文字コード列に該記憶手
段の各文字の位置情報に基づいて原稿上での各文字の位
置に対応するよう空白文字及び空白行の文字コードを挿
入する整形手段とを有し構成する。[Detailed Description of the Invention] [Summary] The present invention relates to a document input device that reads a document and outputs it for storage. a position measuring means for measuring the position of each character to obtain position information; a recognition means for recognizing each character from the image data and converting it into a character code; and a position measuring means for each character obtained by the position measuring means. a storage means for storing at least a character code string obtained by the recognition means, a post-processing means for comparing the character code string obtained by the recognition means with a word dictionary and correcting unrecognized parts and erroneously recognized parts, and characters outputted by the post-processing means. and a formatting means for inserting character codes of blank characters and blank lines into the code string so as to correspond to the position of each character on the document based on the position information of each character in the storage means.

[Industrial application field]

本発明は文書入力装置に関し、文書を読取って蓄積する
文書入力装置に関する。The present invention relates to a document input device, and more particularly to a document input device that reads and stores documents.

[Conventional technology]

第２図は従来の文書入力装置の一例のプロ・ツク図を示
す。FIG. 2 shows a block diagram of an example of a conventional document input device.

同図中、光学読取装置１０て原稿から読取られた画像デ
ータは画像メモリ１１に格納される。文字位置測定部１
２は各文字の位置を測定して位置情報を得、また文字認
識部１４は各文字を認識して文字コードに変換する。In the figure, image data read from a document by an optical reading device 10 is stored in an image memory 11. Character position measurement section 1
2 measures the position of each character to obtain position information, and a character recognition unit 14 recognizes each character and converts it into a character code.

整形部１５は原稿の体裁を保存するために、コード化文
書上で原稿での各文字位置を再現できるよう文字位置情
報に基づき必要に応じて文字コードの間に空白文字の文
字コートを挿入する。In order to preserve the appearance of the manuscript, the formatting unit 15 inserts blank character codes between character codes as necessary based on character position information so that each character position in the manuscript can be reproduced on the coded document. .

文書認識部１４は認識率が１００％ではなく、コート化
文書には認識か不可能なりジエクト部分や、カタカナの
「工」と漢字の「工」の如き同形又は類似文字の誤認識
部分か多数存在する。後処理部１６はコード化された文
字列を単語辞書１７と照合する等によりリジェクト部分
や誤認識部分の訂正を行なってコート化文書の文字コー
ドを端子１８より出力する。The recognition rate of the document recognition unit 14 is not 100%, and there are many parts that cannot be recognized in coated documents, or incorrectly recognized parts of isomorphic or similar characters such as the katakana ``工'' and the kanji ``工''. exist. The post-processing unit 16 corrects rejected parts and erroneously recognized parts by comparing the coded character string with a word dictionary 17, etc., and outputs the character code of the coded document from the terminal 18.

[Invention or problem to be solved]

従来装置では認識した文字コード列の途中に整形部１５
て空白の文字コートを挿入した後で後処理を行なってい
る。このため、文字コート列の間に挿入された空白の文
字コードにより単語か分断され、単語辞書との照合がて
きない場合か発生し、後処理による認識精度か向上しな
いという問題かあった。In the conventional device, the formatting section 15 is placed in the middle of the recognized character code string.
Post-processing is performed after inserting a blank character code. As a result, words may be separated by blank character codes inserted between character code strings, and the words may not be matched with the word dictionary, resulting in the problem that recognition accuracy cannot be improved through post-processing.

本発明は上記の点に鑑みなされたもので、認識精度が向
上する文書入力装置を提供することを目的とする。The present invention has been made in view of the above points, and an object of the present invention is to provide a document input device that improves recognition accuracy.

[Means to solve the problem]

本発明の文書入力装置は、原稿を読取って画像データを得る読取手段と、画像デー
タから各文字の位置を測定して位置情報を得る位置測定
手段と、画像データから各文字を認識して文字コードに変換する
認識手段と、位置測定手段で得られた各文字の位置情報を少なくとも
記憶する記憶手段と、認識手段で得られた文字コード列を単語辞書と照合して
認識できなかった部分及び誤認識部分を訂正する後処理
手段と、後処理手段の出力する文字コード列に記憶手段の各文字
の位置情報に基づいて原稿上での各文字の位置に対応す
るよう空白文字及び空白行の文字コードを挿入する整形
手段とを有する。The document input device of the present invention includes: a reading device that reads a document to obtain image data; a position measuring device that measures the position of each character from the image data to obtain position information; and a document input device that recognizes each character from the image data to obtain the character. a recognition means for converting into a code; a storage means for storing at least the position information of each character obtained by the position measurement means; post-processing means for correcting erroneously recognized portions; and blank characters and blank lines are added to the character code string output by the post-processing means to correspond to the position of each character on the manuscript based on the position information of each character in the storage means. and a formatting means for inserting a character code.

[Effect]

本発明においては、文字コード列に空白文字及び空白行
の文字コードを挿入する以前に後処理を行なうため、文
字コード列の単語か空白文字及び空白行の文字コードに
よって分断されることかなく、単語辞書との照合を確実
に行ない得、これによって後処理による認識精度か向上
する。In the present invention, since post-processing is performed before inserting the character codes of blank characters and blank lines into the character code string, words in the character code string are not separated by the character codes of blank characters and blank lines. Verification with the word dictionary can be performed reliably, thereby improving recognition accuracy through post-processing.

〔Example〕

第１図は本発明装置の一実施例のブロック図を示す。 FIG. 1 shows a block diagram of an embodiment of the device of the present invention.

同図中、２０はＣＣＤイメーンスキャナ等の光学読取装
置であり、ここで原稿文書から読取られた画像データは
画像メモリ２１に格納される。文字位置測定部２２は画
像データから各文字の切出しを行ない、切出された各文
字の頂点位置の座標を文字の位置情報として文字情報バ
ッファ２３に格納する。In the figure, 20 is an optical reading device such as a CCD image scanner, and image data read from the original document is stored in an image memory 21. The character position measurement unit 22 cuts out each character from the image data, and stores the coordinates of the apex position of each cut out character in the character information buffer 23 as character position information.

また、文字認識部２４は切出された各文字について文字
を構成する線の数及び方向等の特徴抽出を行ない、この
特徴によって辞書と照合し認識した文字を文字コードに
変換する。また、認識時に複数の候補があれば全ての候
補の文字コートを得て、この文字コートを文字情報バッ
ファ２３にその文字の位置情報と対応づけて格納する。Further, the character recognition unit 24 extracts features of each extracted character, such as the number and direction of lines forming the character, and converts the recognized character into a character code by comparing the extracted characters with a dictionary. Furthermore, if there are multiple candidates at the time of recognition, the character coats of all the candidates are obtained and the character coats are stored in the character information buffer 23 in association with the position information of the character.

後処理部２５は例えば形態素解析手法によって文字情報
バッファ２３から文字コードを読出し、この文字コード
列を単語辞書２６と照合して単語としての妥当性及び単
語間の接続の妥当性を検証し、各文字について最適の候
補を選択してリジェクト部分や誤認識部分の訂正を行な
う。The post-processing unit 25 reads the character code from the character information buffer 23 using, for example, a morphological analysis method, compares this character code string with the word dictionary 26, verifies the validity of the word and the validity of the connection between words, and The most suitable candidate for characters is selected and rejected parts and misrecognized parts are corrected.

整形部２８は文字情報バッファ２３から読出した各文字
の座標を基に文字列の平均文字ピッチ及び行ピッチを算
定し、各文字の座標を上記平均文字ピッチ及び行ピッチ
で除算することにより各文字の文字列内ての行及び列を
算出し、後処理部２５より出力される文字コード列に原
稿上での文字位置に近似するよう空白文字の文字コード
及び空白行の文字コードを挿入し、コード化文書を端子
２８より出力する。The formatting unit 28 calculates the average character pitch and line pitch of the character string based on the coordinates of each character read from the character information buffer 23, and divides the coordinates of each character by the average character pitch and line pitch. calculate the rows and columns in the character string, insert character codes of blank characters and character codes of blank lines into the character code string output from the post-processing unit 25 so as to approximate the character positions on the manuscript, The encoded document is output from the terminal 28.

このように、文字コード列に空白文字及び空白行の文字
コードを挿入する以前に後処理を行なうため、文字コー
ド列の単語か空白文字及び空白行の文字フードによって
分断されることかなく、単語辞書との照合を確実に行な
い得、これによって後処理による認識精度が向上する。In this way, since post-processing is performed before inserting the character codes of blank characters and blank lines into the character code string, the words in the character code string are not separated by the character hoods of blank characters and blank lines, and the words are Collation with the dictionary can be performed reliably, thereby improving recognition accuracy through post-processing.

勿論この他に、文字の位置情報に含まれる文字の高さ及
び幅の情報を利用してコード化文書上に文字の大きさの
情報を付加しても良い。Of course, in addition to this, information on the height and width of the characters included in the character position information may be used to add information on the size of the characters onto the coded document.

また、文字認識部２４て得た文字コードを直接後処理部
２５に供給し、文字情報バッファ２３には各文字の位置
情報のみを格納しても良く、上記実施例に限定されない
。Further, the character code obtained by the character recognition section 24 may be directly supplied to the post-processing section 25, and only the position information of each character may be stored in the character information buffer 23, and the present invention is not limited to the above embodiment.

〔Effect of the invention〕

上述の如く、本発明の文書入力装置によれば、後処理時
に単語の分断がなく、後処理による認識精度か向上し、
実用上きわめて育用である。As described above, according to the document input device of the present invention, there is no word separation during post-processing, and recognition accuracy is improved through post-processing.
It is extremely useful for practical purposes.

[Brief explanation of the drawing]

第１図は本発明装置の一実施例のブロック図、第２図は
従来装置の一例のブロック図である。図において、２０は光学読取装置、２１は画像メモリ、２２は文字位置測定部、２３は文字情報バッファ、２４は文字認識部、２６は単語辞書、２７は整形部を示す。特許出願人　富　士　通　株式会社第１図従来Ｍｌのブロック図第２図FIG. 1 is a block diagram of an embodiment of the device of the present invention, and FIG. 2 is a block diagram of an example of a conventional device. In the figure, 20 is an optical reading device, 21 is an image memory, 22 is a character position measuring section, 23 is a character information buffer, 24 is a character recognition section, 26 is a word dictionary, and 27 is a shaping section. Patent applicant Fujitsu Ltd. Figure 1 Block diagram of conventional Ml Figure 2

Claims

[Claims] Reading means (20) for reading a document to obtain image data; position measuring means (22) for obtaining position information by measuring the position of each character from the image data; recognition means (24) for recognizing characters and converting them into character codes; storage means (23) for storing at least position information of each character obtained by the position measuring means; A post-processing means (25) corrects unrecognized parts and erroneously recognized parts by comparing the character code string with a word dictionary, and a storage means (23) for storing the character code string output from the post-processing means (25) 1. A document input device comprising: a formatting means (27) for inserting character codes of blank characters and blank lines to correspond to the position of each character on a document based on the position information of each character on the document.