JPH04199454A - Document input device - Google Patents
Document input deviceInfo
- Publication number
- JPH04199454A JPH04199454A JP2333531A JP33353190A JPH04199454A JP H04199454 A JPH04199454 A JP H04199454A JP 2333531 A JP2333531 A JP 2333531A JP 33353190 A JP33353190 A JP 33353190A JP H04199454 A JPH04199454 A JP H04199454A
- Authority
- JP
- Japan
- Prior art keywords
- character
- post
- character code
- null
- processing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Character Discrimination (AREA)
- Document Processing Apparatus (AREA)
Abstract
Description
【発明の詳細な説明】
〔概要〕
文書を読取って蓄積のために出力する文書入力装置に関
し、
認識精度が向上することを目的とし、
原稿を読取って画像データを得る読取手段と、該画像デ
ータから各文字の位置を測定して位置情報を得る位置測
定手段と、該画像データから各文字を認識して文字コー
ドに変換する認識手段と、該位置測定手段で得られた各
文字の位置情報を少なくとも記憶する記憶手段と、該認
識手段で得られた文字コード列を単語辞書と照合して認
識できなかった部分及び誤認識部分を訂正する後処理手
段と、該後処理手段の出力する文字コード列に該記憶手
段の各文字の位置情報に基づいて原稿上での各文字の位
置に対応するよう空白文字及び空白行の文字コードを挿
入する整形手段とを有し構成する。[Detailed Description of the Invention] [Summary] The present invention relates to a document input device that reads a document and outputs it for storage. a position measuring means for measuring the position of each character to obtain position information; a recognition means for recognizing each character from the image data and converting it into a character code; and a position measuring means for each character obtained by the position measuring means. a storage means for storing at least a character code string obtained by the recognition means, a post-processing means for comparing the character code string obtained by the recognition means with a word dictionary and correcting unrecognized parts and erroneously recognized parts, and characters outputted by the post-processing means. and a formatting means for inserting character codes of blank characters and blank lines into the code string so as to correspond to the position of each character on the document based on the position information of each character in the storage means.
本発明は文書入力装置に関し、文書を読取って蓄積する
文書入力装置に関する。The present invention relates to a document input device, and more particularly to a document input device that reads and stores documents.
第2図は従来の文書入力装置の一例のプロ・ツク図を示
す。FIG. 2 shows a block diagram of an example of a conventional document input device.
同図中、光学読取装置10て原稿から読取られた画像デ
ータは画像メモリ11に格納される。文字位置測定部1
2は各文字の位置を測定して位置情報を得、また文字認
識部14は各文字を認識して文字コードに変換する。In the figure, image data read from a document by an optical reading device 10 is stored in an image memory 11. Character position measurement section 1
2 measures the position of each character to obtain position information, and a character recognition unit 14 recognizes each character and converts it into a character code.
整形部15は原稿の体裁を保存するために、コード化文
書上で原稿での各文字位置を再現できるよう文字位置情
報に基づき必要に応じて文字コードの間に空白文字の文
字コートを挿入する。In order to preserve the appearance of the manuscript, the formatting unit 15 inserts blank character codes between character codes as necessary based on character position information so that each character position in the manuscript can be reproduced on the coded document. .
文書認識部14は認識率が100%ではなく、コート化
文書には認識か不可能なりジエクト部分や、カタカナの
「工」と漢字の「工」の如き同形又は類似文字の誤認識
部分か多数存在する。後処理部16はコード化された文
字列を単語辞書17と照合する等によりリジェクト部分
や誤認識部分の訂正を行なってコート化文書の文字コー
ドを端子18より出力する。The recognition rate of the document recognition unit 14 is not 100%, and there are many parts that cannot be recognized in coated documents, or incorrectly recognized parts of isomorphic or similar characters such as the katakana ``工'' and the kanji ``工''. exist. The post-processing unit 16 corrects rejected parts and erroneously recognized parts by comparing the coded character string with a word dictionary 17, etc., and outputs the character code of the coded document from the terminal 18.
従来装置では認識した文字コード列の途中に整形部15
て空白の文字コートを挿入した後で後処理を行なってい
る。このため、文字コート列の間に挿入された空白の文
字コードにより単語か分断され、単語辞書との照合がて
きない場合か発生し、後処理による認識精度か向上しな
いという問題かあった。In the conventional device, the formatting section 15 is placed in the middle of the recognized character code string.
Post-processing is performed after inserting a blank character code. As a result, words may be separated by blank character codes inserted between character code strings, and the words may not be matched with the word dictionary, resulting in the problem that recognition accuracy cannot be improved through post-processing.
本発明は上記の点に鑑みなされたもので、認識精度が向
上する文書入力装置を提供することを目的とする。The present invention has been made in view of the above points, and an object of the present invention is to provide a document input device that improves recognition accuracy.
本発明の文書入力装置は、
原稿を読取って画像データを得る読取手段と、画像デー
タから各文字の位置を測定して位置情報を得る位置測定
手段と、
画像データから各文字を認識して文字コードに変換する
認識手段と、
位置測定手段で得られた各文字の位置情報を少なくとも
記憶する記憶手段と、
認識手段で得られた文字コード列を単語辞書と照合して
認識できなかった部分及び誤認識部分を訂正する後処理
手段と、
後処理手段の出力する文字コード列に記憶手段の各文字
の位置情報に基づいて原稿上での各文字の位置に対応す
るよう空白文字及び空白行の文字コードを挿入する整形
手段とを有する。The document input device of the present invention includes: a reading device that reads a document to obtain image data; a position measuring device that measures the position of each character from the image data to obtain position information; and a document input device that recognizes each character from the image data to obtain the character. a recognition means for converting into a code; a storage means for storing at least the position information of each character obtained by the position measurement means; post-processing means for correcting erroneously recognized portions; and blank characters and blank lines are added to the character code string output by the post-processing means to correspond to the position of each character on the manuscript based on the position information of each character in the storage means. and a formatting means for inserting a character code.
本発明においては、文字コード列に空白文字及び空白行
の文字コードを挿入する以前に後処理を行なうため、文
字コード列の単語か空白文字及び空白行の文字コードに
よって分断されることかなく、単語辞書との照合を確実
に行ない得、これによって後処理による認識精度か向上
する。In the present invention, since post-processing is performed before inserting the character codes of blank characters and blank lines into the character code string, words in the character code string are not separated by the character codes of blank characters and blank lines. Verification with the word dictionary can be performed reliably, thereby improving recognition accuracy through post-processing.
第1図は本発明装置の一実施例のブロック図を示す。 FIG. 1 shows a block diagram of an embodiment of the device of the present invention.
同図中、20はCCDイメーンスキャナ等の光学読取装
置であり、ここで原稿文書から読取られた画像データは
画像メモリ21に格納される。文字位置測定部22は画
像データから各文字の切出しを行ない、切出された各文
字の頂点位置の座標を文字の位置情報として文字情報バ
ッファ23に格納する。In the figure, 20 is an optical reading device such as a CCD image scanner, and image data read from the original document is stored in an image memory 21. The character position measurement unit 22 cuts out each character from the image data, and stores the coordinates of the apex position of each cut out character in the character information buffer 23 as character position information.
また、文字認識部24は切出された各文字について文字
を構成する線の数及び方向等の特徴抽出を行ない、この
特徴によって辞書と照合し認識した文字を文字コードに
変換する。また、認識時に複数の候補があれば全ての候
補の文字コートを得て、この文字コートを文字情報バッ
ファ23にその文字の位置情報と対応づけて格納する。Further, the character recognition unit 24 extracts features of each extracted character, such as the number and direction of lines forming the character, and converts the recognized character into a character code by comparing the extracted characters with a dictionary. Furthermore, if there are multiple candidates at the time of recognition, the character coats of all the candidates are obtained and the character coats are stored in the character information buffer 23 in association with the position information of the character.
後処理部25は例えば形態素解析手法によって文字情報
バッファ23から文字コードを読出し、この文字コード
列を単語辞書26と照合して単語としての妥当性及び単
語間の接続の妥当性を検証し、各文字について最適の候
補を選択してリジェクト部分や誤認識部分の訂正を行な
う。The post-processing unit 25 reads the character code from the character information buffer 23 using, for example, a morphological analysis method, compares this character code string with the word dictionary 26, verifies the validity of the word and the validity of the connection between words, and The most suitable candidate for characters is selected and rejected parts and misrecognized parts are corrected.
整形部28は文字情報バッファ23から読出した各文字
の座標を基に文字列の平均文字ピッチ及び行ピッチを算
定し、各文字の座標を上記平均文字ピッチ及び行ピッチ
で除算することにより各文字の文字列内ての行及び列を
算出し、後処理部25より出力される文字コード列に原
稿上での文字位置に近似するよう空白文字の文字コード
及び空白行の文字コードを挿入し、コード化文書を端子
28より出力する。The formatting unit 28 calculates the average character pitch and line pitch of the character string based on the coordinates of each character read from the character information buffer 23, and divides the coordinates of each character by the average character pitch and line pitch. calculate the rows and columns in the character string, insert character codes of blank characters and character codes of blank lines into the character code string output from the post-processing unit 25 so as to approximate the character positions on the manuscript, The encoded document is output from the terminal 28.
このように、文字コード列に空白文字及び空白行の文字
コードを挿入する以前に後処理を行なうため、文字コー
ド列の単語か空白文字及び空白行の文字フードによって
分断されることかなく、単語辞書との照合を確実に行な
い得、これによって後処理による認識精度が向上する。In this way, since post-processing is performed before inserting the character codes of blank characters and blank lines into the character code string, the words in the character code string are not separated by the character hoods of blank characters and blank lines, and the words are Collation with the dictionary can be performed reliably, thereby improving recognition accuracy through post-processing.
勿論この他に、文字の位置情報に含まれる文字の高さ及
び幅の情報を利用してコード化文書上に文字の大きさの
情報を付加しても良い。Of course, in addition to this, information on the height and width of the characters included in the character position information may be used to add information on the size of the characters onto the coded document.
また、文字認識部24て得た文字コードを直接後処理部
25に供給し、文字情報バッファ23には各文字の位置
情報のみを格納しても良く、上記実施例に限定されない
。Further, the character code obtained by the character recognition section 24 may be directly supplied to the post-processing section 25, and only the position information of each character may be stored in the character information buffer 23, and the present invention is not limited to the above embodiment.
上述の如く、本発明の文書入力装置によれば、後処理時
に単語の分断がなく、後処理による認識精度か向上し、
実用上きわめて育用である。As described above, according to the document input device of the present invention, there is no word separation during post-processing, and recognition accuracy is improved through post-processing.
It is extremely useful for practical purposes.
第1図は本発明装置の一実施例のブロック図、第2図は
従来装置の一例のブロック図である。
図において、
20は光学読取装置、
21は画像メモリ、
22は文字位置測定部、
23は文字情報バッファ、
24は文字認識部、
26は単語辞書、
27は整形部
を示す。
特許出願人 富 士 通 株式会社
第1図
従来Mlのブロック図
第2図FIG. 1 is a block diagram of an embodiment of the device of the present invention, and FIG. 2 is a block diagram of an example of a conventional device. In the figure, 20 is an optical reading device, 21 is an image memory, 22 is a character position measuring section, 23 is a character information buffer, 24 is a character recognition section, 26 is a word dictionary, and 27 is a shaping section. Patent applicant Fujitsu Ltd. Figure 1 Block diagram of conventional Ml Figure 2
Claims (1)
る位置測定手段(22)と、 該画像データから各文字を認識して文字コードに変換す
る認識手段(24)と、 該位置測定手段で得られた各文字の位置情報を少なくと
も記憶する記憶手段(23)と、 該認識手段(24)で得られた文字コード列を単語辞書
と照合して認識できなかった部分及び誤認識部分を訂正
する後処理手段(25)と、該後処理手段(25)の出
力する文字コード列に該記憶手段(23)の各文字の位
置情報に基づいて原稿上での各文字の位置に対応するよ
う空白文字及び空白行の文字コードを挿入する整形手段
(27)とを有することを特徴とする文書入力装置。[Claims] Reading means (20) for reading a document to obtain image data; position measuring means (22) for obtaining position information by measuring the position of each character from the image data; recognition means (24) for recognizing characters and converting them into character codes; storage means (23) for storing at least position information of each character obtained by the position measuring means; A post-processing means (25) corrects unrecognized parts and erroneously recognized parts by comparing the character code string with a word dictionary, and a storage means (23) for storing the character code string output from the post-processing means (25) 1. A document input device comprising: a formatting means (27) for inserting character codes of blank characters and blank lines to correspond to the position of each character on a document based on the position information of each character on the document.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2333531A JPH04199454A (en) | 1990-11-29 | 1990-11-29 | Document input device |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2333531A JPH04199454A (en) | 1990-11-29 | 1990-11-29 | Document input device |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| JPH04199454A true JPH04199454A (en) | 1992-07-20 |
Family
ID=18267087
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| JP2333531A Pending JPH04199454A (en) | 1990-11-29 | 1990-11-29 | Document input device |
Country Status (1)
| Country | Link |
|---|---|
| JP (1) | JPH04199454A (en) |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPS5534786A (en) * | 1978-09-05 | 1980-03-11 | Nippon Telegr & Teleph Corp <Ntt> | Document editing device |
| JPS62197882A (en) * | 1986-02-26 | 1987-09-01 | Toshiba Corp | Sentence input device |
| JPH02125389A (en) * | 1988-07-01 | 1990-05-14 | Ricoh Co Ltd | Space detecting method |
| JPH02255947A (en) * | 1989-01-24 | 1990-10-16 | Fuji Electric Co Ltd | Production method for document file |
-
1990
- 1990-11-29 JP JP2333531A patent/JPH04199454A/en active Pending
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPS5534786A (en) * | 1978-09-05 | 1980-03-11 | Nippon Telegr & Teleph Corp <Ntt> | Document editing device |
| JPS62197882A (en) * | 1986-02-26 | 1987-09-01 | Toshiba Corp | Sentence input device |
| JPH02125389A (en) * | 1988-07-01 | 1990-05-14 | Ricoh Co Ltd | Space detecting method |
| JPH02255947A (en) * | 1989-01-24 | 1990-10-16 | Fuji Electric Co Ltd | Production method for document file |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| JP2000353215A (en) | Character recognition device and recording medium where character recognizing program is recorded | |
| EP0564827A2 (en) | A post-processing error correction scheme using a dictionary for on-line handwriting recognition | |
| JPS63182793A (en) | Character segmenting system | |
| JPS63216189A (en) | Character recognition system | |
| JPH04199454A (en) | Document input device | |
| JPH11328315A (en) | Character recognition device | |
| JP3173363B2 (en) | OCR maintenance method and device | |
| KR100301216B1 (en) | Online text recognition device | |
| JPH0728935A (en) | Document image processor | |
| JP2985813B2 (en) | Character string recognition device and knowledge database learning method | |
| JPH0319589B2 (en) | ||
| JP2851865B2 (en) | Character recognition device | |
| JP2845463B2 (en) | Pattern recognition device | |
| JP2903599B2 (en) | Character recognition device | |
| JPH01265378A (en) | European character recognizing system | |
| JP2972443B2 (en) | Character recognition device | |
| JPH0877293A (en) | Character recognition apparatus and method for creating dictionary for character recognition | |
| JP3476872B2 (en) | Character recognition device | |
| JP2963474B2 (en) | Similar character identification method | |
| JP2784004B2 (en) | Character recognition device | |
| JP3345469B2 (en) | Word spacing calculation method, word spacing calculation device, character reading method, character reading device | |
| JPH06119497A (en) | Character recognition method | |
| JPH0576674B2 (en) | ||
| JPS60138689A (en) | Character recognizing method | |
| JPH01171080A (en) | Recognizing device for error automatically correcting character |