JPH03225576A

JPH03225576A - Device for segmenting word

Info

Publication number: JPH03225576A
Application number: JP2021508A
Authority: JP
Inventors: Koshi Sakurada; 桜田　孔司; Koji Ito; 伊東　晃治; Yoshiyuki Yamashita; 山下　義征
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1990-01-31
Filing date: 1990-01-31
Publication date: 1991-10-04
Anticipated expiration: 2013-05-18
Also published as: JP2753094B2

Abstract

PURPOSE:To highly accurately segment words from a character string by providing the word segmenting device with a preprocessing means, a pitch estimating means and a word segmenting means. CONSTITUTION:The preprocessing means 100 reads out recording information by an image reading part and stores black/white binary information in an image storage part 102. A character string extracting part 103 sends the character string pattern information S103 of one line from the storage part 102. The pitch estimating means 110 computes the positional information S111 of a black block along the character string by a block extracting part 111 and an estimated character pitch P which is the sum of the maximum width of the black block and the minimum width of a blank area adjacent to the black block is outputted from a pitch calculation part 112. In the segmenting means 120, a threshold calculating part 121 finds out a blank width deciding threshold TH based upon P, and when the blank width exceeds the threshold, a word segmenting signal S122 is generated from a word extracting part 122 based upon the block position information S111.

Description

【発明の詳細な説明】（産業上の利用分野）本発明は、欧文文書等の画像情報から単語を切り出すた
めの単語切り出し装置に関するものである。DETAILED DESCRIPTION OF THE INVENTION (Field of Industrial Application) The present invention relates to a word cutting device for cutting out words from image information such as a Roman document.

（従来の技術〉従来、このような分野の技術としては、特開昭６２−１
３３５８５号公報に記載されるものがあった。(Conventional technology) Conventionally, as a technology in this field, Japanese Patent Application Laid-Open No. 62-1
There was one described in Publication No. 33585.

一般に、文書画像を処理する装置において、欧文等を扱
う場合、文字単位の処理だけではなく、単語単位の処理
が必要となることが多い。例えば、光学的文字読取装置
（ＯＣＲ）では、文字認識処理を行った後に単語認識処
理を行うことにより、文字単位での認識結果が誤った場
合でも、単語認識処理によって誤った文字を修正するこ
とが可能となる。この際、対象となる文書の画像情報か
ら文字行を切り出し、各文字行中から文字を切り出すと
共に、単語をも切り出すことが必要となる。Generally, in a device that processes document images, when dealing with European languages, it is often necessary to process not only characters but also words. For example, in an optical character reader (OCR), by performing word recognition processing after character recognition processing, even if the recognition result for each character is incorrect, the incorrect character can be corrected by word recognition processing. becomes possible. At this time, it is necessary to cut out character lines from the image information of the target document, cut out characters from each character line, and also cut out words.

従来のこの種の単語切り出し装置としては、前記文献に
記載されるものがあった。この単語切り出し装置では、
文字列の画像情報の文字列方向に沿った周辺分布に基づ
き、該周辺分布の切れ目の幅のヒストグラム（ｈｉｓｔ
ｏ（ｌｒａｍ　；単一のランダム変数の発生頻度分布図
）を作成する。そして、そのヒストグラムのピークに対
応する最大の幅に沿って単語間ギャップの判定閾値を決
定し、その判定閾値以上の幅の周辺分布の切れ目を単語
間のギャップと判定して単語を切り出すものであった。As a conventional word extraction device of this type, there is one described in the above-mentioned literature. With this word extraction device,
Based on the peripheral distribution of the image information of the character string along the character string direction, a histogram (hist
o(lram; an occurrence frequency distribution diagram of a single random variable). Then, a threshold for determining the gap between words is determined along the maximum width corresponding to the peak of the histogram, and a break in the marginal distribution with a width greater than or equal to the threshold is determined to be a gap between words, and words are extracted. there were.

（発明が解決しようとする課題）しかしながら、上記構成の装置では、次のような課題が
あった。(Problems to be Solved by the Invention) However, the apparatus with the above configuration has the following problems.

（ａ）　　従来の装置では、文字列に含まれる単語数が
少ない場合、作成したヒストグラムにおいて単語間ギャ
ップに対応するピークが得られにくいため、単語間ギャ
ップの判定閾値を決定することが困難となり、その結果
、正しく単語を切り出すことかできない。(a) With conventional devices, when the number of words included in a character string is small, it is difficult to obtain a peak corresponding to an inter-word gap in the created histogram, making it difficult to determine a threshold for determining an inter-word gap. As a result, it is not possible to cut out the words correctly.

（ｂ）　　従来の装置では、単語間ギャップの判定閾値
を決定するなめにヒストグラムを作成する必要があるの
で、装置構成が複雑化すると共に、処理速度が低下する
という問題があった。(b) In conventional devices, it is necessary to create a histogram in order to determine the threshold value for determining the inter-word gap, which has the problem of complicating the device configuration and slowing down the processing speed.

本発明は前記従来技術が持っていた課題として、単語の
切り出しが正確に行えないという点と、装置構成が複雑
化すると共に処理速度が低下するという点について解決
した単語切り出し装置を提供するものである。The present invention provides a word segmentation device that solves the problems of the prior art, such as the inability to accurately segment words and the fact that the device configuration becomes complicated and the processing speed decreases. be.

（課題を解決するための手段）本発明は前記課題を解決するために、欧文等の文書の画
像情報に基づき該文書の単語の切り出し処理を行う単語
切り出し装置において、前記文書の画像情報より文字列
パタンを抽出する前処理手段と、前記文字列パタンの推
定文字ピッチを検出するピッチ推定手段と、前記文字列
パタンにおいて前記推定文字ピッチに応じて定められる
判定閾値よりも大きな空白幅を検出してその空白幅の位
置により単語を切り出す切り出し手段とを、備えたもの
である。(Means for Solving the Problems) In order to solve the above problems, the present invention provides a word extraction device that performs a process of cutting out words of a document based on image information of a document such as a European language. a preprocessing means for extracting a column pattern; a pitch estimation means for detecting an estimated character pitch of the character string pattern; and a pitch estimation means for detecting a blank width in the character string pattern that is larger than a determination threshold determined according to the estimated character pitch. and cutting means for cutting out words according to the position of the blank width.

前記ピッチ判定手段は、例えば前記文字列パタンの文字
列方向についての周辺分布に基づき、推定文字ピッチを
検出するよう構成される。The pitch determining means is configured to detect an estimated character pitch based on, for example, a peripheral distribution of the character string pattern in a character string direction.

（作用）本発明によれば、以上のように単語切り出し装置を構成
したので、前処理手段は、欧文等の文書の画像情報より
文字列パタンを抽出し、それをピッチ推定手段に与える
。ピッチ推定手段は、例えば文字列パタンの文字列方向
についての周辺分布に基づき、推定文字ピッチを検出し
、その推定文字ピッチを切り出し手段に与える。切り出
し手段は、推定文字ピッチに基づき判定閾値を定め、そ
の判定閾値を基準にしてそれよりも大きな空白幅を検出
し、その空白幅位置により単語を切り出す。(Operation) According to the present invention, since the word extraction device is configured as described above, the preprocessing means extracts a character string pattern from image information of a document such as a Roman language, and provides it to the pitch estimating means. The pitch estimating means detects an estimated character pitch based on, for example, the peripheral distribution of the character string pattern in the character string direction, and supplies the estimated character pitch to the cutting means. The cutting means determines a judgment threshold based on the estimated character pitch, detects a blank width larger than the judgment threshold, and cuts out a word at the position of the blank width.

これにより、文字列に含まれる単語数の大小にかかわら
ず、高精度に単語の切り出しが行えると共に、従来のよ
うなヒストグラムの作成が不安になって装置構成の簡単
化と処理速度の高速化が図れる。従って、前記課題を解
決できるのである。As a result, words can be extracted with high precision regardless of the number of words contained in a character string, and the creation of histograms as in the past can be avoided, simplifying the device configuration and increasing processing speed. I can figure it out. Therefore, the above problem can be solved.

（実施例）第１図は、本発明の一実施例を示す単語切り出し装置の
構成ブロック図である。(Embodiment) FIG. 1 is a block diagram of a word extraction device showing an embodiment of the present invention.

この単語切り出し装置は、文字列パタンデータ８１０３
を得るための前処理手段１００と、文字列パタンデータ
５１０３に基づき推定文字ピッチｐを検出するピッチ推
定手段１１０と、推定文字ピッチｐに応じて判定閾値Ｔ
Ｈを定め、文字列パタンデータ５１０３において判定閾
値ＴＨより大きい空白幅を検出してその空白位置により
単語を切り出す切り出し手段１２０とで、構成されてい
る。This word cutting device uses character string pattern data 8103
, a pitch estimating means 110 that detects an estimated character pitch p based on the character string pattern data 5103, and a determination threshold T according to the estimated character pitch p.
H is determined, a blank width larger than the determination threshold TH is detected in the character string pattern data 5103, and a cutout means 120 is configured to cut out a word based on the blank position.

前処理手段１００は、欧文印刷文書等の記録媒体より白
黒２値の画像情報５１０１を読み取る画像読取部１０１
を有している。この画像読取部１０１は、イメージスキ
ャナ等で構成され、その出力側には、画像情報５１０１
を記憶する画像バッファ等からなる画像記憶部１０２と
、文字列抽出部１０３とが、接続されている。文字列抽
出部１０３は、画像記憶部１０２に記憶された画像情報
５１０１より、文字列パタンデータ５１０３を抽出し、
それをピッチ推定手段１１０に与える機能を有している
。The preprocessing unit 100 includes an image reading unit 101 that reads black and white binary image information 5101 from a recording medium such as a European printed document.
have. This image reading unit 101 is composed of an image scanner or the like, and has image information 5101 on its output side.
An image storage unit 102 consisting of an image buffer or the like that stores , and a character string extraction unit 103 are connected. The character string extraction unit 103 extracts character string pattern data 5103 from the image information 5101 stored in the image storage unit 102,
It has a function of providing it to the pitch estimating means 110.

ピッチ推定手段１１０は、文字列パタンデータ５１０３
の文字列方向に沿った周辺分布に基づいて文字列を構成
する部分図形（これを黒ブロックという）の位置情報５
１１１を検出するブロック抽出部１０６と、黒ブロック
の位置情報５１１１に基づき推定文字ピッチｐを検出す
るピッチ計算部１０７とで、構成され、その出力側に切
り出し手段１２０が接続されている。The pitch estimation means 110 uses character string pattern data 5103
Position information 5 of partial figures (this is called a black block) that constitutes a character string based on the peripheral distribution along the character string direction.
111, and a pitch calculation section 107 that detects the estimated character pitch p based on the position information 5111 of the black block, and a cutout means 120 is connected to the output side of the pitch calculation section 107.

切り出し手段１２０は、推定文字ピッチｐに基づき判定
閾値ＴＨを計算する閾値計算部１２１と、単語抽出部１
２２とで、構成されている。単語抽出部１２２は、黒ブ
ロックの位置情報５１１１に基づいて計算される隣接す
る黒ブロツク間の空白幅が、判定閾値ＴＨより大きいこ
とを検知して該空白位置により、単語を切り出すことを
示す単語切り出し信号５１２２を出力する機能を有して
いる。The extraction means 120 includes a threshold calculation unit 121 that calculates a determination threshold TH based on the estimated character pitch p, and a word extraction unit 1.
It is composed of 22. The word extraction unit 122 detects that the blank width between adjacent black blocks calculated based on the black block position information 5111 is larger than the determination threshold TH, and extracts a word indicating that a word is to be extracted based on the blank position. It has a function of outputting a cutout signal 5122.

第２図は、第１図の単語切り出しの動作例を説明するた
めの図であり、この図を参照しつつ第１図の装置の動作
を説明する。FIG. 2 is a diagram for explaining an example of the word extraction operation shown in FIG. 1, and the operation of the apparatus shown in FIG. 1 will be explained with reference to this diagram.

なお、第２図中の１１１ａは文字列パタンデータ５１０
３における文字列方向に沿った周辺分布、８１〜Ｂ２２
は黒ブロック、Ｗ１〜Ｗ４は空白領域、Ｈは黒ブロック
Ｂ８と空白領域Ｗ１を合わせた幅である。Note that 111a in FIG. 2 is character string pattern data 510.
Marginal distribution along the character string direction in 3, 81-B22
is a black block, W1 to W4 are blank areas, and H is the combined width of black block B8 and blank area W1.

先ず、欧文印刷文書等の記録情報は、画像読取部１０１
により読み取られ、白黒２値の画像情報５１０１として
画像記憶部１０２に記憶される。First, recorded information such as a European printed document is stored in the image reading unit 101.
, and stored in the image storage unit 102 as black and white binary image information 5101.

次に、文字列抽出部１０３は、画像記憶部１０２に記憶
された２値の画像情報より、１行分に相当する文字列パ
タンデータ５１０３　（例えば、ｗｏｒｄｓｅｇｍｅｎ
ｔａｔｉｏｎ　ｍｅｔｈｏｄ　）を抽出し、それをブロ
ック抽出部１１１に与える。Next, the character string extraction unit 103 extracts character string pattern data 5103 (for example, wordsegmen) corresponding to one line from the binary image information stored in the image storage unit 102.
tation method) and provides it to the block extraction unit 111.

ブロック抽出部１１１は、文字列パタンデータ５１０３
の文字列方向に沿った周辺分布１１１ａを計算により作
成し、さらにその周辺分布１１１ａが正の値となる連続
領域の各々を黒ブロックＢ１〜Ｂ２２として抽出する。The block extraction unit 111 extracts character string pattern data 5103
A peripheral distribution 111a along the character string direction is created by calculation, and each continuous area in which the peripheral distribution 111a has a positive value is extracted as black blocks B1 to B22.

そして各黒ブロックＢ１〜Ｂ２２の文字列方向に沿った
始端位置及び終端位置からなる各黒ブロックＢ１〜Ｂ２
２の位置情報５１１１を計算し、その計算結果を保存す
る。Each black block B1 to B2 consists of a starting end position and a terminal end position along the character string direction of each black block B1 to B22.
The position information 5111 of No. 2 is calculated and the calculation result is saved.

次にピッチ計算部１１２は、黒ブロックの位置情報５ｉ
ｌｌ（始端位置及び終端位置）に基づき、黒ブロックの
最大幅と該最大幅を持つ黒ブロックに隣接する空白領域
の最小幅との和を求め、これを推定文字ピッチｐとして
閾値計算部１２１へ出力する。すなわち、第１番目の黒
ブロックに対する文字列方向に沿った始端位置と終端位
置をそれぞれＳ、、Ｅ、（但し、Ｓｉ＜Ｅｉ＜Ｓｉ＋１
、ｉ＝１．２，３．・・・、Ｎ、Ｎは黒ブロツク数）と
すると、推定文字ピッチｐは次式で示される。Next, the pitch calculation unit 112 calculates the position information 5i of the black block.
Based on ll (starting end position and ending end position), calculate the sum of the maximum width of the black block and the minimum width of the blank area adjacent to the black block having the maximum width, and send this to the threshold calculation unit 121 as the estimated character pitch p. Output. That is, the starting and ending positions along the character string direction for the first black block are respectively S, , E, (where, Si<Ei<Si+1
, i=1.2,3. . . , N, N is the number of black blocks), the estimated character pitch p is expressed by the following equation.

臣− 但し、Ｚは値（ＥｉＳ）が最大値を取る場合のｉを表し、ｍ１ｎ（Ａ、　Ｂ）はＡとＢの最小値を
表す。このようにして求めた推定文字ピッチ臣は、各単
語内の文字ピッチと近似的に等しい値となる。However, Z represents i when the value (Ei S ) takes the maximum value, and m1n (A, B) represents the minimum value of A and B. The estimated character pitch value obtained in this manner has a value approximately equal to the character pitch within each word.

つまりピッチ計算部１１２では、第２図に示すように、
黒ブロック８１〜Ｂ２２より最大幅を持つ黒ブロックＢ
８を検出すると共に、その黒ブロックＢ８に隣接する空
白領域Ｗ１及びＷ２のうち最小幅となる空白領域Ｗ１を
検出し、黒ブロックＢ８と空白領域Ｗ１を合わせた幅Ｈ
を計算してそれを推定文字ピッチｐとして閾値計算部１
２０へ出力する。In other words, in the pitch calculation section 112, as shown in FIG.
Black block B having the largest width than black blocks 81 to B22
8 is detected, and the blank area W1 having the smallest width among the blank areas W1 and W2 adjacent to the black block B8 is detected, and the width H which is the combined width of the black block B8 and the blank area W1 is detected.
is calculated and used as the estimated character pitch p in the threshold calculation unit 1.
Output to 20.

閾値計算部１２１は、推定文字ピッチＦ）（＝Ｈ）に基
づき、単語を切り出すための空白幅に関する判定閾値Ｔ
Ｈを次式に従って計算し、単語抽出部１２２へ出力する
。The threshold calculation unit 121 calculates a determination threshold T regarding the blank width for cutting out words based on the estimated character pitch F) (=H).
H is calculated according to the following formula and output to the word extraction section 122.

ＴＨ＝ａＸ１５但し、ａは定数であり、本実施例では例えばａ＝０．２
５とする。TH=aX15 However, a is a constant, and in this example, a=0.2
5.

単語抽出部１２２では、ブロック抽出部１１１に保存さ
れた各黒ブロックの位置情報５１１１に基づき、隣接す
る黒ブロツク間の空白領域（例えば、Ｗ３及びＷ４）の
幅が判定閾値ＴＨを越える場合に、該空白位置により、
単語を切り出すことを示す単語切り出し信号５１２２を
出力する。すなわち、値（Ｓｉ、１−Ｅ、−１＞が値Ｔ
Ｈより大きい場合に、第１番目の黒ブロックと第（ｉ＋
１）番目の黒ブロックとの間を、単語の切れ目と判断し
て単語切り出し信号１１１を出力する（ｉ＝１゜２、・
・・、Ｎ−１）。従って本実施例においては、推定文字
ピッチｐの０．２５倍を越える空白幅を検出する毎に、
１つの単語の切り出し位置（始端位置及び終端位置）が
確定する。Based on the position information 5111 of each black block stored in the block extraction unit 111, the word extraction unit 122 determines whether the width of the blank area between adjacent black blocks (for example, W3 and W4) exceeds the determination threshold TH. Depending on the blank position,
A word cutting signal 5122 indicating that a word is to be cut out is output. That is, the value (Si, 1-E, -1> is the value T
If it is larger than H, the first black block and the (i+
1) The space between the block and the black block is determined to be a word break and a word cutout signal 111 is output (i=1°2, .
..., N-1). Therefore, in this embodiment, each time a blank width exceeding 0.25 times the estimated character pitch p is detected,
The cutting position (starting end position and ending position) of one word is determined.

例えば、第２図では、黒ブロック８１〜Ｂ４（”ｗｏｒ
ｄ”　）　、黒ブロック８５〜Ｂ　１６　（”ｓｅｇｍ
ｅｎｔａｔｉｏｎ”　）　、及び黒ブロックＢ１７〜Ｂ
２２（”ｍｅｔｈｏｄ”　）が各々単語を構成すること
を示す単語切り出し信号５１２２となる。このような単
語切り出し信号５１２２が単語抽出部１２２から出力さ
れると、文字列パタンデータ５１０３に対する単語切り
出しの処理が完了する。For example, in FIG. 2, black blocks 81 to B4 ("wor
d"), black block 85 to B16 ("segm
), and black blocks B17 to B
22 ("method") constitutes a word. When such a word extraction signal 5122 is output from the word extraction unit 122, the word extraction process for the character string pattern data 5103 is completed.

以上のように、本実施例では、次のような利点を有して
いる。As described above, this embodiment has the following advantages.

（ａ）　　文字列パタンデータ５１０３の推定文字ピッ
チｐを検出し、文字列パタンデータ５１０３において推
定文字ピッチβに応じて定められる判定閾値ＴＨより大
きい空白幅を検出してその空白位置により、単語パタン
を切り出す構成にしたので、文字列に含まれる単語数の
大小にかかわらず、高精度に単語を切り出すことができ
る。(a) Detect the estimated character pitch p of the character string pattern data 5103, detect a blank width larger than the determination threshold TH determined according to the estimated character pitch β in the character string pattern data 5103, and determine the word pattern based on the blank position. Since the structure is configured to extract words, words can be extracted with high precision regardless of the number of words included in a character string.

（ｂ）　　ピッチ推定手段１１０は、文字列パタンデー
タ５１０３の文字列方向についての周辺分布１１１ａに
基づき、黒ブロックの位置情報５ｌ１１を求め、黒ブロ
ックの最大幅と該最大幅を持つ黒ブロックに隣接する空
白領域の最小幅との和（Ｈ）により、推定文字ピッチｐ
を検出する構成にしなので、周辺分布１１１ａの切れ目
の幅のヒストグラムを作成する必要のあった従来の単語
切り出し装置に比べ、装置構成が簡単になると共に、処
理速度をより高速化できる。(b) The pitch estimating means 110 calculates the position information 5l11 of the black block based on the peripheral distribution 111a in the character string direction of the character string pattern data 5103, and calculates the maximum width of the black block and the adjacent black block having the maximum width. The estimated character pitch p is determined by the sum (H) of the minimum width of the blank area
Since the word segmentation device is configured to detect , the device configuration is simpler and the processing speed can be further increased compared to a conventional word segmentation device that requires creating a histogram of the width of the break in the peripheral distribution 111a.

なお、本発明は図示の実施例に限定されず、種々の変形
が可能である。その変形例としては、例えば次のような
ものがある。Note that the present invention is not limited to the illustrated embodiment, and various modifications are possible. Examples of such modifications include the following.

（ｉ＞　　上記実施例において、ピッチ推定手段１１０
は、文字列パタンデータ５１０３の文字列方向について
の周辺分布１１１ａに基づき、黒ブロックの位置情報５
１１１を求め、黒ブロックの最大幅と該最大幅を持つ黒
ブロックに隣接する空白領域の最小幅との和（Ｈ）によ
り、推定文字ピッチロを検出する構成にしたが、これを
他の構成にしてもよい。例えば、文字列パタンデータ５
１０３の文字列方向についての周辺分布１１１ａに基づ
き、黒ブロックの位置情報を求め、黒ブロックの最大幅
と空白領域の最小幅との和により、推定とッチｐを検出
する等、種々の方法で推定文字ピッチｐの検出が可能で
ある。(i> In the above embodiment, the pitch estimating means 110
is the position information 5 of the black block based on the peripheral distribution 111a in the character string direction of the character string pattern data 5103.
111, and the sum (H) of the maximum width of the black block and the minimum width of the blank area adjacent to the black block having the maximum width is used to detect the estimated character Pitchiro. You can. For example, string pattern data 5
There are various methods, such as obtaining the positional information of the black block based on the peripheral distribution 111a in the character string direction of 103, and detecting the estimation and p by the sum of the maximum width of the black block and the minimum width of the blank area. It is possible to detect the estimated character pitch p.

（ｉｉ）　　第１図の各ブロックは、個別回路で構成す
る他に、マイクロプロセッサ等を用いたソフトウェア処
理等によって構成してもよい。(ii) Each block in FIG. 1 may be constructed by individual circuits or by software processing using a microprocessor or the like.

（発明の効果）以上詳細に説明したように、本発明によれば、前処理手
段により抽出した文字列パタンに基づき、ピッチ推定手
段で推定文字ピッチを検出し、次いで切り出し手段によ
り、文字列パタンにおいて推定文字ピッチに応じて定め
られる判定閾値より大きい空白幅を検出してその空白位
置により単語パタンを切り出す構成にしたので、文字列
に含まれる単語数の大小にかかわらず、高精度に単語を
切り出すことができる。(Effects of the Invention) As described above in detail, according to the present invention, the pitch estimation means detects the estimated character pitch based on the character string pattern extracted by the preprocessing means, and then the cutting means detects the character string pattern. The system is configured to detect a blank width larger than a judgment threshold determined according to the estimated character pitch and cut out word patterns based on the blank position, so it is possible to extract words with high accuracy regardless of the number of words included in the character string. It can be cut out.

その上、例えば文字列パタンの文字列方向についての周
辺分布に基づき推定文字ピッチを検出する手段等により
、ピッチ推定手段を構成したので、従来のようなヒスト
グラムの作成が不要となり、装置構成の簡単化と、処理
速度の高速化という効果も期待できる。Furthermore, since the pitch estimating means is configured by detecting the estimated character pitch based on the peripheral distribution in the character string direction of the character string pattern, for example, it is not necessary to create a histogram as in the past, which simplifies the device configuration. It can also be expected to have the effect of increasing speed and processing speed.

[Brief explanation of drawings]

第１図は本発明の実施例を示す単語切り出し装置の構成
ブロック図、第２図は第１図の単語切り出し動作例を説
明するための図である。１００・・・・・・前処理手段、１０１・・・・・・画
像読取部、１０２・・・・・・画像記憶部、１０３・・
・・・・文字列抽出部、１１０・・・・・・ピッチ推定
手段、１１１・・・・・・ブロック抽出部、１１２・・
・・・・ピッチ計算部、１２０・・・・・・切り出し手
段、１２１・・・・・・閾値計算部、１２２・・・・・
・単語抽出部。FIG. 1 is a block diagram of the configuration of a word extraction device showing an embodiment of the present invention, and FIG. 2 is a diagram for explaining an example of the word extraction operation in FIG. 1. 100... Preprocessing means, 101... Image reading unit, 102... Image storage unit, 103...
...Character string extraction unit, 110...Pitch estimation means, 111...Block extraction unit, 112...
... Pitch calculation section, 120 ... Cutting means, 121 ... Threshold calculation section, 122 ...
・Word extraction part.

Claims

[Scope of Claims] 1. In a word extraction device that performs processing to extract words from a document based on image information of the document, a preprocessing means for extracting a character string pattern from the image information of the document; pitch estimating means for detecting an estimated character pitch; and cutting means for detecting a blank width larger than a determination threshold determined according to the estimated character pitch in the character string pattern and cutting out a word according to the position of the blank width, A word cutting device characterized by: 2. The word cutting device according to claim 1, wherein the pitch estimating means is configured to detect an estimated character pitch based on a peripheral distribution of the character string pattern in the character string direction.