JPH04195672A

JPH04195672A - Phrase segmentation device

Info

Publication number: JPH04195672A
Application number: JP2331063A
Authority: JP
Inventors: Shigeki Kuga; 空閑　茂起
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 1990-11-28
Filing date: 1990-11-28
Publication date: 1992-07-15
Anticipated expiration: 2012-03-19
Also published as: JP2592995B2

Abstract

PURPOSE:To speed up a phrase segmenting process by regarding a specific HIRAGANA (cursive form of Japanese syllabary) as one phrase and correcting a phrase break insertion position when the specific HIRAGANA character string is present in a sentence inserted at the break between phrases. CONSTITUTION:When a unit of segmentation is extracted from a document stored in a document storage means 1, a character kind decision means 4 decides the kinds of characters constituting the sentence and the decision results are replaced with either of two kinds of code and stored in a decision result storage means 6 in order. Then a phrase segmenting means 7 puts a phrase break in the readout sentence when detecting a predetermined code change point. Then a phrase segmentation correcting means 9 compares an input character string with prescribed HIRAGANA characters in a specific character dictionary means 8 by tracing back the sentence from the break between the phrases, recognize a matching HIRAGANA character as one phrase, and corrects and outputs the phrase segmentation position. Consequently, language processing can be performed after the phrase position is detected, so the processing time can greatly be shortened.

Description

【発明の詳細な説明】（イ）産業上の利用分野この発明は文節切出し装置に関し、詳しくはワードプロ
セッサ、翻訳装置、校正装置、データベ−スを利用する
装置等のように言語処理を行う装置に好適な文節切出し
装置に関する。DETAILED DESCRIPTION OF THE INVENTION (a) Field of Industrial Application This invention relates to a phrase segmentation device, and more specifically, to devices that perform language processing such as word processors, translation devices, proofreading devices, devices that use databases, etc. The present invention relates to a suitable phrase cutting device.

（ロ）従来の技術文節区切りの情報が挿入されていない変換済みの日本語
文書から、例えば翻訳や校正のｆこめに文節を切り出す
ためには、従来、自立語辞書、付属語辞書、接辞辞書等
の辞書順と、それらの要素の接続関係を表すテーブル、
文法テーブル等のテーブル類をそれぞれ参照しながら文
節の切り出しを行っている。(b) Conventional technology In order to extract phrases from a converted Japanese document in which phrase separation information has not been inserted, for example, for translation or proofreading, there are conventional techniques such as independent word dictionaries, attached word dictionaries, and affix dictionaries. A table representing the dictionary order of , etc., and the connection relationship of those elements,
Clauses are extracted while referring to tables such as grammar tables.

また、字種情報を利用し、例えば字種の変わり目を文節
の切れ目とする方法も考えられている。Also, a method is being considered that uses character type information, for example, to use a change in character type as a break between phrases.

（ハ）発明が解決しようとする課題このような、辞書類、テーブル類を利用する従来の文節
切出し装置においては、（１）それらの辞書、テーブル
を蓄積するために大量の記憶装置が必要になる。（２）
辞書検索またはテーブル検索を行うにめに文節切出し処
理の時開が長くなる。（３）文節切出しのための制御プ
ログラムかｍ＊になる。また、（４）字種情報を利用し
て文節を切る場合、機械的に字種によって文節を切り出
すため、送りがなのゆらぎで文節の切断を間違う場合か
める。（５）かな表記の単語の部分て文節の切り出し間
違いを発生する場合かあるなどの諸問題がめった。(c) Problems to be solved by the invention In the conventional phrase extraction device that uses dictionaries and tables, (1) a large amount of storage device is required to store the dictionaries and tables; Become. (2)
In order to perform a dictionary search or a table search, the time period for the bunsetsu extraction process becomes long. (3) The control program for segmentation becomes m*. In addition, (4) when cutting phrases using character type information, since the phrases are mechanically cut out according to the character type, it is possible to make mistakes in cutting phrases due to fluctuations in the sending gana. (5) Various problems were encountered, such as words in kana notation sometimes causing incorrect segmentation of phrases.

この発明は以上の事情を考慮してなされたもので、上記
問題を解消しうる文節切出し装置を機供する。The present invention has been made in consideration of the above-mentioned circumstances, and provides a phrase cutting device capable of solving the above-mentioned problems.

（ニ）課題を解決するための手段第１図はこの発明の基本構成を明示するブロック図であ
る。同図において、この発明：よ、文章を蓄積する文章
蓄積手段１と、文章蓄積手段２から所望の文を読み出す
ための指示を行う指示手段２と、指示された文を文章蓄
積手段１から読み出す読出手段３と、読み出した文につ
いて１文字ずつ漢字、ひらがな、カタカナなどの字種を
判別するとと乙に句点を判別する字種判別手段４と、字
種判別手段４による判別結果を２Ｎ顆の符号のいずれか
に置き換えて出力する変換手段５と、変換手段５からの
出力を順次蓄積する判別結果蓄積手段６と、判別結果蓄
積手段６に蓄積された符号列について、所定の符号変移
点を検出した際に、読み出した文中に文節の切れ目を挿
入する文節切出し手段７と、連体詞、代名詞、開開、接
続詞などからなり、文節を伴う特定のひらがな文字列を
多数記憶している特定文字辞書手段８と、文節の切れ目
か挿入された文中に前記特定のひらがな文字列が存在す
るかどうかを検索し、特定のひらがな文字列が存在する
場合に、そのひらがな文字列を１つの文節として前記文
中に文節の切れ目を挿入し、文節切れ目挿入位置を修正
する文節切出し修正手段９と、文節切出し修正手段９か
ら出力される文を記憶する記憶手段ｌＯと、記憶手段ｌ
Ｏに記憶された文を可視出力する出力手段２とからなる
文節切出し装置である。(d) Means for Solving the Problems FIG. 1 is a block diagram showing the basic configuration of the present invention. In the figure, the present invention includes: a text storage means 1 for storing sentences; an instruction means 2 for instructing to read a desired sentence from the text storage means 2; and a command means 2 for reading the instructed sentence from the text storage means 1. The reading means 3, the character type discriminating means 4 which discriminates the character type such as kanji, hiragana, katakana, etc. for each character of the read sentence, and the character type discriminating means 4 which discriminates the period, and the discrimination result by the character type discriminating means 4 are transferred to the 2N condyle. A conversion means 5 which replaces and outputs the code with one of the codes, a discrimination result accumulation means 6 which sequentially accumulates the output from the conversion means 5, and a predetermined code transition point for the code string accumulated in the discrimination result accumulation means 6. A phrase extraction means 7 that inserts a phrase break into the read sentence when detected, and a specific character dictionary that stores a large number of specific hiragana character strings that include adjuncts, pronouns, openings, conjunctions, etc., and that accompany phrases. Means 8 searches for whether or not the specific hiragana character string exists in the sentence inserted at the end of the phrase, and if the specific hiragana character string is present, the hiragana character string is used as one phrase in the sentence. A bunsetsu cutting correction means 9 inserts a bunsetsu break into the bunsetsu break and corrects the bunsetsu break insertion position, a storage means lO stores sentences output from the bunsetsu cutout correction means 9, and a storage means l
This phrase extraction device comprises an output means 2 for visually outputting sentences stored in O.

この発明において、前記特定文字辞書手段８に記憶され
ている各ひらがな文字列は、前記検索におけるひらがな
文字抽出順序と同順序にて格納されていることが好まし
い。In this invention, each hiragana character string stored in the specific character dictionary means 8 is preferably stored in the same order as the hiragana character extraction order in the search.

この発明における文節切出し装置は、ワードプロセッサ
、翻訳装置、校正装置、データベースを利用する装置等
に適用することかでき、また、文章を音声出力する装置
においても文節切出し処理が必要なため、この発明を適
用することができる。The phrase extraction device of the present invention can be applied to word processors, translation devices, proofreading devices, devices that use databases, etc. Furthermore, since phrase extraction processing is required even in devices that output sentences as audio, the phrase extraction device of the present invention can be applied to devices that utilize databases. Can be applied.

（ホ）作用この発明に従えば、文章蓄積手段１に蓄積されている文
章から、切出し処理の単位、例えば１文を切り出すと、
その文を構成する各文字の字種が字種判別手段４によっ
て判別され、その判別結果は変換手段５によって２Ｎ１
［の符号のいずれが、例えばＬまたはＨに置き換えられ
、判別結果蓄積手段６に順次蓄積される。次いで文節切
出し手段７は、あらかじめ決められている符号変移点、
例えぼＬからＨへの変わり目を検出した際に、読み出し
た文中に文節の切れ目を入れる。次いで文節切出し修正
手段９は、文節の切れ目からさかのぼって特定文字辞書
手段８に規定されるひらがな文字と入力文字列とを比較
することにより、一致したひらがな文字を１つの文節と
して認識し、文節切出し位置を修正し、修正した結果を
確認できるよう、得られた結果を出力手段１１に出力す
るよう作用する。(E) Effect According to the present invention, when a unit of extraction processing, for example, one sentence, is extracted from the sentences stored in the sentence storage means 1,
The character type of each character constituting the sentence is determined by the character type determination means 4, and the determination result is converted to 2N1 by the conversion means 5.
Any of the symbols [ is replaced with, for example, L or H, and the results are sequentially stored in the determination result storage means 6. Next, the phrase extraction means 7 selects a predetermined sign transition point,
For example, when a change from L to H is detected, a phrase break is inserted into the read sentence. Next, the bunsetsu cutting correction means 9 compares the input character string with the hiragana characters specified in the specific character dictionary means 8 starting from the break of the bunsetsu, recognizes the matching hiragana characters as one bunsetsu, and extracts the bunsetsu. It acts to correct the position and output the obtained result to the output means 11 so that the corrected result can be confirmed.

（へ）実施例以下図に示す実施例に基づいてこの発明を詳述する。な
お、これによってこの発明は限定されるものではない。(F) EXAMPLES The present invention will be described in detail below based on examples shown in the figures. Note that this invention is not limited by this.

第２図はこの発明をワードプロセッサに適用した一実施
例を示す構成図である。同図において２０はワードプロ
セッサ本体である。２１は文章蓄積装！であり、外部記
憶装置としての例えばフロッピーディスク、ハードディ
スク、あるいは内部記憶装置としての例えばＲＡＭ、あ
るいはその他の蓄積装置としての例えばデータベース等
から構成することかでき、かな漢字交じり文からなる日
本語文書が蓄積されている。２２は指示手段としてのキ
ーボードであり、文章編集、文章校正等を行うための文
字入カキー１各種の指示キー等を備えており、文章を入
力するとともに、文章蓄積装置２１から所望の文を読み
出すための指示を入力する。２３はＣＰＵ２４と協働す
る読出装置であり、キーボード２２にて指示され１２文
を文章蓄積装置２［から読み出す。２５はＣＰＵ２４と
協働する字種判別装置であり、文章蓄積装置２１から読
み出した文について、１文字ずつ漢字、ひらがな、カタ
カナ等の字種を判別するとともに、句点を判別する。２
６は判別結果蓄積手段および記憶手段としての結果蓄積
装置であり、ＲＡＭから構成され、字種判別結果を順次
蓄積するとともに、後述する文節切出し装置によって切
れ目が入れられた文を記憶する。FIG. 2 is a block diagram showing an embodiment in which the present invention is applied to a word processor. In the figure, 20 is a word processor main body. 21 is a writing storage device! It can be composed of an external storage device such as a floppy disk, a hard disk, an internal storage device such as RAM, or another storage device such as a database, and can store Japanese documents consisting of sentences mixed with kana and kanji. has been done. Reference numeral 22 denotes a keyboard as an instruction means, which is equipped with a character input key 1 for editing sentences, proofreading, etc., and various instruction keys, etc., for inputting sentences and for reading out desired sentences from the sentence storage device 21. Enter the instructions for Reference numeral 23 denotes a reading device that cooperates with the CPU 24, and reads out 12 sentences from the text storage device 2 [in response to instructions from the keyboard 22]. Reference numeral 25 denotes a character type discriminating device that cooperates with the CPU 24, and for each sentence read from the sentence storage device 21, it determines the character type of each character, such as kanji, hiragana, katakana, etc., and also determines punctuation marks. 2
Reference numeral 6 denotes a result storage device as a discrimination result storage means and storage means, which is composed of a RAM, and sequentially accumulates character type discrimination results, and also stores sentences cut by a phrase cutting device to be described later.

２７はＣＰＵ２４と協働する変換手段としての字種デジ
タル化装置であり、字種判別装置！２５による字種の判
別結果としての漢字およびカタカナに対しては第１のコ
ード、具体的には“Ｈ”を付し、ひらがなおよび句点に
対しては第２のコード、具体的には“し”を付し、それ
により字種判別結果を２種類のコード゛Ｈ”または°Ｌ
”のいずれかに置き換える。そして字種判別結果である
コート“Ｈ”。27 is a character type digitization device as a conversion means that cooperates with the CPU 24, and a character type discrimination device! For kanji and katakana as a result of character type discrimination according to 25, the first code, specifically "H", is assigned, and for hiragana and period marks, the second code, specifically "S" is assigned. ” is added, thereby converting the character type discrimination result into two types of codes ゛H” or °L
” and the code “H” which is the character type discrimination result.

“Ｌ“の記号列はＣＰＵ２４を介して結果蓄積装置２６
に蓄積される。The symbol string “L” is sent to the result storage device 26 via the CPU 24.
is accumulated in

文節切出し装！２８は、ＣＰＵ２４と協働し、結果蓄積
装置２６に蓄積されたコード列“Ｈ”、“Ｌ”について
所定のコード変移点を検出したときに、読み出した文中
に文節の切れ目を入れる。Bunseki cutouts! 28 cooperates with the CPU 24 and inserts a phrase break into the read sentence when a predetermined code transition point is detected for the code strings "H" and "L" stored in the result storage device 26.

２９はＲＯＭから構成される特定文字辞書テーブルであ
り、連体詞、代名詞、副詞、接続詞などからなり、文節
を伴う特定のひらがな文字列を多数記憶している。この
特定文字辞書テーブル２９にｇｉｌ！憶されている各ひ
らがな文字列は、前記検索におけるひらがな文字抽出順
序と同順序にて格納されている。Reference numeral 29 denotes a specific character dictionary table composed of a ROM, which stores a large number of specific hiragana character strings including adnominals, pronouns, adverbs, conjunctions, etc., and accompanied by clauses. gil! in this specific character dictionary table 29! The stored hiragana character strings are stored in the same order as the hiragana character extraction order in the search.

３０はＣＰＵ２４と協働する文節切出し修正装置であり
、文節の切れ目が挿入された文中に前記特定のひらがな
文字列が存在するかどうかを検索し、特定のひらがな文
字列が存在する場合に、そのひらがな文字列を１つの文
節として前記文中に文節の切れ目を挿入し、文節切れ目
挿入位置を修正する。Reference numeral 30 denotes a phrase cutting/correction device that cooperates with the CPU 24, and searches whether or not the specific hiragana character string exists in the sentence in which the phrase break is inserted, and if the specific hiragana character string exists, A clause break is inserted into the sentence using a hiragana character string as one clause, and the insertion position of the clause break is corrected.

３１は出力制御部３２を介してＣＰＵ２４と接続される
出力手段としての表示装置であり、ＣＲＴやＬＣＤ等の
ドブトマトリクスタイプの表示装置から構成され、文節
の切れ目を確認することができるように、切れ目が挿入
された文を表示する。Reference numeral 31 denotes a display device as an output means connected to the CPU 24 via the output control unit 32, and is composed of a dot matrix type display device such as a CRT or LCD, so that the breaks in phrases can be confirmed. , display the sentence with the break inserted.

このような構成において、例文こ特許庁に出す資料をこ
のワープロで作成し電子出願した。」を用い、第１２図
に示すフローチャートにしたがって文節切出し処理を説
明する。With this configuration, I created the materials to be submitted to the Patent Office using this word processor and filed them electronically. The bunsetsu segmentation process will be explained using "" and according to the flowchart shown in FIG. 12.

第３図は文章ファイル、その他のデータベース簿が蓄積
された文章蓄積装置２Ｌから処理の単位に合わせて例え
ばＬ文を切り出し、結果蓄積装置２６に蓄積した状態を
示している。このように、例えば１文、１段落、１章な
どのように、処理単位に合わせ、文章蓄積装！２１から
所望の文が切り出されると（ステップ４０）、その文を
構成している字種のコードが判別される（ステップ４１
）。FIG. 3 shows a state in which, for example, L sentences are cut out from the text storage device 2L in which text files and other database records are stored in accordance with the unit of processing and are stored in the result storage device 26. In this way, you can store sentences according to processing units, such as one sentence, one paragraph, or one chapter! When a desired sentence is extracted from 21 (step 40), the code of the character type composing the sentence is determined (step 41).
).

詳しくは、読み出された文の各文字にはＪＩＳコードな
どの固有の文字コードが割り当てられているため、その
文字コードを、第４図に示すコード判別テーブルの各条
件とを照合することにより、字種を判別する。条件にお
いてｃｃは字種判別対象の文字であり、ａｌとｂｌは、
実字コードの先頭および終端を表し、ａ２とｂ２はひら
か戸コードの先頭および終端を表し、ａ３とｂ３はカタ
カナコードの先頭および終端を表し、ｆＬ４は句点を表
している。Specifically, each character in the read sentence is assigned a unique character code such as a JIS code, so by comparing that character code with each condition in the code discrimination table shown in Figure 4, , determine the character type. In the conditions, cc is the character to be distinguished, and al and bl are
A2 and b2 represent the beginning and end of a real character code, a2 and b2 represent the beginning and end of a hirakato code, a3 and b3 represent the beginning and end of a katakana code, and fL4 represents a period.

判別された字種コードは、結果蓄積装置［２６に蓄積さ
れ（ステップ４２）、字種デジタル化装！２７によって
デジタル化が行われる（ステップ４３）。The determined character type code is stored in the result storage device [26 (step 42), and the character type digitization device! Digitization is performed by 27 (step 43).

第５図に、判別された出力コードをデジタル化するため
に参照されるコードデジタル化テーブルを示す。すなわ
ち、字種が漢字と判別されるとコード“Ｈ”に変換され
、ひらがなと判別されるとコード′Ｌ“に変換され、同
じくカタカナはコード′Ｈ”に、句はコード“Ｌ”にそ
れぞれ置き換えられる。FIG. 5 shows a code digitization table that is referenced to digitize the determined output code. In other words, if the character type is determined to be a kanji, it is converted to the code ``H'', if it is determined to be hiragana, it is converted to the code ``L'', and similarly, katakana is converted to the code ``H'', and phrases are converted to the code ``L''. Replaced.

コードデジタル化テーブルとの照合により文をデジタル
化した結集は、第６図に示すコード列にて結果蓄積装置
２６に蓄積される。A collection of sentences digitized by comparison with the code digitization table is stored in the result storage device 26 in the form of a code string shown in FIG.

次にデジタル化し１こ結果を、第７図に示す切り出し判
別テーブルと照合し、文節切れ目を判別する（ステップ
４４ン。文節切れ目の判別は、（１）デノタル出力にお
ける“Ｌｏと”Ｈ“の変移点てキーワード（文節）の切
れ目を入れる。（２）句点の次にキーワードの切れ目を
入れることにより判別される。Next, the digitized result is compared with the cut-out discrimination table shown in Fig. 7 to determine bunsetsu breaks (step 44). Insert a keyword (clause) break at the transition point. (2) Discrimination is made by inserting a keyword break next to the period.

次に、切り出し判別テーブルとの照合による切れ目の判
別に基づいて切れ目に切り出し記号、例えば「／」を挿
入し、その切り出し記号か挿入された文を結果蓄積装置
２６に蓄積する。上記処理により得られる文節切出し結
果を第８図に示す。Next, a cutting symbol, for example "/", is inserted into the break based on the discrimination of the break by comparison with the cutting discrimination table, and the sentence into which the cutting symbol has been inserted is stored in the result storage device 26. FIG. 8 shows the phrase extraction results obtained by the above processing.

字種の変わり目の情報のみを利用して文節の切出しを行
うと、第８図に示すように、単語が、ひらがな表記され
ている場合に、その単語を含む文節が切り出せないとい
う欠へか残る。今の例文では、代名詞「この」が「の文
節に吸収され「資料をこの」が一つの文節となっている
。このような、ひらがな表記に起因する文節切り間違い
を修正するのがこの発明の特徴でるる。When segmenting a phrase using only the information on the change in character type, as shown in Figure 8, if a word is written in hiragana, the phrase containing the word cannot be extracted. . In the example sentence just now, the pronoun ``kono'' is absorbed into the clause ``, and ``material wo kono'' becomes one clause. A feature of the present invention is that it corrects such bunsetsu cut errors caused by hiragana notation.

それを実現するために、代表的なひらかな表記の単語列
を蓄積した特定文字辞書テーブル２９を用いる。ひらが
な表記の単語列の集合は、通常の言語生活の中から周知
の事実として作成することができる。例えば、品詞分類
で説明すると、連体詞、代名詞、副詞、接続詞などの中
にそのようなひらがな表記単語が多い。In order to achieve this, a specific character dictionary table 29 is used that stores word strings in typical hirakana notation. A set of word strings written in hiragana can be created as a well-known fact from ordinary language life. For example, in terms of part of speech classification, there are many words written in hiragana in adnominals, pronouns, adverbs, conjunctions, etc.

第９図は特定文字辞書テーブル２９を説明するためのひ
らかな表記テーブル例を示したものである。同図は説明
上のための代表的な単語を示しており、通常の単層の文
字列の順序にソートしたものである。第１０図は上記の
ひらがな表記テーブルを逆引きてきるようにソートした
ちのであり特定文字辞書テーブル２９の内容を示してい
る。FIG. 9 shows an example of a hiragana notation table for explaining the specific character dictionary table 29. The figure shows typical words for the purpose of explanation, sorted in the order of normal single-layer character strings. FIG. 10 shows the contents of the specific character dictionary table 29, which is obtained by sorting the above-mentioned hiragana notation table in reverse order.

ステップ４４に引き続き、特定文字辞書テーブル２９を
参、＠することにより、第８図のように文節切断された
文字列から、ひらがな文字列の部分を、文字列先頭に向
かって検素する（ステップ４５）。Continuing to step 44, by referring to the specific character dictionary table 29 and @, the part of the hiragana character string is searched toward the beginning of the character string from the character string that has been segmented as shown in Figure 8 (step 45).

特定文字辞書テーブル２９に蓄積されている単語の並び
の順序は、上記したように、文字列の検索の＊午と同じ
であり、１文字の照合失敗でその単語との照合を解放す
ることができるため、一致照合までの時間を短縮させる
ことができる。また、途中で照合失敗であることが確認
できるので、バッファの容量の少なくて済む。この処理
は字種がひらがなから池の字種へ変わるひらかな文字列
の間で実行される。As mentioned above, the order of words stored in the specific character dictionary table 29 is the same as * for character string searches, and if a single character fails to match, matching with that word can be released. Therefore, it is possible to shorten the time required for matching. Furthermore, since it is possible to confirm that the verification has failed during the process, the buffer capacity can be reduced. This process is performed between hirakana character strings where the character type changes from hiragana to ike character type.

この実施例の場合、「をこの」の部分の「のこ」か照合
に成功する（ステップ４６）。照合に成功すれば、ひら
がな表記単語の先頭で文節か始めるように文節の切れ目
を追加挿入し、文節の切出し位置を修正する（ステップ
４７）。その結果、第８図の文章は第１１図に示すよう
に文節切出し位置が修正される。次いで修正結果は結果
蓄積装置２６に格納されろ（ステップ４８）。In this embodiment, it is successfully verified whether the part of ``wokono'' is ``noko'' (step 46). If the matching is successful, a clause break is added so that the clause starts at the beginning of the word written in hiragana, and the segment cutout position is corrected (step 47). As a result, the phrase extraction position of the sentence shown in FIG. 8 is corrected as shown in FIG. 11. The modified results are then stored in the results storage device 26 (step 48).

次いで終了条件かｎｏであれば、すなわち次ぎに文節切
出しを行うべき文があれば、次の文を文章蓄積装！２１
から読み出す処理、また、蓄積する位置が重複しないよ
うに制御を行う（ステップ４９）。Next, if the end condition is no, that is, if there is a sentence that should be segmented next, the next sentence is stored in the sentence storage! 21
The process of reading data from the memory data and the storage location are controlled so that they do not overlap (step 49).

ステップ４９においてｙｅｓ、すなわち、切出し処理を
行う対象かなくなれば、必要とする情報を結果蓄積装置
２６に蓄積し、処理を終了する（ステップ５０）。If the answer is yes in step 49, that is, if there are no more targets to be cut out, the necessary information is stored in the result storage device 26, and the process ends (step 50).

（ト）発明の効果この発明によれば、（１）文節の切り出しを行う際に、
辞書を利用しないため装置の構成を簡略化できる。それ
により、ワードプロセッサやオフィスコンピュータはも
ちろん、それ以外の小型機器、具体的には電子手帳やプ
ログラム機能付き電卓においてもこの発明を適用するこ
とができる。（２）文節切り出し処理、キーワード検索
処理を高速で行うことができる。（３）文節切り出しの
ための制御プログラムを簡単にすることができる。（４
）文節を切り出す場合、日本語ではベタ書きのため、文
節の位置がわからないという欠点があり、そのたぬ、文
節をどこから始め、どこで終了するかを決定するのに多
大な処理と時間を必要としている。(g) Effects of the invention According to this invention, (1) when cutting out a phrase,
Since no dictionary is used, the configuration of the device can be simplified. As a result, the present invention can be applied not only to word processors and office computers, but also to other small devices, specifically electronic notebooks and calculators with program functions. (2) Phrase extraction processing and keyword search processing can be performed at high speed. (3) The control program for segmenting phrases can be simplified. (4
) When cutting out a clause, the disadvantage is that the position of the clause cannot be determined because Japanese is written in solid letters, and it requires a great deal of processing and time to determine where the clause begins and ends. There is.

この発明によれば、文節位置を決定した後から言語処理
を行うことができるため、処理時間を大幅に短縮するこ
とかできる。（５）字種判別結果をディジタル回路で２
直に置き換えて処理するため、処理か高速になり、回路
が簡略化され、かつ文節切り出し装置を安価で実現する
ことかできる。（６）ひらがな表記単語１こよる文節切
出し位置のＩｉ５違いを減少させることができる。（７
）特定文字辞書のひらがな表記文字列と入力文字列の比
較照合を速く行え、照合に際して：よ辞書に要するパブ
ファメモリの容量を少ない容量で実現することができる
。According to this invention, language processing can be performed after determining the bunsetsu position, so processing time can be significantly shortened. (5) Character type discrimination results are converted into 2 parts using a digital circuit.
Since the phrases are directly replaced and processed, the processing speed is increased, the circuit is simplified, and the phrase segmentation device can be realized at low cost. (6) It is possible to reduce the Ii5 difference in phrase extraction position due to one hiragana word. (7
) It is possible to quickly compare and match the hiragana notation character string of the specific character dictionary and the input character string, and during the comparison, the capacity of the Puffer memory required for the dictionary can be realized with a small capacity.

[Brief explanation of the drawing]

策１図はこの発明の基本構成を明示するブロック図、第
２図はこの発明の一実施例であるワードプロセッサの構
成を示すブロック図、第３図は文章蓄積装置に蓄積され
１こ文の一例を示す説明図、第４図は字種判別テーブル
の内容を示す説明図、第５図はコードデジタル化テーブ
ルの内容を示す説明図、第６図は字種判別結果を示す説
明図、纂７図は文節切出し判別テーブルの内容を示す説
明図、第８図は切出し結果を示す説明図、策９図および
第１０図は特定文字辞書テーブルの内容を示す説明図、
第１Ｉ図は切出し位置修正結果を示す説明図、第［２図
：ヱ実施例の処理動作を示す）σ−チャートである。！・・・・・文章蓄積手段、２・　・・指示手段、３・
　・　読出手段、　　　　４　　・　字種判別手段、５
・・・変換手段、６・・・・・判別結果蓄積手段、７　・・−・文節切出し手段、８　　・・特定文字辞書手段、９・　・・文節切出し修正手段、１０・・・・・・記憶手段、　　　１１・・・−・出力
手段。代理人　　弁理士　　野河　信太部　ソ二つ４、−ぷ、？・　３　図１．、−−７″　をニ　フ−１Ｏし　　□−出　し１Ｘ
戸　４　「・” 第５図第６図が７図７Ｘ　８　ご・′ 巧引陛序１：／出ｆ／賃粕を；の／ワープロで′／作成
し／冨子出尉１１Ｊ二。／第　９　図：＝ＩＱｔｎ第　１１　Σ ＮｔＩＦ庁に／出す／資料を／二１７）／ワーフ゛口で
／作成し／電５わ腫しｔ二、／第１２図Figure 1 is a block diagram that clearly shows the basic configuration of this invention, Figure 2 is a block diagram that shows the configuration of a word processor that is an embodiment of this invention, and Figure 3 is an example of one sentence stored in the sentence storage device. FIG. 4 is an explanatory diagram showing the contents of the character type discrimination table. FIG. 5 is an explanatory diagram showing the contents of the code digitization table. FIG. 6 is an explanatory diagram showing the character type discrimination results. Figure 8 is an explanatory diagram showing the contents of the bunsetsu extraction discrimination table, Figure 8 is an explanatory diagram showing the extraction results, Measures 9 and 10 are explanatory diagrams showing the contents of the specific character dictionary table,
FIG. 1I is an explanatory diagram showing the cutting position correction result, and FIG. 2 is a σ-chart (showing the processing operation of the embodiment). ! ...Text storage means, 2. ...Instruction means, 3.
- Reading means, 4 - Character type discrimination means, 5
...Conversion means, 6..Discrimination result accumulation means, 7..Phrase extraction means, 8..Specific character dictionary means, 9..Paragraph extraction correction means, 10.. Storage means, 11...-Output means. Agent Patent attorney Shintabe Nogawa So two 4, -pu,?・3 Figure 1. ,--7" to Ni-1O □-Out 1X
Door 4 ``・'' Figure 5 Figure 6 is 7 Figure 7 /Fig. 9:=IQtn No. 11 Σ Submit/materials to the NtIF Agency/217)/Create/at the entrance of the workhouse/Electronic 5 swelling t2,/Fig. 12

Claims

[Scope of Claims] 1. A text storage means for storing sentences; an instruction means for instructing to read a desired sentence from the text storage means; a reading means for reading the instructed sentence from the text storage means; A character type discrimination means that discriminates the character type of each character such as kanji, hiragana, katakana, etc. for a given sentence, and also discriminates the period, and a conversion that outputs the result of the discrimination by the character type discrimination means by replacing it with one of two types of codes. a discrimination result accumulation means for sequentially accumulating outputs from the conversion means; and inserting a clause break into a read sentence when a predetermined code transition point is detected in the code string accumulated in the discrimination result accumulation means. a specific character dictionary means that stores a large number of specific hiragana character strings that include adjuncts, pronouns, adverbs, conjunctions, etc. and are accompanied by clauses; Bunsetsu extraction that searches for the existence of a character string, and if a specific hiragana character string exists, inserts a clause break into the sentence with that hiragana character string as one clause, and corrects the bunsetsu break insertion position. A phrase extraction device comprising a correction means, a storage means for storing a sentence output from the phrase extraction correction means, and an output means for visually outputting the sentence stored in the storage means. 2. The phrase extraction device according to claim 1, wherein each hiragana character string stored in the specific character dictionary means is stored in the same order as the hiragana character extraction order in the search.