JPH0345421B2

JPH0345421B2 -

Info

Publication number: JPH0345421B2
Application number: JP60279121A
Authority: JP
Inventors: Akihiro Hirai; Hideaki Shinohara; Yoichi Hitano
Original assignee: Agency of Industrial Science and Technology
Current assignee: National Institute of Advanced Industrial Science and Technology AIST
Priority date: 1985-12-13
Filing date: 1985-12-13
Publication date: 1991-07-11
Also published as: JPS62139076A

Description

【発明の詳細な説明】〔発明の利用分野〕本発明は、言語を解析するための解析方式に係
り、特に、自然言語で記述された文章の解析を効
率良く行う言語解析方式に関するものである。[Detailed Description of the Invention] [Field of Application of the Invention] The present invention relates to an analysis method for analyzing language, and particularly to a language analysis method that efficiently analyzes sentences written in natural language. .

[Background of the invention]

従来の言語解析の方式、例えば、長尾真編“言
語の機械処理”（1984）の第３章で論じられてい
る方式では、処理は一文単位で実行され、複数の
文を順に解析する場合でも、前文までの解析結果
は利用されず、既出の文節と同一の文節が現れて
も、単語への分割等の解析処理のすべてを最初か
ら実行しなければならなかつた。そのため、複数
の文を順に解析する場合、文の数に比例した処理
の手段を要し、処理効率が悪い、という問題があ
つた。 In conventional language analysis methods, such as the method discussed in Chapter 3 of Makoto Nagao's "Machine Processing of Language" (1984), processing is performed on a sentence-by-sentence basis, even when multiple sentences are analyzed in sequence. , the analysis results up to the preamble were not used, and even if a clause that was the same as a clause that had already appeared appeared, all the analysis processing, such as division into words, had to be performed from the beginning. Therefore, when a plurality of sentences are sequentially analyzed, processing means proportional to the number of sentences are required, resulting in a problem of poor processing efficiency.

[Purpose of the invention]

本発明の目的は、かかる従来方式の問題点を解
決し、複数の文を順に解析する場合に、解析効率
の向上する言語処理方式を提供することにある。 An object of the present invention is to solve the problems of the conventional method and to provide a language processing method that improves analysis efficiency when sequentially analyzing a plurality of sentences.

[Summary of the invention]

本発明の言語処理方式は、自然言語の解析、あ
るいは、翻訳を行う言語処理装置において、解析
結果を記憶媒体に格納し、次の文の解析の際、前
文までの解析結果を利用した解析処理の部分的省
略を行うことにより、前記目的を達成するもので
ある。 The language processing method of the present invention stores analysis results in a storage medium in a language processing device that analyzes or translates natural language, and when analyzing the next sentence, performs an analysis process that uses the analysis results up to the previous sentence. The above objective is achieved by partially omitting .

[Embodiments of the invention]

以下、本発明の一実施例を図に従つて、詳細に
説明する。第１図は、本発明の実施例の言語処理
システムであるところの第１言語から第２言語へ
の自動翻訳システムの構成図である。ここでは、
便宜上、第１言語を日本語、第２言語を英語とす
る。第１図に示すように、本発明に係わる言語処
理システムは、処理装置１、処理プログラム、解
析結果が格納される記憶媒体〔１〕２、辞書デー
タが格納される記憶媒体〔２〕３、処理すべき入
力文章が格納される記憶媒体〔３〕４、表示装置
５、キー・ボード６より構成される。本言語処理
システムは、記憶媒体〔３〕４内の文を順に取り
出し、記憶媒体〔２〕３内の辞書データを利用し
ながら翻訳し、結果を表示装置５へ出力する。 Hereinafter, one embodiment of the present invention will be described in detail with reference to the drawings. FIG. 1 is a block diagram of an automatic translation system from a first language to a second language, which is a language processing system according to an embodiment of the present invention. here,
For convenience, the first language will be Japanese and the second language will be English. As shown in FIG. 1, the language processing system according to the present invention includes a processing device 1, a processing program, a storage medium [1] 2 in which analysis results are stored, a storage medium [2] 3 in which dictionary data is stored, It is composed of a storage medium [3] 4 in which input text to be processed is stored, a display device 5, and a keyboard 6. This language processing system sequentially extracts sentences in the storage medium [3] 4, translates them using the dictionary data in the storage medium [2] 3, and outputs the results to the display device 5.

第２，３，４図に本発明による解析方式の流れ
を示す。 Figures 2, 3, and 4 show the flow of the analysis method according to the present invention.

第５図は入力文章の例、第６図は格納された解
析結果の例を示しており、第７図は入力文の単語
分割処理の実例（ａは部分列への分割、ｂは単語
分割の結果を示す図である）、第８図は解析結果
の別の格納形式を示す図である。第６，８図に示
すように、本実施例では、解析結果は、文節が助
動詞列が一つの単位として格納され、格納される
情報は、各単位の文中での表記（以降、これを見
出し文字列と呼ぶ）、各単位を構成している単語
の文中での表記、品詞、活用情報（活用形、活用
の種類）等の辞書データである。なお、第６，
７，８図における２文字の英文字から成るコード
は品詞コードである。 Figure 5 shows an example of an input sentence, Figure 6 shows an example of stored analysis results, and Figure 7 shows an example of word division processing of an input sentence (a is division into substrings, b is word division). FIG. 8 is a diagram showing another storage format of the analysis results. As shown in Figures 6 and 8, in this example, the analysis result is that the sentence clause is stored as a unit of the auxiliary verb string, and the stored information is expressed as the in-sentence representation of each unit (hereinafter referred to as the header). This is dictionary data such as character strings), the in-sentence notation of the words that make up each unit, parts of speech, and conjugation information (conjugations, types of conjugations), etc. In addition, the 6th,
The code consisting of two English letters in Figures 7 and 8 is a part-of-speech code.

日本語を英語に自動翻訳するためには、最初の
ステツプとして、ベタ書きで書かれた漢字かな混
じり文を単語に分割しなければならない。この単
語分割の処理を本発明の実施例として、第２，
３，４図に従つて説明する。 In order to automatically translate Japanese into English, the first step is to divide the solidly written sentences with kanji and kana into words. As an embodiment of the present invention, this word division process is described in the second,
This will be explained according to Figures 3 and 4.

今、第５図に示す文章が記憶媒体〔３〕４に格
納されており、解析処理は第１番目の文が終了し
たところだとする。この時点で、第１番目の文の
解析結果が第６図に示す形式で記憶媒体〔１〕２
に格納される。そして、第２番目の文の解析処理
が実行される（101）。 It is now assumed that the sentences shown in FIG. 5 are stored in the storage medium [3] 4, and the analysis process has just finished for the first sentence. At this point, the analysis result of the first sentence is stored on the storage medium [1] 2 in the format shown in Figure 6.
is stored in Then, analysis processing of the second sentence is executed (101).

解析の最初として、処理対象の文を、格納され
ている解析結果の利用可能な部分列とそうでない
部分列とに分離する（102）。具体的には、格納さ
れた解析結果の見出し文字列と一致する部分列を
解析結果の利用可能な部分列とみなす。その結
果、第２番目の入力文は第７図ａのようになる。
ただし、斜線の部分が解析結果の利用可能な部分
列である。次に、分離した部分列中の未処理の部
分列の内、先頭のものを取り上げ（これを部分列
ａとする）（103）、部分列ａが解析結果の利用可
能な部分列であれば（104）、第４図の処理
（105）を、そうでなければ、第３図の処理
（106）を実行する。この処理を未処理の部分列が
存在しなくなるまで（107）繰返した後、解析結
果（第７図ａの解析結果は第７図ｂ）を記憶媒体
〔１〕２に格納する（108）。解析結果の格納は、
文節か助動詞列を一つの単位として行うが、同一
の見出し文学列を持つ解析結果の単位に関して
は、格納処理を行わない。以上の処理を、未処理
の文がなくなるまで（109）繰り返す。 At the beginning of analysis, the sentence to be processed is separated into subsequences that can be used as stored analysis results and subsequences that cannot (102). Specifically, a substring that matches the heading character string of the stored analysis result is regarded as a usable substring of the analysis result. As a result, the second input sentence becomes as shown in FIG. 7a.
However, the shaded part is the available subsequence of the analysis result. Next, among the unprocessed subsequences in the separated subsequences, pick up the first one (this is called subsequence a) (103), and if subsequence a is a subsequence for which analysis results can be used, (104), the process (105) in FIG. 4 is executed, and if not, the process (106) in FIG. 3 is executed. After repeating this process until there are no unprocessed subsequences (107), the analysis results (the analysis results of FIG. 7a and FIG. 7b) are stored in the storage medium [1]2 (108). To store the analysis results,
This is done with a clause or auxiliary verb string as one unit, but storage processing is not performed for units of analysis results that have the same heading literary string. The above process is repeated until there are no more unprocessed sentences (109).

次に、の処理について、第３図に従つて説明
する。この処理は、格納されている解析結果が利
用できない部分列に対するものであり、最初に、
最長一致を原則とした単語の切出しを（この単語
をＷとする）その部分列に関して行う（201）。例
えば、第７図ａの先頭の部分列に関しては、“そ
して”が切出される。ただし、切出した単語が活
用のある語の場合は、語尾変化も含めて切出す。
次に、切出した単語が前方の語と接続可能である
かチエツクする（202）。接続可能であるならば、
切出した語の後方の語を最長一致の原則で切出し
（203）、その品詞を基準とし、後方接続の可能性
をチエツクする（204）。接続可能であれば、単語
の認定を行う（205）。前方接続、あるいは、後方
接続が不可の場合、単語Ｗの切出し、認定を棄却
し、同一部分列から別の単語を切出し（207）、前
方接続可能性のチエツクからやり直す。また、
207の処理が不可能ならば、単語Ｗの直前の単語
の切出し・認定を棄却し、の処理をやり直す
（208）。ただし、単語Ｗが文頭の語の場合、処理
のやり直しが出来ないため、単語分割処理は失敗
したとする。このような処理を未解析の文字列が
なくなるまで（206）、繰り返して、の処理は終
了する。なお、第７図ａの先頭の部分列に関して
は、の処理により、“そして”が接続詞、“、”
が読点であると解析される。 Next, the process will be explained with reference to FIG. This process is for subsequences for which stored analysis results are not available, and first,
Word extraction based on the principle of longest match (this word is W) is performed for the substring (201). For example, "and" is cut out for the first subsequence in FIG. 7a. However, if the extracted word is a word with a conjugation, the inflection is also included in the extraction.
Next, it is checked whether the extracted word can be connected to the preceding word (202). If it is possible to connect,
Words after the extracted word are extracted using the longest match principle (203), and the possibility of backward connection is checked using that part of speech as a criterion (204). If connection is possible, the word is certified (205). If forward connection or backward connection is not possible, the extraction and recognition of word W is rejected, another word is extracted from the same subsequence (207), and the check for forward connection possibility is restarted. Also,
If the process in 207 is impossible, the extraction and recognition of the word immediately before word W is rejected, and the process in 207 is redone (208). However, if the word W is the first word in a sentence, the word division process is assumed to have failed because the process cannot be redone. This process is repeated until there are no more unparsed character strings (206), and the process ends. Regarding the first subsequence in Figure 7a, by the process, "and" is a conjunction, and ","
is interpreted as a comma.

の処理について、第４図に従つて説明する。
この処理は、格納されている解析結果が利用可能
な部分列に対するものである。最初に、格納され
ている解析結果から得られるその部分列の先頭の
単語の品詞を基準に、前方接続可能性のチエツク
を行う（301）。第７図ａの第２番目の部分列に関
しては、“、”と“太郎”の接続可能性のチエツク
が、301の処理に対応する。接続可能ならば、後
方の部分列より、単語を最長一致の原則を用いて
切出し、その品詞情報を得、該部分列の最後尾の
単語との後方接続可能性のチエツクを行う
（303）。第７図ａの第２番目の部分列に関しては、
“は”と“栗”の接続可能性のチエツクを行うこ
とになる。接続可能ならば、の処理は終了す
る。ただし、前方接続不可の場合、該部分列に対
応する解析結果を棄却し、の処理を実行する
（304）。また、後方接続不可の場合は、該部分列
の最後尾の単語に対応する解析結果を棄却し、
の処理を実行する（305）。 The processing will be explained with reference to FIG.
This process is for subsequences for which stored analysis results are available. First, the possibility of forward connection is checked based on the part of speech of the first word of the subsequence obtained from the stored analysis results (301). Regarding the second subsequence in FIG. 7a, checking the connectability of "," and "Taro" corresponds to the process 301. If it is possible to connect, the word is extracted from the subsequent subsequence using the longest match principle, its part of speech information is obtained, and the possibility of backward connection with the last word of the subsequence is checked (303). Regarding the second subsequence in Figure 7a,
The possibility of connecting “ha” and “chestnut” will be checked. If the connection is possible, the process ends. However, if the forward connection is not possible, the analysis result corresponding to the subsequence is rejected and the process is executed (304). In addition, if backward connection is not possible, reject the analysis result corresponding to the last word of the subsequence,
(305).

なお、記憶媒体の容量の制限のため、解析結果
のすべてを格納できない場合は、解析結果の得ら
れた時間を基準に優先順位をつけ、新しい解析結
果が常に保持されるようにすると、優先順位を付
けない場合と比較して、処理効率が良くなる。 If it is not possible to store all the analysis results due to storage medium capacity limitations, you can prioritize the analysis results based on the time they were obtained and ensure that new analysis results are always retained. Processing efficiency is improved compared to the case without .

また、第８図に示すような形式で解析結果の格
納を行えば、すなわち、解析結果の構成要素とな
るべき情報の格納番地を示す情報により、解析結
果を表現すれば、同一要素に対して、一重に記憶
領域を確保する必要が無く、解析結果の記憶効率
が良くなり、全体の処理効率も向上する。 Furthermore, if the analysis results are stored in the format shown in Figure 8, that is, if the analysis results are expressed using information indicating the storage addresses of information that should become the constituent elements of the analysis results, then the same element can be , there is no need to secure a single storage area, the storage efficiency of analysis results is improved, and the overall processing efficiency is also improved.

〔Effect of the invention〕

以上、本発明の実施例につき説明したが、本発
明によれば、同一の文字列の解析処理を省略が可
能となるため、複数の文の解析処理の効率を向上
せることが出来る。特に、繰返し表現の多い文
章、文の終わり方にくせのある文章の解析には、
大きな効果を得ることができる。 The embodiments of the present invention have been described above, but according to the present invention, it is possible to omit the analysis process for the same character string, so it is possible to improve the efficiency of the analysis process for a plurality of sentences. In particular, when analyzing sentences with many repeated expressions or sentences with a habit of ending sentences,
You can get a big effect.

[Brief explanation of drawings]

第１図は本発明による言語処理システム全体の
構成図、第２，３，４図は本発明による解析処理
の流れを示す図、第５図は入力文章の例を示す
図、第６図は格納された解析結果の例を示す図、
第７図は単語分割の実行を示す図、第８図は解析
結果の別種の格納形式を示す図である。１……中央処理装置、２……記憶媒体〔１〕、
３……記憶媒体〔２〕、４……記憶媒体〔３〕、５
……表示装置、６……キー・ボード。 Figure 1 is a block diagram of the entire language processing system according to the present invention, Figures 2, 3, and 4 are diagrams showing the flow of analysis processing according to the present invention, Figure 5 is a diagram showing an example of an input sentence, and Figure 6 is a diagram showing the flow of analysis processing according to the present invention. Diagram showing an example of stored analysis results,
FIG. 7 is a diagram showing the execution of word division, and FIG. 8 is a diagram showing another type of storage format for the analysis results. 1...Central processing unit, 2...Storage medium [1],
3...Storage medium [2], 4...Storage medium [3], 5
...display device, 6...keyboard.

Claims

[Claims]

1. A language analysis method for analyzing a sentence containing kanji and kana, which comprises means for inputting a sentence containing kanji and kana, means for dividing the sentence containing kanji and kana into its constituent words, and words divided by the dividing means. A phrase or an auxiliary verb string in which at least one of the connectable words immediately before and after the word is connected as a headword, and means for storing information regarding the word and the connectable word; means for separating a partial character string that matches the headword stored by the storing means from the continuously inputted sentence containing Kanji and Kana; dividing the remaining character string other than the character string into words constituting the remaining character string by the dividing means, and storing the predetermined information by the storing means based on the division result. A language analysis method characterized by: