JPH01211176A - Morpheme analyzing system - Google Patents

Morpheme analyzing system

Info

Publication number
JPH01211176A
JPH01211176A JP63036876A JP3687688A JPH01211176A JP H01211176 A JPH01211176 A JP H01211176A JP 63036876 A JP63036876 A JP 63036876A JP 3687688 A JP3687688 A JP 3687688A JP H01211176 A JPH01211176 A JP H01211176A
Authority
JP
Japan
Prior art keywords
word
dictionary
character
jimo
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
JP63036876A
Other languages
Japanese (ja)
Other versions
JPH0795321B2 (en
Inventor
Shinsuke Sakai
坂井 信輔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NEC Corp
Original Assignee
NEC Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NEC Corp filed Critical NEC Corp
Priority to JP63036876A priority Critical patent/JPH0795321B2/en
Publication of JPH01211176A publication Critical patent/JPH01211176A/en
Publication of JPH0795321B2 publication Critical patent/JPH0795321B2/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

PURPOSE:To enhance the performance of morpheme analysis by determining whether or not a character can link to an adjacent vocabulary item and form a compound word during an input. CONSTITUTION:The information of an independent word, an attached word, a prefix and a suffix is stored into a word dictionary 103. When an analysis control part 101 retrieves the word dictionary 103 at the character position of a present position 201 during the analysis, since the concerned word does not exist, it retrieves a letter dictionary 104, the letter of KANJI (Chinese character) 'HO' (to visit) is obtained. Since the concerned word does not exist at the next character position in the dictionary 103, the control part 101 retrieves the dictionary 104, and the letter of KATAKANA (square form of Japanese sillabary) 'SO' is obtained. Here, whether or not the character of KATAKANA 'SO' and the character of KANJI 'HO' can make a combined word is decided, and since dictionary information 301 of the KANJI 'HO' and dictionary information 302 of KATAKANA 'SO' do not contradict, it is decided that the both can be combined.

Description

【発明の詳細な説明】 〔産業上の利用分野〕 本発明は形B累解析方式に関し、特に日本語テキスト音
声合成システム、日英機械翻訳システム等の必須構成要
素である日本語の形態素解析の方式に関するものである
[Detailed Description of the Invention] [Field of Industrial Application] The present invention relates to a B-format cumulative analysis method, and is particularly applicable to Japanese morphological analysis, which is an essential component of Japanese text-to-speech synthesis systems, Japanese-English machine translation systems, etc. It is related to the method.

〔従来の技術〕[Conventional technology]

従来、「情報処理」第27巻第8号951ページに記載
されているように、最長一致法、二文節最長一致法、分
節数最小法、拡張文節モデル上のコスト最小法等の日本
語形態素解析の技術が知られている。
Conventionally, as described in "Information Processing" Vol. 27, No. 8, page 951, Japanese morphological methods such as the longest match method, the two-clause longest match method, the minimum number of segments method, and the minimum cost method on the extended clause model have been used. Analysis techniques are known.

〔発明が解決しようとする課題〕[Problem to be solved by the invention]

従来の形態素解析方式では、自立語、付属語などの語紮
カテゴリーの他には、接頭辞・接尾辞などの造語要素し
か用いられていなかった。これら接頭辞・接尾辞という
語禦カテゴリーを設けることにより、他の単語に接頭辞
・接尾辞が結合することによる造語に対処することは可
能であった。
In conventional morphological analysis methods, in addition to word-phrase categories such as independent words and adjunct words, only neologism elements such as prefixes and suffixes were used. By creating these word categories of prefixes and suffixes, it was possible to deal with coined words created by combining prefixes and suffixes with other words.

しかしながら、日本語のテキストにおいては、そのよう
なタイプの造語だけでなく、例えば、「訪中」、「訪ソ
」、「訪韓」のように、それだけでは単語とみなされな
いような「造語要素」どうしの結合による造語があられ
れるので、従来の形態素解析方式においては、このよう
な造語が一語として登録されていない場合は、未知語と
して扱わざるを得なかった。
However, in Japanese texts, there are not only such types of coined words, but also combinations of "coined word elements" that cannot be considered words on their own, such as "visiting China,""visiting the Soviet Union," and "visiting Korea." In conventional morphological analysis methods, if such a coined word is not registered as a single word, it must be treated as an unknown word.

本発明の目的はこの欠点を改良した高精度の形態素解析
方式を提供することにある。
An object of the present invention is to provide a highly accurate morphological analysis method that improves this drawback.

〔課題を解決するための手段〕[Means to solve the problem]

本発明の形態素解析方式は、単語と結合して造語を行な
うことは必すしもできないが、他の接辞や「字母」と結
合して造語を行なう能力がある「字母」という語酋カテ
ゴリーを有し、前記「字母」と結合して合成語を形成す
ることが可能な語酋項目が持つべき属性情報をその「字
母」とともに保持し、入力中に前記[字母Jの表記が存
在する場合に、前記属性情報を用いて、前記入力中にお
いて前記「字母」が隣接する語酋項目と結合して合成語
を形成することが可能であるが否かを決定して構成され
る。
Although the morphological analysis method of the present invention cannot necessarily combine with words to create coined words, it has a word category called "jimo" that has the ability to combine with other affixes or "jimo" to create coined words. Then, the attribute information that a word entry item that can be combined with the ``jimo'' to form a compound word is held together with the ``jimo'', and the above [if the notation of ``jimo J'' exists during input] , using the attribute information to determine whether or not the "jimo" can be combined with adjacent word selection items to form a compound word during the input.

〔作用〕[Effect]

日本語の文章においては、すでに単語と見なされている
もの以外に、「字母」すなわち、強い造語能力をもつ造
語成分によって形成される造語が頻繁に出現する。これ
は、たとえば「新明解国語辞典J (三省堂・初版19
72年)においては、「造語成分」と呼ばれているもの
である。本発明においては、従来日本語の形態素解析に
おいて扱われている自立語、付属語のような「単語」と
呼ばれる開傘カテゴリー、接頭辞・接尾辞のような「辞
」と呼ばれる開傘カテゴリーのほかに、単語と結合して
造語を行なうことは必ずしもできないが、他の接辞や「
字母」と結合して造語を行なう能力がある「字母」とい
う語酋カテゴリーを用いて形態素解析を行う。
In Japanese texts, in addition to what is already considered to be a word, ``jimo'', that is, coined words formed from coined word components with strong word-coining ability, frequently appear. For example, "Shinmeikai Japanese Dictionary J (Sanseido, first edition 19
In 1972), it is called a ``coined word component''. In the present invention, we have developed an umbrella category called ``word'' such as independent words and adjunct words, and an umbrella category called ``ji'' such as prefixes and suffixes, which are conventionally treated in Japanese morphological analysis. In addition, although it is not necessarily possible to combine words to create coined words, it is possible to combine them with other affixes or
Morphological analysis is performed using the word category ``jimo'', which has the ability to combine with ``jimo'' to form coined words.

これにより、入力テキスト中に、単語としては登録され
ていないが、「字母」として登録されている文字からな
る部分文字列が存在する場合に、その区間を未知語とす
ることなく、単語候補を形成することが可能となる。
As a result, when there is a partial string of characters that are not registered as words but are registered as "jimo" in the input text, word candidates can be selected without treating that section as an unknown word. It becomes possible to form.

〔実施例〕〔Example〕

次に第1図から第3図を命照しつつ、実施例に従って本
発明の詳細な説明する。第1図はこのような本発明の原
理を実現するための一実施例を示すブロック図である。
Next, the present invention will be described in detail according to embodiments with reference to FIGS. 1 to 3. FIG. 1 is a block diagram showing an embodiment for realizing the principle of the present invention.

第1図において、接続テーブル102は品詞の下位分類
カテゴリーの順序対(r+ 、IJ )に対して1ある
いはOという値を与える表である。これが1ならばr、
と1ノは文法的に隣接可能、そうでないならば隣接不可
能であることを表現している。単語辞書103は自立語
・付属語および接頭辞・接尾辞の情報を格納している。
In FIG. 1, the connection table 102 is a table that gives a value of 1 or O to the ordered pair (r+, IJ) of the lower classification categories of parts of speech. If this is 1, r,
and 1 no are grammatically contiguous, otherwise they are not contiguous. The word dictionary 103 stores information on independent words, attached words, prefixes, and suffixes.

字母辞書104は各字母の表記、カテゴリーおよび結合
して造語を行うことが可能な相手のカテゴリーの情報を
格納している。
The alphabet dictionary 104 stores information on the notation of each alphabet, its category, and the categories of partners that can be combined to form words.

解析制御部101は、入力文章に対して単語辞書103
を検索し、複数の候補単語が存在する場合は、表記が最
長の候補単語を優先してまず選び、接続テーブル102
を用いて左側の単語との隣接可能性の検定を行い、隣接
可能ならば、その候補単語をその区間の単語であると仮
定する。単語辞書103中に候補単語が存在しない場合
は、字母辞書104を検索し、同様に接続テーブル10
2を用いて左側の単語あるいは字母との隣接可能性の検
定を行い、隣接可能ならば、その候補字母をその区間に
存在しうると仮定する。もし、現在位置の候補が字母で
あり、かつ左側が字母である場合は、字母辞書104か
ら得られる字母のカテゴリーおよび結合して造語を行う
ことが可能な相手のカテゴリーを用いて、造語可能性を
判定する。
The analysis control unit 101 uses a word dictionary 103 for input sentences.
If there are multiple candidate words, the candidate word with the longest notation is selected first, and the word is stored in the connection table 102.
is used to test the possibility of adjacency with the word on the left, and if adjacency is possible, the candidate word is assumed to be a word in that section. If the candidate word does not exist in the word dictionary 103, the character dictionary 104 is searched, and the connection table 10 is similarly searched.
2 is used to test the possibility of adjacency with the word or character on the left, and if adjacency is possible, it is assumed that the candidate character can exist in that interval. If the candidate at the current position is a jimo and the left side is a jimo, the possibility of coining is determined using the category of the jimo obtained from the jamo dictionary 104 and the category of the other party that can be combined to form a coined word. Determine.

もし、ある位置で辞書を引いても候補単語が存在しない
場合、あるいは接続テーブル102を用いた隣接可能性
の検定で隣接不可能と判定される場合、あるいは、字母
と字母の結合による造語可能性の判定において造語不可
能と判定される場合は、一つ前に仮定された単語の位置
までもどり、その位置における次候補を選択して、先に
進む。
If a candidate word does not exist even if you look up a dictionary at a certain position, or if it is determined that adjacency is not possible by testing the possibility of adjacency using the connection table 102, or if there is a possibility of coining a word by combining a character and a character. If it is determined that a word cannot be coined, the process returns to the position of the previously assumed word, selects the next candidate at that position, and moves on.

辞書引き、接続検定あるいは字母の造語可能性の検定が
成功したら、候補単語あるいは字母の表記の次の文字位
置で再び単語辞書103の検索を行い、候補がない場合
は字母辞8104を検索するという様にして、上記の処
理を繰り返していく。文字種が漢字からその他の文字に
変わる位置まできたら、解析制御部101は、それまで
に仮定された単語を解析結果と決定する。
If the dictionary lookup, connection test, or test of the possibility of coining a character is successful, the word dictionary 103 is searched again at the next character position of the candidate word or character notation, and if there are no candidates, the character 8104 is searched. Repeat the above process. When the character type reaches the position where the character type changes from kanji to other characters, the analysis control unit 101 determines the word assumed up to that point as the analysis result.

第2図は入力日本語テキストの一例を示す説明図である
。いま、「外相の」まで辞書引きが追わっているものと
する。解析制御部101は、まず201の文字位置で単
語辞書103を検索すると、該当単語が存在しないので
、次に字母辞書104を検索し、字母「訪」を得る。次
の文字位置においても単語辞書103には該当単語が存
在しないので、解析制御部101は、字母辞書104を
検索し、字母「ソ」を得る。ここで、字母「ソ」と字母
「訪]の結合による造語が可能であるかの判定が行われ
、「訪Jの辞書情報301および[ソ」の辞書情報30
2は矛盾しないので結合可能であると判定される。
FIG. 2 is an explanatory diagram showing an example of input Japanese text. Assume that the dictionary has now been used to find ``Foreign Minister's''. The analysis control unit 101 first searches the word dictionary 103 for the character position 201, and since the corresponding word does not exist, it then searches the jimo dictionary 104 and obtains the jimo "vis". Since the corresponding word does not exist in the word dictionary 103 at the next character position, the analysis control unit 101 searches the character dictionary 104 and obtains the character "so". Here, it is determined whether it is possible to create a word by combining the character capital "SO" and the character capital "Vi".
2 is not contradictory, so it is determined that they can be combined.

〔発明の効果〕〔Effect of the invention〕

以上述べたように本発明の形態素解析方式によれば、単
語辞書には登録されていないが、「字母」の造語能力に
よって形成される合成語を含んだ日本語文を解析するこ
とが可能となる。日本語においては、「字母」による造
語がしばしば行われるので、本発明による形態素解析の
性能の向上の効果は極めて大きい。
As described above, according to the morphological analysis method of the present invention, it is possible to analyze Japanese sentences that contain compound words that are not registered in word dictionaries but are formed by the word-coining ability of "jimo". . In Japanese, words are often coined using "jimo", so the improvement in morphological analysis performance according to the present invention is extremely effective.

【図面の簡単な説明】[Brief explanation of the drawing]

第1図は本発明の一実施例を示すブロック図、第2図は
入力日本語テキストの一例を示す説明図、第3図は「字
母」の辞書内容の一例を示す説明図。 101・・・解析制御部、102・・・接続テーブル、
103・・・単語辞書、104・・・字母辞書。
FIG. 1 is a block diagram showing an embodiment of the present invention, FIG. 2 is an explanatory diagram showing an example of input Japanese text, and FIG. 3 is an explanatory diagram showing an example of the dictionary contents of "Jibo". 101...Analysis control unit, 102...Connection table,
103...Word dictionary, 104...Character dictionary.

Claims (1)

【特許請求の範囲】[Claims] 単語と結合して造語を行なうことは必ずしもできないが
、他の接辞や「字母」と結合して造語を行なう能力があ
る「字母」という語彙カテゴリーを有し、前記「字母」
と結合して合成語を形成することが可能な語彙項目が持
つべき属性情報をその「字母」とともに保持し、入力中
に前記「字母」の表記が存在する場合に、前記属性情報
を用いて、前記入力中において前記「字母」が隣接する
語彙項目と結合して合成語を形成することが可能である
か否かを決定することを特徴とする形態素解析方式。
Although it is not necessarily possible to create coined words by combining with words, there is a lexical category called "jimo" that has the ability to combine with other affixes or "jimo" to create coined words, and the above-mentioned "jimo"
The attribute information that a lexical item that can be combined with to form a compound word is held together with its "jimo", and when the "jimo" notation is present in the input, the attribute information is used to A morphological analysis method, characterized in that it is determined during the input whether the "jimo" can be combined with adjacent lexical items to form a compound word.
JP63036876A 1988-02-19 1988-02-19 Morphological analyzer Expired - Lifetime JPH0795321B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP63036876A JPH0795321B2 (en) 1988-02-19 1988-02-19 Morphological analyzer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP63036876A JPH0795321B2 (en) 1988-02-19 1988-02-19 Morphological analyzer

Publications (2)

Publication Number Publication Date
JPH01211176A true JPH01211176A (en) 1989-08-24
JPH0795321B2 JPH0795321B2 (en) 1995-10-11

Family

ID=12481983

Family Applications (1)

Application Number Title Priority Date Filing Date
JP63036876A Expired - Lifetime JPH0795321B2 (en) 1988-02-19 1988-02-19 Morphological analyzer

Country Status (1)

Country Link
JP (1) JPH0795321B2 (en)

Also Published As

Publication number Publication date
JPH0795321B2 (en) 1995-10-11

Similar Documents

Publication Publication Date Title
US5890103A (en) Method and apparatus for improved tokenization of natural language text
KR940022316A (en) Keyword Extractor for Japanese Documents
JPH0351020B2 (en)
WO1997004405A9 (en) Method and apparatus for automated search and retrieval processing
JP3594701B2 (en) Key sentence extraction device
JPH01211176A (en) Morpheme analyzing system
Daciuk Treatment of unknown words
JPH01211175A (en) Morpheme analyzing system
JP2821143B2 (en) Morphological decomposition device
JPH01266670A (en) Extracting processing system for characteristic vocabulary in japanese object sentence
JPH10171807A (en) Device and method for canceling semantic ambiguity
JPS62203276A (en) Form element analysis device
JPS6368972A (en) Unregistered word processing system
JPS6395570A (en) Language analysis system
JP3139624B2 (en) Morphological analyzer
JPH01232471A (en) Morpheme analyzer
KR920005023A (en) Morphological Analysis of Hangul Sentences
JPH0262665A (en) Decomposition system for morpheme
JPS63103378A (en) language analysis device
JPH05233686A (en) Japanese language processor
JPS61204771A (en) Form element analyzing device
JPH05197752A (en) Machine translation system
JPH02230370A (en) System and device for morpheme analysis
JPH01236361A (en) System for processing composition written in japanese
JPH03152667A (en) Japanese sentence analysis method