JPH0558552B2

JPH0558552B2 -

Info

Publication number: JPH0558552B2
Application number: JP62205228A
Authority: JP
Inventors: Eiji Oohira; Akio Komatsu
Original assignee: Agency of Industrial Science and Technology
Current assignee: National Institute of Advanced Industrial Science and Technology AIST
Priority date: 1987-08-20
Filing date: 1987-08-20
Publication date: 1993-08-26
Also published as: JPS6449098A

Abstract

PURPOSE: To restrict the existing position of a word by selecting a hypothesis that the fundamental frequency shape between supposed division points corresponds to any one of all reference shapes as a real hypothesis. CONSTITUTION: Since a pause is not generated in a level smaller than a word, the pause is detected at least and a conversation sentence is divided at the pause. Then the minimum position of fundamental frequency in the divided section is detected and the position is set up a divided candidate point. A division hypothesis supporting when each division point is a division point or when each division point is not a division point is set up and a hypothesis that the fundamented frequency shape between a divided point and a supposed point in each hypothesis corresponds to any one of all standard shapes is selected. When plural hypotheses are selected, the hypothesis having the highest reliability is selected as a real hypothesis. Thus, the division and shape of the conversation sentence is determined based upon the division point of the selected hypothesis and the obtained standard shape.

Description

【発明の詳細な説明】［産業上の利用分野］本発明は音声会話文（単に会話文と略す）の理
解方式に係り、特に複数の文が連続して発声され
る会話文を分割し、さらに、会話文の文構造推定
に必要な基本周波数形状を決定する音声会話文の
基本周波数形状推定装置に関する。[Detailed Description of the Invention] [Industrial Application Field] The present invention relates to a method for understanding spoken conversational sentences (simply referred to as conversational sentences), and in particular, a method for dividing a conversational sentence in which a plurality of sentences are uttered in succession, Furthermore, the present invention relates to an apparatus for estimating the fundamental frequency shape of a spoken conversation sentence, which determines the fundamental frequency shape necessary for estimating the sentence structure of a conversation sentence.

［従来の技術］会話文の基本周波数形状を求める方式は、従
来、日本語アクセントの基本周波数パタンとその
生成機構のモデル（日本音響学会誌27−９、445
頁から453頁、1971）やピツチパタンの直線近似
の一手法（音声研究会資料S47−11、1974）で論
じられている。[Prior art] Conventionally, the method for determining the fundamental frequency shape of conversational sentences has been based on a model of the fundamental frequency pattern of Japanese accent and its generation mechanism (Journal of the Acoustical Society of Japan 27-9, 445).
453, 1971) and a method of linear approximation of pitch patterns (Speech Research Group Material S47-11, 1974).

［発明が解決しようとする問題点］上記従来技術は、音声合成の分野における基本
周波数形状の分析・解析用に開発されたものであ
り、人間がパラメータの設定等に関与すれば高精
度の推定が可能である。しかし、認識・理解の分
野においては、基本周波数形状を自動的に推定す
る必要があるが、従来技術では、自動推定を行な
う場合は精度が悪くなつてしまう。また、処理量
が多いという欠点もある。[Problems to be solved by the invention] The above-mentioned conventional technology was developed for analysis of fundamental frequency shapes in the field of speech synthesis, and if humans are involved in setting parameters etc., highly accurate estimation is possible. is possible. However, in the field of recognition and understanding, it is necessary to automatically estimate the fundamental frequency shape, but with conventional techniques, when automatic estimation is performed, accuracy deteriorates. Another disadvantage is that the amount of processing is large.

本発明の目的は、音声会話文の基本周波数形状
を精度よく、自動推定することにある。 An object of the present invention is to accurately and automatically estimate the fundamental frequency shape of a spoken conversation sentence.

［問題点を解決するための手段］基本周波数は、発話開始点で上昇を示し、その
後徐々に低くなるため、三角形又は「へ」の字形
を示す。このため、基本周波数の極小点を求める
ことにより、１まとまりの基本周波数形状を切り
出すことができる。しかし、発話的のゆらぎによ
り、１まとまりの基本周波数内でも極小点が生じ
る問題点がある。[Means for Solving the Problems] The fundamental frequency shows an increase at the beginning of the utterance and then gradually decreases, so it shows a triangular or "he" shape. Therefore, by finding the minimum point of the fundamental frequency, it is possible to cut out a set of fundamental frequency shapes. However, there is a problem in that a local minimum point occurs even within a set of fundamental frequencies due to speech fluctuations.

また、実際の会話文は、１まとまりとして切出
された形状も、三角形や「へ」の字形だけではな
く、間投詞や単語の１拍目が低く生成された場合
は、三角形などの前に短かく、低い周波数が接続
した形となる。更に、起伏型アクセントの文節の
後に平板型アクセント文節が接続し、両者がアク
セント結合を起こすと、三角形の後に低く平坦な
基本周波数が接続する。 In addition, in an actual conversational sentence, the shape cut out as a unit is not only a triangle or the character ``he'', but also an interjection or a word with a short beat before the triangle. In this way, the low frequencies are connected. Furthermore, when a flat-accented phrase is connected after a raised-accented phrase and an accent combination occurs between the two, a low, flat fundamental frequency is connected after the triangle.

したがつて、上記目的は、(1)間投詞やアクセン
ト結合を考慮した標準基本周波数形状を規定し、
(2)検出される基本周波数の極小点を分割侯補点と
して、各侯補点が分割点である時または分割点で
ない時を仮定した分割仮説を立てる。そして、各
仮説において、仮定した分割点間の基本周波数形
状が全て、標準形状のいづれかに該当する仮説を
真の仮説として選ぶことにより達成される。 Therefore, the above objectives are (1) to define a standard fundamental frequency shape that takes interjections and accent combinations into consideration;
(2) Using the detected minimum point of the fundamental frequency as a dividing point, a dividing hypothesis is established assuming that each point is either a dividing point or not a dividing point. In each hypothesis, this is achieved by selecting a hypothesis whose fundamental frequency shapes between the assumed dividing points all correspond to one of the standard shapes as the true hypothesis.

［作用］単語より小さなレベルにおいて休止（ポーズ）
は生成されない。このため、第一に、ポーズを検
出し、そこで会話文を分割する。第二に、分割さ
れた区間の基本周波数の極小位置を検出し、そこ
を分割候補点とする。そして、各分割点が分割点
である時または分割点でない時を仮定した分割仮
説を立て、各仮説の、分割点と仮定した点間の基
本周波数形状が、全て標準形状のいづれかに該当
する仮説を選択する。もし、複数の仮説が選ばれ
たならば、最も信頼度の高い仮説を真の仮説とす
る。それによつて、選ばれた仮説の分割点および
得られた標準形状により、会話文の分割および形
状の決定が可能となる。[Action] Pause at a level smaller than a word
is not generated. For this purpose, first, pauses are detected and the conversation is divided at the pauses. Second, the minimum position of the fundamental frequency in the divided section is detected, and this position is set as a candidate point for division. Then, a division hypothesis is created assuming that each division point is a division point or is not a division point, and the fundamental frequency shapes between the assumed division points for each hypothesis all correspond to one of the standard shapes. Select. If multiple hypotheses are selected, the hypothesis with the highest degree of confidence is determined to be the true hypothesis. Thereby, it becomes possible to divide the conversation sentence and determine the shape based on the selected hypothetical dividing point and the obtained standard shape.

［実施例］以下、本発明の一実施例を第１〜３図により説
明する。[Example] Hereinafter, an example of the present invention will be described with reference to FIGS. 1 to 3.

第１図は音声会話文の基本周波数形状推定装置
のブロツク図を示したものである。会話音声は、
マイクロホン（図示しない）で電気信号に変換さ
れた後、音響処理部１で音声パワーと基本周波数
を抽出する。ポース検出処理部２は、音声パワー
からポーズ（例えば300msec以下の無音区間）を
検出し、ポーズ前後の位置で会話文を分割する。
次に、基本周波数極小点検出部３は、基本周波数
の谷の位置を検出し、この位置を分割候補点とす
る。ここで、基本周波数は摩擦音などの子音区間
では生じない。このため、会話文より抽出される
基本周波数は歯抜けになる。したがつて、本基本
周波数極小点検出部３では、基本周波数を直線で
近似する。この近似としては、例えば、何点か隔
れた基本周波数の間を直線で近似し、その近似直
線とその間の抽出された基本周波数の誤差が一定
閾値以下となる長さが最大の近似直線を求めるこ
とにより実現できる。本基本周波数極小点検出部
３は、このようにして求められた近似直線の谷の
位置を検出する。 FIG. 1 shows a block diagram of an apparatus for estimating the fundamental frequency shape of spoken conversation sentences. Conversation audio is
After the signal is converted into an electrical signal by a microphone (not shown), the audio processing unit 1 extracts the audio power and fundamental frequency. The pause detection processing unit 2 detects a pause (for example, a silent section of 300 msec or less) from the audio power, and divides the conversation at the positions before and after the pause.
Next, the fundamental frequency minimum point detection unit 3 detects the position of the valley of the fundamental frequency, and sets this position as a division candidate point. Here, the fundamental frequency does not occur in consonant intervals such as fricatives. For this reason, the fundamental frequency extracted from conversational sentences is lacking. Therefore, the fundamental frequency minimum point detection section 3 approximates the fundamental frequency with a straight line. This approximation can be done, for example, by approximating a straight line between the fundamental frequencies separated by several points, and then finding the approximate straight line with the maximum length such that the error between the approximate straight line and the extracted fundamental frequency is less than a certain threshold. It can be achieved by asking for it. The basic frequency minimum point detecting section 3 detects the position of the valley of the approximate straight line obtained in this manner.

次に、基本周波数分割／ラベル付け処理部４で
は、まず、各ポーズにより分割された区間毎に分
割仮説をたてる。すなわち、分割された区間内の
分割点候補が分割点である時と、そうでない時を
想定した仮説を生成する。（例えば、分割候補点
が１つの場合は、ポーズで分割された区間全体が
１まとまりの形状とした仮説と、分割候補点前と
後の２つの１まとまりの形状から構成される仮説
の２つの分割仮説を生成する。）次に、標準基本
周波数形状辞書５を用いて、仮説において１まと
まりの形状とされた区間の基本周波数形状が、辞
書５に格納される。標準形状のいづれかに該当す
るかどうかを検証する。そして、各々分割仮説の
全ての１まとまりの形状が標準形状のいづれかに
該当する分割仮説を真の仮説とする。ここで、複
数の分割仮説が選択された場合は、確信度の最も
高い仮説を選択する。確信度とは、例えば、１ま
とまりの形状と標準形状の間の誤差を用いる。 Next, the basic frequency division/labeling processing unit 4 first establishes a division hypothesis for each section divided by each pose. That is, a hypothesis is generated assuming that the dividing point candidate in the divided section is a dividing point and when it is not. (For example, when there is one division candidate point, there are two hypotheses: one that the entire section divided by the pose is one set of shapes, and one that consists of two sets of shapes before and after the division candidate point. (Generate a division hypothesis.) Next, using the standard fundamental frequency shape dictionary 5, the fundamental frequency shapes of the sections that are set as one shape in the hypothesis are stored in the dictionary 5. Verify whether the shape corresponds to one of the standard shapes. Then, a division hypothesis in which all the shapes of each division hypothesis correspond to one of the standard shapes is determined to be a true hypothesis. Here, if multiple division hypotheses are selected, the hypothesis with the highest degree of certainty is selected. The confidence level is, for example, an error between a group of shapes and a standard shape.

標準基本周波数形状辞書５に格納する標準形状
としては、例えば第２図に示すものが考えられ
る。単語や文節の基本周波数は、上昇＋平坦＋下
降の形状が基本的に生成される。平坦部は頭高型
アクセント文節などでは生じないため、基本型と
しては、三角または台形の形状を示す。次に、下
降としては、起伏型アクセント文節では急竣な下
降を示し、平板型アクセント文節では、フレーズ
成分の緩やかな下降として観測される。このた
め、下降の勾配により基本型は更に２分割され
る。更に、単語の１拍目の基本周波数が低く生成
される場合、および、２つの文節がアクセント結
合を起こす場合は、基本型の前および後に、基本
型の上昇または下降勾配以下の基本周波数が接続
する。また、疑間文の文末では、上昇で終わる基
本型周波数が生成される。これらの形状を直線に
より近似したものが第２図に示すものである。 As standard shapes to be stored in the standard fundamental frequency shape dictionary 5, for example, those shown in FIG. 2 can be considered. The basic frequency of words and phrases is basically generated in the form of rising + flat + falling. Flat parts do not occur in high-headed accent clauses, so the basic shape is triangular or trapezoidal. Next, as for the decline, the undulating accent clause shows a steep decline, while the flat accent clause shows a gradual decline in the phrase component. Therefore, the basic form is further divided into two parts due to the downward slope. Furthermore, if the fundamental frequency of the first beat of a word is generated low, or if two clauses cause an accent combination, a fundamental frequency below the rising or falling slope of the basic form will be connected before and after the basic form. do. Furthermore, at the end of an interrogative sentence, a fundamental frequency that ends with a rising is generated. A straight line approximation of these shapes is shown in FIG.

基本周波数分割／ラベル付け処理部４では、第
２図の標準形状を用いて１まとまりの形状の検証
を行なうが、この検証に際しては、標準形状の上
昇や下降直線の形状を規定する必要がある。第３
図は、標準形状の各直線の形状を、その持続時間
基本周波数変動およびその勾配により規定した例
である。 The basic frequency division/labeling processing unit 4 verifies a group of shapes using the standard shape shown in FIG. . Third
The figure is an example in which the shape of each standard straight line is defined by its duration fundamental frequency fluctuation and its slope.

以上、本実施例によれば、自動的に音声会話文
の１まとまりの基本周波数形状位置の検出および
その形状を同時に決定することが可能となる。 As described above, according to this embodiment, it is possible to automatically detect the position of the fundamental frequency shape of a group of spoken conversation sentences and determine the shape at the same time.

〔Effect of the invention〕

本発明のよれば、間投詞やアクセント結合をも
考慮した基本周波数形状を推定できるので次のよ
うな効果がある。まず、標準形状の基本型の前に
低く接続する単語の１拍目と推定される区間に２
以上の音節がある場合は、最後の音節以外は必ず
間投詞であるため、この間投詞を削除できる。ま
た、アクセント結合形状と推定された区間では、
その位置で分割することにより、文節境界を検出
できる。このため、推定された形状を音声認識に
用いれば、単語の存在位置を限定することが可能
となり、性能を向上できる。 According to the present invention, the fundamental frequency shape can be estimated taking interjections and accent combinations into consideration, resulting in the following effects. First, in the interval estimated to be the first beat of the word connected low before the basic form of the standard shape,
If there are more syllables, all but the last syllable are always interjections, so this interjection can be deleted. In addition, in the section estimated to be the accent combination shape,
By dividing at that position, clause boundaries can be detected. Therefore, if the estimated shape is used for speech recognition, it becomes possible to limit the location of words, and performance can be improved.

また、形状の自動推定ができることから、音声
合成などの分野のアクセント作成においても利用
可能である。 Furthermore, since it is possible to automatically estimate the shape, it can also be used in accent creation in fields such as speech synthesis.

[Brief explanation of the drawing]

第１図は、本発明の一実施例を示すブロツク
図、第２図は、形状推定に必要な標準基本周波数
形状の例を示す図、第３図は、標準形状の各要素
（直線の条件を示した図である。１は音響処理部、２はポーズ検出処理部、３は
基本周波数極小点検出部、４は基本周波数分割／
ラベル付け処理部、５は標準基本周波数形状辞書
である。 FIG. 1 is a block diagram showing an embodiment of the present invention, FIG. 2 is a diagram showing an example of a standard fundamental frequency shape necessary for shape estimation, and FIG. 3 shows each element of the standard shape (straight line conditions). 1 is a diagram showing an acoustic processing section, 2 is a pose detection processing section, 3 is a fundamental frequency minimum point detection section, and 4 is a fundamental frequency division/
The labeling processing unit 5 is a standard fundamental frequency shape dictionary.

Claims

[Claims] 1. A fundamental frequency shape estimating device for a spoken conversation sentence that divides the spoken conversation sentence and estimates a fundamental frequency shape necessary for estimating the sentence structure of the spoken conversation sentence, comprising: A means for extracting the power and fundamental frequency, a means for detecting pauses from the extracted voice power and dividing the spoken conversation at positions before and after the pause, and using the position of the valley of the extracted fundamental frequency as a dividing point. means for detecting, means for storing in advance a standard fundamental frequency shape that takes into account at least interjections and accent combinations of the spoken conversation, and verifying the dividing lord complement point based on the stored standard fundamental frequency shape, A device for estimating a fundamental frequency shape of an audio conversational sentence, comprising: means for estimating a fundamental frequency shape of a conversational sentence;