JPH0558552B2 - - Google Patents

Info

Publication number
JPH0558552B2
JPH0558552B2 JP62205228A JP20522887A JPH0558552B2 JP H0558552 B2 JPH0558552 B2 JP H0558552B2 JP 62205228 A JP62205228 A JP 62205228A JP 20522887 A JP20522887 A JP 20522887A JP H0558552 B2 JPH0558552 B2 JP H0558552B2
Authority
JP
Japan
Prior art keywords
fundamental frequency
hypothesis
shape
division
point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
JP62205228A
Other languages
Japanese (ja)
Other versions
JPS6449098A (en
Inventor
Eiji Oohira
Akio Komatsu
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Institute of Advanced Industrial Science and Technology AIST
Original Assignee
Agency of Industrial Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Agency of Industrial Science and Technology filed Critical Agency of Industrial Science and Technology
Priority to JP62205228A priority Critical patent/JPS6449098A/en
Publication of JPS6449098A publication Critical patent/JPS6449098A/en
Publication of JPH0558552B2 publication Critical patent/JPH0558552B2/ja
Granted legal-status Critical Current

Links

Abstract

PURPOSE: To restrict the existing position of a word by selecting a hypothesis that the fundamental frequency shape between supposed division points corresponds to any one of all reference shapes as a real hypothesis. CONSTITUTION: Since a pause is not generated in a level smaller than a word, the pause is detected at least and a conversation sentence is divided at the pause. Then the minimum position of fundamental frequency in the divided section is detected and the position is set up a divided candidate point. A division hypothesis supporting when each division point is a division point or when each division point is not a division point is set up and a hypothesis that the fundamented frequency shape between a divided point and a supposed point in each hypothesis corresponds to any one of all standard shapes is selected. When plural hypotheses are selected, the hypothesis having the highest reliability is selected as a real hypothesis. Thus, the division and shape of the conversation sentence is determined based upon the division point of the selected hypothesis and the obtained standard shape.

Description

【発明の詳細な説明】 [産業上の利用分野] 本発明は音声会話文(単に会話文と略す)の理
解方式に係り、特に複数の文が連続して発声され
る会話文を分割し、さらに、会話文の文構造推定
に必要な基本周波数形状を決定する音声会話文の
基本周波数形状推定装置に関する。
[Detailed Description of the Invention] [Industrial Application Field] The present invention relates to a method for understanding spoken conversational sentences (simply referred to as conversational sentences), and in particular, a method for dividing a conversational sentence in which a plurality of sentences are uttered in succession, Furthermore, the present invention relates to an apparatus for estimating the fundamental frequency shape of a spoken conversation sentence, which determines the fundamental frequency shape necessary for estimating the sentence structure of a conversation sentence.

[従来の技術] 会話文の基本周波数形状を求める方式は、従
来、日本語アクセントの基本周波数パタンとその
生成機構のモデル(日本音響学会誌27−9、445
頁から453頁、1971)やピツチパタンの直線近似
の一手法(音声研究会資料S47−11、1974)で論
じられている。
[Prior art] Conventionally, the method for determining the fundamental frequency shape of conversational sentences has been based on a model of the fundamental frequency pattern of Japanese accent and its generation mechanism (Journal of the Acoustical Society of Japan 27-9, 445).
453, 1971) and a method of linear approximation of pitch patterns (Speech Research Group Material S47-11, 1974).

[発明が解決しようとする問題点] 上記従来技術は、音声合成の分野における基本
周波数形状の分析・解析用に開発されたものであ
り、人間がパラメータの設定等に関与すれば高精
度の推定が可能である。しかし、認識・理解の分
野においては、基本周波数形状を自動的に推定す
る必要があるが、従来技術では、自動推定を行な
う場合は精度が悪くなつてしまう。また、処理量
が多いという欠点もある。
[Problems to be solved by the invention] The above-mentioned conventional technology was developed for analysis of fundamental frequency shapes in the field of speech synthesis, and if humans are involved in setting parameters etc., highly accurate estimation is possible. is possible. However, in the field of recognition and understanding, it is necessary to automatically estimate the fundamental frequency shape, but with conventional techniques, when automatic estimation is performed, accuracy deteriorates. Another disadvantage is that the amount of processing is large.

本発明の目的は、音声会話文の基本周波数形状
を精度よく、自動推定することにある。
An object of the present invention is to accurately and automatically estimate the fundamental frequency shape of a spoken conversation sentence.

[問題点を解決するための手段] 基本周波数は、発話開始点で上昇を示し、その
後徐々に低くなるため、三角形又は「へ」の字形
を示す。このため、基本周波数の極小点を求める
ことにより、1まとまりの基本周波数形状を切り
出すことができる。しかし、発話的のゆらぎによ
り、1まとまりの基本周波数内でも極小点が生じ
る問題点がある。
[Means for Solving the Problems] The fundamental frequency shows an increase at the beginning of the utterance and then gradually decreases, so it shows a triangular or "he" shape. Therefore, by finding the minimum point of the fundamental frequency, it is possible to cut out a set of fundamental frequency shapes. However, there is a problem in that a local minimum point occurs even within a set of fundamental frequencies due to speech fluctuations.

また、実際の会話文は、1まとまりとして切出
された形状も、三角形や「へ」の字形だけではな
く、間投詞や単語の1拍目が低く生成された場合
は、三角形などの前に短かく、低い周波数が接続
した形となる。更に、起伏型アクセントの文節の
後に平板型アクセント文節が接続し、両者がアク
セント結合を起こすと、三角形の後に低く平坦な
基本周波数が接続する。
In addition, in an actual conversational sentence, the shape cut out as a unit is not only a triangle or the character ``he'', but also an interjection or a word with a short beat before the triangle. In this way, the low frequencies are connected. Furthermore, when a flat-accented phrase is connected after a raised-accented phrase and an accent combination occurs between the two, a low, flat fundamental frequency is connected after the triangle.

したがつて、上記目的は、(1)間投詞やアクセン
ト結合を考慮した標準基本周波数形状を規定し、
(2)検出される基本周波数の極小点を分割侯補点と
して、各侯補点が分割点である時または分割点で
ない時を仮定した分割仮説を立てる。そして、各
仮説において、仮定した分割点間の基本周波数形
状が全て、標準形状のいづれかに該当する仮説を
真の仮説として選ぶことにより達成される。
Therefore, the above objectives are (1) to define a standard fundamental frequency shape that takes interjections and accent combinations into consideration;
(2) Using the detected minimum point of the fundamental frequency as a dividing point, a dividing hypothesis is established assuming that each point is either a dividing point or not a dividing point. In each hypothesis, this is achieved by selecting a hypothesis whose fundamental frequency shapes between the assumed dividing points all correspond to one of the standard shapes as the true hypothesis.

[作用] 単語より小さなレベルにおいて休止(ポーズ)
は生成されない。このため、第一に、ポーズを検
出し、そこで会話文を分割する。第二に、分割さ
れた区間の基本周波数の極小位置を検出し、そこ
を分割候補点とする。そして、各分割点が分割点
である時または分割点でない時を仮定した分割仮
説を立て、各仮説の、分割点と仮定した点間の基
本周波数形状が、全て標準形状のいづれかに該当
する仮説を選択する。もし、複数の仮説が選ばれ
たならば、最も信頼度の高い仮説を真の仮説とす
る。それによつて、選ばれた仮説の分割点および
得られた標準形状により、会話文の分割および形
状の決定が可能となる。
[Action] Pause at a level smaller than a word
is not generated. For this purpose, first, pauses are detected and the conversation is divided at the pauses. Second, the minimum position of the fundamental frequency in the divided section is detected, and this position is set as a candidate point for division. Then, a division hypothesis is created assuming that each division point is a division point or is not a division point, and the fundamental frequency shapes between the assumed division points for each hypothesis all correspond to one of the standard shapes. Select. If multiple hypotheses are selected, the hypothesis with the highest degree of confidence is determined to be the true hypothesis. Thereby, it becomes possible to divide the conversation sentence and determine the shape based on the selected hypothetical dividing point and the obtained standard shape.

[実施例] 以下、本発明の一実施例を第1〜3図により説
明する。
[Example] Hereinafter, an example of the present invention will be described with reference to FIGS. 1 to 3.

第1図は音声会話文の基本周波数形状推定装置
のブロツク図を示したものである。会話音声は、
マイクロホン(図示しない)で電気信号に変換さ
れた後、音響処理部1で音声パワーと基本周波数
を抽出する。ポース検出処理部2は、音声パワー
からポーズ(例えば300msec以下の無音区間)を
検出し、ポーズ前後の位置で会話文を分割する。
次に、基本周波数極小点検出部3は、基本周波数
の谷の位置を検出し、この位置を分割候補点とす
る。ここで、基本周波数は摩擦音などの子音区間
では生じない。このため、会話文より抽出される
基本周波数は歯抜けになる。したがつて、本基本
周波数極小点検出部3では、基本周波数を直線で
近似する。この近似としては、例えば、何点か隔
れた基本周波数の間を直線で近似し、その近似直
線とその間の抽出された基本周波数の誤差が一定
閾値以下となる長さが最大の近似直線を求めるこ
とにより実現できる。本基本周波数極小点検出部
3は、このようにして求められた近似直線の谷の
位置を検出する。
FIG. 1 shows a block diagram of an apparatus for estimating the fundamental frequency shape of spoken conversation sentences. Conversation audio is
After the signal is converted into an electrical signal by a microphone (not shown), the audio processing unit 1 extracts the audio power and fundamental frequency. The pause detection processing unit 2 detects a pause (for example, a silent section of 300 msec or less) from the audio power, and divides the conversation at the positions before and after the pause.
Next, the fundamental frequency minimum point detection unit 3 detects the position of the valley of the fundamental frequency, and sets this position as a division candidate point. Here, the fundamental frequency does not occur in consonant intervals such as fricatives. For this reason, the fundamental frequency extracted from conversational sentences is lacking. Therefore, the fundamental frequency minimum point detection section 3 approximates the fundamental frequency with a straight line. This approximation can be done, for example, by approximating a straight line between the fundamental frequencies separated by several points, and then finding the approximate straight line with the maximum length such that the error between the approximate straight line and the extracted fundamental frequency is less than a certain threshold. It can be achieved by asking for it. The basic frequency minimum point detecting section 3 detects the position of the valley of the approximate straight line obtained in this manner.

次に、基本周波数分割/ラベル付け処理部4で
は、まず、各ポーズにより分割された区間毎に分
割仮説をたてる。すなわち、分割された区間内の
分割点候補が分割点である時と、そうでない時を
想定した仮説を生成する。(例えば、分割候補点
が1つの場合は、ポーズで分割された区間全体が
1まとまりの形状とした仮説と、分割候補点前と
後の2つの1まとまりの形状から構成される仮説
の2つの分割仮説を生成する。)次に、標準基本
周波数形状辞書5を用いて、仮説において1まと
まりの形状とされた区間の基本周波数形状が、辞
書5に格納される。標準形状のいづれかに該当す
るかどうかを検証する。そして、各々分割仮説の
全ての1まとまりの形状が標準形状のいづれかに
該当する分割仮説を真の仮説とする。ここで、複
数の分割仮説が選択された場合は、確信度の最も
高い仮説を選択する。確信度とは、例えば、1ま
とまりの形状と標準形状の間の誤差を用いる。
Next, the basic frequency division/labeling processing unit 4 first establishes a division hypothesis for each section divided by each pose. That is, a hypothesis is generated assuming that the dividing point candidate in the divided section is a dividing point and when it is not. (For example, when there is one division candidate point, there are two hypotheses: one that the entire section divided by the pose is one set of shapes, and one that consists of two sets of shapes before and after the division candidate point. (Generate a division hypothesis.) Next, using the standard fundamental frequency shape dictionary 5, the fundamental frequency shapes of the sections that are set as one shape in the hypothesis are stored in the dictionary 5. Verify whether the shape corresponds to one of the standard shapes. Then, a division hypothesis in which all the shapes of each division hypothesis correspond to one of the standard shapes is determined to be a true hypothesis. Here, if multiple division hypotheses are selected, the hypothesis with the highest degree of certainty is selected. The confidence level is, for example, an error between a group of shapes and a standard shape.

標準基本周波数形状辞書5に格納する標準形状
としては、例えば第2図に示すものが考えられ
る。単語や文節の基本周波数は、上昇+平坦+下
降の形状が基本的に生成される。平坦部は頭高型
アクセント文節などでは生じないため、基本型と
しては、三角または台形の形状を示す。次に、下
降としては、起伏型アクセント文節では急竣な下
降を示し、平板型アクセント文節では、フレーズ
成分の緩やかな下降として観測される。このた
め、下降の勾配により基本型は更に2分割され
る。更に、単語の1拍目の基本周波数が低く生成
される場合、および、2つの文節がアクセント結
合を起こす場合は、基本型の前および後に、基本
型の上昇または下降勾配以下の基本周波数が接続
する。また、疑間文の文末では、上昇で終わる基
本型周波数が生成される。これらの形状を直線に
より近似したものが第2図に示すものである。
As standard shapes to be stored in the standard fundamental frequency shape dictionary 5, for example, those shown in FIG. 2 can be considered. The basic frequency of words and phrases is basically generated in the form of rising + flat + falling. Flat parts do not occur in high-headed accent clauses, so the basic shape is triangular or trapezoidal. Next, as for the decline, the undulating accent clause shows a steep decline, while the flat accent clause shows a gradual decline in the phrase component. Therefore, the basic form is further divided into two parts due to the downward slope. Furthermore, if the fundamental frequency of the first beat of a word is generated low, or if two clauses cause an accent combination, a fundamental frequency below the rising or falling slope of the basic form will be connected before and after the basic form. do. Furthermore, at the end of an interrogative sentence, a fundamental frequency that ends with a rising is generated. A straight line approximation of these shapes is shown in FIG.

基本周波数分割/ラベル付け処理部4では、第
2図の標準形状を用いて1まとまりの形状の検証
を行なうが、この検証に際しては、標準形状の上
昇や下降直線の形状を規定する必要がある。第3
図は、標準形状の各直線の形状を、その持続時間
基本周波数変動およびその勾配により規定した例
である。
The basic frequency division/labeling processing unit 4 verifies a group of shapes using the standard shape shown in FIG. . Third
The figure is an example in which the shape of each standard straight line is defined by its duration fundamental frequency fluctuation and its slope.

以上、本実施例によれば、自動的に音声会話文
の1まとまりの基本周波数形状位置の検出および
その形状を同時に決定することが可能となる。
As described above, according to this embodiment, it is possible to automatically detect the position of the fundamental frequency shape of a group of spoken conversation sentences and determine the shape at the same time.

〔発明の効果〕〔Effect of the invention〕

本発明のよれば、間投詞やアクセント結合をも
考慮した基本周波数形状を推定できるので次のよ
うな効果がある。まず、標準形状の基本型の前に
低く接続する単語の1拍目と推定される区間に2
以上の音節がある場合は、最後の音節以外は必ず
間投詞であるため、この間投詞を削除できる。ま
た、アクセント結合形状と推定された区間では、
その位置で分割することにより、文節境界を検出
できる。このため、推定された形状を音声認識に
用いれば、単語の存在位置を限定することが可能
となり、性能を向上できる。
According to the present invention, the fundamental frequency shape can be estimated taking interjections and accent combinations into consideration, resulting in the following effects. First, in the interval estimated to be the first beat of the word connected low before the basic form of the standard shape,
If there are more syllables, all but the last syllable are always interjections, so this interjection can be deleted. In addition, in the section estimated to be the accent combination shape,
By dividing at that position, clause boundaries can be detected. Therefore, if the estimated shape is used for speech recognition, it becomes possible to limit the location of words, and performance can be improved.

また、形状の自動推定ができることから、音声
合成などの分野のアクセント作成においても利用
可能である。
Furthermore, since it is possible to automatically estimate the shape, it can also be used in accent creation in fields such as speech synthesis.

【図面の簡単な説明】[Brief explanation of the drawing]

第1図は、本発明の一実施例を示すブロツク
図、第2図は、形状推定に必要な標準基本周波数
形状の例を示す図、第3図は、標準形状の各要素
(直線の条件を示した図である。 1は音響処理部、2はポーズ検出処理部、3は
基本周波数極小点検出部、4は基本周波数分割/
ラベル付け処理部、5は標準基本周波数形状辞書
である。
FIG. 1 is a block diagram showing an embodiment of the present invention, FIG. 2 is a diagram showing an example of a standard fundamental frequency shape necessary for shape estimation, and FIG. 3 shows each element of the standard shape (straight line conditions). 1 is a diagram showing an acoustic processing section, 2 is a pose detection processing section, 3 is a fundamental frequency minimum point detection section, and 4 is a fundamental frequency division/
The labeling processing unit 5 is a standard fundamental frequency shape dictionary.

Claims (1)

【特許請求の範囲】 1 音声会話文を分割し、上記音声会話文の文構
造推定に必要な基本周波数形状を推定する音声会
話文の基本周波数形状推定装置において、 入力される音声会話文から音声パワーおよび基
本周波数を抽出する手段と、 抽出された音声パワーからポーズを検出し、ポ
ーズ前後の位置で音声会話文を分割する手段と、 抽出された基本周波数の谷の位置を分割侯補点
として検出する手段と、 上記音声会話文の間投詞およびアクセント結合
を少なくとも考慮した標準基本周波数形状を予め
格納する手段と、 格納された標準基本周波数形状に基づき、上記
分割侯補点の検証を行ない、上記会話文の基本周
波数形状を推定する手段と、 を備えたことを特徴とする音声会話文の基本周波
数形状推定装置。
[Claims] 1. A fundamental frequency shape estimating device for a spoken conversation sentence that divides the spoken conversation sentence and estimates a fundamental frequency shape necessary for estimating the sentence structure of the spoken conversation sentence, comprising: A means for extracting the power and fundamental frequency, a means for detecting pauses from the extracted voice power and dividing the spoken conversation at positions before and after the pause, and using the position of the valley of the extracted fundamental frequency as a dividing point. means for detecting, means for storing in advance a standard fundamental frequency shape that takes into account at least interjections and accent combinations of the spoken conversation, and verifying the dividing lord complement point based on the stored standard fundamental frequency shape, A device for estimating a fundamental frequency shape of an audio conversational sentence, comprising: means for estimating a fundamental frequency shape of a conversational sentence;
JP62205228A 1987-08-20 1987-08-20 Fundamental frequency geometry estimation system Granted JPS6449098A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP62205228A JPS6449098A (en) 1987-08-20 1987-08-20 Fundamental frequency geometry estimation system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP62205228A JPS6449098A (en) 1987-08-20 1987-08-20 Fundamental frequency geometry estimation system

Publications (2)

Publication Number Publication Date
JPS6449098A JPS6449098A (en) 1989-02-23
JPH0558552B2 true JPH0558552B2 (en) 1993-08-26

Family

ID=16503526

Family Applications (1)

Application Number Title Priority Date Filing Date
JP62205228A Granted JPS6449098A (en) 1987-08-20 1987-08-20 Fundamental frequency geometry estimation system

Country Status (1)

Country Link
JP (1) JPS6449098A (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010211122A (en) * 2009-03-12 2010-09-24 Nissan Motor Co Ltd Speech recognition device and method

Also Published As

Publication number Publication date
JPS6449098A (en) 1989-02-23

Similar Documents

Publication Publication Date Title
Goto et al. A real-time filled pause detection system for spontaneous speech recognition.
US7120575B2 (en) Method and system for the automatic segmentation of an audio stream into semantic or syntactic units
KR100438826B1 (en) System for speech synthesis using a smoothing filter and method thereof
US7013276B2 (en) Method of assessing degree of acoustic confusability, and system therefor
Litman et al. Predicting automatic speech recognition performance using prosodic cues
KR100309207B1 (en) Speech-interactive language command method and apparatus
Seymore et al. The 1997 CMU Sphinx-3 English broadcast news transcription system
Audhkhasi et al. Formant-based technique for automatic filled-pause detection in spontaneous spoken English
US7177810B2 (en) Method and apparatus for performing prosody-based endpointing of a speech signal
KR870009322A (en) Speaker array language recognition system
US20070136062A1 (en) Method and apparatus for labelling speech
Hieronymus et al. Spoken language identification using large vocabulary speech recognition
Gabrea et al. Detection of filled pauses in spontaneous conversational speech.
JPS6138479B2 (en)
JPH0558552B2 (en)
JP4791857B2 (en) Utterance section detection device and utterance section detection program
JPS5912185B2 (en) Voiced/unvoiced determination device
Ohtake et al. Newscast speech summarization via sentence shortening based on prosodic features
Lertwongkhanakool et al. An automatic real-time synchronization of live speech with its transcription approach
KR100350003B1 (en) A system for determining a word from a speech signal
Lertwongkhanakool et al. Real-time synchronization of live speech with its transcription
Rapp Automatic labelling of German prosody.
JPH07295588A (en) Speech rate estimation method
JPH0242238B2 (en)
Takahashi et al. Isolated word recognition using pitch pattern information

Legal Events

Date Code Title Description
EXPY Cancellation because of completion of term