JPH026080B2

JPH026080B2 -

Info

Publication number: JPH026080B2
Application number: JP16250582A
Authority: JP
Inventors: Taisuke Watanabe; Kenji Kaga
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1982-09-17
Filing date: 1982-09-17
Publication date: 1990-02-07
Also published as: JPS5950500A

Description

【発明の詳細な説明】産業上の利用分野本発明は、音声波形を任意の区間に分割し、区
間毎に二つの非線形変換を行ない、その二つの変
数の相互相関係数を算出してピツチ抽出を行なう
ピツチ抽出装置に関するものである。[Detailed Description of the Invention] Industrial Application Field The present invention divides a speech waveform into arbitrary sections, performs two nonlinear transformations on each section, calculates the cross-correlation coefficient of the two variables, and calculates the pitch. This invention relates to a pitch extraction device that performs extraction.

従来例の構成とその問題点音声波形における有声音部分は周期的な繰り返
し波形を持ちその周期（ピツチ）の変化特性は、
音声処理においては重要なパラメータであること
が知られている。音声の分析合成系においては、
分析時に抽出したピツチの抽出結果が合成時にお
ける合成音の品質に大きな影響を与える。Configuration of conventional example and its problems The voiced part of the speech waveform has a periodic repeating waveform, and the change characteristics of the period (pitch) are as follows.
It is known that it is an important parameter in audio processing. In speech analysis and synthesis systems,
The extraction result of pitches extracted during analysis has a great influence on the quality of synthesized speech during synthesis.

音声波形のピツチ抽出方法としては、従来から
音声信号をある時間長で分割するフレーム毎に音
声信号の自己相関係数を算出し抽出する方法が広
く用いられている。しかしながら前記方法は、正
しいピツチの倍周期や半周期の成分などを誤つて
ピツチとして抽出したり、その演算の複雑さから
多くの演算時間をを必要とするため音声の実時間
分析には適さない。また実時間分析を行なうハー
ドウエア構成を考えた場合にも、前記理由により
高速な演算処理装置を必要とするために大規模な
装置となるなどの欠点を有している。 Conventionally, as a method for extracting the pitch of an audio waveform, a method has been widely used in which an autocorrelation coefficient of an audio signal is calculated and extracted for each frame in which the audio signal is divided into a certain length of time. However, the above method is not suitable for real-time analysis of speech because it incorrectly extracts double-period or half-period components of the correct pitch as pitch, and requires a lot of calculation time due to the complexity of the calculation. . Furthermore, when considering the hardware configuration for performing real-time analysis, there are drawbacks such as the need for a high-speed arithmetic processing device for the above reasons, resulting in a large-scale device.

前記演算時間を短縮する方法として、音声波形
ｘ（ｎ）を次式 y₁（ｎ）＝ｘ（ｎ）−C_L ｘ（ｎ）≧C_L ＝０｜ｘ（ｎ）｜＜C_L ＝ｘ（ｎ）＋C_L ｘ（ｎ）≦−C_L ……(1) y₂（ｎ）＝１ｘ（ｎ）≧C_L ＝０｜ｘ（ｎ）｜＜C_L ＝−１ｘ（ｎ）≦−C_L ……(2) ただし、C_L：しきい値によつて非線形変換し、その非線形変換した二値
の相互相関係数を演算してピツチを抽出する方法
が提案されている。（LAWRENCE Ｒ・
RABINER：On the Use ot Autocorrelation
Analysis for Pitch Detection、IEEE Trans.
ASSP−25、No.１、1977）しかしながら前記方法においても、正しいピツ
チの倍周期や半周期などの成分を誤つてピツチと
して抽出するという欠点を有している。 As a method to shorten the calculation time, the audio waveform x(n) is expressed as follows: y ₁ (n)=x(n)−C _L x(n)≧C _L =0 |x(n)|<C _L = x(n)+C _L x(n)≦-C _L ...(1) y ₂ (n)=1 x( _{n)≧C L} ₌ 0 | )≦−C _L ...(2) However, C _L : A method has been proposed in which pitch is extracted by performing non-linear transformation using a threshold value and calculating the cross-correlation coefficient of the non-linearly transformed binary values. . (LAWRENCE R.
RABINER：On the Use of Autocorrelation
Analysis for Pitch Detection, IEEE Trans.
ASSP-25, No. 1, 1977) However, this method also has the drawback that components such as double periods and half periods of the correct pitch are erroneously extracted as pitches.

発明の目的本発明は、以上のような従来の問題点を解決す
るためになされたもので、ピツチ抽出において、
従来に比べてトータルの演算処理量が少なくて済
みかつ高精度なピツチを得ることのできるピツチ
抽出装置を提供することを目的とする。Purpose of the Invention The present invention was made to solve the conventional problems as described above.
It is an object of the present invention to provide a pitch extracting device which requires less total calculation processing amount than conventional ones and can obtain highly accurate pitches.

発明の構成この目的を達成するために本発明は、任意の区
間に分割した音声波形を、しきい値によつて非線
形変換してy₁（ｎ）、y₂（ｎ）を得、これらの関数
の相互相関値P₍〓₎を求めP₍〓₎が最大になるτの値を
ピツチとする過程において、前記しきい値を、音
声波形を前半部と後半部に分割して、分割した二
つの部分の最大値を別々に求め、この求めた二つ
の最大値から決定する回路を新たに付加し、しき
い値を前記任意の区間毎に可変にすることにより
演算量を少なくし、かつピツチの誤抽出を防止す
ることができるようにしたものである。Structure of the Invention In order to achieve this object, the present invention non-linearly transforms an audio waveform divided into arbitrary sections using a threshold value to obtain y ₁ (n) and y ₂ (n), and converts these into In the process of finding the cross-correlation value P ₍ 〓 ₎ of the function and setting the pitch to the value of τ that maximizes P ₍ 〓 ₎ , the threshold value is divided by dividing the speech waveform into the first half and the second half. The maximum value of the two parts is determined separately, a new circuit is added to determine the maximum value from the two determined maximum values, and the threshold value is made variable for each arbitrary section, thereby reducing the amount of calculation, and This makes it possible to prevent incorrect extraction of pitches.

実施例の説明以下本発明の一実施例を図面を用いて説明す
る。第１図は本発明の実施例を説明するためのブ
ロツク図である。DESCRIPTION OF EMBODIMENTS An embodiment of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram for explaining an embodiment of the present invention.

音声波形が、Ａ／Ｄ変換器１に入力される。
Ａ／Ｄ変換器１は、音声波形をあらかじめ定めら
れた周波数で標本化し（普通、電話音声程度の音
質であれば、標本化周波数は8KHz、音声波形の
標本化は８ビツトで充分であり、以下この数値を
用いて説明する。）、音声波形を離散的な時系列信
号に変換した標本化音声波形をデータバツフアメ
モリ２へ出力する。データバツフアメモリ２は標
本化音声波形を一時的に記憶し、分析フレーム周
期分の標本化音声波形を記憶した時点で、標本化
音声波形の前半部を前半部最大値検出回路３へ残
りの後半部を後半部最大値検出回路４へ出力す
る。ここで前半部と後半部の標本化音声波形のデ
ータ量は等しい。さらにデータバツフアメモリ２
は、前記標本化音声波形のすべてを、非線形変換
回路６及び三値分類回路７へ出力する。 A voice waveform is input to an A/D converter 1.
The A/D converter 1 samples the audio waveform at a predetermined frequency (normally, if the sound quality is comparable to that of a telephone voice, the sampling frequency is 8KHz, and 8 bits is sufficient for sampling the audio waveform, (This will be explained below using these numerical values.) A sampled audio waveform obtained by converting the audio waveform into a discrete time-series signal is output to the data buffer memory 2. The data buffer memory 2 temporarily stores the sampled audio waveform, and when the sampled audio waveform for the analysis frame period is stored, the first half of the sampled audio waveform is sent to the first half maximum value detection circuit 3. The second half is output to the second half maximum value detection circuit 4. Here, the data amounts of the sampled audio waveforms in the first half and the second half are equal. Furthermore, data buffer memory 2
outputs all of the sampled audio waveforms to the nonlinear conversion circuit 6 and the ternary classification circuit 7.

前半部最大値検出回路３は、前記標本化音声波
形の最大値MAX1を求め、しきい値決定回路５
へ出力する。後半部最大値検出回路４は、前記標
本化音声波形の後半部の最大値MAX2を求め、
しきい値決定回路５へ出力する。 The first half maximum value detection circuit 3 determines the maximum value MAX1 of the sampled audio waveform, and the threshold value determination circuit 5
Output to. The second half maximum value detection circuit 4 determines the maximum value MAX2 of the second half of the sampled audio waveform,
It is output to the threshold value determining circuit 5.

しきい値決定回路５は前記最大値MAX1及び
最大値MAX2を基に次の(3)、(4)、(5)式によつて
しきい値C_Lを決定する。 The threshold value determination circuit 5 determines the threshold value C _L based on the maximum value MAX1 and the maximum value MAX2 using the following equations (3), (4), and (5).

IMAX＝max（MAX1、MAX2） ………(3) IMAX1＝min（MAX1、MAX2） ………(4) C_L＝min（IMAX・0.6、IMAX1・0.8） ………(5) ここでmax（、）はどちらか大きい方を、
min（、）はどちらか小さい方を選択する意
味である。IMAX=max (MAX1, MAX2) ………(3) IMAX1=min (MAX1, MAX2) ………(4) C _L = min (IMAX・0.6, IMAX1・0.8) ………(5) Here max ( , ) is the larger of the two,
min(,) means to select the smaller one.

しきい値決定回路５は、この決定されたしきい
値C_Lを、非線形変換回路６及び三値分類回路７
へ出力する。 The threshold determining circuit 5 applies the determined threshold C _L to a nonlinear conversion circuit 6 and a ternary classification circuit 7.
Output to.

ここで本発明のしきい値決定方法を、第２図を
用いて説明する。 Here, the threshold value determining method of the present invention will be explained using FIG. 2.

分析フレームＡにおいてピツチはＴであること
は明白である。ピツチＴを検出するためには、し
きい値レベルを波高値201以下に、すなわちしき
い値202に設定する必要がある。次に分析フレー
ムＢにおいてピツチはT1である。しかし、しき
い値202を採用すると波高値203がしきい値202を
越えて、ピツチ抽出の際に真のピツチT1の倍周
期を取る場合がある。本発明のしきい値決定法で
は分析フレームＢにおいて、分析フレームＢを前
半部B₁、後半部B₂に分割し前記前半部最大値検
出回路３及び後半部最大値検出回路４にて前半部
の最大値は波高値204、後半部の最大値は波高値
205が検出され、次にしきい値決定回路５によつ
て波高値205の60％の値がしきい値206として決定
される。この結果ピツチ抽出においては真のピツ
チT1が得られる。 It is clear that pitch is T in analysis frame A. In order to detect the pitch T, it is necessary to set the threshold level below the peak value 201, that is, the threshold value 202. Next, in analysis frame B, the pitch is T1. However, if the threshold value 202 is adopted, the peak value 203 may exceed the threshold value 202, and a period twice the true pitch T1 may be taken during pitch extraction. In the threshold determination method of the present invention, the analysis frame B is divided into a first half _B1 and a second half _B2 , and the first half maximum value detection circuit 3 and the second half maximum value detection circuit 4 detect the first half. The maximum value is the wave height value 204, and the maximum value in the second half is the wave height value
205 is detected, and then the threshold value determination circuit 5 determines a value of 60% of the peak value 205 as the threshold value 206. As a result, true pitch T1 is obtained in pitch extraction.

また同様に分析フレームＡについても本発明の
しきい値決定法を用いれば、波高値201の80％の
値がしきい値（しきい値202と同レベル）として
決定されるから、ピツチ抽出においては真のピツ
チＴが得られる。 Similarly, if the threshold value determination method of the present invention is used for analysis frame A, 80% of the peak value 201 will be determined as the threshold value (same level as the threshold value 202), so in pitch extraction. The true pitch T is obtained.

非線形変換回路６は、前記バツフアメモリ２よ
り転送され一時的に記憶されている前記標本化音
声波形を、しきい値C_Lを基に前記(1)式に従つて
非線形変換する。第３図は非線形変換回路６によ
つて標本化音声波形を非線形変換した例である。
第３図ａは、標本化音声波形、ｂは非線形変換後
の標本化音声波形である。さらに非線形変換回路
６は、非線形変換した標本化音声波形をピツチ算
出回路８へ出力する。 The nonlinear conversion circuit 6 nonlinearly converts the sampled audio waveform transferred from the buffer memory 2 and temporarily stored according to the equation (1) based on the threshold value C _L. FIG. 3 shows an example in which the sampled speech waveform is nonlinearly transformed by the nonlinear transformation circuit 6. In FIG.
FIG. 3a shows a sampled speech waveform, and FIG. 3b shows a sampled speech waveform after nonlinear transformation. Furthermore, the nonlinear conversion circuit 6 outputs the nonlinearly converted sampled audio waveform to the pitch calculation circuit 8.

三値分類回路７は、前記バツフアメモリ２より
転送され一時的に記憶されている前記標本化音声
波形を、しきい値C_Lを基に前記(2)式に従つて三
値に分類し、ピツチ算出回路８へ出力する。第４
図は三値分類回路７によつて標本化音声波形を前
記(2)式のように三値分類した例である。第４図ａ
は、標本化音声波形、ｂは三値分類後の信号であ
る。 The ternary classification circuit 7 classifies the sampled audio waveform transferred from the buffer memory 2 and temporarily stored into ternary values according to the equation (2) based on the threshold value C _L , and calculates the pitch. Output to calculation circuit 8. Fourth
The figure shows an example in which the sampled speech waveform is classified into three values by the three-value classification circuit 7 as shown in equation (2) above. Figure 4a
is the sampled speech waveform, and b is the signal after ternary classification.

ピツチ算出回路８は、非線形変換回路６及び三
値分類回路７を介して供給されたy₁（ｎ）、y₂（ｎ）
の信号に基づいて次の(6)式のように相互相関係数
をP₍〓₎を求める。 The pitch calculation circuit 8 receives y ₁ (n), y ₂ (n) supplied via the nonlinear conversion circuit 6 and the ternary classification circuit 7.
Based on the signal, calculate the cross-correlation coefficient P ₍ 〓 ₎ as shown in equation (6) below.

P₍〓₎＝_N-〓〓^N=1 y₁(i)・y₂（ｉ＋τ） ………(6) （ｉ＝１、………Ｎ）ただし１フレーム当りの標本化音声波形の個数
はＮ個とする。 P ₍ 〓 ₎ = _N- 〓〓 ^N=1 y ₁ (i)・y ₂ (i+τ) ………(6) (i=1,……N) However, the number of sampled audio waveforms per frame is N pieces.

通常の成人男女のピツチの変化範囲は50Hz〜
400Hzであり、この範囲を探索すると、τの範囲
は、τ＝20〜160である。前記式(6)から求められ
たP₍〓₎の中で最大値を取るものをPmax（τo）とす
るとその時のτoをピツチとする。 The pitch change range for normal adult men and women is 50Hz~
400Hz, and searching this range, the range of τ is τ=20 to 160. Let Pmax(τo) be the one that takes the maximum value among P ₍ 〓 ₎ obtained from the above equation (6), and then τo is taken as the pitch.

発明の効果以上説明したように本発明は、音声波形の振幅
の状態によつて、しきい値C_Lを分析フレーム周
期毎に変化させて、その値によつて音声波形を非
線形変換しy₁（ｎ）、y₂（ｎ）を演算し、y₁（ｎ）、
y₂（ｎ）の相互相関係数を演算することにより、
演算時間が少なくかつ高精度のピツチ抽出が可能
である。Effects of the Invention As explained above, the present invention changes the threshold value C _L every analysis frame period depending on the amplitude state of the audio waveform, and nonlinearly transforms the audio waveform using the value _. (n), y ₂ (n), y ₁ (n),
By calculating the cross-correlation coefficient of y ₂ (n),
It takes less calculation time and allows highly accurate pitch extraction.

[Brief explanation of drawings]

第１図は本発明の一実施例におけるピツチ抽出
装置を説明するためのブロツク図、第２図は、し
きい値決定方法を説明するための説明図、第３図
は、非線形変換回路６の特性を示す図、第４図は
三値分類回路７の特性を示す図である。１……Ａ／Ｄ変換器、２……データバツフアメ
モリ、３……前半部最大値検出回路、４……後半
部最大値検出回路、５……しきい値決定回路、６
……非線形変換回路、７……三値分類回路、８…
…ピツチ算出回路。 FIG. 1 is a block diagram for explaining a pitch extraction device according to an embodiment of the present invention, FIG. 2 is an explanatory diagram for explaining a threshold value determination method, and FIG. FIG. 4 is a diagram showing the characteristics of the ternary classification circuit 7. DESCRIPTION OF SYMBOLS 1... A/D converter, 2... Data buffer memory, 3... First half maximum value detection circuit, 4... Second half maximum value detection circuit, 5... Threshold determination circuit, 6
...Nonlinear conversion circuit, 7...Thinary classification circuit, 8...
...Pitch calculation circuit.

Claims

[Claims] 1. The speech waveform is divided into arbitrary sections and pitch extraction is performed for each section, and within each section, the speech waveform x(n) is calculated by the following formula y ₁ (n)= x(n)−C _L x(n)≧C _L =0 |x(n)| _{<C L} ₌ x( _n )+C _L )= 1 x(n)≧C _L =0 |x(n)|<C _L =-1 x(n)≦-C _L
......(2) However, C _L : Nonlinear transformation is performed according to the threshold value, and y ₁ (n) and y ₂ (n) are calculated, and the mutual phase of y ₁ (n) and y ₂ (n) is When the pitch is determined by calculating the relational coefficient, the audio waveform within the section is divided into the first half and the second half, and the above threshold is determined for each section from the amplitude information of the audio waveform of the two parts. A pitch extraction device featuring: