JPH0446440B2

JPH0446440B2 -

Info

Publication number: JPH0446440B2
Application number: JP59053757A
Authority: JP
Inventors: Masaaki Yoda
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: NTT Inc
Priority date: 1984-03-21
Filing date: 1984-03-21
Publication date: 1992-07-29
Also published as: JPS60196800A

Description

[Detailed description of the invention]

この発明は、音声波形をその相関を除去するフ
イルタに通して予測残差波形を得その予測残差波
形を用いる音声信号処理方式に関するものであ
る。＜従来技術＞従来、音声符号化には波形符号化と分析合成系
（ボコーダ）との２つのクラスがある。後者の分
析合成系のクラスに属する線形予測符号化
（LPC）方式では、第１図に示すように音声のス
ペクトル包絡を表わす全極形のフイルタ（予測フ
イルタ）を、入力端子１１からの音声波形につい
て線形予測分析によつて求めた後、それと逆の特
性をもつ全零形のフイルタ（逆フイルタ）１２に
音声波形を通して予測残差波形を求め、この残差
波形を特徴づけるパラメータとしての周期性の有
無（有声、無声の判定）、ピツチ周期、および平
均電力をパラメータ抽出部１３で抽出し、これと
前記予測フイルタ係数とを送出する。合成側では
予測残差波形の代りに、有声の場合は周期パルス
列、無声の場合は雑音波形を音源生成部１４で用
いて予測フイルタ１５を駆動し、予測フイルタ１
５のフイルタ係数を前記フイルタ係数で設定して
音声波形を生成して出力端子１６に出力する。一方、前者の波形符号化のクラスに属する適応
符号化（APC）では第２図に示すようにLPCボ
コーダと同様な手段で予測残差波形を求めた後、
この残差波形のサンプル値をそのまま量子化部１
７で量子化（符号化）し、これと予測フイルタ係
数とを送出する。合成側では、復合か部１８で復
合化された残差波形を用いて予測フイルタ１５を
駆動することにより音声波形を生成する。これら２つの従来方式の違いは予測残差波形の
符号化の方法にある。第１図のLPCボコーダで
は残差波形については、その特徴パラメータだけ
を伝送すれば良いので、残差波形のサンプル毎の
量子化値を伝送する第２図のAPC方式にくらべ
てビツトレートの大幅な低減が図れる。しかし、
その反面第１図に示した方式は残差波形をパルス
列あるいは雑音で置き換えることによる品質劣下
はまぬがれず、ビツトレートを高くしても6kb／
ｓ程度で品質が飽和し、自然な音声品質を提供し
得ない欠点があり、また残差波形の特徴パラメー
タの誤抽出が品質劣下を引き起こす欠点があつ
た。一方、第２図に示したSPC方式では残差波形
の量子化ビツト数を高めることにより原音声に限
り無く近い音声品質を実現できる反面、ビツトレ
ートが１６kb／ｓ以下になると量子化歪が増大
して音声品質が急激に劣下するこという欠点があ
つた。また従来において音声信号のピツチの変更や音
声信号の継ぎ足し時に、エネルギの集中している
個所で行うおそれがあり、その場合は不自然なも
のとなる欠点があつた。＜発明の構成＞この発明の目的は例えば音声信号を継ぎ足す場
合に自然性が得られるようにすることを可能とす
る音声信号処理方式を提供することにある。この発明の他の目的はビツトレートが16kb／
ｓ以下でも比較的良好な音声品質を保持できる音
声信号処理方式を提供することにある。この発明によれば音声波形を線形予測分析して
予測残差波形を得、この予測残差波形についてそ
の短時間（例えばピツチ周期程度以下）位相特性
と逆の特性をもつ線形フイルタ（位相等化フイル
タ）のフイルタ係数を残差波形から適応的に決定
し、その位相等化フイルタに前記音声波形又は予
測残差波形を通して予測残差波形を零位相化、つ
まり位相等化する。この位相等化された予測残差
波形エネルギーがインパルス的に集中し、従つて
そのエネルギーが集中してない部分で、例えば音
声波形の継ぎ足しを行うことにより自然性のよい
音声波形が得られる。また前記位相等化した音声
波形又は予測残差波形を符号化する際に、例えば
そのエネルギーが集中している部分に多くの情報
を割り当てることにより、効率的に符号化するこ
とができ16kb／ｓ以下でも可成り良い音声品質
を得ることができる。＜発明の原理＞まず、この発明による音声信号処理方式の原理
について述べる。音声波形のサンプル値をＳ
（ｎ）、音声波形を線形予測分析して得られる予測
係数をａ（ｋ）（ｋ＝１，２……Ｐ）とすると、予
測残差波形のサンプル値ｅ（ｎ）は次式で表わさ
れる。ｅ（ｎ）＝_p 〓^k=0 ａ（ｋ）・Ｓ（ｎ−ｋ） ……(1) ただし、ａ（０）＝１である。残差波形ｅ（ｎ）
は、音声波形のスペクトル包絡成分を除去したも
の、つまり音声波形のサンプル値間の相関を除去
したもので平坦なスペクトル包絡をもち、かつ有
声音については音声のピツチ周期成分をもつてい
る。そこで、このような残差波形の特徴を次のよ
うなパルス列として理想化して実現する。 e_M（ｎ）＝_L-1 〓^l=0 δ（ｎ−nl） ……(2) ここで、δ（ｎ）はクロネツカーのデルタ関数
で、δ（０）＝１、およびδ（ｎ）＝０（ｎ≠０）で
ある。n_lはパルス位置を表わし、n_l−n_l-1は音声
のピツチ周期に対応する。つまりこのパルス列e_M
（ｎ）はピツチ位置n_lのみにパルスが存在し、そ
の他はゼロである。残差波形ｅ（ｎ）とパルス列
e_M（ｎ）は共に平坦なスペクトル包絡とピツチ周
期成分とをもつから、両波形の差は主に短時間、
つまりピツチ周期程度以下の時間での位相特性の
違いによる。そこで、残差波形の短時間位相の逆
特性を持つ線形フイルタのインパルス応答をｈ
（ｍ）とすると、この線形フイルタ（位相等化フ
イルタ）に残差波形ｅ（ｎ）を通して位相等化
（零位相化）つまりスペクトラムの各成分が同一
位相化された残差波形ep（ｎ）は次式で算出され
る。 ep（ｎ）＝_M 〓^m=0 ｈ（ｍ）ｅ（ｎ−ｍ） ……(3) このインパルス応答ｈ（ｍ）は、e_P（ｎ）とe_M
（ｎ）との平均二乗誤差を最小化することにより
求められる。その平均二乗誤差を次式で表わす。Ｊ＝１／Ｎ_N-1 〓ⁿ⁼⁰ ｛e_P（ｎ）−e_M（ｎ）｝² ……(4) (4)式に(2)，(3)式を代入して、ｈ（ｍ）で偏微分
して零とおくとインパルス応答ｈ（ｍ）は次の連
立方程式の解として求められる。 _M 〓^k=0 ｖ（｜ｍ−ｋ｜）ｈ（ｋ）＝_L-1 〓^l=0 ｅ（nl−ｍ）
……(5) （ｍ＝０，１，……；Ｍ）ここでｖ（ｋ）は残差波形の自己相関関数であ
り次式で算出される。ｖ（ｋ）＝_N-k-1 〓ⁿ⁼⁰ ｅ（ｎ）ｅ（ｎ＋ｋ） ……(6) （ｋ＝０，１，……Ｍ）位相等化フイルタのタツプ数Ｍ＋１と対応した
時間、つまり応答時間がピツチ周期より短かい場
合は自己相関関数は、残差波形が平坦なスペクト
ルをもつことからｖ（ｋ）v₀δ（ｋ）として近
似できる。つまりｋ＝０の時だけ値をもつからそ
の場合、(5)式はｍ＝ｋの時だけ値をもち次のよう
に簡単化できる。ｈ（ｍ）＝１／v_0L-1 〓^l=0 ｅ（n_l−ｍ） ……(7) さらに、分析窓長Ｎがピツチ周期より短い場合
はＬ＝１となり（パルスが１個となり）、インパ
ルス応答は次式で算出される。ｈ（ｍ）＝１／v₀ｅ（n₀−ｍ） ……(8) 即ち、インパルス応答ｈ（ｍ）は時点n₀を原点
として残差波形の時間軸を反転したものとなる。
また、残差波形の電力スペクトルが完全に白色
（すべての周波数成分の振幅が一定）であるとす
ると、インパルス応答ｈ（ｍ）のフーリエ変換は
次式で表わされる。（ただし、ゲインは正規化）ここで、Ｅ（ｋ）は残差波形ｅ（ｎ）のフーリエ
変換を表わす。したがつて、位相等化された残差
波形ep（ｎ）のフーリエ変換Ep（ｋ）は(3)式より
Ep（ｋ）＝Ｈ（ｋ）・Ｅ（ｋ）であり、またＥ（ｋ）＝
｜Ｅ（Ｋ）｜_e ^aargE（ｋ）であるから、これに(9)式
を代入して次式が得られる。 (10)式より位相等化された残差波形ep（ｎ）は直
線位相成分 The present invention relates to an audio signal processing method that passes an audio waveform through a filter that removes its correlation to obtain a predicted residual waveform, and uses the predicted residual waveform. <Prior Art> Conventionally, there are two classes of speech coding: waveform coding and analysis and synthesis systems (vocoders). In the linear predictive coding (LPC) method, which belongs to the latter class of analysis and synthesis systems, as shown in FIG. is determined by linear predictive analysis, the voice waveform is passed through an all-zero filter (inverse filter) 12 with the opposite characteristics to obtain a predicted residual waveform, and the periodicity as a parameter characterizing this residual waveform is determined. The presence/absence (voiced/unvoiced determination), pitch period, and average power are extracted by the parameter extraction unit 13, and these and the prediction filter coefficients are sent out. On the synthesis side, instead of the predicted residual waveform, the sound source generator 14 uses a periodic pulse train in the case of voiced and a noise waveform in the case of unvoiced to drive the prediction filter 15.
A voice waveform is generated by setting a filter coefficient of 5 using the filter coefficient and outputting it to the output terminal 16. On the other hand, in adaptive coding (APC), which belongs to the former class of waveform coding, as shown in Fig. 2, after obtaining the predicted residual waveform using the same means as the LPC vocoder,
The sample value of this residual waveform is directly converted to the quantization unit 1.
7, and sends out this and the predictive filter coefficients. On the synthesis side, the residual waveform decoded by the decoder 18 is used to drive the prediction filter 15 to generate a speech waveform. The difference between these two conventional methods lies in the method of encoding the prediction residual waveform. The LPC vocoder shown in Figure 1 only needs to transmit the characteristic parameters of the residual waveform, so the bit rate is significantly lower than the APC method shown in Figure 2, which transmits the quantized value for each sample of the residual waveform. This can be reduced. but,
On the other hand, the method shown in Figure 1 inevitably suffers from quality deterioration due to replacing the residual waveform with a pulse train or noise.
There is a drawback that the quality is saturated at about s, making it impossible to provide natural voice quality, and erroneous extraction of feature parameters of the residual waveform causes quality deterioration. On the other hand, with the SPC method shown in Figure 2, by increasing the number of quantization bits of the residual waveform, it is possible to achieve audio quality that is as close as possible to the original audio, but on the other hand, when the bit rate becomes less than 16 kb/s, quantization distortion increases. The drawback was that the voice quality deteriorated rapidly. Furthermore, in the past, when changing the pitch of an audio signal or adding audio signals, there was a risk that the change would be performed at a location where energy is concentrated, and in that case, there was a drawback that the result would be unnatural. <Structure of the Invention> An object of the present invention is to provide an audio signal processing method that makes it possible to obtain naturalness when adding audio signals, for example. Another object of this invention is that the bitrate is 16kb/
An object of the present invention is to provide an audio signal processing method that can maintain relatively good audio quality even when the audio quality is less than s. According to this invention, a speech waveform is subjected to linear predictive analysis to obtain a predicted residual waveform, and this predicted residual waveform is filtered by a linear filter (phase equalization A filter coefficient of a filter) is adaptively determined from the residual waveform, and the speech waveform or the predicted residual waveform is passed through the phase equalization filter to zero phase the predicted residual waveform, that is, the phase is equalized. This phase-equalized predicted residual waveform energy concentrates in an impulse manner, and therefore, by adding, for example, a voice waveform in a portion where the energy is not concentrated, a voice waveform with good naturalness can be obtained. Furthermore, when encoding the phase-equalized speech waveform or prediction residual waveform, for example, by allocating a lot of information to the part where the energy is concentrated, it is possible to encode efficiently at 16kb/s. You can get pretty good audio quality even below. <Principle of the Invention> First, the principle of the audio signal processing method according to the present invention will be described. The sample value of the audio waveform is
(n), and the prediction coefficient obtained by linear predictive analysis of the speech waveform is a(k) (k=1, 2...P), then the sample value e(n) of the prediction residual waveform is expressed by the following equation. It will be done. e(n)= _p 〓 ^k=0 a(k)・S(nk)...(1) However, a(0)=1. Residual waveform e(n)
is the voice waveform with the spectral envelope component removed, that is, the correlation between the sample values of the voice waveform is removed, and has a flat spectrum envelope, and for voiced sounds, it has the pitch period component of the voice. Therefore, the characteristics of such a residual waveform are idealized and realized as the following pulse train. e _M (n)= _L-1 〓 ^l=0 δ(n-nl) ...(2) Here, δ(n) is Kronetzker's delta function, δ(0)=1, and δ(n) =0 (n≠0). n _l represents the pulse position, and n _l −n _l-1 corresponds to the pitch period of the voice. In other words, this pulse train e _M
In (n), there is a pulse only at the pitch position _nl , and there are zeros at the other positions. Residual waveform e(n) and pulse train
Since e _M (n) both have a flat spectral envelope and a pitch periodic component, the difference between the two waveforms is mainly for a short period of time,
In other words, this is due to the difference in phase characteristics at times less than the pitch period. Therefore, the impulse response of a linear filter with inverse characteristics of the short-term phase of the residual waveform is h
(m), the residual waveform e(n) is passed through this linear filter (phase equalization filter) to equalize the phase (zero phase), that is, the residual waveform ep(n) in which each component of the spectrum is made into the same phase. is calculated using the following formula. ep(n)= _M 〓 ^m=0 h(m)e(n-m) ...(3) This impulse response h(m) is calculated by e _P (n) and e _M
(n) by minimizing the mean square error. The mean square error is expressed by the following equation. J=1/N _N-1 〓 ⁿ⁼⁰ {e _P (n)−e _M (n)} ² ...(4) Substituting equations (2) and (3) into equation (4), h (m) and set it to zero, the impulse response h(m) can be obtained as a solution of the following simultaneous equations. _M 〓 ^k=0 v(|m−k|) h(k)= _L−1 〓 ^l=0 e(nl−m)
...(5) (m=0,1,...;M) Here, v(k) is an autocorrelation function of the residual waveform and is calculated by the following equation. v(k)= _Nk-1 〓 ⁿ⁼⁰ e(n) e(n+k) ...(6) (k=0, 1,...M) The time corresponding to the number of taps M+1 of the phase equalization filter, that is, When the response time is shorter than the pitch period, the autocorrelation function can be approximated as v(k)v ₀ δ(k) because the residual waveform has a flat spectrum. In other words, since it has a value only when k=0, in that case, equation (5) has a value only when m=k, and can be simplified as follows. h(m)=1/v _0L-1 〓 ^l=0 e(n _l −m) ...(7) Furthermore, if the analysis window length N is shorter than the pitch period, L=1 (one pulse and ), the impulse response is calculated using the following formula. h(m)=1/v ₀ e(n ₀ −m) (8) That is, the impulse response h(m) is the time axis of the residual waveform inverted with the origin at time n ₀ .
Further, assuming that the power spectrum of the residual waveform is completely white (amplitudes of all frequency components are constant), the Fourier transform of the impulse response h(m) is expressed by the following equation. (However, the gain is normalized) Here, E(k) represents the Fourier transform of the residual waveform e(n). Therefore, the Fourier transform Ep(k) of the phase-equalized residual waveform ep(n) is given by equation (3).
Ep(k)=H(k)・E(k), and E(k)=
Since |E(K)| _e ^a argE(k), the following equation is obtained by substituting equation (9) into this. From equation (10), the phase-equalized residual waveform ep(n) is a linear phase component.

【式】を除いて残差波形ｅ（ｎ）を零位相化（スペクトラムをすべて同位相化）し
たものとなる。理想的に｜Ｅ（ｋ）｜＝E₀（一定）
ならばep（ｎ）は完全に無位相となり単一パルス
波形となる。要するに前述のようなフイルタ係数
ｈ（ｍ）をもつ位相等化フイルタ残差波形ｅ（ｎ）
を通すと、ピツチ位置に主としてエネルギーが集
中した、つまり単一パルス化に近い波形となる。＜第１実施例＞次に、この発明の音声信号処理方式の具体的実
施例を第３図に沿つて説明する。入力端子１１か
らは、標本化された音声波形のサンプル値Ｓ（ｎ）
が入力され、線形予測分析部２１および逆フイル
タ部２２に供給される。線形予測分析部２１では
音声波形Ｓ（ｎ）から線形予測分析を用いて、(1)
式における予測係数ａ（ｋ）を算出する。逆フイ
ルタ部２２では、音声波形Ｓ（ｎ）を入力として
(1)式に示すようなフイルタリング演算を行い、予
測残差波形ｅ（ｎ）を出力する。予測残差波形ｅ
（ｎ）はフイルタ係数決定部２３中の有声無声判
定部２４、ピツチ位置検出部２５およびフイルタ
係数数算出部２６に供給される。有声・無声判定
部２４では、残差波形ｅ（ｎ）の自己相関関数を
一定の遅延サンプル点数で求め、その最大ピーク
値が一定のしきい値以上なら有声、それ以下なら
無声として有声・無声の判定を行なう。この判定
結果Ｖ／UVは、以降の位相等化フイルタ係数を
求める処理モードを制御するのに用いられる。位
相等化フイルタは残差波形の位相の時間的変化に
適応化するため、有声部ではピツチ周期ごとに適
応化する。いま、時点ｎがｌ−１番目のピツチ位
置n_l-1にあるとして、その時点における位相等化
フイルタ係数をh^*（m.n_l-1）（ｍ＝０，１……Ｍ）
として表わす。ピツチ位置検出部２５ではピツチ
位置n_l-1およびフイルタ係数^*（ｍ，n_l-1）を用い
て次のピツチ位置nl検出する。第４図は、ピツチ位置検出部２５の内部構成を
示す。入力端子２７からは逆フイルタ部２３より
の残差波形ｅ（ｎ）が入力され、入力端子２８か
らは有声無声判定部２４よりの有声・無声判定結
果Ｖ／UVが入力される。処理モードスイツチ２
９では有声無声判定入力Ｖ／UVに応じて処理モ
ードをスイツチする。有声Ｖの場合は残差波形ｅ
（ｎ）は位相等化フイルタ部３１に入力され、入
力端子３２から入力されるフイルタ係数h^*（ｍ，
n_l-1）との間のたたみ込み演算（(3)式と同様な演
算）が行なわれ、位相等化された残差波形e_p（ｎ）
が出力される。相対振幅算出部３３では、位相等
化された残差波形e_p（ｎ）の時点ｎでの相対振幅
を次式で算出する。振幅比較部３４では相対振幅m_eｐ（ｎ）をあら
かじめ定められたしきい値m_thと比較し、 m_eｐ（ｎ）＞m_th （ｎ＞n_l-1）
……(12) を満たす場合、時点ｎをピツチ位置nlとして出力
端子３５に出力する。次に、ピツチ位置nlは第３図中のフイルタ係数
算出部２６に入力され、ピツチ位置nlにおける位
相等化フイルタ係数h^*（ｍ，nl）がフイルタ係数
算出部２６において次式により算出され、フイル
タ係数補間部３７および第４図中の位相等化フイ
ルタ部３１へ供給される。ただし、(13)式は(8)式とくらべてフイルタのゲ
インを正規化するとともに、直線位相成分（(10)式
中のExcept for [Formula], the residual waveform e(n) is made to have zero phase (all spectra are made to be in the same phase). Ideally |E(k)|=E ₀ (constant)
Then, ep(n) becomes completely phaseless and has a single pulse waveform. In short, the phase equalization filter residual waveform e(n) with the filter coefficient h(m) as described above
When passed through, the energy is mainly concentrated at the pitch position, that is, the waveform becomes close to a single pulse. <First Embodiment> Next, a specific embodiment of the audio signal processing method of the present invention will be described with reference to FIG. From the input terminal 11, the sample value S(n) of the sampled audio waveform is input.
is input and supplied to the linear prediction analysis section 21 and the inverse filter section 22. The linear predictive analysis unit 21 uses linear predictive analysis from the audio waveform S(n) to calculate (1)
Calculate the prediction coefficient a(k) in the equation. The inverse filter section 22 receives the audio waveform S(n) as input.
A filtering operation as shown in equation (1) is performed and a predicted residual waveform e(n) is output. Prediction residual waveform e
(n) is supplied to the voiced/unvoiced determining section 24, the pitch position detecting section 25, and the filter coefficient number calculating section 26 in the filter coefficient determining section 23. The voiced/unvoiced determination unit 24 determines the autocorrelation function of the residual waveform e(n) using a fixed number of delay sample points, and if the maximum peak value is above a fixed threshold value, it is voiced, and if it is less than that, it is determined to be voiced or unvoiced. Make a judgment. This determination result V/UV is used to control the processing mode for determining the subsequent phase equalization filter coefficients. Since the phase equalization filter adapts to temporal changes in the phase of the residual waveform, it adapts to each pitch period in the voiced part. Now, assuming that time n is at the l-1th pitch position n _l-1 , the phase equalization filter coefficient at that time is h ^* (mn _l-1 ) (m = 0, 1...M)
Expressed as The pitch position detection section 25 detects the next pitch position nl using the pitch position nl _-1 and the filter coefficient ^* (m, _nl-1 ). FIG. 4 shows the internal configuration of the pitch position detection section 25. The residual waveform e(n) from the inverse filter section 23 is inputted from the input terminal 27, and the voiced/unvoiced determination result V/UV from the voiced/unvoiced determining section 24 is inputted from the input terminal 28. Processing mode switch 2
At step 9, the processing mode is switched in accordance with the voiced/unvoiced determination input V/UV. In the case of voiced V, the residual waveform e
(n) is input to the phase equalization filter unit 31, and the filter coefficient h ^* (m,
A convolution operation (same operation as equation (3)) is performed between n _l-1 ), and the phase-equalized residual waveform e _p (n)
is output. The relative amplitude calculation unit 33 calculates the relative amplitude of the phase-equalized residual waveform e _p (n) at time n using the following equation. The amplitude comparator 34 compares the relative amplitude m _e p (n) with a predetermined threshold value m _th and determines that m _e p (n)>m _th (n>n _l-1 ).
. . . If (12) is satisfied, the time point n is outputted to the output terminal 35 as the pitch position nl. Next, the pitch position nl is input to the filter coefficient calculation unit ²⁶ in FIG. The signal is supplied to a filter coefficient interpolation section 37 and a phase equalization filter section 31 in FIG. However, compared to equation (8), equation (13) normalizes the filter gain, and also normalizes the linear phase component (in equation (10)).

【式】の遅れを補正したものとなつている。つまり(10)式より明らかなように(8)式により
得られるｈ（ｍ）は実際のものよりＭ／２サンプ
ル分遅れたものとなるので、(13)式を用いる。一方、有声・無声判定結果が無声UVの場合
は、処理モードスイツチ２９により残差波形ｅ
（ｎ）をピツチ位置リセツト部３６に入力してピ
ツチ位置nlを分析窓内の最後のサンプル時点に設
定し、またフイルタ係数算出部２６において、フ
イルタ係数をh^*（ｍ，nl）＝１（ｍ＝０）h^*（ｍ，
nl）＝０（ｍ≠０）に設定する。各時点ｎにおける
フイルタ係数ｈ（ｍ，ｎ）は、フイルタ係数補間
部３７において、たとえば次式で表される一次の
フイルタを用いて平滑化した値として算出され
る。ｈ（ｍ，ｎ）＝αh（ｍ，ｍ−１）＋（１＋α）h^*
（ｍ，nl）（n_l-1＜ｎ≦nl） ……(14) ここで、αはフイルタ係数の変化速度を制御す
る係数でα＜１を満たす定数である。位相等化フイルタ部３８では入力端子１１の入
力音声波形Ｓ（ｎ）と、フイルタ係数補間部３７
のフイルタ係数ｈ（ｍ，ｎ）とを用いて、次式で
示されるたたみ込み演算を行ない、位相等化され
た音声波形S_p（ｎ）を出力端子３９に出力する。 S_p（ｎ）＝_M 〓^m=0 ｈ（ｍ，ｎ）Ｓ（ｎ−ｍ） ……(15) ＜第２実施例＞次に、位相等化された音声波形S_p（ｎ）のデジ
タル符号化について説明する。この符号化の基本
的構成例を第５図に示す。入力端子１１から入力
される音声波形Ｓ（ｎ）に対して、第３図で示さ
れた構成の位相等化処理部４１において位相等化
処理を行ない、位相等化音声波形S_p（ｎ）を出力
する、符号化部４２ではこの位相等化音声波形S_p
（ｎ）をデイジタル符号化し、符号系列を伝送路
４３に送出する。受信側では復号化部４４で位相
等化音声波形S_p（ｎ）を復元して出力端子１６に
出力する。このように、符号化・復号化は音声波
形Ｓ（ｎ）の代りに、位相等化音声波形S_p（ｎ）を
対象として行なう。音声波形Ｓ（ｎ）を位相等化
した音声波形S_p（ｎ）は原音声波形Ｓ（ｎ）と品質
的に変りない、よつてフイルタ係数ｈ（ｍ）は伝
送する必要がなく、位相等化音声S_p（ｎ）を再生
すればよい。特に残差波形ｅ（ｎ）を位相等化し
た残差波形e_p（ｎ）はエネルギーが集中するため、
その部分により多くの情報を与えるように適応的
に符号化することにより少ないビツト数で高品質
の伝送が可能となる。符号化部４２での符号化の
方法としては、種々の方法が適用できる。ここで
は、位相等化音声波形に適した符号化法として３
つの実施例を示す。可変レート木符号化を用いる方法可変レート木符号化法は音声波形を線形予測分
析して得られる予測残差波形の時間方向での振幅
変化に応じて、情報量を適応的に制御することを
特徴とした符号化方式である。第６図に、この発
明による位相等化処理と可変レート木符号化を組
み合せた符号化方式の実施例を示す。入力端子１
１から入力される音声波形Ｓ（ｎ）に対し、線形
予測分析部２１で線形予測分析を行なつて予測係
数akを算出し、逆フイルタ部２２で音声波形Ｓ
（ｎ）の予測残差波形ｅ（ｎ）を求める。フイルタ
係数決定部２３では第３図について述べたように
して残差波形ｅ（ｎ）の短時間位相を等化する位
相等化フイルタの係数ｈ（ｍ，ｎ）を算出し、こ
れを位相等化フイルタ部３８のフイルタ係数とし
て設定する。位相等化フイルタ部３８で入力音声
波形Ｓ（ｎ）を位相等化処理し、その位相等化音
声波形S_p（ｎ）を端子３９へ出力する。一方、残差波形ｅ（ｎ）は位相等化フイルタ部
４５で位相等化した後、部分区間設定部４６にお
いて残差波形振幅の偏りに応じて時間軸を分割す
る部分区間を設定し、電力算出部４７ではその設
定された各部分区間での残差波形の電力を算出す
る。部分区間の設定法としては、例えば第７図に
示すように分析窓内でのピツチ位置（nl）の区間
（ただし１サンプル点のみ）およびピツチ周期T_p
を等分割する各区間として設定する。部分区間で
の残差電力u_iは次式で算出される。 u_i＝１／N_Ti 〓ｎ∈Tie² _p（ｎ） ……(16) ここで、T_iはサンプル時点ｎが属する部分区間
を表わし、N_Tiは部分区間に含まれるサンプル点
の数である。ビツト割当て部４８では、各部分区
間の残差電力u_iから各サンプル時点に割り当てら
れる情報ビツト数Ｒ（ｎ）を次式で算出する。ここで、は残差波形e_p（ｎ）に対する平均ビ
ツトレート、N_sは部分区間数、w_iは部分区間の
時間長比率であり、 w_i＝N_Ti／_NS 〓^j=1 N_Tj で与えられる。また、量子化ステツプ幅△（ｎ）
はステツプ幅算出部４９で残差電力u_iから次式で
算出される。 △（ｎ）＝Ｑ（Ｒ（ｎ））√_i ｎ〓T_i (18) ここで、Ｑ（Ｒ（ｎ））はＲ（ｎ）ビツトのガウス
性量子化器のステツプ幅である。ビツト割当て部
４８とステツプ幅算出部４９で算出されたビツト
数Ｒ（ｎ）とステツプ幅△（ｎ）は木符号生成部
５１を制御する。木符号生成部５１は第８図に示
すように、可変レートの木構造をもち、符号系列
Ｃ（ｎ）＝｛Ｃ（ｎ−Ｌ），……，Ｃ（ｎ−１），Ｃ
（ｎ）｝によつて定まるパス経路に沿つて、各技に
対応づけられたサンプル値ｑ（ｎ）を出力する。
各ノードから出る技の数は2^R(n)として与えられ
る。また、各技に対応づけられるサンプル値ｆ
（ｌ，ｎ）は△（ｎ）とＲ（ｎ）から次式で与えら
れる。ｆ（ｌ，ｎ）＝Sgn（ｌ）｜ｌ｜＋0.5／２△（ｎ），ｌ＝±１，±２……，±2^R(n)-1 (19) ここで、Sgn（ｌ）は、ｌの正負のサイン符号
を表わす。また、ｑ（ｎ）はパス上の技をl^*とし
て、ｑ（ｎ）＝５（l^*，ｎ）として与えられる。木
符号生成部５１から出力されるサンプル値ｐ（ｎ）
は予測フイルタ部５２へ入力され、全極形のフイ
ルタを用いて局部復号化値S^_p（ｎ）を次式で算出
する。 S^_p（ｎ）＝_p 〓^k=1 ａ（ｋ）S^_p（ｎ−ｋ）＋ｑ（ｎ） ……(20) ここで、ａ（ｋ）は予測係数であり、線形予測
分析部２１からの出力によつて制御される。局部
復号化値S^_p（ｎ）と位相等化された音声波形S_p
（ｎ）は減算器５３において両値間の差がとられ、
符号系列最適化部５４へ入力される。符号系列最
適化部５４では、S^_p（ｎ）とS_p（ｎ）間の平均二乗
誤差を最小にするように、木符号のパスすなわち
符号系列Ｃ（ｎ）＝｛Ｃ（ｎ−Ｌ），……Ｃ（ｎ−１）
，
Ｃ（ｎ）｝を探索する。最適パスの探索手法として
は、例えばMLアルゴリズムを用いる。MLアル
ゴリズムでは第８図に示すような木符号におい
て、符号系列の候補をC_n（ｎ）＝｛C_n（ｎ−Ｌ），…
…，C_n（ｎ−１），C_n（ｎ）｝（ｍ＝１，２，……
M′）として、各ノードにおける誤差の評価値ｄ
（ｍ，ｎ）を、符号系列候補C_n（ｎ）に対して与
えられるサンプル値S_p（ｎ）と入力サンプル値S_p
（ｎ）の時系列間の二乗誤差として次式で算出す
る。ｄ（ｍ，ｎ）＝_M 〓^t=n-L ｛S_p（ｔ）−S^_p（ｔ）｝² 次にM′個の符号系列候補の中から評価値ｄ
（ｍ，ｎ）が最小となる符号系列C_n（ｎ）を選択
し、そのパスにおける時点ｎ−Ｌでの符号C_n（ｎ
−Ｌ）を最適な符号として決定する。ｎ＋１時点
での符号系列の候補C_n（ｎ＋１）＝｛C_n（ｎ＋１−
Ｌ），……C_n（ｎ），C_n（ｎ＋１）｝は、ｄ（ｍ，ｎ）
の値の小さい順にＭ個の符号系列C_n（ｎ）を選択
した後、各符号系列にｎ＋１時点でとり得る全て
の符号Ｃ（ｎ＋１）を追加した系列として与えら
れる。以上の処理は各時点ごとに逐次行なわれ、
時点ｎにおいて、時点ｎ−Ｌでの最適符号Ｃ（ｎ
−Ｌ）が出力される。なお第８図中の符号＊は
null符号を示す。この実施例における符号化方式では、残差波形
の符号Ｃ（ｎ）とともに補助情報として線形予測
分析部２１から出力される予測係数ak、部分区
間設定部４６から出力される部分区間の周期T_p
と位置T_d、電力算出部４７から出力される部分
区間残差電力u_iを多重化部５５で多重化した後伝
送路４３へ送出する。受信側では多重分離部５６で各情報を分離した
後、残差波形生成部５７において符号系列に応じ
て残差波形の復号化値ｑ（ｎ）を算出し、その復
号化値ｑ（ｎ）を駆動音源情報として予測フイル
タ１５を駆動して音声波形S_p（ｎ）を復元して出
力端子１６に出力する。残差波形ｅ（ｎ）を位相等化することによりパ
ルス化、つまりエネルギを果し、その部分につい
てはビツト数を多く割当て、また木符号の枚数を
多くすることにより、小さなビツトレートで効率
的に情報を伝送することができる。マルチパルス符号化を用いる方法マルチパルスの基本原理はAtalによつて1982
年の音響・音声信号国際会議（Proceedinhg
ICASSPpp.614−617）において提案された。こ
の手法は、音声の予測残差波形を複数個のパルス
列で表わし、各パルスの時間的位置と強さを、こ
のマルチパルス残差波形で合成した音声波形と入
力音声波形との誤差を最小にするように決定する
方式である。この方式では音声波形そのものを直
接符号化しているが、この発明の実施例では位相
等価した後の音声波形を入力としてマルチパルス
符号化を行なう。第９図に位相等価処理とこのマ
ルチパルス符号化を融合した符号化方式の実施例
を示す。入力端子１１から入力される音声波形のサンプ
ル値Ｓ（ｎ）に対して、線形予測分析部２１で予
測係数を算出し、予測逆フイルタ部２２で音声波
形Ｓ（ｎ）の予測残差波形を求める。次に、フイ
ルタ係数決定部２３では残差波形ｅ（ｎ）からサ
ンプル点ごとの位相等化フイルタの係数ｎ（ｍ，
ｎ）およびピツチ位置nlを出力する。位相等化フ
イルタ部３８のフイルタ係数はｈ（ｍ，ｎ）に設
定され、位相等化フイルタ部３８に音声波形S_N
を位相等化し、その出力は減算器５３でマルチパ
ルス符号化値S^_p（ｎ）との差をとり、その差出力
はパルス時点算出部５８とパルス振幅算出部５９
へ入力される。符号化値S^_p（ｎ）は、マルチパル
ス生成部６１から出力されるマルチパルス信号e^e
（ｎ）を予測フイルタ６２に通すことにより次式
で算出される。 S^_p（ｎ）＝−_p 〓^k=1 akS^_p（ｎ−ｋ）＋ｅ（ｎ）ここで、e^e（ｎ）はパルス時点をt_i。パルス振
幅をm_iとして次式で表わされる。 e^e（ｎ）＝δ（ｔ−t_i）パルス時点t_iとパルス振幅m_iは、それぞれパル
ス時点算出部５８とパルス振幅算出部５９におい
て、波形値S_p（ｎ）とS^_p（ｎ）との差の平均電力
P_eを最小とするように決定されている。前述の
論文で示されたアルゴリスムでは、ｌ−１個分の
t_iとm_iが与えられる場合、ｌ番目のパルス位置tl
は可能な全ての時点（但し、t_l≠t_i（ｉ＝１……ｌ
−１））に対して平均電力P_eが最小となるパルス
振幅m_iを最小二乗法によつて求め、その中でP_e
が最小となる時点として決定される。この手順
は、ｌ＝１より始めてｌ＝ｑまで逐次行ない、全
てのパルス時点と振幅が決定される。このアルゴ
リズムは、ピツチ時点の算出に多大な処理を必要
とする。しかしこの実施例では処理量を低減する
ため、位相等化処理の中で求まるピツチ位置n_i
（ｉ＝１，２……q′）を利用し、始めのq′個分の
パルス時点をt_i＝n_i（ｉ＝１，２……q′）として決
定する。予測係数ak、ピツチ時点（位置）t_iおよびピツ
チ振幅m_iを多重化部５５で多重化して送出し、
受信側ではこれけらを多重分離器５６で分離した
後、マルチパルス生成部６３でマルチパルス信号
を生成し、これを予測フイルタ１５に通して符号
化信号出力を端子に得る。パルス化残差波形を用いる音声分析合成系この実施例では、前述した位相等化処理によつ
て位相等化された予測残差波形のサンプル値時系
列において、ピツチ位置でのサンプル値を残し、
それ以外のサンプル値を零にすることにより、予
測残差波形をパルス化し、このパルス列を駆動音
源として予測フイルタを駆動することにより合成
音声を生成する。すなわち第１０図に示す。入力
端子１１から入力される音声波形のサンプル値Ｓ
（ｎ）に対し、線形予測分析部２１で予測係数ak
を算出後、予測逆フイルタ２２によつて音声波形
Ｓの予測残差波形ｅ（ｎ）を求める。次に残差波
形ｅ（ｎ）からフイルタ係数決定部２３において
位相等化フイルタ係数ｈ（ｍ，ｎ）、有声・無声判
定値Ｖ／UVおよびピツチ位置n_lを算出する。残
差波形ｅ（ｎ）は、位相等化フイルタ部６４で位
相等化された後、パルス化処理部６５において、
ピツチ位置n_lでの位相等化残差波形e_p（ｎ）のサ
ンプル値をm_l＝e_p（n_l）（ｌ＝１，２……Ｌ）とす
る。ここではＬは分析窓内でのピツチ位置の数で
ある。サンプル値m_lは、量子化ステツプ幅算出
部６６から与えられる量子化ステツプ幅△を用い
て量子化器６７で量子化される。多重化部５５は
量子化出力Ｃ（ｎ）、ピツチ位置n_l、予測係数ak、
有声無声判定値Ｖ／UVおよび残差電力ｖを多重
化して送出する。多重分離部５６で多重分離し、
有声部６８では量子化出力Ｃ（ｎ）を逆量子化し、
これとピツチ位置とn_lからパルス列e^_p（ｎ）＝_L 〓^l=1 m_l
〓（ｎ−n_l）を作る。無声部６９では電力がｖに
等しくなる白色雑音を駆動音源とする。有声・無
声判定値Ｖ／UVに応じてスイツチ７１を制御し
て有声Ｖで有声部６８の出力を、無声UVで無声
部６９の出力を予測フイルタ１５へ駆動音源情報
として供給し、合成音声S^_p（ｎ）を出力端子１６
に出力する。＜効果＞以上述べたように、この発明による音声信号処
理方式は予測残差波形の短時間位相特性を、その
時間的変化に応じて適応的に位相等化することに
より、残差波形振幅の時間的集中度を高める効果
を有し、それによつて音声波形のピツチ周期、ピ
ツチ位置を検出することができ、また例えばエネ
ルギーが集中していない部分を除去して時間を短
縮し、又はゼロを挿入して時間を長くして音声波
形のピツチを変更しても自然性が保持でき、更に
符号化の効率を大幅に向上させる利点をもつ。位相等化処理のみを施こした場合の音声品質
は、７，６ビツト対数圧伸PCMと同等であり、
この処理による波形歪はほとんど知覚されない。
したがつて、位相等化された音声波形を符号化入
力としても、入力段階での品質劣下は生じていな
い。また位相等化された音声波形を正しく再生で
きれば、この位相等化された音声波形を駆動音源
信号としても高い品質の音声が得られる。前記実施例で示した符号化法はいづれも、音声
の予測残差波形の振幅の時間的集中度が高められ
ることにより符号化効率が向上する。可変レート
木符号化では、波形振幅の偏りに応じて時間的に
情報を割り当てており、位相等化によつてその偏
りを高めることにより情報割当ての効果が大きく
なり、符号化効率が向上する。１ビツト１サンプ
ル（約10kb／ｓ）で符号化した時、符号化音声
のSN比は19.0dBであり、位相等化処理を含めな
い場合にくらべて4.4dB向上する。また品質的に
は5.5ビツトPCM相当の品質が6.6ビツトPCM相
当の品質に向上する。７ビツトPCMが品質的に
問題ないことより、この例では16kb／ｓ以下の
ビツトレートとしても比較的高い品質が得られ
る。マルチパルス符号化では、位相等化処理によつ
て残差波形がパルス化されるため、よりマルチパ
ルス表現が適合し、従来の入力音声そのものを用
いる場合とくらべて少ないパルス数で残差波形が
表現できる。また、マルチパルス符号化における
パルス位置の多くは、この位相等化処理における
ピツチ位置と一致するため、このピツチ位置の情
報を利用することによりマルチパルス符号化での
パルス位置の決定処理を簡単化することができ
る。パルス数を20（１ビツト１サンプル符号化に
相当し、約10kb／ｓ）とした時のマルチパルス
符号化の性能は、直接音声入力の場合SN比で
11.3dB、位相等化音声の場合は15.0dBであり、
位相等化処理によりSN比は3.7dB向上する。ま
た、品質的には4.5ビツトPCM相当が位相等化処
理により、６ビツトPCM相当に改善される。従
来はビツトレートが16kb／ｓ以下になると音声
品質が急激に劣化するが、このマルチパルス符号
化を適用する場合もビツトレートが10kb／ｓで
も可成り良好な音声品質が得られる。なお、位相等可フイルタ部３８のフイルタ係数
としてh^*（ｍ，n_l）を用い、フイルタ係数の補間
部３７を省略してもよい。また上述における各部
はそれぞれ独立したハードウエア、あるいはマイ
クロプロセツサにより構成してもよく、または複
数の部分を１つのマイクロプロセツサや電子計算
機で兼用しもよい。The delay in [Formula] has been corrected. In other words, as is clear from equation (10), h(m) obtained from equation (8) is delayed by M/2 samples from the actual one, so equation (13) is used. On the other hand, if the voiced/unvoiced determination result is unvoiced UV, the processing mode switch 29
(n) is input to the pitch position reset unit 36 to set the pitch position nl to the last sample time within the analysis window, and the filter coefficient calculation unit 26 sets the filter coefficient h ^* (m, nl) = 1 ( m=0)h ^* (m,
nl) = 0 (m≠0). The filter coefficient h(m, n) at each time point n is calculated by the filter coefficient interpolation unit 37 as a value smoothed using, for example, a first-order filter expressed by the following equation. h(m,n)=αh(m,m-1)+(1+α)h ^*
(m, nl) (n _l-1 <n≦nl) (14) Here, α is a coefficient that controls the rate of change of the filter coefficient and is a constant that satisfies α<1. The phase equalization filter unit 38 uses the input audio waveform S(n) of the input terminal 11 and the filter coefficient interpolation unit 37
Using the filter coefficients h(m, n), a convolution operation shown by the following equation is performed, and a phase-equalized audio waveform S _p (n) is output to the output terminal 39. S _p (n) = _M 〓 ^m=0 h (m, n) S (n - m) ... (15) <Second Example> Next, the phase-equalized speech waveform S _p (n) is Digital encoding will be explained. An example of the basic configuration of this encoding is shown in FIG. The audio waveform S(n) input from the input terminal 11 is subjected to phase equalization processing in the phase equalization processing section 41 having the configuration shown in _FIG . The encoding unit 42 outputs this phase equalized speech waveform S _p
(n) is digitally encoded and the code sequence is sent to the transmission path 43. On the receiving side, the decoding unit 44 restores the phase-equalized speech waveform S _p (n) and outputs it to the output terminal 16 . In this way, encoding and decoding are performed using the phase-equalized speech waveform S _p (n) instead of the speech waveform S(n). The audio waveform S _p (n) obtained by phase-equalizing the audio waveform S(n) is the same in quality as the original audio waveform S(n).Therefore, there is no need to transmit the filter coefficient h(m), and the phase etc. It is only necessary to reproduce the converted speech S _p (n). In particular, energy is concentrated in the residual waveform e _p (n) obtained by phase equalizing the residual waveform e (n), so
By adaptively encoding the data to give more information to that part, high-quality transmission is possible with a smaller number of bits. Various methods can be applied as the encoding method in the encoding unit 42. Here, we will discuss 3 encoding methods suitable for phase-equalized speech waveforms.
An example is shown below. Method using variable rate tree encoding The variable rate tree encoding method adaptively controls the amount of information according to the amplitude change in the temporal direction of the predicted residual waveform obtained by linear predictive analysis of the speech waveform. This is a unique encoding method. FIG. 6 shows an embodiment of an encoding system that combines phase equalization processing and variable rate tree encoding according to the present invention. Input terminal 1
The linear predictive analysis unit 21 performs linear predictive analysis on the audio waveform S(n) input from 1 to calculate the prediction coefficient ak, and the inverse filter unit 22 calculates the predictive coefficient ak.
Find the predicted residual waveform e(n) of (n). The filter coefficient determination unit 23 calculates the coefficient h(m, n) of the phase equalization filter that equalizes the short-term phase of the residual waveform e(n) as described in FIG. is set as a filter coefficient of the filter unit 38. The phase equalization filter unit 38 subjects the input audio waveform S(n) to phase equalization processing, and outputs the phase-equalized audio waveform S _p (n) to the terminal 39 . On the other hand, after the residual waveform e(n) is phase-equalized by the phase equalization filter section 45, the partial interval setting section 46 sets partial intervals for dividing the time axis according to the deviation of the residual waveform amplitude, and The calculation unit 47 calculates the power of the residual waveform in each of the set partial intervals. For example, as shown in Fig. 7, the partial interval can be set by setting the pitch position (nl) interval within the analysis window (however, only one sample point) and the pitch period T _p
Set as each interval to be divided into equal parts. The residual power u _i in the subinterval is calculated using the following formula. u _i = 1/N _Ti 〓 n∈Tie ² _p (n) ...(16) Here, T _i represents the subinterval to which sample time n belongs, and N _Ti is the number of sample points included in the subinterval. be. The bit allocation unit 48 calculates the number of information bits R(n) to be allocated to each sample time from the residual power u _i of each subinterval using the following equation. Here, is the average bit rate for the residual waveform e _p (n), N _s is the number of subintervals, w _i is the time length ratio of the subintervals, and is given by w _i =N _Ti / _NS 〓 ^j=1 N _Tj It will be done. Also, the quantization step width △(n)
is calculated by the step width calculating section 49 from the residual power u _i using the following formula. Δ(n)=Q(R(n))√ _i n 〓T _i (18) Here, Q(R(n)) is the step width of the R(n)-bit Gaussian quantizer. The number of bits R(n) and the step width Δ(n) calculated by the bit allocation section 48 and the step width calculation section 49 control the tree code generation section 51. As shown in FIG. 8, the tree code generation unit 51 has a variable rate tree structure, and has a code sequence C(n)={C(n-L), . . . , C(n-1), C
(n)}, sample values q(n) associated with each technique are output along the path determined by .
The number of techniques coming out of each node is given as 2 ^R(n) . In addition, the sample value f associated with each technique
(l,n) is given by the following equation from Δ(n) and R(n). f(l,n)=Sgn(l)|l|+0.5/2△(n), l=±1,±2...,±2 ^R(n)-1 (19) Here, Sgn( l) represents the positive or negative sign of l. Furthermore, q(n) is given as q(n)=5(l ^* ,n), where l ^* is the technique on the pass. Sample value p(n) output from tree code generation unit 51
is input to the prediction filter section 52, and the local decoded value S^ _p (n) is calculated using the following equation using an all-pole filter. S^ _p (n) = _p 〓 ^k=1 a(k) S^ _p (n-k) + q(n) ...(20) Here, a(k) is the prediction coefficient, and the linear prediction analysis section It is controlled by the output from 21. Locally decoded value S^ _p (n) and phase-equalized speech waveform S _p
(n) is the difference between both values in the subtracter 53,
It is input to the code sequence optimization section 54. The code sequence optimization unit 54 calculates the path of the tree code, that is, the code _sequence C( _n )={C(n−L ),...C(n-1)
，
C(n)}. For example, an ML algorithm is used as the optimal path search method. In the ML algorithm, in a tree code as shown in Figure 8, code sequence candidates are C _n (n) = {C _n (n - L),...
..., C _n (n-1), C _n (n)} (m=1, 2, ...
M′) is the error evaluation value d at each node.
(m, n) is the sample value S _p (n) given for the code sequence candidate C _n (n) and the input sample value S _p
The squared error between the time series of (n) is calculated using the following formula. d(m,n)= _M 〓 ^t=nL {S _p (t)−S^ _p (t)} ^Second , evaluate the evaluation value d from among the M′ code sequence candidates.
Select the code sequence C _n (n) for which (m, n) is the minimum, and select the code sequence C _n (n
-L) is determined as the optimal code. Candidate code sequence at time n+1 C _n (n+1)={C _n (n+1−
L),...C _n (n), C _n (n+1)} is d(m, n)
After selecting M code sequences C _n (n) in descending order of the value of , all possible codes C (n+1) at time point n+1 are added to each code sequence to give a sequence. The above processing is performed sequentially at each point in time,
At time n, the optimal code C(n
-L) is output. Note that the symbol * in Figure 8 is
Indicates a null sign. In the encoding method in this embodiment, the prediction coefficient ak output from the linear prediction analysis unit 21 as auxiliary information together with the code C(n) of the residual waveform, and the period T _p of the subinterval output from the subinterval setting unit 46
and the position T _d , and the partial section residual power u _i output from the power calculation section 47 are multiplexed by the multiplexing section 55 and then sent to the transmission path 43 . On the receiving side, after demultiplexing each piece of information in a demultiplexing unit 56, a residual waveform generating unit 57 calculates a decoded value q(n) of the residual waveform according to the code sequence, and the decoded value q(n) The prediction filter 15 is driven using the driving sound source information to restore the speech waveform S _p (n) and output it to the output terminal 16 . By equalizing the phase of the residual waveform e(n), it is made into a pulse, that is, the energy is increased, and by allocating a large number of bits to that part and increasing the number of tree codes, it can be made efficiently at a small bit rate. Information can be transmitted. Method using multipulse coding The basic principle of multipulse was described by Atal in 1982.
International Conference on Acoustics and Speech Signals (Proceedinhg)
proposed in ICASSPpp.614-617). In this method, the predicted speech residual waveform is represented by multiple pulse trains, and the temporal position and intensity of each pulse are minimized to minimize the error between the speech waveform synthesized using this multi-pulse residual waveform and the input speech waveform. This method determines the In this method, the speech waveform itself is directly encoded, but in the embodiment of the present invention, the speech waveform after being phase-equalized is input and multipulse encoding is performed. FIG. 9 shows an embodiment of an encoding method that combines phase equalization processing and multi-pulse encoding. The linear prediction analysis unit 21 calculates a prediction coefficient for the sample value S(n) of the audio waveform input from the input terminal 11, and the prediction inverse filter unit 22 calculates the prediction residual waveform of the audio waveform S(n). demand. Next, the filter coefficient determination unit 23 uses the residual waveform e(n) to determine the phase equalization filter coefficient n(m,
n) and pitch position nl. The filter coefficient of the phase equalization filter section 38 is set to h (m, n), and the phase equalization filter section 38 inputs the audio waveform S _N
is phase-equalized, and the subtracter 53 calculates the difference between the output and the multi-pulse encoded value S^ _p (n).
is input to. The encoded value S^ _p (n) is the multi-pulse signal e^e output from the multi-pulse generator 61
By passing (n) through the prediction filter 62, it is calculated using the following formula. S^ _p (n) = - _p 〓 ^k=1 akS^ _p (n-k) + e(n) where e^e (n) is the pulse time t _i . It is expressed by the following equation, where the pulse amplitude is m _i . e^e(n)=δ(t-t _i ) The pulse time point t _i and the pulse amplitude m _i are calculated as the waveform value S _p (n) and S^ _p in the pulse time point calculation section 58 and the pulse amplitude calculation section 59, respectively. The average power of the difference between (n)
It is determined to minimize P _e . In the algorithm presented in the above-mentioned paper,
Given t _i and m _i , the lth pulse position tl
is all possible time points (where t _l ≠t _i (i=1...l
−1)), the pulse amplitude m _i at which the average power P _e is the minimum is determined by the least squares method, and P _e
is determined as the point at which is the minimum. This procedure is performed sequentially starting from l=1 until l=q, and all pulse times and amplitudes are determined. This algorithm requires a large amount of processing to calculate the pitch point. However, in this embodiment, in order to reduce the amount of processing, the pitch position n _i determined during the phase equalization process is
(i=1, 2...q'), the first q' pulse points are determined as t _i =n _i (i=1, 2...q'). The prediction coefficient ak, pitch time point (position) t _i and pitch amplitude m _i are multiplexed by a multiplexer 55 and sent out;
On the receiving side, after demultiplexing these signals using a demultiplexer 56, a multipulse generating section 63 generates a multipulse signal, which is passed through a prediction filter 15 to obtain an encoded signal output at a terminal. Speech analysis and synthesis system using pulsed residual waveform In this example, in the sample value time series of the predicted residual waveform whose phase has been equalized by the phase equalization process described above, the sample value at the pitch position is left,
By setting the other sample values to zero, the predicted residual waveform is made into a pulse, and a synthesized speech is generated by driving a prediction filter using this pulse train as a driving sound source. That is, as shown in FIG. Sample value S of the audio waveform input from the input terminal 11
(n), the linear prediction analysis unit 21 calculates the prediction coefficient ak
After calculating, the predicted residual waveform e(n) of the speech waveform S is obtained by the predicted inverse filter 22. Next, the filter coefficient determination unit 23 calculates the phase equalization filter coefficient h(m, n), the voiced/unvoiced determination value V/UV, and the pitch position n _l from the residual waveform e(n). After the residual waveform e(n) is phase equalized by the phase equalization filter section 64, the residual waveform e(n) is subjected to phase equalization in the pulse processing section 65.
Let the sample value of the phase equalized residual waveform e _p (n) at the pitch position n _l be m _l =e _p (n _l ) (l=1, 2...L). Here L is the number of pitch positions within the analysis window. The sample value m _l is quantized by a quantizer 67 using a quantization step width Δ given from a quantization step width calculating section 66 . The multiplexing unit 55 outputs the quantized output C(n), the pitch position n _l , the prediction coefficient ak,
The voiced/unvoiced judgment value V/UV and the residual power v are multiplexed and sent. The demultiplexer 56 demultiplexes the
The voiced part 68 dequantizes the quantized output C(n),
From this and the pitch position and n _l , the pulse train e^ _p (n) = _L 〓 ^l=1 m _l
Make 〓(n−n _l ). In the silent section 69, white noise whose power is equal to v is used as a driving sound source. The switch 71 is controlled according to the voiced/unvoiced judgment value V/UV to supply the output of the voiced part 68 for voiced V and the output of the unvoiced part 69 for unvoiced UV to the prediction filter 15 as driving sound source information, and to generate synthesized speech S. ^ _p (n) output terminal 16
Output to. <Effects> As described above, the audio signal processing method according to the present invention adaptively equalizes the phase of the short-time phase characteristic of the predicted residual waveform according to its temporal change, thereby improving the amplitude of the residual waveform. It has the effect of increasing the temporal concentration, thereby making it possible to detect the pitch period and pitch position of the audio waveform, and for example, to shorten the time by removing parts where energy is not concentrated, or to reduce the time to zero. Even if the pitch of the speech waveform is changed by inserting it for a longer time, the naturalness can be maintained, and the coding efficiency can be greatly improved. The audio quality when only phase equalization processing is performed is equivalent to 7.6-bit logarithmic companding PCM,
Waveform distortion due to this processing is hardly perceptible.
Therefore, even if a phase-equalized audio waveform is input to be encoded, no quality deterioration occurs at the input stage. Furthermore, if the phase-equalized audio waveform can be correctly reproduced, high-quality audio can be obtained by using the phase-equalized audio waveform as a driving sound source signal. In all of the encoding methods shown in the above embodiments, the encoding efficiency is improved by increasing the degree of temporal concentration of the amplitude of the speech prediction residual waveform. In variable rate tree encoding, information is allocated temporally according to the deviation of the waveform amplitude, and by increasing the deviation through phase equalization, the effect of information allocation becomes larger and the coding efficiency improves. When encoded with 1 bit and 1 sample (approximately 10 kb/s), the SN ratio of encoded speech is 19.0 dB, which is 4.4 dB better than when phase equalization processing is not included. In terms of quality, the quality equivalent to 5.5-bit PCM is improved to the quality equivalent to 6.6-bit PCM. Since 7-bit PCM has no quality problems, relatively high quality can be obtained in this example even at a bit rate of 16 kb/s or less. In multi-pulse encoding, the residual waveform is converted into pulses through phase equalization processing, so the multi-pulse representation is more suitable, and the residual waveform can be generated with fewer pulses than when using the conventional input audio itself. I can express it. In addition, many of the pulse positions in multi-pulse encoding match the pitch positions in this phase equalization process, so using this pitch position information simplifies the pulse position determination process in multi-pulse encoding. can do. The performance of multi-pulse encoding when the number of pulses is 20 (equivalent to 1-bit 1-sample encoding, approximately 10 kb/s) is the SN ratio for direct audio input.
11.3dB, 15.0dB for phase equalized audio,
Phase equalization processing improves the SN ratio by 3.7dB. Furthermore, in terms of quality, the quality equivalent to 4.5-bit PCM is improved to that equivalent to 6-bit PCM by phase equalization processing. Conventionally, when the bit rate becomes 16 kb/s or less, the voice quality deteriorates rapidly, but even when this multi-pulse encoding is applied, quite good voice quality can be obtained even at a bit rate of 10 kb/s. Note that h ^* (m, n _l ) may be used as the filter coefficient of the phase equalizable filter section 38 and the filter coefficient interpolation section 37 may be omitted. Further, each of the above-mentioned parts may be constructed by independent hardware or a microprocessor, or a plurality of parts may be combined by one microprocessor or electronic computer.

[Brief explanation of the drawing]

第１図は従来の線形予測分析合成方式の基本構
成を示すブロツク図、第２図は従来の適応予測符
号化の基本構成を示すブロツク図、第３図はこの
発明による音声信号処理方式、特に適応的位相等
化処理方式の構成例を示すブロツク図、第４図は
ピツチ位置検出部２５の内部構成を示したブロツ
ク図、第５図は位相等化処理を用いた音声符号化
の基本構成を示すブロツク図、第６図は位相等化
処理を用いる可変レート木符号化の構成例を示す
ブロツク図、第７図は部分区間の設定法に関する
説明図、第８図は可変レート木符号の構造を表わ
す説明図、第９図は位相等化処理を用いるマルチ
パルス符号化の構成例を示すブロツク図、第１０
図はパルス化残差波形による音声分析合成系の構
成例を示すブロツク図である。１１……入力端子、２１……線形予測分析部、
２２……逆フイルタ部、２４……有声・無声判定
部、２５……ピツチ位置検出部、２６……フイル
タ係数算出部、３７……フイルタ係数補間部、３
８……位相等化フイルタ部、３９……出力端子、
４１……位相等化処理部、４５……位相等化フイ
ルタ部、。４６……部分区間算出部、４７……電
力算出部、４８……ビツト割り当て部、４９……
ステツプ幅算出部、５１……木符号生成部、５２
……予測フイルタ部、５３……減算器、５４……
符号系列最適化部。 FIG. 1 is a block diagram showing the basic configuration of a conventional linear predictive analysis and synthesis method, FIG. 2 is a block diagram showing the basic configuration of a conventional adaptive predictive coding method, and FIG. 3 is a block diagram showing the basic configuration of a conventional adaptive predictive coding method. A block diagram showing an example of the configuration of the adaptive phase equalization processing method, FIG. 4 is a block diagram showing the internal configuration of the pitch position detection section 25, and FIG. 5 shows the basic configuration of speech encoding using phase equalization processing. 6 is a block diagram showing a configuration example of variable rate tree encoding using phase equalization processing, FIG. 7 is an explanatory diagram regarding the method of setting subintervals, and FIG. Fig. 9 is a block diagram showing a configuration example of multi-pulse encoding using phase equalization processing; Fig. 10 is an explanatory diagram showing the structure;
The figure is a block diagram showing an example of the configuration of a speech analysis and synthesis system using a pulsed residual waveform. 11...Input terminal, 21...Linear prediction analysis section,
22... Inverse filter section, 24... Voiced/unvoiced determination section, 25... Pitch position detection section, 26... Filter coefficient calculation section, 37... Filter coefficient interpolation section, 3
8... Phase equalization filter section, 39... Output terminal,
41... Phase equalization processing section, 45... Phase equalization filter section. 46... Partial interval calculation unit, 47... Power calculation unit, 48... Bit allocation unit, 49...
Step width calculation unit, 51...Tree code generation unit, 52
...Prediction filter section, 53...Subtractor, 54...
Code sequence optimization unit.

Claims

[Scope of Claims] 1. Means for obtaining a sample value of a predicted residual waveform using a prediction filter that removes correlation between sample values of a voice waveform, and the predicted residual waveform or the voice waveform is supplied. a phase equalizing filter for reducing the phase of the predicted residual waveform to zero phase; means for determining a phase equalizing filter coefficient having a phase characteristic opposite to that of the predicted residual waveform from the predicted residual waveform; and means for adapting the phase equalization filter according to temporal changes in the phase of the predicted residual waveform. 2. The input audio signal is phase-equalized by the phase equalization filter, the phase-equalized output is encoded using a digital encoding method, and the result is directly used as an encoded audio output. Audio signal processing method. 3 The digital encoding method described above is a variable rate tree encoding method, and in order to control the number of branches (number of bits) emerging from each node of the tree code and the output sample value of the tree code assigned to each technique. 3. The audio signal processing method according to claim 2, wherein the necessary information is extracted from the prediction residual signal obtained during the phase equalization process. 4. The digital encoding system is a multi-pulse encoding system, and some of the pulse positions are determined as pitch positions obtained in the phase equalization process, as set forth in claim 2. audio signal processing method. 5. The digital encoding method is a means of creating a pulse train in which only sample values at pitch positions are left and other sample values are set to zero in the sample value time series of the phase-equalized predicted residual waveform. The audio signal processing method according to claim 2, wherein synthesized speech is obtained by driving a prediction filter using the pulse train as a driving sound source of voiced sound.