JPH04249300A

JPH04249300A - Audio code/decoding method and device

Info

Publication number: JPH04249300A
Application number: JP3035149A
Authority: JP
Inventors: Seiji Sasaki; 誠司佐々木
Original assignee: Kokusai Electric Co Ltd
Current assignee: Kokusai Denki Electric Inc
Priority date: 1991-02-05
Filing date: 1991-02-05
Publication date: 1992-09-04

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

[Detailed description of the invention]

【０００１】0001

【産業上の利用分野】本発明は、アナログ音声信号を高
能率に符号化して伝送路に送出し、受信側でそれを復号
してアナログ音声信号を復号再生する音声符復号化方法
及びその装置に関するものである。[Industrial Application Field] The present invention is an audio encoding/decoding method and apparatus for highly efficiently encoding an analog audio signal, sending it to a transmission path, decoding it on the receiving side, and decoding and reproducing the analog audio signal. It is related to.

【０００２】0002

【従来の技術】図３は長期予測を用いた従来の等間隔パ
ルス駆動型音声符復号化器のブロック図あり、（Ａ）は
音声符号化装置、（Ｂ）は音声復号装置である。この方
法は符号化速度１３Ｋｂｐｓ（ビット／秒）で、汎欧州
ディジタル自動車電話（ＧＳＭ）システムに採用されて
いる音声符復号化方法である。以下、符号化速度を６４
Ｋｂｐｓから１３Ｋｂｐｓに圧縮する方法について説明
する。2. Description of the Related Art FIG. 3 is a block diagram of a conventional evenly spaced pulse-driven speech codec using long-term prediction, in which (A) is a speech encoding device and (B) is a speech decoding device. This method has a coding speed of 13 Kbps (bits per second) and is a voice coding/decoding method adopted in the Pan-European Digital Mobile Telephony (GSM) system. Below, the encoding speed is 64
A method for compressing from Kbps to 13 Kbps will be explained.

【０００３】図３（Ａ）において、８ｋＨｚサンプリン
グで８ビット量子化された入力音声信号（６４Ｋｂｐｓ
）は、短期予測分析器１１により１フレーム毎（１６０
サンプル：２０ｍｓｅｃ）に短期予測分析（線形予測分
析ともいう）が施される。即ち入力音声信号からスペク
トル包絡情報Ｐａを抽出して出力するとともに、スペク
トル包絡成分を取り除いた信号である短期予測残差信号
ａ（１６０サンプル）を生成して出力する。次に、短期
予測残差信号ａは、長期予測分析器１２により４つのサ
ブフレーム（４０サンプル）に分割されサブフレーム毎
にピッチ情報Ｐｂを抽出して出力するとともに、更にピ
ッチ成分を取り除いた信号である長期予測残差信号ｂ（
４０サンプル）を生成して出力する。この長期予測残差
信号ｂ（４０サンプル）はＬＰＦ（低域ろ波器）１３に
より１／３に帯域制限され信号ｃが得られる。信号ｃは
３つのグリッドを有するスイッチＳにより１／３にダウ
ンサンプリングされるが、このときグリッド選択器１４
により電力が最大となるグリッドの信号列（１３サンプ
ル）が選択され、それらが等間隔パルス情報Ｐｃとなっ
て出力される。以上のＰａ，Ｐｂ及びＰｃは符号化器１
５により符号化及び多重化されたディジタル列として受
信側に対して送出される。このときの各パラメータに対
する１フレームあたりのビット割り当ては、表１のよう
になる。但し、等間隔パルス情報Ｐｃとしては、その１
３サンプル中の最大値と、それにより正規化された１３
サンプル及び位置情報（グリッド番号）とにより構成さ
れる。１フレームは２０ｍｓｅｃであるので符号化速度
は１３Ｋｂｐｓとなる。In FIG. 3A, an input audio signal (64 Kbps) 8-bit quantized with 8 kHz sampling is shown.
) is calculated by the short-term prediction analyzer 11 for each frame (160
Short-term predictive analysis (also referred to as linear predictive analysis) is applied to the sample (20 msec). That is, it extracts and outputs the spectral envelope information Pa from the input audio signal, and also generates and outputs the short-term prediction residual signal a (160 samples), which is a signal from which the spectral envelope component has been removed. Next, the short-term prediction residual signal a is divided into four subframes (40 samples) by the long-term prediction analyzer 12, and pitch information Pb is extracted and output for each subframe, and the signal is further removed from the pitch component. The long-term prediction residual signal b(
40 samples) is generated and output. This long-term prediction residual signal b (40 samples) is band-limited to ⅓ by an LPF (low pass filter) 13 to obtain a signal c. The signal c is down-sampled to 1/3 by a switch S having three grids, but at this time the grid selector 14
The grid signal sequence (13 samples) with the maximum power is selected, and these are output as equally spaced pulse information Pc. The above Pa, Pb and Pc are encoder 1
5 and is sent to the receiving side as a digital string encoded and multiplexed. The bit allocation per frame for each parameter at this time is as shown in Table 1. However, as the equally spaced pulse information Pc, Part 1
Maximum value among 3 samples and 13 normalized by it
It consists of a sample and position information (grid number). Since one frame is 20 msec, the encoding speed is 13 Kbps.

【表１】[Table 1]

【０００４】図３（Ｂ）では、受信したディジタル列は
分離回路２１により、等間隔パルス情報Ｐｄ，ピッチ情
報Ｐｅ，スペクトル包絡情報Ｐｆを分離した後、長期予
測残差信号再生器２２により等間隔パルス情報Ｐｄから
長期予測残差信号ｄを再生する。ここでは、送られてき
た等間隔パルスを元のグリッド位置に再配置し、等間隔
パルスが存在しないサンプル点には０を挿入する。次に
、長期予測合成器２３により長期予測残差信号ｄにピッ
チ情報Ｐｅを付加し、短期予測残差信号ｅを再生する。次に、短期予測合成器２４により短期予測残差信号ｅに
スペクトル包絡情報Ｐｆを付加して再生音声信号を出力
する。In FIG. 3(B), the received digital string is separated into equally spaced pulse information Pd, pitch information Pe, and spectrum envelope information Pf by a separation circuit 21, and then is separated into equally spaced pulse information Pd, pitch information Pe, and spectrum envelope information Pf by a long-term prediction residual signal regenerator 22. A long-term prediction residual signal d is reproduced from the pulse information Pd. Here, the sent evenly spaced pulses are rearranged to the original grid positions, and 0's are inserted at sample points where no equally spaced pulses exist. Next, the long-term prediction synthesizer 23 adds pitch information Pe to the long-term prediction residual signal d, and reproduces the short-term prediction residual signal e. Next, the short-term prediction synthesizer 24 adds spectrum envelope information Pf to the short-term prediction residual signal e and outputs a reproduced audio signal.

【０００５】[0005]

【発明が解決しようとする課題】上記の従来の方式の問
題点を図４を用いて説明する。図４（Ａ）は、長期予測
残差信号ｂのスペクトルであり、これを図４（Ｂ）の理
想ＬＰＦによりフィルタリングすれば、図４（Ｃ）のよ
うな信号が得られ、これを１／３のダウンサンプリング
しても折り返し歪みは発生しない。しかし、従来の方式
でのフィルタリングは、時間領域で理想ＬＰＦのインパ
ルス応答と長期予測残差信号列との畳み込みを行うこと
により実現しており、無限長であるはずのインパルス応
答を１１サンプルのみで打ち切っているため、図４（Ｄ
）のようなスペクトルとなり、これによりフィルタリン
グされた信号は、図４（Ｅ）のようになる。これを１／
３にダウンサンプリングすれば、図４（Ｆ）に示すよう
にな折り返し歪み（斜線部分）が生じる。これが再生音
に影響し、１人の発声した音声があたかも２人が発声し
ているように聞こえる現象が起きる。また、折り返し歪
みが大きくなるため、これ以上長期予測残差信号のサン
プルを間引くことが出来ず、符号化速度をこれより低く
するのは困難である。Problems with the above-mentioned conventional method will be explained using FIG. 4. FIG. 4(A) is the spectrum of the long-term prediction residual signal b. If this is filtered by the ideal LPF of FIG. 4(B), a signal as shown in FIG. 4(C) is obtained, which is 1/ No aliasing distortion occurs even with downsampling of 3. However, filtering in the conventional method is achieved by convolving the impulse response of the ideal LPF with the long-term prediction residual signal sequence in the time domain, and the impulse response, which should be infinite, is processed using only 11 samples. Figure 4 (D
), and the signal filtered by this spectrum becomes as shown in FIG. 4(E). This is 1/
If downsampled to 3, aliasing distortion (shaded area) will occur as shown in FIG. 4(F). This affects the reproduced sound, causing a phenomenon in which the voice uttered by one person sounds as if it were uttered by two people. Furthermore, since the aliasing distortion becomes large, it is impossible to thin out the samples of the long-term prediction residual signal any more, and it is difficult to reduce the encoding speed lower than this.

【０００６】以上をまとめると従来の方式の欠点は、次
のようになる。（１）　　ＬＰＦの不完全さによる折り返し歪みが発生
し再生音の品質が劣化する。（２）　　低ビットレート化が困難である。To summarize the above, the drawbacks of the conventional system are as follows. (1) Folding distortion occurs due to imperfections in the LPF, and the quality of reproduced sound deteriorates. (2) It is difficult to reduce the bit rate.

【０００７】本発明の目的は、従来方式の欠点となって
いる折り返し歪みによる品質への悪影響を軽減し、さら
に低い符号化速度での音声符復号化方法及びその装置を
提供することである。SUMMARY OF THE INVENTION An object of the present invention is to provide a method and apparatus for encoding and decoding speech at a lower encoding speed while reducing the adverse effect on quality caused by aliasing, which is a drawback of conventional methods.

【０００８】[0008]

【課題を解決するための手段】図１は本発明の実施例を
示す音声符復号化装置のブロック図であり、（Ａ）は音
声符号化装置、（Ｂ）は音声復号装置である。[Means for Solving the Problems] Fig. 1 is a block diagram of a speech code/decoder showing an embodiment of the present invention, in which (A) is a speech coder and (B) is a speech decoder.

【０００９】図１（Ａ）において、８ｋＨｚサンプリン
グで８ビット量子化された入力音声信号（６４Ｋｂｐｓ
）は短期予測分析器３１により１フレーム毎（１６０サ
ンプル：２０ｍｓｅｃ）に短期予測分析が施される。即ち入力音声信号からスペクトル包絡情報Ｐｇを抽出し
て出力するとともに、スペクトル包絡成分を取り除いた
信号である短期予測残差信号ｇ（１６０サンプル）を生
成して出力する。次に、この短期予測残差信号ｇは、長
期予測分析器３２により４つのサブフレーム（４０サン
プル）に分割され、サブフレーム毎にピッチ情報Ｐｈを
抽出して出力するとともに、さらにピッチ成分を取り除
いた信号である長期予測残差信号ｈ（４０サンプル）を
生成して出力する。この長期予測残差信号ｈ（４０サン
プル）は離散コサイン変換（ＤＣＴ）器３３により周波
数領域に変換されＤＣＴ係数ｉを出力する。ＤＣＴの変
換式については後述する。次にＤＣＴ係数ｉ（４０サン
プル）は、間引き器３４により間引かれ７サンプルによ
り代表される。図２は間引き方法の説明図である。図２
（Ａ）は長期予測残差信号をＤＣＴ変換した結果である
。これを同図（Ｂ）のように、まず、１．３３［ｋＨｚ
］以上の係数を間引く。これは従来の方式の場合のＬＰ
Ｆにより１／３に帯域制限するのと同じ作用をするが、
周波数領域で成分を消去しているので１．３３［ｋＨｚ
］以上の成分が残らず、折り返し歪みが軽減される。また、さらに符号化速度を低くするために、同図におい
て実線，点線のうち電力の大きいほうを選択し７サンプ
ルを選出する。これらがＤＣＴ係数情報Ｐｉとなる。上
記のＰｇ，Ｐｈ及びＰｉは、符号化器３５により符号化
及び多重化されたディジタル系列として受信側に対して
送出される。このときの各パラメータに対する１フレー
ムあたりのビット割り当ては、表２のようになる。但し
、ＤＣＴ係数情報Ｐｉとしては、その７サンプル中の最
大値とそれにより正規化された７サンプル及び位置情報
（グリッド番号）とにより構成される。In FIG. 1A, an input audio signal (64 Kbps) 8-bit quantized with 8 kHz sampling is shown.
) is subjected to short-term predictive analysis for each frame (160 samples: 20 msec) by the short-term predictive analyzer 31. That is, it extracts and outputs the spectral envelope information Pg from the input audio signal, and also generates and outputs the short-term prediction residual signal g (160 samples), which is a signal from which the spectral envelope component has been removed. Next, this short-term prediction residual signal g is divided into four subframes (40 samples) by the long-term prediction analyzer 32, and pitch information Ph is extracted and output for each subframe, and the pitch component is further removed. A long-term prediction residual signal h (40 samples) is generated and output. This long-term prediction residual signal h (40 samples) is transformed into the frequency domain by a discrete cosine transform (DCT) unit 33 and outputs a DCT coefficient i. The DCT conversion formula will be described later. Next, the DCT coefficient i (40 samples) is decimated by a decimator 34 to be represented by 7 samples. FIG. 2 is an explanatory diagram of the thinning method. Figure 2
(A) is the result of DCT transformation of the long-term prediction residual signal. As shown in the same figure (B), first, 1.33 [kHz
] or more are thinned out. This is the LP for the conventional method.
It has the same effect as limiting the band to 1/3 by F, but
Since the component is canceled in the frequency domain, the frequency is 1.33 [kHz
] No components remain and aliasing distortion is reduced. Furthermore, in order to further reduce the encoding speed, the solid line or the dotted line in the figure, whichever has the higher power, is selected and seven samples are selected. These become DCT coefficient information Pi. The above Pg, Ph, and Pi are encoded and multiplexed by the encoder 35 and sent to the receiving side as a digital sequence. The bit allocation per frame for each parameter at this time is as shown in Table 2. However, the DCT coefficient information Pi is composed of the maximum value among the seven samples, seven samples normalized by the maximum value, and position information (grid number).

【表２】　　１フレームは２０ｍｓｅｃであるので、符号化速度
は９．２Ｋｂｐｓとなり、低ビットレート化を実現する
ことができる。[Table 2] Since one frame is 20 msec, the encoding speed is 9.2 Kbps, making it possible to achieve a low bit rate.

【００１０】図１（Ｂ）では受信したディジタル系列は
分離回路４１により、ＤＣＴ係数情報Ｐｊ，ピッチ情報
Ｐｋ及びスペクトル包絡情報Ｐｍとに分離した後、ＤＣ
Ｔ係数補間器４２によりＤＣＴ係数ｊを再生する。ここ
では、図２（Ｃ）に示すように、送られてきたＤＣＴ係
数（７サンプル）を元の周波数位置に再配置し、ＤＣＴ
係数が存在しないサンプル点に０を挿入するか、または
、補間処理により得られた値を挿入する。補間方法の１
例として、直線補間を用いた場合を図２（Ｄ）に示す。ここでは等間隔に間引かれた成分のみ直線補間し、その
他の間引かれた成分に０を挿入して４０サンプルとして
いる。次に、逆ＤＣＴ変換（ＩＤＣＴ）器４３により時
間領域に変換し長期予測残差信号ｋを再生する。次に、
長期予測合成器４４により長期予測残差信号ｋにピッチ
情報Ｐｋを付加し短期予測残差信号ｍを再生する。次に、短期予測合成器４５により短期予測残差信号ｍに
スペクトル包絡情報Ｐｍを付加して再生音声信号を出力
する。In FIG. 1B, the received digital sequence is separated into DCT coefficient information Pj, pitch information Pk, and spectrum envelope information Pm by a separation circuit 41, and then DC
A T-coefficient interpolator 42 reproduces DCT coefficient j. Here, as shown in Fig. 2(C), the sent DCT coefficients (7 samples) are rearranged to the original frequency position, and the DCT
Insert 0 into sample points where no coefficients exist, or insert values obtained by interpolation processing. Interpolation method 1
As an example, FIG. 2(D) shows a case where linear interpolation is used. Here, only the components thinned out at equal intervals are subjected to linear interpolation, and 0 is inserted into the other thinned components, resulting in 40 samples. Next, the inverse DCT transform (IDCT) unit 43 transforms into the time domain to reproduce the long-term prediction residual signal k. next,
The long-term prediction synthesizer 44 adds pitch information Pk to the long-term prediction residual signal k and reproduces the short-term prediction residual signal m. Next, the short-term prediction synthesizer 45 adds spectral envelope information Pm to the short-term prediction residual signal m and outputs a reproduced audio signal.

【００１１】ＤＣＴ及びＩＤＣＴの変換式は、入力信号
をＸ（ｎ）とするとそれぞれ次のようになる。（１）　
　ＤＣＴの場合、求めるＤＣＴ係数Ｘｃ（ｋ）は、但し
、Ｎはブロック当たりのサンプル数ｇ（ｋ）＝１（ｋ＝
０）ｇ（ｋ）＝√２（ｋ＝１，２…，Ｎ−１）（２）　　Ｉ
ＤＣＴの場合、復元される信号Ｘ（ｎ）は、The conversion formulas for DCT and IDCT are as follows, assuming that the input signal is X(n). (1)
In the case of DCT, the required DCT coefficient Xc(k) is, where N is the number of samples per block g(k)=1(k=
0) g(k)=√2(k=1,2...,N-1)(2) I
In the case of DCT, the restored signal X(n) is

【００１２
】0012
]

【発明の効果】以上詳細に説明したように、本発明を実
施することにより、周波数領域でＤＣＴ係数を間引いて
いるため折り返し歪みが発生せず、従来の方法に比べ再
生音声の品質は向上する。また、僅かな品質劣化を伴う
が、９．２Ｋｂｐｓまで符号化速度を下げることも可能
となる等極めて大きい効果がある。[Effects of the Invention] As explained in detail above, by implementing the present invention, aliasing distortion does not occur because DCT coefficients are thinned out in the frequency domain, and the quality of reproduced audio is improved compared to conventional methods. . Further, although there is a slight quality deterioration, it has extremely large effects such as being able to lower the encoding speed to 9.2 Kbps.

[Brief explanation of the drawing]

【図１】本発明の実施例を示すブロック図[Fig. 1] Block diagram showing an embodiment of the present invention

【図２】本発
明の間引き方法の説明図[Fig. 2] Explanatory diagram of the thinning method of the present invention

【図３】従来の音声符復号化装置のブロック図[Figure 3] Block diagram of a conventional audio code/decoder

【図４】
折り返し歪み発生の説明図[Figure 4]
Diagram explaining the occurrence of aliasing distortion

[Explanation of symbols]

１１　　短期予測分析器１２　　長期予測分析器１３　　ＬＰＦ１４　　グリッド選択器１５　　符号化器２１　　分離回路２２　　長期予測残差再生器２３　　長期予測合成器２４　　短期予測合成器３１　　短期予測分析器３２　　長期予測分析器３３　　離散コサイン変換（ＤＣＴ）器３４　　間引き
器３５　　符号化器４１　　分離回路４２　　ＤＣＴ係数補間器４３　　逆離散コサイン変換（ＩＤＣＴ）器４４　　長
期予測合成器４５　　短期予測合成器11 Short-term prediction analyzer 12 Long-term prediction analyzer 13 LPF 14 Grid selector 15 Encoder 21 Separation circuit 22 Long-term prediction residual regenerator 23 Long-term prediction synthesizer 24 Short-term prediction synthesizer 31 Short-term prediction analyzer 32 Long-term prediction analyzer 33 Discrete cosine transform (DCT) unit 34 Decimator 35 Encoder 41 Separation circuit 42 DCT coefficient interpolator 43 Inverse discrete cosine transform (IDCT) unit 44 Long-term prediction combiner 45 Short-term prediction combiner

Claims

[Claims]

1. Extracting spectral envelope information from an input audio signal by short-term predictive analysis, generating a short-term predictive residual signal by removing the spectral envelope information, and extracting pitch information from the short-term predictive residual signal by long-term predictive analysis. A long-term prediction residual signal is generated by extracting the pitch information and removing the pitch information, and it is converted into the frequency domain by discrete cosine transform to output DCT coefficients which are frequency components.
DCT is performed by removing 2/3 of the T coefficient from the high range and thinning out the remaining 1/3 of the DCT coefficient at equal intervals.
Coefficient information is output, and the DCT coefficient information, the pitch information, and the spectral envelope information are encoded in the form of a digital string signal, multiplexed, and sent to a transmission path, and the digital signal received via the transmission path is The column signal is separated, the DCT coefficient information, the pitch information, and the spectral envelope information are extracted, and the DCT coefficient information is reproduced and D
After rearranging the CT coefficients to their original frequencies, all frequency components are reproduced by inserting all 0s in place of the thinned out components or by inserting values obtained by interpolation, and then by inverse cosine transformation. The long-term prediction residual signal is converted into the time domain and reproduced, the pitch information is added to the long-term prediction residual signal by long-term prediction synthesis to reproduce the short-term prediction residual signal, and then the short-term prediction residual signal is reproduced by short-term prediction synthesis. An audio encoding/decoding method in which an audio signal is decoded and reproduced by adding the spectral envelope information to a prediction residual signal.

2. A short-term prediction analyzer that divides and outputs an input audio signal into spectral envelope information and a short-term prediction residual signal obtained by removing the spectral envelope component from the audio signal, and divides and outputs the short-term prediction residual signal into pitch information and a pitch component. a long-term prediction analyzer that divides and outputs the long-term prediction residual signal from which the long-term prediction residual signal is removed; a discrete cosine transformer that converts the long-term prediction residual signal into the frequency domain and outputs DCT coefficients; Eliminate 2/3 of the component and remove the remaining 1/3
a decimator that decimates DCT coefficients at equal intervals to extract DCT coefficient information; and a code that encodes the spectral envelope information, pitch information, and DCT coefficient information in the form of a digital sequence signal, multiplexes the signal, and sends the multiplexed signal to a transmission path. A speech encoding device equipped with a converter.

3. Receive a multiplexed signal encoded in the form of a digital sequence signal including DCT coefficient information, pitch information, and spectral envelope information, separate the digital sequence signal, and extract the DCT coefficient information, a separator for extracting the pitch information and the spectral envelope information; and the DCT
DCT coefficient interpolation that reproduces the DCT coefficients of all frequency components by rearranging the coefficient information to the original frequency and inserting 0 in place of the thinned out frequency components or by inserting values obtained by interpolation. an inverse cosine transformer that performs an inverse cosine transform on the DCT coefficients and converts them into the time domain to reproduce a long-term prediction residual signal; An audio decoding device comprising: a long-term prediction synthesizer for reproducing; and a short-term prediction synthesizer for decoding and reproducing an audio signal by adding the spectral envelope information to the short-term prediction residual signal.