JPS6272000A

JPS6272000A - Encoding method for voice

Info

Publication number: JPS6272000A
Application number: JP60213193A
Authority: JP
Inventors: 白木　善尚; 雅彰誉田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: NTT Inc
Priority date: 1985-09-26
Filing date: 1985-09-26
Publication date: 1987-04-02
Anticipated expiration: 2009-05-25
Also published as: JPH0640278B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】〔産業上の利用分野〕この発明は、入力音声のスにクトルノラメータを抽出し
て低ビツトレートで符号化する音声符号化方法に関する
ものである。DETAILED DESCRIPTION OF THE INVENTION [Industrial Application Field] The present invention relates to a speech encoding method for extracting a vector parameter from input speech and encoding it at a low bit rate.

[Conventional technology]

従来、音声の符号化方式として１０００ｂｐｓ以下の低
ビツトレートで符号化する方式は、ベクトル量子化（例
えばＡ、Ｂｕｚｏ＋他ｅ”５ｐｅｅｃｈ　Ｃｏｄｉｎｇ
　ｂａｓｅｄ　１ｐｏｎＶｅｃｔｏｒ　Ｑｕａｎｔｉｚ
ａｔｉｏｎ、’　ＩＥＥＥ、　ＡＳＳＰ−２８，１９８
０）と可変フレームレート符号化（例えば管材、板金：
・ソラメータの直線近似による音声情報圧縮、音声研究
会資料Ｓ−７８−１３，１９７８）との２つの方式があ
る。前者のベクトル量子化方式は、フレーム単位（音声
分析単位）は一定のまま、フレーム当シのスペクトル・
Ｐラメータ情報を８ビット程度で量子化するもので、パ
ラメータを１つのベクトルとして扱う点に特徴がある。Conventionally, the method of encoding audio at a low bit rate of 1000 bps or less has been vector quantization (for example, A, Buzo+etc. 5peech Coding).
based 1ponVector Quantiz
ation,' IEEE, ASSP-28,198
0) and variable frame rate encoding (e.g. tubing, sheet metal:
- There are two methods: speech information compression by linear approximation of solameter (Speech Research Group Material S-78-13, 1978). The former vector quantization method maintains the frame unit (speech analysis unit) constant and calculates the spectrum of each frame.
It quantizes P parameter information to about 8 bits, and is characterized by treating the parameters as one vector.

しかし、この方式は空間的、すなわち周波数の冗長性の
みを取除くもので、５００ｂｐｓ以下にすると、フレー
ム単位が一定のため、急激な品質劣化を生じる。However, this method removes only spatial redundancy, that is, frequency redundancy, and when the speed is lower than 500 bps, the quality deteriorates rapidly because the frame unit is constant.

一方、後者の可変フレームレート符号化方式は、スペク
トルの時間的変化に適応してフレーム単位（フレーム長
）を変化させるもので、時間的に冗長性を除去しており
、平均伝送速度が１／３程度に減少しても品質の劣化は
少ない。しかし、この方式は本質的に・ぐラメータの（
直線）補間特性に依存しているため、伝送速度が毎秒２
５フレーム（全体で６００ｂｐｓ）以下になると急激な
品質劣化を生じる。On the other hand, the latter variable frame rate encoding method changes the frame unit (frame length) in response to temporal changes in the spectrum, removes redundancy in time, and reduces the average transmission rate to 1/2. Even if the number is reduced to about 3, there is little deterioration in quality. However, this method is essentially a grammeter (
Because it depends on the interpolation characteristics (linear), the transmission speed is 2 per second.
If the number of frames is less than 5 frames (total 600 bps), the quality will deteriorate rapidly.

また、最新のスペクトルパラメータの時系列をセグメン
ト単位で符号化するものがある（特願昭５９−８０８５
５号、白木、誉田；時空間スペクトルによる極低ビット
音声符号化、音響学会講論集１−２−３゜１９８４年３
月）。この方法は標準パタンとのマツチングを固定次元
で行っていることと、セグメント位置の決定と標準バタ
ンの選択とを一体化させていないことから、符号化歪を
十分小さくすることができない。There is also a method that encodes the time series of the latest spectral parameters in segment units (Japanese Patent Application No. 59-8085
No. 5, Shiraki, Honda; Very low bit speech coding using spatio-temporal spectrum, Acoustical Society of Japan Proceedings 1-2-3゜1984 3
Month). This method cannot sufficiently reduce encoding distortion because matching with the standard pattern is performed in a fixed dimension and the determination of the segment position and the selection of the standard button are not integrated.

この発明の目的は、６００　ｂ、ｐｓ以下の低いビット
レートでも良好な文章了解性をもつ音声として再生可能
な音声符号化方法を提供することにある。An object of the present invention is to provide a speech encoding method that can reproduce speech with good text intelligibility even at a low bit rate of 600 bps or less.

[Failure to solve the problem]

この発明によれば入力音声のスペクトルノクラメータ時
系列をセグメントに分割し、そのセグメント系列と時間
長固定の標準パタンとのマツチング距離が最小だなるよ
うに、セグメント分割位置を修正しながら最も類似した
標準・ｇタン及びセグメント分割位置を決定する。つま
シこの発明ではセグメント分割位置の決定と標準パタン
の選択とを一体化させることによシ符号化歪の最小化を
実現している。更に上記標準ツクタンの作成を学習音声
のスペクトルパラメータのセグメント系列についてクラ
スタリングし、各クラスの標準パタンを求め、この標準
パタンを用いて学習音声のセグメント分割位置修正と標
準パタンの更新との２つの手続きをくり返し行なうこと
によシ、符号化歪の確実な低減を図っている。According to this invention, a spectral nomurameter time series of input speech is divided into segments, and the segment division position is corrected so that the matching distance between the segment series and a standard pattern with a fixed time length is minimized. Determine the standard/g tongue and segment division positions. In this invention, encoding distortion is minimized by integrating the determination of segment division positions and the selection of standard patterns. Furthermore, the creation of the standard tsukutan described above is performed by clustering the segment series of the spectral parameters of the learning speech, obtaining a standard pattern for each class, and using this standard pattern, two procedures are performed: modifying the segment division position of the learning speech and updating the standard pattern. By repeating this process, we aim to reliably reduce encoding distortion.

従来のセグメント単位の符号化方法では、音声スにクト
ルノ４ラメータの時系列セグメンテーシヨン（セグメン
トに分割すること）と標準・やタンの選択とを別々に処
理しており、良好なセグメンテーシヨンが得られないた
め、符号化歪の最小化を実現していない。また、標準ノ
４タンの作成ておいても良好なセグメンテーシヨンが得
られず、そのセグメンテーシヨンをもとに標準ノｅタン
を構成しているため、この点からも符号化歪を十分小に
することが困難であった。更にこの発明では標準パタン
とのマツチング尺度が時間長固定ではなく、復号化した
際のスペクトル歪が最小となるように標準・やタンを入
力セグメントの時間長と一致させた尺度としている。In the conventional segment-based encoding method, the time-series segmentation (dividing into segments) of four-dimensional audio and the selection of standard and tangents are processed separately, resulting in good segmentation. cannot be obtained, so the encoding distortion has not been minimized. In addition, good segmentation cannot be obtained even if the standard 4-tank is created, and the standard 4-tank is constructed based on that segmentation. It was difficult to make it smaller. Furthermore, in this invention, the matching measure with the standard pattern is not a fixed time length, but is a measure that matches the standard pattern with the time length of the input segment so that the spectral distortion upon decoding is minimized.

〔Example〕

第１図はこの発明の音声符号化方法の実施例を示す。入
力端子１からの音声入力は低域通過フィルタ２で帯域制
限を受けてＡＤ変換器３に入力され、周期的に標本化（
この例では、毎秒８０００回）されてディジタル信号に
変換される。このＡＤ変換器３の出力はＬＰＣ分析部４
で入力音声のスペクトル・ｇラメータが抽出される。Ｌ
ＰＧ分析されて算出された入力音声のスペクトル・ぐラ
メータ時系列はセグメンテーション部５で例えば、予め
視察により決められた境界点でセグメントに分割される
。FIG. 1 shows an embodiment of the speech encoding method of the present invention. Audio input from input terminal 1 is band-limited by low-pass filter 2 and input to AD converter 3, where it is periodically sampled (
in this example, 8000 times per second) and converted into a digital signal. The output of this AD converter 3 is the LPC analyzer 4
The spectrum/grammeter of the input voice is extracted. L
The input speech spectrum/grammeter time series calculated through PG analysis is divided into segments by the segmentation unit 5, for example, at boundary points determined in advance by inspection.

この実施例では、フナダラムの読取りによる音素境界を
セグメント点としている。分割されたセグメント系列は
、連続した短音声区間内での標準バタンメモリ７に予め
用意された標準・阜タンとの７ツチング距離が最小とな
るように、動的計画法を用いて修正部６でセグメント分
割位置の修正が行なわれた後、その修正されたセグメン
ト分割位置又はセグメント長が符号化され、これと最も
類似した標準パタンの番号とが出力される。前記セグメ
ント系列と標準ノぐタンとのマツチング距離は、予め用
意された標準・やタンに線形変換を施してその長さを入
力セグメント長に等しくした後、パワー込みの重みつき
ユークリッド距離で定義する。In this embodiment, the phoneme boundaries determined by the reading of Funadalam are used as segment points. The divided segment series is modified by the modification unit 6 using dynamic programming so that the distance between the segment sequence and the standard button stored in advance in the standard button memory 7 within a continuous short speech section is minimized. After the segment division position is corrected, the corrected segment division position or segment length is encoded, and the number of the standard pattern most similar to this is output. The matching distance between the segment series and the standard nogtan is defined by the weighted Euclidean distance including power after performing linear transformation on the standard yatan prepared in advance to make its length equal to the input segment length. .

この例では、スペクトル包絡として１２次のＬＳＰ（Ｌ
　　Ｌ・・・・・・Ｌ　）と対数音声・やワＰ１とのノ
やラメータを１’２１２横に１０個並べた、１３　Ｘ　１０次のマトリクスを標
準バタンとしている。入力セグメント長がｌの場合に、
その入力セグメントのマトリクスをＸｊ（１３）１次の
マトリクス）と５ｍ形変換によシ１０から１次元化する
射影行列をＨＪとすれば、ＸｊとＸｏとのマツチング距
離は次式で算出する。In this example, the 12th order LSP (L
The standard baton is a 13 x 10-order matrix in which 10 parameters of logarithmic speech (L...L) and logarithmic speech/Yawa P1 are arranged horizontally. If the input segment length is l,
Assuming that the matrix of the input segment is Xj (13) linear matrix) and the projection matrix to be made one-dimensional from SH10 by 5m transformation is HJ, the matching distance between Xj and Xo is calculated by the following equation.

ｄ　（Ｘ’、Ｘｊ）２＝ｌｌＸｊ−Ｘ’Ｈｚ　Ｉｆ２　
　・・・・・・・・・・・・・・・・・・・・・（１）
このようにこの発明では標準ノ？タンを入力セグメント
長に等しくして、入力セグメントと標準・にタンとのマ
ツチング距離を求める。d (X', Xj)2=llXj-X'Hz If2
・・・・・・・・・・・・・・・・・・・・・(1)
Is this invention standard? The matching distance between the input segment and the standard tongue is determined by making the tongue equal to the length of the input segment.

動的計画法を用いた入力セグメント分割位置の修正は、
以下のように行なう。短音一区間Ｉｍ内の時刻Ｔｓ迄の
累積距離（マツチング距離の和）をσ（Ｔｓ）とし、短
音一区間Ｉｍ内のセグメント数をＭとする。分割位置修
正幅Δを適当にとシ、恣の漸化式に従って時刻Ｔｓ−１
を決める。Modifying the input segment division position using dynamic programming is as follows:
Do as follows. Let σ(Ts) be the cumulative distance (sum of matching distances) up to time Ts within one short sound interval Im, and M be the number of segments within one short sound interval Im. Set the division position correction width Δ appropriately, and set the time Ts-1 according to an arbitrary recurrence formula.
decide.

σ（Ｔｓ）＝ｍｉｎ　（σ（Ｔｓ−１）＋ｄ（Ｔｓ−１
，Ｔｓ））　””・・・””・（２）Ｔｓ−まただし　ＩＴ、−Ｔ、−、ｌ≦（Δ−１）／２：　ｓ＝
１．２・・・Ｍσ（Ｔｏ　）＝０　：　ｄは、時刻Ｔｓ
−１からＴｓの入力セグメントを（１）式で量子化した
値終端点累積歪σ（ＴＭ）を最小とする時刻ＴＭを決定し
、（２）式によシ得られた各セグメント位置の修正点を
逐次決定する。σ(Ts)=min (σ(Ts-1)+d(Ts-1
, Ts)) ""...""・(2) Ts-Madashi IT, -T, -, l≦(Δ-1)/2: s=
1.2...Mσ(To)=0: d is time Ts
Determine the time TM that minimizes the cumulative distortion σ(TM) at the end point of the value obtained by quantizing the input segment from −1 to Ts using equation (1), and correct the position of each segment obtained using equation (2). Determine points sequentially.

次に標準・やタンの作シ方を第２図を参照して説明する
。まず標準パタンの学習用に予め用意された音声を入力
し、その学習音声のスペクトルノ＜’ラメータ時系列を
作り、更にその時系列をセグメント分割する。このセグ
メント境界既知の／？ラメータ時系列をクラスタリング
し、その各クラスについて初期標準／Ｆメタン作る。こ
の方法は例えばＧｒａｙの方法により（１）式の距離尺
度を用いて行う。Next, how to make standard Yatan will be explained with reference to Figure 2. First, a voice prepared in advance for standard pattern learning is input, a spectral parameter time series of the learning voice is created, and the time series is further divided into segments. Is this segment boundary known? Cluster the parameter time series and create an initial standard/Fmethane for each class. This method is performed, for example, by Gray's method using the distance measure of equation (1).

Ｇｒａｙの方法については、例えばＡ−Ｂｕｚｏ他″Ｓ
ｐｅｅｃｈＣｏｄｉｎｇ　ｂａｓｅｄ　ｕｐｏｎ　Ｖｅ
ｃｔｏｒ　Ｑｕａｎｔｌｉ；ａｔｉ：Ｏｎ＋　ＩＥＥＥ
。Regarding Gray's method, see, for example, A-Buzo et al.
peachCoding based upon Ve
ctor Quantli;ati:On+ IEEE
.

ＡＳＳＰ−２８ｐｐ５６２−ｐｐ５７４（１９８０）を
参照されたい。この初期標準パタンを用いて、学習用ノ
’？ラメータ時系列のセグメント分割位置修正を行なう
。この修正法は、前述した動的計画法を用いる。この修
正によシ全量子化歪は非増加する。すなわち、初期セグ
メント分割位置での全量子化歪をＤ（０）、修正後の全
量子化歪をＤ（１）とするとＤ（０）≧Ｄ（１）　　　　・・・・・・・・・・・・・
・・・・・・・・・・・・・・・・・（３）が成シ立つ
。次に、分割位置修正された学習用音声のス硬りトルノ
クラメータ時系列のセグメント系列から、以下に示す手
順で標準パタンを更新する。See ASSP-28pp562-pp574 (1980). Using this initial standard pattern, you can use it for learning purposes. The segment division position of the parameter time series is corrected. This modification method uses the dynamic programming method described above. This modification does not increase the total quantization distortion. That is, if the total quantization distortion at the initial segment division position is D(0) and the total quantization distortion after correction is D(1), then D (0)≧D(1) ...・・・・・・
・・・・・・・・・・・・・・・・・・(3) holds true. Next, the standard pattern is updated according to the procedure shown below from the segment series of the time series of stiffness/tornocrameter of the learning speech whose division positions have been corrected.

すなわち、任意の更新前の標準・ぞタンＸ？で分割位置
修正されたセグメント数をＮｉとする。分割位置修正の
際この標準・ぞタンによシ量子化されたセグメントから
作−られたセグメントの集合、つまシ分割位置修正され
たセグメント系列を再びクラスタリングし、その１つの
クラスを（Ｘｙ；）＞＝１．２．・・・、Ｎｉ）とし、
標準パタンＸ、をこの集合の重心ただしＨ，：　Ｘｙに
対応する射影行列Ｈｔ：転置行列Ｂニー膜化（イヒ）逆行列に更新する。一般に更新前の標準バタン＆ｉは更新Ｇ′ 後の標準パタンＸ１と一致しな・いため、標準パタンの
更新により、全量子化歪は非増加する。すなわち、標準
パタンの更新後の量子化歪をＤ（２）とすればばＤ（１）≧　Ｄ（２）　　　　　　　　・・・・・・・
・・・・・・・・・・・・・・・・・（５）が成シ立つ
。以下同様にして、セグメント分割位置の修正、標準パ
タンの更新をくシ返す事によシ、全量子化歪は、非増加
列Ｄ（０）≧Ｄ（１）≧Ｄ（２）≧Ｄ（３）≧・・・・・
・≧Ｄ（Ｋ）≧Ｄ≧（Ｋ＋１）・・（６）となる。この
標準ｉ４タンの更新を全量子化歪の減少率が所定値以下
になるまで行う。なお、−膜化逆行列については、例え
ば、ラオ、ミトラ／渋谷。In other words, the standard before any update? Let Ni be the number of segments whose division positions have been corrected. When the division position is corrected, the set of segments created from the segments quantized according to this standard method is clustered again, and the segment series whose division position has been corrected is clustered again, and one class is defined as (Xy;). ＞=1.2. ..., Ni),
The standard pattern X is updated to a projection matrix Ht corresponding to the center of gravity of this set, H,: Xy: transposed matrix B, an inverse matrix. In general, the standard pattern &i before updating does not match the standard pattern X1 after updating G', so updating the standard pattern does not increase the total quantization distortion. In other words, if the quantization distortion after updating the standard pattern is D(2), then D(1)≧D(2)...
・・・・・・・・・・・・・・・・・・(5) holds true. In the same way, the total quantization distortion is calculated by modifying the segment division position and updating the standard pattern. 3)≧・・・・・・
・≧D(K)≧D≧(K+1) (6). This standard i4 tan is updated until the reduction rate of the total quantization distortion becomes equal to or less than a predetermined value. Regarding the -membrane inverse matrix, for example, Rao, Mitra/Shibuya.

他訳１一般逆行列とその応用”東京図書（１９７３）を
参照されたい。Please refer to "Other Translations 1 General Inverse Matrices and Their Applications" Tokyo Tosho (1973).

第３図に、全量子化歪が非増加列どな、る実例を示す。FIG. 3 shows an example in which the total quantization distortion is a non-increasing column.

この例では、セグメント個数２０００．標準ノＱタンの
時間方向の次元は１０．標準バタン数６４とし、セグメ
ント分割位置の修正は最長セグメント長≦３２．修正幅
Δ＝３３である。（ＬＰＧ分析は、分析窓長３０　ｍ５
ｅｃ　＋シフト長１０１０ｍ５ｅ話者は、男性−名〕。In this example, the number of segments is 2000. The dimension of the standard Q tan in the time direction is 10. The standard number of clicks is 64, and the segment division position is corrected when the longest segment length is 32. The correction width Δ=33. (LPG analysis has an analysis window length of 30 m5.
ec + shift length 1010 m5e speaker is male - first name].

この図から量子化歪が非増加列となっていることが検証
され、歪が初期値に比べ約８０係に減少し、また１回の
更新で著しく減少していることがわかる。From this figure, it is verified that the quantization distortion is in a non-increasing sequence, and it can be seen that the distortion is reduced to about 80 factors compared to the initial value, and is significantly reduced after one update.

第１図の説明に戻る。入カスベクトル時系列は、前述し
たようにセグメント位置修正部６でセグメント分割位置
が修正され、その分割位置（セグメント長）は符号化さ
れる。また最適標準パタンの番号と、入力音声のピンチ
情報と、各セグメントの継続長情報とがマルチプレクサ
８で合成されて符号化出力として出力される。Returning to the explanation of FIG. As described above, the segment division position of the input waste vector time series is corrected by the segment position correction unit 6, and the division position (segment length) is encoded. Further, the number of the optimal standard pattern, the input audio pinch information, and the duration information of each segment are combined by a multiplexer 8 and output as encoded output.

この音声符号化出力は伝送、あるいは記憶され、復号化
は、標準パタンの番号により標準ノ９タンメモリ９を参
照して標準パタンを得、これに対し、継続時間長情報に
より線形変換を施し、スペクトルパラメータ時系列を復
元し、これとピッチ情報とからＬＰＣ合成部１０でＬＰ
Ｃ分析入力と対応したものの合成を行ない、この合成出
力をＤＡ変換器１１でアナログに変換し、低域通過フィ
ルタ１２を通じて出力端子１３にアナログ音声を出力す
る。This audio encoded output is transmitted or stored, and for decoding, a standard pattern is obtained by referring to the standard pattern memory 9 using the standard pattern number, and then linear conversion is performed using the duration information, and the spectrum is The parameter time series is restored, and the LPC synthesis unit 10 performs LP from this and the pitch information.
A signal corresponding to the C analysis input is synthesized, the synthesized output is converted into analog by a DA converter 11, and an analog voice is outputted to an output terminal 13 through a low-pass filter 12.

〔Effect of the invention〕

セグメント数を２０，０００標準ノ４タン数’ｉ　１０
２４とし、セグメント分割位置修正を１回行ない（修正
幅Δ＝９）、更新した標準パタンを用い、１００音節の
明瞭度試験を行なったところ、修正幅Δ＝１３の場合で
、音韻明瞭度７８チの良好な音声が得られた。この場合
、セグメントの平均個数は、毎秒約８個であるから、こ
の符号化音声のスペクトル情報は１セグメント当り継続
長５ビツト、標準ノぐタン１０ビツトで８　Ｘ　（１０
＋５’）　＝１２０　ｂｐｓである。なお音韻の明瞭度
が７５％以上の場合は文章了解度は１００人中５０人は
１００チとなる。The number of segments is 20,000 standard number 4 tan'i 10
24, the segment division position was corrected once (correction width Δ=9), and an intelligibility test of 100 syllables was conducted using the updated standard pattern. When the correction width Δ=13, the phoneme intelligibility was 78. Good audio quality was obtained. In this case, the average number of segments is about 8 per second, so the spectral information of this encoded speech is 8 x (10
+5') = 120 bps. If the phoneme intelligibility is 75% or higher, 50 out of 100 people will have a sentence intelligibility of 100.

従って前記音韻の明瞭度７８チは良好な結果である。Therefore, the phoneme clarity of 78 points is a good result.

以上説明したように、この発明によればスペクトル情報
、例えば約１２０　ｂｐｓのように著しく低速度として
も十分明瞭な符号化音声が得られるため、伝送路の有効
利用、秘話性の高い通信路の構成などに使用できるとい
う利点がある。As explained above, according to the present invention, it is possible to obtain spectral information, such as coded speech that is sufficiently clear even at extremely low speeds such as about 120 bps. It has the advantage that it can be used for configuration, etc.

[Brief explanation of drawings]

第１図はこの発明の一例を示すブロック図、第２図は第
１図中の標準バタン作成部における手順を示す図、第３
図はセグメント修正とＡ’メタン新のくり返し数と符号
化歪の低減の関係を示す図である。FIG. 1 is a block diagram showing an example of the present invention, FIG. 2 is a diagram showing the procedure in the standard button creation section in FIG. 1, and FIG.
The figure is a diagram showing the relationship between segment correction, the number of repetitions of A'methane new, and reduction in coding distortion.

Claims

[Claims]

(1) Extract the spectral parameters of the input audio frame by frame, divide the time series of the extracted spectral parameters into segments, and while modifying the division position, divide the segment series into a standard with a fixed time length prepared in advance. A speech encoding method that determines a division position and the most similar standard pattern so that the matching distance with a pattern is minimized, and outputs a code indicating the determined division position and standard pattern.

(2) Input the training speech, extract its spectral parameters frame by frame, divide the time series of the extracted spectral parameters into segments, cluster the segments, determine the standard pattern for each class, and Using the determined standard pattern, select the most similar standard pattern while correcting the segmentation position of the segment series of the training speech using the determined standard pattern, recluster the segment series at the modified segmentation position, and create the standard of each class. Redetermine the pattern, repeat the above-mentioned division correction, re-clustering, and redetermining the standard pattern at least once, and finally make the re-determined standard pattern the standard pattern using the encoding of the input speech. A speech encoding method according to claim 1, characterized in that: