EP1808852A1 - Verfahren zur Interoperation zwischen adaptiven Breitband-Codecs mit unterschiedlichen Raten und Breitband-Codecs mit mehreren Betriebsarten und variabler Bitrate - Google Patents

Verfahren zur Interoperation zwischen adaptiven Breitband-Codecs mit unterschiedlichen Raten und Breitband-Codecs mit mehreren Betriebsarten und variabler Bitrate Download PDF

Info

Publication number: EP1808852A1
Authority: EP; European Patent Office
Prior art keywords: speech; frame; rate; encoded; frames
Prior art date: 2002-10-11
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.): Withdrawn

Application number

EP07105041A

Other languages

English (en)

French (fr)

Inventor

Milan Jelinek

Redwan Salami

Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)

Nokia Oyj

Nokia Inc

Original Assignee

Nokia Oyj

Nokia Inc

Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)

2002-10-11

Filing date

2003-10-10

Publication date

2007-07-18

2003-10-10 Application filed by Nokia Oyj, Nokia Inc filed Critical Nokia Oyj

2003-10-10 Priority claimed from EP03769097A external-priority patent/EP1554718B1/de

2007-07-18 Publication of EP1808852A1 publication Critical patent/EP1808852A1/de

Status Withdrawn legal-status Critical Current

Links

238000000034 method Methods 0.000 title claims abstract description 120
230000003044 adaptive effect Effects 0.000 title abstract description 16
230000005540 biological transmission Effects 0.000 claims description 8
230000000694 effects Effects 0.000 claims description 8
230000008859 change Effects 0.000 claims description 5
238000012545 processing Methods 0.000 claims description 5
238000001514 detection method Methods 0.000 claims description 3
238000012986 modification Methods 0.000 description 26
230000004048 modification Effects 0.000 description 26
230000005236 sound signal Effects 0.000 description 22
238000004891 communication Methods 0.000 description 20
206010019133 Hangover Diseases 0.000 description 11
230000005284 excitation Effects 0.000 description 11
230000003595 spectral effect Effects 0.000 description 9
230000007774 longterm Effects 0.000 description 7
230000000875 corresponding effect Effects 0.000 description 6
238000001228 spectrum Methods 0.000 description 6
238000010183 spectrum analysis Methods 0.000 description 6
239000013598 vector Substances 0.000 description 6
238000013139 quantization Methods 0.000 description 5
230000006978 adaptation Effects 0.000 description 4
238000005070 sampling Methods 0.000 description 4
230000011664 signaling Effects 0.000 description 4
238000013461 design Methods 0.000 description 3
238000001914 filtration Methods 0.000 description 3
230000007246 mechanism Effects 0.000 description 3
230000001052 transient effect Effects 0.000 description 3
230000007704 transition Effects 0.000 description 3
230000008901 benefit Effects 0.000 description 2
230000001413 cellular effect Effects 0.000 description 2
238000012937 correction Methods 0.000 description 2
230000002596 correlated effect Effects 0.000 description 2
230000007423 decrease Effects 0.000 description 2
239000000835 fiber Substances 0.000 description 2
238000010187 selection method Methods 0.000 description 2
230000001629 suppression Effects 0.000 description 2
238000012546 transfer Methods 0.000 description 2
238000012935 Averaging Methods 0.000 description 1
238000013459 approach Methods 0.000 description 1
VLYDPWNOCPZGEV-UHFFFAOYSA-M benzyl-dimethyl-[2-[2-[2-methyl-4-(2,4,4-trimethylpentan-2-yl)phenoxy]ethoxy]ethyl]azanium;chloride;hydrate Chemical compound O.[Cl-].CC1=CC(C(C)(C)CC(C)(C)C)=CC=C1OCCOCC[N+](C)(C)CC1=CC=CC=C1 VLYDPWNOCPZGEV-UHFFFAOYSA-M 0.000 description 1
230000015572 biosynthetic process Effects 0.000 description 1
230000015556 catabolic process Effects 0.000 description 1
230000010485 coping Effects 0.000 description 1
238000006731 degradation reaction Methods 0.000 description 1
230000000593 degrading effect Effects 0.000 description 1
230000001934 delay Effects 0.000 description 1
238000010586 diagram Methods 0.000 description 1
238000005516 engineering process Methods 0.000 description 1
230000003287 optical effect Effects 0.000 description 1
230000000737 periodic effect Effects 0.000 description 1
230000009467 reduction Effects 0.000 description 1
238000003786 synthesis reaction Methods 0.000 description 1
230000002194 synthesizing effect Effects 0.000 description 1
238000012360 testing method Methods 0.000 description 1

Images

Classifications

- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/16—Vocoder architecture
- G10L19/173—Transcoding, i.e. converting between two coded representations avoiding cascaded coding-decoding
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/012—Comfort noise or silence coding

Definitions

the present invention relates to digital encoding of sound signals, in particular but not exclusively a speech signal, in view of transmitting and synthesizing this sound signal.
the present invention relates to a method for interoperation between adaptive multi-rate wideband and multi-mode variable bit-rate wideband codecs.
a speech encoder converts a speech signal into a digital bit stream, which is transmitted over a communication channel or stored in a storage medium.
the speech signal is digitized, that is, sampled and quantized with usually 16-bits per sample.
the speech encoder has the role of representing these digital samples with a smaller number of bits while maintaining a good subjective speech quality.
the speech decoder or synthesizer operates on the transmitted or stored bit stream and converts it back to a sound signal.
CELP Code-Excited Linear Prediction
This coding technique is a basis of several speech coding standards both in wireless and wireline applications.
the sampled speech signal is processed in successive blocks of L samples usually called frames, where L is a predetermined number corresponding typically to 10-30 ms.
a linear prediction (LP) filter is computed and transmitted every frame. The computation of the LP filter typically needs a lookahead, a 5-15 ms speech segment from the subsequent frame.
the L-sample frame is divided into smaller blocks called subframes. Usually the number of subframes is three or four resulting in 4-10 ms subframes.
an excitation signal is usually obtained from two components, the past excitation and the innovative, fixed-codebook excitation.
the component formed from the past excitation is often referred to as the adaptive codebook or pitch excitation.
the parameters characterizing the excitation signal are coded and transmitted to the decoder, where the reconstructed excitation signal is used as the input of the LP filter.
VBR variable bit rate
the codec operates at several bit rates, and a rate selection module is used to determine the bit rate used for encoding each speech frame based on the nature of the speech frame (e.g. voiced, unvoiced, transient, background noise).
the goal is to attain the best speech quality at a given average bit rate, also referred to as average data rate (ADR).
ADR average data rate
the codec can operate at different modes by tuning the rate selection module to attain different ADRs at the different modes where the codec performance is improved at increased ADRs.
the mode of operation is imposed by the system depending on channel conditions. This enables the codec with a mechanism of trade-off between speech quality and system capacity.
an eighth-rate is used for encoding frames without speech activity (silence or noise-only frames).
the frame is stationary voiced or stationary unvoiced
half-rate or quarter-rate are used depending on the operating mode. If half-rate can be used, a CELP model without the pitch codebook is used in unvoiced case and a signal modification is used to enhance the periodicity and reduce the number of bits for the pitch indices in voiced case. If the operating mode imposes a quarter-rate, no waveform matching is usually possible as the number of bits is insufficient and some parametric coding is generally applied.
Full-rate is used for onsets, transient frames, and mixed voiced frames (a typical CELP model is usually used).
the system can limit the maximum bit-rate in some speech frames in order to send in-band signalling information (called dim-and-burst signalling) or during bad channel conditions (such as near the cell boundaries) in order to improve the codec robustness. This is referred to as half-rate max.
the rate-selection module chooses the frame to be encoded as a full-rate frame and the system imposes for example HR frame, the speech performance is degraded since the dedicated HR modes are not capable of efficiently encoding onsets and transient signals.
Another HR (or quarter-rate (QR)) coding model can be provided to cope with these special cases.
Rate selection is the key part for attaining the lowest average data rate with the best possible quality.
AMR-WB adaptive multi-rate wideband
ITU-T International Telecommunications Union - Telecommunication Standardization Sector
3GPP third generation partnership project
GSM Global System for Mobile communications
W-CDMA third generation wireless systems
AMR-WB codec consists of nine bit rates, namely 6.6, 8.85, 12.65, 14.25, 15.85, 18.25, 19.85, 23.05, and 23.85 kbit/s. Interoperation between CDMA-WB and AMR-WB codec is thus desirable.
An object of the present invention is to provide an improved signal classification and rate selection methods for a variable-rate wideband speech coding in general; and in particular to provide an improved signal classification and rate selection methods for a variable-rate multi-mode wideband speech coding suitable for CDMA systems. Another objective is to provide techniques for efficient interoperation between the wideband VBR codec for CDMA systems and the standard AMR-WB codec.
VMR-WB Variable bit-rate Multi-mode WideBand
AMR-WB Adaptive Multi-Rate wideband
VMR-WB Variable bit rate multi-mode wideband
AMR-WB adaptative multi-rate wideband
VMR-WB Variable bit rate multi-mode wideband
AMR-WB Adaptive Multi-Rate wideband
a method for translating an Adaptive Multi-Rate wideband (AMR-WB) signal frame into a Variable bit rate multi-mode wideband (VMR-WB) signal frame comprising:
the speech communication system 10 supports transmission and reproduction of a speech signal across a communication channel 12.
the communication channel 12 may comprise for example a wire, optical or fibre link, or a radio frequency link.
the communication channel 12 can be also a combination of different transmission media, for example in part fibre link and in part a radio frequency link.
the radio frequency link may allow to support multiple, simultaneous speech communications requiring shared bandwidth resources such as may be found in cellular telephony.
the communication channel may be replaced by a storage device (not shown) in a single device embodiment of the communication system that records and stores the encoded speech signal for later playback.
the communication system 10 includes an encoder device comprised of a microphone 14, an analog-to-digital converter 16, a speech encoder 18, and a channel encoder 20 on the emitter side of the communication channel 12, and a channel decoder 22, a speech decoder 24, a digital-to-analog converter 26 and a loudspeaker 28 on the receiver side.
an encoder device comprised of a microphone 14, an analog-to-digital converter 16, a speech encoder 18, and a channel encoder 20 on the emitter side of the communication channel 12, and a channel decoder 22, a speech decoder 24, a digital-to-analog converter 26 and a loudspeaker 28 on the receiver side.
the microphone 14 produces an analog speech signal that is conducted to an analog-to-digital (A/D) converter 16 for converting it into a digital form.
a speech encoder 18 encodes the digitized speech signal producing a set of parameters that are coded into a binary form and delivered to a channel encoder 20.
the optional channel encoder 20 adds redundancy to the binary representation of the coding parameters before transmitting them over the communication channel 12. Also, in some applications such packet-network applications, the encoded frames are packetized before transmission.
a channel decoder 22 utilizes the redundant information in the received bitstream to detect and correct channel errors occurred in the transmission.
a speech decoder 24 converts the bitstream received from the channel decoder 20 back to a set of coding parameters for creating a synthesized speech signal.
the synthesized speech signal reconstructed at the speech decoder 24 is converted to an analog form in a digital-to-analog (D/A) converter 26 and played back in a loudspeaker unit 28.
D/A digital-to-analog
the microphone 14 and/or the A/D converter 16 may be replaced in some embodiments by other speech sources for the speech encoder 18.
the encoder 20 and decoder 22 are configured so as to embody a method for encoding a speech signal according to the present invention as described hereinbelow.
the method 100 includes a speech signal classification method according to an illustrative embodiment of a second aspect of the present invention.
the expression speech signal refers to voice signals as well as any multimedia signal that may include a voice portion such as audio with speech content (speech in between music, speech with background music, speech with special sound effects, etc.)
the signal classification is done in three steps 102, 106 and 110, each of them discriminating a specific signal class.
a first-level classifier in the form of a voice activity detector (VAD) discriminates between active and inactive speech frames. If an inactive speech frame is detected then the encoding method 100 ends with the encoding of the current frame with, for example, comfort noise generation (CNG) (step 104). If an active speech frame is detected in step 102, the frame is subjected to a second level classifier (not shown) configured to discriminate unvoiced frames.
VAD voice activity detector
CNG comfort noise generation
step 106 if the classifier classifies the frame as unvoiced speech signal, the encoding method 100 ends in step 108, where the frame is encoded using a coding technique optimized for unvoiced signals. Otherwise, the speech frame is passed in step 110, through a third-level classifier (not shown) in the form of a "stable voiced" classification module (not shown). If the current frame is classified as a stable voiced frame, then the frame is encoded using a coding technique optimized for stable voiced signals (step 112). Otherwise, the frame is likely to contain a non-stationary speech segment such as a voiced onset or rapidly evolving voiced speech signal portion, and the frame is encoded using a general purpose speech coder with high bit rate allowing to sustain good subjective quality (step 114). Note that if the relative energy of the frame is lower than a certain threshold then these frames can be encoded with a generic lower rate coding type to further reduce the average data rate.
the classifiers and encoders may take many forms from an electronic circuitry to a chip processor.
VAD inactive speech frames
VAD Voice Activity Detector
the unvoiced parts of a speech signal are characterized by missing periodicity and can be further divided into unstable frames, where the energy and the spectrum changes rapidly, and stable frames where these characteristics remain relatively stable.
step 106 unvoiced frames are discriminated using at least three out of the following parameters:
Figure 3 illustrates a method 200 for discriminating unvoiced frame according to an illustrative embodiment of a third aspect of the present invention.
the normalized correlation used to determine the voicing measure, is computed as part of the open-loop pitch search module 214.
the open-loop pitch search module usually outputs the open-loop pitch estimate p every 10 ms (twice per frame).
it is also used to output the normalized correlation measures r x .
These normalized correlations are computed on the weighted speech and the past weighted speech at the open-loop pitch delay.
the weighted speech signal s w ( n ) is computed in a perceptual weighting filter 212.
a perceptual weighting filter 212 with fixed denominator, suited for wideband signals is used.
the following relation gives an example of transfer function for the perceptual weighting filter 212:
W z A z / ⁇ 1 / 1 - ⁇ 2 ⁇ z - 1 where 0 ⁇ ⁇ 2 ⁇ ⁇ 1 ⁇ 1
A(z) is the transfer function of the linear prediction (LP) filter computed in module 218, which is given by the following relation:
the computation of the correlations is as follows.
the correlations r x ( k ) are computed on the weighted speech signal s w ( n ).
the length of the autocorrelation computation L k is dependant on the pitch period.
L k 80 samples for p k ⁇ 62 samples
L k 124 samples for 62 ⁇ p k ⁇ 122 samples
L k 230 samples for p k > 122 samples
the weighted speech signal can be decimated by 2 to simplify the open loop pitch search.
the weighted speech signal can be low-pass filtered before decimation.
L k 62 samples for 62 ⁇ p k ⁇ 61 samples
L k 115 samples for p k > 61 samples
Other methods can be used to compute the correlations. For example, only one normalized correlation value can be computed for the whole frame instead of averaging several normalized correlations. Further, the correlations can be computed on signals other than the weighted speech such as the residual signal, the speech signal, or a low-pass filtered residual, speech, or weighted speech signal.
the spectral tilt parameter contains the information about the frequency distribution of energy.
the spectral tilt is estimated in the frequency domain as a ratio between the energy concentrated in low frequencies and the energy concentrated in high frequencies. However, it can be also estimated in different ways such as a ratio between the two first autocorrelation coefficients of the speech signal.
the discrete Fourier Transform is used to perform the spectral analysis in module 210 of Figure 10.
the frequency analysis and the tilt computation are done twice per frame. 256 points Fast Fourier Transform (FFT) is used with 50 percent overlap.
FFT Fast Fourier Transform
the analysis windows are placed so that the entire lookahead is exploited. The beginning of the first window is placed 24 samples after the beginning of the current frame. The second window is placed 128 samples further. Different windows can be used to weight the input signal for the frequency analysis.
a square root of a Hamming window (which is equivalent to a sine window) is used. This window is particularly well suited for overlap-add methods, therefore this particular spectral analysis can be used in an optional noise suppression algorithm based on spectral subtraction and overlap-add analysis/synthesis. Since noise suppression algorithms are believed to be well-known in the art, it will not be described herein in more detail.
Critical bands ⁇ 100.0, 200.0, 300.0, 400.0, 510.0, 630.0, 770.0, 920.0, 1080.0, 1270.0, 1480.0, 1720.0, 2000.0, 2320.0, 2700.0, 3150.0, 3700.0, 4400.0, 5300.0, 6350.0 ⁇ Hz.
the energy in low frequencies is computed as the average of the energies in the first 10 critical bands.
the middle critical bands have been excluded from the computation to improve the discrimination between frames with high-energy concentration in low frequencies (generally voiced) and with high-energy concentration in high frequencies (generally unvoiced). In between, the energy content is not characteristic for any of the classes and increases the decision confusion.
w h ( k ) is set to 1 if the distance between the bin and the nearest harmonic is not larger than a certain frequency threshold (50 Hz) and is set to 0 otherwise.
a priori unvoiced sounds are determined when r x (0) +r x (1) +r e ⁇ 0.6, where the value r e is a correction added to the normalized correlation as described above.
the estimated noise energies have been added to the tilt computation to account for the presence of background noise.
spectral tilt computation is performed twice per frame to obtain e tilt (0) and e tilt (1) corresponding to both spectral analysis per frame.
the signal energy is evaluated twice per subframe, i.e. 8 times per frame, based on short-time segments of length 32 samples. Further, the short-term energies of the last 32 samples from the previous frame and the first 32 samples from next frame are also computed.
other methods can be used to evaluate the energy variation in the frame.
the relative energy of the frame is given by the difference between the frame energy in dB and the long-term average energy.
the relative frame energy is used to identify low energy frames that have not been classified as background noise frames or unvoiced frames. These frames can be encoded with a generic HR encoder in order to reduce the ADR.
the classification of unvoiced speech frames is based on the parameters described above, namely: the voicing measure r x , the spectral tilt e t , the energy variation within a frame dE, and the relative frame energy E rel .
the decision is made based on at least three of these parameters.
the decision thresholds are set based on the operating mode (the required average data rate). Basically for operating modes with lower desired data rates, the thresholds are set to favor more unvoiced classification (since a half-rate or a quarter rate coding will be used to encode the frame).
Unvoiced frames are usually encoded with unvoiced HR encoder. However, in case of the economy mode, unvoiced QR may also be used in order to further reduce the ADR if additional certain conditions are satisfied.
a decision hangover is used.
the algorithm decides that the frame is an inactive speech frame
a local VAD is set to zero but the actual VAD flag is set to zero only after a certain number of frames are elapsed (the hangover period). This avoids clipping of speech offsets.
the local VAD is zero, the frame is classified as an unvoiced frame.
method 200 can be used for discriminating unvoiced frame.
the Voiced HR coding type makes use of signal modification for efficiently encoding stable voiced frames.
Signal modification techniques adjust the pitch of the signal to a predetermined delay contour.
Long term prediction maps the past excitation signal to the present subframe using this delay contour and scaling by a gain parameter.
the delay contour is obtained straightforwardly by interpolating between two open-loop pitch estimates, the first obtained in the previous frame and the second in the current frame. Interpolation gives a delay value for every time instant of the frame.
the pitch in the subframe to be coded currently is adjusted to follow this artificial contour by warping, changing the time scale of the signal.
discontinuous warping [1, 4, 5]
a signal segment is shifted either to the left or to the right without altering the segment length.
Discontinuous warping requires a procedure for handling the resulting overlapping or missing signal portions.
the tolerated change in the time scale is kept small.
warping is typically done using the LP residual signal or the weighted speech signal to reduce the resulting distortions.
the use of these signals instead of the speech signal also facilitates detection of pitch pulses and low-power regions in between them, and thus the determination of the signal segments for warping.
the actual modified speech signal is generated by inverse filtering.
the coding can proceed in conventional manner except the adaptive codebook excitation is generated using the predetermined delay contour.
signal modification is done pitch and frame synchronously, that is, adapting one pitch cycle segment at a time in the current frame such that a subsequent speech frame starts in perfect time alignment with the original signal.
the pitch cycle segments are limited by frame boundaries. This prevents time shift translating over frame boundaries simplifying encoder implementation and reducing a risk of artifacts in the modified speech signal. This also simplifies variable bit rate operation between signal modification enabled and disabled coding types, since every new frame starts in time alignment with the original signal.
a frame is not classified as inactive speech frame nor as unvoiced frame then it is tested if it is a stable voiced frame (step 110).
Classification of stable voiced frames is performed using a closed-loop approach in conjunction with the signal modification procedure used for encoding stable voiced frames.
Figure 4 illustrates a method 300 for discriminating stable voiced frame according to an illustrative embodiment of a fourth aspect of the present invention.
the sub-procedures in the signal modification yields indicators quantifying the attainable performance of long term prediction in the current frame. If any of these indicators is outside its allowed limits, the signal modification procedure is terminated by one of the logic blocks. In this case, the original signal is preserved intact, and the frame is not classified as stable voiced frame. This integrated logic allows maximizing the quality of the modified speech signal after signal modification and coding at a low bit rate.
the pitch pulse search procedure of step 302 produces several indicators on the periodicity of the current frame. Hence the logic block following it is an important component of the classification logic. The evolution of the pitch-cycle length is observed. The logic block compares the distance of the detected pitch pulse positions against the interpolated open-loop pitch estimate as well as against the distance of previously detected pitch pulses. The signal modification procedure is terminated if the difference to the open-loop pitch estimate orto the previous pitch cycle lengths is too large.
the selection of the delay contour in step 304 gives additional information on the evolution of the pitch cycles and the periodicity of the current speech frame.
the signal modification procedure is continued from this block if the condition
the shape of pitch cycle segments is kept similar over the frame to allow faithful signal modeling by long-term prediction and thus coding at a low bit rate without degrading the subjective quality.
the similarity of successive segments can be quantified by the normalized correlation between the current segment and the target signal at the optimal shift. Shifting of the pitch cycle segments maximizing their correlation with the target signal enhances the periodicity and yields a high long-term prediction gain if the signal modification is useful.
the success of the procedure is guaranteed by requiring that all the correlation values must be larger than a predefined threshold. If this condition is not fulfilled for all segments, the signal modification procedure is terminated and the original signal is kept intact. In general, a slightly lower gain threshold range can be allowed on male voices with equal coding performance. Gain thresholds can be changed in different operating modes of the VBR codec to adjust the usage of the coding modes that apply the signal modification and thus change the targeted average bit rate.
the complete rate selection logic according to the method 100 comprises three steps, each of them discriminating a specific signal class.
One of the steps includes the signal modification algorithm as its integral part.
a VAD discriminates between active and inactive speech frames. If an inactive speech frame is detected, the classification method ends as the frame is regarded as background noise and encoded, for example, with a comfort noise generator. If an active speech frame is detected, the frame is subjected to the second step dedicated to discriminate unvoiced frames. If the frame is classified as unvoiced speech signal, the classification chain ends, and the frame is encoded with a mode dedicated for unvoiced frames. As the last step, the speech frame is processed through the proposed signal modification procedure that enables the modification if the conditions described earlier in this subsection are verified.
the frame is classified as stable voiced frame, the pitch of the original signal is adjusted to an artificial, well-defined delay contour, and the frame is encoded using a specific mode optimized for these types of frames. Otherwise, the frame is likely to contain a non-stationary speech segment such as a voiced onset or rapidly evolving voiced speech signal. These frames typically require a more generic coding model. These frames are usually encoded with a Generic FR coding type. However, if the relative energy of the frame is lower than a certain threshold then these frames can be encoded with a Generic HR coding type to further reduce the ADR.
the described codec is based on the adaptive multi-rate wideband (AMR-WB) speech codecthatwas recently selected by the ITU-T (International Telecommunications Union - Telecommunication Standardization Sector) for several wideband speech services and by 3GPP (third generation partnership project) for GSM and W-CDMA third generation wireless systems.
AMR-WB codec consists of nine bit rates, namely 6.6, 8.85, 12.65, 14.25, 15.85, 18.25, 19.85, 23.05, and 23.85 kbit/s.
An AMR-WB-based source controlled VBR codec for CDMA system allows enabling the interoperation between CDMA and other systems using the AMR-WB codec.
the AMR-WB bit rate of 12.65 kbit/s which is the closest rate that can fit in the 13.3 kbit/s full-rate of Rate Set II can be used as the common rate between a CDMA wideband VBR codec and AMR-WB which will enable the interoperability without the need for transcoding (which degrades the speech quality).
Lower rate coding types are provided specifically for the CDMA VBR wideband solution to enable the efficient operation in the Rate Set II framework.
the codec then can operate in few CDMA-specific modes using all rates but it will have a mode that enables interoperability with systems using the AMR-WB codec.
Table 1 The coding methods according to embodiments of the present invention are summarized in Table 1 and will be generally referred to as coding types. Table 1. Coding types used in the illustrative embodiments with corresponding bit rates. Coding Type Bit Rate [kbit/s] Bits / 20 ms frame Generic FR 13.3 266 Interoperable FR 13.3 266 Voiced HR 6.2 124 Unvoiced HR 6.2 124 Interoperable HR 6.2 124 Generic HR 6.2 124 Unvoiced QR 2.7 54 CNG QR 2.7 54 CNG ER 1.0 20
the full-rate (FR) coding types are based on the AMR-WB standard codec at 12.65 kbit/s.
the use of the 12.65 kbit/s rate of the AMR-WB codec enables the design of a variable bit rate codec for the CDMA system capable of interoperating with other systems using the AMR-WB codec standard.
Extra 13 bits per frame are added to fit in the 13.3 kbit/s full-rate of CDMA Rate Set II. These bits are used to improve the codec robustness in case of erased frames and make essentially the difference between Generic FR and Interoperable FR coding types (they are unused in the Interoperable FR).
the FR coding types are based on the algebraic code-excited linear prediction (ACELP) model optimized for general wideband speech signals. It operates on 20 ms speech frames with a sampling frequency of 16 kHz. Before further processing, the input signal is down-sampled to 12.8 kHz sampling frequency and preprocessed. The LP filter parameters are encoded once per frame using 46 bits. Then the frame is divided into four subframes where adaptive and fixed codebook indices and gains are encoded once per subframe.
the fixed codebook is constructed using an algebraic codebook structure where the 64 positions in a subframe are divided into 4 tracks of interleaved positions and where 2 signed pulses are placed in each track.
the two pulses per track are encoded using 9 bits giving a total of 36 bits per subframe. More details about the AMR-WB codec can be found in reference [1].
the bit allocations for the FR coding types are given in Table 2. Table 2. Bit allocation of Generic and Interoperable full-rate CDMA2000 Rate Set II based on the AMR-WB standard at 12.65 kbit/s. Bits per Frame Parameter Generic FR Interoperable FR Class Info - - VAD bit - 1 LP Parameters 46 46 Pitch Delay 30 30 Pitch Filtering 4 4 Gains 28 28 Algebraic Codebook 144 144 FER protection bits 14 - Unused bits - 13 Total 266 266
the Half-Rate Voiced coding is used.
the half-rate voiced bit allocation is given in Table 3. Since the frames to be coded in this communication mode are characteristically very periodic, a substantially lower bit rate suffices for sustaining good subjective quality compared for instance to transition frames.
Signal modification is used which allows efficient coding of the delay information using only nine bits per 20-ms frame saving a considerable proportion of the bit budget for other signal-coding parameters. In signal modification, the signal is forced to follow a certain pitch contour that can be transmitted with 9 bits per frame. Good performance of long-term prediction allows using only 12 bits per 5-ms subframe for the fixed-codebook excitation without sacrificing the subjective speech quality.
the fixed-codebook is an algebraic codebook and comprises two tracks with one pulse each, whereas each track has 32 possible positions.
Table 3 Bit allocation of half-rate Generic, Voiced, Unvoiced according to CDMA2000 Rate Set II. Bits per frame Parameter Generic HR Voiced HR Unvoiced HR Interoperable HR Class Info 1 3 2 3 VAD bit - - - 1 LP Parameters 36 36 46 46 Pitch Delay 13 9 - 30 Pitch Filtering - 2 - 4 Gains 26 26 24 28 Algebraic Codebook 48 48 52 - FER protection bits - - - - Unused bits - - - 12 Total 124 124 124 124 124 124 124 124 124
the adaptive codebook (or pitch codebook) is not used.
a 13-bit Gaussian codebook is used in each subframe where the codebook gain is encoded with 6 bits per subframe. It is to be noted that in cases where the average bit rate needs to be further reduced, unvoiced quarter-rate can be used in case of stable unvoiced frames.
a generic half-rate mode is used for low energy segments.
This generic HR mode can be also used in maximum half-rate operation as will be explained later.
the bit allocation of the Generic HR is shown in the above Table 3.
1 bit is used to indicate if the frame is Generic HR or other HR.
2 bits are used for classification: the first bit to indicate that the frame is not Generic HR and the second bit to indicate it is Unvoiced HR and not Voiced HR or Interoperable HR (to be explained later).
Voiced HR 3 bits are used: the first 2 bits indicate that the frame is not Generic or Unvoiced HR, and the third bit indicates whether the frame is Unvoiced or Interoperable HR.
Unvoiced QR coder In the economy mode, most of the unvoiced frames can be encoded using the Unvoiced QR coder.
the Gaussian codebook indices are generated randomly and the gain is encoded with only 5 bits per subframe.
the LP filter coefficients are quantized with lower bit rate. 1 bit is used for the discrimination among the two quarter-rate coding types: Unvoiced QR and CNG QR. The bit allocation for unvoiced coding types is given in 6.
the Interoperable HR coding type allows coping with the situations where the CDMA system imposes HR as a maximum rate for a particular frame while the frame has been classified as full rate.
the Interoperable HR is directly derived from the full rate coder by dropping the fixed codebook indices after the frame has been encoded as a full rate frame (Table 4).
the fixed codebook indices can be randomly generated and the decoder will operate as if it is in full-rate.
This design has the advantage that it minimizes the impact of the forced half-rate mode during a tandem free operation between the CDMA system and other systems using the AMR-WB standard (such as the mobile GSM system or W-CDMA third generation wireless system).
the Interoperable FR coding type or CNG QR is used for a tandem-free operation (TFO) with AMR-WB.
TFO tandem-free operation
the VMR-WB codec will use the Interoperable HR coding type.
randomly generated algebraic codebook indices are added to the bit stream to output a 12.65 kbit/s rate.
the AMR-WB decoder at the receiver side will interpret it as an ordinary 12.65 kbit/s frame.
the Comfort Noise Generation (CNG) technique is used for processing of inactive speech frames.
the CNG eighth rate (ER) coding type is used to encode inactive speech frames when operating within the CDMA system.
the CNG ER cannot be always used as its bit rate is lower than the bit rate necessary to transmit the update information for the CNG decoder in AMR-WB [3].
the CNG QR is used.
the AMR-WB codec operates often in Discontinuous Transmission Mode (DTX). During discontinuous transmission, the background noise information is not updated every frame. Typically only one frame out of 8 consecutive inactive speech frames is transmitted.
DTX Discontinuous Transmission Mode
This update frame is referred to as Silence Descriptor (SID) [4].
SID Silence Descriptor
the DTX operation is not used in the CDMA system where every frame is encoded. Consequently, only SID frames need to be encoded with CNG QR at the CDMA side and the remaining frames can be still encoded with CNG ER to lower the ADR as they are not used by the AMR-WB counterpart.
CNG coding only the LP filter parameters and a gain are encoded once per frame.
the bit allocation for the CNG QR is given in Table 4 and that of CNG ER is given in Table 5. Table 4.
a method 400 for digitally encoding a sound signal according to a second illustrative embodiment of the second aspect of the present invention is illustrated in Figure 5. It is to be noted that the method 400 is a specific application of the method 100 in the Premium Mode, which is provided for maximum synthesized speech quality given the available bit rates (it should be noted that the case when the system limits the maximum available rate for a particular frame will be described in a separate subsection). Consequently, most of the active speech frames are encoded at full rate, i.e. 13.3 kb/s.
a voice activity detector discriminates between active and inactive speech frames (step 102).
the VAD algorithm can be identical for all modes of operation. If an inactive speech frame is detected (background noise signal) then the classification method stops and the frame is encoded with CNG ER coding type at 1.0 kbit/s according to CDMA Rate Set II (step 402). If an active speech frame is detected, the frame is subjected to a second classifier dedicated to discriminate unvoiced frames (step 404). As the Premium Mode is aimed for the best possible quality, the unvoiced frame discrimination is very severe and only highly stationary unvoiced frames are selected. The unvoiced classification rules and decision thresholds are as given above.
the classification method stops, and the frame is encoded using Unvoiced HR coding type (step 408) optimized for unvoiced signals (6.2 kbit/s according to CDMA Rate Set II). All other frames are processed with Generic FR coding type, based on the AMR-WB standard at 12.65 kbit/s (step 406).
a method 500 for digitally encoding a sound signal according to a third illustrative embodiment of the second aspect of the present invention is illustrated in Figure 6.
the method 500 allows the classification of a speech signal and its encoding in Standard mode.
a VAD discriminates between active and inactive speech frames. If an inactive speech frame is detected then the classification method stops and the frame is encoded as a CNG ER frame (step 510). If an active speech frame is detected, the frame is subjected to a second-level classifier dedicated to discriminate unvoiced frames (step 404). The unvoiced classification rules and decision thresholds are described above. If the second-level classifier classifies the frame as unvoiced speech signal, the classification method stops, and the frame is encoded with an Unvoiced HR coding type (step 508). Otherwise, the speech frame is passed through to the "stable voiced" classification module (step 502). The discrimination of the voiced frames is an inherent feature of the signal modification algorithm as described hereinabove.
the frame is classified as stable voiced frame and encoded with Voiced HR coding type (step 506) in a module optimized for stable voiced signals (6.2 kbit/s according to CDMA Rate Set II). Otherwise, the frame is likely to contain a nonstationary speech segment such as a voiced onset or rapidly evolving voiced speech signal. These frames typically require a high bit rate for sustaining good subjective quality. However, if the energy of the frame is lower than a certain threshold then the frames can be encoded with a Generic HR coding type. Thus, if in step 512 the fourth-level classifier detects a low energy signal the frame is encoded using Generic HR (step 514). Otherwise, the speech frame is encoded as a Generic FR frame (13.3 kbit/s according to CDMA Rate Set II) (step 504).
Voiced HR coding type step 506
the frame is likely to contain a nonstationary speech segment such as a voiced onset or rapidly evolving voiced speech signal. These frames typically require a high bit rate for sustaining good
a method 600 for digitally encoding a sound signal according to a fourth illustrative embodiment of the first aspect of the present invention is illustrated in Figure 6.
the method 600 which is a four-level classification method, allows the classification of a speech signal and its encoding in the Economy mode.
the economy Mode allows for maximum system capacity still producing high quality wideband speech.
the rate determination logic is similar to Standard mode with the exception that also Unvoiced QR coding type is used and Generic FR use is reduced.
a VAD discriminates between active and inactive speech frames. If an inactive speech frame is detected then the classification method stops and the frame is encoded as a CNG ER frame (step 402). If an active speech frame is detected, the frame is subjected to a second classifier dedicated to discriminate all unvoiced frames (step 106). The unvoiced classification rules and decision thresholds have been described above. If the second classifier classifies the frame as unvoiced speech signal, the speech frame is passed into the a first third-level classifier (step 602). The third-level classifier checks whether the frame is on a voiced-unvoiced transition using the rules described above.
this third-level classifier tests whether the last frame is either unvoiced of background noise frame, and if at the end of the frame the energy is concentrated in high frequencies and no potential voiced onset is detected in the lookahead.
the frame is encoded in step 508 with Unvoiced HR coding type. Otherwise, the speech frame is encoded with Unvoiced QR coding type (step 604). Frames not classified as unvoiced are passed through to a "stable voiced" classification module, which is a second third-level classifier (step 110). The discrimination of the voiced frames is an inherent feature of the signal modification algorithm as explained earlier. If the frame is suitable for signal modification, it is classified as stable voiced frame and encoded with Voiced HR in step 506. Similar to the Standard mode, remaining frames (not classified as unvoiced or stable voiced) are tested for low energy content. If a low energy signal is detected in step 512, the frame is encoded in step 514 using Generic HR. Otherwise, the speech frame is encoded as a Generic FR frame (13.3 kbit/s according to CDMA Rate Set II) (step 504).
a method 700 for digitally encoding a sound signal according to a fifth illustrative embodiment of the second aspect of the present invention is illustrated in Figure 8.
the method 700 allows the classification of a speech signal and the encoding in the Interoperable mode.
the Interoperable mode allows for a tandem free operation between the CDMA system and other systems using the AMR-WB standard at 12.65 kbit/s (or lower rates). In absence of rate limitation imposed by the CDMA system, only Interoperable FR and Comfort Noise Generators are used.
a VAD discriminates between active and inactive speech frames. If an inactive speech frame is detected, a decision is made in step 702 whether the frame should be encoded as a SID frame. As mentioned earlier, the SID frame serves to update the CNG parameters at AMR-WB side during DTX operation [4]. Typically, only one of 8 inactive speech frames are encoded during silence periods. However, after an active speech segment, the SID update must be sent already in the 4 th frame (see reference [4] for more details). As the ER is not sufficient to encode a SID frame, SID frames are encoded with CNG QR in step 704. Other than SID inactive frames are encoded with CNG ER in step 402.
the CNG ER frames are discarded at the system interface as AMR-WB does not make use of them. In the opposite direction, those frames are not available (AMR-WB is generating only SID frames) and are declared as frame erasures. All active speech frames are processed with Interoperable FR coding type (step 706), which is essentially the AMR-WB coding standard at 12.65 kbit/s.
a method 800 for digitally encoding a sound signal according to a sixth illustrative embodiment of the second aspect of the present invention is illustrated in Figure 9.
the method 800 allows the classification of a speech signal and the encoding in Half-Rate Max operation for Premium and Standard modes.
the CDMA system imposes a maximum bit rate for a particular frame. Most often, the maximum bit rate imposed by the system is limited to HR. However, the system can impose also lower rates.
All active speech frames that would conventionally be classified as FR during normal operation are now encoded using HR coding types.
the classification and rate selection mechanism classifies then all such voiced frames using Voiced HR (encoded in step 506) and all such unvoiced frames using Unvoiced HR (encoded in step 408). All remaining frames that would be classified as FR during normal operation are encoded using the Generic HR coding type in step 514 except in the Interoperable mode where Interoperable HR coding type is used (step 908 on Figure 10).
the signal classification and encoding mechanism is similar to the normal operation in Standard mode.
the Generic HR (step 514) is used instead of the Generic FR coding (step 406 on Figure 5) and the thresholds used to discriminate unvoiced and voiced frames are more relaxed to allow as many frames as possible to be encoded using the Unvoiced HR and Voiced HR coding types.
the thresholds for Economy mode are used in case of Premium or Standard mode half-rate max operation.
a method 900 for digitally encoding a sound signal according to a seventh illustrative embodiment of the first aspect of the present invention is illustrated in Figure 10.
the method 900 allows the classification of a speech signal and the encoding in Half-Rate Max operation for the economy mode.
the method 900 in Figure 10 is similar to the method 600 in Figure 7 with the exception that all frames that would have been encoded with Generic FR are now encoded with Generic HR (no need for low energy frame classification in half-rate max operation).
a method 920 for digitally encoding a sound signal according to a eighth illustrative embodiment of the first aspect of the present invention is illustrated in Figure 11.
the method 920 allows the classification of a speech signal and the rate determination in the Interoperable mode during half-rate max operation. Since the method 920 is very similar to the method 700 from Figure 8, only the differences between the two methods will be described herein.
a method 1000 for coding a speech signal for interoperation between AMR-WB and VMR-WB codecs will now be described according to an illustrative embodiment of fourth aspect of the present invention with reference to Figure 12.
the method 1000 enables tandem-free operation between the AMR-WB standard codec and the source controlled VBR codec designed, for example, for CDMA2000 systems (referred to here as VMR-WB codec).
VMR-WB codec makes use of bit rates that can be interpreted by the AMR-WB codec and still fit within the Rate Set II bit rates used in a CDMA codec, for example.
Rate Set II As the bit rate of Rate Set II are the FR 13.3, HR 6.2, QR 2.7, and ER 1.0 kbit/s, then the AMR-WB codec bit rates that can be used are 12.65, 8.85, or 6.6 in the full rate, and the SID frames at 1.75 kbit/s in the quarter rate.
AMR-WB at 12.65 kbit/s is the closest in bit rate to CDMA2000 FR 13.3 kbit/s and it is used as the FR codec in this illustrative embodiment.
the link adaptation algorithm can lower the bit rate to 8.85 or 6.6 kbit/s depending on the channel conditions (in order to allocate more bits to channel coding).
the 8.85 and 6.6 kbit/s bit rates of AMR-WB can be part of the Interoperable mode and can be used at the CDMA2000 receiver in case the GSM system decided to use either of these bit rates.
three types of I-FR are used corresponding to AMR-WB rates at 12.65, 8.85, and 6.6 kbit/s and will be denoted I-FR-12, I-FR-8, and I-FR-6, respectively.
I-FR-12 there are 13 unused bits.
the first 8 bits are used to distinguish I-FR frames and Generic FR frames (that use the extra bits to improve frame erasure concealment).
the other 5 bits are used to signal the three types of I-FR frames.
I-FR-12 is used and the lower rates are used if required by the GSM link adaptation.
the average data rate of the speech codec is directly related to the system capacity. Therefore attaining the lowest ADR possible with a minimal loss in speech quality becomes of significant importance.
the AMR-WB codec was mainly designed for GSM cellular systems and third generation wireless based on GSM evolution. Thus an Interoperable mode for CDMA2000 system may result in a higher ADR compared to VBR codec specifically designed for CDMA2000 systems. The main reasons are:
An method for coding a speech signal for interoperation between AMR-WB and VMR-WB codecs allows to overcome the above mentioned limitations and result in reduced ADR of the Interoperable mode such that it is equivalent to CDMA2000 specific modes with comparable speech quality.
the methods are described below for both directions of operation: VMR-WB encoding - AMR-WB decoding, and AMR-WB encoding - VMR-WB decoding.
the VAD/DTX/CNG operation of the AMR-WB standard is not required.
the VAD/CNG operation is made to be as close as possible to the AMR DTX operation.
the VAD/DTX/CNG operation in the AMR-WB codec works as follows. Seven background noise frames after an active speech period are encoded as speech frames but the VAD bit is set to zero (DTX hangover). Then an SID_FIRST frame is sent. In an SID_FIRST frame the signal is not encoded and CNG parameters are derived out of the DTX hangover (the 7 speech frames) at the decoder. It is to be noted that AMR-WB doesn't use DTX hangover after active speech periods which are shorter than 24 frames in order to reduce the DTX hangover overhead.
the VAD in the VMR-WB codec doesn't use DTX hangover.
the first background noise frame after an active speech period is encoded at 1.75 kbit/s and sent in QR, then there are 2 frames encoded at 1 kbit/s (eighth rate) and then another frame at 1.75 kbit/s sent in QR. After that, 7 frames are sent in ER followed by one QR frame and so on. This corresponds roughly to AMR-WB DTX operation with the exception that no DTX hangover is used in order to reduce the ADR.
QR CNG frames can be sent less frequently, e.g. once every 12 frames.
the noise variations can be evaluated at the encoder and QR CNG frames can be sent only when noise characteristics change (not once every 8 or 12 frames).
an Interoperable half rate (I-HR) which includes encoding the frame as a full rate frame then dropping the bits corresponding to the algebraic codebook indices (144 bits per frame in AMR-WB at 12.65 kbit/s). This reduces the bit rate to 5.45 kbit/s which fits in the CDMA2000 Rate Set II half rate.
the dropped bits can be generated either randomly (i.e. using a random generator) or pseudo-randomly (i.e. by repeating part of the existing bitstream) or in some predetermined manner.
the I-HR can be used when dim-and-burst or half-rate max request is signaled by the CDMA2000 system. This avoids declaring the speech frame as a lost frame.
the I-HR can be also used by the VMR-WB codec in Interoperable mode to encode unvoiced frames or frames where the algebraic codebook contribution to the synthesized speech quality is minimal. This results in a reduced ADR. It should be noted that in this case, the encoder can choose frames to be encoded in I-HR mode and thus minimize the speech quality degradation caused by the use of such frames.
the speech frames are encoded with Interoperable mode of the VMR-WB encoder 1002, which outputs one of the following possible bit rates: I-FR for active speech frames (I-FR-12, I-FR-8, or I-FR-6), I-HR in case of dim-and-burst signaling or, as an option, to encode some unvoiced frames or frames where the algebraic codebook contribution to the synthesized speech quality is minimal, QR CNG to encode relevant background noise frames (one out of eight background noise frames as described above, or when a variation in noise characteristic is detected), and ER CNG frames for most background noise frames (background noise frames not encoded as QR CNG frames).
I-FR active speech frames
I-HR in case of dim-and-burst signaling or, as an option, to encode some unvoiced frames or frames where the algebraic codebook contribution to the synthesized speech quality is minimal
QR CNG to encode relevant background noise frames (one out of eight background noise frames as described above, or when
the validity of the frame received by the gateway from the VMR-WB encoder is tested. If it is not a valid Interoperable mode VMR-WB frame then it is sent as an erasure (speech lost type of AMR-WB). The frame is considered invalid for example if one of the following conditions occurs:
the methods 1000 is limited bytheAMR-WB DTX operation.
the bitstream the 1 st data bit
VAD_flag (0 for DTX hangover period, 1 for active speech).
the first two bytes are set to 0x00 and in ER erasure frames the first two bytes are set to 0x04.
the first 14 bits correspond to the ISF indices and two patterns are reserved to indicate blank frames (all-zero) or erasure frames (all-zero except 14th bit set to 1, which is 0x04 in hexadecimal).
the CNG decoder 1004 when blank ER frames are detected, they are processed by the CNG decoder by using the last received good CNG parameters. An exception is the case of the first received blank ER frame (CNG decoder initialization; no old CNG parameters are known yet).
the decoder uses the concealment procedure used for erased frames.
the link adaptation module in GSM system may decide to lower the bit rate to 8.85 or 6.6 kbit/s in case of bad channel conditions. In this case, these lower bit rates need to be included in the CDMA VMR-WB solution.
Rate Set I the bit rates used are 8.55 kbit/s for FR, 4.0 kbit/s for HR, 2.0 kbit/s for QR, and 800 bit/s for ER.
AMR-WB codec at 6.6 kbit/s can be used at FR and CNG frames can be sent at either QR (SID_UPDATE) or ER for other background noise frames (similar to the Rate Set II operation described above).
an 8.55 kbit/s rate is provided which is interoperable with the 8.85 kbit/s bit rate of AMR-WB codec. It will be referred to as Rate Set I Interoperable FR (I-FR-I).
the VAD_flag bit and additional 5 bits are dropped to obtain a 8.55 kbit/s rate.
the dropped bits can be easily introduced at the decoder or system interface so that the 8.85 kbit/s decoder can be used.
Several methods can be used to drop the 5 bits in a way that cause little impact on the speech quality.
Configuration 1 shown in Table 6 the 5 bits are dropped from the linear prediction (LP) parameter quantization.
LP linear prediction
AMR-WB 46 bits are used to quantize the LP parameters in the ISP (immitance spectrum pair) domain (using mean removal and moving average prediction).
the 16 dimensional ISP residual vector (after prediction) is quantized using split-multistage vector quantization.
the vector is split into 2 subvectors of dimensions 9 and 7, respectively.
the 2 subvectors are quantized in two stages. In the first stage each subvector is quantized with 8 bits.
the quantization error vectors are split in the second stage into 3 and 2 subvectors, respectively.
the second stage subvectors are of dimension 3,3, 3, 3, and 4, and are quantized with 6, 7, 7, 5, and 5 bits, respectively.
the 5 bits of the last second stage subvectors are dropped. These have the least impact since they correspond to the high frequency portion of the spectrum. Dropping these 5 bits is done in practice by fixing the index of the last second stage subvector to a certain value that doesn't need to be transmitted.
this 5-bit index is fixed is easily taken into account during the quantization at the VMR-WB encoder.
the fixed index is added either at the system interface (i.e. during VMR-WB encoder/AMR-WB decoder operation) or at the decoder (i.e during AMR-WB encoder/VMR-WB decoder operation).
the AMR-WB decoder at 8.85 kbit/s is used to decode the Rate Set I I-FR frame.
the 5 bits are dropped from the algebraic codebook indices.
AMR-WB at 8.85 kbit/s, a frame is divided into four 64-sample subframes.
the algebraic excitation codebook consists on dividing the subframe into 4 tracks of 16 positions and placing a signed pulse in each track. Each pulse is encoded with 5 bits: 4 bits for the position and 1 bit for the sign. Thus, for each subframe, a 20-bit algebraic codebook is used.
One way of dropping the five bits is to drop one pulse from a certain subframe. For example, the 4 th pulse in the 4 th position-track in the 4 th subframe.
this pulse can be fixed to a predetermined value (position and sign) during the codebook search.
This known pulse index can then be added at the system interface and sent to the AMR-WB decoder.
the index of this pulse is dropped at the system interface, and at the CDMA VMR-WB decoder, the pulse index can be randomly generated. Other methods can be also used to drop these bits.
an Interoperable HR mode is provided also for the Rate Set I codec (I-HR-I).
I-HR-I Rate Set I codec
some bits must be dropped at the system interface during AMR-WB encoding/VMR-WB decoding operation, or generated at the system interface during VMR-WB encoding/ AMR-WB decoding.
a bit allocation of the 8.85 kbit/s rate and an example configuration of I-HR-I is shown in Table 7. Table 7. Example bit allocation of the I-HR-I coding type in Rate Set I configuration.
the 10 bits of the last 2 second stage subvectors in the quantization of the LP filter parameters are dropped or generated at the system interface in a manner similar to Rate Set II described above.
the pitch delay is encoded only with integer resolution and with bit allocation of 7, 3, 7, 3 bits in four subframes. This translates in the AMR-WB encoder/VMR-WB decoder operation to dropping the fractional part of the pitch at the system interface and to clip the differential delay to 3 bits for the 2 nd and 4 th subframes.
Algebraic codebook indices are dropped altogether similarly as in the I-HR solution of Rate Set II. The signal energy information is kept intact.
Rate Set I Interoperable mode is similar to the operation of the Rate Set II mode explained above in Figure 12 (in terms of VAD/DTX/CNG operation) and will not be described herein in more detail.

Landscapes

Engineering & Computer Science (AREA)
Computational Linguistics (AREA)
Signal Processing (AREA)
Health & Medical Sciences (AREA)
Audiology, Speech & Language Pathology (AREA)
Human Computer Interaction (AREA)
Physics & Mathematics (AREA)
Acoustics & Sound (AREA)
Multimedia (AREA)
Compression, Expansion, Code Conversion, And Decoders (AREA)

EP07105041A 2002-10-11 2003-10-10 Verfahren zur Interoperation zwischen adaptiven Breitband-Codecs mit unterschiedlichen Raten und Breitband-Codecs mit mehreren Betriebsarten und variabler Bitrate Withdrawn EP1808852A1 (de)

Applications Claiming Priority (2)

Application Number	Priority Date	Filing Date	Title
US41766702P	2002-10-11	2002-10-11
EP03769097A EP1554718B1 (de)	2002-10-11	2003-10-10	Methoden zur interoperabilität zwischen adaptiven multiraten breitband-sprachkodierern (amr-wb) und multimode-breitband-sprachkodierern mit variabler bitrate (vmr-wb)

Related Parent Applications (1)

Application Number	Title	Priority Date	Filing Date
EP03769097A Division EP1554718B1 (de)	2002-10-11	2003-10-10	Methoden zur interoperabilität zwischen adaptiven multiraten breitband-sprachkodierern (amr-wb) und multimode-breitband-sprachkodierern mit variabler bitrate (vmr-wb)

Publications (1)

Publication Number	Publication Date
EP1808852A1 true EP1808852A1 (de)	2007-07-18

Family

ID=38156869

Family Applications (1)

Application Number	Title	Priority Date	Filing Date
EP07105041A Withdrawn EP1808852A1 (de)	2002-10-11	2003-10-10	Verfahren zur Interoperation zwischen adaptiven Breitband-Codecs mit unterschiedlichen Raten und Breitband-Codecs mit mehreren Betriebsarten und variabler Bitrate

Country Status (1)

Country	Link
EP (1)	EP1808852A1 (de)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number	Priority date	Publication date	Assignee	Title
WO2009103608A1 (de) *	2008-02-19	2009-08-27	Siemens Enterprise Communications Gmbh & Co. Kg	Verfahren und mittel zur enkodierung von hintergrundrauschinformationen
EP1556978A4 (de) *	2002-10-28	2010-09-15	Qualcomm Inc	Umformatierung von vocoder-rahmen variabler rate für übertragungen zwischen systemen
CN101946281B (zh) *	2008-02-19	2012-08-15	西门子企业通讯有限责任两合公司	用于对背景噪声信息进行解码的方法和装置
WO2016164231A1 (en) *	2015-04-05	2016-10-13	Qualcomm Incorporated	Encoder selection
CN115831132A (zh) *	2021-09-17	2023-03-21	腾讯科技（深圳）有限公司	音频编解码方法、装置、介质及电子设备

Citations (2)

* Cited by examiner, † Cited by third party
Publication number	Priority date	Publication date	Assignee	Title
WO2001008136A1 (en) *	1999-07-14	2001-02-01	Nokia Corporation	Method for decreasing the processing capacity required by speech encoding and a network element
US20020101844A1 (en) *	2001-01-31	2002-08-01	Khaled El-Maleh	Method and apparatus for interoperability between voice transmission systems during speech inactivity

2003
- 2003-10-10 EP EP07105041A patent/EP1808852A1/de not_active Withdrawn

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number	Priority date	Publication date	Assignee	Title
WO2001008136A1 (en) *	1999-07-14	2001-02-01	Nokia Corporation	Method for decreasing the processing capacity required by speech encoding and a network element
US20020101844A1 (en) *	2001-01-31	2002-08-01	Khaled El-Maleh	Method and apparatus for interoperability between voice transmission systems during speech inactivity

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JELINEK M ET AL.: "Advances in Source-controlled variable bit rate wideband speech coding", SPECIAL WORKSHOP IN MAUI (SWIM): LECTURES BY MASTERS IN SPEECH PROCESSING, 12 January 2004 (2004-01-12) - 14 January 2004 (2004-01-14), Maui, Hawaii, USA, XP002272510 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number	Priority date	Publication date	Assignee	Title
EP1556978A4 (de) *	2002-10-28	2010-09-15	Qualcomm Inc	Umformatierung von vocoder-rahmen variabler rate für übertragungen zwischen systemen
WO2009103608A1 (de) *	2008-02-19	2009-08-27	Siemens Enterprise Communications Gmbh & Co. Kg	Verfahren und mittel zur enkodierung von hintergrundrauschinformationen
CN101946281B (zh) *	2008-02-19	2012-08-15	西门子企业通讯有限责任两合公司	用于对背景噪声信息进行解码的方法和装置
RU2461080C2 (ru) *	2008-02-19	2012-09-10	Сименс Энтерпрайз Коммьюникейшнз Гмбх Унд Ко.Кг	Способ и средство для кодирования информации фонового шума
CN101952886B (zh) *	2008-02-19	2013-03-06	西门子企业通讯有限责任两合公司	用于对背景噪声信息进行编码的方法和装置
WO2016164231A1 (en) *	2015-04-05	2016-10-13	Qualcomm Incorporated	Encoder selection
KR20170134430A (ko) *	2015-04-05	2017-12-06	퀄컴 인코포레이티드	인코더 선택
US9886963B2 (en)	2015-04-05	2018-02-06	Qualcomm Incorporated	Encoder selection
TWI640979B (zh) *	2015-04-05	2018-11-11	美商高通公司	用於編碼一音訊信號之裝置及設備、選擇用於編碼一音訊信號之一編碼器之方法、電腦可讀儲存裝置及選擇一調整參數之一值以使一選擇偏向用於編碼一音訊信號之一特定編碼器的方法
CN115831132A (zh) *	2021-09-17	2023-03-21	腾讯科技（深圳）有限公司	音频编解码方法、装置、介质及电子设备

Legal Events

Date	Code	Title	Description
2007-06-15	PUAI	Public reference made under article 153(3) epc to a published international application that has entered the european phase	Free format text: ORIGINAL CODE: 0009012
2007-07-18	AC	Divisional application: reference to earlier application	Ref document number: 1554718 Country of ref document: EP Kind code of ref document: P
2007-07-18	AK	Designated contracting states	Kind code of ref document: A1 Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LI LU MC NL PT RO SE SI SK TR
2008-03-26	AKX	Designation fees paid
2008-05-08	REG	Reference to a national code	Ref country code: DE Ref legal event code: 8566
2008-06-20	STAA	Information on the status of an ep patent application or granted ep patent	Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN
2008-07-23	18D	Application deemed to be withdrawn	Effective date: 20080119

Publication	Publication Date	Title
EP1554718B1 (de)	2011-04-13	Methoden zur interoperabilität zwischen adaptiven multiraten breitband-sprachkodierern (amr-wb) und multimode-breitband-sprachkodierern mit variabler bitrate (vmr-wb)
US7657427B2 (en)	2010-02-02	Methods and devices for source controlled variable bit-rate wideband speech coding
JP5173939B2 (ja)	2013-04-03	Ｃｄｍａ無線システム用可変ビットレート広帯域音声符号化時における効率のよい帯域内ディム・アンド・バースト（ｄｉｍ−ａｎｄ−ｂｕｒｓｔ）シグナリングとハーフレートマックス処理のための方法および装置
JP4778010B2 (ja)	2011-09-21	減少レート、可変レートの音声分析合成を実行する方法及び装置
EP1758101A1 (de)	2007-02-28	Signalveränderungsverfahren zur effizienten Kodierung von Sprachsignalen
JP2004287397A (ja)	2004-10-14	相互使用可能なボコーダ
JP2010176145A (ja)	2010-08-12	ロバストな音声分類のための方法および装置
Jelinek et al.	2007	Wideband speech coding advances in VMR-WB standard
EP1808852A1 (de)	2007-07-18	Verfahren zur Interoperation zwischen adaptiven Breitband-Codecs mit unterschiedlichen Raten und Breitband-Codecs mit mehreren Betriebsarten und variabler Bitrate
Jelinek et al.	2004	Advances in source-controlled variable bit rate wideband speech coding
CA2491623C (en)	2014-01-28	Method and device for efficient in-band dim-and-burst signaling and half-rate max operation in variable bit-rate wideband speech coding for cdma wireless systems
HK1130558B (en)	2013-07-12	Method and device for cdma wireless systems
Paksoy	1994	Variable rate speech coding with phonetic classification
HK1069472B (en)	2007-09-21	Signal modification method for efficient coding of speech signals