EP0140249B1 - Analyse et synthèse de la parole avec normalisation de l'énergie - Google Patents

Analyse et synthèse de la parole avec normalisation de l'énergie Download PDF

Info

Publication number
EP0140249B1
EP0140249B1 EP19840112266 EP84112266A EP0140249B1 EP 0140249 B1 EP0140249 B1 EP 0140249B1 EP 19840112266 EP19840112266 EP 19840112266 EP 84112266 A EP84112266 A EP 84112266A EP 0140249 B1 EP0140249 B1 EP 0140249B1
Authority
EP
European Patent Office
Prior art keywords
frames
frame
energy
speech
parameters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired
Application number
EP19840112266
Other languages
German (de)
English (en)
Other versions
EP0140249A1 (fr
Inventor
George R. Doddington
Panos E. Papamichalis
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Texas Instruments Inc
Original Assignee
Texas Instruments Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US06/541,497 external-priority patent/US4696039A/en
Priority claimed from US06/541,410 external-priority patent/US4696040A/en
Application filed by Texas Instruments Inc filed Critical Texas Instruments Inc
Publication of EP0140249A1 publication Critical patent/EP0140249A1/fr
Application granted granted Critical
Publication of EP0140249B1 publication Critical patent/EP0140249B1/fr
Expired legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0364Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for improving intelligibility
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L2025/783Detection of presence or absence of voice signals based on threshold decision
    • G10L2025/786Adaptive threshold
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain

Definitions

  • the present invention relates to voice coding systems and in particular to a voice mail system and a method of encoding speech as defined in the precharacterizing parts of claims 1, 5, 7 and 8.
  • a speech encoding system of the type referred to in the preamble of claim 7 is disclosed in EP-A-47 589.
  • voice coding systems including voice mail in microcomputer networks, voice mail sent and received over telephone lines by microcomputers, user-programmed synthetic speech, etc.
  • the requirements of many of these applications are quite different from those of simple speech synthesis applications wherein synthetic speech can be carefully encoded and then stored in a ROM or on disk.
  • high speed computers with elaborate algorithms, combined with hand tweaking can be used to optimize encoded speech for good intelligibility and low bit requirements.
  • the speech encoding step does not have such large resources available. This is most obviously true in voice mail microcomputer networks, but it is also important in applications where a user may wish to generate his own reminder messages, diagnostic messages, signals during program operation, etc.
  • a microcomputer system wherein the user could generate synthetic speech messages in his own software would be highly desirable, not only for the individual user but also for the software production houses which do not have trained speech scientists available.
  • a particular problem in such application is energy variation. That is, not only will a speaker's voice intensity typically contain a large dynamic range related to sentence inflection, but different speakers will have different volume levels, and the same speaker's voice level may vary widely at different times. Untrained speakers are especially likely to use nonuniform uncontrolled variations in volume, which the listener normally ignores. This large dynamic range would mean that the voice coding method used must accommodate a wide dynamic range, and therefore an increased number of bits would be required for coding at reasonable resolution.
  • Energy normalization also improves the intelligibility of the speech received. That is, the dynamic range available from audio amplifiers and loudspeakers is much less than that which can easily be perceived by the human ear. In fact, the dynamic range of loudspeakers is typically much less than that of microphones. This means that a dynamic range which is perfectly intelligible to a human listener may be hard to understand if communicated through a loudspeaker, even if absolutely perfect encoding and decoding is used.
  • a further desideratum is that, in many attractive applications, the person listening to synthesized speech should not be required to twiddle a volume control frequently. Where a volume control is available, dynamic range can be analog- adjusted for each received synthetic speech signal, to shift the narrow window provided by the loudspeaker's narrow dynamic range, but this is obviously undesirable for voice mail systems and many other applications.
  • analog automatic gain controls have been used to achieve energy normalization of raw signals.
  • analog automatic gain controls distort the signal input to the analog to digital converter. That is, where (e.g.) reflection coefficients are used to encode speech data, use of an automatic gain control in the analog signal will introduce error into the calculated reflection coefficients. While it is hard to analyze the nature of this error, error is in fact introduced.
  • use of an analog automatic gain control requires an analog part, and every introduction of special analog parts into a digital system greatly increases the cost of the digital system. If an AGC circuit having a fast response is used, the energy levels of consecutive allophones may be inappropriate.
  • the sibilant /s/ will normally show a much lower energy than the vowel /I/. If a fast-response AGC circuit is used, the energy-normalized-word “six" is left with a sound extremely hissy, since the initial /s/ will be raised to the same energy as the /i/, inappropriately. Even if a slower-response AGC circuit is used, substantial problems still may exist, such as raising the noise floor up to signal levels during periods of silence, or inadequate limiting of a loud utterance following a silent period.
  • a further general problem with energy normalization is caused by the existence of noise during silent periods. That is, if an energy normalization system brings the noise floor up towards the expected normal energy level during periods when no speech signal is present, the intelligibility of speech will be degraded and the speech will be unpleasant to listen to. In addition, substantial bandwidth will be wasted encoding noise signals during speech silence periods.
  • the present invention solves the problems of energy normalization digitally, by using look-ahead energy normalization. That is, an adaptive energy normalization parameter is carried from frame to frame during a speech analysis portion of an analysis-synthesis system. Speech frames are buffered for a fairly long period, e.g. 1/2 second, and then are normalized according to the current energy normalization parameter. That is, energy normalization is "look ahead" normalization in that each frame of speech (e.g. each 20 millisecond interval of speech) is normalized according to the energy normalization value from much later, e.g. from 25 frames later. The energy normalization value is calculated for the frames as received by using a fast-rising slow-falling peak-tracking value.
  • a novel silence suppression scheme is used.
  • Silence suppression is achieved by tracking 2 additional energy contours.
  • One contour is a slow-rising fast-falling value, which is updated only during unvoiced speech frames, and therefore tracks a lower envelope of the energy contour. (This in effect tracks the ambient noise level).
  • the other parameter is a fast-rising slow-falling parameter, which is updated only during voiced speech frames, and thus tracks an upper envelope of the energy contour. (This in effect tracks the average speech level).
  • a threshold value is calculated as the maximum of respective multiples of these 2 parameters, e.g. the greater of: (5 times the lower envelope parameter), and (one fifth of the upper envelope parameter).
  • Speech is not considered to have begun unless a first frame which both has an energy above the threshold level and is also voiced in detected.
  • the system then backtracks among the buffered frames to include as "speech" all immediately preceding frames which also have energy greater than the threshold. That is, after a period during which the frames of parameters received have been identified as silent frames, all succeeding frames are tentively identified as silent frames, until a super-threshold-energy voiced frame is found.
  • the silence suppression system backtracks among frames immediately preceding this super-threshold energy voiced frame until a broken string subthreshold-energy frames at least to 0.4 seconds long is found. When such a 0.4 second interval of silence is found, backtracking ceases, and only those frames after the 0.4 seconds of silence and before the first voiced super-threshold energy frame are identified as non- silent frames.
  • a waiting counter is started. If the waiting reaches an upper limit (e.g. 0.4 seconds), without the energy again increasing above T, the utterance is considered to have stopped.
  • an upper limit e.g. 0.4 seconds
  • these objects are achieved by means for normalising the energy parameter of each said speech frame, wherein said energy parameter of each frame is normalised primarily with respect to an energy parameter of a subsequent frame occurring at least 0.1 seconds after said each frame.
  • the method of encoding speech as defined in the precharacterising part of claim 5 is characterised in that the energy parameters of each of said speech frames is normalised with respect to an energy parameter of a subsequent frame occurring later than each said respective frame by at least 0.1 seconds, with normalisation being done prior to encoding said speech parameters into the data channel.
  • the speech encoding system as defined in the precharacterising part of claim 7 is characterised by means for normalising the energy parameter of each said speech frame with respect to the energy parameter of a subsequent frame occurring at least 0.1 seconds after said each frame, and wherein said silence suppression means identifies each said frame as silent or non-silent by comparing the energy parameter of each successive one of said frames against a function of first and second adaptively updated threshold values, said first adaptively updated threshold value corresponding to a multiple of an upper envelope of said successive energy parameters of successive ones of said frames and said second threshold value corresponding to a multiple of a lower envelope of said successive values of said frames.
  • the voice mail system as defined in the precharacterising part of claim 8 is characterised by means for normalising the energy parameter of each said speech frame with respect to the energy parameter of a subsequent frame occurring at least 0.1 seconds after said each frame, and wherein said silence suppression means identifies each said frame as silent or non-silent by comparing the energy parameter of each successive one of said frames against a function of first and second adaptively updated threshold values, said first adaptively updated threshold value corresponding to a multiple of an upper envelope of said successive energy parameters of successive ones of said frames and said second threshold value corresponding to a multiple of a lower envelope of said successive values of said frames.
  • the present invention provides a novel speech analysis/synthesis system, which can be configured in a wide variety of embodiments.
  • the presently preferred embodiment uses a VAX 11/780 computer, coupled with a Digital Sound Corporation Model 200 A/D and D/A converter to provided high-resolution high-bitrate digitizing and to provide speech synthesis.
  • a conventional microphone and loudspeaker, with an analog amplifier such as a Digital Sound Corporation Model 240 are also used in conjunction with the system.
  • the present invention contains novel teachings which are also particularly applicable to microcomputer-based systems. That is, the high resolution provided by the above digitizer is not necessary, and the computing power available on the VAX is also not necessary. In particular, it is expected that a highly attractive embodiment of the present invention will use a TI Professional Computer (TM), using the built in low-quality speaker and an attached microphone as discussed below.
  • TM TI Professional Computer
  • the system configuration of the presently preferred embodiment is shown schematically in Figure 5. That is, a raw voice input is received by microphone, amplified by microphone amplifier, and digitized by D/A converter.
  • the D/A converter used in the presently presently embodiment, as noted, is an expensive high-resolution instrument, which provides 16 bits of resolution at a sample rate of 8 kHz.
  • the data received at this high sample rate will be transformed to provide speech parameters at a desired frame rate.
  • the frame rate is 50 frames per second, but the frame period can easily range between 10 milliseconds and 30 milliseconds, or over an even wider range.
  • linear predictive coding based analysis is used to encode the speech. That is, the successive samples (at the original high bit rate, of, in this example, 8000 per second) are used as inputs to derive a set of linear predictive coding parameters, for example 10 reflection coefficants k,-k, o plus pitch and energy, as described below.
  • the audible speech is first translated into a meaningful input for the system.
  • a microphone within range of the audible speech is connected to a microphone preamplifier and to an analog-to- digital converter.
  • the input stream is sampled 8000 times per second, to an accuracy of 16 bits.
  • the stream of input data is then arbitrarily divided up into successive "frames", and, in the presently preferred embodiment, each frame is defined to include 160 samples. That is, the interval between frames is 20 msec, but the LPC parameters of each frame are calculated over a range of 240 samples (30 msec).
  • the sequence of samples in each speech input frame is first transformed into a set of inverse filter coefficients a k , as conventionally defined.
  • a k inverse filter coefficients
  • the a k 's are the predictor coefficients with which a signal S k in a time series can be modeled as the sum of an input Uk and a linear combination of past values S k - n in the series. That is:
  • Each input frame contains a large number of sampling points, and the sampling points within any one input frame can themselves be considered as a time series.
  • the actual derivation of the filter coefficients a k for the sample frame is as follows: First, the time- series autocorrelation values R, are computed as where the summation is taken over the range of samples within the input frame. In this embodiment, 11 autocorrelation values are calculated (R o R, o ). A recursive procedure is now used to derive the inverse filter coefficients as follows: for
  • the presently preferred embodiment uses a procedure due to Leroux-Gueguen.
  • the normalized error energy E i.e. the self-residual energy of the input frame
  • the Leroux-Gueguen algorithm also produces the reflection coefficients (also referred to as partial correlation coefficients) k i .
  • the reflection co- efficients k r are very stable parameters, and are insensitive to coding errors (quantization noise).
  • the Leroux-Gueguen procedure is set forth, for example, in IEEE Transactions on Acoustic Speech and Signal Processing, page 257 (June 1977). This. algorithm is a recursive procedure, defined as follows:
  • This algorithm computres the reflection co- efficients k, using as intermediaries impulse response estimates e k rather then the filter co- efficients a k .
  • Linear predictive coding models generally are well known in the art, and can be found extensively discussed in such references as Rabiner and Schafer, Digital Processing of speech Signals (1978), Markel and Gray, Linear Predictive Coding of Speech (1976).
  • the excitation coding transmitted need not be merely energy and pitch, but may also contain . some additional information regarding a residual signal. For example, it would be possible to encode a bandwidth of the residual signal which was an integral multiple of the pitch, and approximately equal to 1000 Hz, as an excitation signal. Many other well-known variations of encoding the excitation information can also be used alternatively.
  • the LPC parameters can be encoded in various ways. For example, as is also well known in the art, there are numerous equivalent formulations of linear predictive co- efficients.
  • LPC filter coefficients a k can be expressed as the LPC filter coefficients a k , or as the reflection co- efficients k, or as the autocorrelations R, or as other parameter sets such as the impulse response estimates parameters E(i) which are provided by the LeRoux-Guegen procedure.
  • the LPC model order is not necessarily 10, but can be 8, 12, 14, or other.
  • the present invention does not necessarily have to be used in combination with an LPC speech encoding model at all. That is, the present invention provides an energy normalization method which digitally modifies only the energy of each of a sequence of speech frames, with regard to only the energy and voicing of each of a sequence of speech frames.
  • the present invention is applicable to energy normalization of the systems using any one of a great variety of speech encoding methods, including transform techniques, formant encoding techniques, etc.
  • the present invention operates on the energy value of the data vectors.
  • the encoded parameters are the reflection coefficients k 1 -k 1a , the energy, and pitch.
  • ENORM is subsequently updated, for each successive frame, as follows:
  • ENORM (i) is set equal to alpha times E(i)+(1-alpha) times ENORM(i-1);
  • ENORM(i) is set equal to beta times E(i)+(1-beta) times ENORM(i-1), where alpha is given a value close to 1 to provide a fast rising time constant (preferably about 0.1 seconds), and Beta has given a value close to 0, to provide a slow falling time constant (preferably in the neighborhood of 4 seconds).
  • the adaptive parameter ENORM provides an envelope tracking measure which tracks the peak energy of the sequence of frames I.
  • This adaptive peak-tracking parameter ENORM(i) is used to normalize the energy of the frames, but this not done directly.
  • the energy of each frame I is normalized by dividing it by a look ahead normalized energy ENORM * (i), where ENORM * (i) is defined to be equal to ENORM(i+d), where d represents a number of frames of delay which is typically chosen to be equivalent to 1/2 second (but must be at least 0.1 seconds.
  • ENORM * (i) is defined to be equal to ENORM(i+d)
  • d represents a number of frames of delay which is typically chosen to be equivalent to 1/2 second (but must be at least 0.1 seconds.
  • E * (i) is set equal to E(i)/ENORM * (i). This is accomplished by buffering a number of speech frames equal to the delay d, so that the value of ENORM for the last frame loaded into the buffer provides the value of ENORM * for the oldest frame in the buffer, i.e. for the frame currently being taken out of the buffer.
  • the falling time constant (corresponding to the parameter beta) is so long, energy normalization at the end of a word will not be distorted by the approximately zero-energy value of the following frames of silence.
  • the silence suppression will prevent ENORM from falling very far in this situation. That is, for a final unvoiced consonant, the long time constant corresponding to beta will mean that the energy normalization value ENORM of the silient frames 1/2 second after the end of a word will be still be dominated by the voiced phonemes immediately preceding the final unvoiced consonant.
  • the final unvoiced constant will be normalized with respect to preceding voiced frames, and its energy also will not be unduly raised.
  • the foregoing steps provide a normalized energy E * (i) for each speech frame i.
  • a further novel step is used to suppress silent periods.
  • silence detection is used to selectively prevent certain frames from being encoded. Those frames which are encoded are encoded with a normalized energy E * (i), together with the remaining speech parameters in the chosen model (which in the presently preferred embodiment are the pitch P and the reflection coefficients k l -k lo ).
  • Silence suppression is accomplished in a further novel aspect of the present invention, by carrying 2 envelope parameters: ELOW and EHIGH. Both of these parameters are started from some initial value (e.g. 100) and then are updated depending on the energy E(i) of each frame i and on the voiced or unvoiced status of that frame. If the frame is unvoiced, then only the lower parameter ELOW is updated as follows:
  • the 2 envelope parameters ELOW and EHIGH are then used to generate 2 threshold parameters TLOW and THIGH, defined as:
  • the current frame is a silient frame
  • all following frames will be tentatively assumed to be silent unless a voiced super-threshold-energy (and therefore nonsilent) frame is deteched.
  • the frames tentatively assumed to be silent will be stored in a buffer (preferable containing at least one second of data), since they may be identified later as not silent.
  • a speech frame is detected only when some frame is found which has a frame energy E(i) greater than the threshold T and which "is voiced. That is, an unvoiced super-threshold-energy frame is not by itself enough to cause a decision that speech has begun.
  • the unvoiced super-threshold-energy frames in the constant /s/ would not immediately trigger a decision that a speech signal had begun, but, when the voiced super-threshold-energy frames in the /i/ are detected, the immediately preceding frames are reexamined, and the frames corresponding to t /s/ which have energy greater than T are then also designated as "speech" frames.
  • the end of the word i.e. the beginning of "silent" frames which need not be encoded
  • a voiced frame is found which has its energy E(i) less than T
  • a waiting counter is started. If the waiting reaches an upper limit (e.g. 0.4 seconds) without the energy ever rising above T, then speech is determined to have stopped, and frames after the last frame which had energy E(i) greater than T are considered to be silent frames. These frames are therefore not encoded.
  • the energy normalization and silence suppression features of the system of the present invention are both dependent in important ways on the voicing decision. It is preferable, although not strictly necessary, that the voicing decision be made by means of a dynamic programming procedure which makes pitch and voicing decision simultaneously, using an interrelated distance measure.
  • the actual encoding can now be performed with a minimum bit rate.
  • 5 bits are used to encode the energy of each frame, 3 bits are used for each of the ten reflection coefficients, and 5 bits are used for the pitch.
  • this bit rate can be further compressed by one of the many variations of delta coding, e.g. by fitting a polynomial to the sequence of parameter values across successive frames and then encoding merely the coefficients of that polynomial, by simple linear delta coding, or by any of the various well known methods.
  • an analysis system as described above is combined with speech synthesis capability, to provide a voice mail station, or a station capable of generating used- generated spoken reminder messages.
  • the encoded output of the analysis section, as described above is connected to a data channel of some sort.
  • This may be a wire to which an RS 232 UART chip is connected, or may be a telephone line accessed by a modem, or may be simply a local data buss which is also connected to a memory board or memory chips, or may of course be any of a tremendous variety of other data channels.
  • connection to any of these normal data channels is easily and conveniently made two way, so that data may be received from a communications channel or recalled from memory. Such data received from the channel will thus contain a plurality of speech parameters, including an energy value.
  • the encoded data received from the data channel will contain LPC filter parameters for each speech frame, as well as some excitation information.
  • the data vector for each speech frame contains 10 reflection coefficients as well as pitch and energy. The reflection co-efficients configure a tense-order lattice filter, and an excitation signal is generated from the excitation parameters and provided as input to this lattice filter.
  • the excitation parameters are pitch and energy
  • a pulse at intervals equal to the pitch period, is provided as the excitation function during voiced frames (i.e.
  • the energy parameter can be used to define the power provided in the excitation function.
  • the output of the lattice filter provides the LPC-modeled synthetic signal, which will typically be of good intelligible quality, although not absolutely transparent. This output is then digital-to-analog converted, and the analogue output of the d-a converter is provided to an audio amplifier, which drives a loudspeaker or headphones.
  • such a voice mail system is configured in a microcomputer-based system.
  • This configuration uses a 8088-based system, together with a special board having a TMS 320 numeric processure chip mounted thereon.
  • the fast multiple provided by the TMS 320 is very convenient in performing signal processing functions.
  • a pair of audio amplifiers for input and output is also provided on the speech board, as is an 8 bit mu-law codec.
  • the function of this embodiment is essentially identical to that of the VAX embodiment described above, except for a slight difference regarding the converters.
  • the 8 bit codec performs mu-law conversion, which is non linear but provides enhanced dynamic range.
  • a lookup table is used to transform the 8 bit mu-law output provided from the codec chip into a 13 bit linear output.
  • This microcomputer embodiment also includes an internal speaker, and a microphone jack.
  • a further preferred realization is the use of multiple micro-computer based voice mail stations, as described above, to configure a microcomputer-based voice mail system.
  • microcomputers are conventionally connected in a local area network, using one of the many conventional LAN protoacalls, or are connected using PBX tilids.
  • the only slightly distinctive feature of this voice mail system embodiment is that the transfer mechanism used must be able to pass binary data, and not merely ASCII data.
  • the voice mail operation is simply a straight forward file transfer, wherein a file representing encoded speech data is generated by an analysis operation at one station, is transferrd as a file to another station, and then is converted to analog speech data by a synthesis operation at the second station.
  • the crucial changes taught by the present invention are changes in the analysis portion of an analysis/synthesis system, but these changes affect the system as a whole. That is, the system as a whole will achieve higher throughput of intelligible speech information per transmitted bit, better perceptual quality of synthesized sound at the synthesis section, and other system-level advantages.
  • microcomputer network voice mail systems perform better with minimized channel loading according to the present invention.
  • the present invention provides the objects described above, of energy normalization and of silent suppression, as well as other objects, advantageously.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Claims (15)

1. Système de communication vocale, comportant un analyseur connecté pour recevoir un signal de parole numérique et pour produire à partir de ce dernier une séquence de trames de paramètres de parole, lesdits paramètres de chaque trame contenant un paramètre d'énergie, des paramètres d'excitation et des paramètres de codage linéaire par prévision, un dispositif de sortie pour charge desdits paramètres de chaque trame de parole dans un canal de données, un dispositif d'entrée pour recevoir une séquence de trames de paramètres de parole, un dispositif de configuration d'un filtre en réseau en fonction desdits paramètres de codage linéaire par prévision, un dispositif générateur d'un signal d'excitation en fonction desdits paramètres d'excitation, ladite excitation étant produite comme entrée audit filtre en réseau, et un dispositif de modulation de la sortie dudit filtre en réseau en fonction dudit paramétre d'énergie pour produire une sortie de signal de parole, caractérisé en ce que:
un dispositif est prévu pour normalizer le paramètre d'énergie de chacune desdites trames de parole, ledit paramètre d'énergie de chaque trame étant normalisé principalement par rapport à un paramètre d'énergie d'une trame ultérieure apparaissant au moins 0,1 seconde après chacune desdites trames.
2. Système selon la revendication 1, dans lequel ledit paramètre d'énergie de chacune desdites trames de parole est normalisé par rapport à un paramètre de poursuite de crête desdites trames suivantes, ledit paramètre de poursuite de crête correspondant généralement à une enveloppe supérieure de la séquence desdits paramètres d'énergie desdites trames.
3. Système selon la revendication 1, dans lequel lesdits paramètres de parole de chacune desdites trames indiquent également l'état sonore/sourd de chacune desdites trames respectives.
4. Système selon la revendication 3, dans lequel lesdits paramètres comprennent également des informations de hauteur pour chacune desdites trames de parole, et dans lequel ledit analyseur détermine conjointement la hauteur et la sonorité de chaque trame de manière que lesdites décisions de hauteur de de sonorité varient aussi régulièrement que possible entre des trames voisines.
5. Procédé de codage de parole, consistant à analyser un signal de parole pour produire une séquence de trames comme des paramètres de parole, chacune desdites trames de ladite séquence de paramètres contenant un paramètre d'énergie, et à coder lesdits paramètres de parole. dans un canal de données, caractérisé en ce que les paramètres d'énergie de chacune desdites trames de parole sont normalisés par rapport à un paramètre d'énergie d'une trame suivante apparaissant plus tard que chacune desdites trames respectives, d'au moins 0,1 seconde, la normalisation étant faite avant le codage desdits paramètres de parole dans le canal de données.
6. Procédé selon la revendication 5, dans lequel ladite valeur d'énergie de chacune desdites trames de parole est normalisée par rapport à un paramètre de poursuite de crête desdites trames suivantes, ledit paramètre de poursuite de crête correspondant généralement à une enveloppe supérieure de la séquence desdites valeurs d'énergie de ladite trame.
7. Système de codage de la parole, comportant un analyseur connecté pour recevoir des données d'entrée de parole et pour produite à partir de ces données une séquence de trames de paramètres de parole, lesdites trames étant produites à une fréquence de trames prédéterminée, lesdites trames contenant plusieurs paramètres comprenant un paramètre d'énergie, un codeur pour coder des trames de parole successives comme des valeurs numériques et un dispositif de suppression de silence connecté audit dispositif de codage, ledit dispositif de suppression de silence évitant que ledit codeur code celles desdites séquences de trames qui ne correspondnt pas à un signal de parole réelle, et un dispositif de sortie pour charger lesdites valeurs numériques codées dans un canal de données, caractérisé par: un dispositif de normalisation du paramètre d'énergie de chacune desdites trames de parole par rapport au paramètre d'énergie d'une trame ultérieure apparaissant au moins 0,1 seconde après chacune desdites trames, et dans lequel ledit dispositif de suppression de silence identifie chacune desdites trames comme silencieuses ou non silencieuses en comparant le paramètre d'énergie de chacune successive desdites trames à une fonction d'une première et d'une seconde valeurs seuil corrigées de façon adaptative, ladite première valeur seuil corrigée de façon adaptative correspondant à un multiple d'une enveloppe supérieure desdits paramères d'énergie successifs de certaines successives desdites trames et ladite seconde valeur seuil correspondant à un multiple d'une enveloppe inférieure desdites valeurs successives desdites trames.
8. Système de communication vocale, comportant un analyseur connecté pour recevoir des données d'entrée de parole et pour produire à partir de ces données une séquence de trames de paramètres de parole, lesdites trames étant produites avec un débit de trames prédéterminé, lesdites trames contenant plusieurs paramètres comprenant un paramètre d'énergie, un codeur pour coder des trames de parole successives comme des valeurs numériques et un dispositf de suppression de silence connecté audit dispositif de codage, ledit dispositif de suppression de silence évitant que ledit codeur ne code celles de ladite séquence de trames ne correspondant pas à un signal de parole réelle, un dispositif de sortir pour charger lesdites valeurs numériques codées dans un canal de données, un dispositif d'entrée pour recevoir une séquence de trames de paramètres de parole, un dispositif de configuration d'un filtre en réseau en fonction desdits paramètres de codage linéaires à prévision, un dispositif générateur d'un signal d'excitation en fonction desdits paramètres d'excitation, ladite excitation étant produite comme une entrée dudit filtre en réseau et un dispositif de modulation de la sortie dudit filtre en réseau en fonction dudit paramètre d'énergie pour produire une sortie de signal de parole, caractérisé par:
un dispositif de normalisation du paramètre d'énergie de chaçune desdites trames de parole par rapport au paramètre d'énergie d'une trame ultérieure apparaissant au moins 0,1 seconde après chacune desdits trames et dans lequel dedit dispositif de suppression de silence identifie chacune desdites trames comme silencieuse ou non silencieuse en comparant le paramètre d'énergie de chacune successive desdites trames à une fonction d'une première et une seconde valeurs seuil corrigées de façon adaptative, ladite première valeur seuil corrigée de façon adaptative correspondant à un multiple d'une enveloppe supérieure desdits paramètres successifs d'énergie de certaines successives desdites trames et la seconde valeur seuil correspondant à un multiple d'une enveloppe inférieure desdites valeurs successives desdites trames.
9. Système selon la revendication 8, dans lequel l'analyseur produite une décision de vocalisation pour chacune desdites trames de parole, et dans lequel ledit dispositif de suppression de silence corrige ledit premier seuil seulement pendant celles sonores desdites trames et corrige seulement ledit second seuil pendant celles sourdes desdites trames.
10. Système selon la revendication 8, dans lequel ledit dispositif de suppression de silence, une fois qu'une trame silencieuse a été identifiée, n'identifie pas une trame non silencieuse ensuite jusqu'à ce qu'une trame sonore d'énergie supérieure au seuil soit détectée, auquel cas ladite trame sonore d'énergie supérieure au seuil et toutes les trames de parole sourdes d'énergie supérieure au seuil qui ne sont pas séparées de ladite trame sonore d'énergie supérieure au seuil par au moins un nombre prédéterminé des trames successives ayant chacune un niveau d'énergie au-dessous dudit niveau seuil, sont identifiées comme non silencieuses.
11. Systèm selon la revendication 8, dans lequel le dispositif de suppression de silence, une fois qu'une trame non silencieuse a été identifiée, identifie une trame silencieuse seulement lorsqu'une succession continue de trames d'énergie au-dessous du seuil a été identifiée pendant un intervalle de temps prédéterminé.
12. Système selon la revendication 10 ou 11, dans lequel ledit intervalle de temps prédéterminé est compris entre 0,2 et 0,8 secondes.
13. Système selon la revendication 8, dans lequel ladite valeur d'énergie de chacune desdites trames de parole est normalisée par rapport auxdites valeurs d'énergie, principalement celles desdites trames qui sont ultérieures à ladite trame respective d'au moins 0,1 seconde.
14. Système selon l'une quelconque des revendications 8 et 13, dans lequel ladite valeur d'énergie de chacune desdites trames de parole est normalisée par rapport à un paramètre de poursuite de crête desdites trames suivantes, ledit paramètre de poursuite de crête correspondant généralement à une enveloppe supérieure de la séquence desdites valeurs d'énergie desdites trames.
15. Système selon la revendication 11, dans lequel ledit dispositif de suppression de silence, une fois qu'une trame non silencieuse a été identifiée, identifie une trame silencieuse seulement si ladite succession continue de trames d'énergie au-dessous du seuil pendant un intervalle de temps prédéterminée est trouvée après une trame sonore d'énergie au-dessous du seuil.
EP19840112266 1983-10-13 1984-10-12 Analyse et synthèse de la parole avec normalisation de l'énergie Expired EP0140249B1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US541410 1983-10-13
US06/541,497 US4696039A (en) 1983-10-13 1983-10-13 Speech analysis/synthesis system with silence suppression
US06/541,410 US4696040A (en) 1983-10-13 1983-10-13 Speech analysis/synthesis system with energy normalization and silence suppression
US541497 1983-10-13

Publications (2)

Publication Number Publication Date
EP0140249A1 EP0140249A1 (fr) 1985-05-08
EP0140249B1 true EP0140249B1 (fr) 1988-08-10

Family

ID=27066699

Family Applications (1)

Application Number Title Priority Date Filing Date
EP19840112266 Expired EP0140249B1 (fr) 1983-10-13 1984-10-12 Analyse et synthèse de la parole avec normalisation de l'énergie

Country Status (3)

Country Link
EP (1) EP0140249B1 (fr)
JP (1) JPH0644195B2 (fr)
DE (1) DE3473373D1 (fr)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR2631147B1 (fr) * 1988-05-04 1991-02-08 Thomson Csf Procede et dispositif de detection de signaux vocaux
DE69127134T2 (de) * 1990-05-28 1998-02-26 Matsushita Electric Ind Co Ltd Sprachkodierer
FR2686183A1 (fr) * 1992-01-15 1993-07-16 Idms Sa Systeme de numerisation d'un signal audio, procede et dispositif de mise en óoeuvre pour constituer une base de donnees numeriques.
JP3484757B2 (ja) * 1994-05-13 2004-01-06 ソニー株式会社 音声信号の雑音低減方法及び雑音区間検出方法
US5991718A (en) * 1998-02-27 1999-11-23 At&T Corp. System and method for noise threshold adaptation for voice activity detection in nonstationary noise environments
US6314396B1 (en) * 1998-11-06 2001-11-06 International Business Machines Corporation Automatic gain control in a speech recognition system
US6889186B1 (en) * 2000-06-01 2005-05-03 Avaya Technology Corp. Method and apparatus for improving the intelligibility of digitally compressed speech
GB2367467B (en) * 2000-09-30 2004-12-15 Mitel Corp Noise level calculator for echo canceller
CN1867965B (zh) * 2003-10-16 2010-05-26 Nxp股份有限公司 使用自适应噪声基底跟踪的语音活动检测
US7660715B1 (en) 2004-01-12 2010-02-09 Avaya Inc. Transparent monitoring and intervention to improve automatic adaptation of speech models
US7529670B1 (en) 2005-05-16 2009-05-05 Avaya Inc. Automatic speech recognition system for people with speech-affecting disabilities
US7653543B1 (en) 2006-03-24 2010-01-26 Avaya Inc. Automatic signal adjustment based on intelligibility
US7925508B1 (en) 2006-08-22 2011-04-12 Avaya Inc. Detection of extreme hypoglycemia or hyperglycemia based on automatic analysis of speech patterns
US7962342B1 (en) 2006-08-22 2011-06-14 Avaya Inc. Dynamic user interface for the temporarily impaired based on automatic analysis for speech patterns
US7675411B1 (en) 2007-02-20 2010-03-09 Avaya Inc. Enhancing presence information through the addition of one or more of biotelemetry data and environmental data
US8041344B1 (en) 2007-06-26 2011-10-18 Avaya Inc. Cooling off period prior to sending dependent on user's state
US10049685B2 (en) 2013-03-12 2018-08-14 Aaware, Inc. Integrated sensor-array processor
WO2014165032A1 (fr) * 2013-03-12 2014-10-09 Aawtend, Inc. Processeur intégré pour réseau de capteurs
US10204638B2 (en) 2013-03-12 2019-02-12 Aaware, Inc. Integrated sensor-array processor

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AT347504B (de) * 1975-04-18 1978-12-27 Siemens Ag Oesterreich Einrichtung zur automatischen lautstaerke- regelung
US4071695A (en) * 1976-08-12 1978-01-31 Bell Telephone Laboratories, Incorporated Speech signal amplitude equalizer
US4280192A (en) * 1977-01-07 1981-07-21 Moll Edward W Minimum space digital storage of analog information
FR2380612A1 (fr) * 1977-02-09 1978-09-08 Thomson Csf Dispositif de discrimination des signaux de parole et systeme d'alternat comportant un tel dispositif
US4351983A (en) * 1979-03-05 1982-09-28 International Business Machines Corp. Speech detector with variable threshold
FR2451680A1 (fr) * 1979-03-12 1980-10-10 Soumagne Joel Discriminateur parole/silence pour interpolation de la parole
FR2466825A1 (fr) * 1979-09-28 1981-04-10 Thomson Csf Dispositif de detection de signaux vocaux et systeme d'alternat comportant un tel dispositif
CA1147071A (fr) * 1980-09-09 1983-05-24 Northern Telecom Limited Methode et appareil de detection de paroles dans un signal de voie telephonique
JPS58171099A (ja) * 1982-03-31 1983-10-07 富士通株式会社 音声パラメ−タ修正方法

Also Published As

Publication number Publication date
DE3473373D1 (en) 1988-09-15
JPH0644195B2 (ja) 1994-06-08
EP0140249A1 (fr) 1985-05-08
JPS60107700A (ja) 1985-06-13

Similar Documents

Publication Publication Date Title
US4696040A (en) Speech analysis/synthesis system with energy normalization and silence suppression
US4696039A (en) Speech analysis/synthesis system with silence suppression
EP0140249B1 (fr) Analyse et synthèse de la parole avec normalisation de l'énergie
EP0993670B1 (fr) Procede et appareil d'amelioration de qualite de son vocal dans un systeme de communication par son vocal
EP1159736B1 (fr) Systeme reparti de reconnaissance vocale
US6889186B1 (en) Method and apparatus for improving the intelligibility of digitally compressed speech
EP0785541B1 (fr) Usage de la détection d'activité de parole pour un codage efficace de la parole
WO1998040878A1 (fr) Vocodeur mettant en oeuvre une correlation entre des magnitudes spectrales et des excitations candidates pour coder la parole
EP0814458A2 (fr) Améliorations en relation avec le codage des signaux vocaux
US5706392A (en) Perceptual speech coder and method
JPH09152895A (ja) 合成フィルタの周波数応答に基づく知覚ノイズマスキング測定法
JPH09152898A (ja) 符号化されたパラメータのない音声信号の合成法
JPH08254994A (ja) 分類化及び輪郭の目録(インベントリー)による音声符号化パラメータの配列の再構成
WO1994025959A1 (fr) Utilisation d'un modele auditif pour ameliorer la qualite ou reduire le debit binaire de systemes de synthese de la parole
Van Schalkwyk et al. Linear predictive speech coding at 2400 b/s
GB2343822A (en) Using LSP to alter frequency characteristics of speech
Chandra et al. Linear prediction with a variable analysis frame size
JP2003323200A (ja) 音声符号化のための線形予測係数の勾配降下最適化
O’Shaughnessy “Speech Technology
Parihar Performance analysis of advanced front ends on the Aurora Large Vocabulary Evaluation
Xydeas Differential encoding techniques applied to speech signals
JP2847730B2 (ja) 音声符号化方式
Yegnanarayana Effect of noise and distortion in speech on parametric extraction
JPH0414813B2 (fr)
KR100210444B1 (ko) 대역 분할을 통한 음성신호 부호화 방법

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

AK Designated contracting states

Designated state(s): DE FR GB

17P Request for examination filed

Effective date: 19851023

17Q First examination report despatched

Effective date: 19861212

GRAA (expected) grant

Free format text: ORIGINAL CODE: 0009210

AK Designated contracting states

Kind code of ref document: B1

Designated state(s): DE FR GB

REF Corresponds to:

Ref document number: 3473373

Country of ref document: DE

Date of ref document: 19880915

ET Fr: translation filed
PLBE No opposition filed within time limit

Free format text: ORIGINAL CODE: 0009261

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT

26N No opposition filed
REG Reference to a national code

Ref country code: GB

Ref legal event code: IF02

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: GB

Payment date: 20030915

Year of fee payment: 20

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: FR

Payment date: 20031003

Year of fee payment: 20

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: DE

Payment date: 20031031

Year of fee payment: 20

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: GB

Free format text: LAPSE BECAUSE OF EXPIRATION OF PROTECTION

Effective date: 20041011

REG Reference to a national code

Ref country code: GB

Ref legal event code: PE20