EP0441642A2 - Verfahren und Einrichtung zur spektralen Analyse - Google Patents

Verfahren und Einrichtung zur spektralen Analyse Download PDF

Info

Publication number
EP0441642A2
EP0441642A2 EP91301034A EP91301034A EP0441642A2 EP 0441642 A2 EP0441642 A2 EP 0441642A2 EP 91301034 A EP91301034 A EP 91301034A EP 91301034 A EP91301034 A EP 91301034A EP 0441642 A2 EP0441642 A2 EP 0441642A2
Authority
EP
European Patent Office
Prior art keywords
power
signal
logarithm
band
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP91301034A
Other languages
English (en)
French (fr)
Other versions
EP0441642A3 (en
Inventor
John Nicholas Holmes
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BTG International Ltd
Original Assignee
BTG International Ltd
National Research Development Corp UK
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BTG International Ltd, National Research Development Corp UK filed Critical BTG International Ltd
Publication of EP0441642A2 publication Critical patent/EP0441642A2/de
Publication of EP0441642A3 publication Critical patent/EP0441642A3/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; ELECTRIC HEARING AIDS; PUBLIC ADDRESS SYSTEMS
    • H04R2225/00Details of deaf aids covered by H04R25/00, not provided for in any of its subgroups
    • H04R2225/43Signal processing in hearing aids to enhance the speech intelligibility

Definitions

  • the present invention relates to methods and apparatus for spectral analysis, particularly the spectral analysis of sounds produced in speech.
  • spectral analysis finds applications in, for example, automatic speech recognition and speech coding for bandwidth reduction and the storage of speech.
  • each feature vector may typically contain between 5 and 20 features, depending on the method of analysis adopted. It is well known that the frequencies and intensities of the main short-term concentrations of power in a speech signal (formants) are highly correlated with the phonetic realization of the associated speech sound. The frequencies and intensities of the small number of formants that occur within the most significant part of the speech spectrum are useful as features for speech recognition.
  • An important object of the present invention is to provide a set of features that have most of the desirable properties of formants, but aso carry phonetically significant information even for those sounds for which the conventional definition of formants does not seem appropriate.
  • a further object is to provide methods and apparatus for calculating these features easily.
  • a method for use in speech recognition of determining short-term characteristic features of a first signal representative of a speech signal comprising the steps of
  • the powers in the bands in which the said centroids are measured are also determined as further characteristic features.
  • the first signal can be regarded as an electrical signal representative of a speech sound.
  • the filtering to obtain further time varying signals is electrical filtering carried out for example by filters constructed from discrete components or by digital filters implemented by a computer such as a microprocessor.
  • a computer such as a microprocessor.
  • the standard method of calculating the centroid of a distribution is to take the ratio of two integrals. If the distribution is represented graphically, the numerator of this ratio is the integral of the product of the ordinate and the abscissa, whereas the denominator is the integral of the ordinate. For spectral analysis these quantities refer to measurements in the frequency domain; the denominator integral is the total power in the duration of signal that is being analysed, which is the same in time domain as in the frequency domain and so can be computed in the time domain by summing the squares of the signal waveform samples. The numerator represents the sum of the powers of all spectral components after each component is multiplied by a quantity proportional to frequency.
  • numerator can also be computed in the time domain by passing the waveform samples through a filter whose gain is proportional to the square root of frequency over the relevant band, and squaring and summing the filtered waveform samples.
  • the filter gain characteristic required has a positive slope of three dB per octave and can be approximated very closely by a sampled data filter of moderate order, using standard filter design methods.
  • the power of each frequency band is given by the denominator integral.
  • the step of determining at least approximations to the frequencies at which the centroids occur may comprise, for each filter output, summing the squares of samples of the time varying signals at the filter output to provide a denominator which indicates the power of that filter output, applying at least an approximation to a three dB frequency weight per octave to the samples, summing the squares of the resultant samples to provide a numerator, and dividing the numerator by the denominator to indicate the frequency of the centroid.
  • the invention also includes apparatus for carrying out the first aspect thereof.
  • the method of finding the frequency at which a centroid occurs from signals in the time domain can be generally applied. Therefore according to a second aspect of the invention there is provided a method of determining short-term characteristic features of a first signal having a time-varying value comprising the steps of filtering the first signal to obtain second time-varying signals each in one of a plurality of frequency bands, and determining at least approximate indications of the frequencies at which the centroids of respective frequency versus power distributions in the said bands occur as the characteristic features by, for each frequency band, determining the total power of the second signal for that band in the time domain to provide a first power value, applying spectral weighting to frequency components of the second signal for that band, determining the total power of the spectrally weighted signal in the time domain to provide a second power value, and dividing the second power value by the first to provide an indication of the frequency of the centroid of that band.
  • apparatus for determining short-term characteristic features of a signal having a time-varying value comprising means for filtering a first signal having a time-varying value to obtain second time-varying signals each in one of a plurality of frequency bands, and means for determining at least approximate indications of the frequencies at which the centroids of the frequency versus power distributions in the said bands occur as the characteristic features by, determining the total power of the second signal for that band in the time domain to provide a first power value, applying spectral weighting to frequency components of the second signal for that band, determining the total power of the spectrally weighted signal in the time domain to provide a second power value, and dividing the second power value by the first to provide an indication of the frequency of the centroid of that band.
  • the spectral weighting may be at least an approximation to three dB per octave.
  • the signal from the filter may be differentiated which, as is well known, is equivalent to applying a 6 dB per octave increase. If an approximation to differentiation is carried out on a waveform represented by samples, by subtracting each sample from the previous sample, the increase is about 6 dB per octave at low frequencies, gradually reducing to zero slope as the half-sampling rate frequency is reached. The effects of the variation from the ideal 3 dB per octave slope are two-fold. First, the signals at higher and lower frequencies than the spectral peak are not given the correct relative weight in the centroid calculation.
  • the filtering of the time varying signals can be carried out by any bandpass filters which correspond approximately to the ranges of the three lower formants; typically 250 to 900 Hz, 700 to 3,000 Hz and 1,800 to 3,500 Hz respectively.
  • bandpass filters which correspond approximately to the ranges of the three lower formants; typically 250 to 900 Hz, 700 to 3,000 Hz and 1,800 to 3,500 Hz respectively.
  • some shaping of the filter characteristics is preferable in order to separate the two formants when they occur in overlapping parts of the filter characteristics.
  • a time-window may be applied to the band limited signal at the output of each filter and then followed by "three dB per octave” filtering and summation operations applied to the entire duration of the windowed output.
  • the accuracy of this process can be ensured by using a finite-impulse-response "three dB per octave” filter, so that it is known that the output is exactly zero once the impulse response duration has passed.
  • any such systematic variation could be corrected by subsequently applying a non-linear function to the formant measurement, but in practice it would not matter for any type of pattern-matching speech recognizer because the same systematic effect would apply similarly during the recognizer's training process.
  • An advantage of using the centroid instead of direct measurement of the peak frequency is that it will always give an unambiguous result even when the spectral peak or peaks are not clearly defined. Provided the same method of analysis is used for setting up patterns for the pattern-matching process in the recognition algorithm, the fact that the measurements do not always correspond to formants is not important.
  • Speech signals from a microphone are converted to a linear digital representation by a suitable A-D conversion system sampling at 8 kHz.
  • Preliminary audio spectral shaping and gain control is provided such that the full range of the A-D conversion system is used and there is a good average balance between high and low frequency components of the signal.
  • the shaping and gain control is also arranged to attenuate the low frequency prominence normally occurring below the frequency of the first formant during voiced sounds.
  • the output of the A-D conversion system is connected to a computer such as a digital signal processor (DSP) integrated circuit or a microprocessor which in speech recognition carries out the recognition algorithms in addition to feature extraction based on centroids.
  • DSP digital signal processor
  • a general algorithm (see Figure 1) is first described and then followed by a description of more specific algorithms for use with an 8-bit microprocessor.
  • the sampled signal at the output of the A-D conversion system may be divided in frames each containing a predetermined number of samples.
  • a portion of each frame of duration of, for example, 2 to 30 ms is selected, longer durations (even up to a complete frame) being preferred if sufficient computation power for the centroid measuring process is available.
  • Samples from operation 1 are digitally filtered in an operation 2 in three pass bands to obtain three groups of samples relating to the three lowest formants.
  • Figure 1 shows the processing of only one of these formants and therefore in the complete analysis algorithm, all the operations of Figure 1 following operation 2 are repeated for the other two formants.
  • the formant filtering of operation 2 may be carried out by any suitable method.
  • the pass bands for the filters have already been mentioned but for some speech sounds two formants can sometimes move very close together in frequency, and fall within the pass bands of two filters. It is a consequence of the acoustic theory of speech production that when two formants are close they are usually of approximately equal intensity, and so any centroid measurement in these cases is likely to give the mean of the two frequencies of the formants within the band instead of the one formant that was desired. This error can be appreciably reduced if the formant band filters are designed to have a sloping characteristic of, say, about 3 dB per 100 Hz in the overlap regions.
  • the resultant error in the formant intensity measurement is corrected by applying the inverse of the filter characteristic to the intensity result, as a function of the measured frequency.
  • the next step in finding the frequency of the centroid of the selected formant band is to measure the total power in the signal at the output of the window.
  • this signal is typically as shown in Figure 3 and has the distribution of Figure 4 in the frequency domain.
  • Operations 4 to 7 of Figure 1 have the object of finding a denominator and a numerator from time domain signals as already outlined.
  • the denominator is the power in the waveform of Figure 3 while the numerator is found by applying a spectral weighting in the form of a gain characteristic with a positive slope of 3 dB per octave to samples of the waveform of Figure 3 (the operation 5), measuring the total power in the resultant waveform (the operation 6) and dividing the output of the operation 6 by the output of the operation 4 to derive a power ratio (the operation 7) representative of the frequency of the centroid.
  • an operation 8 the power ratio from the operation 7 is multiplied by a scale factor to convert to formant frequency, and in the operation 9 a scaled logarithm of the unfiltered power from the operation 4 is calculated to represent formant intensity in dB.
  • a type 6502 for example, running at 4 MHz may be used for the recognition of continuous speech with a fairly small recognition vocabulary if the techniques described below are used to simplify multiplication and division during feature extraction, and very efficient computational techniques (not relevant to the present invention) are used for recognition.
  • the input signal may be divided into frames containing 256 samples, that is 31.25 frames per second.
  • a microprocessor limitation of computational power means that a detailed analysis to determine the formant centroids can only be carried out on a selected part of each frame.
  • each sample is multiplied by the largest power of 2 that does not cause any samples of the fifty to exceed the range -128 to +127.
  • filters for separation for the formant bands can be made using a cascade connection of simple finite impulse response (FIR) sections each with one or at the most two multiplication and addition operations.
  • FIR finite impulse response
  • Within each section signal delays can be one or two sample periods or integer multiples of these numbers. Delays of two or more sample periods in one filter section imply multiple sets of zeroes in the transfer function, thus enabling higher order filters to be achieved without significantly increasing the computational load.
  • the filter coefficients can be chosen to have values such as + 1, -1, 0.5, 1.75, which can be implemented by at most a very small number of shift and add/subtract operations.
  • the filter transfer function for the first formant filter may be (1-Z ⁇ 6)(1+Z ⁇ 2)(1+Z ⁇ 1)(1+Z ⁇ 1), while that for the second formant filter may be (1-Z ⁇ 3)(1-Z ⁇ 3)(1+Z ⁇ 2+Z ⁇ 4)(1+1.75Z ⁇ 2+Z ⁇ 4)(0.5+Z ⁇ 2), where Z ⁇ 1 denotes a delay of one sample interval.
  • each sample is processed by the algorithm shown in order to separate the first formant. Similar algorithms are required to select the second and third formants but different transfer functions are employed. First the sample number is initialized to zero in an operation 16 and then the appropriate overall transfer function is achieved by applying the transfer functions of operations 17 to 20 in turn.
  • the first ten samples are disregarded with respect to zero crossing detection to exclude the initial transient of the first formant filter which is tenth order and uses a maximum delay of ten sample intervals.
  • the test 22 allows the operation 23 to increment the sample number and input a new sample, if the sample number is less than 10.
  • the test 24 is carried out if the sample number is greater than 10 and determines whether the polarity of the current sample is opposite to that of the previous sample; if not the operation 23 is carried out and the next sample is taken but if so then a test 25 is carried out to determine whether this is the first zero crossing.
  • the test 27 is carried out to determine whether more than 2 ms have elapsed since the first zero crossing. Thus if less than 2 ms have occurred then the next sample is taken but sampling ceases by the operation 28 if the output of the test 27 is positive and the sample number of this final zero crossing is stored in an operation 28. Analysis of the output of the filtering operation 2 for the first formant therefore is applied only to the first sample following the first zero crossing after the first ten samples, the samples following in the next 2 ms and the following interval up to the next zero crossing.
  • LSUMD is found by a process which includes finding an approximation to the logarithm of the sum of two numbers without using antilogarithms. This process is described by Kingsbury and Rayner (1971) "Digital Filtering Using Logarithmic Arithmetic", Electronic Letters, 7, pages 56 to 58 and in the inventor's book “Speech Synthesis and Recognition", published by Van Nostrand Reinhold (UK) Co. Ltd. in 1988, pages 149 and 150.
  • the look-up table is entered as log(B/A) and the table output is log(1+B/A), where A is the greater of two values: the power found up to the current sample; and the square of the current sample. B is the smaller of these two values.
  • an operation 39 the logarithm of the square of each difference from the operation 38 is found by means of a look-up table and designated LSDIF.
  • the logarithm of the sum of LSDIF and LSUMN is found using the Kingsbury and Rayner process in operations 40 to 44 in the same way as is described for LSSAM and LSUMD in the operations 33 to 37.
  • Test 45 determines whether the last sample as stored in operation 28 of Figure 6 has been reached and if not a jump back to the operation 32 occurs and the next sample is taken. Otherwise operations 4, 5 and 6 have been carried out for a complete interval covered by the samples and an exit occurs from the algorithm of Figures 7a and 7b.
  • the replacement for the approximate 3 dB/octave filtering is the differencing operation 38 but for the third formant where some frequency components may be near to half the sampling rate the frequency-domain slope of the differencing operation 39 tends to zero.
  • the problem is avoided by spectrally inverting the signal before taking measurements in the range of the third formant so bringing high frequencies down to near zero where the differencing is effective.
  • the spectral inversion can be achieved by inverting every alternative waveform sample and the combined effects of spectral inversion and subsequent differencing are combined into the single operation of adding pairs of adjacent samples instead of differencing.
  • the operation 7 of deriving the power ratio which gives the centroid frequency is now carried out by an operation 47 where the logarithm (LSUMD) of the denominator of the power ratio is subtracted from the logarithm of the numerator (LSUMN).
  • the resulting value is then converted to a formant frequency by means of a further look-up table in the operation 48 which for the third formant includes subtraction of the frequency obtained from half the sampling frequency.
  • the power in the band in which the centroid is measured is obtained in an operation 49 by a further look-up table which converts LSUMD to dB, taking account of the sloping characteristics of the formant filters in the overlap regions.
  • Apparatus which includes the invention is shown in Figure 9 and comprises a signal capture portion 51 which includes the microphone, the audio spectral shaping and the A-D conversion system, a feature extraction portion 52 for carrying out the algorithm of Figure 1, or Figure 1 as approximated by Figures 5 to 8, and a pattern/modelling portion 53 for speech recognition from features obtained from the portion 52.
  • the portions 52 and 53 are usually in the form of a single computer, DSP circuit or microprocessor as indicated above, which may also include some of the portion 51.
  • the invention can of course be put into operation in many other ways than those specifically described; for example a 16-bit or 32-bit microprocessor may be used and gives more accurate results since less approximations have to be made and larger signal portions are analysed.
  • a DSP integrated circuit gives better results but may involve greater expense both in hardware and power consumption. Any other computers, apparatus or method for finding the centroids of spectral peaks can be used in spectral analysis according to the invention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
  • Measuring Frequencies, Analyzing Spectra (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
EP19910301034 1990-02-08 1991-02-08 Methods and apparatus for spectral analysis Withdrawn EP0441642A3 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB9002852 1990-02-08
GB9002852A GB2240867A (en) 1990-02-08 1990-02-08 Speech analysis

Publications (2)

Publication Number Publication Date
EP0441642A2 true EP0441642A2 (de) 1991-08-14
EP0441642A3 EP0441642A3 (en) 1993-03-10

Family

ID=10670649

Family Applications (1)

Application Number Title Priority Date Filing Date
EP19910301034 Withdrawn EP0441642A3 (en) 1990-02-08 1991-02-08 Methods and apparatus for spectral analysis

Country Status (3)

Country Link
EP (1) EP0441642A3 (de)
JP (1) JPH05143098A (de)
GB (1) GB2240867A (de)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR2762180A1 (fr) * 1997-04-15 1998-10-16 Roland Roger Carrat Procede et dispositif d'amplification et de codage du signal vocal destine a l'amelioration de l'intelligibilite en milieu bruyant et a la correction des surdites
EP0713076B1 (de) * 1994-11-15 2001-06-06 Raytheon Company Fehlererkennungsgerät mit digitaler Koordinatentransformation
WO2005055645A1 (en) * 2003-12-01 2005-06-16 Koninklijke Philips Electronics N.V. Selective audio signal enhancement

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5623609A (en) * 1993-06-14 1997-04-22 Hal Trust, L.L.C. Computer system and computer-implemented process for phonology-based automatic speech recognition
GB9323991D0 (en) * 1993-11-22 1994-01-12 Holmes John N Method and apparatus for spectral analysis
JP4568826B2 (ja) * 2005-09-08 2010-10-27 株式会社国際電気通信基礎技術研究所 声門閉鎖区間検出装置および声門閉鎖区間検出プログラム

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3649765A (en) * 1969-10-29 1972-03-14 Bell Telephone Labor Inc Speech analyzer-synthesizer system employing improved formant extractor
DE2313141A1 (de) * 1973-03-16 1974-09-19 Philips Patentverwaltung Verfahren und anordnung zur echtzeitermittlung der uebertragungsfunktionen von systemen
DE2448909B2 (de) * 1974-10-15 1978-12-07 Olympia Werke Ag, 2940 Wilhelmshaven Elektrische Schaltungsanordnung für eine Einrichtung zur Spracherkennung
NL8203520A (nl) * 1982-09-10 1984-04-02 Philips Nv Digitale filterinrichting.

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0713076B1 (de) * 1994-11-15 2001-06-06 Raytheon Company Fehlererkennungsgerät mit digitaler Koordinatentransformation
FR2762180A1 (fr) * 1997-04-15 1998-10-16 Roland Roger Carrat Procede et dispositif d'amplification et de codage du signal vocal destine a l'amelioration de l'intelligibilite en milieu bruyant et a la correction des surdites
WO2005055645A1 (en) * 2003-12-01 2005-06-16 Koninklijke Philips Electronics N.V. Selective audio signal enhancement

Also Published As

Publication number Publication date
GB9002852D0 (en) 1990-04-04
GB2240867A (en) 1991-08-14
EP0441642A3 (en) 1993-03-10
JPH05143098A (ja) 1993-06-11

Similar Documents

Publication Publication Date Title
Charpentier Pitch detection using the short-term phase spectrum
Ross et al. Average magnitude difference function pitch extractor
Schroeder Period histogram and product spectrum: New methods for fundamental‐frequency measurement
Markel The SIFT algorithm for fundamental frequency estimation
Dubnowski et al. Real-time digital hardware pitch detector
GB1569990A (en) Frequency compensation method for use in speech analysis apparatus
Martin Comparison of pitch detection by cepstrum and spectral comb analysis
EP1422693B1 (de) Tonhöhensignalformerzeugungsvorrichtung; tonhöhensignalformerzeugungsverfahren und programm
CA1061906A (en) Speech signal fundamental period extractor
Atal et al. Linear prediction analysis of speech based on a pole‐zero representation
EP0441642A2 (de) Verfahren und Einrichtung zur spektralen Analyse
US5809453A (en) Methods and apparatus for detecting harmonic structure in a waveform
Silverman et al. A comparison of several speech-spectra classification methods
EP0092612B1 (de) Sprachanalysesystem
Hess Algorithms and devices for pitch determination of speech signals
Junqua et al. A comparative study of cepstral lifters and distance measures for all pole models of speech in noise
Davis Octave and fractional octave band digital filtering based on the proposed ANSI standard
Rahman et al. Formant frequency estimation of high-pitched speech by homomorphic prediction
Vich et al. New cepstral zero-pole vocal tract models for TTS synthesis
KR0128851B1 (ko) 극성이 다른 가변길이 듀얼 임펄스의 스펙트럼 하모닉스 매칭에 의한 피치 검출 방법
CA1180813A (en) Speech recognition apparatus
Varho et al. Separated linear prediction—A new all-pole modelling technique for speech analysis
Hess Pitch determination of speech signals—a survey
Holmes Robust measurement of fundamental frequency and degree of voicing.
Linggard et al. High-speed digital filter bank

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

AK Designated contracting states

Kind code of ref document: A2

Designated state(s): DE FR GB IT SE

17P Request for examination filed

Effective date: 19910828

RAP1 Party data changed (applicant data changed or rights of an application transferred)

Owner name: BRITISH TECHNOLOGY GROUP LTD

PUAL Search report despatched

Free format text: ORIGINAL CODE: 0009013

AK Designated contracting states

Kind code of ref document: A3

Designated state(s): DE FR GB IT SE

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN WITHDRAWN

18W Application withdrawn

Withdrawal date: 19960404