WO2000072308A1 - Interval normalization device for voice recognition input voice - Google Patents
Interval normalization device for voice recognition input voice Download PDFInfo
- Publication number
- WO2000072308A1 WO2000072308A1 PCT/JP2000/003113 JP0003113W WO0072308A1 WO 2000072308 A1 WO2000072308 A1 WO 2000072308A1 JP 0003113 W JP0003113 W JP 0003113W WO 0072308 A1 WO0072308 A1 WO 0072308A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- pitch
- voice
- signal
- frequency
- speech
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/065—Adaptation
- G10L15/07—Adaptation to the speaker
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/04—Time compression or expansion
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/90—Pitch determination of speech signals
Definitions
- This invention enables a wide range of voice recognition processing for low-pitched male voices and high-pitched female and child voices in a voice recognition device that recognizes the voice of an unspecified speaker.
- the present invention relates to a speech recognition device, and more specifically, to an input speech pitch normalization device for normalizing a pitch of a recognition target speech according to a pitch of a standard speech of the speech recognition device.
- voice ⁇ recognition technology has been widely used in consumer electronics due to the improvement of digital signal processing technology and the use of high-performance and low-cost LSIs for processing. It has been introduced to help improve the operability of the equipment.
- the basic principle of the sound recognition device is that the input sound is converted into a digital sound signal, and the digital sound signal is converted into a sound dictionary prepared in advance.
- the input speech is recognized by comparing it with the registered standard speech data. For this reason, a special speaker method is required for a specific speaker that is the subject of speech recognition so that it can be easily compared with standard voice data.
- Some measures have been taken, such as registering the voice of ⁇ in a speech recognition device in advance.
- a speech recognition device is used as a consumer device, the convenience will be significantly reduced if the speaker is specified, and the product will be lost. Value is lost.
- utterances by unspecified speakers are various. With regard to such varied utterances by unspecified speakers, the factors that inhibit speech recognition, which impair speech recognition accuracy, can be broadly divided into utterance speed and speech. And the pitch.
- the utterance speed which is the first obstacle to speech recognition
- speech recognition is realized by comparing the input speech with a standard-speed speech registered in a speech dictionary prepared in advance. . Therefore, if the difference between the utterance speeds of the two becomes more than a certain value, it is impossible to make a correct comparison, and the speech recognition becomes impossible.
- the voice pitch which is the second voice recognition factor
- the voice pitch there is a difference in the pitch of the voice depending on the speaker, such as low-pitched voice of men and high-pitched voice of women and children. Is Rukoto .
- the difference between the pitch of the voice registered in the voice dictionary prepared in advance and the pitch of the voice uttered by the unspecified speaker exceeds a certain level. If this is not the case, the two voices cannot be correctly compared, and speech recognition becomes impossible.
- FIG. 5 shows a speech recognition apparatus proposed in Japanese Patent Laid-Open No. 9-325579 to solve the above-mentioned problem.
- the voice recognition device VRA c includes a voice input unit 111, a utterance speed calculation unit 112, a utterance speed conversion rate determination unit 113, a utterance speed conversion unit 114, and Also includes the speech recognition unit 115.
- the voice input unit 1 1 1 converts an analog voice signal that captures a voice uttered by an unspecified speaker into a digital signal, performs A / D conversion, and converts the voice signal. Generate.
- the utterance speed calculating unit 112 calculates the utterance speed of the voice of the unspecified speaker input based on the voice signal.
- the utterance speed conversion rate determination unit 113 compares the utterance speed calculated by the utterance speed calculation unit 112 with the reference speed, and determines the speed conversion rate.
- the utterance speed conversion unit 114 converts the utterance speed based on the rate conversion rate.
- the voice recognition unit 115 performs voice recognition of the input voice signal whose speed has been converted by the voice speed conversion unit 114.
- the voice uttered by the unspecified speaker is captured via the microphone and the amplifier of the voice input unit 111, and further by the AZD converter.
- the analog signal is converted to a digital signal.
- the utterance speed calculating unit 112 extracts one sound of the input voice from the converted digital voice signal.
- the utterance speed calculation unit 1 12 calculates the utterance speed of one sound from the cut-out time of one cut-out sound.
- the time required for the utterance speed calculation unit 1 1 2 to cut out the sound (hereinafter referred to as “single sound cut-out time”) is T s, and the unspecified speaker is
- the reference time required for uttering one sound (hereinafter referred to as “single sound utterance reference time”) is defined as Th.
- the utterance speed conversion rate determining unit 113 based on the one sound cut-out time T s and the one sound utterance reference time Th, the one sound utterance speed 1ZT s and the reference one sound utterance speed are determined.
- l ZTh is compared with to determine the speed conversion rate ⁇ .
- the speed conversion rate ⁇ can be calculated by the following equation (1).
- one sound cut-out time T s force S 1 sound is shorter than the standard sound utterance time Th. If the utterance speed of the input speech is faster than the utterance speed that can be accurately recognized by the speech recognition device VRAc, the speed conversion rate a becomes smaller than 1. In this case, it is necessary to reduce the utterance speed of the input voice. Conversely, the output time of one sound T s is longer than the reference time of one sound utterance Th, that is, the utterance speed of the input voice is compared to the utterance speed at which the voice recognition device VRAc can accurately recognize it. When the speed is low, the speed conversion rate a is larger than 1. In this case, it is necessary to increase the utterance speed of the input voice.
- the utterance speed conversion unit 114 converts the input voice signal based on the speed conversion rate a so that the utterance speed becomes constant, and converts the speed of the input voice signal to the speed conversion input voice signal. To generate.
- the speech recognition unit 115 outputs a recognition result obtained by performing speech recognition processing on the speed conversion input speech signal.
- the above-mentioned speed conversion can be easily realized by using modern digital technology. For example, to slow down the utterance speed of the input voice, add multiple vowel sound waveforms having a correlation with one sound of the input voice to the voice signal to extend the voice signal utterance time. Good. In order to increase the utterance speed of the input voice, the vowel waveform of one sound of the input voice may be thinned out from the voice signal a plurality of times.
- speech rate conversion technology is used for unspecified speakers whose utterance speed varies among individuals, especially for speech uttered by a speaker who speaks quickly. It is intended to improve the recognition rate of speech uttered by a fast-talking speaker.
- the speech generated by an unspecified speaker having a different utterance rate than the reference one-tone utterance rate lZTh can be obtained. It is effective to improve the recognition rate for the speech, that is, to be effective against the first obstacle to speech recognition.
- the utterance sound having a difference in height from the reference sound that is, the utterance sound having a difference in height, which is a second obstacle to speech recognition, is referred to. Cannot be expected to improve the recognition rate.
- the speech recognition device VRA c can handle a wide range of frequencies, such as low voices of men, high voices of women, and children, but the high speech recognition rate is high. It cannot be realized. Or, in the case of early opening, and boiled Tsu rather Ri story and intends et al also Do throat attention Ru Oh in a jar Do is already no problem if not force s, this uttered strange E the sound to speaker It is difficult to wish for. This is based on the shape and size of the speaker's throat, and the reference utterance frequency of the speaker is determined. In other words, since the shape of the speaker's throat cannot be changed, the timbre of the utterance cannot be changed.
- the voice recognition device VRAc requires a male voice. It has multiple standard tone data of different pitches necessary for speech recognition, such as voices of women, children and children, and switches the standard tone data to be referred according to the tone of the speaker. If you don't have a review, you have a review and review issues. Disclosure of invention
- the present invention has the following features in order to achieve the above-mentioned object.
- the first phase of rebound is based on speech recognition standard data.
- An input voice pitch normalization device which is used for a voice recognition device that recognizes input voices uttered by unspecified speakers, and converts the pitch of the input voice into a predetermined relationship with the pitch of voice recognition standard data.
- a pitch difference judging device for judging a pitch difference between the input speech and the standard data for speech recognition;
- the frequency of the input voice is converted so that the pitch of the input voice has a predetermined relationship with the pitch of the voice recognition standard data. And a pitch converter.
- the pitch of the input voice is adjusted according to the pitch of the voice recognition standard data, so that the voice recognition rate can be improved.
- a reading controller for reading a series of the input voices from the memory and generating a voice signal to be recognized
- the pitch difference detector is
- a frequency component analyzer that analyzes a frequency component of the speech signal to be recognized and generates a frequency component signal
- the apparatus is provided with a pitch determining device that determines a pitch difference between the voice recognition standard data and the basic frequency and generates a pitch difference signal.
- the input voice may be one sound, or may be a word composed of several sounds.
- the third aspect is that, in the second aspect, the pitch determiner obtains the first format of the speech signal to be recognized as the basic frequency
- the target voice to be recognized is one sound. It is characterized in that the pitch difference can be determined stably regardless of whether it is a plurality of sounds or a plurality of sounds.
- the frequency characteristics are stabilized in units of the input voice. Since the pitch comparison with the recognition standard characteristic data is performed at the first format, processing such as clipping of one sound of the input voice is not required, and processing is quick and the device configuration is simplified. To be sent to
- the pitch converter converts the memory so that the frequency of the speech signal to be recognized is converted based on the hand signal.
- a read clock controller for determining a frequency of a timing clock to be read out and generating a read clock signal.
- the memory outputs a voice signal to be recognized so as to have a pitch having a predetermined relationship with the pitch of the voice recognition standard data based on the m output mouth signal.
- the reading of memory By changing the timing, it is possible to change the pitch without damaging the waveform characteristics of the speech signal to be recognized. No thinning process is required.
- the fifth aspect is a speech recognition device provided with the input speech pitch normalization device according to the fourth aspect.
- the sixth aspect is a speech recognition device for recognizing an input speech uttered by an unspecified speaker based on the speech recognition standard data.
- the pitch of the input speech is determined in accordance with the pitch of the speech recognition standard data.
- An input voice pitch normalization device for converting into the relationship
- a voice analyzer for comparing the input voice whose pitch has been converted with the voice recognition standard data, and generating a recognition signal indicating voice recognition standard data matching the input voice;
- the pitch of the input voice is adjusted according to the pitch of the voice recognition standard data, so that the voice recognition rate can be improved.
- the seventh aspect is, in the sixth aspect, a memory for temporarily storing the input voice
- a readout controller for reading out a series of input voices from the memory and generating a voice signal to be recognized
- the pitch difference detector is
- a frequency component analyzer that analyzes a frequency component of a speech signal to be recognized and generates a frequency component signal
- the basic frequency of the speech signal to be recognized is determined, and the pitch difference between the standard speech recognition data and the basic frequency is determined to generate a pitch difference signal.
- Pitch determiner Is provided.
- the input voice may be a single sound or a word composed of several sound powers.
- the pitch determiner obtains the first frequency of the speech signal to be recognized as a fundamental frequency, and obtains the first frequency of the speech signal to be recognized. By comparing the pitch with the first form of the speech recognition standard data to determine the pitch difference, whether the recognition target speech is one sound or multiple sounds is determined. Also, the feature is that the pitch difference can be determined stably.
- the frequency characteristics are stable in units of the input speech. Since the pitch is compared with the recognition standard characteristic data in one format, there is no need to perform processing such as cutting out one sound of the input voice, and processing is quick and the device configuration is simplified. Can be used.
- the pitch converter reads the memory so that the frequency of the speech signal to be recognized is converted based on the pitch difference signal.
- a readout clock controller for determining a frequency of the timing clock and generating a readout clock signal;
- the memory outputs a voice signal to be recognized so as to have a pitch having a predetermined relationship with a pitch of voice recognition standard data based on a read-out mouth signal.
- the waveform characteristic of the speech signal to be recognized is impaired by changing the memory reading timing. You can change that pitch without any This eliminates the need for interpolation and decimation processing.
- FIG. 1 is a block diagram showing a configuration of a speech recognition device incorporating the input speech normalization device according to the embodiment of the present invention.
- FIG. 3 is an explanatory diagram of an example of a time change of a speech waveform and a pitch conversion method performed between the examples.
- FIG. 4 is a flowchart showing the operation of the input speech normalizing apparatus shown in FIG. 1, and FIG.
- FIG. 5 is a block diagram showing the configuration of a conventional speech recognition apparatus.
- BEST MODE FOR CARRYING OUT THE INVENTION The present invention will be described in detail with reference to the accompanying drawings in order to explain the present invention in more detail. This will be explained according to the following.
- the voice recognition device VRAp includes an AZD converter 1, an input voice normalization device Tr, a standard voice data storage unit 13, a voice analyzer 15 and a controller 17.
- the standard voice data storage unit 13 stores a voice frequency component pattern Psf, which is a reference for voice recognition, and stores the voice frequency component Psf stored at a predetermined timing. Outputs Psf.
- the uttered voice is input to the voice recognition device VRAp as an analog voice signal SVa via a microphone and an amplifier (not shown).
- the controller 17 is the other component 1 of the voice recognition device V R A p, Ding! : Based on the operating state signal S s output from, 13, and 15, which indicates the operating state of those components,
- a control signal Sc for controlling the operations of 1, Tr, 13 and 15 is generated to control the operation of the entire speech recognition apparatus VRAp. Since the operation state signal S s, the operation state signal S c, and the controller 17 are well-known technologies, they are not particularly required for the sake of simplicity of explanation. Not mentioned.
- the A / D converter 1 performs an AZD conversion process on the input analog audio signal SVa to generate a digital audio signal SVd, and inputs the digital audio signal SVd to the input audio normalization device Tr.
- the input voice normalization device Tr is based on the input digital voice signal SVd and is a pitch-normalized digital voice that is pitch-converted to the standard pitch of the voice recognition device VRAp.
- a signal SVc is generated and output to the audio analyzer 15. Based on the audio frequency pattern P sf read from the standard audio data storage 13, the audio analyzer 15 receives a pitch-normalized digit from the input audio normalizer Tr.
- the voice signal S vc is analyzed, and a recognition signal S rc indicating voice recognition standard data that matches the input voice is output.
- the input speech normalizer Tr includes a memory 3, a read controller 5, a frequency component analyzer 7, a pitch determiner 9, and a read clock controller.
- Including vessel 1 1 Memory 3 is The digital audio signal S vd output from the A / D converter 1 is temporarily stored.
- the read controller 5 monitors the storage of the digital audio signal SV d by the memory 3, generates a read control signal S rc, and outputs the stored digital audio signal S rc.
- the memory 3 is controlled so that the signal corresponding to the independent utterance of the signal SVd is read out as the digital sound signal unit Svu.
- the frequency component analyzer 7 subjects the digital audio signal unit Svu output from the memory 3 to a high-speed free-time conversion process, and performs frequency frequency spectrum analysis. Perform the analysis.
- the frequency component analyzer 7 generates a frequency component signal Sfc based on the frequency spectrum analysis result of the digital audio signal unit SVu.
- the pitch determinator 9 extracts the first holoremant of the frequency component signal S fc output from the frequency component analyzer 7, and The pitch of the input voice (SVaSVdSvu) and the standard voice based on the first form of the standard voice (standard voice data storage 13) stored and stored in the Find the pitch difference between. Based on the obtained pitch difference, ztr. ⁇
- the f-determiner 9 further generates a pitch conversion rate signal S cr indicating how much the pitch of the input voice (SV d ⁇ V a SVU) can be converted to the standard pitch.
- the read clock controller 11 reads the clock for the memory 3 based on the pitch conversion rate signal S cr output from the 9 pitch determiners.
- the read clock Sec is generated by controlling the frequency.
- Memory 3 is the timing specified by the read clock Scc.
- the pitch of the digital audio signal SVd is adjusted to match the pitch of the standard audio.
- This predetermined pitch relationship does not necessarily mean the same, but an allowable tolerance determined by the performance of the voice recognition device VRAp (especially the voice analyzer 15). It goes without saying that the range is allowed.
- the voice analyzer 15 analyzes the pitch-normalized digital voice signal SVc input from the memory 3 and reads the reference voice frequency read from the standard voice data storage 13. A recognition signal S rc indicating the one that matches with the wave number component pattern P sf is output.
- FIG. 2 shows an example of a frequency spectrum obtained as a result of performing a high-speed Fourier transform on the digital audio signal SVd by the frequency component analyzer 7.
- the horizontal axis indicates the frequency f
- the vertical axis indicates the intensity A.
- the dashed line L1 shows an example of a typical voice frequency spectrum of a voice in which the digital voice signal SVd is uttered by a man
- the dashed line L2 shows a digital signal.
- the following is an example of a typical voice frequency spectrum of a voice voice signal SV d generated by a woman or a child.
- the solid line Ls shows an example of the audio frequency spectrum stored in the standard audio data storage 13 as the standard audio data for speech recognition. .
- the frequency spectrum appears on the lower frequency side compared to the standard voice, and in the case of women or children, the frequency spectrum is broken L2.
- a wavenumber spectrum appears on the higher frequency region side as compared with the standard sound.
- the first fundamental frequencies which are the fundamental frequencies of such frequency components, are f1, f2, and fs
- these fundamental frequencies are It is generally constant for speakers.
- the first holoremant frequency described here will be briefly described.
- the formants are named as the first, second, third,... Honoremant from the lower frequency, and were uttered by the same speaker
- the first honorem of speech is almost constant, whether it is a single sound or a phrase composed of multiple sounds.
- the reference utterance frequency of the speaker's voice is determined by the shape and size of the speaker's throat as described above.
- the difference is substantially constant with respect to the individual speaker, not limited to the gender difference and the age difference described above, as well as the content of the uttered leaves.
- the first volume of the voice sequence The mount is constant for the individual speaker.
- the pitch determiner 9 obtains the first frequency of the speech uttered by the unspecified speaker, and obtains the fundamental frequency fi (hereinafter, referred to as the unspecified speaker's voice). "Input voice basic frequency fi"). Then, in the pitch determiner 9, the input voice basic frequency fi is compared with the basic frequency fs of the standard voice data (hereinafter, referred to as "standard voice basic frequency fs"), and the input voice basic frequency fs is compared.
- the pitch ratio CR of the basic frequency fi to the standard voice basic frequency fs is calculated according to the following equation (2).
- the first formant frequency is acoustically determined uniquely by the shape (length and thickness) of the speaker's throat.
- the fundamental frequency f m of the voice is lower than the fundamental frequency f s of the standard voice.
- the pitch ratio CR becomes larger than 1.
- high women and children have short and thin throats, so their basic frequency f c is higher than the standard frequency f s of standard speech.
- the pitch ratio CR becomes smaller than 1.
- the frequency component analyzer 7 generates a pitch conversion rate signal Scr indicating the value of the pitch ratio CR.
- the output clock controller 11 outputs a digital audio signal Scr based on the pitch conversion rate signal Scr output from the pitch determiner 9.
- memory 3 By reading the digital audio signal S vd from the memory 3 at a timing that is CR times the sampling timing of Vd, the pitch normalized data is read.
- a digital audio signal S vc is generated and recorded.
- memory 3 is commonly called a ring memory. It consists of a circular memory.
- the pitch ratio CR is greater than 1, that is, when the pitch of the input sound (SVd) is low, the memory is recorded earlier than the sampling mouth.
- the digital sound signal S vd is read from 3 to generate a pitch-normalized digital sound signal s Vc.
- the pitch ratio CR is smaller than 1, that is, when the pitch of the input sound (Svd) is higher, the sampling rate is lower than the sampling rate and the timing is higher. Then, the digital audio signal SVd is read out to generate a pitch-normalized digital audio signal SVc.
- the pitch conversion process in the pitch converter 9 will be further described.
- the horizontal axis represents time t
- the vertical axis represents voice intensity A.
- the waveform WS shows an example of a temporal change of the audio waveform stored in the standard audio data storage unit 13.
- the waveform WL indicates a voice waveform having a lower pitch than the standard voice data (for example, male voice), and the waveform wH indicates a voice waveform having a higher pitch than the standard voice data (eg, a male voice).
- the waveform WS, the waveform WL, and one cycle of the waveform WH are represented by PL, PS, and PH, respectively.
- the periods P L and P H correspond to the reciprocal of the above-mentioned basic frequency f i of the input voice
- the period P S corresponds to the reciprocal of the basic frequency f s of the standard voice.
- the reading speed is faster (P / PS) than the sampling clock when the input audio waveform is converted to AZD. This can be achieved by reading it out by mouth.
- the sample at the time of AZD conversion of the input audio waveform is used. This can be achieved by reading later (by a factor of PHPS) than the clock.
- the protruding clock is obtained by converting the sampling clock based on the pitch ratio CR defined by the above equation (2).
- a pitch-normalized digital voice signal SVc obtained by converting the pitch of the digital voice signal SVd according to the pitch of the standard voice is obtained.
- the time axis of the voice waveform becomes shorter, and when the pitch lowers, the time axis of the voice waveform becomes longer. Resulting in .
- the speech rate can be adjusted by adding a vowel waveform when raising the pitch, and by thinning out the vowel waveform when lowering the pitch.
- this technique is well-known and is not the purpose of the present invention, and therefore, its description and illustration are omitted.
- the frequency conversion of the SJC protruding clock can be easily created using the master clock frequency dividing clock, which is conventionally known. it can .
- step S2 the voice uttered by the unspecified speaker through a device such as a magic is used as the analog voice signal SVa in the AZD converter 1 Is input to. Then, the process proceeds to the next step S4.
- step S4 the A / D converter 1 sequentially converts the input analog audio signal SVa into AZD, and Then, the audio signal S vd is generated and output to the memory 3.
- the above-described steps S2 and S4 form a subroutine # 1000 for receiving an input of the voice uttered by the speaker.
- step S6 the readout controller 5 monitors the input state of the memory 3 and determines whether or not the voice input by the speaker (analog voice signal SVa) has been completed. Judge whether or not. This determination is made, for example, based on whether or not the interruption period of the input of the analog voice signal SVa has reached a predetermined threshold.
- the speaker may be configured to indicate to the speech recognition device VRAP or the input speech normalization device Tr using appropriate means that the input has been completed.
- step S8 If the speaker's utterance continues, it is determined to be No, and the process returns to step S4 described above to generate the digital voice signal SV d and to store the memo. The input to the file 3 is continued. Then, when the input of the analog voice signal SVa of an independent voice train composed of one or several sounds by the speaker is completed, it is determined to be Yes, and The process proceeds to the next step S8.
- step S8 the read controller 5 outputs the data corresponding to the audio stream independent of the digital audio signal SVd stored in the memory 3.
- the digital audio signal unit SVu is read out and output to the frequency component analyzer 7.
- the digital voice signal unit S vu is the target of voice recognition by the voice recognition device VRAp.
- steps S6 and S8 described above are recognition target voice extraction subroutines for extracting voices to be recognized from voices uttered by the speaker. Form # 2 0 0.
- step S10 the frequency component analyzer 7 performs high-speed Fourier transform processing on the digital audio signal unit SVu input from the memory 3. Then, the frequency spectrum (Fig. 2) of the digital audio signal unit Svu is analyzed. Then, the process proceeds to the next step S12.
- step S12 the frequency component analyzer 7 generates the frequency component signal Sfc as described with reference to FIG. Then, the process proceeds to the next step S14.
- step S 14 the frequency component analyzer 7 outputs the generated frequency component signal Sfc to the pitch determiner 9. Then, the process proceeds to the next step S16.
- the above-mentioned steps S10, S12, and S14 are the frequency spectrum analysis subroutines of the digital audio signal unit SVu. Form # 3 0 0.
- step S 16 the pitch determiner 9 determines the input voice (digital) based on the frequency component signal Sfc input from the frequency component analyzer 7.
- the first frequency which is the fundamental frequency of the audio signal unit S vu
- the process proceeds to the next step S18.
- step S18 the pitch determiner 9 stores the first format determined in step S16 in the standard voice data storage unit 13. Then, the pitch ratio CR is calculated in accordance with the above equation (2) in comparison with the first format of the standard audio data to be obtained. Then, the process proceeds to the next step S20.
- step S20 the pitch determiner 9 generates a pitch conversion rate signal Scr representing the pitch ratio CR, and outputs the readout lock control. Output to controller 1 1. Then, the process proceeds to the next step S22.
- the above steps S16, S18, and S20 form a pitch determination subroutine # 400 that determines the pitch of the input voice with respect to the standard voice.
- the read clock controller 11 reads the memory 3 based on the pitch conversion rate signal Scr output from the pitch determiner 9. Generate a read clock Sec that determines the read timing. Then, the process proceeds to the next step S24.
- step S24 the pitch-normalized digital voice signal SVc is read from the memory 3 based on the read clock Scc.
- steps S22 and S24 are subroutines # 100 and # 100 that form the pitch-normalized subroutine # 500 of the input voice as described above.
- the pitch-normalized digital voice signal SVc generated through the processing of # 200, # 300, # 400, and # 500 is sent to the voice analyzer 15. Accordingly, the data is collated with the standard voice data stored in the standard voice data storage unit 13 and subjected to recognition processing.
- the speech analyzer 15 further generates and outputs a recognition signal Src indicating the recognition result.
- the basic frequency (first form) detection in the pitch judgment subroutine # 400 can be obtained with only one sound, but the whole utterance word is detected.
- the average value of can be taken. This is because, as described above, even if the voice uttered by the speaker is one sound, even if it is a voice composed of multiple sound powers, it is the first hormone. Is the speaker They are generally constant every time.
- the voice analyzer 15 is a standard voice data storage unit 13 that refers to the voice digital signal (pitch normalized digital voice signal SVc) that has been pitch-converted in this way. Then, the degree of coincidence between the voice frequency component pattern of the voice recognition and the input voice frequency component pattern stored in the voice recognition is calculated, and the voice recognition is analyzed.
- the standard voice By converting the input voice uttered by the unspecified speaker into the pitch of the stored standard voice data in advance, the standard voice can be obtained. Since it is not necessary to have multiple data, it is possible to cope with a wide frequency range of unspecified speakers, and the speech recognition rate can be improved.
- the pitch of the standard voice data is converted to the input voice (digital voice signal). The pitch may be converted according to the pitch of the signal SV d).
- the apparatus of the present invention recognizes the sound of the present invention and analyzes the frequency components of the input voice signal, and converts the input voice to the standard voice data for voice recognition.
- the conversion improves the speech recognition rate based on the tone color difference of the speaker, and eliminates the need to have multiple standard speech data, thus reducing the memory capacity. .
- this invention is intended for applications that require recognition of speech uttered by an unspecified number of speakers, such as television.
Landscapes
- Engineering & Computer Science (AREA)
- Quality & Reliability (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Electrophonic Musical Instruments (AREA)
- Telephonic Communication Services (AREA)
Description
Claims
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP00925673A EP1102240A4 (en) | 1999-05-21 | 2000-05-16 | DEVICE FOR INTERVAL NORMALIZING AN INPUT SIGNAL FOR VOICE RECOGNITION |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP11/141838 | 1999-05-21 | ||
| JP14183899 | 1999-05-21 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2000072308A1 true WO2000072308A1 (en) | 2000-11-30 |
Family
ID=15301333
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2000/003113 Ceased WO2000072308A1 (en) | 1999-05-21 | 2000-05-16 | Interval normalization device for voice recognition input voice |
Country Status (4)
| Country | Link |
|---|---|
| EP (1) | EP1102240A4 (ja) |
| KR (1) | KR100423630B1 (ja) |
| CN (1) | CN1136538C (ja) |
| WO (1) | WO2000072308A1 (ja) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| KR100803894B1 (ko) * | 2001-05-17 | 2008-02-15 | 신세다이 가부시키 가이샤 | 음계 인식 방법 및 그 장치 |
Families Citing this family (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP1793370B1 (en) | 2001-08-31 | 2009-06-03 | Kabushiki Kaisha Kenwood | apparatus and method for creating pitch wave signals and apparatus and method for synthesizing speech signals using these pitch wave signals |
| CN100458914C (zh) * | 2004-11-01 | 2009-02-04 | 英业达股份有限公司 | 语音识别系统以及方法 |
| AU2006272451B2 (en) * | 2005-07-18 | 2010-10-14 | Diego Giuseppe Tognola | A signal process and system |
| EP1904816A4 (en) | 2005-07-18 | 2014-12-24 | Diego Giuseppe Tognola | SIGNAL PROCESS AND SYSTEM |
| JP4882899B2 (ja) * | 2007-07-25 | 2012-02-22 | ソニー株式会社 | 音声解析装置、および音声解析方法、並びにコンピュータ・プログラム |
| KR101674597B1 (ko) * | 2014-03-28 | 2016-11-22 | 세종대학교산학협력단 | 음성 인식 시스템 및 방법 |
| CN107895579B (zh) * | 2018-01-02 | 2021-08-17 | 联想(北京)有限公司 | 一种语音识别方法及系统 |
Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPS46205B1 (ja) * | 1966-03-24 | 1971-01-06 | ||
| JPH02275999A (ja) * | 1989-04-18 | 1990-11-09 | Oki Electric Ind Co Ltd | 音紋の照合方法 |
| JPH02275997A (ja) * | 1989-04-18 | 1990-11-09 | Oki Electric Ind Co Ltd | 音紋照合方法における測定音紋変換処理方法 |
| EP0290190B1 (en) * | 1987-04-30 | 1991-10-09 | Oki Electric Industry Company, Limited | Pattern matching system |
| JPH04102900A (ja) * | 1990-08-22 | 1992-04-03 | Matsushita Electric Ind Co Ltd | 音程変換装置 |
| JPH06214596A (ja) * | 1993-01-14 | 1994-08-05 | Ricoh Co Ltd | 音声認識装置および話者適応化方法 |
| EP0390037B1 (en) * | 1989-03-27 | 1994-08-10 | Matsushita Electric Industrial Co., Ltd. | Pitch shift apparatus |
| JPH09325798A (ja) * | 1996-06-06 | 1997-12-16 | Matsushita Electric Ind Co Ltd | 音声認識装置 |
Family Cites Families (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5839099A (en) * | 1996-06-11 | 1998-11-17 | Guvolt, Inc. | Signal conditioning apparatus |
-
2000
- 2000-05-16 EP EP00925673A patent/EP1102240A4/en not_active Ceased
- 2000-05-16 WO PCT/JP2000/003113 patent/WO2000072308A1/ja not_active Ceased
- 2000-05-16 CN CNB00800952XA patent/CN1136538C/zh not_active Expired - Fee Related
- 2000-05-16 KR KR10-2001-7000649A patent/KR100423630B1/ko not_active Expired - Fee Related
Patent Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPS46205B1 (ja) * | 1966-03-24 | 1971-01-06 | ||
| EP0290190B1 (en) * | 1987-04-30 | 1991-10-09 | Oki Electric Industry Company, Limited | Pattern matching system |
| EP0390037B1 (en) * | 1989-03-27 | 1994-08-10 | Matsushita Electric Industrial Co., Ltd. | Pitch shift apparatus |
| JPH02275999A (ja) * | 1989-04-18 | 1990-11-09 | Oki Electric Ind Co Ltd | 音紋の照合方法 |
| JPH02275997A (ja) * | 1989-04-18 | 1990-11-09 | Oki Electric Ind Co Ltd | 音紋照合方法における測定音紋変換処理方法 |
| JPH04102900A (ja) * | 1990-08-22 | 1992-04-03 | Matsushita Electric Ind Co Ltd | 音程変換装置 |
| JPH06214596A (ja) * | 1993-01-14 | 1994-08-05 | Ricoh Co Ltd | 音声認識装置および話者適応化方法 |
| JPH09325798A (ja) * | 1996-06-06 | 1997-12-16 | Matsushita Electric Ind Co Ltd | 音声認識装置 |
Non-Patent Citations (2)
| Title |
|---|
| See also references of EP1102240A4 * |
| SEICHI NAKAGAWA ET AL.: "Spoken word recognition based on normalization of speaker differences spectra", IEICE TECHNICAL REPORT (AUTOMATON), vol. 79, no. 200, 20 December 1979 (1979-12-20), pages 79 - 86, AL79-78, XP002933260 * |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| KR100803894B1 (ko) * | 2001-05-17 | 2008-02-15 | 신세다이 가부시키 가이샤 | 음계 인식 방법 및 그 장치 |
Also Published As
| Publication number | Publication date |
|---|---|
| EP1102240A1 (en) | 2001-05-23 |
| CN1310839A (zh) | 2001-08-29 |
| KR100423630B1 (ko) | 2004-03-22 |
| CN1136538C (zh) | 2004-01-28 |
| KR20010053542A (ko) | 2001-06-25 |
| EP1102240A4 (en) | 2001-10-10 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| JP3180655B2 (ja) | パターンマッチングによる単語音声認識方法及びその方法を実施する装置 | |
| KR100629669B1 (ko) | 분산 음성인식 시스템 | |
| NL8300718A (nl) | Werkwijze en inrichting voor herkenning van een foneem in een stemsignaal. | |
| JPH0990974A (ja) | 信号処理方法 | |
| WO2013002674A1 (ru) | Система и способ распознавания речи | |
| JP2018040982A (ja) | 発話区間検出装置、発話区間検出方法及び発話区間検出用コンピュータプログラム | |
| JP2002536691A (ja) | 音声認識除去方式 | |
| Magre et al. | A comparative study on feature extraction techniques in speech recognition | |
| US20020065649A1 (en) | Mel-frequency linear prediction speech recognition apparatus and method | |
| WO2007046267A1 (ja) | 音声判別システム、音声判別方法及び音声判別用プログラム | |
| WO2000072308A1 (en) | Interval normalization device for voice recognition input voice | |
| JP2016042152A (ja) | 音声認識装置及びプログラム | |
| JP2002236494A (ja) | 音声区間判別装置、音声認識装置、プログラム及び記録媒体 | |
| JP2019032400A (ja) | 発話判定プログラム、発話判定方法、及び発話判定装置 | |
| JP2001042889A (ja) | 音声認識入力音声の音程正規化装置 | |
| JP3354252B2 (ja) | 音声認識装置 | |
| JP2002189487A (ja) | 音声認識装置および音声認識方法 | |
| JP4328423B2 (ja) | 音声識別装置 | |
| JP2004341340A (ja) | 話者認識装置 | |
| JP2009058548A (ja) | 音声検索装置 | |
| JPH0345839B2 (ja) | ||
| JP2004139049A (ja) | 話者正規化方法及びそれを用いた音声認識装置 | |
| JPH11338492A (ja) | 話者認識装置 | |
| Kim et al. | Speech/music discrimination using mel-cepstrum modulation energy | |
| KR100322704B1 (ko) | 음성신호의지속시간변경방법 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| WWE | Wipo information: entry into national phase |
Ref document number: 00800952.X Country of ref document: CN |
|
| AK | Designated states |
Kind code of ref document: A1 Designated state(s): CN KR US |
|
| AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2000925673 Country of ref document: EP |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 09743578 Country of ref document: US |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 1020017000649 Country of ref document: KR |
|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
| WWP | Wipo information: published in national office |
Ref document number: 2000925673 Country of ref document: EP |
|
| WWP | Wipo information: published in national office |
Ref document number: 1020017000649 Country of ref document: KR |
|
| WWR | Wipo information: refused in national office |
Ref document number: 1020017000649 Country of ref document: KR |
|
| WWR | Wipo information: refused in national office |
Ref document number: 2000925673 Country of ref document: EP |
|
| WWW | Wipo information: withdrawn in national office |
Ref document number: 2000925673 Country of ref document: EP |