WO2022134894A1 - 语音识别方法、装置、计算机设备及存储介质 - Google Patents
语音识别方法、装置、计算机设备及存储介质 Download PDFInfo
- Publication number
- WO2022134894A1 WO2022134894A1 PCT/CN2021/129223 CN2021129223W WO2022134894A1 WO 2022134894 A1 WO2022134894 A1 WO 2022134894A1 CN 2021129223 W CN2021129223 W CN 2021129223W WO 2022134894 A1 WO2022134894 A1 WO 2022134894A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- phoneme
- speech
- probability
- recognition result
- speech frame
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/187—Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
Definitions
- the present application relates to the technical field of speech recognition, and in particular, to a speech recognition method, apparatus, computer equipment and storage medium.
- Speech recognition is a technology that recognizes speech as text, which has a wide range of applications in various artificial intelligence (Artificial Intelligence, AI) scenarios.
- AI Artificial Intelligence
- the speech recognition framework usually includes an acoustic model part and a decoding part, wherein the acoustic model part is used to recognize the phonemes of each speech frame in the input speech signal, and the decoding part outputs the text sequence of the speech signal through the recognized phonemes of each speech frame.
- RNN-T Recurrent Neural Network Transducer
- the RNN-T model introduces the concept of empty output in the phoneme recognition process, that is, predicting that a certain speech frame does not contain valid phonemes.
- the introduction of empty output will lead to an increase in the error rate of the subsequent decoding process in some application scenarios. , especially resulting in an increase in deletion errors, affecting the accuracy of speech recognition.
- a speech recognition method executed by computer equipment, the method comprising:
- the phoneme recognition result is used to indicate the probability distribution of the corresponding speech frame in the phoneme space; the phoneme space contains each phoneme and empty output;
- the phoneme recognition results corresponding to the adjusted speech frames are input into a decoding map to obtain a recognized text sequence corresponding to the speech signal, and the decoding map includes the mapping relationship between characters and phonemes.
- a voice recognition device comprising:
- a speech signal processing module configured to perform phoneme recognition on the speech signal, and obtain a phoneme recognition result corresponding to each speech frame in the speech signal; the phoneme recognition result is used to indicate the probability distribution of the corresponding speech frame in the phoneme space;
- the phoneme space contains each phoneme and empty output;
- a probability adjustment module configured to suppress and adjust the probability of empty output in the phoneme recognition result corresponding to each speech frame, so as to reduce the ratio of the probability of empty output in the phoneme recognition result to the probability of each phoneme;
- a decoding module configured to input the phoneme recognition results corresponding to the adjusted speech frames into a decoding map to obtain a recognized text sequence corresponding to the speech signal, and the decoding map includes the mapping relationship between characters and phonemes .
- a speech recognition method comprising:
- the voice signal including each voice frame obtained by segmenting the original voice
- the phoneme recognition result is used to indicate the probability distribution of the corresponding speech frame in the phoneme space; the phoneme space includes each phoneme and empty output ;and
- the phoneme recognition results corresponding to the respective speech frames the phoneme recognition results whose probability of empty output satisfies the specified conditions are input into the decoding map, and the recognized text sequence corresponding to the speech signal is obtained, and the decoding map includes characters Mapping relationship with phonemes.
- a voice recognition device comprising:
- a voice signal acquisition module configured to acquire a voice signal, the voice signal including each voice frame obtained by dividing the original voice
- the phoneme recognition result obtaining module is used to perform phoneme recognition on the speech signal, and obtain the phoneme recognition result corresponding to each speech frame; the phoneme recognition result is used to indicate the probability distribution of the corresponding speech frame in the phoneme space; the phoneme The space contains individual phonemes and empty outputs; and
- the recognition text sequence acquisition module is used to input the phoneme recognition results whose probability of empty output satisfies the specified condition in the phoneme recognition results corresponding to the respective speech frames into the decoding map, and obtain the recognition text sequence corresponding to the speech signal , the decoding map includes the mapping relationship between characters and phonemes.
- a computer device comprising a processor and a memory, wherein at least one computer instruction is stored in the memory, and the at least one computer instruction is loaded and executed by the processor to implement the above-mentioned speech recognition method.
- a computer-readable storage medium where at least one computer instruction is stored, the at least one computer instruction is loaded and executed by a processor to implement the above speech recognition method.
- a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium.
- the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the computer program method described above.
- FIG. 1 is a system configuration diagram of a speech recognition system involved in various embodiments of the present application.
- FIG. 2 is a schematic flowchart of a speech recognition method according to an exemplary embodiment
- FIG. 3 is a schematic flowchart of a speech recognition method according to an exemplary embodiment
- FIG. 4 is a schematic diagram of an alignment process involved in the embodiment shown in FIG. 3;
- FIG. 5 is a schematic structural diagram of an acoustic model involved in the embodiment shown in FIG. 3;
- Fig. 6 is the network structure diagram of the predictor involved in the embodiment shown in Fig. 3;
- Fig. 7 is the model training and application flow chart involved in the embodiment shown in Fig. 3;
- FIG. 8 is a frame diagram of a speech recognition system according to an exemplary embodiment
- FIG. 9 is a block diagram showing the structure of an apparatus for labeling objects in a video according to an exemplary embodiment
- Fig. 10 is a structural block diagram of a computer device according to an exemplary embodiment.
- AI is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
- artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can respond in a similar way to human intelligence.
- Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
- Artificial intelligence technology is a comprehensive discipline, involving a wide range of fields, including both hardware-level technology and software-level technology.
- the basic technologies of artificial intelligence generally include technologies such as sensors, special artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
- Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
- ASR Automatic Speech Recognition
- TTS Text To Speech
- voiceprint recognition technology Making computers able to hear, see, speak, and feel is the development direction of human-computer interaction in the future, and voice will become one of the most promising human-computer interaction methods in the future.
- Machine learning is a multi-domain interdisciplinary subject involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines. It specializes in how computers simulate or realize human learning behaviors to acquire new knowledge or skills, and to reorganize existing knowledge structures to continuously improve their performance.
- Machine learning is the core of artificial intelligence and the fundamental way to make computers intelligent, and its applications are in all fields of artificial intelligence.
- Machine learning and deep learning usually include artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, teaching learning and other techniques.
- FIG. 1 a system structure diagram of a speech recognition system involved in various embodiments of the present application is shown. As shown in FIG. 1 , the system includes a sound collection component 120 and a speech recognition device 140 .
- the sound collection component 120 and the speech recognition device 140 are connected in a wired or wireless manner.
- the sound collection component 120 may be implemented as a microphone, a microphone array, or a pickup, or the like.
- the sound collecting component 120 is used for collecting voice data when the user speaks.
- the speech recognition device 140 is used for recognizing the speech data collected by the sound collection component 120 to obtain the recognized text sequence.
- the speech recognition device 140 may also perform natural semantic processing on the recognized text sequence to respond to the user's speech.
- the sound collection component 120 and the speech recognition device 140 may be implemented as two independent hardware devices.
- the sound collection component 120 is a microphone arranged on the steering wheel of the vehicle, and the speech recognition device 140 may be an in-vehicle smart device; or, the sound collection component 120 is a microphone arranged on a remote control, and the speech recognition device 140 may be controlled by the remote control Smart home devices (such as smart TVs, set-top boxes, air conditioners, etc.).
- the sound collection component 120 and the speech recognition device 140 may be implemented as the same hardware device.
- the speech recognition device 140 may be a smart device such as a smart phone, a tablet computer, a smart watch, and smart glasses, and the sound collection component 120 may be a microphone built in the speech recognition device 140 .
- the speech recognition system described above may also include a server 160 .
- the server 160 may be used to deploy and update the speech recognition model in the speech recognition device 140 .
- the server 160 may also provide the cloud speech recognition service to the speech recognition device 140, that is, receive the speech data sent by the speech recognition device 140, perform speech recognition on the speech data, and return the recognition result to the speech recognition device 140.
- the server 160 may also cooperate with the speech recognition device 140 to complete operations such as recognizing the speech data and responding to the speech data.
- the server 160 is a server, or consists of several servers, or a virtualization platform, or a cloud computing service center.
- the server can be an independent physical server, a server cluster or a distributed system composed of multiple physical servers, or a cloud service, cloud database, cloud computing, cloud function, cloud storage, network service, cloud communication, intermediate Cloud servers for basic cloud computing services such as software services, domain name services, security services, Content Delivery Network (CDN), and big data and artificial intelligence platforms.
- cloud service e.g., a cloud service, a cloud database, cloud computing, cloud function, cloud storage, network service, cloud communication, intermediate Cloud servers for basic cloud computing services such as software services, domain name services, security services, Content Delivery Network (CDN), and big data and artificial intelligence platforms.
- CDN Content Delivery Network
- the server 160 and the speech recognition device 140 are connected through a communication network.
- the communication network is a wired network or a wireless network.
- the system may further include a management device (not shown in FIG. 1 ), and the management device and the server 160 are connected through a communication network.
- the communication network is a wired network or a wireless network.
- the above-mentioned wireless network or wired network uses standard communication technologies and/or protocols.
- the network is usually the Internet, but can be any network, including but not limited to Local Area Network (LAN), Metropolitan Area Network (MAN), Wide Area Network (WAN), mobile, wired or wireless Any combination of network, private network, or virtual private network.
- data exchanged over a network is represented using technologies and/or formats including Hyper Text Mark-up Language (HTML), Extensible Markup Language (XML), and the like.
- HTML Hyper Text Mark-up Language
- XML Extensible Markup Language
- you can also use services such as Secure Socket Layer (SSL), Transport Layer Security (TLS), Virtual Private Network (VPN), Internet Protocol Security (IPsec) and other conventional encryption techniques to encrypt all or some of the links.
- custom and/or dedicated data communication techniques may also be used in place of or in addition to the data communication techniques described above.
- the computer device may be the speech recognition device 140 or the server 160 in the system shown in FIG. 1 , or the computer device may include both the speech recognition device 140 and the server 160 in the system shown in FIG. 1 .
- the speech recognition method may include the following steps:
- Step 21 perform phoneme recognition on the speech signal, and obtain the phoneme recognition result corresponding to each speech frame in the speech signal; the phoneme recognition result is used to indicate the probability distribution of the corresponding speech frame in the phoneme space; the phoneme space contains each phoneme and empty output;
- the phoneme recognition result may be a result obtained by performing phoneme recognition on the speech signal through an acoustic model.
- the acoustic model is obtained by training the speech signal samples and the actual phonemes of each speech frame in the speech signal samples.
- a phone is the smallest unit of speech divided according to the natural properties of speech. It is analyzed according to the pronunciation action in the syllable, and an action constitutes a phoneme. Phonemes are divided into vowels and consonants. For example, the Chinese syllable ah ( ⁇ ) has only one phoneme, love (ài) has two phonemes, and dai (dài) has three phonemes.
- a phoneme is the smallest unit or the smallest speech segment that constitutes a syllable, and is the smallest linear unit of speech divided from the perspective of sound quality.
- Phonemes are physical phenomena that exist concretely.
- the phonetic symbols of the International Phonetic Alphabet (the alphabet developed by the International Phonetic Society and used to uniformly indicate the pronunciation of various countries. Also known as "International Phonetic Alphabet” and “Universal Phonetic Alphabet”) correspond one-to-one with the phonemes of all human languages.
- the number of null outputs included in the phoneme space may be greater than or equal to 1, eg including one null output.
- the acoustic model can identify the phoneme corresponding to the speech frame, and obtain the probability that the phoneme of the speech frame belongs to each preset phoneme and an empty output.
- the above phoneme space includes 212 phonemes and an empty output (indicating that the corresponding speech frame has no user pronunciation), that is, for an input speech frame, the acoustic model shown in the embodiments of the present application , which can output the probability that the speech frame corresponds to 212 phonemes and empty output respectively.
- Step 22 Suppress and adjust the probability of empty output in the phoneme recognition result corresponding to each speech frame to reduce the ratio of the probability of empty output in the phoneme recognition result to the probability of each phoneme.
- Step 23 Input the adjusted phoneme recognition result corresponding to each speech frame into a decoding map to obtain a recognized text sequence corresponding to the speech signal.
- the decoding map is used to determine the phoneme corresponding to the speech frame based on the phoneme recognition result.
- the decoding map may include a mapping relationship between characters and phonemes, and a character may be a Chinese character or a word.
- the phoneme recognition result is input into the decoding map, based on the decoding map, it is determined that the phoneme recognition result corresponds to a certain phoneme or the corresponding phoneme according to the probability of each phoneme and empty output in the phoneme recognition result Empty output, and the corresponding text is determined according to the determined phoneme. If the phoneme recognition result corresponds to an empty output, it is determined that the speech frame corresponding to the phoneme recognition result does not contain user pronunciation, that is, there is no corresponding text.
- the recognition error rate may increase. For example, a certain pronunciation frame may be mistakenly recognized as an empty output (this situation is also called deletion error) , thereby affecting the accuracy of speech recognition.
- the probability of empty output in the phoneme recognition result is suppressed. The probability of the empty output is suppressed, and the possibility that the phoneme recognition result is recognized as a certain phoneme also increases, which can effectively reduce the situation that speech frames with pronunciation are mistakenly recognized as empty output.
- the probability of empty output is suppressed to reduce the probability of speech frame being recognized as empty output, thereby reducing the possibility of speech frame being mistakenly recognized as empty output, that is, reducing the deletion error of the model, thereby improving the recognition accuracy of the model.
- the speech recognition method can be executed by a computer device.
- the computer device may be the speech recognition device 140 or the server 160 in the system shown in FIG. 1 , or the computer device may include both the speech recognition device 140 and the server 160 in the system shown in FIG. 1 .
- the speech recognition method may include the following steps:
- Step 301 Acquire a voice signal, where the voice signal includes each voice frame obtained by segmenting the original voice.
- the sound collection component collects the original voice during the user's speech, it sends the collected original voice to a computer device, for example, to a voice recognition device, and the voice recognition device divides the original voice , to obtain several speech frames.
- the speech recognition device may segment the original speech into short-term speech segments with overlapping. For example, for speech with a sampling rate of 16K, the length of the segmented speech is 25ms, and the overlap between frames is 15ms. , this process is also called "framing".
- Step 302 Perform phoneme recognition on the speech signal to obtain a phoneme recognition result corresponding to each speech frame in the speech signal.
- the phoneme recognition result is used to indicate the probability distribution of the corresponding speech frame in the phoneme space;
- the phoneme space includes each phoneme and an empty output;
- the acoustic model is based on the speech signal samples, and each speech in the speech signal samples.
- the actual phoneme of the frame is obtained by training.
- the acoustic model is an end-to-end machine learning model, the input data of which includes a speech frame in a speech signal (for example, the input includes a feature vector of the speech frame), and the output data is the predicted phoneme of the speech frame The distribution probability in the phoneme space, that is, the phoneme recognition result.
- the above phoneme recognition result can be expressed as a probability vector as shown below:
- p 0 represents the probability that the speech frame is an empty output
- p 1 represents the probability that the speech frame corresponds to the first phoneme.
- the entire phoneme space contains 212 phonemes, plus an empty output.
- phoneme recognition is performed on the speech signal, and phoneme recognition results corresponding to each speech frame in the speech signal are obtained, including:
- the feature extraction is performed on the target speech frame through the trained acoustic model to obtain the feature vector of the target speech frame;
- the target speech frame is any one of the respective speech frames;
- the acoustic hidden layer representation vector of the target speech frame and the text hidden layer representation vector of the target speech frame are input into the joint network to obtain the phoneme recognition result of the target speech frame.
- the above acoustic model may be implemented by using a transducer (Transducer) model.
- Transducer Transducer
- the Transducer model is introduced as follows:
- ⁇ * represents the set of all input sequences
- y * represents the set of all output sequences
- x t ⁇ ⁇ , y u ⁇ y are real vectors
- x and y represent the input and output spaces, respectively.
- the Transducer model is used for phoneme recognition
- the input sequence x is a sequence of feature vectors, such as filter bank (Filter Bank, FBank) features, or Mel Frequency Cepstrum Coefficient (MFCC) feature
- x t represents the feature vector at time t
- the output sequence y is the phoneme sequence
- yu represents the phoneme of the u -th step.
- an extended output space Indicates an empty output symbol, which means that the model has no output.
- the sequence It is equivalent to (y 1 , y 2 , y 3 ) ⁇ y * .
- the output sequence will have the same length as the input sequence, so the set will also be Elements in a ⁇ y * are called "alignments”.
- the Transducer model defines a conditional distribution This conditional distribution will be used to calculate the probability of outputting sequence y given input sequence x:
- FIG. 5 it shows a schematic structural diagram of an acoustic model involved in an embodiment of the present application. As shown in Figure 5.
- the acoustic model includes an encoder 51 , a predictor 52 , and a joint network 53 .
- the encoder 51 can be a recurrent neural network, such as a long short-term memory (Long Short-Term Memory, LSTM) network, which accepts the audio feature input at time t and outputs the acoustic hidden layer representation
- LSTM Long Short-Term Memory
- Predictor 52 which can be a recurrent neural network, such as an LSTM, that accepts non-empty output labels from the model history
- the output is a textual hidden layer representation
- Joint Network 53 which can be a fully connected neural network, such as a linear layer plus an activation unit, used to combine and After linear transformation and summation, the output hidden unit represents zi ; finally, it is converted into a probability distribution through a softmax function.
- the encoder is Feedforward Sequential Memory Networks (FSMN).
- FSMN Feedforward Sequential Memory Networks
- the predictor is a one-dimensional convolutional network.
- the solutions shown in the embodiments of the present application can be applied to scenarios with limited computing capabilities, such as a vehicle-mounted offline speech recognition system.
- In-vehicle equipment has high requirements for model parameters and calculation, and the computing power of the Central Processing Unit (CPU) is limited. Therefore, the requirements for model parameters and model structure are relatively high.
- the scheme shown in this application uses the full forward neural network FSMN as the Encoder (encoder) of the model, and uses a one-dimensional convolutional network to replace the commonly used long and short-term memory.
- Network LSTM as Predictor.
- the Encoder and Predictor networks generally use a Recurrent Neural Network (RNN) structure, such as LSTM or Gated Recurrent Unit (GRU).
- RNN Recurrent Neural Network
- GRU Gated Recurrent Unit
- this scheme uses an FSMN-based Encoder and a one-dimensional convolution-based Predictor network.
- model parameters can be compressed; on the other hand, it can greatly save computing resources, improve computing speed, and ensure the real-time performance of speech recognition.
- the Encoder structure based on FSMN is adopted.
- FSMN networks are applied to large vocabulary speech recognition tasks.
- the FSMN structure used in this scheme can be a structure with projection layers and residual connections.
- a one-dimensional convolutional network is used in this scheme to generate the current output according to the limited historical prediction output.
- FIG. 6 shows a network structure diagram of the predictor involved in the embodiment of the present application.
- the Predictor network uses 4 non-empty historical outputs to predict the frame of the current output. That is, after the four non-empty historical outputs 61 corresponding to the current input are subjected to vector mapping, they are input into the one-dimensional convolutional network 62 to obtain the text hidden layer representation vector.
- the above acoustic model may be obtained by training a preset speech sample and actual phonemes of each speech frame in the speech signal sample.
- a speech frame in the speech sample is input into the FSMN-based Encoder network in the acoustic model, and the actual phonemes of the first 4 non-empty speech frames of the speech frame (there is no historical Non-empty speech frames, or when the historical non-empty speech frames are insufficient, can be replaced by preset phonemes), input to the Predictor network based on one-dimensional convolution, and in the process of processing the input data by the acoustic model, the acoustic model
- the parameters of the three parts are updated to maximize the sum of the probabilities on all possible alignment paths, that is, the result of the above formula (2), thereby realizing the training of the acoustic model.
- Step 303 Suppress and adjust the probability of empty output in the phoneme recognition result corresponding to each speech frame to reduce the ratio of the probability of empty output in the phoneme recognition result to the probability of each phoneme.
- the suppression adjustment of the probability of empty output in the phoneme recognition result corresponding to each speech frame includes:
- the phoneme recognition result corresponding to each speech frame is adjusted by at least one of the following adjustment methods:
- reducing the probability of empty output in the phoneme recognition result corresponding to each speech frame includes:
- the probability of empty output in the phoneme recognition result corresponding to each speech frame is multiplied by a first weight, where the first weight is less than 1 and greater than 0.
- the probability of the empty output in the phoneme recognition result is suppressed, and only the probability of the empty output in the phoneme recognition result can be reduced.
- the probability of the empty output in the phoneme recognition result is multiplied by a value between A number between 0 and 1. In this way, when the probability of each phoneme in the phoneme recognition result remains unchanged, the ratio between the probability of the null output and the probability of each phoneme can be reduced.
- reducing the probability of empty output in the phoneme recognition result corresponding to each speech frame includes:
- the probability of each phoneme in the phoneme recognition result corresponding to each speech frame is multiplied by a second weight, where the second weight is greater than 1.
- the probability of empty output in the phoneme recognition result is suppressed, and only the probability of empty output in the low phoneme recognition result can be increased.
- the probability of each phoneme in the phoneme recognition result is multiplied by a value greater than 1 number. In this way, when the probability of the null output in the phoneme recognition result remains unchanged, the ratio between the probability of the null output and the probability of each phoneme can be reduced.
- the computer device may also increase the probability of each phoneme in the phoneme recognition result while reducing the probability of null output in the phoneme recognition result. For example, the probability of a null output in the phoneme recognition result is multiplied by a number between 0 and 1, while the probability of each phoneme in the phoneme recognition result is multiplied by a number greater than 1.
- the first weight or the second weight is preset in the computer device by the developer or the administrator.
- the first weight or the second weight can be preset in the speech recognition model by the developer. middle.
- Step 304 in the phoneme recognition result corresponding to each speech frame, the phoneme recognition result whose probability of empty output satisfies the specified condition is input into the decoding map, and the recognition text sequence corresponding to the speech signal is obtained.
- the phoneme recognition results corresponding to the adjusted speech frames are input into a decoding map to obtain a recognized text sequence corresponding to the speech signal, including:
- the target phoneme recognition result is input into the decoding map, and the recognition text corresponding to the target phoneme recognition result is obtained;
- the target phoneme recognition result is any one of the phoneme recognition results corresponding to each speech frame.
- the specified conditions include:
- the probability of a null output in the target phoneme recognition result is less than the probability threshold.
- Input feature sequence (x 1 , x 2 , ..., x T ); empty output weight adjustment coefficient ⁇ blank ;
- the 6th line in the above algorithm has adjusted the weight
- the ⁇ blank in the algorithm is 1/ ⁇ in the formula (3)
- the 13th-17th lines in the above algorithm are the PSD algorithm proposed in this scheme. , that is, only when the probability of blank output is less than a certain threshold ⁇ blank , the probability distribution of network output will participate in the decoding of subsequent decoding maps.
- the method before the phoneme recognition results corresponding to the adjusted speech frames are input into the decoding map, and before the recognition text sequence corresponding to the speech signal is obtained, the method further includes:
- the threshold influence parameter includes at least one of ambient sound intensity, the number of times of speech recognition failures within a specified time period, and user setting information;
- the probability threshold is determined.
- the above probability threshold may also be adjusted by a computer device during the speech recognition process. That is to say, the computer device can acquire relevant parameters that may affect the value of the probability threshold, and flexibly set the probability threshold through the relevant parameters.
- the intensity of the ambient sound may interfere with the voice made by the user. Therefore, when the intensity of the ambient sound is strong, the computer device can set the probability threshold to a higher value, so that more phoneme recognition results are input into the decoding map for decoding. Therefore, the accuracy of recognition is ensured; on the contrary, when the intensity of the ambient sound is weak, the computer device can set the probability threshold to a lower value, so that more phoneme recognition results are skipped, thereby ensuring the efficiency of recognition.
- the accuracy of decoding the phoneme recognition results based on the decoding map will affect the success rate of speech recognition.
- the device can set the probability threshold to a higher value, so that more phoneme recognition results are input into the decoding map for decoding, so as to ensure the accuracy of the recognition; on the contrary, when the number of speech recognition failures within the specified time period is small or not failed, The computer device can set the probability threshold to a lower value, so that more phoneme recognition results are skipped, thereby ensuring the efficiency of the recognition.
- the decoding graph is composed of a phoneme dictionary and a language model composite.
- the decoding graph used in this scheme is composed of two sub-weighted finite automata (Weighted Finite State Transducer, WFST) graphs of the phoneme dictionary and the language model.
- WFST Weighted Finite State Transducer
- Language model WFST This WFST is usually converted from an n-gram language model, which is used to calculate the probability of a sentence appearing, and is trained using training data and statistical methods.
- texts in different fields such as texts of news and spoken dialogues, have great differences in commonly used words and collocations between words. Therefore, when performing speech recognition in different fields, the language model WFST can be changed to achieve adaptation.
- FIG. 7 it shows a flow chart of model training and application involved in the embodiments of the present application.
- libtorch is used to quantify and deploy the model.
- the Android version of libtorch uses the QNNPACK library for INT8 matrix calculation, which greatly speeds up the matrix operation speed.
- the model is trained in Python environment 71 using pytorch, and then the model is quantized after training, that is, the model parameters are quantized to INT8, and the matrix multiplication of INT8 is used to speed up the calculation, and the quantized model is exported for use in the C++ environment 72's of forward inference to test with test data.
- the Transducer-based end-to-end model does not need frame-level alignment information during the training process, which greatly simplifies the modeling process; secondly, the decoding graph is simplified and the search space is reduced.
- the decoding map due to the use of phoneme modeling, the decoding map only needs to be compounded by L and G, and the search space is greatly reduced.
- phoneme modeling combined with a custom decoding map, can achieve flexible customization requirements. According to different business scenarios, without changing the acoustic model, you only need to customize the language model to adapt to your business. Scenes.
- the system model shown in this scheme still has a similar cpu occupancy rate to the DNN-HMM system model when the number of model parameters is 4 times that of the DNN-HMM system.
- the speech recognition rates are compared as follows:
- Table 1 below shows the character error rate (Character Error Rate, CER) comparison between the existing DNN-HMM system and the Transducer system proposed by this solution on three data sets.
- the probability of empty output is suppressed to reduce the probability of speech frame being recognized as empty output, thereby reducing the possibility of speech frame being mistakenly recognized as empty output, that is, reducing the deletion error of the model, thereby improving the recognition accuracy of the model.
- null output weight adjustment step 303
- decoding frame skipping step 304
- the null output weight adjustment and decoding Frame skipping can also be applied independently.
- the solution shown in the present application may be as follows:
- the voice signal including each voice frame obtained by segmenting the original voice
- the phoneme recognition is carried out to the speech signal, and the phoneme recognition result corresponding to each speech frame is obtained; the phoneme recognition result is used to indicate the probability distribution of the corresponding speech frame in the phoneme space; the phoneme space includes each phoneme and an empty output;
- the phoneme recognition results corresponding to the respective speech frames the phoneme recognition results whose probability of empty output satisfies the specified condition is input into the decoding map, and the recognized text sequence corresponding to the speech signal is obtained.
- the null output can be used for the phoneme recognition result. Decoding the phoneme recognition results whose probability satisfies the condition reduces the number of factor recognition results that need to be decoded, and skips unnecessary decoding steps, thereby effectively improving the speech recognition efficiency.
- FIG. 8 is a frame diagram of a speech recognition system according to an exemplary embodiment.
- the audio collection device 81 is connected to a speech recognition device 82, and the speech recognition device 82 includes an acoustic model 82a, a probability adjustment unit 82b, a decoding map input unit 82c, a decoding map 82d, and a feature extraction unit 82e.
- the decoding map 82d consists of a phoneme dictionary and a language model.
- the audio collection device 81 collects the original voice of the user
- the original voice is transmitted to the feature extraction unit 82e in the speech recognition device 82
- the feature extraction unit performs segmentation and feature extraction for each speech frame, and then the The phonetic features of a speech frame, and the phonemes of the text recognized by decoding the first 4 non-empty speech frames of the speech frame in Figure 82d, are input into the FSMN and the one-dimensional convolutional network in the acoustic model 82a, respectively, to obtain an acoustic model.
- 82a outputs the phoneme recognition result of the speech frame.
- the phoneme recognition result is input to the probability adjustment unit 82b, and the probability adjustment of the empty output is performed to obtain the adjusted phoneme recognition result; the adjusted speech recognition result is judged by the decoding map input unit 82c, when it is judged that the adjusted empty output When the probability is less than the threshold, it is determined that decoding is required, and the decoding map input unit 82c inputs the adjusted phoneme recognition result into the decoding map 82d, and the text is identified by the decoding map 82d; on the contrary, if it is judged that the adjusted empty output probability is not less than the threshold value When it is determined that decoding is not required, the adjusted speech recognition result is discarded.
- the above decoding diagram identifies the adjusted phoneme recognition results of each speech frame, and outputs a text sequence.
- the text sequence can be output to the natural language processing component, and the natural language processing component responds to the voice input by the user.
- Fig. 9 is a structural block diagram of a speech recognition apparatus according to an exemplary embodiment.
- the speech recognition apparatus may implement all or part of the steps in the method provided by the embodiment shown in FIG. 2 or FIG. 3 .
- the speech recognition device may include:
- the speech signal processing module 901 is used to perform phoneme recognition on the speech signal, and obtain the phoneme recognition result corresponding to each speech frame in the speech signal; the phoneme recognition result is used to indicate the probability distribution of the corresponding speech frame in the phoneme space; the phoneme space contains individual phonemes and empty outputs;
- the probability adjustment module 902 is used to suppress and adjust the probability of the empty output in the corresponding phoneme recognition result of each speech frame, to reduce the ratio of the probability of the empty output in the phoneme recognition result to the probability of each phoneme;
- the decoding module 903 is configured to input the phoneme recognition results corresponding to the adjusted speech frames into the decoding map to obtain the recognized text sequence corresponding to the speech signal, and the decoding map includes the mapping relationship between characters and phonemes.
- the probability adjustment module 902 is configured to adjust the phoneme recognition result corresponding to each speech frame by at least one of the following adjustment methods:
- the probability adjustment module 902 is configured to multiply the probability of the null output in the phoneme recognition result corresponding to each speech frame by a first weight, where the first weight is less than 1 and greater than 0.
- the probability adjustment module 902 is configured to multiply the probability of each phoneme in the phoneme recognition result corresponding to each speech frame by a second weight, where the second weight is greater than 1.
- the decoding module 903 is used to:
- the target phoneme recognition result is any one of the phoneme recognition results corresponding to each speech frame.
- the specified conditions include:
- the probability of a null output in the target phoneme recognition result is less than the probability threshold.
- the apparatus further includes:
- a parameter acquisition module configured to acquire threshold influence parameters, where the threshold influence parameters include at least one of ambient sound intensity, the number of speech recognition failures within a specified time period, and user setting information;
- the threshold value determination module is used for determining the probability threshold value based on the threshold value influence parameter.
- the speech signal processing module 901 is used to:
- the feature extraction is performed on the target speech frame through the trained acoustic model to obtain the feature vector of the target speech frame;
- the target speech frame is any one of the individual speech frames;
- n is an integer greater than or equal to 1;
- the acoustic hidden layer representation vector of the target speech frame and the text hidden layer representation vector of the target speech frame are input into the joint network to obtain the phoneme recognition result of the target speech frame.
- the encoder is a forward sequence memory network FSMN.
- the predictor is a one-dimensional convolutional network.
- the decoding graph is composed of a phoneme dictionary and a language model composite.
- the probability of empty output is suppressed to reduce the probability of speech frame being recognized as empty output, thereby reducing the possibility of speech frame being mistakenly recognized as empty output, that is, reducing the deletion error of the model, thereby improving the recognition accuracy of the model.
- Fig. 10 is a schematic structural diagram of a computer device according to an exemplary embodiment.
- the computer device may be implemented as the computer device in each of the above method embodiments.
- the computer device 1000 includes a central processing unit 1001, a system memory 1004 including a random access memory (Random Access Memory, RAM) 1002 and a read-only memory (Read-Only Memory, ROM) 1003, and the system memory 1004 and the central processing unit are connected.
- System bus 1005 of unit 1001 The computer device 1000 also includes a basic input/output system 1006 that facilitates the transfer of information between various components within the computer, and a mass storage device 1007 for storing an operating system 1013, application programs 1014, and other program modules 1015.
- the mass storage device 1007 is connected to the central processing unit 1001 through a mass storage controller (not shown) connected to the system bus 1005 .
- the mass storage device 1007 and its associated computer-readable media provide non-volatile storage for the computer device 1000 . That is, the mass storage device 1007 may include a computer-readable medium (not shown) such as a hard disk or a Compact Disc Read-Only Memory (CD-ROM) drive.
- a computer-readable medium such as a hard disk or a Compact Disc Read-Only Memory (CD-ROM) drive.
- the computer-readable media can include computer storage media and communication media.
- Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
- Computer storage media include RAM, ROM, flash memory, or other solid-state storage technology, CD-ROM, or other optical storage, tape cartridges, magnetic tape, magnetic disk storage, or other magnetic storage devices.
- RAM random access memory
- ROM read-only memory
- flash memory or other solid-state storage technology
- CD-ROM Compact Disc
- tape cartridges magnetic tape
- magnetic disk storage magnetic disk storage devices
- the computer device 1000 can be connected to the Internet or other network devices through a network interface unit 1011 connected to the system bus 1005 .
- the memory also includes at least one computer instruction, which is stored in the memory, and the processor implements all or part of the steps of the method shown in FIG. 2 or FIG. 3 by loading and executing the at least one computer instruction.
- non-transitory computer-readable storage medium including instructions, such as a memory including a computer program (instructions) executable by a processor of a computer device to complete the present application
- instructions such as a memory including a computer program (instructions) executable by a processor of a computer device to complete the present application
- the non-transitory computer-readable storage medium may be Read-Only Memory (ROM), Random Access Memory (RAM), Compact Disc Read-Only Memory (CD) -ROM), magnetic tapes, floppy disks, and optical data storage devices, etc.
- a computer program product or computer program comprising computer instructions stored in a computer readable storage medium.
- the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the methods shown in the foregoing embodiments.
- a computer program product comprising a computer program, is characterized in that, when the computer program is executed by a processor, the methods shown in the above embodiments are implemented.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Machine Translation (AREA)
Abstract
Description
| 模型 | 参数量 | 测试集1CER(%) | 测试集2CER(%) |
| DNN-HMM | 0.7M | 14.88 | 19.77 |
| Transducer1 | 0.8M | 12.1 | 16.09 |
| Tansducder2 | 1.9M | 9.76 | 13.4 |
| Tansducder3 | 2.1M | 8.93 | 13.18 |
| 模型 | 参数量 | CPU占用(峰值) |
| DNN-HMM | 0.7M | 16% |
| Transducer1 | 0.8M | 18% |
| Tansducder2 | 1.9M | 20% |
| Tansducder3 | 2.1M | 20% |
Claims (20)
- 一种语音识别方法,其特征在于,由计算机设备执行,所述方法包括:对语音信号进行音素识别,获得所述语音信号中各个语音帧对应的音素识别结果;所述音素识别结果用于指示对应的语音帧在音素空间中的概率分布;所述音素空间中包含各个音素以及空输出;对所述各个语音帧对应的所述音素识别结果中的空输出的概率进行抑制调整,以降低所述音素识别结果中的空输出的概率与各个音素的概率的比值;及将调整后的所述各个语音帧对应的所述音素识别结果输入解码图,获得所述语音信号对应的识别文本序列,所述解码图中包括字符与音素之间的映射关系。
- 根据权利要求1所述的方法,其特征在于,所述对所述各个语音帧对应的所述音素识别结果中的空输出的概率进行抑制调整,包括:降低所述各个语音帧对应的所述音素识别结果中的空输出的概率。
- 根据权利要求2所述的方法,其特征在于,所述降低所述各个语音帧对应的所述音素识别结果中的空输出的概率,包括:将所述各个语音帧对应的所述音素识别结果中的空输出的概率乘以第一权重,所述第一权重小于1且大于0。
- 根据权利要求2所述的方法,其特征在于,所述降低所述各个语音帧对应的所述音素识别结果中的空输出的概率,包括:将所述各个语音帧对应的所述音素识别结果中的各个音素的概率乘以第二权重,所述第二权重大于1。
- 根据权利要求1所述的方法,其特征在于,所述对所述各个语音帧对应的所述音素识别结果中的空输出的概率进行抑制调整,包括:提高所述各个语音帧对应的所述音素识别结果中的各个音素的概率。
- 根据权利要求1所述的方法,其特征在于,所述将调整后的所述各个语音帧对应的所述音素识别结果输入解码图,获得所述语音信号对应的识别文本序列,包括:当目标音素识别结果中的空输出的概率满足指定条件时,将所述目标音素识别结果输入所述解码图,获得所述目标音素识别结果对应的识别文本;其中,所述目标音素识别结果是所述各个语音帧对应的所述音素识别结果中的任意一个。
- 根据权利要求6所述的方法,其特征在于,所述指定条件包括:所述目标音素识别结果中的空输出的概率小于概率阈值。
- 根据权利要求7所述的方法,其特征在于,所述将调整后的所述各个语音帧对应的所 述音素识别结果输入解码图,获得所述语音信号对应的识别文本序列之前,还包括:获取阈值影响参数,所述阈值影响参数包括环境音强度、指定时间段内语音识别失败的次数、以及用户设置信息中的至少一种;及基于所述阈值影响参数,确定所述概率阈值。
- 根据权利要求1所述的方法,其特征在于,所述对语音信号进行音素识别,获得所述语音信号中各个语音帧对应的音素识别结果,包括:通过已训练的声学模型对目标语音帧进行特征提取,获得所述目标语音帧的特征向量;所述目标语音帧是所述各个语音帧中的任意一个;将所述目标语音帧输入所述声学模型中的编码器,获得所述目标语音帧的声学隐层表示向量;将所述目标语音帧的历史识别文本的音素信息输入所述声学模型中的预测器,获得所述目标语音帧的文本隐层表示向量;所述目标语音帧的历史识别文本,是所述解码图对所述目标语音帧的前n个非空输出的语音帧的音素识别结果进行识别得到的文本;n为大于或者等于1的整数;及将所述目标语音帧的声学隐层表示向量,以及所述目标语音帧的文本隐层表示向量输入所述声学模型中的联合网络,获得所述目标语音帧的所述音素识别结果。
- 根据权利要求9所述的方法,其特征在于,所述编码器为前向序列记忆网络FSMN。
- 根据权利要求9所述的方法,其特征在于,所述预测器为一维卷积网络。
- 根据权利要求1至9任一所述的方法,其特征在于,所述解码图由音素词典和语言模型复合构成。
- 一种语音识别方法,其特征在于,由计算机设备执行,所述方法包括:获取语音信号,所述语音信号包括对原始语音进行切分获得的各个语音帧;对语音信号进行音素识别,获得所述各个语音帧对应的音素识别结果;所述音素识别结果用于指示对应的语音帧在音素空间中的概率分布;所述音素空间中包含各个音素以及空输出;及将所述各个语音帧对应的所述音素识别结果中,空输出的概率满足指定条件的所述音素识别结果输入解码图,获得所述语音信号对应的识别文本序列,所述解码图中包括字符与音素之间的映射关系。
- 一种语音识别装置,其特征在于,所述装置包括:语音信号处理模块,用于对语音信号进行音素识别,获得所述语音信号中各个语音帧对应的音素识别结果;所述音素识别结果用于指示对应的语音帧在音素空间中的概率分布;所述音素空间中包含各个音素以及空输出;概率调整模块,用于对所述各个语音帧对应的所述音素识别结果中的空输出的概率进行 抑制调整,以降低所述音素识别结果中的空输出的概率与各个音素的概率的比值;及解码模块,用于将调整后的所述各个语音帧对应的所述音素识别结果输入解码图,获得所述语音信号对应的识别文本序列,所述解码图中包括字符与音素之间的映射关系。
- 根据权利要求14所述的装置,其特征在于,所述概率调整模块,还用于:降低所述各个语音帧对应的所述音素识别结果中的空输出的概率。
- 根据权利要求15所述的装置,其特征在于,所述概率调整模块,还用于:将所述各个语音帧对应的所述音素识别结果中的空输出的概率乘以第一权重,所述第一权重小于1且大于0。
- 一种语音识别装置,其特征在于,所述装置包括:语音信号获取模块,用于获取语音信号,所述语音信号包括对原始语音进行切分获得的各个语音帧;音素识别结果获得模块,用于对语音信号进行音素识别,获得所述各个语音帧对应的音素识别结果;所述音素识别结果用于指示对应的语音帧在音素空间中的概率分布;所述音素空间中包含各个音素以及空输出;及识别文本序列获取模块,用于将所述各个语音帧对应的所述音素识别结果中,空输出的概率满足指定条件的所述音素识别结果输入解码图,获得所述语音信号对应的识别文本序列,所述解码图中包括字符与音素之间的映射关系。
- 一种计算机设备,其特征在于,所述计算机设备包含处理器和存储器,所述存储器中存储有至少一条计算机指令,所述至少一条计算机指令由所述处理器加载并执行以实现如权利要求1至13任一所述的语音识别方法。
- 一种计算机可读存储介质,其特征在于,所述存储介质中存储有至少一条计算机指令,所述至少一条计算机指令由处理器加载并执行以实现如权利要求1至13任一所述的语音识别方法。
- 一种计算机程序产品,包括计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至13中任一项所述的方法的步骤。
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2023518016A JP7570760B2 (ja) | 2020-12-23 | 2021-11-08 | 音声認識方法、音声認識装置、コンピュータ機器、及びコンピュータプログラム |
| EP21908894.5A EP4191576B1 (en) | 2020-12-23 | 2021-11-08 | SPEECH RECOGNITION METHOD, COMPUTER DEVICE AND STORAGE MEDIA |
| US17/977,496 US12367861B2 (en) | 2020-12-23 | 2022-10-31 | Phoneme recognition-based speech recognition method and apparatus, computer device, and storage medium |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202011536771.4A CN113539242B (zh) | 2020-12-23 | 2020-12-23 | 语音识别方法、装置、计算机设备及存储介质 |
| CN202011536771.4 | 2020-12-23 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US17/977,496 Continuation US12367861B2 (en) | 2020-12-23 | 2022-10-31 | Phoneme recognition-based speech recognition method and apparatus, computer device, and storage medium |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2022134894A1 true WO2022134894A1 (zh) | 2022-06-30 |
Family
ID=78124211
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2021/129223 Ceased WO2022134894A1 (zh) | 2020-12-23 | 2021-11-08 | 语音识别方法、装置、计算机设备及存储介质 |
Country Status (5)
| Country | Link |
|---|---|
| US (1) | US12367861B2 (zh) |
| EP (1) | EP4191576B1 (zh) |
| JP (1) | JP7570760B2 (zh) |
| CN (1) | CN113539242B (zh) |
| WO (1) | WO2022134894A1 (zh) |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN116229950A (zh) * | 2023-03-01 | 2023-06-06 | 北京奕斯伟计算技术股份有限公司 | 命令词识别模型的模型训练装置、命令词识别装置及方法 |
| CN116364062A (zh) * | 2023-05-30 | 2023-06-30 | 广州小鹏汽车科技有限公司 | 语音识别方法、装置及车辆 |
| CN116580701A (zh) * | 2023-05-19 | 2023-08-11 | 国网物资有限公司 | 告警音频识别方法、装置、电子设备和计算机介质 |
Families Citing this family (19)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113539242B (zh) * | 2020-12-23 | 2025-05-30 | 腾讯科技(深圳)有限公司 | 语音识别方法、装置、计算机设备及存储介质 |
| US12211509B2 (en) * | 2021-10-06 | 2025-01-28 | Google Llc | Fusion of acoustic and text representations in RNN-T |
| CN114220444B (zh) * | 2021-10-27 | 2022-09-06 | 安徽讯飞寰语科技有限公司 | 语音解码方法、装置、电子设备和存储介质 |
| CN113936643B (zh) * | 2021-12-16 | 2022-05-17 | 阿里巴巴达摩院(杭州)科技有限公司 | 语音识别方法、语音识别模型、电子设备和存储介质 |
| CN114724544B (zh) * | 2022-04-13 | 2022-12-06 | 北京百度网讯科技有限公司 | 语音芯片、语音识别方法、装置、设备及智能汽车 |
| CN114822535B (zh) * | 2022-04-19 | 2025-08-22 | 时擎智能科技(上海)有限公司 | 语音关键词识别方法、装置、介质及设备 |
| CN115132196B (zh) * | 2022-05-18 | 2024-09-10 | 腾讯科技(深圳)有限公司 | 语音指令识别的方法、装置、电子设备及存储介质 |
| CN115499541A (zh) * | 2022-09-15 | 2022-12-20 | 华能国际电力股份有限公司 | 一种语音检测模型构建和语音识别方法、装置及电子设备 |
| CN116052643A (zh) * | 2022-12-30 | 2023-05-02 | 西安讯飞超脑信息科技有限公司 | 一种语音识别方法、装置、存储介质及设备 |
| CN116434738A (zh) * | 2023-02-16 | 2023-07-14 | 北京有竹居网络技术有限公司 | 噪音数据提取方法、装置、介质及电子设备 |
| CN116453504B (zh) * | 2023-02-21 | 2026-03-17 | 杭州网之易创新科技有限公司 | 语音音素识别方法、介质、装置和计算设备 |
| CN116403587B (zh) * | 2023-03-28 | 2025-12-16 | 中国科学院深圳先进技术研究院 | 一种基于音素信息的声纹识别方法及电子设备 |
| CN116110574B (zh) * | 2023-04-14 | 2023-06-20 | 武汉大学人民医院(湖北省人民医院) | 一种基于神经网络实现的眼科智能问诊方法和装置 |
| CN116844529A (zh) * | 2023-05-25 | 2023-10-03 | 深圳华为云计算技术有限公司 | 语音识别方法、装置及计算机存储介质 |
| CN116665652A (zh) * | 2023-06-07 | 2023-08-29 | 平安科技(深圳)有限公司 | 语音识别方法、语音识别系统、计算机设备和存储介质 |
| CN119548810B (zh) * | 2023-08-22 | 2026-01-13 | 荣耀终端股份有限公司 | 预测帧生成方法、终端设备及存储介质 |
| CN116798052B (zh) * | 2023-08-28 | 2023-12-08 | 腾讯科技(深圳)有限公司 | 文本识别模型的训练方法和装置、存储介质及电子设备 |
| CN117524198B (zh) * | 2023-12-29 | 2024-04-16 | 广州小鹏汽车科技有限公司 | 语音识别方法、装置及车辆 |
| US20250292764A1 (en) * | 2024-03-15 | 2025-09-18 | Microsoft Technology Licensing, Llc | Space efficient training for sequence transduction machine learning |
Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN105529027A (zh) * | 2015-12-14 | 2016-04-27 | 百度在线网络技术(北京)有限公司 | 语音识别方法和装置 |
| CN105895081A (zh) * | 2016-04-11 | 2016-08-24 | 苏州思必驰信息科技有限公司 | 一种语音识别解码的方法及装置 |
| CN108269568A (zh) * | 2017-01-03 | 2018-07-10 | 中国科学院声学研究所 | 一种基于ctc的声学模型训练方法 |
| CN108389575A (zh) * | 2018-01-11 | 2018-08-10 | 苏州思必驰信息科技有限公司 | 音频数据识别方法及系统 |
| CN109559735A (zh) * | 2018-10-11 | 2019-04-02 | 平安科技(深圳)有限公司 | 一种基于神经网络的语音识别方法、终端设备及介质 |
| CN110164421A (zh) * | 2018-12-14 | 2019-08-23 | 腾讯科技(深圳)有限公司 | 语音解码方法、装置及存储介质 |
| WO2020195068A1 (en) * | 2019-03-25 | 2020-10-01 | Mitsubishi Electric Corporation | System and method for end-to-end speech recognition with triggered attention |
| CN113539242A (zh) * | 2020-12-23 | 2021-10-22 | 腾讯科技(深圳)有限公司 | 语音识别方法、装置、计算机设备及存储介质 |
Family Cites Families (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7493259B2 (en) * | 2002-01-04 | 2009-02-17 | Siebel Systems, Inc. | Method for accessing data via voice |
| US9728185B2 (en) * | 2014-05-22 | 2017-08-08 | Google Inc. | Recognizing speech using neural networks |
| US10127904B2 (en) * | 2015-05-26 | 2018-11-13 | Google Llc | Learning pronunciations from acoustic sequences |
| US9818409B2 (en) * | 2015-06-19 | 2017-11-14 | Google Inc. | Context-dependent modeling of phonemes |
| US10229672B1 (en) * | 2015-12-31 | 2019-03-12 | Google Llc | Training acoustic models using connectionist temporal classification |
| JP6727607B2 (ja) | 2016-06-09 | 2020-07-22 | 国立研究開発法人情報通信研究機構 | 音声認識装置及びコンピュータプログラム |
| US11195093B2 (en) * | 2017-05-18 | 2021-12-07 | Samsung Electronics Co., Ltd | Apparatus and method for student-teacher transfer learning network using knowledge bridge |
| JP7092953B2 (ja) | 2019-05-03 | 2022-06-28 | グーグル エルエルシー | エンドツーエンドモデルによる多言語音声認識のための音素に基づく文脈解析 |
| US11862146B2 (en) * | 2019-07-05 | 2024-01-02 | Asapp, Inc. | Multistream acoustic models with dilations |
-
2020
- 2020-12-23 CN CN202011536771.4A patent/CN113539242B/zh active Active
-
2021
- 2021-11-08 WO PCT/CN2021/129223 patent/WO2022134894A1/zh not_active Ceased
- 2021-11-08 JP JP2023518016A patent/JP7570760B2/ja active Active
- 2021-11-08 EP EP21908894.5A patent/EP4191576B1/en active Active
-
2022
- 2022-10-31 US US17/977,496 patent/US12367861B2/en active Active
Patent Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN105529027A (zh) * | 2015-12-14 | 2016-04-27 | 百度在线网络技术(北京)有限公司 | 语音识别方法和装置 |
| CN105895081A (zh) * | 2016-04-11 | 2016-08-24 | 苏州思必驰信息科技有限公司 | 一种语音识别解码的方法及装置 |
| CN108269568A (zh) * | 2017-01-03 | 2018-07-10 | 中国科学院声学研究所 | 一种基于ctc的声学模型训练方法 |
| CN108389575A (zh) * | 2018-01-11 | 2018-08-10 | 苏州思必驰信息科技有限公司 | 音频数据识别方法及系统 |
| CN109559735A (zh) * | 2018-10-11 | 2019-04-02 | 平安科技(深圳)有限公司 | 一种基于神经网络的语音识别方法、终端设备及介质 |
| CN110164421A (zh) * | 2018-12-14 | 2019-08-23 | 腾讯科技(深圳)有限公司 | 语音解码方法、装置及存储介质 |
| WO2020195068A1 (en) * | 2019-03-25 | 2020-10-01 | Mitsubishi Electric Corporation | System and method for end-to-end speech recognition with triggered attention |
| CN113539242A (zh) * | 2020-12-23 | 2021-10-22 | 腾讯科技(深圳)有限公司 | 语音识别方法、装置、计算机设备及存储介质 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP4191576A4 |
Cited By (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN116229950A (zh) * | 2023-03-01 | 2023-06-06 | 北京奕斯伟计算技术股份有限公司 | 命令词识别模型的模型训练装置、命令词识别装置及方法 |
| CN116580701A (zh) * | 2023-05-19 | 2023-08-11 | 国网物资有限公司 | 告警音频识别方法、装置、电子设备和计算机介质 |
| CN116580701B (zh) * | 2023-05-19 | 2023-11-24 | 国网物资有限公司 | 告警音频识别方法、装置、电子设备和计算机介质 |
| CN116364062A (zh) * | 2023-05-30 | 2023-06-30 | 广州小鹏汽车科技有限公司 | 语音识别方法、装置及车辆 |
| CN116364062B (zh) * | 2023-05-30 | 2023-08-25 | 广州小鹏汽车科技有限公司 | 语音识别方法、装置及车辆 |
Also Published As
| Publication number | Publication date |
|---|---|
| EP4191576A1 (en) | 2023-06-07 |
| EP4191576A4 (en) | 2024-05-29 |
| US12367861B2 (en) | 2025-07-22 |
| CN113539242A (zh) | 2021-10-22 |
| US20230074869A1 (en) | 2023-03-09 |
| EP4191576B1 (en) | 2025-12-31 |
| JP7570760B2 (ja) | 2024-10-22 |
| JP2023542685A (ja) | 2023-10-11 |
| CN113539242B (zh) | 2025-05-30 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN113539242B (zh) | 语音识别方法、装置、计算机设备及存储介质 | |
| US11848008B2 (en) | Artificial intelligence-based wakeup word detection method and apparatus, device, and medium | |
| US11270694B2 (en) | Artificial intelligence apparatus and method for recognizing speech by correcting misrecognized word | |
| CN112528637B (zh) | 文本处理模型训练方法、装置、计算机设备和存储介质 | |
| CN112259089B (zh) | 语音识别方法及装置 | |
| CN114596844A (zh) | 声学模型的训练方法、语音识别方法及相关设备 | |
| EP4409568B1 (en) | Contrastive siamese network for semi-supervised speech recognition | |
| CN110473531A (zh) | 语音识别方法、装置、电子设备、系统及存储介质 | |
| CN113555006B (zh) | 一种语音信息识别方法、装置、电子设备及存储介质 | |
| CN111653274B (zh) | 唤醒词识别的方法、装置及存储介质 | |
| CN111161724B (zh) | 中文视听结合语音识别方法、系统、设备及介质 | |
| CN113393841A (zh) | 语音识别模型的训练方法、装置、设备及存储介质 | |
| KR20230156425A (ko) | 자체 정렬을 통한 스트리밍 asr 모델 지연 감소 | |
| KR20230156795A (ko) | 단어 분할 규칙화 | |
| EP4528580A1 (en) | Training method for translation model, translation method, and device | |
| JP2017076127A (ja) | 音響モデル入力データの正規化装置及び方法と、音声認識装置 | |
| CN111862956A (zh) | 一种数据处理方法、装置、设备及存储介质 | |
| CN119889282A (zh) | 语音合成模型的训练方法、装置、设备及存储介质 | |
| HK40054494B (zh) | 语音识别方法、装置、计算机设备及存储介质 | |
| HK40054494A (zh) | 语音识别方法、装置、计算机设备及存储介质 | |
| KR102944446B1 (ko) | 프롬프트에 기반하여 감정을 표현하는 음성을 합성하는 방법, 장치, 및 프로그램 | |
| HK40052275B (zh) | 语音识别模型的训练方法、装置、设备及存储介质 | |
| HK40055187A (zh) | 一种语音信息识别方法、装置、电子设备及存储介质 | |
| HK40055187B (zh) | 一种语音信息识别方法、装置、电子设备及存储介质 | |
| HK40092618A (zh) | 语音音素识别方法、装置、设备及存储介质 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21908894 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 202347015921 Country of ref document: IN |
|
| ENP | Entry into the national phase |
Ref document number: 2021908894 Country of ref document: EP Effective date: 20230302 |
|
| ENP | Entry into the national phase |
Ref document number: 2023518016 Country of ref document: JP Kind code of ref document: A |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| WWG | Wipo information: grant in national office |
Ref document number: 202347015921 Country of ref document: IN |
|
| WWG | Wipo information: grant in national office |
Ref document number: 2021908894 Country of ref document: EP |


