WO2022134894A1 - 语音识别方法、装置、计算机设备及存储介质 - Google Patents

语音识别方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2022134894A1
WO2022134894A1 PCT/CN2021/129223 CN2021129223W WO2022134894A1 WO 2022134894 A1 WO2022134894 A1 WO 2022134894A1 CN 2021129223 W CN2021129223 W CN 2021129223W WO 2022134894 A1 WO2022134894 A1 WO 2022134894A1
Authority
WO
WIPO (PCT)
Prior art keywords
phoneme
speech
probability
recognition result
speech frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2021/129223
Other languages
English (en)
French (fr)
Inventor
孙思宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to JP2023518016A priority Critical patent/JP7570760B2/ja
Priority to EP21908894.5A priority patent/EP4191576B1/en
Publication of WO2022134894A1 publication Critical patent/WO2022134894A1/zh
Priority to US17/977,496 priority patent/US12367861B2/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/187Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Definitions

  • the present application relates to the technical field of speech recognition, and in particular, to a speech recognition method, apparatus, computer equipment and storage medium.
  • Speech recognition is a technology that recognizes speech as text, which has a wide range of applications in various artificial intelligence (Artificial Intelligence, AI) scenarios.
  • AI Artificial Intelligence
  • the speech recognition framework usually includes an acoustic model part and a decoding part, wherein the acoustic model part is used to recognize the phonemes of each speech frame in the input speech signal, and the decoding part outputs the text sequence of the speech signal through the recognized phonemes of each speech frame.
  • RNN-T Recurrent Neural Network Transducer
  • the RNN-T model introduces the concept of empty output in the phoneme recognition process, that is, predicting that a certain speech frame does not contain valid phonemes.
  • the introduction of empty output will lead to an increase in the error rate of the subsequent decoding process in some application scenarios. , especially resulting in an increase in deletion errors, affecting the accuracy of speech recognition.
  • a speech recognition method executed by computer equipment, the method comprising:
  • the phoneme recognition result is used to indicate the probability distribution of the corresponding speech frame in the phoneme space; the phoneme space contains each phoneme and empty output;
  • the phoneme recognition results corresponding to the adjusted speech frames are input into a decoding map to obtain a recognized text sequence corresponding to the speech signal, and the decoding map includes the mapping relationship between characters and phonemes.
  • a voice recognition device comprising:
  • a speech signal processing module configured to perform phoneme recognition on the speech signal, and obtain a phoneme recognition result corresponding to each speech frame in the speech signal; the phoneme recognition result is used to indicate the probability distribution of the corresponding speech frame in the phoneme space;
  • the phoneme space contains each phoneme and empty output;
  • a probability adjustment module configured to suppress and adjust the probability of empty output in the phoneme recognition result corresponding to each speech frame, so as to reduce the ratio of the probability of empty output in the phoneme recognition result to the probability of each phoneme;
  • a decoding module configured to input the phoneme recognition results corresponding to the adjusted speech frames into a decoding map to obtain a recognized text sequence corresponding to the speech signal, and the decoding map includes the mapping relationship between characters and phonemes .
  • a speech recognition method comprising:
  • the voice signal including each voice frame obtained by segmenting the original voice
  • the phoneme recognition result is used to indicate the probability distribution of the corresponding speech frame in the phoneme space; the phoneme space includes each phoneme and empty output ;and
  • the phoneme recognition results corresponding to the respective speech frames the phoneme recognition results whose probability of empty output satisfies the specified conditions are input into the decoding map, and the recognized text sequence corresponding to the speech signal is obtained, and the decoding map includes characters Mapping relationship with phonemes.
  • a voice recognition device comprising:
  • a voice signal acquisition module configured to acquire a voice signal, the voice signal including each voice frame obtained by dividing the original voice
  • the phoneme recognition result obtaining module is used to perform phoneme recognition on the speech signal, and obtain the phoneme recognition result corresponding to each speech frame; the phoneme recognition result is used to indicate the probability distribution of the corresponding speech frame in the phoneme space; the phoneme The space contains individual phonemes and empty outputs; and
  • the recognition text sequence acquisition module is used to input the phoneme recognition results whose probability of empty output satisfies the specified condition in the phoneme recognition results corresponding to the respective speech frames into the decoding map, and obtain the recognition text sequence corresponding to the speech signal , the decoding map includes the mapping relationship between characters and phonemes.
  • a computer device comprising a processor and a memory, wherein at least one computer instruction is stored in the memory, and the at least one computer instruction is loaded and executed by the processor to implement the above-mentioned speech recognition method.
  • a computer-readable storage medium where at least one computer instruction is stored, the at least one computer instruction is loaded and executed by a processor to implement the above speech recognition method.
  • a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the computer program method described above.
  • FIG. 1 is a system configuration diagram of a speech recognition system involved in various embodiments of the present application.
  • FIG. 2 is a schematic flowchart of a speech recognition method according to an exemplary embodiment
  • FIG. 3 is a schematic flowchart of a speech recognition method according to an exemplary embodiment
  • FIG. 4 is a schematic diagram of an alignment process involved in the embodiment shown in FIG. 3;
  • FIG. 5 is a schematic structural diagram of an acoustic model involved in the embodiment shown in FIG. 3;
  • Fig. 6 is the network structure diagram of the predictor involved in the embodiment shown in Fig. 3;
  • Fig. 7 is the model training and application flow chart involved in the embodiment shown in Fig. 3;
  • FIG. 8 is a frame diagram of a speech recognition system according to an exemplary embodiment
  • FIG. 9 is a block diagram showing the structure of an apparatus for labeling objects in a video according to an exemplary embodiment
  • Fig. 10 is a structural block diagram of a computer device according to an exemplary embodiment.
  • AI is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can respond in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive discipline, involving a wide range of fields, including both hardware-level technology and software-level technology.
  • the basic technologies of artificial intelligence generally include technologies such as sensors, special artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • ASR Automatic Speech Recognition
  • TTS Text To Speech
  • voiceprint recognition technology Making computers able to hear, see, speak, and feel is the development direction of human-computer interaction in the future, and voice will become one of the most promising human-computer interaction methods in the future.
  • Machine learning is a multi-domain interdisciplinary subject involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines. It specializes in how computers simulate or realize human learning behaviors to acquire new knowledge or skills, and to reorganize existing knowledge structures to continuously improve their performance.
  • Machine learning is the core of artificial intelligence and the fundamental way to make computers intelligent, and its applications are in all fields of artificial intelligence.
  • Machine learning and deep learning usually include artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, teaching learning and other techniques.
  • FIG. 1 a system structure diagram of a speech recognition system involved in various embodiments of the present application is shown. As shown in FIG. 1 , the system includes a sound collection component 120 and a speech recognition device 140 .
  • the sound collection component 120 and the speech recognition device 140 are connected in a wired or wireless manner.
  • the sound collection component 120 may be implemented as a microphone, a microphone array, or a pickup, or the like.
  • the sound collecting component 120 is used for collecting voice data when the user speaks.
  • the speech recognition device 140 is used for recognizing the speech data collected by the sound collection component 120 to obtain the recognized text sequence.
  • the speech recognition device 140 may also perform natural semantic processing on the recognized text sequence to respond to the user's speech.
  • the sound collection component 120 and the speech recognition device 140 may be implemented as two independent hardware devices.
  • the sound collection component 120 is a microphone arranged on the steering wheel of the vehicle, and the speech recognition device 140 may be an in-vehicle smart device; or, the sound collection component 120 is a microphone arranged on a remote control, and the speech recognition device 140 may be controlled by the remote control Smart home devices (such as smart TVs, set-top boxes, air conditioners, etc.).
  • the sound collection component 120 and the speech recognition device 140 may be implemented as the same hardware device.
  • the speech recognition device 140 may be a smart device such as a smart phone, a tablet computer, a smart watch, and smart glasses, and the sound collection component 120 may be a microphone built in the speech recognition device 140 .
  • the speech recognition system described above may also include a server 160 .
  • the server 160 may be used to deploy and update the speech recognition model in the speech recognition device 140 .
  • the server 160 may also provide the cloud speech recognition service to the speech recognition device 140, that is, receive the speech data sent by the speech recognition device 140, perform speech recognition on the speech data, and return the recognition result to the speech recognition device 140.
  • the server 160 may also cooperate with the speech recognition device 140 to complete operations such as recognizing the speech data and responding to the speech data.
  • the server 160 is a server, or consists of several servers, or a virtualization platform, or a cloud computing service center.
  • the server can be an independent physical server, a server cluster or a distributed system composed of multiple physical servers, or a cloud service, cloud database, cloud computing, cloud function, cloud storage, network service, cloud communication, intermediate Cloud servers for basic cloud computing services such as software services, domain name services, security services, Content Delivery Network (CDN), and big data and artificial intelligence platforms.
  • cloud service e.g., a cloud service, a cloud database, cloud computing, cloud function, cloud storage, network service, cloud communication, intermediate Cloud servers for basic cloud computing services such as software services, domain name services, security services, Content Delivery Network (CDN), and big data and artificial intelligence platforms.
  • CDN Content Delivery Network
  • the server 160 and the speech recognition device 140 are connected through a communication network.
  • the communication network is a wired network or a wireless network.
  • the system may further include a management device (not shown in FIG. 1 ), and the management device and the server 160 are connected through a communication network.
  • the communication network is a wired network or a wireless network.
  • the above-mentioned wireless network or wired network uses standard communication technologies and/or protocols.
  • the network is usually the Internet, but can be any network, including but not limited to Local Area Network (LAN), Metropolitan Area Network (MAN), Wide Area Network (WAN), mobile, wired or wireless Any combination of network, private network, or virtual private network.
  • data exchanged over a network is represented using technologies and/or formats including Hyper Text Mark-up Language (HTML), Extensible Markup Language (XML), and the like.
  • HTML Hyper Text Mark-up Language
  • XML Extensible Markup Language
  • you can also use services such as Secure Socket Layer (SSL), Transport Layer Security (TLS), Virtual Private Network (VPN), Internet Protocol Security (IPsec) and other conventional encryption techniques to encrypt all or some of the links.
  • custom and/or dedicated data communication techniques may also be used in place of or in addition to the data communication techniques described above.
  • the computer device may be the speech recognition device 140 or the server 160 in the system shown in FIG. 1 , or the computer device may include both the speech recognition device 140 and the server 160 in the system shown in FIG. 1 .
  • the speech recognition method may include the following steps:
  • Step 21 perform phoneme recognition on the speech signal, and obtain the phoneme recognition result corresponding to each speech frame in the speech signal; the phoneme recognition result is used to indicate the probability distribution of the corresponding speech frame in the phoneme space; the phoneme space contains each phoneme and empty output;
  • the phoneme recognition result may be a result obtained by performing phoneme recognition on the speech signal through an acoustic model.
  • the acoustic model is obtained by training the speech signal samples and the actual phonemes of each speech frame in the speech signal samples.
  • a phone is the smallest unit of speech divided according to the natural properties of speech. It is analyzed according to the pronunciation action in the syllable, and an action constitutes a phoneme. Phonemes are divided into vowels and consonants. For example, the Chinese syllable ah ( ⁇ ) has only one phoneme, love (ài) has two phonemes, and dai (dài) has three phonemes.
  • a phoneme is the smallest unit or the smallest speech segment that constitutes a syllable, and is the smallest linear unit of speech divided from the perspective of sound quality.
  • Phonemes are physical phenomena that exist concretely.
  • the phonetic symbols of the International Phonetic Alphabet (the alphabet developed by the International Phonetic Society and used to uniformly indicate the pronunciation of various countries. Also known as "International Phonetic Alphabet” and “Universal Phonetic Alphabet”) correspond one-to-one with the phonemes of all human languages.
  • the number of null outputs included in the phoneme space may be greater than or equal to 1, eg including one null output.
  • the acoustic model can identify the phoneme corresponding to the speech frame, and obtain the probability that the phoneme of the speech frame belongs to each preset phoneme and an empty output.
  • the above phoneme space includes 212 phonemes and an empty output (indicating that the corresponding speech frame has no user pronunciation), that is, for an input speech frame, the acoustic model shown in the embodiments of the present application , which can output the probability that the speech frame corresponds to 212 phonemes and empty output respectively.
  • Step 22 Suppress and adjust the probability of empty output in the phoneme recognition result corresponding to each speech frame to reduce the ratio of the probability of empty output in the phoneme recognition result to the probability of each phoneme.
  • Step 23 Input the adjusted phoneme recognition result corresponding to each speech frame into a decoding map to obtain a recognized text sequence corresponding to the speech signal.
  • the decoding map is used to determine the phoneme corresponding to the speech frame based on the phoneme recognition result.
  • the decoding map may include a mapping relationship between characters and phonemes, and a character may be a Chinese character or a word.
  • the phoneme recognition result is input into the decoding map, based on the decoding map, it is determined that the phoneme recognition result corresponds to a certain phoneme or the corresponding phoneme according to the probability of each phoneme and empty output in the phoneme recognition result Empty output, and the corresponding text is determined according to the determined phoneme. If the phoneme recognition result corresponds to an empty output, it is determined that the speech frame corresponding to the phoneme recognition result does not contain user pronunciation, that is, there is no corresponding text.
  • the recognition error rate may increase. For example, a certain pronunciation frame may be mistakenly recognized as an empty output (this situation is also called deletion error) , thereby affecting the accuracy of speech recognition.
  • the probability of empty output in the phoneme recognition result is suppressed. The probability of the empty output is suppressed, and the possibility that the phoneme recognition result is recognized as a certain phoneme also increases, which can effectively reduce the situation that speech frames with pronunciation are mistakenly recognized as empty output.
  • the probability of empty output is suppressed to reduce the probability of speech frame being recognized as empty output, thereby reducing the possibility of speech frame being mistakenly recognized as empty output, that is, reducing the deletion error of the model, thereby improving the recognition accuracy of the model.
  • the speech recognition method can be executed by a computer device.
  • the computer device may be the speech recognition device 140 or the server 160 in the system shown in FIG. 1 , or the computer device may include both the speech recognition device 140 and the server 160 in the system shown in FIG. 1 .
  • the speech recognition method may include the following steps:
  • Step 301 Acquire a voice signal, where the voice signal includes each voice frame obtained by segmenting the original voice.
  • the sound collection component collects the original voice during the user's speech, it sends the collected original voice to a computer device, for example, to a voice recognition device, and the voice recognition device divides the original voice , to obtain several speech frames.
  • the speech recognition device may segment the original speech into short-term speech segments with overlapping. For example, for speech with a sampling rate of 16K, the length of the segmented speech is 25ms, and the overlap between frames is 15ms. , this process is also called "framing".
  • Step 302 Perform phoneme recognition on the speech signal to obtain a phoneme recognition result corresponding to each speech frame in the speech signal.
  • the phoneme recognition result is used to indicate the probability distribution of the corresponding speech frame in the phoneme space;
  • the phoneme space includes each phoneme and an empty output;
  • the acoustic model is based on the speech signal samples, and each speech in the speech signal samples.
  • the actual phoneme of the frame is obtained by training.
  • the acoustic model is an end-to-end machine learning model, the input data of which includes a speech frame in a speech signal (for example, the input includes a feature vector of the speech frame), and the output data is the predicted phoneme of the speech frame The distribution probability in the phoneme space, that is, the phoneme recognition result.
  • the above phoneme recognition result can be expressed as a probability vector as shown below:
  • p 0 represents the probability that the speech frame is an empty output
  • p 1 represents the probability that the speech frame corresponds to the first phoneme.
  • the entire phoneme space contains 212 phonemes, plus an empty output.
  • phoneme recognition is performed on the speech signal, and phoneme recognition results corresponding to each speech frame in the speech signal are obtained, including:
  • the feature extraction is performed on the target speech frame through the trained acoustic model to obtain the feature vector of the target speech frame;
  • the target speech frame is any one of the respective speech frames;
  • the acoustic hidden layer representation vector of the target speech frame and the text hidden layer representation vector of the target speech frame are input into the joint network to obtain the phoneme recognition result of the target speech frame.
  • the above acoustic model may be implemented by using a transducer (Transducer) model.
  • Transducer Transducer
  • the Transducer model is introduced as follows:
  • ⁇ * represents the set of all input sequences
  • y * represents the set of all output sequences
  • x t ⁇ ⁇ , y u ⁇ y are real vectors
  • x and y represent the input and output spaces, respectively.
  • the Transducer model is used for phoneme recognition
  • the input sequence x is a sequence of feature vectors, such as filter bank (Filter Bank, FBank) features, or Mel Frequency Cepstrum Coefficient (MFCC) feature
  • x t represents the feature vector at time t
  • the output sequence y is the phoneme sequence
  • yu represents the phoneme of the u -th step.
  • an extended output space Indicates an empty output symbol, which means that the model has no output.
  • the sequence It is equivalent to (y 1 , y 2 , y 3 ) ⁇ y * .
  • the output sequence will have the same length as the input sequence, so the set will also be Elements in a ⁇ y * are called "alignments”.
  • the Transducer model defines a conditional distribution This conditional distribution will be used to calculate the probability of outputting sequence y given input sequence x:
  • FIG. 5 it shows a schematic structural diagram of an acoustic model involved in an embodiment of the present application. As shown in Figure 5.
  • the acoustic model includes an encoder 51 , a predictor 52 , and a joint network 53 .
  • the encoder 51 can be a recurrent neural network, such as a long short-term memory (Long Short-Term Memory, LSTM) network, which accepts the audio feature input at time t and outputs the acoustic hidden layer representation
  • LSTM Long Short-Term Memory
  • Predictor 52 which can be a recurrent neural network, such as an LSTM, that accepts non-empty output labels from the model history
  • the output is a textual hidden layer representation
  • Joint Network 53 which can be a fully connected neural network, such as a linear layer plus an activation unit, used to combine and After linear transformation and summation, the output hidden unit represents zi ; finally, it is converted into a probability distribution through a softmax function.
  • the encoder is Feedforward Sequential Memory Networks (FSMN).
  • FSMN Feedforward Sequential Memory Networks
  • the predictor is a one-dimensional convolutional network.
  • the solutions shown in the embodiments of the present application can be applied to scenarios with limited computing capabilities, such as a vehicle-mounted offline speech recognition system.
  • In-vehicle equipment has high requirements for model parameters and calculation, and the computing power of the Central Processing Unit (CPU) is limited. Therefore, the requirements for model parameters and model structure are relatively high.
  • the scheme shown in this application uses the full forward neural network FSMN as the Encoder (encoder) of the model, and uses a one-dimensional convolutional network to replace the commonly used long and short-term memory.
  • Network LSTM as Predictor.
  • the Encoder and Predictor networks generally use a Recurrent Neural Network (RNN) structure, such as LSTM or Gated Recurrent Unit (GRU).
  • RNN Recurrent Neural Network
  • GRU Gated Recurrent Unit
  • this scheme uses an FSMN-based Encoder and a one-dimensional convolution-based Predictor network.
  • model parameters can be compressed; on the other hand, it can greatly save computing resources, improve computing speed, and ensure the real-time performance of speech recognition.
  • the Encoder structure based on FSMN is adopted.
  • FSMN networks are applied to large vocabulary speech recognition tasks.
  • the FSMN structure used in this scheme can be a structure with projection layers and residual connections.
  • a one-dimensional convolutional network is used in this scheme to generate the current output according to the limited historical prediction output.
  • FIG. 6 shows a network structure diagram of the predictor involved in the embodiment of the present application.
  • the Predictor network uses 4 non-empty historical outputs to predict the frame of the current output. That is, after the four non-empty historical outputs 61 corresponding to the current input are subjected to vector mapping, they are input into the one-dimensional convolutional network 62 to obtain the text hidden layer representation vector.
  • the above acoustic model may be obtained by training a preset speech sample and actual phonemes of each speech frame in the speech signal sample.
  • a speech frame in the speech sample is input into the FSMN-based Encoder network in the acoustic model, and the actual phonemes of the first 4 non-empty speech frames of the speech frame (there is no historical Non-empty speech frames, or when the historical non-empty speech frames are insufficient, can be replaced by preset phonemes), input to the Predictor network based on one-dimensional convolution, and in the process of processing the input data by the acoustic model, the acoustic model
  • the parameters of the three parts are updated to maximize the sum of the probabilities on all possible alignment paths, that is, the result of the above formula (2), thereby realizing the training of the acoustic model.
  • Step 303 Suppress and adjust the probability of empty output in the phoneme recognition result corresponding to each speech frame to reduce the ratio of the probability of empty output in the phoneme recognition result to the probability of each phoneme.
  • the suppression adjustment of the probability of empty output in the phoneme recognition result corresponding to each speech frame includes:
  • the phoneme recognition result corresponding to each speech frame is adjusted by at least one of the following adjustment methods:
  • reducing the probability of empty output in the phoneme recognition result corresponding to each speech frame includes:
  • the probability of empty output in the phoneme recognition result corresponding to each speech frame is multiplied by a first weight, where the first weight is less than 1 and greater than 0.
  • the probability of the empty output in the phoneme recognition result is suppressed, and only the probability of the empty output in the phoneme recognition result can be reduced.
  • the probability of the empty output in the phoneme recognition result is multiplied by a value between A number between 0 and 1. In this way, when the probability of each phoneme in the phoneme recognition result remains unchanged, the ratio between the probability of the null output and the probability of each phoneme can be reduced.
  • reducing the probability of empty output in the phoneme recognition result corresponding to each speech frame includes:
  • the probability of each phoneme in the phoneme recognition result corresponding to each speech frame is multiplied by a second weight, where the second weight is greater than 1.
  • the probability of empty output in the phoneme recognition result is suppressed, and only the probability of empty output in the low phoneme recognition result can be increased.
  • the probability of each phoneme in the phoneme recognition result is multiplied by a value greater than 1 number. In this way, when the probability of the null output in the phoneme recognition result remains unchanged, the ratio between the probability of the null output and the probability of each phoneme can be reduced.
  • the computer device may also increase the probability of each phoneme in the phoneme recognition result while reducing the probability of null output in the phoneme recognition result. For example, the probability of a null output in the phoneme recognition result is multiplied by a number between 0 and 1, while the probability of each phoneme in the phoneme recognition result is multiplied by a number greater than 1.
  • the first weight or the second weight is preset in the computer device by the developer or the administrator.
  • the first weight or the second weight can be preset in the speech recognition model by the developer. middle.
  • Step 304 in the phoneme recognition result corresponding to each speech frame, the phoneme recognition result whose probability of empty output satisfies the specified condition is input into the decoding map, and the recognition text sequence corresponding to the speech signal is obtained.
  • the phoneme recognition results corresponding to the adjusted speech frames are input into a decoding map to obtain a recognized text sequence corresponding to the speech signal, including:
  • the target phoneme recognition result is input into the decoding map, and the recognition text corresponding to the target phoneme recognition result is obtained;
  • the target phoneme recognition result is any one of the phoneme recognition results corresponding to each speech frame.
  • the specified conditions include:
  • the probability of a null output in the target phoneme recognition result is less than the probability threshold.
  • Input feature sequence (x 1 , x 2 , ..., x T ); empty output weight adjustment coefficient ⁇ blank ;
  • the 6th line in the above algorithm has adjusted the weight
  • the ⁇ blank in the algorithm is 1/ ⁇ in the formula (3)
  • the 13th-17th lines in the above algorithm are the PSD algorithm proposed in this scheme. , that is, only when the probability of blank output is less than a certain threshold ⁇ blank , the probability distribution of network output will participate in the decoding of subsequent decoding maps.
  • the method before the phoneme recognition results corresponding to the adjusted speech frames are input into the decoding map, and before the recognition text sequence corresponding to the speech signal is obtained, the method further includes:
  • the threshold influence parameter includes at least one of ambient sound intensity, the number of times of speech recognition failures within a specified time period, and user setting information;
  • the probability threshold is determined.
  • the above probability threshold may also be adjusted by a computer device during the speech recognition process. That is to say, the computer device can acquire relevant parameters that may affect the value of the probability threshold, and flexibly set the probability threshold through the relevant parameters.
  • the intensity of the ambient sound may interfere with the voice made by the user. Therefore, when the intensity of the ambient sound is strong, the computer device can set the probability threshold to a higher value, so that more phoneme recognition results are input into the decoding map for decoding. Therefore, the accuracy of recognition is ensured; on the contrary, when the intensity of the ambient sound is weak, the computer device can set the probability threshold to a lower value, so that more phoneme recognition results are skipped, thereby ensuring the efficiency of recognition.
  • the accuracy of decoding the phoneme recognition results based on the decoding map will affect the success rate of speech recognition.
  • the device can set the probability threshold to a higher value, so that more phoneme recognition results are input into the decoding map for decoding, so as to ensure the accuracy of the recognition; on the contrary, when the number of speech recognition failures within the specified time period is small or not failed, The computer device can set the probability threshold to a lower value, so that more phoneme recognition results are skipped, thereby ensuring the efficiency of the recognition.
  • the decoding graph is composed of a phoneme dictionary and a language model composite.
  • the decoding graph used in this scheme is composed of two sub-weighted finite automata (Weighted Finite State Transducer, WFST) graphs of the phoneme dictionary and the language model.
  • WFST Weighted Finite State Transducer
  • Language model WFST This WFST is usually converted from an n-gram language model, which is used to calculate the probability of a sentence appearing, and is trained using training data and statistical methods.
  • texts in different fields such as texts of news and spoken dialogues, have great differences in commonly used words and collocations between words. Therefore, when performing speech recognition in different fields, the language model WFST can be changed to achieve adaptation.
  • FIG. 7 it shows a flow chart of model training and application involved in the embodiments of the present application.
  • libtorch is used to quantify and deploy the model.
  • the Android version of libtorch uses the QNNPACK library for INT8 matrix calculation, which greatly speeds up the matrix operation speed.
  • the model is trained in Python environment 71 using pytorch, and then the model is quantized after training, that is, the model parameters are quantized to INT8, and the matrix multiplication of INT8 is used to speed up the calculation, and the quantized model is exported for use in the C++ environment 72's of forward inference to test with test data.
  • the Transducer-based end-to-end model does not need frame-level alignment information during the training process, which greatly simplifies the modeling process; secondly, the decoding graph is simplified and the search space is reduced.
  • the decoding map due to the use of phoneme modeling, the decoding map only needs to be compounded by L and G, and the search space is greatly reduced.
  • phoneme modeling combined with a custom decoding map, can achieve flexible customization requirements. According to different business scenarios, without changing the acoustic model, you only need to customize the language model to adapt to your business. Scenes.
  • the system model shown in this scheme still has a similar cpu occupancy rate to the DNN-HMM system model when the number of model parameters is 4 times that of the DNN-HMM system.
  • the speech recognition rates are compared as follows:
  • Table 1 below shows the character error rate (Character Error Rate, CER) comparison between the existing DNN-HMM system and the Transducer system proposed by this solution on three data sets.
  • the probability of empty output is suppressed to reduce the probability of speech frame being recognized as empty output, thereby reducing the possibility of speech frame being mistakenly recognized as empty output, that is, reducing the deletion error of the model, thereby improving the recognition accuracy of the model.
  • null output weight adjustment step 303
  • decoding frame skipping step 304
  • the null output weight adjustment and decoding Frame skipping can also be applied independently.
  • the solution shown in the present application may be as follows:
  • the voice signal including each voice frame obtained by segmenting the original voice
  • the phoneme recognition is carried out to the speech signal, and the phoneme recognition result corresponding to each speech frame is obtained; the phoneme recognition result is used to indicate the probability distribution of the corresponding speech frame in the phoneme space; the phoneme space includes each phoneme and an empty output;
  • the phoneme recognition results corresponding to the respective speech frames the phoneme recognition results whose probability of empty output satisfies the specified condition is input into the decoding map, and the recognized text sequence corresponding to the speech signal is obtained.
  • the null output can be used for the phoneme recognition result. Decoding the phoneme recognition results whose probability satisfies the condition reduces the number of factor recognition results that need to be decoded, and skips unnecessary decoding steps, thereby effectively improving the speech recognition efficiency.
  • FIG. 8 is a frame diagram of a speech recognition system according to an exemplary embodiment.
  • the audio collection device 81 is connected to a speech recognition device 82, and the speech recognition device 82 includes an acoustic model 82a, a probability adjustment unit 82b, a decoding map input unit 82c, a decoding map 82d, and a feature extraction unit 82e.
  • the decoding map 82d consists of a phoneme dictionary and a language model.
  • the audio collection device 81 collects the original voice of the user
  • the original voice is transmitted to the feature extraction unit 82e in the speech recognition device 82
  • the feature extraction unit performs segmentation and feature extraction for each speech frame, and then the The phonetic features of a speech frame, and the phonemes of the text recognized by decoding the first 4 non-empty speech frames of the speech frame in Figure 82d, are input into the FSMN and the one-dimensional convolutional network in the acoustic model 82a, respectively, to obtain an acoustic model.
  • 82a outputs the phoneme recognition result of the speech frame.
  • the phoneme recognition result is input to the probability adjustment unit 82b, and the probability adjustment of the empty output is performed to obtain the adjusted phoneme recognition result; the adjusted speech recognition result is judged by the decoding map input unit 82c, when it is judged that the adjusted empty output When the probability is less than the threshold, it is determined that decoding is required, and the decoding map input unit 82c inputs the adjusted phoneme recognition result into the decoding map 82d, and the text is identified by the decoding map 82d; on the contrary, if it is judged that the adjusted empty output probability is not less than the threshold value When it is determined that decoding is not required, the adjusted speech recognition result is discarded.
  • the above decoding diagram identifies the adjusted phoneme recognition results of each speech frame, and outputs a text sequence.
  • the text sequence can be output to the natural language processing component, and the natural language processing component responds to the voice input by the user.
  • Fig. 9 is a structural block diagram of a speech recognition apparatus according to an exemplary embodiment.
  • the speech recognition apparatus may implement all or part of the steps in the method provided by the embodiment shown in FIG. 2 or FIG. 3 .
  • the speech recognition device may include:
  • the speech signal processing module 901 is used to perform phoneme recognition on the speech signal, and obtain the phoneme recognition result corresponding to each speech frame in the speech signal; the phoneme recognition result is used to indicate the probability distribution of the corresponding speech frame in the phoneme space; the phoneme space contains individual phonemes and empty outputs;
  • the probability adjustment module 902 is used to suppress and adjust the probability of the empty output in the corresponding phoneme recognition result of each speech frame, to reduce the ratio of the probability of the empty output in the phoneme recognition result to the probability of each phoneme;
  • the decoding module 903 is configured to input the phoneme recognition results corresponding to the adjusted speech frames into the decoding map to obtain the recognized text sequence corresponding to the speech signal, and the decoding map includes the mapping relationship between characters and phonemes.
  • the probability adjustment module 902 is configured to adjust the phoneme recognition result corresponding to each speech frame by at least one of the following adjustment methods:
  • the probability adjustment module 902 is configured to multiply the probability of the null output in the phoneme recognition result corresponding to each speech frame by a first weight, where the first weight is less than 1 and greater than 0.
  • the probability adjustment module 902 is configured to multiply the probability of each phoneme in the phoneme recognition result corresponding to each speech frame by a second weight, where the second weight is greater than 1.
  • the decoding module 903 is used to:
  • the target phoneme recognition result is any one of the phoneme recognition results corresponding to each speech frame.
  • the specified conditions include:
  • the probability of a null output in the target phoneme recognition result is less than the probability threshold.
  • the apparatus further includes:
  • a parameter acquisition module configured to acquire threshold influence parameters, where the threshold influence parameters include at least one of ambient sound intensity, the number of speech recognition failures within a specified time period, and user setting information;
  • the threshold value determination module is used for determining the probability threshold value based on the threshold value influence parameter.
  • the speech signal processing module 901 is used to:
  • the feature extraction is performed on the target speech frame through the trained acoustic model to obtain the feature vector of the target speech frame;
  • the target speech frame is any one of the individual speech frames;
  • n is an integer greater than or equal to 1;
  • the acoustic hidden layer representation vector of the target speech frame and the text hidden layer representation vector of the target speech frame are input into the joint network to obtain the phoneme recognition result of the target speech frame.
  • the encoder is a forward sequence memory network FSMN.
  • the predictor is a one-dimensional convolutional network.
  • the decoding graph is composed of a phoneme dictionary and a language model composite.
  • the probability of empty output is suppressed to reduce the probability of speech frame being recognized as empty output, thereby reducing the possibility of speech frame being mistakenly recognized as empty output, that is, reducing the deletion error of the model, thereby improving the recognition accuracy of the model.
  • Fig. 10 is a schematic structural diagram of a computer device according to an exemplary embodiment.
  • the computer device may be implemented as the computer device in each of the above method embodiments.
  • the computer device 1000 includes a central processing unit 1001, a system memory 1004 including a random access memory (Random Access Memory, RAM) 1002 and a read-only memory (Read-Only Memory, ROM) 1003, and the system memory 1004 and the central processing unit are connected.
  • System bus 1005 of unit 1001 The computer device 1000 also includes a basic input/output system 1006 that facilitates the transfer of information between various components within the computer, and a mass storage device 1007 for storing an operating system 1013, application programs 1014, and other program modules 1015.
  • the mass storage device 1007 is connected to the central processing unit 1001 through a mass storage controller (not shown) connected to the system bus 1005 .
  • the mass storage device 1007 and its associated computer-readable media provide non-volatile storage for the computer device 1000 . That is, the mass storage device 1007 may include a computer-readable medium (not shown) such as a hard disk or a Compact Disc Read-Only Memory (CD-ROM) drive.
  • a computer-readable medium such as a hard disk or a Compact Disc Read-Only Memory (CD-ROM) drive.
  • the computer-readable media can include computer storage media and communication media.
  • Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
  • Computer storage media include RAM, ROM, flash memory, or other solid-state storage technology, CD-ROM, or other optical storage, tape cartridges, magnetic tape, magnetic disk storage, or other magnetic storage devices.
  • RAM random access memory
  • ROM read-only memory
  • flash memory or other solid-state storage technology
  • CD-ROM Compact Disc
  • tape cartridges magnetic tape
  • magnetic disk storage magnetic disk storage devices
  • the computer device 1000 can be connected to the Internet or other network devices through a network interface unit 1011 connected to the system bus 1005 .
  • the memory also includes at least one computer instruction, which is stored in the memory, and the processor implements all or part of the steps of the method shown in FIG. 2 or FIG. 3 by loading and executing the at least one computer instruction.
  • non-transitory computer-readable storage medium including instructions, such as a memory including a computer program (instructions) executable by a processor of a computer device to complete the present application
  • instructions such as a memory including a computer program (instructions) executable by a processor of a computer device to complete the present application
  • the non-transitory computer-readable storage medium may be Read-Only Memory (ROM), Random Access Memory (RAM), Compact Disc Read-Only Memory (CD) -ROM), magnetic tapes, floppy disks, and optical data storage devices, etc.
  • a computer program product or computer program comprising computer instructions stored in a computer readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the methods shown in the foregoing embodiments.
  • a computer program product comprising a computer program, is characterized in that, when the computer program is executed by a processor, the methods shown in the above embodiments are implemented.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

一种语音识别方法、装置、计算机设备及存储介质,该方法包括:对语音信号进行音素识别,获得语音信号中各个语音帧对应的音素识别结果(21);对各个语音帧对应的音素识别结果中的空输出的概率进行抑制调整,以降低音素识别结果中的空输出的概率与各个音素的概率的比值(22);将调整后的各个语音帧对应的音素识别结果输入解码图,获得语音信号对应的识别文本序列(23)。该方法能够在人工智能领域的语音识别场景中,提高模型的识别准确性。

Description

语音识别方法、装置、计算机设备及存储介质
本申请要求于2020年12月23日提交中国专利局,申请号为202011536771.4,申请名称为“语音识别方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及语音识别技术领域,特别涉及一种语音识别方法、装置、计算机设备及存储介质。
背景技术
语音识别是一种将语音识别为文本的技术,其在各种人工智能(Artificial Intelligence,AI)场景中具有广泛的应用。
语音识别框架通常包含声学模型部分和解码部分,其中,声学模型部分用于识别输入的语音信号中的各个语音帧的音素,解码部分通过识别出的各个语音帧的音素,输出语音信号的文本序列。在相关技术中,通过递归神经网络转移机(Recurrent Neural Network Transducer,RNN-T)来实现声学模型,是业内研究的重点之一。
然而,RNN-T模型在音素识别过程中引入了空输出的概念,即预测某个语音帧上不包含有效的音素,空输出的引入,在一些应用场景中会导致后续解码过程的错误率上升,尤其是导致删除错误的增多,影响语音识别的准确性。
发明内容
根据本申请提供的各种实施例,提供了一种语音识别方法、装置、计算机设备及存储介质
一种语音识别方法,由计算机设备执行,所述方法包括:
对语音信号进行音素识别,获得所述语音信号中各个语音帧对应的音素识别结果;所述音素识别结果用于指示对应的语音帧在音素空间中的概率分布;所述音素空间中包含各个音素以及空输出;
对所述各个语音帧对应的所述音素识别结果中的空输出的概率进行抑制调整,以降低所述音素识别结果中的空输出的概率与各个音素的概率的比值;及
将调整后的所述各个语音帧对应的所述音素识别结果输入解码图,获得所述语音信号对应的识别文本序列,所述解码图中包括字符与音素之间的映射关系。
一种语音识别装置,所述装置包括:
语音信号处理模块,用于对语音信号进行音素识别,获得所述语音信号中各个语音帧对应的音素识别结果;所述音素识别结果用于指示对应的语音帧在音素空间中的概率分布;所述音素空间中包含各个音素以及空输出;
概率调整模块,用于对所述各个语音帧对应的所述音素识别结果中的空输出的概率进行抑制调整,以降低所述音素识别结果中的空输出的概率与各个音素的概率的比值;及
解码模块,用于将调整后的所述各个语音帧对应的所述音素识别结果输入解码图,获得所述语音信号对应的识别文本序列,所述解码图中包括字符与音素之间的映射关系。
一种语音识别方法,所述方法包括:
获取语音信号,所述语音信号包括对原始语音进行切分获得的各个语音帧;
对语音信号进行音素识别,获得所述各个语音帧对应的音素识别结果;所述音素识别结果用于指示对应的语音帧在音素空间中的概率分布;所述音素空间中包含各个音素以及空输出;及
将所述各个语音帧对应的所述音素识别结果中,空输出的概率满足指定条件的所述音素识别结果输入解码图,获得所述语音信号对应的识别文本序列,所述解码图中包括字符与音素之间的映射关系。
一种语音识别装置,所述装置包括:
语音信号获取模块,用于获取语音信号,所述语音信号包括对原始语音进行切分获得的各个语音帧;
音素识别结果获得模块,用于对语音信号进行音素识别,获得所述各个语音帧对应的音素识别结果;所述音素识别结果用于指示对应的语音帧在音素空间中的概率分布;所述音素空间中包含各个音素以及空输出;及
识别文本序列获取模块,用于将所述各个语音帧对应的所述音素识别结果中,空输出的概率满足指定条件的所述音素识别结果输入解码图,获得所述语音信号对应的识别文本序列,所述解码图中包括字符与音素之间的映射关系。
一种计算机设备,所述计算机设备包含处理器和存储器,所述存储器中存储有至少一条计算机指令,所述至少一条计算机指令由所述处理器加载并执行以实现上述的语音识别方法。
一种计算机可读存储介质,所述存储介质中存储有至少一条计算机指令,所述至少一条计算机指令由处理器加载并执行以实现上述语音识别方法。
一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述计算机程序方法。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请各个实施例涉及的一种语音识别系统的系统构成图;
图2是根据一示例性实施例示出的语音识别方法的流程示意图;
图3是根据一示例性实施例示出的语音识别方法的流程示意图;
图4是图3所示实施例涉及的对齐过程示意图;
图5是图3所示实施例涉及的一种声学模型的结构示意图;
图6是图3所示实施例涉及的预测器的网络结构图;
图7是图3所示实施例涉及的模型训练及应用流程图;
图8是根据一示例性实施例示出的语音识别系统的框架图;
图9是根据一示例性实施例示出的视频中的对象标注装置的结构方框图;
图10是根据一示例性实施例示出的一种计算机设备的结构框图。
具体实施方式
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。
在对本申请所示的各个实施例进行说明之前,首先对本申请涉及到的几个概念进行介绍:
1)人工智能(Artificial Intelligence,AI)
AI是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。
2)语音技术(Speech Technology,ST)
语音技术的关键技术有自动语音识别技术(Automatic Speech Recognition,ASR)和语音合成技术(Text To Speech,TTS)以及声纹识别技术。让计算机能听、能看、能说、能感觉,是未来人机交互的发展方向,其中语音成为未来最被看好的人机交互方式之一。
3)机器学习(Machine Learning,ML)
机器学习是一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能。机器学习是人工智能的核心,是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域。机器学习和深度学习通常包括人工神经网络、置信网络、强化学习、迁移学习、归纳学习、示教学习等技术。
本申请实施例提供的方案应用在涉及人工智能的语音技术和机器学习技术等场景,以实现将用户语音准确的识别为对应的文本。
参考图1,展示了本申请各个实施例涉及的一种语音识别系统的系统构成图。如图1所示,该系统包括声音采集组件120、以及语音识别设备140。
其中,声音采集组件120和语音识别设备140之间通过有线或者无线方式相连。
声音采集组件120可以实现为麦克风、麦克风阵列或者拾音器等。声音采集组件120用于采集用户说话时的语音数据。
语音识别设备140用于对声音采集组件120采集到的语音数据进行识别,获得识别出的文本序列。
可选的,语音识别设备140还可以对识别出的文本序列进行自然语义处理,以对用户语音做出响应。
其中,声音采集组件120和语音识别设备140可以实现为相互独立的两个硬件设备。例如,声音采集组件120是设置在车辆方向盘上的麦克风,语音识别设备140可以是车载智能设备;或者,声音采集组件120是设置在遥控器上的麦克风,语音识别设备140可以是遥控器控制的智能家居设备(比如智能电视、机顶盒、空调等等)。
或者,声音采集组件120和语音识别设备140可以实现为同一个硬件设备。例如,语音识别设备140可以是智能手机、平板电脑、智能手表、智能眼镜等智能设备,而声音采集组件120可以是语音识别设备140内置的麦克风。
在一些实施例中,上述语音识别系统还可以包含服务器160。
其中,该服务器160可以用于对语音识别设备140中的语音识别模型进行部署和更新。或者,服务器160也可以向语音识别设备140提供云端语音识别 的服务,即接收语音识别设备140发送的语音数据,对语音数据进行语音识别后,将识别结果返回给语音识别设备140。或者,服务器160也可以与语音识别设备140协作完成对语音数据的识别以及对语音数据的响应等操作。
服务器160是一台服务器,或者由若干台服务器,或者是一个虚拟化平台,或者是一个云计算服务中心。
服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content Delivery Network,CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器。
服务器160与语音识别设备140之间通过通信网络相连。可选的,该通信网络是有线网络或无线网络。
可选的,该系统还可以包括管理设备(图1未示出),该管理设备与服务器160之间通过通信网络相连。可选的,通信网络是有线网络或无线网络。
可选的,上述的无线网络或有线网络使用标准通信技术和/或协议。网络通常为因特网、但也可以是任何网络,包括但不限于局域网(Local Area Network,LAN)、城域网(Metropolitan Area Network,MAN)、广域网(Wide Area Network,WAN)、移动、有线或者无线网络、专用网络或者虚拟专用网络的任何组合。在一些实施例中,使用包括超文本标记语言(Hyper Text Mark-up Language,HTML)、可扩展标记语言(Extensible Markup Language,XML)等的技术和/或格式来代表通过网络交换的数据。此外还可以使用诸如安全套接字层(Secure Socket Layer,SSL)、传输层安全(Transport Layer Security,TLS)、虚拟专用网络(Virtual Private Network,VPN)、网际协议安全(Internet Protocol Security,IPsec)等常规加密技术来加密所有或者一些链路。在另一些实施例中,还可以使用定制和/或专用数据通信技术取代或者补充上述数据通信技术。
参考图2,展示了一种语音识别方法的流程示意图,该语音识别方法可以由计算机设备执行。该计算机设备可以是上述图1所示系统中的语音识别设备140或者服务器160,或者,该计算机设备可以同时包括上述图1所示系统中的语音识别设备140和服务器160。如图2所示,该语音识别方法可以包括如下步骤:
步骤21,对语音信号进行音素识别,获得该语音信号中各个语音帧对应的 音素识别结果;该音素识别结果用于指示对应的语音帧在音素空间中的概率分布;该音素空间中包含各个音素以及空输出;
其中,音素识别结果可以是通过声学模型对语音信号进行音素识别所得到的结果。该声学模型是通过语音信号样本,以及该语音信号样本中各个语音帧的实际音素训练得到的。
音素(phone),是根据语音的自然属性划分出来的最小语音单位,依据音节里的发音动作来分析,一个动作构成一个音素。音素分为元音与辅音两大类。比如,汉语音节啊(ā)只有一个音素,爱(ài)有两个音素,代(dài)有三个音素等。
音素是构成音节的最小单位或最小的语音片段,是从音质的角度划分出来的最小的线性的语音单位。音素是具体存在的物理现象。国际音标(由国际语音学会制定,用来统一标示各国语音的字母。也称为“国际语音学字母”、“万国语音学字母”)的音标符号与全人类语言的音素一一对应。
音素空间中包括的空输出的数量可以大于或者等于1,例如包括一个空输出。
在本申请实施例中,对于语音信号中的每个语音帧,声学模型可以识别该语音帧对应的音素,得到该语音帧的音素属于各个预先设置的音素以及空输出的概率。
例如,在一些实施例中,上述音素空间包含212种音素以及一个空输出(表示对应的语音帧没有用户发音),也就是说,对于一个输入的语音帧,本申请实施例所示的声学模型,能够输出该语音帧分别对应212种音素以及空输出的概率。
步骤22,对该各个语音帧对应的该音素识别结果中的空输出的概率进行抑制调整,以降低该音素识别结果中的空输出的概率与各个音素的概率的比值。
步骤23,将调整后的该各个语音帧对应的该音素识别结果输入解码图,获得该语音信号对应的识别文本序列。
其中,解码图用于基于音素识别结果确定语音帧对应的音素,解码图中可以包括字符与音素之间的映射关系,一个字符可以是一个汉字,也可以是一个词。
在本申请实施例中,音素识别结果输入解码图后,基于解码图根据音素识别结果中的音素空间中各个音素以及空输出的概率,确定该音素识别结果对应 的是某一个音素或者对应的是空输出,并根据确定音素确定对应的文本,如果音素识别结果对应的是空输出,则确定该音素识别结果对应的语音帧中不包含用户发音,即没有对应的文本。
语音信号中的语音帧可以是对声音采集组件采集到的原始语音进行切分获得的各个语音帧。语音信号中的各个语音帧有序排列,例如可以按照语音帧在原始语音中的位置进行排列,例如,语音帧在原始语音中的位置越靠前,则该语音帧在语音信号中的位置越靠前。语音信息中的各个语音帧依次排列,将解码图得到的各个语音帧的文本,按照语音帧在语音信号中的位置进行排列,得到识别文本序列。
由于上述音素识别结果中包含空输出,可能会导致识别错误率上升,例如,有可能会出现某个有发音的语音帧被误识别为空输出的情况(这种情况也被称为删除错误),从而影响语音识别的准确性,对此,本申请实施例所示的方案,在声学模型输出音素识别结果之后,对音素识别结果中的空输出的概率进行抑制,而随着音素识别结果中的空输出的概率被抑制,该音素识别结果被识别为某一个音素的可能性也随之上升,从而能够有效的减少有发音的语音帧被误识别为空输出的情况。
综上所述,本申请实施例所示的方案,对于包含语音帧在各个音素以及空输出上的概率分布的音素识别结果,在将该音素识别结果输入解码图之前,先对音素识别结果中的空输出的概率进行抑制,降低语音帧被识别为空输出的几率,从而降低语音帧被误识别为空输出的可能性,也就是降低模型的删除错误,从而提高模型的识别准确性。
参考图3,展示了一种语音识别方法的流程示意图,该语音识别方法可以由计算机设备执行。比如,该计算机设备可以是上述图1所示系统中的语音识别设备140或者服务器160,或者,该计算机设备可以同时包括上述图1所示系统中的语音识别设备140和服务器160。如图3所示,该语音识别方法可以包括如下步骤:
步骤301,获取语音信号,该语音信号包括对原始语音进行切分获得的各个语音帧。
在本申请实施例中,声音采集组件采集到的用户说话过程中的原始语音后,将采集到的原始语音发送给计算机设备,比如,发送给语音识别设备,语音识 别设备对原始语音进行切分,获得若干个语音帧。
在一些实施例中,语音识别设备可以将原始语音切分成带有重叠的短时语音片段,例如,一般对于采样率为16K的语音,切分后一帧语音长度为25ms,帧间重叠为15ms,此过程也称为“分帧”。
步骤302,对语音信号进行音素识别,获得该语音信号中各个语音帧对应的音素识别结果。
其中,该音素识别结果用于指示对应的语音帧在音素空间中的概率分布;该音素空间中包含各个音素以及一个空输出;该声学模型是通过语音信号样本,以及该语音信号样本中各个语音帧的实际音素训练得到的。
在本申请实施例中,声学模型为端到端的机器学习模型,其输入数据包括语音信号中的语音帧(例如,输入包括语音帧的特征向量),输出数据为预测出的该语音帧的音素在音素空间中的分布概率,即音素识别结果。
例如,上述音素识别结果可以表示为如下所示的一个概率向量:
(p 0,p 1,p 2,……p 212)
上述概率向量中,p 0表示语音帧为空输出的概率,p 1表示语音帧对应第1种音素的概率,整个音素空间中包含212种音素,外加一个空输出。
在一些实施例中,对语音信号进行音素识别,获得该语音信号中各个语音帧对应的音素识别结果,包括:
通过已训练的声学模型对目标语音帧进行特征提取,获得该目标语音帧的特征向量;该目标语音帧是该各个语音帧中的任意一个;
将该目标语音帧输入该声学模型中的编码器,获得该目标语音帧的声学隐层表示向量;
将该目标语音帧的历史识别文本的音素信息输入该声学模型中的预测器,获得该目标语音帧的文本隐层表示向量;该目标语音帧的历史识别文本,是该解码图对该目标语音帧的前n个非空输出的语音帧的音素识别结果进行识别得到的文本;n为大于或者等于1的整数;
将该目标语音帧的声学隐层表示向量,以及该目标语音帧的文本隐层表示向量输入联合网络,获得该目标语音帧的该音素识别结果。
本申请实施例中,可以通过转换机(Transducer)模型,实现上述声学模型。Transducer模型介绍如下:
给定输入序列:
x=(x 1,x 2,...,x T)∈χ *
和输出序列:
y=(y 1,y 2,...,y U)∈y *
其中,χ *表示所有的输入序列的集合,y *表示所有输出序列的集合,x t∈χ,y u∈y均为实数向量,x和y分别表示输入和输出空间。例如,在本方案中,Transducer模型用来进行音素识别,输入序列x为特征向量序列,比如滤波器组(Filter Bank,FBank)特征,或者,梅尔倒谱系数(Mel Frequency Cepstrum Coefficient,MFCC)特征,x t表示t时刻的特征向量;输出序列y则为音素序列,y u表示第u步的音素。
定义一个扩展输出空间
Figure PCTCN2021129223-appb-000001
表示空输出符号,代表模型没有输出。引入空输出符号之后,序列
Figure PCTCN2021129223-appb-000002
就等价于(y 1,y 2,y 3)∈y *。在本方案中,因为空输出的引入,输出序列与输入序列将有相同的长度,因此,也将集合
Figure PCTCN2021129223-appb-000003
中的元素a∈y *称为“对齐”。给定任意输入序列,Transducer模型定义了一个条件分布
Figure PCTCN2021129223-appb-000004
此条件分布将用来计算给定输入序列x后输出序列y的概率:
Figure PCTCN2021129223-appb-000005
其中
Figure PCTCN2021129223-appb-000006
表示移除对齐序列中的空输出,
Figure PCTCN2021129223-appb-000007
表示对输出序列添加空输出生成对齐序列。从公式(1)中可以看出,为了计算输出序列y的概率,需要将序列y所对应的所有可能的对齐a的条件概率进行求和,参考图4,其示出了本申请实施例涉及的对齐过程示意图。该图4给出一个示例用来说明公式(1)。
在图4中,U=3,T=5,从左下角到右上角所有可能的路径,都是一个对齐。加粗箭头标出的为其中一条可能路径,当模型沿着纵向前进一步时,将输出一个非空符号(音素);当模型沿着横向前进一步时,将输出空符号(即上述空输出),表示没有输出产生。在同一时刻,模型允许多个输出产生。
为了建模
Figure PCTCN2021129223-appb-000008
一般采用三个子网络进行联合建模,参考图5,其示出了本申请实施例涉及的一种声学模型的结构示意图。如图5所示。声学模型包含编码器51、预测器52、以及联合网络53。
其中,编码器51(Encoder),可以为递归神经网络,例如长短期记忆(Long Short-Term Memory,LSTM)网络,接受t时刻的音频特征输入,输出声学隐层表示
Figure PCTCN2021129223-appb-000009
预测器52(Predictor),可以为递归神经网络,例如LSTM,接受模型历史的非空输出标签
Figure PCTCN2021129223-appb-000010
输出为文本隐层表示
Figure PCTCN2021129223-appb-000011
联合网络53(Joint Network),可以为全连接神经网络,如线性层加激活单元,用于将
Figure PCTCN2021129223-appb-000012
Figure PCTCN2021129223-appb-000013
经过线性变换之后求和,输出隐单元表示z i;最后,经过一个softmax函数,将其转换成概率分布。
上述图5中,
Figure PCTCN2021129223-appb-000014
且对齐
Figure PCTCN2021129223-appb-000015
最终,公式(1)的计算为:
Figure PCTCN2021129223-appb-000016
公式(2)的计算,需要遍历计算所有可能的对齐路径,直接使用该算法将导致大量的计算量,在模型训练过程中,可以采用前后向算法来进行公式(2)的概率计算。
在一些实施例中,该编码器为前向序列记忆网络(Feedforward Sequential Memory Networks,FSMN)。
在一些实施例中,该预测器为一维卷积网络。
本申请实施例所示的方案,可以应用在车载离线语音识别系统等计算能力有限的场景。车载设备对模型参数量和计算量要求高,中央处理器(Central Processing Unit,CPU)算力有限,因此,对模型参数量和模型结构要求较高。为了降低计算量,适应此类计算能力有限的应用场景,本申请所示的方案采用全前向神经网络FSMN作为模型的Encoder(编码器),并采用一维卷积网络替代常用的长短时记忆网络LSTM作为Predictor(预测器)。
上述的Transducer模型,为了刻画模型的历史信息,Encoder和Predictor网络一般采用循环神经网络(Recurrent Neural Network,RNN)结构,例如LSTM或者门控循环单元(Gated Recurrent Unit,GRU)。但是在计算资源有限的嵌入式设备上,递归神经网络会带来大量的计算量,会带来大量的CPU资源的占用。另一方面,车载离线语音识别的内容主要为查询和控制类的指令,句子相对较短,无需太长的历史信息。对此,本方案使用了基于FSMN的Encoder和基于一维卷积的Predictor网络。一方面,可以压缩模型参数,另一方面,可以极大的节省计算资源,提高计算速度,保证语音识别的实时性。
本方案中,采用了基于FSMN的Encoder结构。FSMN网络应用于大词汇量语音识别任务中。本方案中所采用的FSMN结构,可以为带有投影层和残差 连接的结构。
对于Predictor网络,本方案中采用了一维卷积网络,根据有限的历史预测输出,来产生当前的输出。参考图6,其示出了本申请实施例涉及的预测器的网络结构图。如图6所示,Predictor网络使用了4个非空的历史输出来预测当前输出的框架。也就是,将当前输入对应的4个非空的历史输出61经过向量映射后,输入到以为一维卷积网络62中,得到文本隐层表示向量。
在本申请实施例中,上述声学模型可以通过预先设置的语音样本、以及该语音信号样本中各个语音帧的实际音素训练得到。例如,在训练过程中,将语音样本中的一个语音帧输入到声学模型中基于FSMN的Encoder网络,并将该语音帧的前4个非空语音帧的实际音素(训练开始时刻不存在历史的非空语音帧,或者历史非空语音帧不足时,可以通过预先设置的音素来代替),输入到基于一维卷积的Predictor网络,在声学模型对输入数据进行处理的过程中,对声学模型中三个部分(Encoder、Predictor以及联合网络)的参数进行更新,使得所有可能的对齐路径上的概率之和,也就是上述公式(2)的结果最大化,从而实现对声学模型的训练。
步骤303,对该各个语音帧对应的该音素识别结果中的空输出的概率进行抑制调整,以降低该音素识别结果中的空输出的概率与各个音素的概率的比值。
在一些实施例中,该对该各个语音帧对应的该音素识别结果中的空输出的概率进行抑制调整,包括:
通过以下调整方式中的至少一种,对该各个语音帧对应的该音素识别结果进行调整:
降低该各个语音帧对应的该音素识别结果中的空输出的概率;
以及,提高该各个语音帧对应的该音素识别结果中的各个音素的概率。
在一些实施例中,该降低该各个语音帧对应的该音素识别结果中的空输出的概率,包括:
将该各个语音帧对应的该音素识别结果中的空输出的概率乘以第一权重,该第一权重小于1且大于0。
在本申请实施例中,对音素识别结果中的空输出的概率进行抑制,可以只降低音素识别结果中的空输出的概率,例如,在音素识别结果中的空输出的概率上乘以一个介于0到1之间的数。这样在音素识别结果中各个音素的概率不变的情况下,能够降低空输出的概率与各个音素的概率之间的比值。
在一些实施例中,该降低该各个语音帧对应的该音素识别结果中的空输出的概率,包括:
将该各个语音帧对应的该音素识别结果中的各个音素的概率乘以第二权重,该第二权重大于1。
在本申请实施例中,对音素识别结果中的空输出的概率进行抑制,可以只提升低音素识别结果中的空输出的概率,例如,在音素识别结果中的各个音素的概率上乘以一个大于1的数。这样在音素识别结果中的空输出的概率不变的情况下,能够降低空输出的概率与各个音素的概率之间的比值。
在另一示例性的方案中,计算机设备也可以在降低音素识别结果中的空输出的概率的同时,提高该音素识别结果中的各个音素的概率。例如,在音素识别结果中的空输出的概率上乘以一个介于0到1之间的数,同时,在音素识别结果中的各个音素的概率上乘以一个大于1的数。
在本方案中,上述声学模型为了得到输入和输出之前的对齐,在输入的音素序列中,需要插入空输出符号,即
Figure PCTCN2021129223-appb-000017
符号和其他音素一样,使用模型进行预测。假设非空音素总数为P,则最终模型的输出维度为P+1,通常第0维表示空输出
Figure PCTCN2021129223-appb-000018
实验发现,空输出的引入,使模型的删除错误大幅提高,这说明大量的音素被错误的识别为空输出,为了解决空输出概率过高的问题,本申请在Transducer解码过程中,通过调节空输出的概率权重,来减少删除错误的产生。
以将该各个语音帧对应的该音素识别结果中的空输出的概率乘以第一权重为例,假设空输出的概率为
Figure PCTCN2021129223-appb-000019
为了降低空输出的概率,本方案在原始的空输出概率值上,除以一个大于1的权重α,α>1,α称为折扣因子(discount factor),调整之后的空输出概率值为:
Figure PCTCN2021129223-appb-000020
一般来说,采用对数概率作为最终的数值参与最终的解码分数计算,因此,对公式(3)两边取对数之后,可以得到:
Figure PCTCN2021129223-appb-000021
上述公式(4)的结果可以作为空输出的调整后概率,以进行后续的解码。
在一些实施例中,上述第一权重或第二权重,是由开发人员或者管理人员预先设置在计算机设备中的,例如,上述第一权重或第二权重可以由开发人员预先设置在语音识别模型中。
步骤304,将各个语音帧对应的该音素识别结果中,空输出的概率满足指定 条件的该音素识别结果输入解码图,获得该语音信号对应的识别文本序列。
在一些实施例中,该将调整后的该各个语音帧对应的该音素识别结果输入解码图,获得该语音信号对应的识别文本序列,包括:
当目标音素识别结果中的空输出的概率满足指定条件时,将该目标音素识别结果输入该解码图,获得该目标音素识别结果对应的识别文本;
其中,该目标音素识别结果是该各个语音帧对应的该音素识别结果中的任意一个。
在一些实施例中,该指定条件包括:
该目标音素识别结果中的空输出的概率小于概率阈值。
实验中发现,相比于DNN-HMM模型,Transducer模型的输出具有明显的尖峰效应,即在某一时刻,模型会以极高的置信度输出某一个预测结果。利用模型的尖峰效应,我们可以在解码过程中,跳过模型预测为空输出的概率,即这些概率将不会参与到解码图的解码过程中,由于本专利以音素为建模单元,同时,解码时跳过空输出,则解码图搜索的步数只与音素的个数有关,本方案中称之为“音素同步解码(Phone Synchronous Decoding,PSD)。下图给出了本方案所提出的PSD算法和空输出权重调整的整个流程:
算法1:PSD算法;
输入:   特征序列(x 1,x 2,...,x T);空输出权重调节系数β blank
           空输出阈值γ blank;历史输出大小M;解码图LG;
输出:   预测的词序列w *
Figure PCTCN2021129223-appb-000022
Figure PCTCN2021129223-appb-000023
其中,上述算法中的第6行进行了权重的调整,算法中的β blank即为公式(3)中的1/α,上述算法中的第13-17行,为本方案所提出的PSD算法,即只有当空输出的概率小于一定的阈值γ blank时,网络输出的概率分布才会参与到后续的解码图的解码中。
在一些实施例中,上述概率阈值是由开发人员或者管理人员预先设置在计算机设备中的,例如,上述概率阈值可以由开发人员预先设置在语音识别模型中。
在一些实施例中,该将调整后的该各个语音帧对应的该音素识别结果输入解码图,获得该语音信号对应的识别文本序列之前,还包括:
获取阈值影响参数,该阈值影响参数包括环境音强度、指定时间段内语音识别失败的次数、以及用户设置信息中的至少一种;
基于该阈值影响参数,确定该概率阈值。
在本申请实施例中,上述概率阈值也可以由计算机设备在进行语音识别过程中进行调整。也就是说,计算机设备可以获取可能影响概率阈值的取值的相关参数,并通过相关参数来灵活设置概率阈值。
例如,环境音强度可以会对用户发出的语音造成干扰,因此,当环境音强度较强时,计算机设备可以将概率阈值设置的较高,使得更多的音素识别结果被输入解码图进行解码,从而保证识别的准确性;反之,当环境音强度较弱时,计算机设备可以将概率阈值设置的较低,使得更多的音素识别结果被跳过,从而保证识别的效率。
再例如,基于解码图对音素识别结果进行解码的准确性会影响语音识别的成功率,当指定时间段(比如当前时刻之前的一段时间,例如5分钟)内语音 识别失败的次数过多时,计算机设备可以将概率阈值设置的较高,使得更多的音素识别结果被输入解码图进行解码,从而保证识别的准确性;反之,当指定时间段内语音识别失败的次数较少或者未失败时,计算机设备可以将概率阈值设置的较低,使得更多的音素识别结果被跳过,从而保证识别的效率。
在一些实施例中,该解码图由音素词典和语言模型复合构成。
本方案采用的解码图是由音素词典和语言模型两个子加权有限自动机(Weighted Finite State Transducer,WFST)图复合而成。
音素词典WFST:汉字或词到音素序列的映射。输入音素序列串,WFST可以输出对应的汉字或者词;通常,此WFST与文本领域无关,在不同的识别任务中为通用部分;
语言模型WFST:此WFST通常由n-gram语言模型转换而来,语言模型用来计算一个句子出现的概率,利用训练数据和统计学方法训练而来。通常,不同领域的文本,例如新闻和口语对话的文本,常用词和词间搭配存在较大的差异,因此,当进行不同领域的语音识别时,可以通过改变语言模型WFST,来实现适配。
参考图7,其示出了本申请实施例涉及的模型训练及应用流程图。如图7所示,以应用于车载设备为例,本申请实施例所示的模型训练完成之后,采用libtorch进行模型的量化和部署。libtorch的安卓版本,采用QNNPACK库进行INT8的矩阵计算,大大加速的矩阵运算速度。模型在Python环境71中,采用pytorch进行训练,然后对模型进行训练后量化,即把模型参数量化为INT8,并采用INT8的矩阵乘法来加速计算,将量化后的模型导出后,用于C++环境72的前向推断,以通过测试数据进行测试。
通过本申请所示的方案,一方面,基于Transducer的端到端模型在训练过程中,无需帧级别的对齐信息,极大的简化了建模过程;其次,简化解码图,减小搜索空间。而本方案中提出的方法,由于采用音素建模,解码图只需要L和G复合,搜索空间大为降低。最后,采用音素建模,结合自定义的解码图,可以实现灵活的定制需求,可以根据不同的业务场景,在不改变声学模型的条件下,只需要定制语言模型,就可以适配各自的业务场景。
相比于相关技术中的离线识别系统,本方案在识别率和CPU占用率上都具有优势:
在识别率方面,相比于DNN结合隐马尔可夫模型(Hidden Markov Model, HMM)系统模型(DNN-HMM模型),本方案所示的系统模型取得了大幅的提高;
在CPU占用方面,本方案所示的系统模型在模型参数量是DNN-HMM系统的4倍的情况下,仍与DNN-HMM系统模型具有相似的cpu占用率。
语音识别率对比如下:
下表1示出了在3个数据集合上,现有的DNN-HMM系统和本方案所提出的Transducer系统的字错误率(Character Error Rate,CER)对比。
表1
模型 参数量 测试集1CER(%) 测试集2CER(%)
DNN-HMM 0.7M 14.88 19.77
Transducer1 0.8M 12.1 16.09
Tansducder2 1.9M 9.76 13.4
Tansducder3 2.1M 8.93 13.18
从表1中可以发现,相似参数量下,在两个测试集合上,Transducder1模型分别取得了相对18.7%和18.6%的CER的下降。同时,当增加模型参数量之后,利用Transducer3,分别取得了8.93%和13.18%的词错误率。
CPU占用率对比:
表2
模型 参数量 CPU占用(峰值)
DNN-HMM 0.7M 16%
Transducer1 0.8M 18%
Tansducder2 1.9M 20%
Tansducder3 2.1M 20%
通过表2对比Transducer1和DNN-HMM,两个模型在同等参数量时,Transducer1模型比DNN-HMM模型的峰值高了2%,但是当模型参数量增加,Transducer模型的峰值并没有明显变化,在大幅增加模型参数量和降低识别错误率的条件下,CPU占用率仍然保持在较低的水平。
综上所述,本申请实施例所示的方案,对于包含语音帧在各个音素以及空输出上的概率分布的音素识别结果,在将该音素识别结果输入解码图之前,先对音素识别结果中的空输出的概率进行抑制,降低语音帧被识别为空输出的几率,从而降低语音帧被误识别为空输出的可能性,也就是降低模型的删除错误, 从而提高模型的识别准确性。
本申请上述图3所示实施例中的方案,以同时应用空输出权重调节(步骤303)和解码跳帧(对应步骤304)为例进行说明,在其它实现方案中,空输出权重调节和解码跳帧也可以独立应用。例如,在本申请一个示例性的实施例中,当独立应用上述解码跳帧时,本申请所示的方案可以如下:
获取语音信号,该语音信号包括对原始语音进行切分获得的各个语音帧;
对语音信号进行音素识别,获得该各个语音帧对应的音素识别结果;该音素识别结果用于指示对应的语音帧在音素空间中的概率分布;该音素空间中包含各个音素以及一个空输出;
将该各个语音帧对应的该音素识别结果中,空输出的概率满足指定条件的该音素识别结果输入解码图,获得该语音信号对应的识别文本序列。
综上所述,本申请实施例所示的方案,对于包含语音帧在各个音素以及空输出上的概率分布的音素识别结果,在将该音素识别结果输入解码图时,可以对其中空输出的概率满足条件的音素识别结果进行解码,减少了需要解码的因素识别结果的数量,跳过不必要的解码步骤,从而有效的提高了语音识别效率。
参考图8,其是根据一示例性实施例示出的一种语音识别系统的框架图。如图8所示,音频采集设备81与语音识别设备82相连,语音识别设备82中包含声学模型82a、概率调整单元82b、解码图输入单元82c、解码图82d、以及特征提取单元82e。其中,解码图82d由音素词典和语言模型组成。
在应用过程中,音频采集设备81采集用户的原始语音后,将原始语音传输给语音识别设备82中的特征提取单元82e,由特征提取单元进行切分并对各个语音帧进行特征提取后,将一个语音帧的语音特征,以及解码图82d对该语音帧的前4个非空语音帧识别出的文本的音素,分别输入到声学模型82a中的FSMN和一维卷积网络中,得到声学模型82a输出该语音帧的音素识别结果。
该音素识别结果输入至概率调整单元82b,进行空输出的概率调整,得到调整后的音素识别结果;该调整后的语音识别结果由解码图输入单元82c进行判断,当判断出调整后的空输出概率小于阈值时,确定需要进行解码,解码图输入单元82c将该调整后的音素识别结果输入解码图82d,由解码图82d识别出文本;反之,如果判断出调整后的空输出概率不小于阈值时,确定不需要进行解 码,则丢弃该调整后的语音识别结果。
上述解码图对各个语音帧的调整后的音素识别结果进行识别,并输出文本序列后,可以将文本序列输出给自然语言处理组件,由自然语言处理组件对用户输入的语音做出响应。
图9是根据一示例性实施例示出的一种语音识别装置的结构方框图。该语音识别装置可以实现图2或图3所示实施例提供的方法中的全部或者部分步骤。该语音识别装置可以包括:
语音信号处理模块901,用于对语音信号进行音素识别,获得语音信号中各个语音帧对应的音素识别结果;音素识别结果用于指示对应的语音帧在音素空间中的概率分布;音素空间中包含各个音素以及空输出;
概率调整模块902,用于对各个语音帧对应的音素识别结果中的空输出的概率进行抑制调整,以降低音素识别结果中的空输出的概率与各个音素的概率的比值;及
解码模块903,用于将调整后的各个语音帧对应的音素识别结果输入解码图,获得语音信号对应的识别文本序列,解码图中包括字符与音素之间的映射关系。
在一些实施例中,概率调整模块902,用于通过以下调整方式中的至少一种,对各个语音帧对应的音素识别结果进行调整:
降低各个语音帧对应的音素识别结果中的空输出的概率;
以及,
提高各个语音帧对应的音素识别结果中的各个音素的概率。
在一些实施例中,概率调整模块902,用于将各个语音帧对应的音素识别结果中的空输出的概率乘以第一权重,第一权重小于1且大于0。
在一些实施例中,概率调整模块902,用于将各个语音帧对应的音素识别结果中的各个音素的概率乘以第二权重,第二权重大于1。
在一些实施例中,解码模块903,用于,
当目标音素识别结果中的空输出的概率满足指定条件时,将目标音素识别结果输入解码图,获得目标音素识别结果对应的识别文本;
其中,目标音素识别结果是各个语音帧对应的音素识别结果中的任意一个。
在一些实施例中,指定条件包括:
目标音素识别结果中的空输出的概率小于概率阈值。
在一些实施例中,装置还包括:
参数获取模块,用于获取阈值影响参数,阈值影响参数包括环境音强度、指定时间段内语音识别失败的次数、以及用户设置信息中的至少一种;
阈值确定模块,用于基于阈值影响参数,确定概率阈值。
在一些实施例中,语音信号处理模块901,用于,
通过已训练的声学模型对目标语音帧进行特征提取,获得目标语音帧的特征向量;目标语音帧是各个语音帧中的任意一个;
将目标语音帧输入声学模型中的编码器,获得目标语音帧的声学隐层表示向量;及
将目标语音帧的历史识别文本的音素信息输入声学模型中的预测器,获得目标语音帧的文本隐层表示向量;目标语音帧的历史识别文本,是解码图对目标语音帧的前n个非空输出的语音帧的音素识别结果进行识别得到的文本;n为大于或者等于1的整数;
将目标语音帧的声学隐层表示向量,以及目标语音帧的文本隐层表示向量输入联合网络,获得目标语音帧的音素识别结果。
在一些实施例中,编码器为前向序列记忆网络FSMN。
在一些实施例中,预测器为一维卷积网络。
在一些实施例中,解码图由音素词典和语言模型复合构成。
综上所述,本申请实施例所示的方案,对于包含语音帧在各个音素以及空输出上的概率分布的音素识别结果,在将该音素识别结果输入解码图之前,先对音素识别结果中的空输出的概率进行抑制,降低语音帧被识别为空输出的几率,从而降低语音帧被误识别为空输出的可能性,也就是降低模型的删除错误,从而提高模型的识别准确性。
图10是根据一示例性实施例示出的一种计算机设备的结构示意图。该计算机设备可以实现为上述各个方法实施例中的计算机设备。所述计算机设备1000包括中央处理单元1001、包括随机存取存储器(Random Access Memory,RAM)1002和只读存储器(Read-Only Memory,ROM)1003的系统存储器1004,以及连接系统存储器1004和中央处理单元1001的系统总线1005。所述计算机设备1000还包括帮助计算机内的各个器件之间传输信息的基本输入/输出系统 1006,和用于存储操作系统1013、应用程序1014和其他程序模块1015的大容量存储设备1007。
所述大容量存储设备1007通过连接到系统总线1005的大容量存储控制器(未示出)连接到中央处理单元1001。所述大容量存储设备1007及其相关联的计算机可读介质为计算机设备1000提供非易失性存储。也就是说,所述大容量存储设备1007可以包括诸如硬盘或者光盘只读存储器(Compact Disc Read-Only Memory,CD-ROM)驱动器之类的计算机可读介质(未示出)。
不失一般性,所述计算机可读介质可以包括计算机存储介质和通信介质。计算机存储介质包括以用于存储诸如计算机可读指令、数据结构、程序模块或其他数据等信息的任何方法或技术实现的易失性和非易失性、可移动和不可移动介质。计算机存储介质包括RAM、ROM、闪存或其他固态存储其技术,CD-ROM、或其他光学存储、磁带盒、磁带、磁盘存储或其他磁性存储设备。当然,本领域技术人员可知所述计算机存储介质不局限于上述几种。上述的系统存储器1004和大容量存储设备1007可以统称为存储器。
计算机设备1000可以通过连接在所述系统总线1005上的网络接口单元1011连接到互联网或者其它网络设备。
所述存储器还包括至少一条计算机指令,所述至少一条计算机指令存储于存储器中,处理器通过加载并执行该至少一条计算机指令来实现图2或图3所示的方法的全部或者部分步骤。
在示例性实施例中,还提供了一种包括指令的非临时性计算机可读存储介质,例如包括计算机程序(指令)的存储器,上述程序(指令)可由计算机设备的处理器执行以完成本申请各个实施例所示的方法。例如,所述非临时性计算机可读存储介质可以是只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、只读光盘(Compact Disc Read-Only Memory,CD-ROM)、磁带、软盘和光数据存储设备等。
在示例性实施例中,还提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述各个实施例所示的方法。
一种计算机程序产品,包括计算机程序,其特征在于,该计算机程序被处理器执行时实现上述各个实施例所示的方法。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本申请的真正范围和精神由权利要求指出。
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本申请的范围仅由所附的权利要求来限制。

Claims (20)

  1. 一种语音识别方法,其特征在于,由计算机设备执行,所述方法包括:
    对语音信号进行音素识别,获得所述语音信号中各个语音帧对应的音素识别结果;所述音素识别结果用于指示对应的语音帧在音素空间中的概率分布;所述音素空间中包含各个音素以及空输出;
    对所述各个语音帧对应的所述音素识别结果中的空输出的概率进行抑制调整,以降低所述音素识别结果中的空输出的概率与各个音素的概率的比值;及
    将调整后的所述各个语音帧对应的所述音素识别结果输入解码图,获得所述语音信号对应的识别文本序列,所述解码图中包括字符与音素之间的映射关系。
  2. 根据权利要求1所述的方法,其特征在于,所述对所述各个语音帧对应的所述音素识别结果中的空输出的概率进行抑制调整,包括:
    降低所述各个语音帧对应的所述音素识别结果中的空输出的概率。
  3. 根据权利要求2所述的方法,其特征在于,所述降低所述各个语音帧对应的所述音素识别结果中的空输出的概率,包括:
    将所述各个语音帧对应的所述音素识别结果中的空输出的概率乘以第一权重,所述第一权重小于1且大于0。
  4. 根据权利要求2所述的方法,其特征在于,所述降低所述各个语音帧对应的所述音素识别结果中的空输出的概率,包括:
    将所述各个语音帧对应的所述音素识别结果中的各个音素的概率乘以第二权重,所述第二权重大于1。
  5. 根据权利要求1所述的方法,其特征在于,所述对所述各个语音帧对应的所述音素识别结果中的空输出的概率进行抑制调整,包括:
    提高所述各个语音帧对应的所述音素识别结果中的各个音素的概率。
  6. 根据权利要求1所述的方法,其特征在于,所述将调整后的所述各个语音帧对应的所述音素识别结果输入解码图,获得所述语音信号对应的识别文本序列,包括:
    当目标音素识别结果中的空输出的概率满足指定条件时,将所述目标音素识别结果输入所述解码图,获得所述目标音素识别结果对应的识别文本;
    其中,所述目标音素识别结果是所述各个语音帧对应的所述音素识别结果中的任意一个。
  7. 根据权利要求6所述的方法,其特征在于,所述指定条件包括:
    所述目标音素识别结果中的空输出的概率小于概率阈值。
  8. 根据权利要求7所述的方法,其特征在于,所述将调整后的所述各个语音帧对应的所 述音素识别结果输入解码图,获得所述语音信号对应的识别文本序列之前,还包括:
    获取阈值影响参数,所述阈值影响参数包括环境音强度、指定时间段内语音识别失败的次数、以及用户设置信息中的至少一种;及
    基于所述阈值影响参数,确定所述概率阈值。
  9. 根据权利要求1所述的方法,其特征在于,所述对语音信号进行音素识别,获得所述语音信号中各个语音帧对应的音素识别结果,包括:
    通过已训练的声学模型对目标语音帧进行特征提取,获得所述目标语音帧的特征向量;所述目标语音帧是所述各个语音帧中的任意一个;
    将所述目标语音帧输入所述声学模型中的编码器,获得所述目标语音帧的声学隐层表示向量;
    将所述目标语音帧的历史识别文本的音素信息输入所述声学模型中的预测器,获得所述目标语音帧的文本隐层表示向量;所述目标语音帧的历史识别文本,是所述解码图对所述目标语音帧的前n个非空输出的语音帧的音素识别结果进行识别得到的文本;n为大于或者等于1的整数;及
    将所述目标语音帧的声学隐层表示向量,以及所述目标语音帧的文本隐层表示向量输入所述声学模型中的联合网络,获得所述目标语音帧的所述音素识别结果。
  10. 根据权利要求9所述的方法,其特征在于,所述编码器为前向序列记忆网络FSMN。
  11. 根据权利要求9所述的方法,其特征在于,所述预测器为一维卷积网络。
  12. 根据权利要求1至9任一所述的方法,其特征在于,所述解码图由音素词典和语言模型复合构成。
  13. 一种语音识别方法,其特征在于,由计算机设备执行,所述方法包括:
    获取语音信号,所述语音信号包括对原始语音进行切分获得的各个语音帧;
    对语音信号进行音素识别,获得所述各个语音帧对应的音素识别结果;所述音素识别结果用于指示对应的语音帧在音素空间中的概率分布;所述音素空间中包含各个音素以及空输出;及
    将所述各个语音帧对应的所述音素识别结果中,空输出的概率满足指定条件的所述音素识别结果输入解码图,获得所述语音信号对应的识别文本序列,所述解码图中包括字符与音素之间的映射关系。
  14. 一种语音识别装置,其特征在于,所述装置包括:
    语音信号处理模块,用于对语音信号进行音素识别,获得所述语音信号中各个语音帧对应的音素识别结果;所述音素识别结果用于指示对应的语音帧在音素空间中的概率分布;所述音素空间中包含各个音素以及空输出;
    概率调整模块,用于对所述各个语音帧对应的所述音素识别结果中的空输出的概率进行 抑制调整,以降低所述音素识别结果中的空输出的概率与各个音素的概率的比值;及
    解码模块,用于将调整后的所述各个语音帧对应的所述音素识别结果输入解码图,获得所述语音信号对应的识别文本序列,所述解码图中包括字符与音素之间的映射关系。
  15. 根据权利要求14所述的装置,其特征在于,所述概率调整模块,还用于:
    降低所述各个语音帧对应的所述音素识别结果中的空输出的概率。
  16. 根据权利要求15所述的装置,其特征在于,所述概率调整模块,还用于:
    将所述各个语音帧对应的所述音素识别结果中的空输出的概率乘以第一权重,所述第一权重小于1且大于0。
  17. 一种语音识别装置,其特征在于,所述装置包括:
    语音信号获取模块,用于获取语音信号,所述语音信号包括对原始语音进行切分获得的各个语音帧;
    音素识别结果获得模块,用于对语音信号进行音素识别,获得所述各个语音帧对应的音素识别结果;所述音素识别结果用于指示对应的语音帧在音素空间中的概率分布;所述音素空间中包含各个音素以及空输出;及
    识别文本序列获取模块,用于将所述各个语音帧对应的所述音素识别结果中,空输出的概率满足指定条件的所述音素识别结果输入解码图,获得所述语音信号对应的识别文本序列,所述解码图中包括字符与音素之间的映射关系。
  18. 一种计算机设备,其特征在于,所述计算机设备包含处理器和存储器,所述存储器中存储有至少一条计算机指令,所述至少一条计算机指令由所述处理器加载并执行以实现如权利要求1至13任一所述的语音识别方法。
  19. 一种计算机可读存储介质,其特征在于,所述存储介质中存储有至少一条计算机指令,所述至少一条计算机指令由处理器加载并执行以实现如权利要求1至13任一所述的语音识别方法。
  20. 一种计算机程序产品,包括计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至13中任一项所述的方法的步骤。
PCT/CN2021/129223 2020-12-23 2021-11-08 语音识别方法、装置、计算机设备及存储介质 Ceased WO2022134894A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2023518016A JP7570760B2 (ja) 2020-12-23 2021-11-08 音声認識方法、音声認識装置、コンピュータ機器、及びコンピュータプログラム
EP21908894.5A EP4191576B1 (en) 2020-12-23 2021-11-08 SPEECH RECOGNITION METHOD, COMPUTER DEVICE AND STORAGE MEDIA
US17/977,496 US12367861B2 (en) 2020-12-23 2022-10-31 Phoneme recognition-based speech recognition method and apparatus, computer device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011536771.4A CN113539242B (zh) 2020-12-23 2020-12-23 语音识别方法、装置、计算机设备及存储介质
CN202011536771.4 2020-12-23

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/977,496 Continuation US12367861B2 (en) 2020-12-23 2022-10-31 Phoneme recognition-based speech recognition method and apparatus, computer device, and storage medium

Publications (1)

Publication Number Publication Date
WO2022134894A1 true WO2022134894A1 (zh) 2022-06-30

Family

ID=78124211

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/129223 Ceased WO2022134894A1 (zh) 2020-12-23 2021-11-08 语音识别方法、装置、计算机设备及存储介质

Country Status (5)

Country Link
US (1) US12367861B2 (zh)
EP (1) EP4191576B1 (zh)
JP (1) JP7570760B2 (zh)
CN (1) CN113539242B (zh)
WO (1) WO2022134894A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116229950A (zh) * 2023-03-01 2023-06-06 北京奕斯伟计算技术股份有限公司 命令词识别模型的模型训练装置、命令词识别装置及方法
CN116364062A (zh) * 2023-05-30 2023-06-30 广州小鹏汽车科技有限公司 语音识别方法、装置及车辆
CN116580701A (zh) * 2023-05-19 2023-08-11 国网物资有限公司 告警音频识别方法、装置、电子设备和计算机介质

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113539242B (zh) * 2020-12-23 2025-05-30 腾讯科技(深圳)有限公司 语音识别方法、装置、计算机设备及存储介质
US12211509B2 (en) * 2021-10-06 2025-01-28 Google Llc Fusion of acoustic and text representations in RNN-T
CN114220444B (zh) * 2021-10-27 2022-09-06 安徽讯飞寰语科技有限公司 语音解码方法、装置、电子设备和存储介质
CN113936643B (zh) * 2021-12-16 2022-05-17 阿里巴巴达摩院(杭州)科技有限公司 语音识别方法、语音识别模型、电子设备和存储介质
CN114724544B (zh) * 2022-04-13 2022-12-06 北京百度网讯科技有限公司 语音芯片、语音识别方法、装置、设备及智能汽车
CN114822535B (zh) * 2022-04-19 2025-08-22 时擎智能科技(上海)有限公司 语音关键词识别方法、装置、介质及设备
CN115132196B (zh) * 2022-05-18 2024-09-10 腾讯科技(深圳)有限公司 语音指令识别的方法、装置、电子设备及存储介质
CN115499541A (zh) * 2022-09-15 2022-12-20 华能国际电力股份有限公司 一种语音检测模型构建和语音识别方法、装置及电子设备
CN116052643A (zh) * 2022-12-30 2023-05-02 西安讯飞超脑信息科技有限公司 一种语音识别方法、装置、存储介质及设备
CN116434738A (zh) * 2023-02-16 2023-07-14 北京有竹居网络技术有限公司 噪音数据提取方法、装置、介质及电子设备
CN116453504B (zh) * 2023-02-21 2026-03-17 杭州网之易创新科技有限公司 语音音素识别方法、介质、装置和计算设备
CN116403587B (zh) * 2023-03-28 2025-12-16 中国科学院深圳先进技术研究院 一种基于音素信息的声纹识别方法及电子设备
CN116110574B (zh) * 2023-04-14 2023-06-20 武汉大学人民医院(湖北省人民医院) 一种基于神经网络实现的眼科智能问诊方法和装置
CN116844529A (zh) * 2023-05-25 2023-10-03 深圳华为云计算技术有限公司 语音识别方法、装置及计算机存储介质
CN116665652A (zh) * 2023-06-07 2023-08-29 平安科技(深圳)有限公司 语音识别方法、语音识别系统、计算机设备和存储介质
CN119548810B (zh) * 2023-08-22 2026-01-13 荣耀终端股份有限公司 预测帧生成方法、终端设备及存储介质
CN116798052B (zh) * 2023-08-28 2023-12-08 腾讯科技(深圳)有限公司 文本识别模型的训练方法和装置、存储介质及电子设备
CN117524198B (zh) * 2023-12-29 2024-04-16 广州小鹏汽车科技有限公司 语音识别方法、装置及车辆
US20250292764A1 (en) * 2024-03-15 2025-09-18 Microsoft Technology Licensing, Llc Space efficient training for sequence transduction machine learning

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105529027A (zh) * 2015-12-14 2016-04-27 百度在线网络技术(北京)有限公司 语音识别方法和装置
CN105895081A (zh) * 2016-04-11 2016-08-24 苏州思必驰信息科技有限公司 一种语音识别解码的方法及装置
CN108269568A (zh) * 2017-01-03 2018-07-10 中国科学院声学研究所 一种基于ctc的声学模型训练方法
CN108389575A (zh) * 2018-01-11 2018-08-10 苏州思必驰信息科技有限公司 音频数据识别方法及系统
CN109559735A (zh) * 2018-10-11 2019-04-02 平安科技(深圳)有限公司 一种基于神经网络的语音识别方法、终端设备及介质
CN110164421A (zh) * 2018-12-14 2019-08-23 腾讯科技(深圳)有限公司 语音解码方法、装置及存储介质
WO2020195068A1 (en) * 2019-03-25 2020-10-01 Mitsubishi Electric Corporation System and method for end-to-end speech recognition with triggered attention
CN113539242A (zh) * 2020-12-23 2021-10-22 腾讯科技(深圳)有限公司 语音识别方法、装置、计算机设备及存储介质

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7493259B2 (en) * 2002-01-04 2009-02-17 Siebel Systems, Inc. Method for accessing data via voice
US9728185B2 (en) * 2014-05-22 2017-08-08 Google Inc. Recognizing speech using neural networks
US10127904B2 (en) * 2015-05-26 2018-11-13 Google Llc Learning pronunciations from acoustic sequences
US9818409B2 (en) * 2015-06-19 2017-11-14 Google Inc. Context-dependent modeling of phonemes
US10229672B1 (en) * 2015-12-31 2019-03-12 Google Llc Training acoustic models using connectionist temporal classification
JP6727607B2 (ja) 2016-06-09 2020-07-22 国立研究開発法人情報通信研究機構 音声認識装置及びコンピュータプログラム
US11195093B2 (en) * 2017-05-18 2021-12-07 Samsung Electronics Co., Ltd Apparatus and method for student-teacher transfer learning network using knowledge bridge
JP7092953B2 (ja) 2019-05-03 2022-06-28 グーグル エルエルシー エンドツーエンドモデルによる多言語音声認識のための音素に基づく文脈解析
US11862146B2 (en) * 2019-07-05 2024-01-02 Asapp, Inc. Multistream acoustic models with dilations

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105529027A (zh) * 2015-12-14 2016-04-27 百度在线网络技术(北京)有限公司 语音识别方法和装置
CN105895081A (zh) * 2016-04-11 2016-08-24 苏州思必驰信息科技有限公司 一种语音识别解码的方法及装置
CN108269568A (zh) * 2017-01-03 2018-07-10 中国科学院声学研究所 一种基于ctc的声学模型训练方法
CN108389575A (zh) * 2018-01-11 2018-08-10 苏州思必驰信息科技有限公司 音频数据识别方法及系统
CN109559735A (zh) * 2018-10-11 2019-04-02 平安科技(深圳)有限公司 一种基于神经网络的语音识别方法、终端设备及介质
CN110164421A (zh) * 2018-12-14 2019-08-23 腾讯科技(深圳)有限公司 语音解码方法、装置及存储介质
WO2020195068A1 (en) * 2019-03-25 2020-10-01 Mitsubishi Electric Corporation System and method for end-to-end speech recognition with triggered attention
CN113539242A (zh) * 2020-12-23 2021-10-22 腾讯科技(深圳)有限公司 语音识别方法、装置、计算机设备及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4191576A4

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116229950A (zh) * 2023-03-01 2023-06-06 北京奕斯伟计算技术股份有限公司 命令词识别模型的模型训练装置、命令词识别装置及方法
CN116580701A (zh) * 2023-05-19 2023-08-11 国网物资有限公司 告警音频识别方法、装置、电子设备和计算机介质
CN116580701B (zh) * 2023-05-19 2023-11-24 国网物资有限公司 告警音频识别方法、装置、电子设备和计算机介质
CN116364062A (zh) * 2023-05-30 2023-06-30 广州小鹏汽车科技有限公司 语音识别方法、装置及车辆
CN116364062B (zh) * 2023-05-30 2023-08-25 广州小鹏汽车科技有限公司 语音识别方法、装置及车辆

Also Published As

Publication number Publication date
EP4191576A1 (en) 2023-06-07
EP4191576A4 (en) 2024-05-29
US12367861B2 (en) 2025-07-22
CN113539242A (zh) 2021-10-22
US20230074869A1 (en) 2023-03-09
EP4191576B1 (en) 2025-12-31
JP7570760B2 (ja) 2024-10-22
JP2023542685A (ja) 2023-10-11
CN113539242B (zh) 2025-05-30

Similar Documents

Publication Publication Date Title
CN113539242B (zh) 语音识别方法、装置、计算机设备及存储介质
US11848008B2 (en) Artificial intelligence-based wakeup word detection method and apparatus, device, and medium
US11270694B2 (en) Artificial intelligence apparatus and method for recognizing speech by correcting misrecognized word
CN112528637B (zh) 文本处理模型训练方法、装置、计算机设备和存储介质
CN112259089B (zh) 语音识别方法及装置
CN114596844A (zh) 声学模型的训练方法、语音识别方法及相关设备
EP4409568B1 (en) Contrastive siamese network for semi-supervised speech recognition
CN110473531A (zh) 语音识别方法、装置、电子设备、系统及存储介质
CN113555006B (zh) 一种语音信息识别方法、装置、电子设备及存储介质
CN111653274B (zh) 唤醒词识别的方法、装置及存储介质
CN111161724B (zh) 中文视听结合语音识别方法、系统、设备及介质
CN113393841A (zh) 语音识别模型的训练方法、装置、设备及存储介质
KR20230156425A (ko) 자체 정렬을 통한 스트리밍 asr 모델 지연 감소
KR20230156795A (ko) 단어 분할 규칙화
EP4528580A1 (en) Training method for translation model, translation method, and device
JP2017076127A (ja) 音響モデル入力データの正規化装置及び方法と、音声認識装置
CN111862956A (zh) 一种数据处理方法、装置、设备及存储介质
CN119889282A (zh) 语音合成模型的训练方法、装置、设备及存储介质
HK40054494B (zh) 语音识别方法、装置、计算机设备及存储介质
HK40054494A (zh) 语音识别方法、装置、计算机设备及存储介质
KR102944446B1 (ko) 프롬프트에 기반하여 감정을 표현하는 음성을 합성하는 방법, 장치, 및 프로그램
HK40052275B (zh) 语音识别模型的训练方法、装置、设备及存储介质
HK40055187A (zh) 一种语音信息识别方法、装置、电子设备及存储介质
HK40055187B (zh) 一种语音信息识别方法、装置、电子设备及存储介质
HK40092618A (zh) 语音音素识别方法、装置、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21908894

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202347015921

Country of ref document: IN

ENP Entry into the national phase

Ref document number: 2021908894

Country of ref document: EP

Effective date: 20230302

ENP Entry into the national phase

Ref document number: 2023518016

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

WWG Wipo information: grant in national office

Ref document number: 202347015921

Country of ref document: IN

WWG Wipo information: grant in national office

Ref document number: 2021908894

Country of ref document: EP